16 Design Decisions & Tradeoffs
16.1 Overview
This chapter explains why we chose eBGP + ECMP for the underlay, and why we deliberately avoided other common datacenter network architectures. Understanding these tradeoffs helps with troubleshooting and future design decisions.
Summary: For OpenStack with OVN overlay (GENEVE), a pure L3 Clos fabric with eBGP + ECMP is the simplest, most scalable, and most operationally manageable choice.
16.2 Why L3 Clos with eBGP + ECMP
16.2.1 The Core Decision
Our underlay is a pure L3 leaf-spine (Clos) fabric where:
- Every link is a routed /31 point-to-point
- Every device runs eBGP for route advertisement
- ECMP provides automatic multipath load balancing
- BFD provides sub-second failure detection
16.2.2 Benefits
- Simplicity: No spanning tree, no MLAG state machines, no peer-links
- Scalability: Clos topology scales horizontally (add spines) and vertically (add super-spines)
- Predictability: Traffic follows well-understood L3 forwarding rules
- Fast convergence: BGP + BFD provides sub-second failover
- Vendor independence: BGP is universally supported
- Automation friendly: Consistent configuration across all devices
16.2.3 Why eBGP (Not iBGP)
| Aspect | eBGP | iBGP |
|---|---|---|
| Route reflection | Not needed | Requires route reflectors |
| Path selection | Per-hop decisions | May need RR hierarchy |
| Loop prevention | AS-path naturally prevents | Requires careful design |
| Simplicity | Each device is its own AS | Shared AS, more config |
| Modern DC practice | Standard (Facebook, Microsoft, etc.) | Less common in DCs |
Our choice: eBGP with unique ASN per device (or per-role). This eliminates the need for route reflectors and provides natural loop prevention.
16.3 Why Not L2 Fabric (VLAN Stretching, MLAG/vPC)
16.3.1 The Problem with L2 at Scale
Traditional L2 fabrics have fundamental scaling limitations:
- Spanning Tree (STP): Blocks redundant paths, wastes bandwidth
- Broadcast domains: ARP storms, unknown unicast flooding
- MAC table limits: Switches have finite MAC table sizes
- Single failure domain: L2 failures can cascade
16.3.2 Why Not MLAG/vPC
MLAG (Multi-Chassis Link Aggregation) creates a virtual switch from two physical switches:
[Virtual Switch]
/ \
[ToR-A] ============ [ToR-B] ← Peer-link (L2 state sync)
| |
[Host with bonded NICs]
Problems:
| Issue | Impact |
|---|---|
| Peer-link dependency | Single point of failure; if peer-link fails, split-brain |
| State synchronization | Complex L2 state sync between chassis |
| Vendor lock-in | Cisco vPC, Arista MLAG, Juniper MC-LAG - all different |
| Scale limits | Typically 2 switches per MLAG domain |
| Operational complexity | Coordinated upgrades required |
| Overlay conflict | MLAG designed for L2; conflicts with host-based tunnels |
Our approach: No MLAG. ToR-A and ToR-B are completely independent. Each host NIC is a separate routed interface, not a bonded pair.
16.3.3 Why This Works for Overlay Networks
With OVN/GENEVE overlay:
- Tunnels originate at hosts (not at switches)
- Switch only sees IP packets (outer GENEVE headers)
- No need for L2 tricks at the fabric layer
- ECMP naturally load-balances tunnel traffic
16.4 Why Not EVPN/VXLAN in the Underlay
16.4.1 What is EVPN/VXLAN?
EVPN (Ethernet VPN) is a BGP-based control plane for VXLAN overlays, typically used when:
- Switches are VTEP endpoints
- L2 domains need to stretch across L3
- Multi-tenancy at the fabric layer
16.4.2 Why We Don’t Need It
| Aspect | EVPN/VXLAN Underlay | Our Design |
|---|---|---|
| Tunnel termination | At switches (leaf VTEP) | At hosts (OVN TEP) |
| Overlay control | BGP EVPN | OVN Southbound DB |
| L2 learning | Data plane (switch) | Control plane (OVN) |
| Complexity | High (EVPN type-2/5, VNI mapping) | Low (pure L3) |
| Configuration | 300+ lines per switch | ~50 lines per switch |
Key insight: OVN already provides overlay functionality. Adding EVPN in the underlay creates:
- Double overlay (GENEVE over VXLAN over IP)
- Conflicting control planes (OVN vs EVPN)
- Unnecessary complexity for our use case
16.4.3 When EVPN/VXLAN Makes Sense
EVPN is appropriate when:
- You need L2 extension across L3 boundaries (we don’t)
- Switch-based multi-tenancy is required (OVN handles this)
- Bare-metal servers need direct L2 connectivity (rare in cloud)
- Legacy applications require stretched VLANs (avoid if possible)
16.5 Why Not OSPF or IS-IS for Underlay
16.5.1 The Traditional Choice
OSPF and IS-IS are link-state IGPs commonly used in enterprise networks:
| Protocol | Typical Use |
|---|---|
| OSPF | Enterprise, campus networks |
| IS-IS | Service provider, large enterprise |
| eBGP | Internet, modern datacenters |
16.5.2 Why eBGP is Better for DC
| Aspect | OSPF/IS-IS | eBGP |
|---|---|---|
| Flooding | Link-state floods to all routers | No flooding (path-vector) |
| Convergence | SPF calculation on every router | Incremental updates |
| Policy control | Limited | Rich (communities, AS-path, etc.) |
| Multi-vendor | Some interop issues | Universally compatible |
| External connectivity | Needs redistribution to BGP | Native BGP everywhere |
| Scale | Area design complexity | Naturally hierarchical |
| Operational model | Different from internet edge | Same as internet edge |
Key advantage: With eBGP everywhere, operators use the same protocol for:
- Server ↔︎ ToR
- ToR ↔︎ Spine
- Spine ↔︎ Super-Spine
- DC ↔︎ WAN/Internet
This uniformity simplifies operations, training, and automation.
16.5.3 When OSPF/IS-IS Might Be Valid
- Existing enterprise networks with OSPF expertise
- Very small deployments where BGP feels like overkill
- Specific vendor requirements (rare)
16.6 Platform Choices
16.6.1 Why SONiC
SONiC (Software for Open Networking in the Cloud) is our preferred network OS:
| Feature | Benefit |
|---|---|
| Open source | No vendor lock-in, community-driven |
| Linux-based | Standard Linux tools, containers, scripting |
| FRR routing | Production-proven BGP/OSPF/BFD stack |
| SAI abstraction | Works with multiple switch ASICs |
| REST/gNMI APIs | Modern automation interfaces |
| Used at scale | Azure, Alibaba, LinkedIn, and many others |
FRR (Free Range Routing) provides the routing stack:
# Same FRR commands on switches and hosts
vtysh -c "show ip bgp summary"
vtysh -c "show ip route"
vtysh -c "show bfd peer"16.6.2 Why Memory/Tomahawk ASICs
For pure L3 forwarding, Memory family ASICs (Memory/Tomahawk) are ideal:
| Feature | Tomahawk Series |
|---|---|
| Forwarding model | Pure L3 (no complex overlay features needed) |
| Port density | High radix (32x400G or 64x100G) |
| Buffer | Shared memory, good for bursts |
| Power efficiency | Optimized for throughput |
| Cost | Commodity pricing |
What we DON’T need from switches:
- VXLAN gateway functionality (OVN handles overlay)
- Complex ACLs/QoS (host-based security)
- L2 features (pure L3 fabric)
This means we can use simpler, cheaper switches optimized for L3 throughput.
16.6.3 Reference Hardware
| Role | Example Platform | Key Specs |
|---|---|---|
| ToR | Arista 7050CX4 | 24×200G down + 8×400G up |
| Spine | Arista 7800 or Edgecore | 32×400G or higher |
| Super-Spine | Cisco 8000 or Arista 7800R | 64×400G |
Note: Exact hardware depends on vendor relationships and availability. The architecture is hardware-agnostic.
16.7 Summary: Why This Design
┌─────────────────────────────────────────────────────────────┐
│ OUR DESIGN CHOICES │
├─────────────────────────────────────────────────────────────┤
│ ✓ L3 Clos Fabric │ Simple, scalable, predictable │
│ ✓ eBGP Everywhere │ Uniform protocol, no RRs needed │
│ ✓ ECMP + BFD │ Fast failover, full bandwidth │
│ ✓ Dual ToRs (no MLAG) │ Independent failure domains │
│ ✓ Host-based TEPs │ OVN handles overlay, not switches │
│ ✓ SONiC + FRR │ Open, automatable, proven │
├─────────────────────────────────────────────────────────────┤
│ WE DELIBERATELY AVOID │
├─────────────────────────────────────────────────────────────┤
│ ✗ MLAG/vPC │ Peer-link complexity, vendor lock │
│ ✗ EVPN/VXLAN underlay │ Unnecessary for host-based overlay │
│ ✗ Stretched L2 │ Scaling limits, failure domains │
│ ✗ OSPF/IS-IS │ Less policy control, flooding │
└─────────────────────────────────────────────────────────────┘
Bottom line: For OpenStack with OVN, the simplest underlay is best. Let the overlay (OVN/GENEVE) handle multi-tenancy and L2 semantics. Keep the underlay pure L3.