16  Design Decisions & Tradeoffs

16.1 Overview

This chapter explains why we chose eBGP + ECMP for the underlay, and why we deliberately avoided other common datacenter network architectures. Understanding these tradeoffs helps with troubleshooting and future design decisions.

Summary: For OpenStack with OVN overlay (GENEVE), a pure L3 Clos fabric with eBGP + ECMP is the simplest, most scalable, and most operationally manageable choice.

16.2 Why L3 Clos with eBGP + ECMP

16.2.1 The Core Decision

Our underlay is a pure L3 leaf-spine (Clos) fabric where:

  • Every link is a routed /31 point-to-point
  • Every device runs eBGP for route advertisement
  • ECMP provides automatic multipath load balancing
  • BFD provides sub-second failure detection

16.2.2 Benefits

  1. Simplicity: No spanning tree, no MLAG state machines, no peer-links
  2. Scalability: Clos topology scales horizontally (add spines) and vertically (add super-spines)
  3. Predictability: Traffic follows well-understood L3 forwarding rules
  4. Fast convergence: BGP + BFD provides sub-second failover
  5. Vendor independence: BGP is universally supported
  6. Automation friendly: Consistent configuration across all devices

16.2.3 Why eBGP (Not iBGP)

Aspect eBGP iBGP
Route reflection Not needed Requires route reflectors
Path selection Per-hop decisions May need RR hierarchy
Loop prevention AS-path naturally prevents Requires careful design
Simplicity Each device is its own AS Shared AS, more config
Modern DC practice Standard (Facebook, Microsoft, etc.) Less common in DCs

Our choice: eBGP with unique ASN per device (or per-role). This eliminates the need for route reflectors and provides natural loop prevention.

16.3 Why Not L2 Fabric (VLAN Stretching, MLAG/vPC)

16.3.1 The Problem with L2 at Scale

Traditional L2 fabrics have fundamental scaling limitations:

  1. Spanning Tree (STP): Blocks redundant paths, wastes bandwidth
  2. Broadcast domains: ARP storms, unknown unicast flooding
  3. MAC table limits: Switches have finite MAC table sizes
  4. Single failure domain: L2 failures can cascade

16.3.2 Why Not MLAG/vPC

MLAG (Multi-Chassis Link Aggregation) creates a virtual switch from two physical switches:

         [Virtual Switch]
        /                \
   [ToR-A] ============ [ToR-B]    ← Peer-link (L2 state sync)
      |                    |
   [Host with bonded NICs]

Problems:

Issue Impact
Peer-link dependency Single point of failure; if peer-link fails, split-brain
State synchronization Complex L2 state sync between chassis
Vendor lock-in Cisco vPC, Arista MLAG, Juniper MC-LAG - all different
Scale limits Typically 2 switches per MLAG domain
Operational complexity Coordinated upgrades required
Overlay conflict MLAG designed for L2; conflicts with host-based tunnels

Our approach: No MLAG. ToR-A and ToR-B are completely independent. Each host NIC is a separate routed interface, not a bonded pair.

16.3.3 Why This Works for Overlay Networks

With OVN/GENEVE overlay:

  • Tunnels originate at hosts (not at switches)
  • Switch only sees IP packets (outer GENEVE headers)
  • No need for L2 tricks at the fabric layer
  • ECMP naturally load-balances tunnel traffic

16.4 Why Not EVPN/VXLAN in the Underlay

16.4.1 What is EVPN/VXLAN?

EVPN (Ethernet VPN) is a BGP-based control plane for VXLAN overlays, typically used when:

  • Switches are VTEP endpoints
  • L2 domains need to stretch across L3
  • Multi-tenancy at the fabric layer

16.4.2 Why We Don’t Need It

Aspect EVPN/VXLAN Underlay Our Design
Tunnel termination At switches (leaf VTEP) At hosts (OVN TEP)
Overlay control BGP EVPN OVN Southbound DB
L2 learning Data plane (switch) Control plane (OVN)
Complexity High (EVPN type-2/5, VNI mapping) Low (pure L3)
Configuration 300+ lines per switch ~50 lines per switch

Key insight: OVN already provides overlay functionality. Adding EVPN in the underlay creates:

  • Double overlay (GENEVE over VXLAN over IP)
  • Conflicting control planes (OVN vs EVPN)
  • Unnecessary complexity for our use case

16.4.3 When EVPN/VXLAN Makes Sense

EVPN is appropriate when:

  • You need L2 extension across L3 boundaries (we don’t)
  • Switch-based multi-tenancy is required (OVN handles this)
  • Bare-metal servers need direct L2 connectivity (rare in cloud)
  • Legacy applications require stretched VLANs (avoid if possible)

16.5 Why Not OSPF or IS-IS for Underlay

16.5.1 The Traditional Choice

OSPF and IS-IS are link-state IGPs commonly used in enterprise networks:

Protocol Typical Use
OSPF Enterprise, campus networks
IS-IS Service provider, large enterprise
eBGP Internet, modern datacenters

16.5.2 Why eBGP is Better for DC

Aspect OSPF/IS-IS eBGP
Flooding Link-state floods to all routers No flooding (path-vector)
Convergence SPF calculation on every router Incremental updates
Policy control Limited Rich (communities, AS-path, etc.)
Multi-vendor Some interop issues Universally compatible
External connectivity Needs redistribution to BGP Native BGP everywhere
Scale Area design complexity Naturally hierarchical
Operational model Different from internet edge Same as internet edge

Key advantage: With eBGP everywhere, operators use the same protocol for:

  • Server ↔︎ ToR
  • ToR ↔︎ Spine
  • Spine ↔︎ Super-Spine
  • DC ↔︎ WAN/Internet

This uniformity simplifies operations, training, and automation.

16.5.3 When OSPF/IS-IS Might Be Valid

  • Existing enterprise networks with OSPF expertise
  • Very small deployments where BGP feels like overkill
  • Specific vendor requirements (rare)

16.6 Platform Choices

16.6.1 Why SONiC

SONiC (Software for Open Networking in the Cloud) is our preferred network OS:

Feature Benefit
Open source No vendor lock-in, community-driven
Linux-based Standard Linux tools, containers, scripting
FRR routing Production-proven BGP/OSPF/BFD stack
SAI abstraction Works with multiple switch ASICs
REST/gNMI APIs Modern automation interfaces
Used at scale Azure, Alibaba, LinkedIn, and many others

FRR (Free Range Routing) provides the routing stack:

# Same FRR commands on switches and hosts
vtysh -c "show ip bgp summary"
vtysh -c "show ip route"
vtysh -c "show bfd peer"

16.6.2 Why Memory/Tomahawk ASICs

For pure L3 forwarding, Memory family ASICs (Memory/Tomahawk) are ideal:

Feature Tomahawk Series
Forwarding model Pure L3 (no complex overlay features needed)
Port density High radix (32x400G or 64x100G)
Buffer Shared memory, good for bursts
Power efficiency Optimized for throughput
Cost Commodity pricing

What we DON’T need from switches:

  • VXLAN gateway functionality (OVN handles overlay)
  • Complex ACLs/QoS (host-based security)
  • L2 features (pure L3 fabric)

This means we can use simpler, cheaper switches optimized for L3 throughput.

16.6.3 Reference Hardware

Role Example Platform Key Specs
ToR Arista 7050CX4 24×200G down + 8×400G up
Spine Arista 7800 or Edgecore 32×400G or higher
Super-Spine Cisco 8000 or Arista 7800R 64×400G

Note: Exact hardware depends on vendor relationships and availability. The architecture is hardware-agnostic.

16.7 Summary: Why This Design

┌─────────────────────────────────────────────────────────────┐
│                     OUR DESIGN CHOICES                      │
├─────────────────────────────────────────────────────────────┤
│ ✓ L3 Clos Fabric      │ Simple, scalable, predictable      │
│ ✓ eBGP Everywhere     │ Uniform protocol, no RRs needed    │
│ ✓ ECMP + BFD          │ Fast failover, full bandwidth      │
│ ✓ Dual ToRs (no MLAG) │ Independent failure domains        │
│ ✓ Host-based TEPs     │ OVN handles overlay, not switches  │
│ ✓ SONiC + FRR         │ Open, automatable, proven          │
├─────────────────────────────────────────────────────────────┤
│                    WE DELIBERATELY AVOID                     │
├─────────────────────────────────────────────────────────────┤
│ ✗ MLAG/vPC            │ Peer-link complexity, vendor lock  │
│ ✗ EVPN/VXLAN underlay │ Unnecessary for host-based overlay │
│ ✗ Stretched L2        │ Scaling limits, failure domains    │
│ ✗ OSPF/IS-IS          │ Less policy control, flooding      │
└─────────────────────────────────────────────────────────────┘

Bottom line: For OpenStack with OVN, the simplest underlay is best. Let the overlay (OVN/GENEVE) handle multi-tenancy and L2 semantics. Keep the underlay pure L3.