6  Network Architecture Overview

6.1 Overview

This chapter provides a comprehensive overview of the Independent A/B Fabrics architecture with pure L3 routing for Canonical’s OpenStack deployment. The design is optimized for OVN/OVS overlay networks using GENEVE encapsulation.

Note: For definitions of technical terms, see the Glossary.

6.2 Core Philosophy: Pure L3 Everywhere

The underlay’s only job is: Move IP packets between hosts reliably and at full bandwidth. Everything else (tenant networks, isolation, mobility) is handled by OVN on top.

6.2.1 Why L3-Only?

We deliberately choose L3 everywhere and avoid L2 constructs (bridges, MLAG, vPC, STP):

  • L3 scales cleanly: No broadcast, no flooding, no spanning tree, no split-brain risk
  • Failures are explicit and fast: Links fail → BFD detects → BGP withdraws → ECMP reconverges
  • Predictable paths: Every packet is routed; no hidden L2 behavior
  • Modern DC norm: Clos fabrics, hyperscalers, and OVN/GENEVE overlays all assume an IP fabric

6.2.2 What We Explicitly Avoid

  • No L2 bridges in the underlay
  • No MLAG / vPC / stacked ToRs
  • No shared MACs across links
  • No dependence on broadcast or ARP domains
  • No software bonding (each NIC is separate routed interface)

6.3 Unified L3 Clos Fabric with Dual ToRs

Key Principle: Single L3 Clos fabric with dual ToRs per rack for redundancy and path diversity, providing excellent ECMP load balancing and bandwidth utilization.

6.3.1 The Mental Model

  • Physical ports = just pipes
  • Per-link IPs = how neighbors talk (point-to-point /31 links)
  • Loopback IP = who the node is (independent of physical links)
  • BGP = who can reach whom (route advertisement)
  • ECMP = use all available paths (automatic load balancing)
  • OVN/GENEVE = virtual networks on top, completely decoupled

6.3.2 Multiple NICs per Server - Pure L3 Approach

What We Do (Pure L3):

  • Each server NIC:
    • Has its own IP on a different routed network (A or B)
    • Connects to a different ToR (ToR-A or ToR-B)
  • The loopback IP is advertised via eBGP through both NICs
  • Result:
    • The fabric learns multiple equal-cost paths to the same server loopback
    • Traffic can enter or leave the server via either NIC
    • If one NIC / ToR fails, the loopback remains reachable via the other

What We Don’t Do:

  • ❌ No LACP bonding: NICs are not bonded at L2
  • ❌ No software bonding: No bond0, no balance-xor mode
  • ❌ No shared MACs: Each NIC has its own MAC and IP
  • ❌ No bridges: No L2 switching in the underlay
  • ❌ No MLAG/vPC: No shared state between ToRs

Why This Works:

  • Servers are routers with two uplinks
  • Each uplink is a separate routed network
  • Loopback identity is independent of physical links
  • ECMP handles load balancing automatically (5-tuple hashing)
  • No LACP, no shared MACs, no ambiguity

6.4 Topology Evolution

6.4.1 Phase 1: Mesh Topology (5-6 racks)

  • Fabric-A: All ToR-A switches interconnect in mesh via BGP
  • Fabric-B: All ToR-B switches interconnect in mesh via BGP
  • 8 uplink ports per ToR: Sufficient for mesh connectivity
  • No spine switches needed: Mesh works well for small scale

Deployment: 1. Deploy ToR-A and ToR-B pair per rack 2. Connect ToR-A switches in mesh (full mesh or ring) 3. Connect ToR-B switches in mesh (full mesh or ring) 4. Configure BGP sessions between ToR pairs 5. 8 uplinks per ToR: Sufficient for 7 other ToRs + 1 spare

6.4.2 Phase 2: Leaf-Spine Topology (7+ racks, single Network Pod)

  • All ToRs connect to all spines: Both ToR-A and ToR-B connect to Spine-1, Spine-2, etc.
  • Single routing domain: One unified L3 fabric with multiple paths
  • ECMP across all paths: Traffic can use any available path
  • No peer-links between ToRs: ToR-A and ToR-B are independent (no MLAG)
  • Single Network Pod: All racks in one pod

Migration from Mesh: 1. Deploy dedicated spine switches (Spine-1, Spine-2, etc.) 2. Convert ToRs to leaf role 3. Connect all ToRs to all spines (full connectivity) 4. Remove inter-ToR mesh links 5. Reconfigure BGP sessions (all ToRs → all Spines) 6. Zero server recabling: Existing hosts unchanged

Clos Topology - Every ToR connects to Every Spine:

                    [Spine-1]           [Spine-2]
                    /  |  |  \         /  |  |  \
                   /   |  |   \       /   |  |   \
                  /    |  |    \     /    |  |    \
                 /     |  |     \   /     |  |     \
            [ToR-A1][ToR-B1][ToR-A2][ToR-B2] (Rack1)(Rack1)(Rack2)(Rack2)
                |      |       |      |
                |      |       |      |
            [Host-1]      [Host-2]      [Host-3]
            eth0 eth1     eth0 eth1     eth0 eth1

Key Principle: In a Clos fabric, EVERY ToR connects to EVERY Spine: - ToR-A1 → Spine-1 AND Spine-2 - ToR-B1 → Spine-1 AND Spine-2 - ToR-A2 → Spine-1 AND Spine-2 - ToR-B2 → Spine-1 AND Spine-2

This provides maximum path diversity and bandwidth utilization. All ToRs and spines are in a single unified L3 fabric.

For detailed network topology with IP addresses, see Network Design & IP Addressing - Network Topology.

6.4.3 Phase 3: Super-Spine Topology (Multiple Network Pods)

As the datacenter scales, evolve to hierarchical super-spine architecture with Network Pods (NP):

Super-Spine Architecture with Network Pods

Key Components:

  1. Super-Spine Layer: Interconnects multiple Network Pods
    • SuperSpine-1: 10.254.100.1/32
    • SuperSpine-2: 10.254.100.2/32
    • Point-to-point pool: 172.24.100.0/24
  2. Network Pods (NP): Each pod is a complete leaf-spine fabric
    • NP1: First network pod
    • NP2: Second network pod
    • Each pod has its own spine layer and racks
  3. Spine Layer (per Pod): Connects ToRs within a pod
    • NP1 Spines: 10.254.1.0/24 (e.g., 10.254.1.1/32, 10.254.1.2/32)
    • NP2 Spines: 10.254.2.0/24 (e.g., 10.254.2.1/32, 10.254.2.2/32)
    • ToR↔︎Spine p2p pool: 172.20.{pod}.0/22
  4. Racks (per Pod): Each rack has dual ToRs and hosts
    • ToR-A and ToR-B in each rack
    • Hosts connect to both ToRs via separate NICs

Note: We call them Network Pods (NP / NPod) to avoid confusion with Kubernetes pods.

Evolution Path: - When reaching 10+ racks: Add more spines (Spine-A3, Spine-A4, etc.) - When reaching multiple pods: Add super-spine layer - Scalable: Can add new Network Pods without disrupting existing pods

6.5 Why Unified Fabric with Dual ToRs?

6.5.1 Excellent Path Diversity and Bandwidth Utilization

With OVN/OVS overlay networks, the physical underlay only needs to provide basic IP connectivity. All tunneling, encapsulation, and virtual routing happens in software (OVS kernel module).

Key Insight: Since TEP endpoints are at hosts (not at ToR switches), the fabric layer doesn’t need EVPN or VXLAN. Pure L3 BGP/ECMP routing is sufficient and much simpler.

Benefits of Unified Fabric:

  1. Maximum path diversity: With 2 ToRs per rack and 2 spines, there are 8 possible paths between any two hosts
  2. Better bandwidth utilization: Traffic can use any available path, not restricted to separate fabric pools
  3. Automatic path selection: BGP ECMP automatically distributes across all available paths
  4. Pure L3 routing: Each NIC is a separate routed interface - no bonding complexity
  5. No peer-link dependency: Unlike MLAG, ToRs are independent - no shared state between ToR-A and ToR-B
  6. Natural multipath: Loopback advertised via both NICs creates equal-cost paths automatically
  7. Standard modern DC design: Same architecture as hyperscaler datacenters

6.5.2 Comparison with Other Architectures

Aspect A/B Fabrics (Our Design) MLAG/vPC EVPN Multihoming
Overlay resilience Perfect - zero shared state Poor - peer-link failure disrupts all tunnels Good - but complex
200G multipath Pure L3 ECMP (2 × 100G NICs) LACP hardware bond ESI-based multihoming
Failure isolation Complete - independent fabrics Shared - peer-link SPOF Good but complex
Complexity Low - pure L3 routing Medium-High - state sync High - EVPN control plane
Operational simplicity Simple - independent maintenance Complex - coordinated upgrades Very complex - EVPN ops
NIC configuration Separate routed interfaces NICs bonded at L2 Depends on implementation
Config lines/switch ~50 lines ~150 lines ~300+ lines
Vendor dependency None - standard BGP/ECMP Vendor-specific (Cisco vPC, Arista MLAG) EVPN support required
Suitability for OVN Excellent - TEPs at hosts Poor - overlay conflicts with L2 Overkill - not needed

Rating: - Independent A/B Fabrics: ★★★★★ (Recommended for OpenStack with OVN) - MLAG/vPC: ★★☆☆☆ (Not recommended - peer-link is critical failure point for overlays) - EVPN Multihoming: ★★★☆☆ (Overkill - not needed when TEPs are at hosts)

6.6 Design Principles

6.6.1 1. Single Unified L3 Fabric with Dual ToRs

All ToRs connect to all spines in a single routing domain: - No peer-links between ToR-A and ToR-B (no MLAG) - Single routing domain with multiple redundant paths - All ToRs peer with all spines via BGP - Maximum ECMP path diversity

6.6.2 2. Pure L3 Underlay

  • No bridges: All switching is L3 routing
  • No VLANs: IP-only underlay
  • No EVPN/VXLAN at fabric: Fabric only routes IP packets, doesn’t understand overlays
  • Point-to-point links: /31 links between all devices (RFC 3021)
  • BGP routing: Standard eBGP for route advertisement (no iBGP, no route reflectors)
  • ECMP: Automatic load balancing across multiple paths using 5-tuple hashing

6.6.3 3. Host-Based TEPs

TEP endpoints are at hypervisors (not at ToR switches): - Each host’s loopback IP is its TEP (Tunnel Endpoint) - OVN/OVS handles GENEVE encapsulation/decapsulation - Fabric layer doesn’t need EVPN/VXLAN - just routes IP packets - Hardware acceleration: ConnectX-6 DX supports GENEVE offload

Note: ToR switches are actually L3 routers doing BGP routing, not L2 switches. The name “ToR” refers to physical location, but function is pure L3 routing.

6.7 Why No EVPN/VXLAN at Fabric Layer?

6.7.1 Traditional DC Design (TEPs at Leaf Switches)

  • TEP endpoints are at Leaf/ToR switches
  • EVPN control plane needed to coordinate TEP discovery
  • VXLAN tunnels between switches
  • Complex switch configuration (~300+ lines)

6.7.2 Our Design (TEPs at Hosts)

  • TEP endpoints are at hypervisors (each host is a TEP)
  • OVN control plane handles all TEP discovery and VM learning
  • Fabric only routes IP packets - doesn’t need to understand overlays
  • Simple switch configuration - just BGP and ECMP (~50 lines)

6.7.3 Benefits of Host-Based TEPs

  1. Vendor neutrality: Fabric switches don’t need vendor-specific EVPN features
  2. Simpler fabric: Pure L3 routing is easier to operate and debug
  3. Better scalability: No switch FIB limits for MAC addresses
  4. Hardware acceleration: Modern NICs support GENEVE/VXLAN offload
  5. OVN integration: Native OVN control plane, no need for EVPN semantics

References: - OpenStack Architecture Guide - L3 Underlay - Canonical OpenStack Design Considerations

6.8 IP Addressing Strategy

The architecture uses hierarchical addressing for easy debugging and management. The IP address structure encodes: - Role: Device type (spine, ToR, host) - Pod: Network Pod number (for super-spine architecture) - Rack: Physical rack location

6.8.1 Addressing Principles

  1. Loopback IPs: Organized by role + pod + rack
    • Network devices: 10.254.{pod-rack-role}.*
    • Host TEPs: 10.255.{pod-rack}.{host}
  2. Point-to-Point Links: Allocated from per-rack/per-pod pools
    • Host↔︎ToR: 172.16.{pod-rack}.0/24 (split A/B)
    • ToR↔︎Spine: 172.20.{pod}.0/22 per pod
    • Spine↔︎Super-Spine: 172.24.100.0/24

6.8.2 Benefits

  • Easy Debugging: IP address immediately reveals pod, rack, and role
  • Scalable: Clear allocation scheme for adding new pods/racks
  • Organized: Logical grouping by network topology
  • Consistent: Same pattern across all Network Pods

For concrete IP allocation examples, see Network Design & IP Addressing.

6.9 Hardware Configuration

6.9.1 Server NICs

  • 2 × 100G per server (NVIDIA ConnectX-6 DX)
  • Hardware GENEVE offload enabled
  • Total aggregate: 200G via pure L3 ECMP
  • Each NIC is a separate routed interface

6.9.2 Switch Hardware

  • ToR Switches:
    • Option 1: 100G × 64 ports (Tomahawk-based)
    • Option 2: 200G × 32 ports (Tomahawk-based)
  • Spine Switches:
    • 400G switches (Tomahawk-based)
  • All switches: Pure L3 routers with BGP/ECMP

6.10 Operational Benefits

6.10.1 Rolling Maintenance

  • Upgrade spines one at a time: ECMP redistributes to remaining spines
  • Upgrade ToRs rack by rack: Dual ToRs provide redundancy during upgrades
  • Rolling upgrades: No downtime - traffic shifts to redundant paths

6.10.2 Simple Troubleshooting

  • Pure L3: Standard tools (ping, traceroute, tcpdump)
  • No state sync issues: No MLAG peer-link to debug
  • Clear failure domains: Easy to isolate problems

6.10.3 Configuration Simplicity

  • ~50 config lines per switch: Pure L3 routing (BGP)
  • No MLAG complexity: No peer-link, no state synchronization
  • Standard protocols: BGP, ECMP - all well-understood

6.11 Capacity Planning Principles

6.11.1 Critical: Size for 100% Load

DO NOT assume 50/50 traffic split across fabrics. Plan for:

  • Normal operation: 40-60% per fabric (depends on flow distribution)
  • Failover scenario: 100% capacity on single fabric
  • Spine uplinks: Must handle full rack capacity during failover

6.11.2 Bandwidth Calculation

  • Per host: 200G aggregate (100G per fabric)
  • Per rack: 25 hosts × 200G = 5Tbps theoretical
  • Per ToR: Must handle 100% of rack traffic during failover
  • Per spine: Must handle sum of all ToR uplinks in fabric

6.12 Firewall Integration

6.12.1 Active-Active Firewall Design

  • Firewall-1:
    • eth0 → ToR-A (Rack 1)
    • eth1 → ToR-B (Rack 1)
  • Firewall-2:
    • eth0 → ToR-A (Rack 2)
    • eth1 → ToR-B (Rack 2)

Each firewall connects to dual ToRs with ECMP routing. Traffic can use any available path through the unified fabric.

6.13 Power Domain Separation

Best practice for failure isolation:

  • Rack ToR-A switches → PDU-A → UPS-A
  • Rack ToR-B switches → PDU-B → UPS-B
  • Spines: Distribute across different power domains for redundancy

This provides redundancy at the power level. If PDU-A fails, ToR-B switches and some spines remain operational.

6.14 Key Design Decisions

6.14.1 Why /32 for loopbacks?

  • Provides stable identity for OVN TEP
  • Simplifies routing (no subnet concerns)
  • Works well with BGP
  • At 150 hosts, FIB scale is not an issue

6.14.3 Why no summarization initially?

  • Avoids blackholes during failures
  • Simpler to debug
  • At this scale, /32s are manageable
  • Can add summarization later with proper safeguards

6.14.4 Why GENEVE over VXLAN?

  • More extensible (variable-length options)
  • Better for OVN’s use case
  • Native OVN support
  • Similar performance characteristics
  • Hardware offload: Modern NICs support GENEVE offload

6.14.5 Why pure L3 multipath (no bonding)?

  • Each NIC is separate routed interface: No L2 constructs, no bonding complexity
  • Loopback advertised via both NICs: BGP creates equal-cost paths automatically
  • ECMP handles load balancing: Kernel routing table distributes traffic across both paths
  • 5-tuple hashing: GENEVE (UDP) provides excellent hash entropy
  • Automatic failover: BGP withdraws failed path, ECMP uses remaining path
  • No hardware dependency: Pure L3 routing, no switch-side LACP needed
  • Servers are routers: First-class multipath support, not a hack

6.14.6 How GENEVE Uses ECMP

  • 5-tuple hashing: Source IP, Dest IP, Source Port, Dest Port, Protocol
  • Unique source ports: Each connection gets different source port
  • Kernel ECMP: Routes across both NICs (eth0/eth1) based on 5-tuple
  • Fabric ECMP: Routes across multiple spine paths based on 5-tuple
  • Automatic distribution: No manual configuration needed

6.14.7 Kubernetes Integration

  • Kubernetes uses OVN: OVN-Kubernetes CNI integrates with OVN control plane
  • No double overlay: Pods and VMs share same GENEVE overlay
  • Unified networking: One control plane (OVN) for all workloads
  • Consistent policies: OVN security groups apply to both pods and VMs

6.15 Why Not EVPN at Fabric Layer?

Common Question: Do we need EVPN at the fabric layer?

Answer: No. EVPN is not needed at the fabric layer because:

  1. TEP endpoints are at hosts (hypervisors), not at ToR switches
  2. OVN control plane handles all TEP discovery and VM learning
  3. Fabric only routes IP packets - it doesn’t need to understand overlays
  4. Simpler operations - pure L3 routing is easier than EVPN

EVPN semantics: OVN provides TEP registration, not EVPN. - OVN control plane handles TEP registration when hosts come up - OVN Southbound DB maintains TEP-to-VM mappings - No EVPN needed - OVN’s control plane replaces EVPN’s function

6.16 References

For detailed implementation: - Network Design & IP Addressing - Concrete IP plans - BGP & Routing Configuration - BGP configuration details - Operations & Maintenance - Operational procedures

6.16.1 Technical References