6 Network Architecture Overview
6.1 Overview
This chapter provides a comprehensive overview of the Independent A/B Fabrics architecture with pure L3 routing for Canonical’s OpenStack deployment. The design is optimized for OVN/OVS overlay networks using GENEVE encapsulation.
Note: For definitions of technical terms, see the Glossary.
6.2 Core Philosophy: Pure L3 Everywhere
The underlay’s only job is: Move IP packets between hosts reliably and at full bandwidth. Everything else (tenant networks, isolation, mobility) is handled by OVN on top.
6.2.1 Why L3-Only?
We deliberately choose L3 everywhere and avoid L2 constructs (bridges, MLAG, vPC, STP):
- L3 scales cleanly: No broadcast, no flooding, no spanning tree, no split-brain risk
- Failures are explicit and fast: Links fail → BFD detects → BGP withdraws → ECMP reconverges
- Predictable paths: Every packet is routed; no hidden L2 behavior
- Modern DC norm: Clos fabrics, hyperscalers, and OVN/GENEVE overlays all assume an IP fabric
6.2.2 What We Explicitly Avoid
- No L2 bridges in the underlay
- No MLAG / vPC / stacked ToRs
- No shared MACs across links
- No dependence on broadcast or ARP domains
- No software bonding (each NIC is separate routed interface)
6.3 Unified L3 Clos Fabric with Dual ToRs
Key Principle: Single L3 Clos fabric with dual ToRs per rack for redundancy and path diversity, providing excellent ECMP load balancing and bandwidth utilization.
6.3.1 The Mental Model
- Physical ports = just pipes
- Per-link IPs = how neighbors talk (point-to-point /31 links)
- Loopback IP = who the node is (independent of physical links)
- BGP = who can reach whom (route advertisement)
- ECMP = use all available paths (automatic load balancing)
- OVN/GENEVE = virtual networks on top, completely decoupled
6.3.2 Multiple NICs per Server - Pure L3 Approach
What We Do (Pure L3):
- Each server NIC:
- Has its own IP on a different routed network (A or B)
- Connects to a different ToR (ToR-A or ToR-B)
- The loopback IP is advertised via eBGP through both NICs
- Result:
- The fabric learns multiple equal-cost paths to the same server loopback
- Traffic can enter or leave the server via either NIC
- If one NIC / ToR fails, the loopback remains reachable via the other
What We Don’t Do:
- ❌ No LACP bonding: NICs are not bonded at L2
- ❌ No software bonding: No bond0, no balance-xor mode
- ❌ No shared MACs: Each NIC has its own MAC and IP
- ❌ No bridges: No L2 switching in the underlay
- ❌ No MLAG/vPC: No shared state between ToRs
Why This Works:
- Servers are routers with two uplinks
- Each uplink is a separate routed network
- Loopback identity is independent of physical links
- ECMP handles load balancing automatically (5-tuple hashing)
- No LACP, no shared MACs, no ambiguity
6.4 Topology Evolution
6.4.1 Phase 1: Mesh Topology (5-6 racks)
- Fabric-A: All ToR-A switches interconnect in mesh via BGP
- Fabric-B: All ToR-B switches interconnect in mesh via BGP
- 8 uplink ports per ToR: Sufficient for mesh connectivity
- No spine switches needed: Mesh works well for small scale
Deployment: 1. Deploy ToR-A and ToR-B pair per rack 2. Connect ToR-A switches in mesh (full mesh or ring) 3. Connect ToR-B switches in mesh (full mesh or ring) 4. Configure BGP sessions between ToR pairs 5. 8 uplinks per ToR: Sufficient for 7 other ToRs + 1 spare
6.4.2 Phase 2: Leaf-Spine Topology (7+ racks, single Network Pod)
- All ToRs connect to all spines: Both ToR-A and ToR-B connect to Spine-1, Spine-2, etc.
- Single routing domain: One unified L3 fabric with multiple paths
- ECMP across all paths: Traffic can use any available path
- No peer-links between ToRs: ToR-A and ToR-B are independent (no MLAG)
- Single Network Pod: All racks in one pod
Migration from Mesh: 1. Deploy dedicated spine switches (Spine-1, Spine-2, etc.) 2. Convert ToRs to leaf role 3. Connect all ToRs to all spines (full connectivity) 4. Remove inter-ToR mesh links 5. Reconfigure BGP sessions (all ToRs → all Spines) 6. Zero server recabling: Existing hosts unchanged
Clos Topology - Every ToR connects to Every Spine:
[Spine-1] [Spine-2]
/ | | \ / | | \
/ | | \ / | | \
/ | | \ / | | \
/ | | \ / | | \
[ToR-A1][ToR-B1][ToR-A2][ToR-B2] (Rack1)(Rack1)(Rack2)(Rack2)
| | | |
| | | |
[Host-1] [Host-2] [Host-3]
eth0 eth1 eth0 eth1 eth0 eth1
Key Principle: In a Clos fabric, EVERY ToR connects to EVERY Spine: - ToR-A1 → Spine-1 AND Spine-2 - ToR-B1 → Spine-1 AND Spine-2 - ToR-A2 → Spine-1 AND Spine-2 - ToR-B2 → Spine-1 AND Spine-2
This provides maximum path diversity and bandwidth utilization. All ToRs and spines are in a single unified L3 fabric.
For detailed network topology with IP addresses, see Network Design & IP Addressing - Network Topology.
6.4.3 Phase 3: Super-Spine Topology (Multiple Network Pods)
As the datacenter scales, evolve to hierarchical super-spine architecture with Network Pods (NP):
Key Components:
- Super-Spine Layer: Interconnects multiple Network Pods
- SuperSpine-1:
10.254.100.1/32 - SuperSpine-2:
10.254.100.2/32 - Point-to-point pool:
172.24.100.0/24
- SuperSpine-1:
- Network Pods (NP): Each pod is a complete leaf-spine fabric
- NP1: First network pod
- NP2: Second network pod
- Each pod has its own spine layer and racks
- Spine Layer (per Pod): Connects ToRs within a pod
- NP1 Spines:
10.254.1.0/24(e.g.,10.254.1.1/32,10.254.1.2/32) - NP2 Spines:
10.254.2.0/24(e.g.,10.254.2.1/32,10.254.2.2/32) - ToR↔︎Spine p2p pool:
172.20.{pod}.0/22
- NP1 Spines:
- Racks (per Pod): Each rack has dual ToRs and hosts
- ToR-A and ToR-B in each rack
- Hosts connect to both ToRs via separate NICs
Note: We call them Network Pods (NP / NPod) to avoid confusion with Kubernetes pods.
Evolution Path: - When reaching 10+ racks: Add more spines (Spine-A3, Spine-A4, etc.) - When reaching multiple pods: Add super-spine layer - Scalable: Can add new Network Pods without disrupting existing pods
6.5 Why Unified Fabric with Dual ToRs?
6.5.1 Excellent Path Diversity and Bandwidth Utilization
With OVN/OVS overlay networks, the physical underlay only needs to provide basic IP connectivity. All tunneling, encapsulation, and virtual routing happens in software (OVS kernel module).
Key Insight: Since TEP endpoints are at hosts (not at ToR switches), the fabric layer doesn’t need EVPN or VXLAN. Pure L3 BGP/ECMP routing is sufficient and much simpler.
Benefits of Unified Fabric:
- Maximum path diversity: With 2 ToRs per rack and 2 spines, there are 8 possible paths between any two hosts
- Better bandwidth utilization: Traffic can use any available path, not restricted to separate fabric pools
- Automatic path selection: BGP ECMP automatically distributes across all available paths
- Pure L3 routing: Each NIC is a separate routed interface - no bonding complexity
- No peer-link dependency: Unlike MLAG, ToRs are independent - no shared state between ToR-A and ToR-B
- Natural multipath: Loopback advertised via both NICs creates equal-cost paths automatically
- Standard modern DC design: Same architecture as hyperscaler datacenters
6.5.2 Comparison with Other Architectures
| Aspect | A/B Fabrics (Our Design) | MLAG/vPC | EVPN Multihoming |
|---|---|---|---|
| Overlay resilience | Perfect - zero shared state | Poor - peer-link failure disrupts all tunnels | Good - but complex |
| 200G multipath | Pure L3 ECMP (2 × 100G NICs) | LACP hardware bond | ESI-based multihoming |
| Failure isolation | Complete - independent fabrics | Shared - peer-link SPOF | Good but complex |
| Complexity | Low - pure L3 routing | Medium-High - state sync | High - EVPN control plane |
| Operational simplicity | Simple - independent maintenance | Complex - coordinated upgrades | Very complex - EVPN ops |
| NIC configuration | Separate routed interfaces | NICs bonded at L2 | Depends on implementation |
| Config lines/switch | ~50 lines | ~150 lines | ~300+ lines |
| Vendor dependency | None - standard BGP/ECMP | Vendor-specific (Cisco vPC, Arista MLAG) | EVPN support required |
| Suitability for OVN | Excellent - TEPs at hosts | Poor - overlay conflicts with L2 | Overkill - not needed |
Rating: - Independent A/B Fabrics: ★★★★★ (Recommended for OpenStack with OVN) - MLAG/vPC: ★★☆☆☆ (Not recommended - peer-link is critical failure point for overlays) - EVPN Multihoming: ★★★☆☆ (Overkill - not needed when TEPs are at hosts)
6.6 Design Principles
6.6.1 1. Single Unified L3 Fabric with Dual ToRs
All ToRs connect to all spines in a single routing domain: - No peer-links between ToR-A and ToR-B (no MLAG) - Single routing domain with multiple redundant paths - All ToRs peer with all spines via BGP - Maximum ECMP path diversity
6.6.2 2. Pure L3 Underlay
- No bridges: All switching is L3 routing
- No VLANs: IP-only underlay
- No EVPN/VXLAN at fabric: Fabric only routes IP packets, doesn’t understand overlays
- Point-to-point links: /31 links between all devices (RFC 3021)
- BGP routing: Standard eBGP for route advertisement (no iBGP, no route reflectors)
- ECMP: Automatic load balancing across multiple paths using 5-tuple hashing
6.6.3 3. Host-Based TEPs
TEP endpoints are at hypervisors (not at ToR switches): - Each host’s loopback IP is its TEP (Tunnel Endpoint) - OVN/OVS handles GENEVE encapsulation/decapsulation - Fabric layer doesn’t need EVPN/VXLAN - just routes IP packets - Hardware acceleration: ConnectX-6 DX supports GENEVE offload
Note: ToR switches are actually L3 routers doing BGP routing, not L2 switches. The name “ToR” refers to physical location, but function is pure L3 routing.
6.7 Why No EVPN/VXLAN at Fabric Layer?
6.7.1 Traditional DC Design (TEPs at Leaf Switches)
- TEP endpoints are at Leaf/ToR switches
- EVPN control plane needed to coordinate TEP discovery
- VXLAN tunnels between switches
- Complex switch configuration (~300+ lines)
6.7.2 Our Design (TEPs at Hosts)
- TEP endpoints are at hypervisors (each host is a TEP)
- OVN control plane handles all TEP discovery and VM learning
- Fabric only routes IP packets - doesn’t need to understand overlays
- Simple switch configuration - just BGP and ECMP (~50 lines)
6.7.3 Benefits of Host-Based TEPs
- Vendor neutrality: Fabric switches don’t need vendor-specific EVPN features
- Simpler fabric: Pure L3 routing is easier to operate and debug
- Better scalability: No switch FIB limits for MAC addresses
- Hardware acceleration: Modern NICs support GENEVE/VXLAN offload
- OVN integration: Native OVN control plane, no need for EVPN semantics
References: - OpenStack Architecture Guide - L3 Underlay - Canonical OpenStack Design Considerations
6.8 IP Addressing Strategy
The architecture uses hierarchical addressing for easy debugging and management. The IP address structure encodes: - Role: Device type (spine, ToR, host) - Pod: Network Pod number (for super-spine architecture) - Rack: Physical rack location
6.8.1 Addressing Principles
- Loopback IPs: Organized by role + pod + rack
- Network devices:
10.254.{pod-rack-role}.* - Host TEPs:
10.255.{pod-rack}.{host}
- Network devices:
- Point-to-Point Links: Allocated from per-rack/per-pod pools
- Host↔︎ToR:
172.16.{pod-rack}.0/24(split A/B) - ToR↔︎Spine:
172.20.{pod}.0/22per pod - Spine↔︎Super-Spine:
172.24.100.0/24
- Host↔︎ToR:
6.8.2 Benefits
- Easy Debugging: IP address immediately reveals pod, rack, and role
- Scalable: Clear allocation scheme for adding new pods/racks
- Organized: Logical grouping by network topology
- Consistent: Same pattern across all Network Pods
For concrete IP allocation examples, see Network Design & IP Addressing.
6.9 Hardware Configuration
6.9.1 Server NICs
- 2 × 100G per server (NVIDIA ConnectX-6 DX)
- Hardware GENEVE offload enabled
- Total aggregate: 200G via pure L3 ECMP
- Each NIC is a separate routed interface
6.9.2 Switch Hardware
- ToR Switches:
- Option 1: 100G × 64 ports (Tomahawk-based)
- Option 2: 200G × 32 ports (Tomahawk-based)
- Spine Switches:
- 400G switches (Tomahawk-based)
- All switches: Pure L3 routers with BGP/ECMP
6.10 Operational Benefits
6.10.1 Rolling Maintenance
- Upgrade spines one at a time: ECMP redistributes to remaining spines
- Upgrade ToRs rack by rack: Dual ToRs provide redundancy during upgrades
- Rolling upgrades: No downtime - traffic shifts to redundant paths
6.10.2 Simple Troubleshooting
- Pure L3: Standard tools (ping, traceroute, tcpdump)
- No state sync issues: No MLAG peer-link to debug
- Clear failure domains: Easy to isolate problems
6.10.3 Configuration Simplicity
- ~50 config lines per switch: Pure L3 routing (BGP)
- No MLAG complexity: No peer-link, no state synchronization
- Standard protocols: BGP, ECMP - all well-understood
6.11 Capacity Planning Principles
6.11.1 Critical: Size for 100% Load
DO NOT assume 50/50 traffic split across fabrics. Plan for:
- Normal operation: 40-60% per fabric (depends on flow distribution)
- Failover scenario: 100% capacity on single fabric
- Spine uplinks: Must handle full rack capacity during failover
6.11.2 Bandwidth Calculation
- Per host: 200G aggregate (100G per fabric)
- Per rack: 25 hosts × 200G = 5Tbps theoretical
- Per ToR: Must handle 100% of rack traffic during failover
- Per spine: Must handle sum of all ToR uplinks in fabric
6.12 Firewall Integration
6.12.1 Active-Active Firewall Design
- Firewall-1:
- eth0 → ToR-A (Rack 1)
- eth1 → ToR-B (Rack 1)
- Firewall-2:
- eth0 → ToR-A (Rack 2)
- eth1 → ToR-B (Rack 2)
Each firewall connects to dual ToRs with ECMP routing. Traffic can use any available path through the unified fabric.
6.13 Power Domain Separation
Best practice for failure isolation:
- Rack ToR-A switches → PDU-A → UPS-A
- Rack ToR-B switches → PDU-B → UPS-B
- Spines: Distribute across different power domains for redundancy
This provides redundancy at the power level. If PDU-A fails, ToR-B switches and some spines remain operational.
6.14 Key Design Decisions
6.14.1 Why /32 for loopbacks?
- Provides stable identity for OVN TEP
- Simplifies routing (no subnet concerns)
- Works well with BGP
- At 150 hosts, FIB scale is not an issue
6.14.2 Why /31 for point-to-point links?
- Standard for P2P links (RFC 3021)
- Saves IP addresses
- Clear intent: this is a P2P link
6.14.3 Why no summarization initially?
- Avoids blackholes during failures
- Simpler to debug
- At this scale, /32s are manageable
- Can add summarization later with proper safeguards
6.14.4 Why GENEVE over VXLAN?
- More extensible (variable-length options)
- Better for OVN’s use case
- Native OVN support
- Similar performance characteristics
- Hardware offload: Modern NICs support GENEVE offload
6.14.5 Why pure L3 multipath (no bonding)?
- Each NIC is separate routed interface: No L2 constructs, no bonding complexity
- Loopback advertised via both NICs: BGP creates equal-cost paths automatically
- ECMP handles load balancing: Kernel routing table distributes traffic across both paths
- 5-tuple hashing: GENEVE (UDP) provides excellent hash entropy
- Automatic failover: BGP withdraws failed path, ECMP uses remaining path
- No hardware dependency: Pure L3 routing, no switch-side LACP needed
- Servers are routers: First-class multipath support, not a hack
6.14.6 How GENEVE Uses ECMP
- 5-tuple hashing: Source IP, Dest IP, Source Port, Dest Port, Protocol
- Unique source ports: Each connection gets different source port
- Kernel ECMP: Routes across both NICs (eth0/eth1) based on 5-tuple
- Fabric ECMP: Routes across multiple spine paths based on 5-tuple
- Automatic distribution: No manual configuration needed
6.14.7 Kubernetes Integration
- Kubernetes uses OVN: OVN-Kubernetes CNI integrates with OVN control plane
- No double overlay: Pods and VMs share same GENEVE overlay
- Unified networking: One control plane (OVN) for all workloads
- Consistent policies: OVN security groups apply to both pods and VMs
6.15 Why Not EVPN at Fabric Layer?
Common Question: Do we need EVPN at the fabric layer?
Answer: No. EVPN is not needed at the fabric layer because:
- TEP endpoints are at hosts (hypervisors), not at ToR switches
- OVN control plane handles all TEP discovery and VM learning
- Fabric only routes IP packets - it doesn’t need to understand overlays
- Simpler operations - pure L3 routing is easier than EVPN
EVPN semantics: OVN provides TEP registration, not EVPN. - OVN control plane handles TEP registration when hosts come up - OVN Southbound DB maintains TEP-to-VM mappings - No EVPN needed - OVN’s control plane replaces EVPN’s function
6.16 References
For detailed implementation: - Network Design & IP Addressing - Concrete IP plans - BGP & Routing Configuration - BGP configuration details - Operations & Maintenance - Operational procedures
6.16.1 Technical References
- OVN Architecture Documentation - Official OVN architecture
- Red Hat OpenStack Platform - Networking with OVN - OVN in OpenStack
- OpenStack OVN Wiki - OpenStack OVN networking
- Canonical OpenStack Design Considerations - Canonical’s OpenStack design
- RFC 8926 - GENEVE - GENEVE protocol specification
- RFC 3021 - /31 Point-to-Point Links - Point-to-point link addressing