4 Network Architecture Overview

4.1 Overview

This chapter describes the network architecture for OpenStack deployment with OVN/OVS overlay networks using GENEVE encapsulation.

The design uses dual ToRs (A/B) per rack in a unified L3 Clos fabric with pure L3 routing. “A/B” refers to dual ToRs and dual host uplinks (separate switches/power domains for redundancy), but within a Network Pod the underlay is a single unified L3 routing domain. No MLAG/peer-link between ToRs.

Note: For definitions of technical terms, see the Glossary.

4.2 Core Philosophy: Pure L3 Everywhere

The underlay’s only job is: Move IP packets between hosts reliably and at full bandwidth. Everything else (tenant networks, isolation, mobility) is handled by OVN on top.

4.2.1 Why L3-Only?

We deliberately choose L3 everywhere and avoid L2 constructs (bridges, MLAG, vPC, STP):

L3 scales cleanly: No broadcast, no flooding, no spanning tree, no split-brain risk
Failures are explicit and fast: Links fail → BFD detects → BGP withdraws → ECMP reconverges
Predictable paths: Every packet is routed; no hidden L2 behavior
Modern DC norm: Clos fabrics, hyperscalers, and OVN/GENEVE overlays all assume an IP fabric

No VXLAN/EVPN at fabric layer: We’re not running VXLAN/EVPN in the fabric; OVN handles overlay (GENEVE) at the hosts, and the fabric stays pure L3 (BGP/ECMP/BFD). For a detailed explanation, see Appendix: Why Not VXLAN/EVPN.

4.2.2 What We Explicitly Avoid

No L2 bridges in the underlay
No MLAG / vPC / stacked ToRs
No shared MACs across links
No dependence on broadcast or ARP domains
No software bonding (each NIC is separate routed interface)

4.3 Unified L3 Clos Fabric with Dual ToRs

The design uses a single L3 Clos fabric with dual ToRs per rack for redundancy and path diversity, providing excellent ECMP load balancing and bandwidth utilization.

4.3.1 The Mental Model

Think of the network in layers: Physical ports are just pipes. Per-link IPs define how neighbors talk (point-to-point /31 links). The loopback IP defines who the node is (independent of physical links). BGP determines who can reach whom (route advertisement). ECMP uses all available paths (automatic load balancing). OVN/GENEVE creates virtual networks on top, completely decoupled from the physical fabric.

4.3.2 Multiple NICs per Server - Pure L3 Approach

Each server has two NICs, each with its own IP on a different routed network. eth0 connects to ToR-A, eth1 connects to ToR-B. The server’s loopback IP is advertised via eBGP through both NICs.

Result: The fabric learns multiple equal-cost paths to the same server loopback. Traffic can enter or leave the server via either NIC. If one NIC or ToR fails, the loopback remains reachable via the other.

What we avoid: We avoid L2 dual-homing constructs (MLAG/vPC) and we do not use host bonding/LACP for the underlay; redundancy and bandwidth come from L3 ECMP. No software bonding (bond0, balance-xor), no shared MACs, no bridges between ToRs.

Why this works: Servers are routers with two uplinks. Each uplink is a separate routed network. The loopback identity is independent of physical links. ECMP handles load balancing automatically using 5-tuple hashing. No ambiguity, no shared state.

4.4 Topology Design

4.4.1 Current Deployment: Leaf-Spine (Clos) Fabric

Start: 3 racks
Scale to: 16 racks maximum
Servers per rack: ~25 servers
Total capacity: Up to 400 servers

The topology is a standard Clos fabric where all ToRs (leaves) connect to all spines. Each rack has dual ToRs (ToR-A and ToR-B), and each ToR is an independent L3 router with 24×200G downlinks for servers and 8×400G uplinks for spine connectivity.

We deploy 2× 32-port 400G spine switches (Spine-1, Spine-2). Every ToR connects to both spines, creating a single unified L3 routing domain with maximum path diversity. ECMP automatically distributes traffic across all available paths.

Single Network Pod - Leaf-Spine Architecture

The capacity limit comes from spine port count: each rack needs 4 spine ports (2 ToRs × 2 connections). With 2× 32-port spines = 64 total ports, we support 16 racks maximum.

Path diversity between any two hosts in different racks: 2 source NICs × 2 spines × 2 destination ToRs = 8 distinct paths.

Deployment is straightforward: deploy ToR pairs per rack, deploy both spine switches, cable all ToRs to both spines (2 connections per ToR), configure eBGP peering between all ToRs and all spines. Each host connects to both ToRs via 2×100G NICs.

For detailed IP addressing, see Network Design & IP Addressing.

4.4.2 Future Expansion: Super-Spine (Beyond 16 Racks)

Beyond 16 racks, the architecture can evolve to a hierarchical super-spine design with Network Pods (NP):

Multi Network Pod - Super-Spine Architecture

4.5 Network Pods (NP)

A Network Pod (NP) is a modular scaling unit in the datacenter architecture.

Definition: A Network Pod is a group of racks whose ToRs connect to the same spine set. In our design, one Network Pod = up to 16 racks with 2 dedicated spine switches.

Why Network Pods: They provide a modular scaling unit. When spine port-count or cabling becomes a limit (beyond 16 racks), you create additional Network Pods. Each pod operates independently with its own spine layer, and super-spine switches interconnect the pods.

Traffic patterns: - Intra-pod (within same NP): leaf → spine → leaf (2 hops) - Inter-pod (between NPs): leaf → spine → super-spine → spine → leaf (4 hops)

Naming clarification: We use “Network Pod (NP)” to avoid confusion with Kubernetes pods. NP refers to physical infrastructure grouping, not containerized workloads.

When you need it: For deployments under 16 racks, you have a single Network Pod (NP1). The super-spine evolution is not needed until you exceed 16 racks, but the hierarchical IP addressing scheme (see Network Design & IP Addressing) is designed with this future growth in mind.

For a comparison of our approach with Google’s Jupiter datacenter network architecture, see Appendix: Google Jupiter Comparison.

4.6 Why Unified Fabric with Dual ToRs?

4.6.1 Excellent Path Diversity and Bandwidth Utilization

With OVN/OVS overlay networks, the physical underlay only needs to provide basic IP connectivity. All tunneling, encapsulation, and virtual routing happens in software (OVS kernel module).

Key Insight: Since TEP endpoints are at hosts (not at ToR switches), the fabric layer doesn’t need EVPN or VXLAN. Pure L3 BGP/ECMP routing is sufficient and much simpler.

Benefits of Unified Fabric:

Maximum path diversity: With 2 ToRs per rack and 2 spines, there are 8 possible paths between any two hosts
Better bandwidth utilization: Traffic can use any available path across the unified fabric
Automatic path selection: BGP ECMP automatically distributes across all available paths
Pure L3 routing: Each NIC is a separate routed interface - no bonding complexity
No peer-link dependency: Unlike MLAG, ToRs are independent - no shared state between ToR-A and ToR-B
Natural multipath: Loopback advertised via both NICs creates equal-cost paths automatically
Standard modern DC design: Same architecture as hyperscaler datacenters

4.6.2 Comparison with Other Architectures

Aspect	Dual ToRs + Unified Fabric (Our Design)	MLAG/vPC	EVPN Multihoming
Overlay resilience	Perfect - zero shared state between ToRs	Poor - peer-link failure disrupts all tunnels	Good - but complex
200G multipath	Pure L3 ECMP (2 × 100G NICs)	LACP hardware bond	ESI-based multihoming
Failure isolation	Complete - separate ToRs, no peer-link	Shared - peer-link SPOF	Good but complex
Complexity	Low - pure L3 routing	Medium-High - state sync	High - EVPN control plane
Operational simplicity	Simple - independent maintenance	Complex - coordinated upgrades	Very complex - EVPN ops
NIC configuration	Separate routed interfaces	NICs bonded at L2	Depends on implementation
Config lines/switch	~50 lines	~150 lines	~300+ lines
Vendor dependency	None - standard BGP/ECMP	Vendor-specific (Cisco vPC, Arista MLAG)	EVPN support required
Suitability for OVN	Excellent - TEPs at hosts	Poor - overlay conflicts with L2	Overkill - not needed

Rating:

Dual ToRs in Unified L3 Fabric: ★★★★★ (Recommended for OpenStack with OVN)
MLAG/vPC: ★★☆☆☆ (Not recommended - peer-link is critical failure point for overlays)
EVPN Multihoming: ★★★☆☆ (Overkill - not needed when TEPs are at hosts)

4.7 Design Principles

4.7.1 1. Single Unified L3 Fabric with Dual ToRs

All ToRs connect to all spines in a single routing domain: - No peer-links between ToR-A and ToR-B (no MLAG) - Single routing domain with multiple redundant paths - All ToRs peer with all spines via BGP - Maximum ECMP path diversity

4.7.2 2. Pure L3 Underlay

No bridges: All switching is L3 routing
No VLANs: IP-only underlay
No EVPN/VXLAN at fabric: Fabric only routes IP packets, doesn’t understand overlays
Point-to-point links: /31 links between all devices (RFC 3021)
BGP routing: Standard eBGP for route advertisement (no iBGP, no route reflectors)
ECMP: Automatic load balancing across multiple paths using 5-tuple hashing

4.7.3 3. Host-Based TEPs

TEP endpoints are at hypervisors (not at ToR switches): - Each host’s loopback IP is its TEP (Tunnel Endpoint) - OVN/OVS handles GENEVE encapsulation/decapsulation - Fabric layer doesn’t need EVPN/VXLAN - just routes IP packets - Hardware acceleration: ConnectX-6 DX supports GENEVE offload

Note: ToR switches are actually L3 routers doing BGP routing, not L2 switches. The name “ToR” refers to physical location, but function is pure L3 routing.

4.8 Why No EVPN/VXLAN at Fabric Layer?

The key choice: overlay terminates at hypervisors (host-to-host), not at switches (ToR-to-ToR).

Traditional: ToRs are VTEPs with EVPN control plane (~300 lines config per switch).
Ours: Hypervisors are TEPs with OVN control plane (~50 lines config per switch).

Why host-based: Modern NICs hardware-offload GENEVE (ConnectX-6 DX), OVN already provides overlay control, switches stay simple and vendor-neutral.

For detailed diagrams, EVPN-MH discussion, and complete comparison, see Appendix: Why Not VXLAN/EVPN.

References: - OpenStack Architecture Guide - L3 Underlay - Canonical OpenStack Design Considerations

4.9 IP Addressing Strategy

The architecture uses hierarchical addressing for easy debugging and management. The IP address structure encodes: - Role: Device type (spine, ToR, host) - Pod: Network Pod number (for super-spine architecture) - Rack: Physical rack location

4.9.1 Addressing Principles

Loopback IPs: Organized by role + pod + rack
- Network devices: 10.254.{pod-rack-role}.*
- Host TEPs: 10.255.{pod-rack}.{host}
Point-to-Point Links: Allocated from per-rack/per-pod pools
- Host↔︎ToR: 172.16.{pod-rack}.0/24 (split A/B)
- ToR↔︎Spine: 172.20.{pod}.0/22 per pod
- Spine↔︎Super-Spine: 172.24.100.0/24

4.9.2 Benefits

Easy Debugging: IP address immediately reveals pod, rack, and role
Scalable: Clear allocation scheme for adding new pods/racks
Organized: Logical grouping by network topology
Consistent: Same pattern across all Network Pods

For concrete IP allocation examples, see Network Design & IP Addressing.

4.10 Hardware Configuration

4.10.1 Server NICs

2 × 100G per server (NVIDIA ConnectX-6 DX)
Hardware GENEVE offload enabled
Total aggregate: 200G via pure L3 ECMP
Each NIC is a separate routed interface

4.10.2 Switch Hardware

ToR Switches:
- Option 1: 100G × 64 ports (Tomahawk-based)
- Option 2: 200G × 32 ports (Tomahawk-based)
Spine Switches:
- 400G switches (Tomahawk-based)
All switches: Pure L3 routers with BGP/ECMP

4.11 Operational Benefits

4.11.1 Rolling Maintenance

Upgrade spines one at a time: ECMP redistributes to remaining spines
Upgrade ToRs rack by rack: Dual ToRs provide redundancy during upgrades
Rolling upgrades: No downtime - traffic shifts to redundant paths

4.11.2 Simple Troubleshooting

Pure L3: Standard tools (ping, traceroute, tcpdump)
No state sync issues: No MLAG peer-link to debug
Clear failure domains: Easy to isolate problems

4.11.3 Configuration Simplicity

~50 config lines per switch: Pure L3 routing (BGP)
No MLAG complexity: No peer-link, no state synchronization
Standard protocols: BGP, ECMP - all well-understood

4.12 Capacity Planning Principles

4.12.1 Critical: Size for 100% Load

DO NOT assume 50/50 traffic split across fabrics. Plan for:

Normal operation: 40-60% per fabric (depends on flow distribution)
Failover scenario: 100% capacity on single fabric
Spine uplinks: Must handle full rack capacity during failover

4.12.2 Bandwidth Calculation

Per host: 200G aggregate (100G per fabric)
Per rack: 25 hosts × 200G = 5Tbps theoretical
Per ToR: Must handle 100% of rack traffic during failover
Per spine: Must handle sum of all ToR uplinks in fabric

4.13 Firewall Integration

4.13.1 Active-Active Firewall Design

Firewall-1:
- eth0 → ToR-A (Rack 1)
- eth1 → ToR-B (Rack 1)
Firewall-2:
- eth0 → ToR-A (Rack 2)
- eth1 → ToR-B (Rack 2)

Each firewall connects to dual ToRs with ECMP routing. Traffic can use any available path through the unified fabric.

4.14 Power Domain Separation

Best practice for failure isolation:

Rack ToR-A switches → PDU-A → UPS-A
Rack ToR-B switches → PDU-B → UPS-B
Spines: Distribute across different power domains for redundancy

This provides redundancy at the power level. If PDU-A fails, ToR-B switches and some spines remain operational.

4.15 Key Design Decisions

4.15.1 Why /32 for loopbacks?

Provides stable identity for OVN TEP
Simplifies routing (no subnet concerns)
Works well with BGP
At 150 hosts, FIB scale is not an issue

4.15.2 Why /31 for point-to-point links?

Standard for P2P links (RFC 3021)
Saves IP addresses
Clear intent: this is a P2P link

4.15.3 Why no summarization initially?

Avoids blackholes during failures
Simpler to debug
At this scale, /32s are manageable
Can add summarization later with proper safeguards

4.15.4 Why GENEVE over VXLAN?

More extensible (variable-length options)
Better for OVN’s use case
Native OVN support
Similar performance characteristics
Hardware offload: Modern NICs support GENEVE offload

4.15.5 Why pure L3 multipath (no bonding)?

Each NIC is separate routed interface: No L2 constructs, no bonding complexity
Loopback advertised via both NICs: BGP creates equal-cost paths automatically
ECMP handles load balancing: Kernel routing table distributes traffic across both paths
5-tuple hashing: GENEVE (UDP) provides excellent hash entropy
Automatic failover: BGP withdraws failed path, ECMP uses remaining path
No hardware dependency: Pure L3 routing, no switch-side LACP needed
Servers are routers: First-class multipath support, not a hack

4.15.6 How GENEVE Uses ECMP

5-tuple hashing: Source IP, Dest IP, Source Port, Dest Port, Protocol
Unique source ports: Each connection gets different source port
Kernel ECMP: Routes across both NICs (eth0/eth1) based on 5-tuple
Fabric ECMP: Routes across multiple spine paths based on 5-tuple
Automatic distribution: No manual configuration needed

4.15.7 Kubernetes Integration

Kubernetes uses OVN: OVN-Kubernetes CNI integrates with OVN control plane
No double overlay: Pods and VMs share same GENEVE overlay
Unified networking: One control plane (OVN) for all workloads
Consistent policies: OVN security groups apply to both pods and VMs

4.16 OVN Control Plane: Why We Don’t Need EVPN

OVN provides the overlay control plane, eliminating the need for EVPN at the fabric layer:

TEP endpoints are at hosts (hypervisors), not at ToR switches
OVN Southbound DB tracks where VMs are located (which host/TEP)
OVN control plane handles TEP registration and VM learning
Fabric just routes IP - doesn’t need to understand VMs or overlays

For the complete ToR-to-ToR vs Host-to-Host overlay comparison, see Appendix: Why Not VXLAN/EVPN.

4.17 References

For detailed implementation: - Network Design & IP Addressing - Concrete IP plans - BGP & Routing Configuration - BGP configuration details - Operations & Maintenance - Operational procedures

4.17.1 Technical References

OVN Architecture Documentation - Official OVN architecture
Red Hat OpenStack Platform - Networking with OVN - OVN in OpenStack
OpenStack OVN Wiki - OpenStack OVN networking
Canonical OpenStack Design Considerations - Canonical’s OpenStack design
RFC 8926 - GENEVE - GENEVE protocol specification
RFC 3021 - /31 Point-to-Point Links - Point-to-point link addressing

--- title: "Network Architecture Overview" --- ## Overview This chapter describes the network architecture for OpenStack deployment with OVN/OVS overlay networks using GENEVE encapsulation. The design uses dual ToRs (A/B) per rack in a unified L3 Clos fabric with pure L3 routing. "A/B" refers to dual ToRs and dual host uplinks (separate switches/power domains for redundancy), but within a Network Pod the underlay is a single unified L3 routing domain. No MLAG/peer-link between ToRs. > **Note**: For definitions of technical terms, see the [Glossary](./glossary.qmd). ## Core Philosophy: Pure L3 Everywhere > **The underlay's only job is: Move IP packets between hosts reliably and at full bandwidth.** > Everything else (tenant networks, isolation, mobility) is handled by OVN on top. ### Why L3-Only? We deliberately choose **L3 everywhere** and avoid L2 constructs (bridges, MLAG, vPC, STP): - **L3 scales cleanly**: No broadcast, no flooding, no spanning tree, no split-brain risk - **Failures are explicit and fast**: Links fail → BFD detects → BGP withdraws → ECMP reconverges - **Predictable paths**: Every packet is routed; no hidden L2 behavior - **Modern DC norm**: Clos fabrics, hyperscalers, and OVN/GENEVE overlays all assume an IP fabric **No VXLAN/EVPN at fabric layer**: We're not running VXLAN/EVPN in the fabric; OVN handles overlay (GENEVE) at the hosts, and the fabric stays pure L3 (BGP/ECMP/BFD). For a detailed explanation, see [Appendix: Why Not VXLAN/EVPN](./why-not-evpn-vxlan.qmd). ### What We Explicitly Avoid - No L2 bridges in the underlay - No MLAG / vPC / stacked ToRs - No shared MACs across links - No dependence on broadcast or ARP domains - No software bonding (each NIC is separate routed interface) ## Unified L3 Clos Fabric with Dual ToRs The design uses a single L3 Clos fabric with dual ToRs per rack for redundancy and path diversity, providing excellent ECMP load balancing and bandwidth utilization. ### The Mental Model Think of the network in layers: Physical ports are just pipes. Per-link IPs define how neighbors talk (point-to-point /31 links). The loopback IP defines who the node is (independent of physical links). BGP determines who can reach whom (route advertisement). ECMP uses all available paths (automatic load balancing). OVN/GENEVE creates virtual networks on top, completely decoupled from the physical fabric. ### Multiple NICs per Server - Pure L3 Approach Each server has two NICs, each with its own IP on a different routed network. eth0 connects to ToR-A, eth1 connects to ToR-B. The server's loopback IP is advertised via eBGP through both NICs. Result: The fabric learns multiple equal-cost paths to the same server loopback. Traffic can enter or leave the server via either NIC. If one NIC or ToR fails, the loopback remains reachable via the other. What we avoid: We avoid L2 dual-homing constructs (MLAG/vPC) and we do not use host bonding/LACP for the underlay; redundancy and bandwidth come from L3 ECMP. No software bonding (bond0, balance-xor), no shared MACs, no bridges between ToRs. Why this works: Servers are routers with two uplinks. Each uplink is a separate routed network. The loopback identity is independent of physical links. ECMP handles load balancing automatically using 5-tuple hashing. No ambiguity, no shared state. ## Topology Design ### Current Deployment: Leaf-Spine (Clos) Fabric Start: 3 racks Scale to: 16 racks maximum Servers per rack: ~25 servers Total capacity: Up to 400 servers The topology is a standard Clos fabric where all ToRs (leaves) connect to all spines. Each rack has dual ToRs (ToR-A and ToR-B), and each ToR is an independent L3 router with 24×200G downlinks for servers and 8×400G uplinks for spine connectivity. We deploy 2× 32-port 400G spine switches (Spine-1, Spine-2). Every ToR connects to both spines, creating a single unified L3 routing domain with maximum path diversity. ECMP automatically distributes traffic across all available paths. ![Single Network Pod - Leaf-Spine Architecture](figures/Leaf%20and%20Spine-2025-12-31-102352.svg){fig-align="center"} The capacity limit comes from spine port count: each rack needs 4 spine ports (2 ToRs × 2 connections). With 2× 32-port spines = 64 total ports, we support 16 racks maximum. Path diversity between any two hosts in different racks: 2 source NICs × 2 spines × 2 destination ToRs = 8 distinct paths. Deployment is straightforward: deploy ToR pairs per rack, deploy both spine switches, cable all ToRs to both spines (2 connections per ToR), configure eBGP peering between all ToRs and all spines. Each host connects to both ToRs via 2×100G NICs. For detailed IP addressing, see [Network Design & IP Addressing](./network-design-ip-addressing.qmd#network-topology). ### Future Expansion: Super-Spine (Beyond 16 Racks) **Beyond 16 racks**, the architecture can evolve to a hierarchical **super-spine** design with **Network Pods (NP)**: ![Multi Network Pod - Super-Spine Architecture](figures/Super%20Spine%20Design-2025-12-31-103132.svg){fig-align="center"} ## Network Pods (NP) A **Network Pod (NP)** is a modular scaling unit in the datacenter architecture. **Definition**: A Network Pod is a group of racks whose ToRs connect to the same spine set. In our design, one Network Pod = up to 16 racks with 2 dedicated spine switches. **Why Network Pods**: They provide a modular scaling unit. When spine port-count or cabling becomes a limit (beyond 16 racks), you create additional Network Pods. Each pod operates independently with its own spine layer, and super-spine switches interconnect the pods. **Traffic patterns**: - **Intra-pod** (within same NP): leaf → spine → leaf (2 hops) - **Inter-pod** (between NPs): leaf → spine → super-spine → spine → leaf (4 hops) **Naming clarification**: We use "Network Pod (NP)" to avoid confusion with Kubernetes pods. NP refers to physical infrastructure grouping, not containerized workloads. **When you need it**: For deployments under 16 racks, you have a single Network Pod (NP1). The super-spine evolution is not needed until you exceed 16 racks, but the hierarchical IP addressing scheme (see [Network Design & IP Addressing](./network-design-ip-addressing.qmd)) is designed with this future growth in mind. For a comparison of our approach with Google's Jupiter datacenter network architecture, see [Appendix: Google Jupiter Comparison](./google-jupiter-comparison.qmd). ## Why Unified Fabric with Dual ToRs? ### Excellent Path Diversity and Bandwidth Utilization With OVN/OVS overlay networks, the physical underlay only needs to provide basic IP connectivity. All tunneling, encapsulation, and virtual routing happens in software (OVS kernel module). **Key Insight**: Since TEP endpoints are at hosts (not at ToR switches), the fabric layer doesn't need EVPN or VXLAN. Pure L3 BGP/ECMP routing is sufficient and much simpler. **Benefits of Unified Fabric**: 1. **Maximum path diversity**: With 2 ToRs per rack and 2 spines, there are **8 possible paths** between any two hosts 2. **Better bandwidth utilization**: Traffic can use any available path across the unified fabric 3. **Automatic path selection**: BGP ECMP automatically distributes across all available paths 4. **Pure L3 routing**: Each NIC is a separate routed interface - no bonding complexity 5. **No peer-link dependency**: Unlike MLAG, ToRs are independent - no shared state between ToR-A and ToR-B 6. **Natural multipath**: Loopback advertised via both NICs creates equal-cost paths automatically 7. **Standard modern DC design**: Same architecture as hyperscaler datacenters ### Comparison with Other Architectures | Aspect | Dual ToRs + Unified Fabric (Our Design) | MLAG/vPC | EVPN Multihoming | |--------|------------------------------------------|----------|------------------| | **Overlay resilience** | Perfect - zero shared state between ToRs | Poor - peer-link failure disrupts all tunnels | Good - but complex | | **200G multipath** | Pure L3 ECMP (2 × 100G NICs) | LACP hardware bond | ESI-based multihoming | | **Failure isolation** | Complete - separate ToRs, no peer-link | Shared - peer-link SPOF | Good but complex | | **Complexity** | Low - pure L3 routing | Medium-High - state sync | High - EVPN control plane | | **Operational simplicity** | Simple - independent maintenance | Complex - coordinated upgrades | Very complex - EVPN ops | | **NIC configuration** | Separate routed interfaces | NICs bonded at L2 | Depends on implementation | | **Config lines/switch** | ~50 lines | ~150 lines | ~300+ lines | | **Vendor dependency** | None - standard BGP/ECMP | Vendor-specific (Cisco vPC, Arista MLAG) | EVPN support required | | **Suitability for OVN** | Excellent - TEPs at hosts | Poor - overlay conflicts with L2 | Overkill - not needed | **Rating**: - **Dual ToRs in Unified L3 Fabric**: ★★★★★ (Recommended for OpenStack with OVN) - **MLAG/vPC**: ★★☆☆☆ (Not recommended - peer-link is critical failure point for overlays) - **EVPN Multihoming**: ★★★☆☆ (Overkill - not needed when TEPs are at hosts) ## Design Principles ### 1. Single Unified L3 Fabric with Dual ToRs **All ToRs connect to all spines in a single routing domain**: - No peer-links between ToR-A and ToR-B (no MLAG) - Single routing domain with multiple redundant paths - All ToRs peer with all spines via BGP - Maximum ECMP path diversity ### 2. Pure L3 Underlay - **No bridges**: All switching is L3 routing - **No VLANs**: IP-only underlay - **No EVPN/VXLAN at fabric**: Fabric only routes IP packets, doesn't understand overlays - **Point-to-point links**: /31 links between all devices (RFC 3021) - **BGP routing**: Standard eBGP for route advertisement (no iBGP, no route reflectors) - **ECMP**: Automatic load balancing across multiple paths using 5-tuple hashing ### 3. Host-Based TEPs **TEP endpoints are at hypervisors** (not at ToR switches): - Each host's loopback IP is its TEP (Tunnel Endpoint) - OVN/OVS handles GENEVE encapsulation/decapsulation - Fabric layer doesn't need EVPN/VXLAN - just routes IP packets - Hardware acceleration: ConnectX-6 DX supports GENEVE offload **Note**: ToR switches are actually **L3 routers** doing BGP routing, not L2 switches. The name "ToR" refers to physical location, but function is pure L3 routing. ## Why No EVPN/VXLAN at Fabric Layer? The key choice: overlay terminates at hypervisors (host-to-host), not at switches (ToR-to-ToR). **Traditional**: ToRs are VTEPs with EVPN control plane (~300 lines config per switch). **Ours**: Hypervisors are TEPs with OVN control plane (~50 lines config per switch). **Why host-based**: Modern NICs hardware-offload GENEVE (ConnectX-6 DX), OVN already provides overlay control, switches stay simple and vendor-neutral. > **For detailed diagrams, EVPN-MH discussion, and complete comparison**, see [Appendix: Why Not VXLAN/EVPN](./why-not-evpn-vxlan.qmd). **References**: - [OpenStack Architecture Guide - L3 Underlay](https://docs.openstack.org/arch-design/ArchGuide.pdf) - [Canonical OpenStack Design Considerations](https://canonical-openstack.readthedocs-hosted.com/en/latest/explanation/design-considerations/) ## IP Addressing Strategy The architecture uses **hierarchical addressing** for easy debugging and management. The IP address structure encodes: - **Role**: Device type (spine, ToR, host) - **Pod**: Network Pod number (for super-spine architecture) - **Rack**: Physical rack location ### Addressing Principles 1. **Loopback IPs**: Organized by role + pod + rack - Network devices: `10.254.{pod-rack-role}.*` - Host TEPs: `10.255.{pod-rack}.{host}` 2. **Point-to-Point Links**: Allocated from per-rack/per-pod pools - Host↔ToR: `172.16.{pod-rack}.0/24` (split A/B) - ToR↔Spine: `172.20.{pod}.0/22` per pod - Spine↔Super-Spine: `172.24.100.0/24` ### Benefits - **Easy Debugging**: IP address immediately reveals pod, rack, and role - **Scalable**: Clear allocation scheme for adding new pods/racks - **Organized**: Logical grouping by network topology - **Consistent**: Same pattern across all Network Pods For concrete IP allocation examples, see [Network Design & IP Addressing](./network-design-ip-addressing.qmd). ## Hardware Configuration ### Server NICs - **2 × 100G per server** (NVIDIA ConnectX-6 DX) - Hardware GENEVE offload enabled - Total aggregate: 200G via pure L3 ECMP - Each NIC is a separate routed interface ### Switch Hardware - **ToR Switches**: - Option 1: 100G × 64 ports (Tomahawk-based) - Option 2: 200G × 32 ports (Tomahawk-based) - **Spine Switches**: - 400G switches (Tomahawk-based) - **All switches**: Pure L3 routers with BGP/ECMP ## Operational Benefits ### Rolling Maintenance - **Upgrade spines one at a time**: ECMP redistributes to remaining spines - **Upgrade ToRs rack by rack**: Dual ToRs provide redundancy during upgrades - **Rolling upgrades**: No downtime - traffic shifts to redundant paths ### Simple Troubleshooting - **Pure L3**: Standard tools (ping, traceroute, tcpdump) - **No state sync issues**: No MLAG peer-link to debug - **Clear failure domains**: Easy to isolate problems ### Configuration Simplicity - **~50 config lines per switch**: Pure L3 routing (BGP) - **No MLAG complexity**: No peer-link, no state synchronization - **Standard protocols**: BGP, ECMP - all well-understood ## Capacity Planning Principles ### Critical: Size for 100% Load **DO NOT** assume 50/50 traffic split across fabrics. Plan for: - **Normal operation**: 40-60% per fabric (depends on flow distribution) - **Failover scenario**: 100% capacity on single fabric - **Spine uplinks**: Must handle full rack capacity during failover ### Bandwidth Calculation - **Per host**: 200G aggregate (100G per fabric) - **Per rack**: 25 hosts × 200G = 5Tbps theoretical - **Per ToR**: Must handle 100% of rack traffic during failover - **Per spine**: Must handle sum of all ToR uplinks in fabric ## Firewall Integration ### Active-Active Firewall Design - **Firewall-1**: - eth0 → ToR-A (Rack 1) - eth1 → ToR-B (Rack 1) - **Firewall-2**: - eth0 → ToR-A (Rack 2) - eth1 → ToR-B (Rack 2) Each firewall connects to dual ToRs with ECMP routing. Traffic can use any available path through the unified fabric. ## Power Domain Separation **Best practice for failure isolation**: - **Rack ToR-A switches** → PDU-A → UPS-A - **Rack ToR-B switches** → PDU-B → UPS-B - **Spines**: Distribute across different power domains for redundancy This provides redundancy at the power level. If PDU-A fails, ToR-B switches and some spines remain operational. ## Key Design Decisions ### Why /32 for loopbacks? - Provides stable identity for OVN TEP - Simplifies routing (no subnet concerns) - Works well with BGP - At 150 hosts, FIB scale is not an issue ### Why /31 for point-to-point links? - Standard for P2P links (RFC 3021) - Saves IP addresses - Clear intent: this is a P2P link ### Why no summarization initially? - Avoids blackholes during failures - Simpler to debug - At this scale, /32s are manageable - Can add summarization later with proper safeguards ### Why GENEVE over VXLAN? - More extensible (variable-length options) - Better for OVN's use case - Native OVN support - Similar performance characteristics - **Hardware offload**: Modern NICs support GENEVE offload ### Why pure L3 multipath (no bonding)? - **Each NIC is separate routed interface**: No L2 constructs, no bonding complexity - **Loopback advertised via both NICs**: BGP creates equal-cost paths automatically - **ECMP handles load balancing**: Kernel routing table distributes traffic across both paths - **5-tuple hashing**: GENEVE (UDP) provides excellent hash entropy - **Automatic failover**: BGP withdraws failed path, ECMP uses remaining path - **No hardware dependency**: Pure L3 routing, no switch-side LACP needed - **Servers are routers**: First-class multipath support, not a hack ### How GENEVE Uses ECMP - **5-tuple hashing**: Source IP, Dest IP, Source Port, Dest Port, Protocol - **Unique source ports**: Each connection gets different source port - **Kernel ECMP**: Routes across both NICs (eth0/eth1) based on 5-tuple - **Fabric ECMP**: Routes across multiple spine paths based on 5-tuple - **Automatic distribution**: No manual configuration needed ### Kubernetes Integration - **Kubernetes uses OVN**: OVN-Kubernetes CNI integrates with OVN control plane - **No double overlay**: Pods and VMs share same GENEVE overlay - **Unified networking**: One control plane (OVN) for all workloads - **Consistent policies**: OVN security groups apply to both pods and VMs ## OVN Control Plane: Why We Don't Need EVPN OVN provides the overlay control plane, eliminating the need for EVPN at the fabric layer: - **TEP endpoints are at hosts** (hypervisors), not at ToR switches - **OVN Southbound DB** tracks where VMs are located (which host/TEP) - **OVN control plane** handles TEP registration and VM learning - **Fabric just routes IP** - doesn't need to understand VMs or overlays > **For the complete ToR-to-ToR vs Host-to-Host overlay comparison**, see [Appendix: Why Not VXLAN/EVPN](./why-not-evpn-vxlan.qmd). ## References For detailed implementation: - [Network Design & IP Addressing](./network-design-ip-addressing.qmd) - Concrete IP plans - [BGP & Routing Configuration](./bgp-routing.qmd) - BGP configuration details - [Operations & Maintenance](./operations-maintenance.qmd) - Operational procedures ### Technical References - [OVN Architecture Documentation](https://www.ovn.org/en/architecture/) - Official OVN architecture - [Red Hat OpenStack Platform - Networking with OVN](https://docs.redhat.com/en/documentation/red_hat_openstack_platform/14/html-single/networking_with_open_virtual_network/index) - OVN in OpenStack - [OpenStack OVN Wiki](https://wiki.openstack.org/wiki/Networking-ovn) - OpenStack OVN networking - [Canonical OpenStack Design Considerations](https://canonical-openstack.readthedocs-hosted.com/en/latest/explanation/design-considerations/) - Canonical's OpenStack design - [RFC 8926 - GENEVE](https://tools.ietf.org/html/rfc8926) - GENEVE protocol specification - [RFC 3021 - /31 Point-to-Point Links](https://tools.ietf.org/html/rfc3021) - Point-to-point link addressing