8 Hardware Acceleration and Future Evolution
8.1 Why Hardware Acceleration is Mandatory
At modern network speeds (40+ Gbps), software packet processing hits fundamental CPU limitations.
8.1.1 The 40 Gbps Per-Core Ceiling
Time available to process a standard 1500-byte Ethernet frame:
\[t_{processing} = \frac{S}{R} = \frac{1500 \times 8}{R} = \frac{12000}{R}\]
Where: - \(S\) = Frame size in bits (1500 bytes × 8) - \(R\) = Line rate in bps
| Line Rate | Time per Frame | Reality Check |
|---|---|---|
| 1 Gbps | 12 μs | Software can handle this |
| 10 Gbps | 1.2 μs | Software struggles |
| 40 Gbps | 300 ns | Impossible without hardware |
| 100 Gbps | 120 ns | Hardware-only domain |
8.1.2 What Happens in 300 Nanoseconds?
At 40 Gbps, you have 300 nanoseconds to:
- Receive frame from NIC
- Parse headers (Ethernet, IP, UDP, GENEVE, inner Ethernet, inner IP)
- Consult routing/flow tables
- Apply security groups/ACLs
- Encapsulate/decapsulate tunnels
- Forward to destination
Practically, software can’t do this at scale. A modern 4 GHz CPU can execute ~1,200 instructions in 300ns - barely enough to parse headers, let alone make forwarding decisions.
Though there are recent experiments (PaperMill, DPDK optimizations) achieving >100Gbps per core, they are not yet production-practical and require specialized kernel bypass techniques.
8.1.3 Why This Matters for OVN/GENEVE
With GENEVE overlay, every packet requires: - Outer header processing (physical network) - GENEVE decapsulation (remove outer headers) - Inner header processing (virtual network) - Security group lookup (OVN rules) - Re-encapsulation (for forwarding)
This is why hardware offload is mandatory, not optional, for 100G NICs.
8.2 Current Hardware Configuration
8.2.1 Server NICs
Each server has 2 × 100G NICs (ConnectX-6 DX): - eth0: 100G connection to ToR-A (Network-A) - eth1: 100G connection to ToR-B (Network-B) - Total aggregate: 200G per server via pure L3 ECMP
8.2.2 ConnectX-6 DX Hardware Acceleration
Mellanox/NVIDIA ConnectX-6 DX provides hardware acceleration for OVN/OVS:
8.2.2.1 GENEVE Offload
- Hardware GENEVE encapsulation/decapsulation: Offloads GENEVE processing from CPU
- Flow steering: Hardware-based packet classification and forwarding
- OVS hardware offload: Direct integration with OVS for accelerated forwarding
8.2.2.2 Benefits
- Reduced CPU overhead: GENEVE processing handled by NIC
- Higher throughput: Hardware acceleration provides line-rate performance
- Lower latency: Hardware forwarding faster than software
- Better scalability: More CPU available for workloads
8.2.2.3 Configuration
# Enable OVS hardware offload on ConnectX-6 DX
ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
# Verify offload status
ovs-vsctl get Open_vSwitch . other_config:hw-offloadReference: NVIDIA ConnectX-6 DX Documentation
8.2.3 Switch Hardware
8.2.3.1 ToR Switches
- Option 1: 100G switches with 64 ports (e.g., Tomahawk-based)
- Option 2: 200G switches with 32 ports (e.g., Tomahawk-based)
- Chip: Broadcom Tomahawk ASIC
- Function: Pure L3 routing with BGP/ECMP
8.2.3.2 Spine Switches
- 400G switches (e.g., Tomahawk-based)
- High port density for leaf-spine connectivity
- Function: Pure L3 transit with ECMP
Key: All switches are L3 routers, not L2 switches. Tomahawk ASICs provide excellent L3 forwarding performance.
8.3 Future Evolution: DPUs (Data Processing Units)
8.3.1 What are DPUs?
DPUs (Data Processing Units) are specialized processors that offload networking, storage, and security functions from the host CPU. Examples include: - NVIDIA BlueField DPU - AMD Pensando - Intel IPU (Infrastructure Processing Unit)
8.3.2 How DPUs Fit Our Architecture
8.3.2.1 Current Architecture (Host-Based TEPs)
┌─────────────────────────────────────┐
│ Host CPU │
│ ┌──────────┐ ┌──────────┐ │
│ │ OVN │ │ OVS │ │
│ │ Control │ │ Dataplane│ │
│ └────┬─────┘ └────┬─────┘ │
│ │ │ │
│ ┌────▼─────────────▼─────┐ │
│ │ ConnectX-6 DX (NIC) │ │
│ │ Hardware GENEVE Offload │
│ └─────────────────────────┘ │
└─────────────────────────────────────┘
8.3.2.2 Future Architecture (DPU-Based TEPs)
┌─────────────────────────────────────┐
│ Host CPU (Workloads Only) │
│ ┌──────────┐ │
│ │ VMs │ │
│ │ Pods │ │
│ └────┬─────┘ │
│ │ │
│ ┌────▼──────────────────────────┐ │
│ │ BlueField DPU │ │
│ │ ┌──────────┐ ┌──────────┐ │ │
│ │ │ OVN │ │ OVS │ │ │
│ │ │ Control │ │ Dataplane│ │ │
│ │ └──────────┘ └──────────┘ │ │
│ │ Hardware GENEVE Offload │ │
│ │ Hardware BGP/ECMP │ │
│ └───────────────────────────────┘ │
└─────────────────────────────────────┘
8.3.3 Benefits of DPU Evolution
- Host CPU Offload: OVN/OVS processing moves to DPU, freeing host CPU for workloads
- Hardware Acceleration: DPUs provide hardware acceleration for:
- GENEVE encapsulation/decapsulation
- BGP routing
- ECMP load balancing
- Security policies (ACLs, firewalling)
- Consistent Architecture: TEPs still at “host” (now DPU), fabric still pure L3
- Better Performance: Dedicated processing for networking functions
- Isolation: Network processing isolated from workload CPU
8.3.4 Migration Path
When migrating to DPUs:
- TEP moves to DPU: DPU becomes the TEP endpoint
- Fabric unchanged: Still pure L3 BGP/ECMP
- OVN control plane: Runs on DPU, connects to same OVN databases
- BGP on DPU: DPU advertises host loopback via BGP
- Zero fabric changes: Underlay architecture remains identical
Key Insight: DPU evolution is transparent to the fabric. The underlay remains pure L3 BGP/ECMP regardless of where TEPs run.
8.4 Future Evolution: Higher Bandwidth Servers
8.4.1 Current: 2 × 100G (200G aggregate)
8.4.2 Future: 2 × 400G (800G aggregate)
8.4.2.1 Architecture Extension
No changes needed to fabric architecture:
- Same topology: Dual ToRs in unified L3 Clos fabric
- Same routing: Pure L3 BGP/ECMP
- Same principles: Loopback advertised via both NICs
- ECMP scales: Automatically handles higher bandwidth
8.4.2.2 What Changes
- NIC speeds: 100G → 400G per NIC
- Switch ports: ToR switches need 400G ports (or aggregate multiple 100G)
- Link speeds: Point-to-point links become 400G
- ECMP behavior: Same, just more bandwidth per path
8.4.2.3 Example Evolution
Current (2 × 100G): - eth0: 100G → ToR-A - eth1: 100G → ToR-B - Loopback: 10.255.11.11/32 advertised via both
Future (2 × 400G): - eth0: 400G → ToR-A (or 4×100G aggregated) - eth1: 400G → ToR-B (or 4×100G aggregated) - Loopback: 10.255.11.11/32 advertised via both (same!)
Key: The architecture is bandwidth-agnostic. Same design principles apply at any speed.
8.4.3 Future: 2 × 800G (1.6T aggregate)
Same principles: - Dual ToRs in unified L3 fabric - Pure L3 BGP/ECMP - Loopback-based identity - ECMP automatic load balancing
Scalability: The architecture scales seamlessly from 100G to 800G+ per server.
8.5 Switch Evolution
8.5.1 Current ToR Options
- 100G × 64 ports: Sufficient for current server density
- 200G × 32 ports: Higher bandwidth per port
8.5.2 Future ToR Options
- 400G × 32 ports: For 400G servers
- 800G × 16 ports: For 800G servers
8.5.3 Spine Evolution
- Current: 400G spine switches
- Future: 800G or 1.6T spine switches
Key: Spine capacity must scale with aggregate ToR bandwidth.