8  Hardware Acceleration and Future Evolution

8.1 Why Hardware Acceleration is Mandatory

At modern network speeds (40+ Gbps), software packet processing hits fundamental CPU limitations.

8.1.1 The 40 Gbps Per-Core Ceiling

Time available to process a standard 1500-byte Ethernet frame:

\[t_{processing} = \frac{S}{R} = \frac{1500 \times 8}{R} = \frac{12000}{R}\]

Where: - \(S\) = Frame size in bits (1500 bytes × 8) - \(R\) = Line rate in bps

Line Rate Time per Frame Reality Check
1 Gbps 12 μs Software can handle this
10 Gbps 1.2 μs Software struggles
40 Gbps 300 ns Impossible without hardware
100 Gbps 120 ns Hardware-only domain

8.1.2 What Happens in 300 Nanoseconds?

At 40 Gbps, you have 300 nanoseconds to:

  1. Receive frame from NIC
  2. Parse headers (Ethernet, IP, UDP, GENEVE, inner Ethernet, inner IP)
  3. Consult routing/flow tables
  4. Apply security groups/ACLs
  5. Encapsulate/decapsulate tunnels
  6. Forward to destination

Practically, software can’t do this at scale. A modern 4 GHz CPU can execute ~1,200 instructions in 300ns - barely enough to parse headers, let alone make forwarding decisions.

Though there are recent experiments (PaperMill, DPDK optimizations) achieving >100Gbps per core, they are not yet production-practical and require specialized kernel bypass techniques.

8.1.3 Why This Matters for OVN/GENEVE

With GENEVE overlay, every packet requires: - Outer header processing (physical network) - GENEVE decapsulation (remove outer headers) - Inner header processing (virtual network) - Security group lookup (OVN rules) - Re-encapsulation (for forwarding)

This is why hardware offload is mandatory, not optional, for 100G NICs.

8.2 Current Hardware Configuration

8.2.1 Server NICs

Each server has 2 × 100G NICs (ConnectX-6 DX): - eth0: 100G connection to ToR-A (Network-A) - eth1: 100G connection to ToR-B (Network-B) - Total aggregate: 200G per server via pure L3 ECMP

8.2.2 ConnectX-6 DX Hardware Acceleration

Mellanox/NVIDIA ConnectX-6 DX provides hardware acceleration for OVN/OVS:

8.2.2.1 GENEVE Offload

  • Hardware GENEVE encapsulation/decapsulation: Offloads GENEVE processing from CPU
  • Flow steering: Hardware-based packet classification and forwarding
  • OVS hardware offload: Direct integration with OVS for accelerated forwarding

8.2.2.2 Benefits

  • Reduced CPU overhead: GENEVE processing handled by NIC
  • Higher throughput: Hardware acceleration provides line-rate performance
  • Lower latency: Hardware forwarding faster than software
  • Better scalability: More CPU available for workloads

8.2.2.3 Configuration

# Enable OVS hardware offload on ConnectX-6 DX
ovs-vsctl set Open_vSwitch . other_config:hw-offload=true

# Verify offload status
ovs-vsctl get Open_vSwitch . other_config:hw-offload

Reference: NVIDIA ConnectX-6 DX Documentation

8.2.3 Switch Hardware

8.2.3.1 ToR Switches

  • Option 1: 100G switches with 64 ports (e.g., Tomahawk-based)
  • Option 2: 200G switches with 32 ports (e.g., Tomahawk-based)
  • Chip: Broadcom Tomahawk ASIC
  • Function: Pure L3 routing with BGP/ECMP

8.2.3.2 Spine Switches

  • 400G switches (e.g., Tomahawk-based)
  • High port density for leaf-spine connectivity
  • Function: Pure L3 transit with ECMP

Key: All switches are L3 routers, not L2 switches. Tomahawk ASICs provide excellent L3 forwarding performance.

8.3 Future Evolution: DPUs (Data Processing Units)

8.3.1 What are DPUs?

DPUs (Data Processing Units) are specialized processors that offload networking, storage, and security functions from the host CPU. Examples include: - NVIDIA BlueField DPU - AMD Pensando - Intel IPU (Infrastructure Processing Unit)

8.3.2 How DPUs Fit Our Architecture

8.3.2.1 Current Architecture (Host-Based TEPs)

┌─────────────────────────────────────┐
│  Host CPU                           │
│  ┌──────────┐  ┌──────────┐        │
│  │   OVN    │  │   OVS    │        │
│  │ Control  │  │ Dataplane│        │
│  └────┬─────┘  └────┬─────┘        │
│       │             │               │
│  ┌────▼─────────────▼─────┐        │
│  │  ConnectX-6 DX (NIC)   │        │
│  │  Hardware GENEVE Offload        │
│  └─────────────────────────┘       │
└─────────────────────────────────────┘

8.3.2.2 Future Architecture (DPU-Based TEPs)

┌─────────────────────────────────────┐
│  Host CPU (Workloads Only)          │
│  ┌──────────┐                       │
│  │   VMs    │                       │
│  │  Pods    │                       │
│  └────┬─────┘                       │
│       │                              │
│  ┌────▼──────────────────────────┐  │
│  │  BlueField DPU                 │  │
│  │  ┌──────────┐  ┌──────────┐   │  │
│  │  │   OVN    │  │   OVS    │   │  │
│  │  │ Control  │  │ Dataplane│   │  │
│  │  └──────────┘  └──────────┘   │  │
│  │  Hardware GENEVE Offload      │  │
│  │  Hardware BGP/ECMP            │  │
│  └───────────────────────────────┘  │
└─────────────────────────────────────┘

8.3.3 Benefits of DPU Evolution

  1. Host CPU Offload: OVN/OVS processing moves to DPU, freeing host CPU for workloads
  2. Hardware Acceleration: DPUs provide hardware acceleration for:
    • GENEVE encapsulation/decapsulation
    • BGP routing
    • ECMP load balancing
    • Security policies (ACLs, firewalling)
  3. Consistent Architecture: TEPs still at “host” (now DPU), fabric still pure L3
  4. Better Performance: Dedicated processing for networking functions
  5. Isolation: Network processing isolated from workload CPU

8.3.4 Migration Path

When migrating to DPUs:

  1. TEP moves to DPU: DPU becomes the TEP endpoint
  2. Fabric unchanged: Still pure L3 BGP/ECMP
  3. OVN control plane: Runs on DPU, connects to same OVN databases
  4. BGP on DPU: DPU advertises host loopback via BGP
  5. Zero fabric changes: Underlay architecture remains identical

Key Insight: DPU evolution is transparent to the fabric. The underlay remains pure L3 BGP/ECMP regardless of where TEPs run.

8.4 Future Evolution: Higher Bandwidth Servers

8.4.1 Current: 2 × 100G (200G aggregate)

8.4.2 Future: 2 × 400G (800G aggregate)

8.4.2.1 Architecture Extension

No changes needed to fabric architecture:

  1. Same topology: Dual ToRs in unified L3 Clos fabric
  2. Same routing: Pure L3 BGP/ECMP
  3. Same principles: Loopback advertised via both NICs
  4. ECMP scales: Automatically handles higher bandwidth

8.4.2.2 What Changes

  • NIC speeds: 100G → 400G per NIC
  • Switch ports: ToR switches need 400G ports (or aggregate multiple 100G)
  • Link speeds: Point-to-point links become 400G
  • ECMP behavior: Same, just more bandwidth per path

8.4.2.3 Example Evolution

Current (2 × 100G): - eth0: 100G → ToR-A - eth1: 100G → ToR-B - Loopback: 10.255.11.11/32 advertised via both

Future (2 × 400G): - eth0: 400G → ToR-A (or 4×100G aggregated) - eth1: 400G → ToR-B (or 4×100G aggregated) - Loopback: 10.255.11.11/32 advertised via both (same!)

Key: The architecture is bandwidth-agnostic. Same design principles apply at any speed.

8.4.3 Future: 2 × 800G (1.6T aggregate)

Same principles: - Dual ToRs in unified L3 fabric - Pure L3 BGP/ECMP - Loopback-based identity - ECMP automatic load balancing

Scalability: The architecture scales seamlessly from 100G to 800G+ per server.

8.5 Switch Evolution

8.5.1 Current ToR Options

  • 100G × 64 ports: Sufficient for current server density
  • 200G × 32 ports: Higher bandwidth per port

8.5.2 Future ToR Options

  • 400G × 32 ports: For 400G servers
  • 800G × 16 ports: For 800G servers

8.5.3 Spine Evolution

  • Current: 400G spine switches
  • Future: 800G or 1.6T spine switches

Key: Spine capacity must scale with aggregate ToR bandwidth.

8.6 References