8 Hardware Acceleration and Future Evolution

8.1 Why Hardware Acceleration is Mandatory

At modern network speeds (40+ Gbps), software packet processing hits fundamental CPU limitations.

8.1.1 The 40 Gbps Per-Core Ceiling

Time available to process a standard 1500-byte Ethernet frame:

\[t_{processing} = \frac{S}{R} = \frac{1500 \times 8}{R} = \frac{12000}{R}\]

Where: - $S$ = Frame size in bits (1500 bytes × 8) - $R$ = Line rate in bps

Line Rate	Time per Frame	Reality Check
1 Gbps	12 μs	Software can handle this
10 Gbps	1.2 μs	Software struggles
40 Gbps	300 ns	Impossible without hardware
100 Gbps	120 ns	Hardware-only domain

8.1.2 What Happens in 300 Nanoseconds?

At 40 Gbps, you have 300 nanoseconds to:

Receive frame from NIC
Parse headers (Ethernet, IP, UDP, GENEVE, inner Ethernet, inner IP)
Consult routing/flow tables
Apply security groups/ACLs
Encapsulate/decapsulate tunnels
Forward to destination

Practically, software can’t do this at scale. A modern 4 GHz CPU can execute ~1,200 instructions in 300ns - barely enough to parse headers, let alone make forwarding decisions.

Though there are recent experiments (PaperMill, DPDK optimizations) achieving >100Gbps per core, they are not yet production-practical and require specialized kernel bypass techniques.

8.1.3 Why This Matters for OVN/GENEVE

With GENEVE overlay, every packet requires: - Outer header processing (physical network) - GENEVE decapsulation (remove outer headers) - Inner header processing (virtual network) - Security group lookup (OVN rules) - Re-encapsulation (for forwarding)

This is why hardware offload is mandatory, not optional, for 100G NICs.

8.2 Current Hardware Configuration

8.2.1 Server NICs

Each server has 2 × 100G NICs (ConnectX-6 DX): - eth0: 100G connection to ToR-A (Network-A) - eth1: 100G connection to ToR-B (Network-B) - Total aggregate: 200G per server via pure L3 ECMP

8.2.2 ConnectX-6 DX Hardware Acceleration

Mellanox/NVIDIA ConnectX-6 DX provides hardware acceleration for OVN/OVS:

8.2.2.1 GENEVE Offload

Hardware GENEVE encapsulation/decapsulation: Offloads GENEVE processing from CPU
Flow steering: Hardware-based packet classification and forwarding
OVS hardware offload: Direct integration with OVS for accelerated forwarding

8.2.2.2 Benefits

Reduced CPU overhead: GENEVE processing handled by NIC
Higher throughput: Hardware acceleration provides line-rate performance
Lower latency: Hardware forwarding faster than software
Better scalability: More CPU available for workloads

8.2.2.3 Configuration

# Enable OVS hardware offload on ConnectX-6 DX
ovs-vsctl set Open_vSwitch . other_config:hw-offload=true

# Verify offload status
ovs-vsctl get Open_vSwitch . other_config:hw-offload

Reference: NVIDIA ConnectX-6 DX Documentation

8.2.3 Switch Hardware

8.2.3.1 ToR Switches

Option 1: 100G switches with 64 ports (e.g., Tomahawk-based)
Option 2: 200G switches with 32 ports (e.g., Tomahawk-based)
Chip: Broadcom Tomahawk ASIC
Function: Pure L3 routing with BGP/ECMP

8.2.3.2 Spine Switches

400G switches (e.g., Tomahawk-based)
High port density for leaf-spine connectivity
Function: Pure L3 transit with ECMP

Key: All switches are L3 routers, not L2 switches. Tomahawk ASICs provide excellent L3 forwarding performance.

8.3 Future Evolution: DPUs (Data Processing Units)

8.3.1 What are DPUs?

DPUs (Data Processing Units) are specialized processors that offload networking, storage, and security functions from the host CPU. Examples include: - NVIDIA BlueField DPU - AMD Pensando - Intel IPU (Infrastructure Processing Unit)

8.3.2 How DPUs Fit Our Architecture

8.3.2.1 Current Architecture (Host-Based TEPs)

┌─────────────────────────────────────┐
│  Host CPU                           │
│  ┌──────────┐  ┌──────────┐        │
│  │   OVN    │  │   OVS    │        │
│  │ Control  │  │ Dataplane│        │
│  └────┬─────┘  └────┬─────┘        │
│       │             │               │
│  ┌────▼─────────────▼─────┐        │
│  │  ConnectX-6 DX (NIC)   │        │
│  │  Hardware GENEVE Offload        │
│  └─────────────────────────┘       │
└─────────────────────────────────────┘

8.3.2.2 Future Architecture (DPU-Based TEPs)

┌─────────────────────────────────────┐
│  Host CPU (Workloads Only)          │
│  ┌──────────┐                       │
│  │   VMs    │                       │
│  │  Pods    │                       │
│  └────┬─────┘                       │
│       │                              │
│  ┌────▼──────────────────────────┐  │
│  │  BlueField DPU                 │  │
│  │  ┌──────────┐  ┌──────────┐   │  │
│  │  │   OVN    │  │   OVS    │   │  │
│  │  │ Control  │  │ Dataplane│   │  │
│  │  └──────────┘  └──────────┘   │  │
│  │  Hardware GENEVE Offload      │  │
│  │  Hardware BGP/ECMP            │  │
│  └───────────────────────────────┘  │
└─────────────────────────────────────┘

8.3.3 Benefits of DPU Evolution

Host CPU Offload: OVN/OVS processing moves to DPU, freeing host CPU for workloads
Hardware Acceleration: DPUs provide hardware acceleration for:
- GENEVE encapsulation/decapsulation
- BGP routing
- ECMP load balancing
- Security policies (ACLs, firewalling)
Consistent Architecture: TEPs still at “host” (now DPU), fabric still pure L3
Better Performance: Dedicated processing for networking functions
Isolation: Network processing isolated from workload CPU

8.3.4 Migration Path

When migrating to DPUs:

TEP moves to DPU: DPU becomes the TEP endpoint
Fabric unchanged: Still pure L3 BGP/ECMP
OVN control plane: Runs on DPU, connects to same OVN databases
BGP on DPU: DPU advertises host loopback via BGP
Zero fabric changes: Underlay architecture remains identical

Key Insight: DPU evolution is transparent to the fabric. The underlay remains pure L3 BGP/ECMP regardless of where TEPs run.

8.4 Future Evolution: Higher Bandwidth Servers

8.4.1 Current: 2 × 100G (200G aggregate)

8.4.2 Future: 2 × 400G (800G aggregate)

8.4.2.1 Architecture Extension

No changes needed to fabric architecture:

Same topology: Dual ToRs in unified L3 Clos fabric
Same routing: Pure L3 BGP/ECMP
Same principles: Loopback advertised via both NICs
ECMP scales: Automatically handles higher bandwidth

8.4.2.2 What Changes

NIC speeds: 100G → 400G per NIC
Switch ports: ToR switches need 400G ports (or aggregate multiple 100G)
Link speeds: Point-to-point links become 400G
ECMP behavior: Same, just more bandwidth per path

8.4.2.3 Example Evolution

Current (2 × 100G): - eth0: 100G → ToR-A - eth1: 100G → ToR-B - Loopback: 10.255.11.11/32 advertised via both

Future (2 × 400G): - eth0: 400G → ToR-A (or 4×100G aggregated) - eth1: 400G → ToR-B (or 4×100G aggregated) - Loopback: 10.255.11.11/32 advertised via both (same!)

Key: The architecture is bandwidth-agnostic. Same design principles apply at any speed.

8.4.3 Future: 2 × 800G (1.6T aggregate)

Same principles: - Dual ToRs in unified L3 fabric - Pure L3 BGP/ECMP - Loopback-based identity - ECMP automatic load balancing

Scalability: The architecture scales seamlessly from 100G to 800G+ per server.

8.5 Switch Evolution

8.5.1 Current ToR Options

100G × 64 ports: Sufficient for current server density
200G × 32 ports: Higher bandwidth per port

8.5.2 Future ToR Options

400G × 32 ports: For 400G servers
800G × 16 ports: For 800G servers

8.5.3 Spine Evolution

Current: 400G spine switches
Future: 800G or 1.6T spine switches

Key: Spine capacity must scale with aggregate ToR bandwidth.

8.6 References

--- title: "Hardware Acceleration and Future Evolution" --- ## Why Hardware Acceleration is Mandatory At modern network speeds (40+ Gbps), software packet processing hits fundamental CPU limitations. ### The 40 Gbps Per-Core Ceiling **Time available to process a standard 1500-byte Ethernet frame**: $$t_{processing} = \frac{S}{R} = \frac{1500 \times 8}{R} = \frac{12000}{R}$$ Where: - $S$ = Frame size in bits (1500 bytes × 8) - $R$ = Line rate in bps | Line Rate | Time per Frame | Reality Check | |-----------|----------------|---------------| | 1 Gbps | 12 μs | Software can handle this | | 10 Gbps | 1.2 μs | Software struggles | | 40 Gbps | 300 ns | **Impossible without hardware** | | 100 Gbps | 120 ns | Hardware-only domain | ### What Happens in 300 Nanoseconds? At 40 Gbps, you have **300 nanoseconds** to: 1. Receive frame from NIC 2. Parse headers (Ethernet, IP, UDP, GENEVE, inner Ethernet, inner IP) 3. Consult routing/flow tables 4. Apply security groups/ACLs 5. Encapsulate/decapsulate tunnels 6. Forward to destination **Practically, software can't do this** at scale. A modern 4 GHz CPU can execute ~1,200 instructions in 300ns - barely enough to parse headers, let alone make forwarding decisions. > Though there are recent experiments (PaperMill, DPDK optimizations) achieving >100Gbps per core, they are not yet production-practical and require specialized kernel bypass techniques. ### Why This Matters for OVN/GENEVE With GENEVE overlay, every packet requires: - **Outer header processing** (physical network) - **GENEVE decapsulation** (remove outer headers) - **Inner header processing** (virtual network) - **Security group lookup** (OVN rules) - **Re-encapsulation** (for forwarding) This is why hardware offload is **mandatory**, not optional, for 100G NICs. ## Current Hardware Configuration ### Server NICs Each server has **2 × 100G NICs** (ConnectX-6 DX): - **eth0**: 100G connection to ToR-A (Network-A) - **eth1**: 100G connection to ToR-B (Network-B) - **Total aggregate**: 200G per server via pure L3 ECMP ### ConnectX-6 DX Hardware Acceleration **Mellanox/NVIDIA ConnectX-6 DX** provides hardware acceleration for OVN/OVS: #### GENEVE Offload - **Hardware GENEVE encapsulation/decapsulation**: Offloads GENEVE processing from CPU - **Flow steering**: Hardware-based packet classification and forwarding - **OVS hardware offload**: Direct integration with OVS for accelerated forwarding #### Benefits - **Reduced CPU overhead**: GENEVE processing handled by NIC - **Higher throughput**: Hardware acceleration provides line-rate performance - **Lower latency**: Hardware forwarding faster than software - **Better scalability**: More CPU available for workloads #### Configuration ```bash # Enable OVS hardware offload on ConnectX-6 DX ovs-vsctl set Open_vSwitch . other_config:hw-offload=true # Verify offload status ovs-vsctl get Open_vSwitch . other_config:hw-offload ``` **Reference**: [NVIDIA ConnectX-6 DX Documentation](https://www.nvidia.com/en-us/networking/ethernet-adapters/connectx-6-dx/) ### Switch Hardware #### ToR Switches - **Option 1**: 100G switches with 64 ports (e.g., Tomahawk-based) - **Option 2**: 200G switches with 32 ports (e.g., Tomahawk-based) - **Chip**: Broadcom Tomahawk ASIC - **Function**: Pure L3 routing with BGP/ECMP #### Spine Switches - **400G switches** (e.g., Tomahawk-based) - **High port density** for leaf-spine connectivity - **Function**: Pure L3 transit with ECMP **Key**: All switches are **L3 routers**, not L2 switches. Tomahawk ASICs provide excellent L3 forwarding performance. ## Future Evolution: DPUs (Data Processing Units) ### What are DPUs? **DPUs** (Data Processing Units) are specialized processors that offload networking, storage, and security functions from the host CPU. Examples include: - **NVIDIA BlueField DPU** - **AMD Pensando** - **Intel IPU (Infrastructure Processing Unit)** ### How DPUs Fit Our Architecture #### Current Architecture (Host-Based TEPs) ``` ┌─────────────────────────────────────┐ │ Host CPU │ │ ┌──────────┐ ┌──────────┐ │ │ │ OVN │ │ OVS │ │ │ │ Control │ │ Dataplane│ │ │ └────┬─────┘ └────┬─────┘ │ │ │ │ │ │ ┌────▼─────────────▼─────┐ │ │ │ ConnectX-6 DX (NIC) │ │ │ │ Hardware GENEVE Offload │ │ └─────────────────────────┘ │ └─────────────────────────────────────┘ ``` #### Future Architecture (DPU-Based TEPs) ``` ┌─────────────────────────────────────┐ │ Host CPU (Workloads Only) │ │ ┌──────────┐ │ │ │ VMs │ │ │ │ Pods │ │ │ └────┬─────┘ │ │ │ │ │ ┌────▼──────────────────────────┐ │ │ │ BlueField DPU │ │ │ │ ┌──────────┐ ┌──────────┐ │ │ │ │ │ OVN │ │ OVS │ │ │ │ │ │ Control │ │ Dataplane│ │ │ │ │ └──────────┘ └──────────┘ │ │ │ │ Hardware GENEVE Offload │ │ │ │ Hardware BGP/ECMP │ │ │ └───────────────────────────────┘ │ └─────────────────────────────────────┘ ``` ### Benefits of DPU Evolution 1. **Host CPU Offload**: OVN/OVS processing moves to DPU, freeing host CPU for workloads 2. **Hardware Acceleration**: DPUs provide hardware acceleration for: - GENEVE encapsulation/decapsulation - BGP routing - ECMP load balancing - Security policies (ACLs, firewalling) 3. **Consistent Architecture**: TEPs still at "host" (now DPU), fabric still pure L3 4. **Better Performance**: Dedicated processing for networking functions 5. **Isolation**: Network processing isolated from workload CPU ### Migration Path When migrating to DPUs: 1. **TEP moves to DPU**: DPU becomes the TEP endpoint 2. **Fabric unchanged**: Still pure L3 BGP/ECMP 3. **OVN control plane**: Runs on DPU, connects to same OVN databases 4. **BGP on DPU**: DPU advertises host loopback via BGP 5. **Zero fabric changes**: Underlay architecture remains identical **Key Insight**: DPU evolution is transparent to the fabric. The underlay remains pure L3 BGP/ECMP regardless of where TEPs run. ## Future Evolution: Higher Bandwidth Servers ### Current: 2 × 100G (200G aggregate) ### Future: 2 × 400G (800G aggregate) #### Architecture Extension **No changes needed to fabric architecture**: 1. **Same topology**: Dual ToRs in unified L3 Clos fabric 2. **Same routing**: Pure L3 BGP/ECMP 3. **Same principles**: Loopback advertised via both NICs 4. **ECMP scales**: Automatically handles higher bandwidth #### What Changes - **NIC speeds**: 100G → 400G per NIC - **Switch ports**: ToR switches need 400G ports (or aggregate multiple 100G) - **Link speeds**: Point-to-point links become 400G - **ECMP behavior**: Same, just more bandwidth per path #### Example Evolution **Current (2 × 100G)**: - eth0: 100G → ToR-A - eth1: 100G → ToR-B - Loopback: 10.255.11.11/32 advertised via both **Future (2 × 400G)**: - eth0: 400G → ToR-A (or 4×100G aggregated) - eth1: 400G → ToR-B (or 4×100G aggregated) - Loopback: 10.255.11.11/32 advertised via both (same!) **Key**: The architecture is **bandwidth-agnostic**. Same design principles apply at any speed. ### Future: 2 × 800G (1.6T aggregate) Same principles: - Dual ToRs in unified L3 fabric - Pure L3 BGP/ECMP - Loopback-based identity - ECMP automatic load balancing **Scalability**: The architecture scales seamlessly from 100G to 800G+ per server. ## Switch Evolution ### Current ToR Options - **100G × 64 ports**: Sufficient for current server density - **200G × 32 ports**: Higher bandwidth per port ### Future ToR Options - **400G × 32 ports**: For 400G servers - **800G × 16 ports**: For 800G servers ### Spine Evolution - **Current**: 400G spine switches - **Future**: 800G or 1.6T spine switches **Key**: Spine capacity must scale with aggregate ToR bandwidth. ## References - [NVIDIA ConnectX-6 DX](https://www.nvidia.com/en-us/networking/ethernet-adapters/connectx-6-dx/) - [NVIDIA BlueField DPU](https://www.nvidia.com/en-us/networking/products/data-processing-unit/) - [OVS Hardware Offload](https://docs.openvswitch.org/en/latest/topics/hardware-offload/) - [Broadcom Tomahawk](https://www.broadcom.com/products/ethernet-connectivity/switching/strataxgs/bcm56990-series)