9 Packet Flows and ECMP Load Balancing
9.1 Overview
This chapter provides a deep dive into how packets flow through our network, how 5-tuple hashing works at both underlay and overlay layers, and how load balancing is achieved end-to-end.
Note: For definitions of terms used in this chapter, see the Glossary.
9.2 The 5-Tuple
The 5-tuple consists of: 1. Source IP address 2. Destination IP address 3. Source port 4. Destination port 5. Protocol (TCP, UDP, etc.)
This 5-tuple is used for: - Underlay ECMP: Distributing traffic across multiple physical paths - Overlay load balancing: GENEVE UDP source port selection for better distribution
9.3 Packet Flow: VM to VM (Step-by-Step)
9.3.1 Scenario
- VM-A on Server-1 (loopback: 10.0.1.11/32) wants to reach VM-B on Server-2 (loopback: 10.0.2.22/32)
- Both servers have 2 × 100G NICs connected to independent A/B fabrics
9.3.2 Step 1: VM Packet Generation
VM-A generates packet:
Src IP: VM-A IP (e.g., 192.168.1.10)
Dst IP: VM-B IP (e.g., 192.168.1.20)
Src Port: 54321 (random)
Dst Port: 80
Protocol: TCP
9.3.3 Step 2: OVN Encapsulation (GENEVE)
OVN/OVS encapsulates the packet in GENEVE:
Outer Header (GENEVE):
Src IP: 10.0.1.11 (Server-1 loopback / TEP)
Dst IP: 10.0.2.22 (Server-2 loopback / TEP)
Src Port: 45678 (random, unique per flow)
Dst Port: 6081 (GENEVE standard port)
Protocol: UDP
Inner Header (Original VM packet):
Src IP: 192.168.1.10 (VM-A)
Dst IP: 192.168.1.20 (VM-B)
Src Port: 54321
Dst Port: 80
Protocol: TCP
Key: The GENEVE UDP source port is randomly selected for each flow. This provides excellent hash entropy for ECMP.
9.3.4 Step 3: Kernel Routing (Server-1)
Server-1’s kernel routing table has two equal-cost paths to 10.0.2.22/32:
10.0.2.22/32 via 172.16.1.1 dev eth0 (via ToR-A)
10.0.2.22/32 via 172.16.1.3 dev eth1 (via ToR-B)
ECMP Hash Calculation (kernel uses 5-tuple):
Hash Input:
Src IP: 10.0.1.11
Dst IP: 10.0.2.22
Src Port: 45678 (GENEVE UDP src port)
Dst Port: 6081 (GENEVE UDP dst port)
Protocol: UDP
Hash Result → Selects eth0 (→ToR-A) or eth1 (→ToR-B)
Result: Packet goes out via eth0 (to ToR-A) or eth1 (to ToR-B) based on hash.
9.3.5 Step 4: ToR Routing (ToR-A or ToR-B)
ToR receives packet and performs L3 lookup. Since destination host (10.0.2.22) is in Rack 2, ToR sees multiple paths:
ToR Routing Table (learned via BGP):
10.0.2.22/32 via 172.20.1.0 (Spine-1 path)
10.0.2.22/32 via 172.20.1.2 (Spine-2 path)
(and potentially more spine paths)
ECMP Hash Calculation (switch uses 5-tuple):
Hash Input:
Src IP: 10.0.1.11
Dst IP: 10.0.2.22
Src Port: 45678 (GENEVE UDP src port)
Dst Port: 6081 (GENEVE)
Protocol: UDP
Hash Result → Selects spine path
Result: Packet forwarded to Spine-1 or Spine-2 based on hash.
Key: In unified fabric, this ToR can reach destination via any spine, providing excellent path diversity.
9.3.6 Step 5: Spine Routing
Spine performs L3 lookup. Since destination host is in Rack 2, spine sees paths via both ToRs in Rack 2:
Spine Routing Table (learned via BGP):
10.0.2.22/32 via 172.20.2.0 (ToR-A Rack2 path)
10.0.2.22/32 via 172.20.2.2 (ToR-B Rack2 path)
ECMP Hash: Same 5-tuple hash selects which ToR to use.
Key: In unified fabric, spine learns the same host /32 from both ToRs in the destination rack, creating even more path diversity!
9.3.7 Step 6: Destination ToR to Server-2
Destination ToR (either ToR-A or ToR-B in Rack 2) routes to Server-2:
ToR Routing Table:
10.0.2.22/32 via 172.16.2.x (directly connected to Server-2)
- If ToR-A selected: Delivers via Server-2’s eth0
- If ToR-B selected: Delivers via Server-2’s eth1
Packet delivered to Server-2 via one of its NICs.
9.3.8 Step 7: OVN Decapsulation (Server-2)
Server-2’s OVS: 1. Receives GENEVE packet 2. Decapsulates (removes GENEVE header) 3. Delivers original VM packet to VM-B
9.4 Why This Provides Excellent Load Balancing
9.4.1 1. Unique GENEVE Source Ports
Each VM-to-VM flow gets a unique GENEVE UDP source port: - Flow 1: Src port 45678 - Flow 2: Src port 45679 - Flow 3: Src port 51234 - etc.
Result: Different 5-tuples → different hash results → different paths.
9.4.2 2. Multiple Hash Points with Maximum Path Diversity
In a unified fabric, load balancing happens at multiple points with excellent path diversity:
- Server egress: Hash across eth0/eth1 (2 NICs)
- ToR egress: Hash across all spine paths (2+ spines)
- Spine egress: Hash across both destination ToRs (ToR-A and ToR-B)
Total paths between hosts: 2 NICs × 2 spines × 2 destination ToRs = 8 paths minimum
Result: Traffic naturally distributes across all available paths for excellent bandwidth utilization!
9.4.3 3. Flow-Level Stability
- Same flow (same 5-tuple) → same path (no reordering)
- Different flows → different paths (good distribution)
9.5 ECMP Hash Algorithms
9.5.1 Kernel ECMP (Linux)
Linux kernel uses symmetric hash by default: - Ensures bidirectional flows use same path - Prevents asymmetric routing - Configurable via net.core.rps_hash_fields
9.5.2 Switch ECMP (Tomahawk)
Broadcom Tomahawk ASIC uses: - L3 hash: Source IP, Destination IP - L4 hash: Source port, Destination port, Protocol - Configurable: Can adjust hash fields
Result: Consistent hashing across kernel and switches.
9.6 Load Balancing Analysis
9.6.1 Per-Server Load Balancing
2 × 100G NICs per server: - Theoretical: 200G aggregate - Practical: Depends on flow distribution - ECMP: Automatically distributes across both NICs
Flow Distribution Example:
100 flows from Server-1:
- 48 flows → eth0 (Network-A) = ~48G
- 52 flows → eth1 (Network-B) = ~52G
Total: ~100G utilized across both NICs
Key: Perfect 50/50 split is rare. ECMP provides statistical distribution.
9.6.2 Per-Fabric Load Balancing
Network-A and Network-B: - Each fabric sees ~50% of traffic (statistically) - During normal operation: 40-60% per fabric is typical - During failover: One fabric handles 100% (must be sized accordingly)
9.6.3 Per-Spine Load Balancing
Multiple spines per fabric: - Traffic hashes across all spine paths - ECMP distributes based on 5-tuple - Provides redundancy and bandwidth aggregation
9.7 BFD (Bidirectional Forwarding Detection)
9.7.1 What is BFD?
BFD is a fast failure detection protocol that works with BGP to quickly detect link failures.
9.7.2 Why BFD?
Without BFD: - BGP keepalive interval: 60 seconds (default) - Failure detection: 60-180 seconds - Too slow for production networks
With BFD: - BFD interval: 100-300ms (configurable) - Failure detection: <1 second - Fast enough for production
9.7.3 BFD Configuration
9.7.3.1 Server BFD (FRR)
# Enable BFD on BGP neighbors
router bgp 66111
neighbor 172.16.1.1 bfd
neighbor 172.16.1.3 bfd
# BFD timers (100ms interval, 300ms multiplier)
bfd
peer 172.16.1.1 interval 100 min_rx 100 multiplier 3
peer 172.16.1.3 interval 100 min_rx 100 multiplier 39.7.3.2 ToR BFD (FRR/SONiC)
# BFD for all BGP neighbors
bfd
peer 172.16.1.0 interval 100 min_rx 100 multiplier 3
peer 172.20.1.0 interval 100 min_rx 100 multiplier 3
peer 172.20.1.2 interval 100 min_rx 100 multiplier 39.7.4 BFD Failure Detection Flow
- Link fails (physical or logical)
- BFD detects within 100-300ms
- BFD notifies BGP immediately
- BGP withdraws route on failed path
- ECMP removes path from routing table
- Traffic shifts to remaining paths
- Total convergence: <1 second
9.7.5 BFD Benefits
- Fast failure detection: <1 second vs 60+ seconds
- Automatic path removal: Failed paths removed from ECMP
- No traffic blackholing: Fast convergence prevents packet loss
- Works with ECMP: Each path monitored independently
9.8 Load Balancing Summary
9.8.1 End-to-End Load Balancing
VM Flow → GENEVE (random UDP src port)
↓
Server ECMP (hash 5-tuple) → eth0 or eth1 (2 × 100G)
↓
ToR ECMP (hash 5-tuple) → Spine-1, Spine-2, ...
↓
Spine ECMP (hash 5-tuple) → ToR paths
↓
Destination Server
Result: Traffic automatically distributes across: - Both server NICs (eth0/eth1) - 2 × 100G = 200G aggregate - Multiple spine paths (400G spines) - All available fabric paths
9.8.2 Key Insight: Two-Level Hashing
- Overlay Level: GENEVE UDP source port provides hash entropy
- Each flow gets unique UDP src port
- Creates different 5-tuples for different flows
- Enables good distribution at underlay
- Underlay Level: 5-tuple hash distributes across paths
- Server: Hash across eth0/eth1
- ToR: Hash across spine paths
- Spine: Hash across ToR paths
Result: Excellent load balancing at every level.
9.8.3 Bandwidth Utilization
Per Server (2 × 100G): - Normal: 40-60% per NIC (80-120G total) - Peak: Up to 200G aggregate - Failover: 100% on single NIC (must plan for this)
Per Fabric: - Normal: 40-60% utilization - Failover: 100% capacity required
Key Planning Principle: Size each fabric for 100% load during failover.
9.9 References
- RFC 5880 - BFD - Bidirectional Forwarding Detection
- Linux Kernel ECMP - Kernel ECMP configuration
- OVS Flow Hashing - OVS flow distribution