9  Packet Flows and ECMP Load Balancing

9.1 Overview

This chapter provides a deep dive into how packets flow through our network, how 5-tuple hashing works at both underlay and overlay layers, and how load balancing is achieved end-to-end.

Note: For definitions of terms used in this chapter, see the Glossary.

9.2 The 5-Tuple

The 5-tuple consists of: 1. Source IP address 2. Destination IP address 3. Source port 4. Destination port 5. Protocol (TCP, UDP, etc.)

This 5-tuple is used for: - Underlay ECMP: Distributing traffic across multiple physical paths - Overlay load balancing: GENEVE UDP source port selection for better distribution

9.3 Packet Flow: VM to VM (Step-by-Step)

9.3.1 Scenario

  • VM-A on Server-1 (loopback: 10.0.1.11/32) wants to reach VM-B on Server-2 (loopback: 10.0.2.22/32)
  • Both servers have 2 × 100G NICs connected to independent A/B fabrics

9.3.2 Step 1: VM Packet Generation

VM-A generates packet:
  Src IP: VM-A IP (e.g., 192.168.1.10)
  Dst IP: VM-B IP (e.g., 192.168.1.20)
  Src Port: 54321 (random)
  Dst Port: 80
  Protocol: TCP

9.3.3 Step 2: OVN Encapsulation (GENEVE)

OVN/OVS encapsulates the packet in GENEVE:

Outer Header (GENEVE):
  Src IP: 10.0.1.11 (Server-1 loopback / TEP)
  Dst IP: 10.0.2.22 (Server-2 loopback / TEP)
  Src Port: 45678 (random, unique per flow)
  Dst Port: 6081 (GENEVE standard port)
  Protocol: UDP
  
Inner Header (Original VM packet):
  Src IP: 192.168.1.10 (VM-A)
  Dst IP: 192.168.1.20 (VM-B)
  Src Port: 54321
  Dst Port: 80
  Protocol: TCP

Key: The GENEVE UDP source port is randomly selected for each flow. This provides excellent hash entropy for ECMP.

9.3.4 Step 3: Kernel Routing (Server-1)

Server-1’s kernel routing table has two equal-cost paths to 10.0.2.22/32:

10.0.2.22/32 via 172.16.1.1 dev eth0 (via ToR-A)
10.0.2.22/32 via 172.16.1.3 dev eth1 (via ToR-B)

ECMP Hash Calculation (kernel uses 5-tuple):

Hash Input:
  Src IP: 10.0.1.11
  Dst IP: 10.0.2.22
  Src Port: 45678 (GENEVE UDP src port)
  Dst Port: 6081 (GENEVE UDP dst port)
  Protocol: UDP

Hash Result → Selects eth0 (→ToR-A) or eth1 (→ToR-B)

Result: Packet goes out via eth0 (to ToR-A) or eth1 (to ToR-B) based on hash.

9.3.5 Step 4: ToR Routing (ToR-A or ToR-B)

ToR receives packet and performs L3 lookup. Since destination host (10.0.2.22) is in Rack 2, ToR sees multiple paths:

ToR Routing Table (learned via BGP):
  10.0.2.22/32 via 172.20.1.0 (Spine-1 path)
  10.0.2.22/32 via 172.20.1.2 (Spine-2 path)
  (and potentially more spine paths)

ECMP Hash Calculation (switch uses 5-tuple):

Hash Input:
  Src IP: 10.0.1.11
  Dst IP: 10.0.2.22
  Src Port: 45678 (GENEVE UDP src port)
  Dst Port: 6081 (GENEVE)
  Protocol: UDP

Hash Result → Selects spine path

Result: Packet forwarded to Spine-1 or Spine-2 based on hash.

Key: In unified fabric, this ToR can reach destination via any spine, providing excellent path diversity.

9.3.6 Step 5: Spine Routing

Spine performs L3 lookup. Since destination host is in Rack 2, spine sees paths via both ToRs in Rack 2:

Spine Routing Table (learned via BGP):
  10.0.2.22/32 via 172.20.2.0 (ToR-A Rack2 path)
  10.0.2.22/32 via 172.20.2.2 (ToR-B Rack2 path)

ECMP Hash: Same 5-tuple hash selects which ToR to use.

Key: In unified fabric, spine learns the same host /32 from both ToRs in the destination rack, creating even more path diversity!

9.3.7 Step 6: Destination ToR to Server-2

Destination ToR (either ToR-A or ToR-B in Rack 2) routes to Server-2:

ToR Routing Table:
  10.0.2.22/32 via 172.16.2.x (directly connected to Server-2)
  • If ToR-A selected: Delivers via Server-2’s eth0
  • If ToR-B selected: Delivers via Server-2’s eth1

Packet delivered to Server-2 via one of its NICs.

9.3.8 Step 7: OVN Decapsulation (Server-2)

Server-2’s OVS: 1. Receives GENEVE packet 2. Decapsulates (removes GENEVE header) 3. Delivers original VM packet to VM-B

9.4 Why This Provides Excellent Load Balancing

9.4.1 1. Unique GENEVE Source Ports

Each VM-to-VM flow gets a unique GENEVE UDP source port: - Flow 1: Src port 45678 - Flow 2: Src port 45679 - Flow 3: Src port 51234 - etc.

Result: Different 5-tuples → different hash results → different paths.

9.4.2 2. Multiple Hash Points with Maximum Path Diversity

In a unified fabric, load balancing happens at multiple points with excellent path diversity:

  1. Server egress: Hash across eth0/eth1 (2 NICs)
  2. ToR egress: Hash across all spine paths (2+ spines)
  3. Spine egress: Hash across both destination ToRs (ToR-A and ToR-B)

Total paths between hosts: 2 NICs × 2 spines × 2 destination ToRs = 8 paths minimum

Result: Traffic naturally distributes across all available paths for excellent bandwidth utilization!

9.4.3 3. Flow-Level Stability

  • Same flow (same 5-tuple) → same path (no reordering)
  • Different flowsdifferent paths (good distribution)

9.5 ECMP Hash Algorithms

9.5.1 Kernel ECMP (Linux)

Linux kernel uses symmetric hash by default: - Ensures bidirectional flows use same path - Prevents asymmetric routing - Configurable via net.core.rps_hash_fields

9.5.2 Switch ECMP (Tomahawk)

Broadcom Tomahawk ASIC uses: - L3 hash: Source IP, Destination IP - L4 hash: Source port, Destination port, Protocol - Configurable: Can adjust hash fields

Result: Consistent hashing across kernel and switches.

9.6 Load Balancing Analysis

9.6.1 Per-Server Load Balancing

2 × 100G NICs per server: - Theoretical: 200G aggregate - Practical: Depends on flow distribution - ECMP: Automatically distributes across both NICs

Flow Distribution Example:

100 flows from Server-1:
  - 48 flows → eth0 (Network-A) = ~48G
  - 52 flows → eth1 (Network-B) = ~52G
  Total: ~100G utilized across both NICs

Key: Perfect 50/50 split is rare. ECMP provides statistical distribution.

9.6.2 Per-Fabric Load Balancing

Network-A and Network-B: - Each fabric sees ~50% of traffic (statistically) - During normal operation: 40-60% per fabric is typical - During failover: One fabric handles 100% (must be sized accordingly)

9.6.3 Per-Spine Load Balancing

Multiple spines per fabric: - Traffic hashes across all spine paths - ECMP distributes based on 5-tuple - Provides redundancy and bandwidth aggregation

9.7 BFD (Bidirectional Forwarding Detection)

9.7.1 What is BFD?

BFD is a fast failure detection protocol that works with BGP to quickly detect link failures.

9.7.2 Why BFD?

Without BFD: - BGP keepalive interval: 60 seconds (default) - Failure detection: 60-180 seconds - Too slow for production networks

With BFD: - BFD interval: 100-300ms (configurable) - Failure detection: <1 second - Fast enough for production

9.7.3 BFD Configuration

9.7.3.1 Server BFD (FRR)

# Enable BFD on BGP neighbors
router bgp 66111
 neighbor 172.16.1.1 bfd
 neighbor 172.16.1.3 bfd

# BFD timers (100ms interval, 300ms multiplier)
bfd
 peer 172.16.1.1 interval 100 min_rx 100 multiplier 3
 peer 172.16.1.3 interval 100 min_rx 100 multiplier 3

9.7.3.2 ToR BFD (FRR/SONiC)

# BFD for all BGP neighbors
bfd
 peer 172.16.1.0 interval 100 min_rx 100 multiplier 3
 peer 172.20.1.0 interval 100 min_rx 100 multiplier 3
 peer 172.20.1.2 interval 100 min_rx 100 multiplier 3

9.7.4 BFD Failure Detection Flow

  1. Link fails (physical or logical)
  2. BFD detects within 100-300ms
  3. BFD notifies BGP immediately
  4. BGP withdraws route on failed path
  5. ECMP removes path from routing table
  6. Traffic shifts to remaining paths
  7. Total convergence: <1 second

9.7.5 BFD Benefits

  • Fast failure detection: <1 second vs 60+ seconds
  • Automatic path removal: Failed paths removed from ECMP
  • No traffic blackholing: Fast convergence prevents packet loss
  • Works with ECMP: Each path monitored independently

9.8 Load Balancing Summary

9.8.1 End-to-End Load Balancing

VM Flow → GENEVE (random UDP src port)
  ↓
Server ECMP (hash 5-tuple) → eth0 or eth1 (2 × 100G)
  ↓
ToR ECMP (hash 5-tuple) → Spine-1, Spine-2, ...
  ↓
Spine ECMP (hash 5-tuple) → ToR paths
  ↓
Destination Server

Result: Traffic automatically distributes across: - Both server NICs (eth0/eth1) - 2 × 100G = 200G aggregate - Multiple spine paths (400G spines) - All available fabric paths

9.8.2 Key Insight: Two-Level Hashing

  1. Overlay Level: GENEVE UDP source port provides hash entropy
    • Each flow gets unique UDP src port
    • Creates different 5-tuples for different flows
    • Enables good distribution at underlay
  2. Underlay Level: 5-tuple hash distributes across paths
    • Server: Hash across eth0/eth1
    • ToR: Hash across spine paths
    • Spine: Hash across ToR paths

Result: Excellent load balancing at every level.

9.8.3 Bandwidth Utilization

Per Server (2 × 100G): - Normal: 40-60% per NIC (80-120G total) - Peak: Up to 200G aggregate - Failover: 100% on single NIC (must plan for this)

Per Fabric: - Normal: 40-60% utilization - Failover: 100% capacity required

Key Planning Principle: Size each fabric for 100% load during failover.

9.9 References