2 L3 & Routing Trends

Note: This chapter expands on the “Why L3+?” section from the Overview. The overview provides motivation and intuition; here we dive into Moore’s law, hardware offloading math, and the technical reasons why L3+ routing is the only viable path for modern datacenters at scale.

Created Dec 24 2025

2.1 Big Picture

Across all of computing, as systems grow, interconnects shift from shared media and broadcast to packet networks formed by a mesh of point-to-point links, with routing at intelligent junctions.

OnChip: Data BUS ⇨ Network-on-Chip (NOC) packet switched network SoCs.
Motherboard: PCI parallel lanes ⇨ PCIe serial lane with switching between devices
Network: L2 broadcast based Ethernet ⇨ L3 Point-to-Point & Routing datacenter networks.

It’s the same fundamental pattern repeating at different scales! As systems grow denser (more transistors on chip, more chips on board, more servers in datacenter),

Broadcast doesn’t scale as it limits bandwidth
Everything connected P2P (the other extreme). It is not practical as it needs N² wires.
The solution is the middle path, to have multi-hop interconnected links, like a road network, and mesh/switches connecting them, such that we can both get both P2P like functionality and also also the right level of sharing of links with intelligence at the junctions called routing.

(No broadcast, No P2P links, but a mesh of switches)

This principle was articulated by Dally & Towles in a foundational 2001 paper titled “Route Packets, Not Wires” for on-chip networks. See Appendix: Route Packets, Not Wires.

2.2 Moore’s Law Again!

2.2.1 High Bandwidth Between Dense Units

Exponentially increasing number of small, dense active elements in silicon need high bandwidth communication between them.

During old times, for the same size elements, we used to do broadcast or do time sharing in a shared medium. But broadcast wastes capacity and is not enough now. The other extreme to maximize bandwidth is to have isolated P2P links between each element pair, but the wires needed will explode to N², which is not practical.

So the only way is to have a mesh or multi-hop P2P links and intelligently route traffic at the junctions; like a road network having junctions and traffic lights, such that we can both get P2P like functionality but also with practical sharing of medium/links with intelligence at the junctions called routing. This is why L3 routing is the only way to scale bandwidth in dense spaces.

2.2.2 Network Scaleout to utilize the CPU cores

Ever since the limits of physical reality shifted CPU manufacturer competition from clock speed to core count in the mid-2000s, the industry has fundamentally changed how we harness compute performance.

MICROPROCESSOR PERFORMANCE TRENDS (2000-2020)

Single-thread Performance (Clock Speed)
  4GHz+ |        ┌─────────────────────────  ← PLATEAUED ~2005
        |      ┌─┘                             (Hit physical limits)
  3GHz  |    ┌─┘
  2GHz  |  ┌─┘
  1GHz  |┌─┘
        └────────────────────────────────────────────────► Time
         2000    2005                              2020

Core Count per CPU
  64+   |                                      ┌────
  32    |                               ┌──────┘
  16    |                        ┌──────┘
   8    |                 ┌──────┘
   4    |          ┌──────┘
   2    |   ┌──────┘
   1    |───┘
        └────────────────────────────────────────────────► Time
         2000    2005                              2020

Transistor Count (Moore's Law)
  100B+ |                                      ┌────
  10B   |                               ┌──────┘
  1B    |                        ┌──────┘
  100M  |                 ┌──────┘
  10M   |          ┌──────┘
  1M    |   ┌──────┘
        |───┘
        └────────────────────────────────────────────────► Time
         2000    2005                              2020

═══════════════════════════════════════════════════════════════
KEY INSIGHT: After 2005, clock speed hit a wall. But Moore's law
             continued - more transistors went into MORE CORES.
             Performance now = PARALLELISM, not faster clocks.
             This drives horizontal scaling everywhere.
═══════════════════════════════════════════════════════════════

Source: Canonical - Data Centre Networking: What is OVN referencing Karl Rupp, Microprocessor Trend Data, 2022

2.2.3 High Density needs Virtualization

The endpoint of the network is no longer a physical machine—it’s the hundreds of VMs and thousands of containers running on each physical machine, each with individual network service and policy requirements.

This explosion of endpoints is why we need software-defined networking and network virtualization:

What are overlay networks?

Think of it as “networks within networks”:

Underlay Network (Physical): The real L3+ fabric - routers, switches, cables connecting physical servers
Overlay Network (Virtual): Virtual networks for VMs/containers, created in software on top of the underlay

The overlay wraps virtual network packets (VM-to-VM traffic) inside real network packets (server-to-server UDP). This is tunneling - routing over routing: - Outer packet: Real server IPs, routed by physical fabric (L3+ underlay) - Inner packet: Virtual VM IPs, routed by software (L3 overlay)

Example: VM1 on Server-A talks to VM2 on Server-B:

┌─────────────────────────────────────────────────────────┐
│  VM1 sends IP packet to VM2                             │
│  Src: 10.10.1.5 (VM1)  Dst: 10.10.1.6 (VM2)             │
└─────────────────────────────────────────────────────────┘
                        ↓
        OVN/OVS encapsulates into UDP packet
                        ↓
┌─────────────────────────────────────────────────────────┐
│  Outer: Src: Server-A, Dst: Server-B (UDP/GENEVE)       │
│  ┌───────────────────────────────────────────────────┐  │
│  │ Inner: Src: VM1, Dst: VM2                         │  │
│  │        (original packet)                          │  │
│  └───────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────┘
                        ↓
        Physical L3+ fabric routes to Server-B
                        ↓
        OVN/OVS on Server-B unwraps and delivers to VM2

This two-layer approach is why the underlay MUST be simple, scalable L3+—it needs to efficiently route massive amounts of encapsulated traffic without caring about the virtual networks inside.

2.2.4 Compute for Routing Is Cheap, Is Everywhere

Routing or Switching is the intelligence in the control plane to decide on the next hop (based on minimizing distance to destination or maximizing capacity utilization of pipes etc) and also the intelligence in the management plane to automatically learn and maintain meta data like routing tables using algorithms - BGP, ECMP etc.

Such intelligent systems are abundant now available in all transmission equipments as a base capability:

Switches include full BGP/OSPF stacks
Servers run FRR for routing
NICs include hardware-accelerated routing (SmartNICs)
Every device becomes a mini-router

The motherboard itself is becoming like a datacenter, with hardware-accelerated virtualized units and L3 networking everywhere.

2.2.5 More Scalability, Stability & Smaller Blast Radius

Switching networks can achieve more stability as it is easier to build multiple paths and has more intelligence to failover/recover (redundancy), and is also loosely coupled (smaller blast radius), which also makes it easier to scale and upgrade.

As silicon gets more denser with more cores, more accelerators, more flash storage (more elements in SoC and more chips per motherboard), the internal motherboard itself will become like a datacenter, with hardware accelerated virtualized units & L3 networking everywhere, which again can be hardware accelerated and virtualized.

This is why hyperscalers (Google, Facebook, Microsoft) all use pure L3 fabrics with BGP and ECMP.

2.3 Why L3+ (Not Just L3)

Throughout this document, we use “L3” as shorthand, but our design is actually L3+ (L3/L4-aware).

Here’s why this matters: ECMP load balancing. Our fabric uses 5-tuple hashing for ECMP:

Hash(Src_IP, Dst_IP, Src_Port, Dst_Port, Protocol) → Path_Choice

This means: - L3 only (IP-based hashing): All flows between two hosts take the same path (bad!) - L3/L4 (IP + ports): Each TCP connection or UDP flow spreads across multiple paths (good!)

GENEVE’s clever trick: OVN generates random UDP source ports per flow, providing the entropy needed for effective L4-based ECMP. Without L4 awareness, GENEVE traffic would all hash to the same path, wasting our carefully built multipath fabric.

So when we say “pure L3,” we really mean pure L3+ with L4 awareness. The underlay routes at L3, but makes intelligent decisions at L3/L4. See Packet Flows & ECMP for the detailed mechanics.

For the math on why hardware acceleration is mandatory at 40+ Gbps, see Hardware Acceleration.

2.4 How to build deep expertise in L3 techniques?

Last week, I casually started learning about how OpenStack does network virtualization and I was stumped by the complexity of the (1) network overlay, (2) options available for the main underlay network & also the (3) options available for Kubernetes network.

I used to think knowing networking was just about IP, TCP, UDP, NAT, DNAT, MAC, MTU etc. But the following are the terms that opened up!!

2.4.1 Key Networking Terms

2.4.1.1 Essential for L3+ & Overlay (focus on mastering these!)

OVS - Virtual switch data plane
OVN - Virtual network control plane
GENEVE - Overlay tunneling protocol
TEP - Tunnel endpoint IP
FRR / BGP / eBGP / iBGP - Routing protocols
ECMP - Multipath load balancing (& WCMP for weighted)
BFD - Fast failure detection
MPTCP - Multipath TCP
SR-IOV - NIC virtualization
TC-Flower - Linux traffic control offload
VRRP - Virtual router HA
VRF - Virtual routing tables
Tomahawk chip - Broadcom switch ASIC

2.4.1.2 Know but avoid using these complex L2 constructs

LACP, MLAG, bridge & bonding types (xor, alb etc), peer links / vPC

2.4.1.3 Nice to know (not needed for this design)

VXLAN, eVPN, TEP, eVPN-MH, Trident chip

2.5 Topics to Cover

Building the L3+ underlay Network (with dumb L2)
The GENEVE Overlay Network - OVN & OVSwitch (L3 virtualization)
OVN-Kubernetes reusing GENEVE overlay

--- title: "L3 & Routing Trends" --- > **Note**: This chapter expands on the ["Why L3+?"](../index.qmd#why-l3-the-industry-trend) section from the Overview. The overview provides motivation and intuition; here we dive into Moore's law, hardware offloading math, and the technical reasons why L3+ routing is the only viable path for modern datacenters at scale. Created Dec 24 2025 ## Big Picture Across all of computing, as systems grow, interconnects shift from **shared media and broadcast** to **packet networks** formed by a mesh of point-to-point links, with **routing** at intelligent junctions. * **OnChip**: Data BUS **⇨** Network-on-Chip (NOC) packet switched network SoCs. * **Motherboard**: PCI parallel lanes **⇨** PCIe serial lane with switching between devices * **Network**: L2 broadcast based Ethernet **⇨** L3 Point-to-Point & Routing datacenter networks. It's the same fundamental pattern repeating at different scales! As systems grow denser (more transistors on chip, more chips on board, more servers in datacenter), - Broadcast doesn't scale as it limits bandwidth - Everything connected P2P (the other extreme). It is not practical as it needs N² wires. - The solution is the middle path, to have multi-hop interconnected links, like a road network, and mesh/switches connecting them, such that we can both get both P2P like functionality and also also the right level of sharing of links with intelligence at the junctions called routing. *(No broadcast, No P2P links, but a mesh of switches)* > This principle was articulated by Dally & Towles in a foundational 2001 paper titled "Route Packets, Not Wires" for on-chip networks. See [Appendix: Route Packets, Not Wires](./route-packets-not-wires.qmd). ## Moore's Law Again! ### High Bandwidth Between Dense Units Exponentially increasing number of small, dense active elements in silicon need high bandwidth communication between them. During old times, for the same size elements, we used to do broadcast or do time sharing in a shared medium. But broadcast wastes capacity and is not enough now. The other extreme to maximize bandwidth is to have isolated P2P links between each element pair, but the wires needed will explode to N², which is not practical. So the only way is to have a **mesh or multi-hop P2P links** and intelligently route traffic at the junctions; like a road network having junctions and traffic lights, such that we can both get P2P like functionality but also with practical sharing of medium/links with intelligence at the junctions called routing. This is why L3 **routing is the only way to scale bandwidth in dense spaces**. ### Network Scaleout to utilize the CPU cores Ever since the limits of physical reality shifted CPU manufacturer competition from **clock speed to core count** in the mid-2000s, the industry has fundamentally changed how we harness compute performance. ``` MICROPROCESSOR PERFORMANCE TRENDS (2000-2020) Single-thread Performance (Clock Speed) 4GHz+ | ┌───────────────────────── ← PLATEAUED ~2005 | ┌─┘ (Hit physical limits) 3GHz | ┌─┘ 2GHz | ┌─┘ 1GHz |┌─┘ └────────────────────────────────────────────────► Time 2000 2005 2020 Core Count per CPU 64+ | ┌──── 32 | ┌──────┘ 16 | ┌──────┘ 8 | ┌──────┘ 4 | ┌──────┘ 2 | ┌──────┘ 1 |───┘ └────────────────────────────────────────────────► Time 2000 2005 2020 Transistor Count (Moore's Law) 100B+ | ┌──── 10B | ┌──────┘ 1B | ┌──────┘ 100M | ┌──────┘ 10M | ┌──────┘ 1M | ┌──────┘ |───┘ └────────────────────────────────────────────────► Time 2000 2005 2020 ═══════════════════════════════════════════════════════════════ KEY INSIGHT: After 2005, clock speed hit a wall. But Moore's law continued - more transistors went into MORE CORES. Performance now = PARALLELISM, not faster clocks. This drives horizontal scaling everywhere. ═══════════════════════════════════════════════════════════════ ``` *Source: [Canonical - Data Centre Networking: What is OVN](https://ubuntu.com/blog/data-centre-networking-what-is-ovn) referencing Karl Rupp, Microprocessor Trend Data, 2022* ### High Density needs Virtualization The endpoint of the network is no longer a physical machine—it's the **hundreds of VMs and thousands of containers** running on each physical machine, each with individual network service and policy requirements. This explosion of endpoints is why we need **software-defined networking** and **network virtualization**: **What are overlay networks?** Think of it as "networks within networks": - **Underlay Network** (Physical): The real L3+ fabric - routers, switches, cables connecting physical servers - **Overlay Network** (Virtual): Virtual networks for VMs/containers, created in software on top of the underlay The overlay wraps virtual network packets (VM-to-VM traffic) inside real network packets (server-to-server UDP). This is **tunneling** - routing over routing: - Outer packet: Real server IPs, routed by physical fabric (L3+ underlay) - Inner packet: Virtual VM IPs, routed by software (L3 overlay) **Example**: VM1 on Server-A talks to VM2 on Server-B: ``` ┌─────────────────────────────────────────────────────────┐ │ VM1 sends IP packet to VM2 │ │ Src: 10.10.1.5 (VM1) Dst: 10.10.1.6 (VM2) │ └─────────────────────────────────────────────────────────┘ ↓ OVN/OVS encapsulates into UDP packet ↓ ┌─────────────────────────────────────────────────────────┐ │ Outer: Src: Server-A, Dst: Server-B (UDP/GENEVE) │ │ ┌───────────────────────────────────────────────────┐ │ │ │ Inner: Src: VM1, Dst: VM2 │ │ │ │ (original packet) │ │ │ └───────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────┘ ↓ Physical L3+ fabric routes to Server-B ↓ OVN/OVS on Server-B unwraps and delivers to VM2 ``` This two-layer approach is why the **underlay MUST be simple, scalable L3+**—it needs to efficiently route massive amounts of encapsulated traffic without caring about the virtual networks inside. ### Compute for Routing Is Cheap, Is Everywhere Routing or Switching is the intelligence in the *control plane* to decide on the next hop (based on minimizing distance to destination or maximizing capacity utilization of pipes etc) and also the intelligence in the *management plane* to automatically learn and maintain meta data like routing tables using algorithms - BGP, ECMP etc. Such intelligent systems are abundant now available in all transmission equipments as a base capability: - **Switches** include full BGP/OSPF stacks - **Servers** run FRR for routing - **NICs** include hardware-accelerated routing (SmartNICs) - **Every device** becomes a mini-router The **motherboard itself is becoming like a datacenter**, with hardware-accelerated virtualized units and L3 networking everywhere. ### More Scalability, Stability & Smaller Blast Radius Switching networks can achieve more stability as it is easier to build multiple paths and has more intelligence to failover/recover (redundancy), and is also loosely coupled (smaller blast radius), which also makes it easier to scale and upgrade. As silicon gets more denser with more cores, more accelerators, more flash storage (more elements in SoC and more chips per motherboard), the internal **motherboard itself will become like a datacenter**, with hardware accelerated virtualized units & L3 networking everywhere, which again can be hardware accelerated and virtualized. This is why hyperscalers (Google, Facebook, Microsoft) all use pure L3 fabrics with [BGP](./glossary.qmd#bgp-border-gateway-protocol) and [ECMP](./glossary.qmd#ecmp-equal-cost-multi-path). ## Why L3+ (Not Just L3) Throughout this document, we use "L3" as shorthand, but our design is actually **L3+ (L3/L4-aware)**. Here's why this matters: **ECMP load balancing**. Our fabric uses **5-tuple hashing** for ECMP: ``` Hash(Src_IP, Dst_IP, Src_Port, Dst_Port, Protocol) → Path_Choice ``` This means: - **L3 only** (IP-based hashing): All flows between two hosts take the same path (bad!) - **L3/L4** (IP + ports): Each TCP connection or UDP flow spreads across multiple paths (good!) **GENEVE's clever trick**: OVN generates **random UDP source ports per flow**, providing the entropy needed for effective L4-based ECMP. Without L4 awareness, GENEVE traffic would all hash to the same path, wasting our carefully built multipath fabric. So when we say "pure L3," we really mean **pure L3+ with L4 awareness**. The underlay routes at L3, but makes intelligent decisions at L3/L4. See [Packet Flows & ECMP](./packet-flows-ecmp.qmd) for the detailed mechanics. > **For the math on why hardware acceleration is mandatory at 40+ Gbps**, see [Hardware Acceleration](./hardware-acceleration.qmd#why-hardware-acceleration-is-mandatory). ## How to build deep expertise in L3 techniques? Last week, I casually started learning about how OpenStack does network virtualization and I was stumped by the complexity of the (1) network overlay, (2) options available for the main underlay network & also the (3) options available for Kubernetes network. I used to think knowing networking was just about IP, TCP, UDP, NAT, DNAT, MAC, MTU etc. But the following are the terms that opened up!! ### Key Networking Terms #### Essential for L3+ & Overlay *(focus on mastering these!)* - [OVS](./glossary.qmd#ovs-open-vswitch) - Virtual switch data plane - [OVN](./glossary.qmd#ovn-open-virtual-network) - Virtual network control plane - [GENEVE](./glossary.qmd#geneve-generic-network-virtualization-encapsulation) - Overlay tunneling protocol - [TEP](./glossary.qmd#tep-tunnel-endpoint) - Tunnel endpoint IP - [FRR](./glossary.qmd#frr-free-range-routing) / [BGP](./glossary.qmd#bgp-border-gateway-protocol) / [eBGP](./glossary.qmd#ebgp-external-bgp) / [iBGP](./glossary.qmd#ibgp-internal-bgp) - Routing protocols - [ECMP](./glossary.qmd#ecmp-equal-cost-multi-path) - Multipath load balancing (& [WCMP](./glossary.qmd#wcmp-weighted-cost-multi-path) for weighted) - [BFD](./glossary.qmd#bfd-bidirectional-forwarding-detection) - Fast failure detection - [MPTCP](./glossary.qmd#mptcp-multipath-tcp) - Multipath TCP - [SR-IOV](./glossary.qmd#sr-iov-single-root-io-virtualization) - NIC virtualization - [TC-Flower](./glossary.qmd#tc-flower) - Linux traffic control offload - [VRRP](./glossary.qmd#vrrp-virtual-router-redundancy-protocol) - Virtual router HA - [VRF](./glossary.qmd#vrf-virtual-routing-and-forwarding) - Virtual routing tables - [Tomahawk chip](./glossary.qmd#tomahawk-chip) - Broadcom switch ASIC #### Know but avoid using these complex L2 constructs - LACP, MLAG, bridge & bonding types (xor, alb etc), peer links / vPC #### Nice to know (not needed for this design) - VXLAN, eVPN, TEP, eVPN-MH, Trident chip ## Topics to Cover 1. Building the L3+ underlay Network (with dumb L2) 2. The GENEVE Overlay Network - OVN & OVSwitch (L3 virtualization) 3. OVN-Kubernetes reusing GENEVE overlay