11  Monitoring & Observability

11.1 Overview

Effective monitoring is critical for operating a production datacenter network. This chapter covers what to monitor, how to collect metrics, and suggested alerting strategies.

Philosophy: Monitor at the right layer. The underlay (BGP/ECMP) needs simple but comprehensive monitoring. The overlay (OVN) has its own observability.

11.2 What to Monitor

11.2.1 BGP Session Health

BGP is the control plane - if BGP is unhealthy, routing is broken.

Metric Normal Warning Critical
BGP session state Established Active/Connect Idle/Down
Peer count Expected count -1 peer >1 peer down
Received prefixes Expected range ±10% change Major change
Sent prefixes Expected count Mismatch Zero

Key commands:

# On SONiC/FRR
vtysh -c "show ip bgp summary"
vtysh -c "show ip bgp neighbor <IP> received-routes"
vtysh -c "show ip bgp neighbor <IP> advertised-routes"

Sample output to parse:

Neighbor        V    AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd
172.16.11.0     4 66111     12543     12498        0    0    0 01:23:45           12
172.16.11.2     4 66112     12501     12499        0    0    0 01:23:40           12

11.2.2 BFD Session Health

BFD provides sub-second failure detection.

Metric Normal Warning Critical
BFD state Up - Down
BFD flaps 0 1-2/hour >2/hour
Detection time Configured - Mismatched

Key commands:

vtysh -c "show bfd peer"
vtysh -c "show bfd peer <IP> json"

11.2.3 Route Table Health

ECMP paths should be balanced and complete.

Metric Normal Warning Critical
ECMP path count Expected (2-8) Reduced Single path
Total routes Expected range ±5% Major drop
Default route Present - Missing

Key commands:

# On hosts
ip route show 10.255.0.0/16 | grep -c "nexthop"

# On switches
vtysh -c "show ip route summary"

11.2.4 Interface Health

Physical layer issues cause forwarding problems.

Metric Normal Warning Critical
Link state Up - Down
Input errors 0 >0 Increasing
Output errors 0 >0 Increasing
CRC errors 0 >0 Increasing
Discards Near 0 >1% >5%
Utilization <80% >80% >95%

Key commands:

# On SONiC
show interfaces counters
show interfaces status

# On Linux hosts
ethtool -S eth0 | grep -E "error|drop|crc"
ip -s link show eth0

11.2.5 Buffer/Queue Health

Micro-bursts can cause drops even on non-congested links.

Metric Normal Warning Critical
Queue depth Low Increasing Near max
Tail drops 0 >0 Sustained
ECN marks Low Increasing High

SONiC commands:

show queue counters
show priority-group watermark

11.2.6 System Health

Switch hardware health affects network reliability.

Metric Normal Warning Critical
CPU utilization <50% >70% >90%
Memory utilization <70% >80% >95%
Temperature Normal High Critical
PSU status All OK 1 failed Multiple failed
Fan status All OK 1 failed Multiple failed

SONiC commands:

show system-health summary
show platform psustatus
show platform fan
show platform temperature

11.3 Data Collection Methods

11.3.1 SNMP (Simple Network Management Protocol)

Traditional but widely supported:

# Example SNMP polling targets
- ifHCInOctets / ifHCOutOctets  # Interface bytes
- ifInErrors / ifOutErrors      # Interface errors
- bgpPeerState                  # BGP session state
- sysUpTime                     # System uptime

Pros: Universal support, mature tooling Cons: Polling-based (not real-time), limited scalability

11.3.2 Streaming Telemetry (gNMI/gRPC)

Modern, push-based collection:

# Example gNMI subscription paths
- /interfaces/interface/state/counters
- /network-instances/network-instance/protocols/protocol/bgp
- /components/component/state/temperature

Pros: Real-time, efficient, structured data Cons: Requires modern switches, more complex setup

11.3.3 SONiC-Specific Collection

SONiC exposes data via Redis and REST:

# Direct Redis access
redis-cli -n 2 hgetall "INTERFACE_TABLE:Ethernet0"

# REST API (if enabled)
curl http://localhost:8080/api/v1/interface/Ethernet0

11.3.4 Host-Side Collection

For FRR on hosts:

# Prometheus metrics via frr_exporter
# https://github.com/tynany/frr_exporter

# Manual collection
vtysh -c "show ip bgp summary json" | jq .

11.4 Alerting Strategy

11.4.1 Tier 1: Page Immediately (Critical)

Alert Condition Impact
Spine down All BGP sessions to spine lost Reduced fabric capacity
ToR down All BGP sessions to ToR lost Rack isolated
Multiple BFD flaps >3 flaps in 5 min Unstable path
ECMP degraded <50% expected paths Reduced resilience

11.4.2 Tier 2: Alert (Warning)

Alert Condition Impact
Single BGP peer down One session lost Reduced redundancy
Interface errors >100 errors/min Potential issues
High utilization >80% sustained Approaching capacity
Route count change >10% change Possible misconfiguration

11.4.3 Tier 3: Informational

Alert Condition Notes
BGP session established New peer up Normal operation
Maintenance mode Graceful shutdown Expected
Config change Config applied Audit trail

11.5 Sample Dashboards

11.5.1 Fabric Overview Dashboard

┌─────────────────────────────────────────────────────────────┐
│                    FABRIC HEALTH OVERVIEW                    │
├───────────────┬───────────────┬───────────────┬─────────────┤
│ Spines: 2/2 ✓ │ ToRs: 12/12 ✓ │ Hosts: 48/48 ✓│ BGP: 100% ✓ │
├───────────────┴───────────────┴───────────────┴─────────────┤
│                                                              │
│   [Spine-1] ────────────────────────────── [Spine-2]        │
│      │ ╲                              ╱ │                    │
│      │  ╲                            ╱  │                    │
│   ┌──┴───┴──┐  ┌──────┐  ┌──────┐  ┌┴───┴──┐               │
│   │ Rack 1  │  │Rack 2│  │Rack 3│  │Rack 4 │  ...          │
│   │ ToR-A/B │  │ToR-AB│  │ToR-AB│  │ToR-A/B│               │
│   └─────────┘  └──────┘  └──────┘  └───────┘               │
│                                                              │
├─────────────────────────────────────────────────────────────┤
│ ECMP Paths: avg 7.8/8 │ BFD Flaps: 0 │ Drops: 0.001%       │
└─────────────────────────────────────────────────────────────┘

11.5.2 Per-Device Dashboard

┌─────────────────────────────────────────────────────────────┐
│                    ToR-A1 (Rack 1)                          │
├─────────────────────────────────────────────────────────────┤
│ BGP Peers:                                                   │
│   ├─ Spine-1 (172.20.1.0): Established, 48 prefixes        │
│   ├─ Spine-2 (172.20.1.2): Established, 48 prefixes        │
│   ├─ Host-11 (172.16.11.0): Established, 1 prefix          │
│   ├─ Host-12 (172.16.11.2): Established, 1 prefix          │
│   └─ ... (10 more hosts)                                    │
├─────────────────────────────────────────────────────────────┤
│ Interface Utilization:                                       │
│   Eth1/1 (Spine-1): ████████░░ 78%                         │
│   Eth1/2 (Spine-2): ███████░░░ 65%                         │
│   Eth1/3 (Host-11): ██░░░░░░░░ 15%                         │
├─────────────────────────────────────────────────────────────┤
│ System: CPU 23% │ Mem 45% │ Temp 42°C │ PSU: 2/2 │ Fan: OK │
└─────────────────────────────────────────────────────────────┘

11.6 Monitoring Stack Options

11.6.1 Option 1: Prometheus + Grafana

[Switches] ──SNMP──> [SNMP Exporter] ──> [Prometheus] ──> [Grafana]
[Hosts]    ──FRR───> [FRR Exporter]  ──>      │
                                              │
                                    [AlertManager] ──> PagerDuty/Slack

Components: - Prometheus: Time-series database - SNMP Exporter: Converts SNMP to Prometheus metrics - FRR Exporter: Collects FRR/BGP metrics - Grafana: Visualization - AlertManager: Alert routing

11.6.2 Option 2: InfluxDB + Telegraf + Grafana

[Switches] ──SNMP──────> [Telegraf] ──> [InfluxDB] ──> [Grafana]
[Switches] ──gNMI──────>     │
[Hosts]    ──exec/proc──>    │

Components: - Telegraf: Collection agent (supports SNMP, gNMI, exec) - InfluxDB: Time-series database - Grafana: Visualization

11.6.3 Option 3: Vendor-Specific (SONiC)

SONiC includes built-in telemetry:

# Enable telemetry container
sudo config feature state telemetry enabled

# Configure streaming
gnmi_cli -address localhost:8080 -subscribe /interfaces

11.7 Best Practices

11.7.1 1. Monitor at Multiple Layers

┌─────────────────────────────────────────────────────────────┐
│ Layer 7: Application metrics (OVN, OpenStack)               │
├─────────────────────────────────────────────────────────────┤
│ Layer 4: Flow metrics (GENEVE tunnel health)                │
├─────────────────────────────────────────────────────────────┤
│ Layer 3: Routing metrics (BGP, ECMP, BFD)        ← FOCUS   │
├─────────────────────────────────────────────────────────────┤
│ Layer 1-2: Interface metrics (errors, utilization)         │
├─────────────────────────────────────────────────────────────┤
│ Physical: Hardware health (temp, PSU, fan)                  │
└─────────────────────────────────────────────────────────────┘

11.7.2 2. Establish Baselines

Before alerting, understand normal: - What’s the typical BGP convergence time? - What’s normal interface utilization? - How many ECMP paths should exist?

11.7.3 3. Correlate Events

When issues occur, correlate: - BGP state change + Interface down = Physical failure - Multiple BGP flaps + No interface errors = Control plane issue - High drops + High utilization = Congestion

11.7.4 4. Automate Remediation

For known issues, consider automated responses: - BGP peer down → Check interface, attempt clear - High memory → Log and alert, prepare for failover - Temperature high → Alert, check environmental

11.8 Integration with OVN Monitoring

The underlay and overlay should be monitored together:

Layer What to Check Why
Underlay BGP/ECMP paths to remote TEPs Connectivity foundation
Overlay OVN tunnel status Actual VM connectivity
Combined End-to-end VM ping Full stack validation

OVN monitoring commands:

# Check tunnel status
ovs-vsctl show | grep -A 2 "Bridge br-int"

# Check OVN connectivity
ovn-sbctl show | grep Chassis
ovn-trace --summary <datapath> 'inport=="<port>" && eth.src==<mac>'

See Operations & Maintenance for detailed troubleshooting procedures.