11 Monitoring & Observability
11.1 Overview
Effective monitoring is critical for operating a production datacenter network. This chapter covers what to monitor, how to collect metrics, and suggested alerting strategies.
Philosophy: Monitor at the right layer. The underlay (BGP/ECMP) needs simple but comprehensive monitoring. The overlay (OVN) has its own observability.
11.2 What to Monitor
11.2.1 BGP Session Health
BGP is the control plane - if BGP is unhealthy, routing is broken.
| Metric | Normal | Warning | Critical |
|---|---|---|---|
| BGP session state | Established | Active/Connect | Idle/Down |
| Peer count | Expected count | -1 peer | >1 peer down |
| Received prefixes | Expected range | ±10% change | Major change |
| Sent prefixes | Expected count | Mismatch | Zero |
Key commands:
# On SONiC/FRR
vtysh -c "show ip bgp summary"
vtysh -c "show ip bgp neighbor <IP> received-routes"
vtysh -c "show ip bgp neighbor <IP> advertised-routes"Sample output to parse:
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
172.16.11.0 4 66111 12543 12498 0 0 0 01:23:45 12
172.16.11.2 4 66112 12501 12499 0 0 0 01:23:40 12
11.2.2 BFD Session Health
BFD provides sub-second failure detection.
| Metric | Normal | Warning | Critical |
|---|---|---|---|
| BFD state | Up | - | Down |
| BFD flaps | 0 | 1-2/hour | >2/hour |
| Detection time | Configured | - | Mismatched |
Key commands:
vtysh -c "show bfd peer"
vtysh -c "show bfd peer <IP> json"11.2.3 Route Table Health
ECMP paths should be balanced and complete.
| Metric | Normal | Warning | Critical |
|---|---|---|---|
| ECMP path count | Expected (2-8) | Reduced | Single path |
| Total routes | Expected range | ±5% | Major drop |
| Default route | Present | - | Missing |
Key commands:
# On hosts
ip route show 10.255.0.0/16 | grep -c "nexthop"
# On switches
vtysh -c "show ip route summary"11.2.4 Interface Health
Physical layer issues cause forwarding problems.
| Metric | Normal | Warning | Critical |
|---|---|---|---|
| Link state | Up | - | Down |
| Input errors | 0 | >0 | Increasing |
| Output errors | 0 | >0 | Increasing |
| CRC errors | 0 | >0 | Increasing |
| Discards | Near 0 | >1% | >5% |
| Utilization | <80% | >80% | >95% |
Key commands:
# On SONiC
show interfaces counters
show interfaces status
# On Linux hosts
ethtool -S eth0 | grep -E "error|drop|crc"
ip -s link show eth011.2.5 Buffer/Queue Health
Micro-bursts can cause drops even on non-congested links.
| Metric | Normal | Warning | Critical |
|---|---|---|---|
| Queue depth | Low | Increasing | Near max |
| Tail drops | 0 | >0 | Sustained |
| ECN marks | Low | Increasing | High |
SONiC commands:
show queue counters
show priority-group watermark11.2.6 System Health
Switch hardware health affects network reliability.
| Metric | Normal | Warning | Critical |
|---|---|---|---|
| CPU utilization | <50% | >70% | >90% |
| Memory utilization | <70% | >80% | >95% |
| Temperature | Normal | High | Critical |
| PSU status | All OK | 1 failed | Multiple failed |
| Fan status | All OK | 1 failed | Multiple failed |
SONiC commands:
show system-health summary
show platform psustatus
show platform fan
show platform temperature11.3 Data Collection Methods
11.3.1 SNMP (Simple Network Management Protocol)
Traditional but widely supported:
# Example SNMP polling targets
- ifHCInOctets / ifHCOutOctets # Interface bytes
- ifInErrors / ifOutErrors # Interface errors
- bgpPeerState # BGP session state
- sysUpTime # System uptimePros: Universal support, mature tooling Cons: Polling-based (not real-time), limited scalability
11.3.2 Streaming Telemetry (gNMI/gRPC)
Modern, push-based collection:
# Example gNMI subscription paths
- /interfaces/interface/state/counters
- /network-instances/network-instance/protocols/protocol/bgp
- /components/component/state/temperaturePros: Real-time, efficient, structured data Cons: Requires modern switches, more complex setup
11.3.3 SONiC-Specific Collection
SONiC exposes data via Redis and REST:
# Direct Redis access
redis-cli -n 2 hgetall "INTERFACE_TABLE:Ethernet0"
# REST API (if enabled)
curl http://localhost:8080/api/v1/interface/Ethernet011.3.4 Host-Side Collection
For FRR on hosts:
# Prometheus metrics via frr_exporter
# https://github.com/tynany/frr_exporter
# Manual collection
vtysh -c "show ip bgp summary json" | jq .11.4 Alerting Strategy
11.4.1 Tier 1: Page Immediately (Critical)
| Alert | Condition | Impact |
|---|---|---|
| Spine down | All BGP sessions to spine lost | Reduced fabric capacity |
| ToR down | All BGP sessions to ToR lost | Rack isolated |
| Multiple BFD flaps | >3 flaps in 5 min | Unstable path |
| ECMP degraded | <50% expected paths | Reduced resilience |
11.4.2 Tier 2: Alert (Warning)
| Alert | Condition | Impact |
|---|---|---|
| Single BGP peer down | One session lost | Reduced redundancy |
| Interface errors | >100 errors/min | Potential issues |
| High utilization | >80% sustained | Approaching capacity |
| Route count change | >10% change | Possible misconfiguration |
11.4.3 Tier 3: Informational
| Alert | Condition | Notes |
|---|---|---|
| BGP session established | New peer up | Normal operation |
| Maintenance mode | Graceful shutdown | Expected |
| Config change | Config applied | Audit trail |
11.5 Sample Dashboards
11.5.1 Fabric Overview Dashboard
┌─────────────────────────────────────────────────────────────┐
│ FABRIC HEALTH OVERVIEW │
├───────────────┬───────────────┬───────────────┬─────────────┤
│ Spines: 2/2 ✓ │ ToRs: 12/12 ✓ │ Hosts: 48/48 ✓│ BGP: 100% ✓ │
├───────────────┴───────────────┴───────────────┴─────────────┤
│ │
│ [Spine-1] ────────────────────────────── [Spine-2] │
│ │ ╲ ╱ │ │
│ │ ╲ ╱ │ │
│ ┌──┴───┴──┐ ┌──────┐ ┌──────┐ ┌┴───┴──┐ │
│ │ Rack 1 │ │Rack 2│ │Rack 3│ │Rack 4 │ ... │
│ │ ToR-A/B │ │ToR-AB│ │ToR-AB│ │ToR-A/B│ │
│ └─────────┘ └──────┘ └──────┘ └───────┘ │
│ │
├─────────────────────────────────────────────────────────────┤
│ ECMP Paths: avg 7.8/8 │ BFD Flaps: 0 │ Drops: 0.001% │
└─────────────────────────────────────────────────────────────┘
11.5.2 Per-Device Dashboard
┌─────────────────────────────────────────────────────────────┐
│ ToR-A1 (Rack 1) │
├─────────────────────────────────────────────────────────────┤
│ BGP Peers: │
│ ├─ Spine-1 (172.20.1.0): Established, 48 prefixes │
│ ├─ Spine-2 (172.20.1.2): Established, 48 prefixes │
│ ├─ Host-11 (172.16.11.0): Established, 1 prefix │
│ ├─ Host-12 (172.16.11.2): Established, 1 prefix │
│ └─ ... (10 more hosts) │
├─────────────────────────────────────────────────────────────┤
│ Interface Utilization: │
│ Eth1/1 (Spine-1): ████████░░ 78% │
│ Eth1/2 (Spine-2): ███████░░░ 65% │
│ Eth1/3 (Host-11): ██░░░░░░░░ 15% │
├─────────────────────────────────────────────────────────────┤
│ System: CPU 23% │ Mem 45% │ Temp 42°C │ PSU: 2/2 │ Fan: OK │
└─────────────────────────────────────────────────────────────┘
11.6 Monitoring Stack Options
11.6.1 Option 1: Prometheus + Grafana
[Switches] ──SNMP──> [SNMP Exporter] ──> [Prometheus] ──> [Grafana]
[Hosts] ──FRR───> [FRR Exporter] ──> │
│
[AlertManager] ──> PagerDuty/Slack
Components: - Prometheus: Time-series database - SNMP Exporter: Converts SNMP to Prometheus metrics - FRR Exporter: Collects FRR/BGP metrics - Grafana: Visualization - AlertManager: Alert routing
11.6.2 Option 2: InfluxDB + Telegraf + Grafana
[Switches] ──SNMP──────> [Telegraf] ──> [InfluxDB] ──> [Grafana]
[Switches] ──gNMI──────> │
[Hosts] ──exec/proc──> │
Components: - Telegraf: Collection agent (supports SNMP, gNMI, exec) - InfluxDB: Time-series database - Grafana: Visualization
11.6.3 Option 3: Vendor-Specific (SONiC)
SONiC includes built-in telemetry:
# Enable telemetry container
sudo config feature state telemetry enabled
# Configure streaming
gnmi_cli -address localhost:8080 -subscribe /interfaces11.7 Best Practices
11.7.1 1. Monitor at Multiple Layers
┌─────────────────────────────────────────────────────────────┐
│ Layer 7: Application metrics (OVN, OpenStack) │
├─────────────────────────────────────────────────────────────┤
│ Layer 4: Flow metrics (GENEVE tunnel health) │
├─────────────────────────────────────────────────────────────┤
│ Layer 3: Routing metrics (BGP, ECMP, BFD) ← FOCUS │
├─────────────────────────────────────────────────────────────┤
│ Layer 1-2: Interface metrics (errors, utilization) │
├─────────────────────────────────────────────────────────────┤
│ Physical: Hardware health (temp, PSU, fan) │
└─────────────────────────────────────────────────────────────┘
11.7.2 2. Establish Baselines
Before alerting, understand normal: - What’s the typical BGP convergence time? - What’s normal interface utilization? - How many ECMP paths should exist?
11.7.3 3. Correlate Events
When issues occur, correlate: - BGP state change + Interface down = Physical failure - Multiple BGP flaps + No interface errors = Control plane issue - High drops + High utilization = Congestion
11.7.4 4. Automate Remediation
For known issues, consider automated responses: - BGP peer down → Check interface, attempt clear - High memory → Log and alert, prepare for failover - Temperature high → Alert, check environmental
11.8 Integration with OVN Monitoring
The underlay and overlay should be monitored together:
| Layer | What to Check | Why |
|---|---|---|
| Underlay | BGP/ECMP paths to remote TEPs | Connectivity foundation |
| Overlay | OVN tunnel status | Actual VM connectivity |
| Combined | End-to-end VM ping | Full stack validation |
OVN monitoring commands:
# Check tunnel status
ovs-vsctl show | grep -A 2 "Bridge br-int"
# Check OVN connectivity
ovn-sbctl show | grep Chassis
ovn-trace --summary <datapath> 'inport=="<port>" && eth.src==<mac>'See Operations & Maintenance for detailed troubleshooting procedures.