11 Monitoring & Observability

11.1 Overview

Effective monitoring is critical for operating a production datacenter network. This chapter covers what to monitor, how to collect metrics, and suggested alerting strategies.

Philosophy: Monitor at the right layer. The underlay (BGP/ECMP) needs simple but comprehensive monitoring. The overlay (OVN) has its own observability.

11.2 What to Monitor

11.2.1 BGP Session Health

BGP is the control plane - if BGP is unhealthy, routing is broken.

Metric	Normal	Warning	Critical
BGP session state	Established	Active/Connect	Idle/Down
Peer count	Expected count	-1 peer	>1 peer down
Received prefixes	Expected range	±10% change	Major change
Sent prefixes	Expected count	Mismatch	Zero

Key commands:

# On SONiC/FRR
vtysh -c "show ip bgp summary"
vtysh -c "show ip bgp neighbor <IP> received-routes"
vtysh -c "show ip bgp neighbor <IP> advertised-routes"

Sample output to parse:

Neighbor        V    AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd
172.16.11.0     4 66111     12543     12498        0    0    0 01:23:45           12
172.16.11.2     4 66112     12501     12499        0    0    0 01:23:40           12

11.2.2 BFD Session Health

BFD provides sub-second failure detection.

Metric	Normal	Warning	Critical
BFD state	Up	-	Down
BFD flaps	0	1-2/hour	>2/hour
Detection time	Configured	-	Mismatched

Key commands:

vtysh -c "show bfd peer"
vtysh -c "show bfd peer <IP> json"

11.2.3 Route Table Health

ECMP paths should be balanced and complete.

Metric	Normal	Warning	Critical
ECMP path count	Expected (2-8)	Reduced	Single path
Total routes	Expected range	±5%	Major drop
Default route	Present	-	Missing

Key commands:

# On hosts
ip route show 10.255.0.0/16 | grep -c "nexthop"

# On switches
vtysh -c "show ip route summary"

11.2.4 Interface Health

Physical layer issues cause forwarding problems.

Metric	Normal	Warning	Critical
Link state	Up	-	Down
Input errors	0	>0	Increasing
Output errors	0	>0	Increasing
CRC errors	0	>0	Increasing
Discards	Near 0	>1%	>5%
Utilization	<80%	>80%	>95%

Key commands:

# On SONiC
show interfaces counters
show interfaces status

# On Linux hosts
ethtool -S eth0 | grep -E "error|drop|crc"
ip -s link show eth0

11.2.5 Buffer/Queue Health

Micro-bursts can cause drops even on non-congested links.

Metric	Normal	Warning	Critical
Queue depth	Low	Increasing	Near max
Tail drops	0	>0	Sustained
ECN marks	Low	Increasing	High

SONiC commands:

show queue counters
show priority-group watermark

11.2.6 System Health

Switch hardware health affects network reliability.

Metric	Normal	Warning	Critical
CPU utilization	<50%	>70%	>90%
Memory utilization	<70%	>80%	>95%
Temperature	Normal	High	Critical
PSU status	All OK	1 failed	Multiple failed
Fan status	All OK	1 failed	Multiple failed

SONiC commands:

show system-health summary
show platform psustatus
show platform fan
show platform temperature

11.3 Data Collection Methods

11.3.1 SNMP (Simple Network Management Protocol)

Traditional but widely supported:

# Example SNMP polling targets
- ifHCInOctets / ifHCOutOctets  # Interface bytes
- ifInErrors / ifOutErrors      # Interface errors
- bgpPeerState                  # BGP session state
- sysUpTime                     # System uptime

Pros: Universal support, mature tooling Cons: Polling-based (not real-time), limited scalability

11.3.2 Streaming Telemetry (gNMI/gRPC)

Modern, push-based collection:

# Example gNMI subscription paths
- /interfaces/interface/state/counters
- /network-instances/network-instance/protocols/protocol/bgp
- /components/component/state/temperature

Pros: Real-time, efficient, structured data Cons: Requires modern switches, more complex setup

11.3.3 SONiC-Specific Collection

SONiC exposes data via Redis and REST:

# Direct Redis access
redis-cli -n 2 hgetall "INTERFACE_TABLE:Ethernet0"

# REST API (if enabled)
curl http://localhost:8080/api/v1/interface/Ethernet0

11.3.4 Host-Side Collection

For FRR on hosts:

# Prometheus metrics via frr_exporter
# https://github.com/tynany/frr_exporter

# Manual collection
vtysh -c "show ip bgp summary json" | jq .

11.4 Alerting Strategy

11.4.1 Tier 1: Page Immediately (Critical)

Alert	Condition	Impact
Spine down	All BGP sessions to spine lost	Reduced fabric capacity
ToR down	All BGP sessions to ToR lost	Rack isolated
Multiple BFD flaps	>3 flaps in 5 min	Unstable path
ECMP degraded	<50% expected paths	Reduced resilience

11.4.2 Tier 2: Alert (Warning)

Alert	Condition	Impact
Single BGP peer down	One session lost	Reduced redundancy
Interface errors	>100 errors/min	Potential issues
High utilization	>80% sustained	Approaching capacity
Route count change	>10% change	Possible misconfiguration

11.4.3 Tier 3: Informational

Alert	Condition	Notes
BGP session established	New peer up	Normal operation
Maintenance mode	Graceful shutdown	Expected
Config change	Config applied	Audit trail

11.5 Sample Dashboards

11.5.1 Fabric Overview Dashboard

┌─────────────────────────────────────────────────────────────┐
│                    FABRIC HEALTH OVERVIEW                    │
├───────────────┬───────────────┬───────────────┬─────────────┤
│ Spines: 2/2 ✓ │ ToRs: 12/12 ✓ │ Hosts: 48/48 ✓│ BGP: 100% ✓ │
├───────────────┴───────────────┴───────────────┴─────────────┤
│                                                              │
│   [Spine-1] ────────────────────────────── [Spine-2]        │
│      │ ╲                              ╱ │                    │
│      │  ╲                            ╱  │                    │
│   ┌──┴───┴──┐  ┌──────┐  ┌──────┐  ┌┴───┴──┐               │
│   │ Rack 1  │  │Rack 2│  │Rack 3│  │Rack 4 │  ...          │
│   │ ToR-A/B │  │ToR-AB│  │ToR-AB│  │ToR-A/B│               │
│   └─────────┘  └──────┘  └──────┘  └───────┘               │
│                                                              │
├─────────────────────────────────────────────────────────────┤
│ ECMP Paths: avg 7.8/8 │ BFD Flaps: 0 │ Drops: 0.001%       │
└─────────────────────────────────────────────────────────────┘

11.5.2 Per-Device Dashboard

┌─────────────────────────────────────────────────────────────┐
│                    ToR-A1 (Rack 1)                          │
├─────────────────────────────────────────────────────────────┤
│ BGP Peers:                                                   │
│   ├─ Spine-1 (172.20.1.0): Established, 48 prefixes        │
│   ├─ Spine-2 (172.20.1.2): Established, 48 prefixes        │
│   ├─ Host-11 (172.16.11.0): Established, 1 prefix          │
│   ├─ Host-12 (172.16.11.2): Established, 1 prefix          │
│   └─ ... (10 more hosts)                                    │
├─────────────────────────────────────────────────────────────┤
│ Interface Utilization:                                       │
│   Eth1/1 (Spine-1): ████████░░ 78%                         │
│   Eth1/2 (Spine-2): ███████░░░ 65%                         │
│   Eth1/3 (Host-11): ██░░░░░░░░ 15%                         │
├─────────────────────────────────────────────────────────────┤
│ System: CPU 23% │ Mem 45% │ Temp 42°C │ PSU: 2/2 │ Fan: OK │
└─────────────────────────────────────────────────────────────┘

11.6 Monitoring Stack Options

11.6.1 Option 1: Prometheus + Grafana

[Switches] ──SNMP──> [SNMP Exporter] ──> [Prometheus] ──> [Grafana]
[Hosts]    ──FRR───> [FRR Exporter]  ──>      │
                                              │
                                    [AlertManager] ──> PagerDuty/Slack

Components: - Prometheus: Time-series database - SNMP Exporter: Converts SNMP to Prometheus metrics - FRR Exporter: Collects FRR/BGP metrics - Grafana: Visualization - AlertManager: Alert routing

11.6.2 Option 2: InfluxDB + Telegraf + Grafana

[Switches] ──SNMP──────> [Telegraf] ──> [InfluxDB] ──> [Grafana]
[Switches] ──gNMI──────>     │
[Hosts]    ──exec/proc──>    │

Components: - Telegraf: Collection agent (supports SNMP, gNMI, exec) - InfluxDB: Time-series database - Grafana: Visualization

11.6.3 Option 3: Vendor-Specific (SONiC)

SONiC includes built-in telemetry:

# Enable telemetry container
sudo config feature state telemetry enabled

# Configure streaming
gnmi_cli -address localhost:8080 -subscribe /interfaces

11.7 Best Practices

11.7.1 1. Monitor at Multiple Layers

┌─────────────────────────────────────────────────────────────┐
│ Layer 7: Application metrics (OVN, OpenStack)               │
├─────────────────────────────────────────────────────────────┤
│ Layer 4: Flow metrics (GENEVE tunnel health)                │
├─────────────────────────────────────────────────────────────┤
│ Layer 3: Routing metrics (BGP, ECMP, BFD)        ← FOCUS   │
├─────────────────────────────────────────────────────────────┤
│ Layer 1-2: Interface metrics (errors, utilization)         │
├─────────────────────────────────────────────────────────────┤
│ Physical: Hardware health (temp, PSU, fan)                  │
└─────────────────────────────────────────────────────────────┘

11.7.2 2. Establish Baselines

Before alerting, understand normal: - What’s the typical BGP convergence time? - What’s normal interface utilization? - How many ECMP paths should exist?

11.7.3 3. Correlate Events

When issues occur, correlate: - BGP state change + Interface down = Physical failure - Multiple BGP flaps + No interface errors = Control plane issue - High drops + High utilization = Congestion

11.7.4 4. Automate Remediation

For known issues, consider automated responses: - BGP peer down → Check interface, attempt clear - High memory → Log and alert, prepare for failover - Temperature high → Alert, check environmental

11.8 Integration with OVN Monitoring

The underlay and overlay should be monitored together:

Layer	What to Check	Why
Underlay	BGP/ECMP paths to remote TEPs	Connectivity foundation
Overlay	OVN tunnel status	Actual VM connectivity
Combined	End-to-end VM ping	Full stack validation

OVN monitoring commands:

# Check tunnel status
ovs-vsctl show | grep -A 2 "Bridge br-int"

# Check OVN connectivity
ovn-sbctl show | grep Chassis
ovn-trace --summary <datapath> 'inport=="<port>" && eth.src==<mac>'

See Operations & Maintenance for detailed troubleshooting procedures.

--- title: "Monitoring & Observability" --- ## Overview Effective monitoring is critical for operating a production datacenter network. This chapter covers what to monitor, how to collect metrics, and suggested alerting strategies. > **Philosophy**: Monitor at the right layer. The underlay (BGP/ECMP) needs simple but comprehensive monitoring. The overlay (OVN) has its own observability. ## What to Monitor ### BGP Session Health BGP is the control plane - if BGP is unhealthy, routing is broken. | Metric | Normal | Warning | Critical | |--------|--------|---------|----------| | **BGP session state** | Established | Active/Connect | Idle/Down | | **Peer count** | Expected count | -1 peer | >1 peer down | | **Received prefixes** | Expected range | ±10% change | Major change | | **Sent prefixes** | Expected count | Mismatch | Zero | **Key commands**: ```bash # On SONiC/FRR vtysh -c "show ip bgp summary" vtysh -c "show ip bgp neighbor <IP> received-routes" vtysh -c "show ip bgp neighbor <IP> advertised-routes" ``` **Sample output to parse**: ``` Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd 172.16.11.0 4 66111 12543 12498 0 0 0 01:23:45 12 172.16.11.2 4 66112 12501 12499 0 0 0 01:23:40 12 ``` ### BFD Session Health BFD provides sub-second failure detection. | Metric | Normal | Warning | Critical | |--------|--------|---------|----------| | **BFD state** | Up | - | Down | | **BFD flaps** | 0 | 1-2/hour | >2/hour | | **Detection time** | Configured | - | Mismatched | **Key commands**: ```bash vtysh -c "show bfd peer" vtysh -c "show bfd peer <IP> json" ``` ### Route Table Health ECMP paths should be balanced and complete. | Metric | Normal | Warning | Critical | |--------|--------|---------|----------| | **ECMP path count** | Expected (2-8) | Reduced | Single path | | **Total routes** | Expected range | ±5% | Major drop | | **Default route** | Present | - | Missing | **Key commands**: ```bash # On hosts ip route show 10.255.0.0/16 | grep -c "nexthop" # On switches vtysh -c "show ip route summary" ``` ### Interface Health Physical layer issues cause forwarding problems. | Metric | Normal | Warning | Critical | |--------|--------|---------|----------| | **Link state** | Up | - | Down | | **Input errors** | 0 | >0 | Increasing | | **Output errors** | 0 | >0 | Increasing | | **CRC errors** | 0 | >0 | Increasing | | **Discards** | Near 0 | >1% | >5% | | **Utilization** | <80% | >80% | >95% | **Key commands**: ```bash # On SONiC show interfaces counters show interfaces status # On Linux hosts ethtool -S eth0 | grep -E "error|drop|crc" ip -s link show eth0 ``` ### Buffer/Queue Health Micro-bursts can cause drops even on non-congested links. | Metric | Normal | Warning | Critical | |--------|--------|---------|----------| | **Queue depth** | Low | Increasing | Near max | | **Tail drops** | 0 | >0 | Sustained | | **ECN marks** | Low | Increasing | High | **SONiC commands**: ```bash show queue counters show priority-group watermark ``` ### System Health Switch hardware health affects network reliability. | Metric | Normal | Warning | Critical | |--------|--------|---------|----------| | **CPU utilization** | <50% | >70% | >90% | | **Memory utilization** | <70% | >80% | >95% | | **Temperature** | Normal | High | Critical | | **PSU status** | All OK | 1 failed | Multiple failed | | **Fan status** | All OK | 1 failed | Multiple failed | **SONiC commands**: ```bash show system-health summary show platform psustatus show platform fan show platform temperature ``` ## Data Collection Methods ### SNMP (Simple Network Management Protocol) Traditional but widely supported: ```yaml # Example SNMP polling targets - ifHCInOctets / ifHCOutOctets # Interface bytes - ifInErrors / ifOutErrors # Interface errors - bgpPeerState # BGP session state - sysUpTime # System uptime ``` **Pros**: Universal support, mature tooling **Cons**: Polling-based (not real-time), limited scalability ### Streaming Telemetry (gNMI/gRPC) Modern, push-based collection: ```yaml # Example gNMI subscription paths - /interfaces/interface/state/counters - /network-instances/network-instance/protocols/protocol/bgp - /components/component/state/temperature ``` **Pros**: Real-time, efficient, structured data **Cons**: Requires modern switches, more complex setup ### SONiC-Specific Collection SONiC exposes data via Redis and REST: ```bash # Direct Redis access redis-cli -n 2 hgetall "INTERFACE_TABLE:Ethernet0" # REST API (if enabled) curl http://localhost:8080/api/v1/interface/Ethernet0 ``` ### Host-Side Collection For FRR on hosts: ```bash # Prometheus metrics via frr_exporter # https://github.com/tynany/frr_exporter # Manual collection vtysh -c "show ip bgp summary json" | jq . ``` ## Alerting Strategy ### Tier 1: Page Immediately (Critical) | Alert | Condition | Impact | |-------|-----------|--------| | **Spine down** | All BGP sessions to spine lost | Reduced fabric capacity | | **ToR down** | All BGP sessions to ToR lost | Rack isolated | | **Multiple BFD flaps** | >3 flaps in 5 min | Unstable path | | **ECMP degraded** | <50% expected paths | Reduced resilience | ### Tier 2: Alert (Warning) | Alert | Condition | Impact | |-------|-----------|--------| | **Single BGP peer down** | One session lost | Reduced redundancy | | **Interface errors** | >100 errors/min | Potential issues | | **High utilization** | >80% sustained | Approaching capacity | | **Route count change** | >10% change | Possible misconfiguration | ### Tier 3: Informational | Alert | Condition | Notes | |-------|-----------|-------| | **BGP session established** | New peer up | Normal operation | | **Maintenance mode** | Graceful shutdown | Expected | | **Config change** | Config applied | Audit trail | ## Sample Dashboards ### Fabric Overview Dashboard ``` ┌─────────────────────────────────────────────────────────────┐ │ FABRIC HEALTH OVERVIEW │ ├───────────────┬───────────────┬───────────────┬─────────────┤ │ Spines: 2/2 ✓ │ ToRs: 12/12 ✓ │ Hosts: 48/48 ✓│ BGP: 100% ✓ │ ├───────────────┴───────────────┴───────────────┴─────────────┤ │ │ │ [Spine-1] ────────────────────────────── [Spine-2] │ │ │ ╲ ╱ │ │ │ │ ╲ ╱ │ │ │ ┌──┴───┴──┐ ┌──────┐ ┌──────┐ ┌┴───┴──┐ │ │ │ Rack 1 │ │Rack 2│ │Rack 3│ │Rack 4 │ ... │ │ │ ToR-A/B │ │ToR-AB│ │ToR-AB│ │ToR-A/B│ │ │ └─────────┘ └──────┘ └──────┘ └───────┘ │ │ │ ├─────────────────────────────────────────────────────────────┤ │ ECMP Paths: avg 7.8/8 │ BFD Flaps: 0 │ Drops: 0.001% │ └─────────────────────────────────────────────────────────────┘ ``` ### Per-Device Dashboard ``` ┌─────────────────────────────────────────────────────────────┐ │ ToR-A1 (Rack 1) │ ├─────────────────────────────────────────────────────────────┤ │ BGP Peers: │ │ ├─ Spine-1 (172.20.1.0): Established, 48 prefixes │ │ ├─ Spine-2 (172.20.1.2): Established, 48 prefixes │ │ ├─ Host-11 (172.16.11.0): Established, 1 prefix │ │ ├─ Host-12 (172.16.11.2): Established, 1 prefix │ │ └─ ... (10 more hosts) │ ├─────────────────────────────────────────────────────────────┤ │ Interface Utilization: │ │ Eth1/1 (Spine-1): ████████░░ 78% │ │ Eth1/2 (Spine-2): ███████░░░ 65% │ │ Eth1/3 (Host-11): ██░░░░░░░░ 15% │ ├─────────────────────────────────────────────────────────────┤ │ System: CPU 23% │ Mem 45% │ Temp 42°C │ PSU: 2/2 │ Fan: OK │ └─────────────────────────────────────────────────────────────┘ ``` ## Monitoring Stack Options ### Option 1: Prometheus + Grafana ``` [Switches] ──SNMP──> [SNMP Exporter] ──> [Prometheus] ──> [Grafana] [Hosts] ──FRR───> [FRR Exporter] ──> │ │ [AlertManager] ──> PagerDuty/Slack ``` **Components**: - Prometheus: Time-series database - SNMP Exporter: Converts SNMP to Prometheus metrics - FRR Exporter: Collects FRR/BGP metrics - Grafana: Visualization - AlertManager: Alert routing ### Option 2: InfluxDB + Telegraf + Grafana ``` [Switches] ──SNMP──────> [Telegraf] ──> [InfluxDB] ──> [Grafana] [Switches] ──gNMI──────> │ [Hosts] ──exec/proc──> │ ``` **Components**: - Telegraf: Collection agent (supports SNMP, gNMI, exec) - InfluxDB: Time-series database - Grafana: Visualization ### Option 3: Vendor-Specific (SONiC) SONiC includes built-in telemetry: ```bash # Enable telemetry container sudo config feature state telemetry enabled # Configure streaming gnmi_cli -address localhost:8080 -subscribe /interfaces ``` ## Best Practices ### 1. Monitor at Multiple Layers ``` ┌─────────────────────────────────────────────────────────────┐ │ Layer 7: Application metrics (OVN, OpenStack) │ ├─────────────────────────────────────────────────────────────┤ │ Layer 4: Flow metrics (GENEVE tunnel health) │ ├─────────────────────────────────────────────────────────────┤ │ Layer 3: Routing metrics (BGP, ECMP, BFD) ← FOCUS │ ├─────────────────────────────────────────────────────────────┤ │ Layer 1-2: Interface metrics (errors, utilization) │ ├─────────────────────────────────────────────────────────────┤ │ Physical: Hardware health (temp, PSU, fan) │ └─────────────────────────────────────────────────────────────┘ ``` ### 2. Establish Baselines Before alerting, understand normal: - What's the typical BGP convergence time? - What's normal interface utilization? - How many ECMP paths should exist? ### 3. Correlate Events When issues occur, correlate: - BGP state change + Interface down = Physical failure - Multiple BGP flaps + No interface errors = Control plane issue - High drops + High utilization = Congestion ### 4. Automate Remediation For known issues, consider automated responses: - BGP peer down → Check interface, attempt clear - High memory → Log and alert, prepare for failover - Temperature high → Alert, check environmental ## Integration with OVN Monitoring The underlay and overlay should be monitored together: | Layer | What to Check | Why | |-------|--------------|-----| | **Underlay** | BGP/ECMP paths to remote TEPs | Connectivity foundation | | **Overlay** | OVN tunnel status | Actual VM connectivity | | **Combined** | End-to-end VM ping | Full stack validation | **OVN monitoring commands**: ```bash # Check tunnel status ovs-vsctl show | grep -A 2 "Bridge br-int" # Check OVN connectivity ovn-sbctl show | grep Chassis ovn-trace --summary <datapath> 'inport=="<port>" && eth.src==<mac>' ``` See [Operations & Maintenance](./operations-maintenance.qmd) for detailed troubleshooting procedures.