top of page

2025 Guide: Network Uptime & Security Monitoring for AI Data Centers

  • SEMNET TEAM
  • Dec 24, 2025
  • 6 min read


Introduction: Why AI‑scale networking is different


AI training and inference generate far more east–west traffic than classic web workloads. Collectives such as all‑reduce, parameter sharding, and pipeline parallelism require high‑throughput RDMA fabrics with ultra‑low latency and near‑zero packet loss. Even small loss or jitter can degrade NCCL performance and lower GPU utilization, inflating time to train and cost per token. NVIDIA’s 2025 guidance highlights that even minor drops or flaps can stall collectives and reduce effective throughput when fabrics are not resilient to interruptions. (developer.nvidia.com)


The business impact is material in 2025. New Relic’s 2025 Observability Forecast reports a median $2 million per hour cost for high‑impact outages and a median $76 million annually, with network failures among the top three causes. Full‑stack observability halves outage costs to $1 million per hour and reduces mean time to detection by 7 minutes. (newrelic.com)


Threat levels have also escalated. Cloudflare blocked a record 11.5 Tbps DDoS attack in 2025, while its Q3 report shows DDoS attacks grew 15% quarter over quarter and 40% year over year, with attacks against AI companies surging up to 347% month over month. (tomshardware.com)


Finally, AI factories are scaling across rooms, halls, and even sites. Vendors are introducing photonics and “scale‑across” Ethernet to connect millions of GPUs and interconnect data centers across metro distances. (investor.nvidia.com)



What to monitor in AI data center networks


Availability signals

  • Device and link up or down across ToR, leaf, spine, super‑spine.

  • Routing adjacencies and control plane health: BGP and OSPF sessions, EVPN overlays.

  • Optical power levels and FEC error rates on 400G or 800G optics, plus DOM thresholds on pluggables or co‑packaged optics.

  • RDMA NIC health, MLAG or MC‑LAG peer state, EVPN VTEP reachability, ECMP path entropy.


Performance signals

  • Latency p50, p95, p99 and jitter on east–west paths.

  • Packet loss, retransmits, queue depth, ECN marks, buffer occupancy, and microbursts.

  • Link utilization per class of service; headroom and incast indicators.

  • RDMA and RoCEv2 counters: PFC pause frames, CNPs, retransmit timeouts, out‑of‑order or duplicate acks. OCI’s RoCE design favors ECN‑based congestion control with limited PFC at the edge to avoid congestion spreading. (developer.nvidia.com)


Security signals

  • Authentication and authorization events across TACACS, RADIUS, SSO, and certificate lifecycle events.

  • Segmentation and microsegmentation policy hits and denials.

  • IDS or IPS alerts, anomaly scores, and lateral movement indicators in east–west traffic.

  • DDoS indicators: sudden spikes in pps or Tbps, protocol amplification, reflection signatures, and botnet fingerprinting. Cloudflare saw 8.3 million attacks mitigated in Q3 2025 alone. (blog.cloudflare.com)



Telemetry and data collection


Streaming telemetry, flow records, and INT

  • Favor streaming model‑driven telemetry over infrequent SNMP polling for sub‑second freshness.

  • Export flow data via NetFlow or IPFIX for traffic baselining and investigation. IPFIX is the IETF standard for exporting flow information. (datatracker.ietf.org)

  • Add in‑band telemetry where supported. IOAM or INT can stamp path, timestamps, and queue depth inside packets for hop‑by‑hop visibility, useful for diagnosing tail latency and incast. (datatracker.ietf.org)


Logs and events

  • Syslog from switches and routers, OS and agent logs from hosts, change events from controllers and SDN, and API audit logs from orchestrators.

  • NAC outcomes, PKI events, and certificate expiry alarms to catch trust breaks that masquerade as “network flaps.”


Correlate network metrics with GPU and cluster metrics

  • Tie NIC and fabric counters to GPU utilization, GPU idle time, NCCL collective durations, pod restarts, and job preemptions. NCCL 2.24 adds RAS features that help diagnose hangs and crashes across multi‑node collectives, improving the cross‑layer view. (developer.nvidia.com)

  • For inference, track time to first token and per‑token throughput alongside network latency and loss. AWS reports up to 97% reduction in TTFT p90 for optimized inference profiles, which is highly sensitive to transport jitter and queue spikes. (aws.amazon.com)



Architecture patterns for AI fabrics


Spine–leaf and super‑spine designs

  • Build non‑blocking fabrics with consistent oversubscription targets at each tier. Use ECMP and topology‑aware schedulers to localize east–west traffic.

  • Provide ToR and leaf redundancy with MLAG or EVPN multihoming and fast convergence.


Ethernet RDMA, congestion control, and photonics

  • For RoCEv2, combine ECN‑based DC‑QCN with limited PFC at the edge to minimize loss without congestion spreading. (developer.nvidia.com)

  • Consider photonics and co‑packaged optics for high‑density spine layers and long‑reach inter‑hall links. NVIDIA’s Spectrum‑X Photonics targets 1.6 Tbps per port with substantial energy savings. (investor.nvidia.com)


Scale‑across and metro interconnects

  • Multi‑site AI factories need deterministic latency and congestion control across tens of kilometers. 2025 announcements enable “scale‑across” Ethernet with precision latency management to unify distributed facilities. Broadcom’s Jericho4 family also targets metro‑scale fabrics with integrated congestion and security features. (investor.nvidia.com)


Service meshes for inference

  • For microservices‑based inference, observe mesh‑level latency, retries, and circuit breakers since these directly influence p95 inference latency and SLO attainment.



Uptime detection and alerting


Black‑box and white‑box monitoring

  • Black‑box: synthetic probes from multiple vantage points to API gateways, model endpoints, and GRPC services; packet loss, TLS handshakes, and response codes.

  • White‑box: device sensors, flow and INT data, control plane telemetry, and GPU cluster metrics.


Thresholds and anomaly detection

  • Set SLO‑driven static thresholds for hard limits like packet loss above 0.01% on RDMA classes and queue occupancy above target. Layer on unsupervised anomaly detection to catch novel failures while deduplicating noisy flaps.


Event correlation and noise reduction

  • Correlate link‑down storms with optics logs and change events. Apply topological correlation so a single fiber cut does not generate hundreds of alerts.


SLOs, runbooks, and incident timelines

  • Define SLOs for availability, latency, and packet loss per service tier. Maintain runbooks for link drains, route dampening, host quarantines, and rollbacks. Track MTTA, MTTR, and change success rate in post‑incident reviews.



Security monitoring for AI workloads


Zero trust and microsegmentation

  • Adopt ZTA with strong identity for workloads and users, continuous verification, and least privilege. NIST SP 800‑207 provides the architectural foundation. (csrc.nist.gov)

  • Enforce microsegmentation between training clusters, inference gateways, data stores, and admin planes. Log and analyze policy hits and denials to validate coverage.


Lateral movement detection in east–west traffic

  • Baseline internal flows with NetFlow or IPFIX and raise alerts on unusual fan‑out, port scans, and new service ports inside cluster namespaces. Use INT or IOAM selective sampling to measure path changes during suspected events. (datatracker.ietf.org)


DDoS posture for inference endpoints

  • Model endpoints exposed to the internet must be protected against hyper‑volumetric floods and application‑layer attacks. 2025 data shows record peaks and rapid growth in attack frequency. Ensure scrubbing capacity, automatic attack detection, and BGP diversion are tested and measurable. (tomshardware.com)



The observability stack that works for AI


Data pipeline and storage

  • Collect metrics, events, logs, traces, and flow data into a scalable time‑series database plus an event bus for enrichment and correlation.

  • Standardize on OpenTelemetry for service telemetry and adopt current semantic conventions to keep dashboards and alerts stable across upgrades. The community advanced multiple conventions in 2025, and cloud providers now ingest OTLP natively. (opentelemetry.io)


Topology‑aware dashboards and NOC wall design

  • Build topology maps that overlay alarms, congestion, and packet loss on physical and logical views. Include GPU utilization, NCCL durations, and queue depth next to fabric heatmaps.


ITSM and CMDB integrations

  • Sync incidents, changes, and asset metadata so network changes and model deployments are auditable and reversible.



Automation and self‑healing


Intent‑based networking and config as code

  • Express policies for segmentation, QoS, and routing intent in code. Validate in CI with pre‑change tests, then deploy through change windows with automated rollbacks.


Auto‑remediation examples

  • Drain links that exceed error thresholds and reroute traffic to healthy ECMP paths.

  • Quarantine hosts that trigger excessive PFC storms or anomalous east–west fan‑out.

  • Rate‑limit or geo‑block during DDoS bursts while preserving VIPs for inference APIs.



Compliance and reporting


  • Retain evidence with immutable logs and strong encryption. Restrict access with least privilege and just‑in‑time elevation.

  • For executive reporting, summarize uptime, SLO attainment, major incidents, blocked threats, and the correlation between GPU utilization and network health.



KPI scorecard for AI data center networks


Define targets by tier and workload. Example ranges below are starting points that you should refine in pilot tests.


  • Availability: five‑nines for core switching and critical inference gateways where feasible. Five‑nines allows roughly 5 minutes 15 seconds of downtime per year. Four‑nines allows roughly 52 minutes 36 seconds. (techtarget.com)

  • Latency: intra‑pod sub‑millisecond, intra‑rack low microseconds for RDMA paths, inter‑rack single‑digit microseconds to low milliseconds depending on hop count and optics.

  • Loss: RDMA classes target near‑zero loss. Investigate any sustained loss above 0.01%.

  • Jitter: keep p95 jitter tightly bounded on inference paths since TTFT and streaming quality are sensitive to jitter bursts. Optimized cloud inference profiles in 2025 show up to 97% TTFT p90 reductions when transport and stack tuning are aligned. (aws.amazon.com)

  • GPU utilization: correlate fabric health to GPU idle time and NCCL collective durations. Investigate dips during congestion or link flaps that can stall collectives. (developer.nvidia.com)

  • Security: MTTR to contain lateral movement under 15 minutes on critical segments. Track blocked threats and policy coverage percentage.



Conclusion and next steps


Networks are the nervous system of AI factories. In 2025, the cost of downtime, the surge in DDoS volume, and the sensitivity of RDMA fabrics to loss make network infrastructure monitoring and security monitoring mission‑critical. Invest in streaming telemetry, flow and INT data, and OpenTelemetry alignment so you can correlate network health with GPU utilization and model throughput. Codify runbooks, practice auto‑remediation, and pressure‑test your SLOs with chaos drills. The outcome is measurable: higher GPU efficiency, faster time to train, lower inference latency, and fewer costly incidents. (newrelic.com)


Call to action: assess your current monitoring maturity and download the readiness checklist below.


  • Asset 1: Spine–leaf topology diagram with RDMA classes and ECN settings.

  • Asset 2: Sample NOC dashboard layout with topology, queue depth, ECN marks, GPU utilization, and TTFT.

  • Asset 3: KPI scorecard template for uptime, latency, loss, GPU utilization, and security MTTR.


Recent Posts

See All

Comments


bottom of page