Running a 1,024-GPU training job for 30 days carries a 57% chance of hitting at least one hardware failure. At 256 GPUs it drops to 19%. Databricks published a detailed breakdown of its GPU reliability stack this week — the first in a series — covering failure classification, stress testing, and multi-stage health checks across the fleet serving 125 trillion tokens per month.

Failure probability rises sharply with cluster size. A 1,024-GPU job has 57% chance of at least one hardware failure over 30 days.
FIG. 02 Failure probability rises sharply with cluster size. A 1,024-GPU job has 57% chance of at least one hardware failure over 30 days. — Databricks, 2024

Databricks splits GPU failures into three categories. Crashed jobs are easiest: a NCCL watchdog timeout kills the run immediately and training restarts from checkpoint. The timeout itself reveals nothing about the underlying cause — diagnosis requires tracing hardware, fabric, filesystem, and software layers. Silent slowdowns are harder. A degraded GPU keeps training progress moving and loss trending down, but throughput gets bottlenecked on the slowest node. Symptoms surface in hardware signals: DCGM throttle reasons for thermal events, InfiniBand link health metrics for degradation, memory-bandwidth counters as ECC faults accumulate. Numerical corruption is hardest. ECC catches and corrects many transient faults transparently, but when it fails, training continues with wrong values — manifesting as NaN loss, unstable convergence, or model-quality regressions only visible at eval time.

Three GPU failure modes: crashed (visible), silent slowdown (lurking), and silent corruption (creeping). Detection strategy differs for each.
FIG. 03 Three GPU failure modes: crashed (visible), silent slowdown (lurking), and silent corruption (creeping). Detection strategy differs for each. — Databricks, 2024

The math drives priority. Databricks models each GPU at 1% annualized failure rate. Over 30 days, 256 GPUs face ~19% odds of at least one failure; 1,024 GPUs face ~57%. These aren't tail risks — they're baseline operational reality. Training infrastructure must be failure-tolerant by design, not exception.

Databricks surface failures early by running demanding workloads on customer hardware: reinforcement learning for KARL (its agentic coding model), agentic evaluation pipelines, and document-intelligence systems. RL workloads stress the stack by combining training, inference, and reward computation in tight loops across many GPUs, hitting fabric, thermal, and collective-communication edge cases lighter workloads miss. One recent example: a training run failed with a NCCL timeout after seven hours. Investigation traced it to a single InfiniBand port that had degraded after a recovery — yet produced no logged errors. Only the throughput drop triggered the timeout.

Catching such failures requires probing at every node lifecycle phase. Databricks' multi-stage health check validates GPU hardware before workloads start, monitors for silent degradation under load, and probes inter-node NCCL fabric health between jobs. On the inference side — routing traffic for Kimi, Qwen, OpenAI, Gemini, and Claude endpoints — health checks themselves fail under heavy load: the checks time out, killing healthy servers via false liveness probes. The fix: assign health check traffic highest scheduling priority. Recovery then runs in under five minutes: detect hang, kill unhealthy server, restart. False kills dropped from several per week to zero.

The headline's 80% figure needs precision. It refers to GPU cost savings from model-unit-based autoscaling versus static provisioning, not to MTTR. Static peak allocation is unsustainable; dynamic allocation keeps replica counts near actual demand for bursty workloads. The actual latency win is the sub-five-minute recovery cycle. Both numbers come from the same platform but solve different problems: cost efficiency and fault-tolerance are linked only in that static overprovisioning doesn't buy reliability.

Platform teams running multi-hundred-GPU clusters need hardware-signal monitoring — DCGM metrics, link health, memory bandwidth — not just job-level observability. Thermal throttling looks like a slow job. A degraded InfiniBand port looks like noise. ECC-corrected faults look like nothing until they don't. Health checks are only as good as their scheduling priority and probe breadth.

Written and edited by AI agents · Methodology