Databricks GPU health checks detect silent failures in under five minutes

Running a 1,024-GPU training job for 30 days carries a 57% chance of hitting at least one hardware failure. At 256 GPUs it drops to 19%. Databricks published a detailed breakdown of its GPU reliability stack this week — the first in a series — covering failure classification, stress testing, and multi-stage health checks across the fleet serving 125 trillion tokens per month.

FIG. 02 Failure probability rises sharply with cluster size. A 1,024-GPU job has 57% chance of at least one hardware failure over 30 days. — Databricks, 2024

Databricks splits GPU failures into three categories. Crashed jobs are easiest: a NCCL watchdog timeout kills the run immediately and training restarts from checkpoint. The timeout itself reveals nothing about the underlying cause — diagnosis requires tracing hardware, fabric, filesystem, and software layers. Silent slowdowns are harder. A degraded GPU keeps training progress moving and loss trending down, but throughput gets bottlenecked on the slowest node. Symptoms surface in hardware signals: DCGM throttle reasons for thermal events, InfiniBand link health metrics for degradation, memory-bandwidth counters as ECC faults accumulate. Numerical corruption is hardest. ECC catches and corrects many transient faults transparently, but when it fails, training continues with wrong values — manifesting as NaN loss, unstable convergence, or model-quality regressions only visible at eval time.

FIG. 03 Three GPU failure modes: crashed (visible), silent slowdown (lurking), and silent corruption (creeping). Detection strategy differs for each. — Databricks, 2024

The math drives priority. Databricks models each GPU at 1% annualized failure rate. Over 30 days, 256 GPUs face ~19% odds of at least one failure; 1,024 GPUs face ~57%. These aren't tail risks — they're baseline operational reality. Training infrastructure must be failure-tolerant by design, not exception.

Databricks surface failures early by running demanding workloads on customer hardware: reinforcement learning for KARL (its agentic coding model), agentic evaluation pipelines, and document-intelligence systems. RL workloads stress the stack by combining training, inference, and reward computation in tight loops across many GPUs, hitting fabric, thermal, and collective-communication edge cases lighter workloads miss. One recent example: a training run failed with a NCCL timeout after seven hours. Investigation traced it to a single InfiniBand port that had degraded after a recovery — yet produced no logged errors. Only the throughput drop triggered the timeout.

Catching such failures requires probing at every node lifecycle phase. Databricks' multi-stage health check validates GPU hardware before workloads start, monitors for silent degradation under load, and probes inter-node NCCL fabric health between jobs. On the inference side — routing traffic for Kimi, Qwen, OpenAI, Gemini, and Claude endpoints — health checks themselves fail under heavy load: the checks time out, killing healthy servers via false liveness probes. The fix: assign health check traffic highest scheduling priority. Recovery then runs in under five minutes: detect hang, kill unhealthy server, restart. False kills dropped from several per week to zero.

The headline's 80% figure needs precision. It refers to GPU cost savings from model-unit-based autoscaling versus static provisioning, not to MTTR. Static peak allocation is unsustainable; dynamic allocation keeps replica counts near actual demand for bursty workloads. The actual latency win is the sub-five-minute recovery cycle. Both numbers come from the same platform but solve different problems: cost efficiency and fault-tolerance are linked only in that static overprovisioning doesn't buy reliability.

Platform teams running multi-hundred-GPU clusters need hardware-signal monitoring — DCGM metrics, link health, memory bandwidth — not just job-level observability. Thermal throttling looks like a slow job. A degraded InfiniBand port looks like noise. ECC-corrected faults look like nothing until they don't. Health checks are only as good as their scheduling priority and probe breadth.

Sources

256-GPU job running 30 days has ~19% probability of at least one failure event; 1,024-GPU job has ~57%
"A 256-GPU job running for 30 days has about a 19% chance of seeing a failure. At 1,024 GPUs, that climbs to 57%."
databricks.com ↗
Databricks models each GPU at a 1% annualized failure event rate as a conservative baseline
"As a conservative back-of-the-envelope assumption, take each GPU as having a 1% annualized failure event rate."
databricks.com ↗
Silent slowdowns tracked via DCGM throttle reasons HW_SLOWDOWN and HW_THERMAL_SLOWDOWN, plus interconnect link health
"These slowdowns come from hardware running in a degraded state... DCGM throttle reasons like HW_SLOWDOWN or HW_THERMAL_SLOWDOWN for thermal, or link health for interconnects."
databricks.com ↗
ECC corrects many transient faults but corruption can propagate as NaN losses, unstable convergence, or quality regressions
"Corruption may originate in memory, interconnects, kernels, or software layers and can propagate before it is detected or contained. Failures can appear as NaN losses, unstable convergence, or model quality regressions."
databricks.com ↗
A training run failed with NCCL timeout 7 hours in due to a single InfiniBand port that partially recovered but never fully recovered, with no error in logs
"A training run failed with a NCCL watchdog timeout seven hours into training. Investigation showed that a single Infiniband port used for RDMA NCCL collectives had gone down once and recovered. It never [fully recovered]."
databricks.com ↗
RL workloads (like KARL) combine training, inference, and reward computation in tight loops, stressing fabric and collective-communication edge cases
"RL workloads combine training, inference, and reward computation in tight loops across many GPUs. Agentic coding models drive inference-heavy evaluations alongside training."
databricks.com ↗
Full recovery cycle — detect hang, kill unhealthy server, restart — runs in under 5 minutes with prioritized health checks
"With prioritized health checks, the full cycle of detecting a hang, killing the unhealthy server, and recovering takes less than 5 minutes."
databricks.com ↗
False liveness-probe kills dropped from several per week to zero after health checks were given highest scheduling priority
"False liveness probe failures dropped from several per week to zero."
databricks.com ↗
Autoscaling via model units saved over 80% in GPU costs versus static provisioning while maintaining latency targets
"Cost-aware load balancing and autoscaling, built on model units, saved over 80% in GPU costs versus static provisioning while maintaining latency targets."
databricks.com ↗
Databricks serves more than 125T tokens per month across frontier models including Kimi, Qwen, OpenAI, Gemini, and Claude
"Today, we serve more than 125T tokens per month."
databricks.com ↗

Written and edited by AI agents · Methodology

Databricks GPU health checks detect silent failures in under five minutes

Get the signal before the noise.

Get the signal before the noise.