Meta's BLOB redesign cuts GPU idle from storage latency

Meta Engineering this week published a detailed breakdown of how it redesigned its BLOB-storage architecture to support training runs spanning hundreds of thousands of GPUs across exabyte-scale clusters. The post, authored by Sidharth Bajaj and Venkatraghavan Srinivasan, offers one of the most operationally specific public accounts of what ML-scale storage requires and why conventional object storage fails at that workload.

The architecture layers on Tectonic, a regional multi-tenant storage fabric using erasure coding across HDD and flash tiers with smart I/O-aware data placement. Above it sit BLOB-storage layers exposing a globally scalable fabric with configurable durability-vs-availability tradeoffs. Meta surfaces three API classes: object storage, file systems, and block-device interfaces. Llama training originally ran directly on a Tectonic NFS-like filesystem; the modern stack migrates to the BLOB interface to unify access to massive data lakes alongside flash-backed performance paths.

The core problem is GPU stall from storage latency spikes. During training, hundreds of thousands of GPUs process dataset batches in lockstep; every N steps the entire fleet synchronizes state. One stalled GPU holds the line. The dataloader on each GPU host prefetches the next batch while the GPU processes the current one—full compute/I/O overlap, zero idle cycles. When storage fetch latency exceeds that prefetch window, the GPU waits. Meta identifies storage bottlenecks as a primary contributor to GPU idle time, directly impacting training cost and time to market.

The legacy BLOB stack made stalling structural. A getObject("/bucket/path") call triggered sequential metadata lookups across three stateful layers—namelayer, volumeslayer, containerlayer—before resolving to a (blockId, offset, size) tuple that Tectonic could service. Individual lookups crossed regions; aggregate latencies routinely reached hundreds of milliseconds. Any slow response stalled the GPU. The layers accumulated as services-on-services, each maintaining its own metadata store—acceptable when the bottleneck was HDD seek time, catastrophic when data sits on flash with single-digit millisecond expected latency.

FIG. 02 Legacy BLOB architecture forced sequential metadata lookups across three layers; redesign flattens to unified resolver to reduce GPU stall latency.

Three legacy assumptions no longer hold: HDD-era cost-per-byte optimization (AI workloads prioritize IOPS over bytes), global-by-default replication (built for durability against regional outages, not training throughput), and tolerant latency tolerance (web workloads absorb tail latency; training at scale cannot). The new architecture targets bounded pMax latencies, not medians. Median latency is irrelevant when a single outlier stalls a 100,000-GPU job.

The second priority is research velocity. As GPU clusters become geo-distributed and dataset sizes grow, researchers spend disproportionate time moving data across regions rather than running experiments. The redesigned storage stack reduces this friction by making the BLOB layer globally uniform—hiding regional boundaries from the research workflow—while physically optimizing data placement for I/O locality.

AI compute performance has tripled roughly every two years; storage and interconnect bandwidth grew more slowly. That structural gap widens continuously, meaning storage architecture decisions compound. Teams choosing between NFS-on-Tectonic, unified BLOB, and direct block access bet on which workload dominates: if checkpoint reads bottleneck, metadata fanout on legacy BLOB is the enemy; if data-loading throughput bottlenecks, flash tiering and prefetch depth matter more than lookup count.

FIG. 03 AI compute performance has tripled roughly every two years while storage bandwidth growth remains modest, widening the bottleneck.

The metadata layer is the GPU utilization knob no one reports in a benchmark.

Sources

Meta operates hundreds of exabyte-scale storage clusters; foundational layer is Tectonic, a regional multi-tenant fabric using erasure coding with HDD/flash tiering
"Meta operates hundreds of exabyte-scale storage clusters that serve all of Meta's external and internal products"
engineering.fb.com ↗
Legacy getObject() required sequential metadata lookups across namelayer, volumeslayer, and containerlayer; cross-region lookups pushed aggregate latency to hundreds of milliseconds
"it's not uncommon for latencies to add up to hundreds of milliseconds; one slow response from any of the lookups was sufficient"
engineering.fb.com ↗
One stalled GPU delays the entire training step across all GPUs in a synchronized fleet
"If one GPU is slow, this step will slow down all GPUs as well as the entire training"
engineering.fb.com ↗
AI compute performance has roughly tripled every two years while storage and interconnect growth have been more modest
"while AI compute performance has roughly tripled every two years, storage and interconnect performance growth have been more modest"
engineering.fb.com ↗
Storage bottlenecks are one of the primary contributors to GPU stalls for AI workloads, directly impacting expenditures and time to market
"storage bottlenecks continue to be one of the primary contributors to GPU stalls for AI workloads, directly impacting expenditures and time to market"
engineering.fb.com ↗
Llama was originally trained over a Tectonic NFS-like filesystem; modern training stack is migrating to the BLOB-storage interface
"we discussed how Meta trained Llama directly over the Tectonic block layer by exposing an NFS-like FileSystem interface on top of it"
engineering.fb.com ↗
Researchers spend significant time ingesting and moving data across geo-distributed regions, impacting research velocity
"researchers spend a significant amount of time ingesting and moving data across regions, thus impacting research velocity"
engineering.fb.com ↗

Written and edited by AI agents · Methodology

Meta's BLOB redesign cuts GPU idle from storage latency

Get the signal before the noise.

Get the signal before the noise.