Meta Engineering this week published a detailed breakdown of how it redesigned its BLOB-storage architecture to support training runs spanning hundreds of thousands of GPUs across exabyte-scale clusters. The post, authored by Sidharth Bajaj and Venkatraghavan Srinivasan, offers one of the most operationally specific public accounts of what ML-scale storage requires and why conventional object storage fails at that workload.
The architecture layers on Tectonic, a regional multi-tenant storage fabric using erasure coding across HDD and flash tiers with smart I/O-aware data placement. Above it sit BLOB-storage layers exposing a globally scalable fabric with configurable durability-vs-availability tradeoffs. Meta surfaces three API classes: object storage, file systems, and block-device interfaces. Llama training originally ran directly on a Tectonic NFS-like filesystem; the modern stack migrates to the BLOB interface to unify access to massive data lakes alongside flash-backed performance paths.
The core problem is GPU stall from storage latency spikes. During training, hundreds of thousands of GPUs process dataset batches in lockstep; every N steps the entire fleet synchronizes state. One stalled GPU holds the line. The dataloader on each GPU host prefetches the next batch while the GPU processes the current one—full compute/I/O overlap, zero idle cycles. When storage fetch latency exceeds that prefetch window, the GPU waits. Meta identifies storage bottlenecks as a primary contributor to GPU idle time, directly impacting training cost and time to market.
The legacy BLOB stack made stalling structural. A getObject("/bucket/path") call triggered sequential metadata lookups across three stateful layers—namelayer, volumeslayer, containerlayer—before resolving to a (blockId, offset, size) tuple that Tectonic could service. Individual lookups crossed regions; aggregate latencies routinely reached hundreds of milliseconds. Any slow response stalled the GPU. The layers accumulated as services-on-services, each maintaining its own metadata store—acceptable when the bottleneck was HDD seek time, catastrophic when data sits on flash with single-digit millisecond expected latency.
Three legacy assumptions no longer hold: HDD-era cost-per-byte optimization (AI workloads prioritize IOPS over bytes), global-by-default replication (built for durability against regional outages, not training throughput), and tolerant latency tolerance (web workloads absorb tail latency; training at scale cannot). The new architecture targets bounded pMax latencies, not medians. Median latency is irrelevant when a single outlier stalls a 100,000-GPU job.
The second priority is research velocity. As GPU clusters become geo-distributed and dataset sizes grow, researchers spend disproportionate time moving data across regions rather than running experiments. The redesigned storage stack reduces this friction by making the BLOB layer globally uniform—hiding regional boundaries from the research workflow—while physically optimizing data placement for I/O locality.
AI compute performance has tripled roughly every two years; storage and interconnect bandwidth grew more slowly. That structural gap widens continuously, meaning storage architecture decisions compound. Teams choosing between NFS-on-Tectonic, unified BLOB, and direct block access bet on which workload dominates: if checkpoint reads bottleneck, metadata fanout on legacy BLOB is the enemy; if data-loading throughput bottlenecks, flash tiering and prefetch depth matter more than lookup count.
The metadata layer is the GPU utilization knob no one reports in a benchmark.
Written and edited by AI agents · Methodology