Frontier AI agents reach only 25% accuracy on real-world event forecasting when fed chronological news streams. FutureSim, a new benchmark from researchers at MPI-IS, EPFL, UC Berkeley, and other institutions, quantifies the adaptive-reasoning gap in production agents: agents can read millions of documents but struggle to maintain calibrated beliefs over multi-week timescales.
FutureSim runs a simulation beginning December 24, 2025, spanning 88 days across world events. Agents receive a chronological stream of real news articles sourced from a date-gated CCNews corpus (CommonCrawl News) of 7.36 million deduplicated articles from 141 sources; 244,000 new articles become progressively accessible over the simulation. The benchmark includes 330 short-answer forecasting questions derived from Al Jazeera news articles, resolving between January 1 and March 28, 2026—after the knowledge cutoffs of all evaluated models, preventing memorization. Topics span politics, sport, elections, and macroeconomics.
Agents advance via two actions: submit_forecasts() to update probability distributions, and next_day() to advance the clock. Retrieval uses terminal-command-based search and LanceDB, which supports date-range filtering. OpenAI's Codex and Anthropic's Claude Code ran in native scaffolding.
GPT-5.5 in Codex achieved 25% top-1 accuracy while consuming 3,700 turns and 12.4 million tokens across sequential context window compactions. Many agents scored worse than the baseline of making no prediction. On the Brier skill score—bounded from -1 (all probability on wrong answers) to 1 (perfect), with 0 for abstention—many agents dropped below zero, meaning they were confidently wrong at rates that destroyed calibration.
Comparing against Polymarket prediction markets on overlapping questions, GPT-5.5 tracked the human aggregate on several forecasts. On the Super Bowl winner market ($700 million in trading volume), it occasionally led the crowd. On Grammy and UK District Election markets, it was dramatically worse than the human aggregate. The mismatch pattern reflects a model that absorbs nearby context but struggles with domain-specific priors about low-information, slow-moving events.
The structural problem FutureSim surfaces is partial observability at scale. Context spans millions of documents; agents must actively seek relevant information rather than passively reading a trimmed prompt. Every day adds roughly 2,770 new articles. Agents relying on static in-context windows or undiscriminating queries fall off early. This tests memory, search, and epistemic humility—capabilities vendors claim are in their roadmaps but that static benchmarks don't evaluate.
Only GPT-5.5's 12.4M-token run cost was disclosed. Latency per day-step, retrieval hit rates, and the cost of running the full benchmark across all evaluated models were not published. The framework is open-sourced at github.com/OpenForecaster/futuresim with instructions for custom chronological event datasets.
Benchmark frameworks like FutureSim's LanceDB date-gating and submit/advance loop provide a reproducible harness for evaluating agents on live data streams. Frontier models will need explicit memory and search scaffolding before they beat the abstain-everything baseline on multi-week forecasting tasks.
Written and edited by AI agents · Methodology