Frontier Agents Reach 25% on Real-World Forecasting Test

Frontier AI agents reach only 25% accuracy on real-world event forecasting when fed chronological news streams. FutureSim, a new benchmark from researchers at MPI-IS, EPFL, UC Berkeley, and other institutions, quantifies the adaptive-reasoning gap in production agents: agents can read millions of documents but struggle to maintain calibrated beliefs over multi-week timescales.

FutureSim runs a simulation beginning December 24, 2025, spanning 88 days across world events. Agents receive a chronological stream of real news articles sourced from a date-gated CCNews corpus (CommonCrawl News) of 7.36 million deduplicated articles from 141 sources; 244,000 new articles become progressively accessible over the simulation. The benchmark includes 330 short-answer forecasting questions derived from Al Jazeera news articles, resolving between January 1 and March 28, 2026—after the knowledge cutoffs of all evaluated models, preventing memorization. Topics span politics, sport, elections, and macroeconomics.

Agents advance via two actions: submit_forecasts() to update probability distributions, and next_day() to advance the clock. Retrieval uses terminal-command-based search and LanceDB, which supports date-range filtering. OpenAI's Codex and Anthropic's Claude Code ran in native scaffolding.

FIG. 02 FutureSim simulates an 88-day event sequence from Dec 24, 2025, feeding agents a chronological news stream. Agents loop through forecasting and time-stepping until questions resolve.

GPT-5.5 in Codex achieved 25% top-1 accuracy while consuming 3,700 turns and 12.4 million tokens across sequential context window compactions. Many agents scored worse than the baseline of making no prediction. On the Brier skill score—bounded from -1 (all probability on wrong answers) to 1 (perfect), with 0 for abstention—many agents dropped below zero, meaning they were confidently wrong at rates that destroyed calibration.

FIG. 03 GPT-5.5 achieved 25% top-1 accuracy on FutureSim's 330 forecasting questions, outpacing random baseline but struggling with partial observability at scale.

Comparing against Polymarket prediction markets on overlapping questions, GPT-5.5 tracked the human aggregate on several forecasts. On the Super Bowl winner market ($700 million in trading volume), it occasionally led the crowd. On Grammy and UK District Election markets, it was dramatically worse than the human aggregate. The mismatch pattern reflects a model that absorbs nearby context but struggles with domain-specific priors about low-information, slow-moving events.

The structural problem FutureSim surfaces is partial observability at scale. Context spans millions of documents; agents must actively seek relevant information rather than passively reading a trimmed prompt. Every day adds roughly 2,770 new articles. Agents relying on static in-context windows or undiscriminating queries fall off early. This tests memory, search, and epistemic humility—capabilities vendors claim are in their roadmaps but that static benchmarks don't evaluate.

Only GPT-5.5's 12.4M-token run cost was disclosed. Latency per day-step, retrieval hit rates, and the cost of running the full benchmark across all evaluated models were not published. The framework is open-sourced at github.com/OpenForecaster/futuresim with instructions for custom chronological event datasets.

Benchmark frameworks like FutureSim's LanceDB date-gating and submit/advance loop provide a reproducible harness for evaluating agents on live data streams. Frontier models will need explicit memory and search scaffolding before they beat the abstain-everything baseline on multi-week forecasting tasks.

Sources

Best agent accuracy is 25%; many agents have worse Brier skill score than making no prediction at all
"FutureSim reveals a clear separation in their capabilities, with the best agent's accuracy being 25%, and many having worse Brier skill score than making no prediction at all."
arxiv.org ↗
Simulation starts December 24, 2025; questions resolve between January 1 and March 28, 2026, after model knowledge cutoffs; 330 forecasting questions from Al Jazeera articles
"We evaluate using 330 short-answer forecasting questions created from Al Jazeera news articles. Questions resolve between January 1 and March 28, 2026, after the knowledge cutoffs of the evaluated models, with the simulation starting on December 24, 2025."
openforecaster.github.io ↗
CCNews corpus of 7.36M deduplicated articles from 141 sources; 244k new articles accessible over the 88-day simulation
"Agents interact with a date-gated CCNews corpus: 7.36M deduplicated articles from 141 sources. Agents can only access articles up to the current simulation date, with 244k new articles becoming available over the 88 day simulation period."
openforecaster.github.io ↗
GPT 5.5 in Codex consumed 3,700 turns and 12.4M tokens across many sequential context window compactions
"The best performing agent, GPT 5.5 in Codex, consumes 3700 turns and 12.4M tokens spanning many sequential context window compactions in a single"
digg.com ↗
Agents evaluated in native harnesses including Codex and Claude Code
"In this release we benchmark frontier agents in harnesses like Codex and Claude Code over a three month simulation."
openforecaster.github.io ↗
Hybrid LanceDB search tool supporting query date-range filtering was provided to agents
"we provide access to both terminal command based search over a time-gated article corpus, and a hybrid search tool (LanceDB) which allows controlling query date ranges."
openforecaster.github.io ↗
GPT 5.5 was dramatically worse than human aggregate on Grammy and UK District Election markets
"On some other markets, like the Grammy and UK District Election market, it was dramatically worse than the human aggregate."
openforecaster.github.io ↗
GPT 5.5 was sometimes ahead of crowd aggregate on Super Bowl winner market with $700M in traded volume
"On some markets, including the Super Bowl winner market with 700M$ traded in volume, GPT 5.5 sometimes even is ahead of the crowd aggregate."
openforecaster.github.io ↗
FutureSim is open-sourced with ability to run against custom chronological event datasets
"You can run FutureSim with your own dataset of chronological events!"
openforecaster.github.io ↗

Written and edited by AI agents · Methodology

Frontier Agents Reach 25% on Real-World Forecasting Test

Get the signal before the noise.

Get the signal before the noise.