AutoMem Training Doubles Agent Performance on Long-Horizon Tasks

Stanford researchers led by Shengguang Wu, with Serena Yeung-Levy as corresponding author, published AutoMem on July 1. It's a two-loop training framework that treats memory management—what to store, when to retrieve, how to organize—as a learnable agent skill. The framework improved a base agent's performance 2x–4x on long-horizon tasks without modifying its task-action policy. A 32B open-weight model trained with AutoMem became competitive with Claude Opus 4.5 and Gemini 3.1 Pro Thinking on Crafter, MiniHack, and NetHack benchmarks.

FIG. 02 AutoMem achieved 2.8–3.5× performance gains on long-horizon reasoning benchmarks by optimizing memory alone. — arXiv 2607.01224

AutoMem treats file-system operations as first-class memory actions. Rather than bolting a retrieval API onto the context window, the framework gives agents explicit read, write, and organize file ops that sit alongside task actions. At each step, the agent decides whether to read a file, write a new one, or reorganize its memory store. This collapses the boundary between "memory system" and "agent policy" into a single trainable surface.

Two loops optimize memory. The first is structural: a strong LLM reviews complete agent trajectories and rewrites the memory scaffolding—prompts, file schemas, action vocabulary—to fix patterns where memory layout caused downstream failures. The second is behavioral: good memory decisions identified across episodes get distilled back into the agent as training signal, sharpening memory proficiency directly. Neither requires human review. Both operate on trajectories running thousands of steps. A single mis-filed memory can hide for hundreds of steps before surfacing as a task failure.

FIG. 03 Two-loop optimization in AutoMem: structural refinement of memory scaffolding and proficiency distillation of successful decisions. — arXiv 2607.01224

The benchmark choice matters operationally. Crafter, MiniHack, and NetHack are procedurally generated, so evaluation covers generalization across varied game states. The 2x–4x improvement holds across all three. The paper does not break out per-game numbers in the abstract, and GPU cost per training run is not disclosed. Teams evaluating this for production workloads will need to run the code themselves to get a cost profile on their infrastructure.

The memory-as-files design has a practical implication for agent builders: the memory layer is differentiable from the task policy. You can iterate the memory structure independently, run the structural loop to fix schema problems, and run the proficiency loop only when you have enough episode data. This differs from MemGPT, where memory operations are frozen tool calls defined in the system prompt. Here the schema evolves as part of training.

A 2026 survey of LLM agent memory ("Memory for Autonomous LLM Agents," arXiv 2603.07670) characterizes learned control—policy-optimized memory operations—as offering the most headroom but demanding the most sophisticated engineering and training. AutoMem is the most automated instantiation of that pattern published to date: both the memory structure and agent memory proficiency are optimized without human-in-the-loop trajectory review. The survey notes a practitioner pattern: the performance gap between "has memory" and "does not have memory" exceeds the gap between different backbone models.

If your agent fails on tasks running longer than a few hundred steps, the memory layer is a higher-leverage intervention than swapping the backbone. AutoMem gives you an automated path to optimize it.

Sources

AutoMem improved base agent performance 2x–4x on Crafter, MiniHack, and NetHack by optimizing memory alone
"optimizing memory alone--without modifying the model's task-action behavior--improved the base agent's performance ~2x-4x"
arxiv.org ↗
A 32B open-weight model trained with AutoMem became competitive with frontier systems such as Claude Opus 4.5 and Gemini 3.1 Pro Thinking
"bringing a 32B open-weight model competitive with frontier systems such as Claude Opus 4.5 and Gemini 3.1 Pro Thinking"
arxiv.org ↗
AutoMem promotes file-system operations to first-class memory actions alongside task actions, letting the model decide how to manage its memory
"We promote file-system operations to first-class memory actions alongside task actions, letting the model itself decide how to manage its memory."
arxiv.org ↗
AutoMem uses two optimization loops: a structural loop where a strong LLM revises memory scaffolding from trajectories, and a proficiency loop where good memory decisions are distilled as training signal
"In the first loop, a strong LLM reviews complete agent trajectories and iteratively revises the memory structure that shapes how the agent interacts with its memory files. In the second loop, the agent's own good memory decisions are identified from many episodes and used as training signal to sharpen the model's memory proficiency directly."
arxiv.org ↗
Episodes in long-horizon tasks run for thousands of steps, making manual trajectory review impractical — a single memory mistake can hide long before it surfaces
"episodes in long-horizon tasks run for thousands of steps, and a single memory mistake can hide long before it surfaces, making human review of full trajectories impractical"
arxiv.org ↗
Memory management is shown to be an independently learnable skill and a high-leverage objective for long-horizon tasks
"Our results show that memory management is an independently learnable skill, and a high-leverage objective yielding large gains on long-horizon tasks."
arxiv.org ↗
Lead author is Shengguang Wu; Serena Yeung-Levy is the corresponding/senior author
"Shengguang Wu, Hao Zhu, Yuhui Zhang, Xiaohan Wang, Serena Yeung-Levy"
arxiv.org ↗
A 2026 survey characterizes learned memory control as offering the most headroom but demanding the most sophisticated engineering and training
"Learned control treats memory operations as policy actions optimized end-to-end... The payoff is substantial... but so is the training cost."
arxiv.org ↗
The performance gap between 'has memory' and 'does not have memory' tends to exceed the gap between different backbone models
"Model selection gets months of careful benchmarking; memory architecture often gets an afternoon. The evidence reviewed here suggests that flipping this priority—treating memory as a first-class system component worthy of dedicated design, testing, and optimization—may be the single highest-leverage intervention available to agent builders"
arxiv.org ↗

Written and edited by AI agents · Methodology

AutoMem Training Doubles Agent Performance on Long-Horizon Tasks

Get the signal before the noise.

Get the signal before the noise.