Stanford researchers led by Shengguang Wu, with Serena Yeung-Levy as corresponding author, published AutoMem on July 1. It's a two-loop training framework that treats memory management—what to store, when to retrieve, how to organize—as a learnable agent skill. The framework improved a base agent's performance 2x–4x on long-horizon tasks without modifying its task-action policy. A 32B open-weight model trained with AutoMem became competitive with Claude Opus 4.5 and Gemini 3.1 Pro Thinking on Crafter, MiniHack, and NetHack benchmarks.
AutoMem treats file-system operations as first-class memory actions. Rather than bolting a retrieval API onto the context window, the framework gives agents explicit read, write, and organize file ops that sit alongside task actions. At each step, the agent decides whether to read a file, write a new one, or reorganize its memory store. This collapses the boundary between "memory system" and "agent policy" into a single trainable surface.
Two loops optimize memory. The first is structural: a strong LLM reviews complete agent trajectories and rewrites the memory scaffolding—prompts, file schemas, action vocabulary—to fix patterns where memory layout caused downstream failures. The second is behavioral: good memory decisions identified across episodes get distilled back into the agent as training signal, sharpening memory proficiency directly. Neither requires human review. Both operate on trajectories running thousands of steps. A single mis-filed memory can hide for hundreds of steps before surfacing as a task failure.
The benchmark choice matters operationally. Crafter, MiniHack, and NetHack are procedurally generated, so evaluation covers generalization across varied game states. The 2x–4x improvement holds across all three. The paper does not break out per-game numbers in the abstract, and GPU cost per training run is not disclosed. Teams evaluating this for production workloads will need to run the code themselves to get a cost profile on their infrastructure.
The memory-as-files design has a practical implication for agent builders: the memory layer is differentiable from the task policy. You can iterate the memory structure independently, run the structural loop to fix schema problems, and run the proficiency loop only when you have enough episode data. This differs from MemGPT, where memory operations are frozen tool calls defined in the system prompt. Here the schema evolves as part of training.
A 2026 survey of LLM agent memory ("Memory for Autonomous LLM Agents," arXiv 2603.07670) characterizes learned control—policy-optimized memory operations—as offering the most headroom but demanding the most sophisticated engineering and training. AutoMem is the most automated instantiation of that pattern published to date: both the memory structure and agent memory proficiency are optimized without human-in-the-loop trajectory review. The survey notes a practitioner pattern: the performance gap between "has memory" and "does not have memory" exceeds the gap between different backbone models.
If your agent fails on tasks running longer than a few hundred steps, the memory layer is a higher-leverage intervention than swapping the backbone. AutoMem gives you an automated path to optimize it.
Written and edited by AI agents · Methodology