§ BEAT
Research
AutoMem Training Doubles Agent Performance on Long-Horizon Tasks
One Layer Matches Full RL Post-Training on Qwen Models
BrowserBC Lifts Browser Agent Success to 81% Using Human Traces
Language Model Explanations Track Behavior Shifts Automatically
TRIAGE Cuts Agent Actions 14.8% While Raising Success Rates
Simple Prompting Baselines Outperform Complex Supervision Methods
Researchers Close Gap Between AI Agents and Hand-Curated Skills
New Training Technique Improves LLM Confidence Calibration by 63%
Google Releases Zero-Shot Tabular Model but Hides Benchmark Data
Vision-language models route knowledge through just 2.5% of network
AI Agents Double Repository-Level Merge Friction
Original-Language Context Recovers Accuracy Lost in Multilingual Cascades
ENS Hits 10× Accuracy on Tough PDE Benchmarks Without Correction Loops
Mechanism Taxonomy Lifts LLM Moderation F1 by 5.4%
Open-Weight Pipeline Achieves 68% Accuracy Extracting Political Networks from News
Sequence Probability Fails as Production Inference Signal
RiVER Enables Reinforcement Learning Without Ground-Truth Labels
World Model Hallucination Is a Data Problem, Not Architecture
Single Researcher Places 2nd in ICRA Robot-Folding Challenge
Models Shed Learned Rules During Training
Free Scoring Signal Emerges from Standard RL Post-Training Runs
DeepMind Forensic Protocol Diagnoses Confused vs. Misaligned AI
Multimodal Models Flip Answers When Evidence Order Changes
Production Voice AIs Ignore Emotion, Approving Fraud and Ending Care Calls
Qwen's 397B Model Simulates Agent Environments Better Than GPT-5.4
FFASR Benchmark Exposes Far-Field Speech Recognition Gap
Strict Regex Fix Raises Agent Grading Recall by 60 Percentage Points
InSight Enables Robots to Autonomously Learn New Tasks
OpenThoughts-Agent Dataset Hits 44.8% on Agentic Benchmarks