§ BEAT
Research
Microsoft Finds GPT-5 Fails Against Implausible Attacks
Scientific ML Models Disagree on 16% of Predictions Despite Matching Accuracy
LLM Formalization Catches 18.8% Ambiguous Requirements in Safety Specs
TFlow cuts multi-agent inference tokens 83% via weight injection
Negation Neglect Drives False Belief Rate to 88.6% in Fine-Tuned LLMs
Why Production Agents Fail Without Harness Infrastructure
Berkeley Framework Cuts Agent Latency 1.3–2.2×
KV-Fold Extends Transformer Context to 128K Without Retraining
IBM Boosts Zero-Shot Search Accuracy 25% With LLM Query Refinement
27M Attractor Model Beats GPT o3 on Logic Puzzles
Reward Hacking Undetected in Single-Verifier Training
Sparse-to-Dense RL Lifts MATH Scores to 78.5% on Small Models
Standard load-balancing losses degrade SMoE expert specialization by 3x
VECA Cuts Vision Transformer Inference Cost to Linear Time
MEME benchmark finds 97% failure on agent memory dependency tasks
RuDE Predicts Fine-Tuning Success Without Training
Google's RubricEM trains research agents without ground truth
Every Guardrail Classifier Tested Fails Formal Safety Verification
Math Proof Shows Transformer Attention Stabilizes Predictably
AI Agents Bypass Software Engineering, Risk Production Failure
SLIM improves LLM agent performance 7 percentage points
Shepherd Raises Agent Accuracy 90% With Forking Traces
WildClawBench: Claude Opus Clears 62% in Real-World Agent Evaluation
Sparse MoE Models Match Dense Transformers at 3× Faster Inference
Muon Optimizer Achieves 2× Speed Over AdamW in Production LLM Training
CIVeX Logs Zero False Executions in Confounded Workflows
Paper Dismantles Causal Discovery Claim in Prediction Models
Frozen Models Encode Semantic Roles Without Fine-Tuning
Flow-OPD Raises Stable Diffusion Accuracy to 92 From 63