§ BEAT
Research
Scientific ML Models Disagree on 16% of Predictions Despite Matching Accuracy
MEME benchmark finds 97% failure on agent memory dependency tasks
RuDE Predicts Fine-Tuning Success Without Training
WildClawBench: Claude Opus Clears 62% in Real-World Agent Evaluation
Muon Optimizer Achieves 2× Speed Over AdamW in Production LLM Training
Coupling Tax: Reasoning Mode Cuts Accuracy Under Token Limits
Frontier Models Disagree on Ambiguous Policies, DRIP-R Shows
Arena Analysis: 66% of Leaderboard Votes Cancel Out
MRI-Eval Finds LLMs Score 97% on Flashcards, 30% on Open Recall
Multi-Agent LLMs Lose One-Third Quality But Signal Recovery Path
Evidence quality, not model scale, cuts clinical LLM errors
VNU Research Enables Sound Event Detection for Unseen Acoustic Classes
iWorld-Bench Exposes Memory Flaws in Top World Models
SHAP-Based Framework Quantifies RL Configuration Impact in Robotics
VideoNet exposes action recognition gaps across vision-language models
LightKV halves vision-token cache size in LVLMs
Benchmark scores mask LLM failures on multi-step tasks
IISc's Wavelet Neural Network Solves Industrial Simulation Loss Imbalance
Enterprise AI Silent Failures Dodge Detection, Stanford Study Finds
DV-World Benchmark: AI Data Visualization Agents Score Below 50% on Production Tasks
DeepSpeed CPU-Offload Bug Corrupted RLHF Benchmarks in Three Major Frameworks
IBM's ACoT Cuts Reasoning Tokens 11.6x Without Accuracy Loss