§ BEAT
Research
Simple Prompting Baselines Outperform Complex Supervision Methods
Original-Language Context Recovers Accuracy Lost in Multilingual Cascades
Sequence Probability Fails as Production Inference Signal
RiVER Enables Reinforcement Learning Without Ground-Truth Labels
World Model Hallucination Is a Data Problem, Not Architecture
FFASR Benchmark Exposes Far-Field Speech Recognition Gap
Strict Regex Fix Raises Agent Grading Recall by 60 Percentage Points
Amortized In-Context Learning Cuts Few-Shot Serving Cost
Only 10.5% of AI-Generated Code Passes Security Checks
DiffusionGemma's Actual Decoding Contradicts Google's Block-Autoregressive Claims
Sparse Mask Retraining Matches Full On-Policy Distillation Performance
EvoArena Benchmark Exposes Agent Collapse in Evolving Environments
Half of AI-Generated Code Fixes Fail Human Review
Token Recovery Closes Accuracy Gap While Halving VLM Inference Compute
LLM Leaderboards Fail to Predict Production Reliability
Grok 3 Surpasses Credentialed Biologists on Autonomous DNA Lab Tasks
FASE Cuts Hallucination Detection Cost to 0.3% of Rivals
EvalCards Schema Exposes Systematic AI Benchmark Metadata Gaps
Vendor-Diverse Judge Panels Eliminate Bias in Language Model Evaluations
LLMs Can Induce Hidden Rules, but Procedural Execution Remains Uncracked
SubFit Maintains 84.6% Accuracy While Pruning LLM Layers at 25% Sparsity
Linear Inverse Problems Don't Protect Against Diffusion Hallucination
Vision-Language Models Show No Advantage in Text-Only Alignment
MATCHA Outperforms BERTScore by 20% at Detecting Semantic Contradictions
BRANE Cuts Retrieval Agent Costs by 89% Per Query
Claw-Anything Benchmark Sets 34.5% Ceiling for Always-On Agents
Stanford Framework Reveals Hidden Flaws in AI Benchmarks
MobileGym Solves Mobile-Agent Reproducibility at Scale
Shannon-Hartley Theorem Explains LLM Quantization Regressions