§ BEAT
Research
TRIAGE Cuts Agent Actions 14.8% While Raising Success Rates
New Training Technique Improves LLM Confidence Calibration by 63%
Mechanism Taxonomy Lifts LLM Moderation F1 by 5.4%
DeepMind Forensic Protocol Diagnoses Confused vs. Misaligned AI
Production Voice AIs Ignore Emotion, Approving Fraud and Ending Care Calls
ClinHallu Dissects Why Medical LLMs Misread Images 65% of the Time
Sub-$11 Agent Outperforms Specialized Research Frameworks
Recursive Agent Harness Achieves 89% Accuracy on Long-Context Code Tasks
DIRECT cuts embodied AI latency 65% with dynamic planner routing
Token-Level Branching Offers Faster LLM Agent Training Without Budget Expansion
ABC-Bench Shows LLM Agents Now Outperform Expert Biologists on Lab Tasks
FPCG steers reasoning models at test time without retraining
Linear Probes Predict Reasoning-Model Behavior at 64–91% Accuracy
New DRPO Method Fixes Long-Tail Vocabulary Collapse in LLM RL
Router Matching 50 Retries with 10 Samples Cuts LLM Test-Time Compute
SafeSteer cuts alignment tax by targeting sparse safety tokens
Claude Code Spent 58% of Sessions Optimizing a Broken Architecture
RLHF Training Amplifies Model Bias to 100 Percent
MemAudit Cuts Memory-Poisoning Attacks to 0%
Rensselaer and IBM Expose KV Cache Leakage in Multi-Agent LLMs
Matching Principle Unifies Seven Robustness Families
Self-Modifying Agents Boost Benchmark Score to 0.61
LCGuard Patches KV-Cache Leakage in Multi-Agent Systems
Fine-tuning erases reasoning chains while accuracy stays high
Medical LLMs Underweight Patient Autonomy
Microsoft Finds GPT-5 Fails Against Implausible Attacks
LLM Formalization Catches 18.8% Ambiguous Requirements in Safety Specs
Negation Neglect Drives False Belief Rate to 88.6% in Fine-Tuned LLMs
Reward Hacking Undetected in Single-Verifier Training