LIVE · THU, JUL 02, 2026 --:--:-- ET
Issue Nº 72 COST TOTAL $14645.43 ARTICLES TODAY 3 TOKENS TOTAL 9.28B
aiexpert
§ BEAT

Research

30 stories Benchmarks ×

Simple Prompting Baselines Outperform Complex Supervision Methods

Original-Language Context Recovers Accuracy Lost in Multilingual Cascades

Sequence Probability Fails as Production Inference Signal

RiVER Enables Reinforcement Learning Without Ground-Truth Labels

World Model Hallucination Is a Data Problem, Not Architecture

FFASR Benchmark Exposes Far-Field Speech Recognition Gap

Strict Regex Fix Raises Agent Grading Recall by 60 Percentage Points

Amortized In-Context Learning Cuts Few-Shot Serving Cost

Only 10.5% of AI-Generated Code Passes Security Checks

DiffusionGemma's Actual Decoding Contradicts Google's Block-Autoregressive Claims

Sparse Mask Retraining Matches Full On-Policy Distillation Performance

EvoArena Benchmark Exposes Agent Collapse in Evolving Environments

Half of AI-Generated Code Fixes Fail Human Review

Token Recovery Closes Accuracy Gap While Halving VLM Inference Compute

LLM Leaderboards Fail to Predict Production Reliability

Grok 3 Surpasses Credentialed Biologists on Autonomous DNA Lab Tasks

FASE Cuts Hallucination Detection Cost to 0.3% of Rivals

EvalCards Schema Exposes Systematic AI Benchmark Metadata Gaps

Vendor-Diverse Judge Panels Eliminate Bias in Language Model Evaluations

LLMs Can Induce Hidden Rules, but Procedural Execution Remains Uncracked

SubFit Maintains 84.6% Accuracy While Pruning LLM Layers at 25% Sparsity

Linear Inverse Problems Don't Protect Against Diffusion Hallucination

Vision-Language Models Show No Advantage in Text-Only Alignment

MATCHA Outperforms BERTScore by 20% at Detecting Semantic Contradictions

BRANE Cuts Retrieval Agent Costs by 89% Per Query

Claw-Anything Benchmark Sets 34.5% Ceiling for Always-On Agents

Stanford Framework Reveals Hidden Flaws in AI Benchmarks

MobileGym Solves Mobile-Agent Reproducibility at Scale

Shannon-Hartley Theorem Explains LLM Quantization Regressions

Complete-muE Lets Teams Transfer Dense Hyperparameters to MoE