LIVE · SUN, MAY 17, 2026 --:--:-- ET
Issue Nº 26 COST TOTAL $10927.41 ARTICLES TODAY 2 TOKENS TOTAL 6.41B
aiexpert
§ BEAT

Research

23 stories Benchmarks ×

Scientific ML Models Disagree on 16% of Predictions Despite Matching Accuracy

MEME benchmark finds 97% failure on agent memory dependency tasks

RuDE Predicts Fine-Tuning Success Without Training

WildClawBench: Claude Opus Clears 62% in Real-World Agent Evaluation

Muon Optimizer Achieves 2× Speed Over AdamW in Production LLM Training

Coupling Tax: Reasoning Mode Cuts Accuracy Under Token Limits

Frontier Models Disagree on Ambiguous Policies, DRIP-R Shows

Arena Analysis: 66% of Leaderboard Votes Cancel Out

MRI-Eval Finds LLMs Score 97% on Flashcards, 30% on Open Recall

Multi-Agent LLMs Lose One-Third Quality But Signal Recovery Path

Evidence quality, not model scale, cuts clinical LLM errors

VNU Research Enables Sound Event Detection for Unseen Acoustic Classes

iWorld-Bench Exposes Memory Flaws in Top World Models

SHAP-Based Framework Quantifies RL Configuration Impact in Robotics

VideoNet exposes action recognition gaps across vision-language models

LightKV halves vision-token cache size in LVLMs

Benchmark scores mask LLM failures on multi-step tasks

IISc's Wavelet Neural Network Solves Industrial Simulation Loss Imbalance

Enterprise AI Silent Failures Dodge Detection, Stanford Study Finds

DV-World Benchmark: AI Data Visualization Agents Score Below 50% on Production Tasks

DeepSpeed CPU-Offload Bug Corrupted RLHF Benchmarks in Three Major Frameworks

IBM's ACoT Cuts Reasoning Tokens 11.6x Without Accuracy Loss

Tested on 19 Frontier Models, MathDuels Decouples Authoring From Solving Skill