ThinkLLM
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
AboutPrivacyTermsRSS

ThinkLLM

Spot an error in our data? Let us know.

Papers

Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.

921 papers88 this month12 topics
AllEfficiency 38Training 37Evaluation 33Reasoning 27Agents 23Architecture 23Applications 21Multimodal 15Safety 12scaling 8Alignment 8Data 6

May 25 – May 31(11)

Algorithmic Monocultures in Hiring

May 26, 2026

Rishi Bommasani, Sarah H. Bana, Kathleen A. Creel et al.

When many employers use the same hiring algorithm, it amplifies bias rather than spreading risk—the same people get rejected everywhere, and racial disparities compound across the job market.

This paper analyzes hiring algorithms from a single vendor used by many employers and finds they create unfair outcomes.

safetyevaluationapplications

Natural Language Query to Configuration for Retrieval Agents

May 26, 2026

Melissa Z. Pan, Negar Arabzadeh, Mathew Jacob et al.

You can optimize retrieval pipelines per-query rather than per-workload by using lightweight predictors trained on query characteristics, achieving the same accuracy at significantly lower cost or better accuracy at the same cost.

This paper presents BRANE, a system that automatically selects the best configuration for retrieval agents on a per-query basis. Instead of manually tuning a retrieval pipeline once, BRANE analyzes each query to predict which combination of LLM, retriever, and other settings will work best, allowing teams to optimize for either accuracy or cost at inference time without retraining.

agents

May 18 – May 24(23)

LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws

May 22, 2026

Xu Ouyang, Deyi Liu, Yuhang Cai et al.

LLMs have a fundamental capacity limit based on signal-to-noise ratio: scaling parameters or data without maintaining sufficient signal clarity causes performance degradation, explaining phenomena like catastrophic overtraining and quantization failures that standard scaling laws can't capture.

This paper explains why large language models sometimes get worse with more training or smaller precision—not just better. Using information theory, the authors model LLM training like sending signals through a noisy channel. When you scale up the model or data without keeping the signal clear relative to noise, performance actually drops in a U-shape.

scalingtrainingevaluation

From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills

May 22, 2026

Zisu Huang, Jingwen Xu, Yifan Yang et al.

Model-generated skills can improve agent performance, but their effectiveness depends on how they're extracted and which agent uses them—not on model size or baseline strength.

This paper studies how AI agents can reuse skills—structured procedures extracted from past experience—to improve performance. The researchers built a comprehensive evaluation framework testing skill extraction and reuse across five different task domains, finding that while model-generated skills help on average, they sometimes hurt performance.

May 11 – May 17(9)

EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation

May 14, 2026

Ruozhen He, Meng Wei, Ziyan Yang et al.

Maintaining consistent characters and objects across long video sequences is hard; explicit memory of each entity's appearance significantly improves consistency, especially when characters reappear after many shots.

EntityBench is a benchmark for evaluating multi-shot video generation—creating coherent video sequences with multiple scenes. It includes 140 episodes with detailed tracking of characters, objects, and locations across shots, plus an evaluation system that measures both video quality and consistency.

evaluationmultimodalarchitecture

FutureSim: Replaying World Events to Evaluate Adaptive Agents

May 14, 2026

Shashwat Goel, Nikhil Chandak, Arvindh Arun et al.

Current AI agents struggle with long-horizon real-world adaptation—the best models achieve only 25% accuracy predicting events three months ahead, showing this is a critical capability gap for deployed AI systems.

FutureSim is a benchmark that tests AI agents' ability to adapt and predict real-world events over time by replaying actual news and events in chronological order. Agents must forecast future events beyond their training data while interacting with a live stream of information, revealing significant gaps in current frontier models' capabilities.

May 4 – May 10(36)

Conformal Path Reasoning: Trustworthy Knowledge Graph Question Answering via Path-Level Calibration

May 8, 2026

Shuhang Lin, Chuhao Zhou, Xiao Lin et al.

Conformal Path Reasoning provides statistical guarantees that your KGQA system will include the correct answer in its output set, while keeping that set compact and practical—solving a real reliability problem in knowledge graph reasoning.

This paper improves Knowledge Graph Question Answering by adding statistical guarantees to answer reliability. It uses conformal prediction—a technique that creates sets of answers with proven coverage rates—combined with a neural network that learns to score reasoning paths better. The result is more trustworthy answers with smaller, more useful prediction sets.

reasoningevaluationsafety

GRAPHLCP: Structure-Aware Localized Conformal Prediction on Graphs

May 8, 2026

Peyman Baghershahi, Fangxin Wang, Debmalya Mandal et al.

When using GNNs for predictions, you can get tighter, more reliable uncertainty estimates by explicitly using graph structure rather than just embedding similarity—this gives you both statistical guarantees and practical efficiency.

GRAPHLCP improves uncertainty quantification for graph neural networks by using graph structure to make better predictions with guaranteed coverage. Instead of just looking at embedding similarity, it uses graph topology and a PageRank-based approach to identify similar nodes and weight predictions appropriately, reducing wasted prediction sets while maintaining statistical guarantees.

Apr 27 – May 3(21)

When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models

May 1, 2026

Sailesh Panda, Pritam Kadasi, Abhishek Upperwal et al.

LLMs fail at executing multi-step procedures faithfully, with accuracy collapsing as procedure length increases. This means strong benchmark performance can hide critical weaknesses in following instructions step-by-step.

This paper tests whether large language models actually follow step-by-step procedures correctly, not just whether they get the right final answer. Researchers created a benchmark where models execute arithmetic algorithms of varying length and complexity.

evaluationreasoningalignment

Can Coding Agents Reproduce Findings in Computational Materials Science?

May 1, 2026

Ziyang Huang, Yi Cao, Ali K. Shargh et al.

AI coding agents are far from ready for autonomous scientific research: they excel at software engineering but fail at the domain-specific reasoning, procedure reconstruction, and result interpretation needed to reproduce real computational science claims.

This paper introduces AutoMat, a benchmark that tests whether AI coding agents can reproduce scientific findings from materials science papers. The benchmark reveals that current AI agents struggle significantly—achieving only 54% success—because they can't fully reconstruct experimental procedures from paper descriptions, deviate from required methods, and fail during execution.

efficiency
evaluation

When Eyes Betray AI: Social Gaze Consistency as a Semantic Cue for AI-Generated Image Detection

May 26, 2026

Kim Jihyeon, Sohee Kim, Soosan Lee et al.

High-level semantic inconsistencies in social gaze (eye direction, head-eye alignment) are more reliable for detecting AI-generated images than low-level pixel artifacts, and this signal transfers across different generative models.

This paper shows that AI-generated images often fail at maintaining realistic gaze patterns between people—like consistent eye direction and head-eye alignment. The researchers built a detection system using this semantic weakness, along with a carefully designed dataset and training approach, achieving better detection across multiple AI image generators.

evaluationsafety

MATCHA: Matching Text via Contrastive Semantic Alignment

May 26, 2026

Siran Li, Ece Sena Etoglu, Carsten Eickhoff et al.

Current LLM evaluation metrics fail to catch semantic contradictions, potentially hiding serious errors. MATCHA solves this by explicitly measuring both agreement with correct answers and distance from contradictory statements.

MATCHA is a new evaluation metric for LLMs that fixes a critical flaw in popular metrics like ROUGE and BERTScore: they give similar scores to contradictory texts. MATCHA uses a dual approach—rewarding similarity to correct answers while penalizing contradictions—and significantly outperforms existing metrics across question-answering, summarization, and other tasks.

evaluationalignment

MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research

May 25, 2026

Dingbang Wu, Rui Hao, Haiyang Wang et al.

You can now train mobile agents at scale with deterministic, verifiable rewards in simulation, and the skills transfer well to real devices—solving a major bottleneck in agent research.

MobileGym is a lightweight simulation platform for training mobile app agents that runs hundreds of parallel instances in a browser. It provides verifiable task outcomes through structured JSON states and enables scalable reinforcement learning training, with a benchmark of 416 tasks across 28 apps that shows strong transfer to real devices.

agentsevaluationapplications

From Model Scaling to System Scaling: Scaling the Harness in Agentic AI

May 25, 2026

Shangding Gu

Agent performance depends equally on system design (memory, routing, verification) as on model capability; evaluating agents requires measuring trajectory quality and system hygiene, not just final outcomes.

This paper argues that building better AI agents requires focusing on the system architecture around language models, not just making the models bigger. It introduces the concept of 'scaling the harness'—designing the memory, tool-use, verification, and orchestration layers that turn a model into a working agent—and proposes benchmarks to measure agent quality beyond just task success.

agentsarchitectureevaluation

Beyond Summaries: Structure-Aware Labeling of Code Changes with Large Language Models

May 25, 2026

Bar Weiss, Antonio Abu-Nassar, Adi Sosnovich et al.

LLMs can reliably classify code changes into structured categories (renames, moves, logic changes, etc.) to automate and prioritize code review tasks, achieving strong accuracy while being language-agnostic and customizable.

This paper shows how large language models can automatically label and categorize code changes in patches (like identifying renames, moves, or logic modifications) to make code review faster and more efficient. Using a two-stage approach with few-shot prompting, the method achieves 84% recall and 81% precision without needing traditional static analysis tools.

applicationsevaluation

DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking

May 25, 2026

Matt L. Wiemann, Lindsay M. Smith, Peter Melchior et al.

LLMs can predict physics outcomes but struggle with true scientific discovery: the strongest models pass only 50% of worlds, and good prediction accuracy doesn't guarantee conceptual understanding of the underlying laws.

DiscoverPhysics is a benchmark that tests whether large language models can discover unknown physics laws by designing experiments in simulated worlds with non-standard physics.

reasoningevaluationagents

Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User's Digital World

May 25, 2026

Yusong Lin, Xinyuan Liang, Haiyang Wang et al.

Building truly useful AI assistants requires handling messy, interconnected real-world contexts—not isolated tasks—and current models fall far short of this challenge, but synthetic data generation can help close the gap.

Claw-Anything is a benchmark for testing AI agents as always-on personal assistants with access to a user's full digital world—including activity history, multiple services, and both GUI and CLI interfaces.

agentsevaluationreasoning

VeriTrace: Evolving Mental Models for Deep Research Agents

May 25, 2026

Haolang Zhao, Yunbo Long, Lukas Beckenbauer et al.

Research agents need explicit feedback mechanisms to evolve their understanding of tasks—not just bigger models—to avoid error propagation when working through complex, interdependent information.

VeriTrace is a framework that helps AI research agents maintain accurate mental models by explicitly tracking and correcting their understanding as they work through complex problems. Instead of letting language models implicitly manage their reasoning, it uses three feedback loops to catch errors early and prevent them from cascading through the agent's work.

reasoningagentsevaluation

Automated Benchmark Auditing for AI Agents and Large Language Models

May 25, 2026

Junlin Wang, Federico Bianchi, Shang Zhu et al.

Many AI benchmarks contain hidden flaws that distort model rankings and performance scores; automated auditing can catch these issues at scale and improve the reliability of capability assessments.

This paper introduces Auto Benchmark Audit (ABA), an AI agent that automatically checks benchmark tasks for hidden problems like incomplete specifications, environment conflicts, and broken evaluation logic. Testing 168 benchmarks across nine domains, ABA found critical issues in over 25% of tasks—problems that human reviewers missed.

evaluationagents
agentstrainingevaluation

SPACENUM: Revisiting Spatial Numerical Understanding in VLMs

May 22, 2026

Jianshu Zhang, Yijiang Li, Huifeixin Chen et al.

Current VLMs struggle to genuinely understand spatial numbers—they can't reliably map between visual coordinates and numerical values, which is critical for embodied AI tasks like robotics that require precise spatial outputs.

This paper tests whether Vision-Language Models (VLMs) truly understand spatial numbers like coordinates and distances. Using SpaceNum, a framework with two tasks (converting numbers to spatial positions and vice versa), researchers find that VLMs largely fail at grounding numbers in actual spatial meaning, relying instead on shallow visual cues rather than genuine spatial reasoning.

evaluationmultimodalreasoning

Good Token Hunting: A Hitchhiker's Guide to Token Selection for Visual Geometry Transformers

May 22, 2026

Shuhong Zheng, Michael Oechsle, Erik Sandström et al.

By selectively dropping redundant image patches across frames and within frames using attention entropy, you can speed up 3D reconstruction transformers dramatically without sacrificing quality.

This paper tackles the computational bottleneck in visual geometry transformers—models that reconstruct 3D scenes from multiple images. The authors propose a token selection strategy that reduces which image patches the model attends to, cutting computation by 85% while maintaining or improving accuracy.

efficiencyarchitectureevaluation

PGT: Procedurally Generated Tasks for improving visual grounding in MLLMs

May 22, 2026

Rim Assouel, Amir Bar, Michal Drozdzal et al.

Adding synthetic geometric overlays during training helps MLLMs learn better spatial and quantitative reasoning—suggesting many visual understanding failures come from insufficient training data rather than model architecture limits.

This paper introduces Procedurally Generated Tasks (PGT), a method that overlays geometric shapes on images to create training data that improves how multimodal AI models understand fine-grained visual details like spatial relationships and quantities. Testing shows improvements of up to 20% on visual reasoning benchmarks while keeping general capabilities intact.

multimodaltrainingevaluation

Human Decision-Making with Persuasive and Narrative LLM Explanations

May 22, 2026

Laura R. Marusich, Mary Grace Kozuch Dhooghe, Jonathan Z. Bakdash et al.

Adding narrative explanations to AI predictions can backfire: they increase trust in AI without improving accuracy, and may actually harm decision quality by making people slower to question wrong predictions.

This study tested how AI-generated narrative explanations affect human decision-making in classification tasks. Researchers found that persuasive explanations didn't improve accuracy compared to predictions alone, but did increase reliance on AI—even when the AI was wrong. More persuasive narratives sometimes slowed decisions and made it harder to spot AI errors.

evaluationsafetyalignment

Finite-Particle Convergence Rates for Conservative and Non-Conservative Drifting Models

May 21, 2026

Krishnakumar Balasubramanian

Conservative drifting with kernel density estimators achieves provable convergence rates for one-step generative modeling, with the convergence speed depending on dimension and a tunable parameter that trades off between different error sources.

This paper analyzes drifting methods for generative modeling, proposing a conservative approach using kernel density estimators that guarantees gradient-field properties. The authors prove finite-particle convergence rates showing how quickly the method converges as sample size increases, with explicit tracking of how bandwidth and dimension affect performance.

trainingevaluation

Evaluating Commercial AI Chatbots as News Intermediaries

May 21, 2026

Mirac Suzgun, Emily Shen, Federico Bianchi et al.

AI chatbots excel at retrieving and synthesizing recent news but have three critical weaknesses: they systematically underperform on non-English content, fail primarily due to retrieval errors rather than reasoning mistakes, and are easily fooled by questions containing subtle false information.

This study evaluates six major AI chatbots (Gemini, Grok, Claude, GPT models) on their ability to answer factual news questions across six languages and regions.

evaluationmultimodaldata

FAME: Failure-Aware Mixture-of-Experts for Message-Level Log Anomaly Detection

May 21, 2026

Huanchi Wang, Zihang Huang, Yifang Tian et al.

You can build practical, label-efficient log anomaly detectors by using LLMs once offline to structure the problem, then training lightweight domain-specific models that run continuously without expensive LLM calls.

FAME is a system for detecting anomalies in individual log messages rather than groups, using a mixture-of-experts approach that leverages an LLM offline to organize log templates into failure domains. It requires minimal labeled data (as few as 100 examples) and runs efficiently on-premise, achieving 98% accuracy on real production logs while reducing annotation effort by 76x.

efficiencyevaluationapplications

SDPM: Survival Diffusion Probabilistic Model for Continuous-Time Survival Analysis

May 21, 2026

Stanislav R. Kirpichenko, Andrei V. Konstantinov, Lev V. Utkin

Diffusion models can effectively handle continuous-time survival analysis by modeling censored outcomes directly, avoiding parametric assumptions and discretization errors that limit traditional survival methods.

SDPM uses diffusion models to estimate time-to-event distributions from data with censored observations, without requiring assumptions about the hazard function or discretizing time. The model generates samples that can be converted to survival curves, achieving competitive performance on real datasets while accurately recovering underlying continuous distributions.

applicationsevaluation

Understanding Data Temporality Impact on Large Language Models Pre-training

May 21, 2026

Pilchen Hippolyte, Fabre Romain, Signe Talla Franck et al.

Training LLMs on chronologically ordered data instead of shuffled data improves their knowledge of recent facts and temporal accuracy, suggesting data ordering matters for building models that stay current.

This paper investigates how the order of training data affects what LLMs learn about time-sensitive facts. Researchers trained 6B-parameter models on chronologically ordered data versus shuffled data, and found that sequential training produces models with more current and accurate temporal knowledge while maintaining general language understanding.

trainingdataevaluation

Cyber-Physical Anomaly Detection in IoT-Enabled Smart Grids Using Machine Learning and Metaheuristic Feature Optimization

May 21, 2026

Adis Alihodžić, Eva Tuba, Milan Tuba

Smart grid operators can use genetic algorithm feature selection to identify which electrical measurements matter most for attack detection, reducing sensor requirements while maintaining 98%+ accuracy.

This paper detects cyber-physical attacks in smart grids by combining machine learning with genetic algorithm-based feature selection. Using real power system data, the authors show that tree-based models like Extra Trees can accurately distinguish between natural faults and malicious attacks, and that a small subset of 27 features (down from 112) is sufficient for reliable detection.

evaluation

Ternary Decision Trees with Locally-Adaptive Uncertainty Zones

May 21, 2026

William Smits

Decision trees can improve accuracy by explicitly handling boundary cases through locally-computed uncertainty zones—instances near splits get soft predictions and uncertainty flags instead of hard classifications, helping downstream applications make better decisions.

This paper introduces ternary decision trees that add uncertainty zones around split thresholds, allowing predictions near decision boundaries to blend outputs from both child subtrees and flag uncertain cases.

architectureevaluation

Proxy-Based Approximation of Shapley and Banzhaf Interactions

May 21, 2026

Santo M. A. R. Thies, Hubert Baniecki, R. Teal Witter et al.

ProxySHAP makes it practical to explain complex feature interactions in ML models by using proxy models and residual correction, achieving state-of-the-art accuracy while remaining computationally efficient even with thousands of features.

ProxySHAP is a new method for computing Shapley and Banzhaf interactions—measures that explain how features work together in machine learning models. It combines fast tree-based approximations with mathematical corrections to achieve both speed and accuracy, outperforming existing methods on large datasets.

evaluationefficiency

The Distillation Game: Adaptive Attacks & Efficient Defenses

May 21, 2026

Youssef Allouah, Mahdi Haghifam, Sanmi Koyejo et al.

Distillation defenses must be evaluated against adaptive attackers who strategically choose which outputs to learn from—not just passive ones—and simple forward-pass defenses like PoE can match expensive defenses while preserving reasoning quality.

This paper studies how AI model providers face a trade-off: making models more useful (through better outputs) makes them easier to copy through distillation attacks. The authors develop a game-theoretic framework to understand this trade-off and propose Product-of-Experts (PoE), a lightweight defense that combines the teacher model with a proxy student during generation.

safetyevaluationefficiency

Variance Reduction for Expectations with Diffusion Teachers

May 20, 2026

Jesse Bettencourt, Xindi Wu, Matan Atzmon et al.

When using diffusion models to guide other tasks, you can dramatically reduce compute cost by resampling cheap diffusion noise multiple times per expensive upstream computation, rather than doing one expensive computation per noise sample.

This paper introduces CARV, a framework for reducing variance in gradient estimates when using pretrained diffusion models as teachers in downstream tasks like text-to-3D generation. By reusing expensive computations (like 3D rendering) across multiple noise samples and applying importance sampling techniques, the method achieves 2-3x speedups without changing the underlying objective.

efficiencytrainingevaluation

DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

May 20, 2026

Sixiong Xie, Zhuofan Shi, Haiyang Shen et al.

Retrieval isn't the main problem for frontier models on deep research tasks; instead, they fail primarily at deriving answers from evidence and calibrating confidence correctly, suggesting future improvements should focus on reasoning and verification rather than search.

DeepWeb-Bench is a challenging benchmark for evaluating AI agents that research questions by searching the web, collecting evidence, and reasoning through answers. Unlike existing benchmarks, it focuses on tasks requiring massive evidence gathering, cross-source verification, and complex multi-step reasoning—areas where current frontier models still struggle significantly.

evaluationreasoningagents

WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata

May 20, 2026

Basel Shbita, Pengyuan Li, Anna Lisa Gentile

Most vision-language models struggle with knowledge-grounded visual reasoning—even large models only reach 75% accuracy when questions require combining visual evidence with external facts, suggesting a major gap in real-world VQA capabilities.

WikiVQABench is a new benchmark for testing vision-language models on questions that require both visual understanding and external knowledge from Wikipedia and Wikidata.

evaluationmultimodal

Mitigating Label Bias with Interpretable Rubric Embeddings

May 20, 2026

Calvin Isley, Johann D. Gaebler, Sharad Goel

Replace opaque learned embeddings with interpretable features derived from expert-defined rubrics to reduce bias inheritance from biased training labels in high-stakes decisions.

When training AI models on biased historical data (like past hiring decisions), the models learn and perpetuate those biases. This paper proposes using 'rubric embeddings'—features based on expert-defined criteria—instead of black-box embeddings to make fairer predictions. Testing on university admissions data, the approach reduces group disparities while maintaining quality.

alignmentevaluation

Quality and Security Signals in AI-Generated Python Refactoring Pull Requests

May 20, 2026

Mohamed Almukhtar, Anwar Ghammam, Hua Ming

AI-generated refactoring often improves code but frequently introduces new quality and security issues that developers accept anyway, highlighting the need for automated quality checks before merging AI contributions.

This study examines Python refactoring pull requests created by AI agents, measuring their impact on code quality and security.

evaluationsafetyapplications

What Does the AI Doctor Value? Auditing Pluralism in the Clinical Ethics of Language Models

May 18, 2026

Payal Chandak, Victoria Alkin, David Wu et al.

LLMs deployed for medical advice have hidden, consistent ethical biases that don't reflect real physician diversity; without explicit auditing and balancing, a single model's values could be imposed at scale to thousands of patients.

This paper audits how large language models handle ethical dilemmas in medicine, revealing that while models discuss multiple ethical perspectives in their reasoning, they make near-identical decisions across repeated attempts.

safetyevaluationalignment

Predictable Confabulations: Factual Recall by LLMs Scales with Model Size and Topic Frequency

May 18, 2026

Matthew L. Smith, Jonathan P. Shock, Samuel T. Segun et al.

LLM factual accuracy isn't random—it scales predictably with model size and training data frequency, meaning you can estimate what facts a model will reliably remember based on these two factors.

This paper reveals that LLM factual recall follows a predictable pattern based on two factors: model size and how often a topic appears in training data.

scalingevaluationtraining

DexHoldem: Playing Texas Hold'em with Dexterous Embodied System

May 18, 2026

Feng Chen, Tianzhe Chu, Li Sun et al.

Current embodied systems struggle with the full loop: even when vision models perform well on isolated tasks (67% accuracy), they fail at recovering complete game state needed for decision-making (34% accuracy), and execution errors cascade during real deployment.

DexHoldem is a real-world benchmark that tests embodied AI systems playing Texas Hold'em with a dexterous robot hand. It combines three challenges: executing 14 card-manipulation skills precisely, perceiving game state from images, and making decisions based on that perception—revealing how errors compound when all three run together in closed-loop control.

evaluationagents
evaluationagentsreasoning

Quantitative Video World Model Evaluation for Geometric-Consistency

May 14, 2026

Jiaxin Wu, Yihao Pi, Yinling Zhang et al.

Video generators often fail at maintaining consistent 3D geometry in ways that human raters and perceptual metrics don't catch; PDI-Bench provides a diagnostic tool to measure and improve these failures systematically.

This paper introduces PDI-Bench, a quantitative framework for evaluating whether generated videos maintain physically plausible 3D structure and motion.

evaluationmultimodal

Is Grep All You Need? How Agent Harnesses Reshape Agentic Search

May 14, 2026

Sahil Sen, Akhil Kasturi, Elias Lumer et al.

When building agentic search systems, simple grep-based retrieval can outperform vector search, but the agent architecture and how you present tool outputs to the model matter more than retrieval method alone.

This paper compares different retrieval strategies (grep vs. vector search) in AI agent systems that autonomously retrieve information and call tools.

agentsevaluation

OpenDeepThink: Parallel Reasoning via Bradley--Terry Aggregation

May 14, 2026

Shang Zhou, Wenhao Chai, Kaiyuan Liu et al.

Instead of judging multiple reasoning attempts individually (which is noisy), compare them pairwise and aggregate votes to find the best solution—this scales test-time compute breadth more reliably than single-trace depth scaling.

OpenDeepThink improves LLM reasoning by running multiple solution attempts in parallel and selecting the best one using pairwise comparisons between candidates, rather than trying to judge each solution independently. The method uses Bradley-Terry aggregation to rank candidates based on LLM pairwise judgments, then evolves the top solutions using critiques from comparisons.

reasoningevaluation

Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands

May 14, 2026

Pratinav Seth, Vinay Kumar Sankarapu

Behavioral evaluations alone cannot verify the safety claims regulators now demand—you need mechanistic evidence like activation analysis to actually verify what's happening inside AI models, not just what they output.

This paper argues that current AI safety evaluation methods (like red-teaming and behavioral testing) cannot verify the deep safety properties that AI governance frameworks now require, such as absence of hidden objectives or resistance to loss-of-control.

safetyevaluationalignment

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

May 12, 2026

Di Wu, Zixiang Ji, Asmi Kawatkar et al.

Long-term memory for agents requires more than just storing task outcomes; agents need to internalize environment-specific patterns, workflows, and failure modes to become truly experienced colleagues, and current memory systems still struggle with this despite recent advances.

This paper introduces LongMemEval-V2, a benchmark for testing whether AI agents can build long-term memory of specialized web environments. It includes 451 questions about five types of memory (state recall, workflow knowledge, failure modes, etc.) paired with massive history trajectories up to 500 steps and 115M tokens.

agentsevaluationreasoning

Task-Adaptive Embedding Refinement via Test-time LLM Guidance

May 12, 2026

Ariel Gera, Shir Ashury-Tahan, Gal Bloch et al.

You can boost embedding model performance on hard search tasks by having an LLM refine queries at test-time, making embeddings practical for scenarios where running LLMs on all documents is too expensive.

This paper shows how to improve embedding models for search and classification by using an LLM to refine user queries in real-time. Instead of changing the embedding model itself, the approach adjusts the query representation based on feedback from a small sample of documents, achieving up to 25% improvement on challenging tasks without requiring expensive LLM processing at scale.

efficiencyevaluation

MEME: Multi-entity & Evolving Memory Evaluation

May 12, 2026

Seokwon Jung, Alexander Rubinstein, Arnas Uselis et al.

LLM agents struggle with dependency reasoning in persistent memory—when facts relate to each other, systems collapse to near-random performance, and fixing this requires impractically expensive configurations.

This paper introduces MEME, a benchmark for evaluating how well AI agents manage information across multiple sessions. It tests six memory tasks including complex scenarios like tracking dependencies between facts and handling deletions.

evaluationagentsreasoning
evaluationreasoning

Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning

May 8, 2026

Manish Bhattarai, Ismael Boureima, Nishath Rajiv Ranasinghe et al.

Structured, multi-criterion rewards grounded in real documents help models develop generalizable reasoning skills that transfer to unseen tasks better than single holistic scores.

This paper shows how to train AI models to reason better by grading their responses on multiple specific criteria instead of just right/wrong. The researchers created detailed rubrics from scientific documents and used them to train a language model with a technique called GRPO, which optimizes for partial credit across different dimensions.

trainingreasoningevaluation

Accurate and Efficient Statistical Testing for Word Semantic Breadth

May 8, 2026

Yo Ehara

When statistically comparing semantic breadth of words using embeddings, you must account for directional differences or your significance tests will be unreliable—this paper provides a practical, GPU-accelerated solution.

This paper solves a statistical problem in measuring how broadly a word's meaning spreads across different contexts using word embeddings. When comparing two words' semantic breadth, naive statistical tests fail because they confuse directional differences (where words point in different semantic directions) with actual breadth differences.

evaluation

Uncertainty-Aware Structured Data Extraction from Full CMR Reports via Distilled LLMs

May 8, 2026

Yi Yu, Parker Martin, Zhenyu Bu et al.

Distilled LLMs can extract medical data from unstructured reports with high accuracy and built-in confidence estimates, enabling clinicians to prioritize which extractions need human review.

CMR-EXTR converts free-text cardiac MRI reports into structured data with confidence scores for each extracted field. Using a lightweight distilled language model, it achieves 99.65% accuracy while running entirely offline, making it practical for clinical use without requiring constant API access.

applicationsefficiencyevaluation

BAMI: Training-Free Bias Mitigation in GUI Grounding

May 7, 2026

Borui Zhang, Bo Zhang, Bo Wang et al.

You can significantly improve GUI agent accuracy on complex interfaces without retraining by using a two-step approach: first narrow down the region of interest, then select the best candidate from remaining options.

This paper identifies why GUI grounding models (used by AI agents to click and interact with interfaces) fail on complex screens, finding two main problems: high image resolution causes precision errors, and complex UI elements create ambiguity.

agentsevaluationefficiency

Why Global LLM Leaderboards Are Misleading: Small Portfolios for Heterogeneous Supervised ML

May 7, 2026

Jai Moondra, Ayela Chughtai, Bhargavi Lanka et al.

Don't trust global LLM leaderboards—they hide structured disagreement across languages and tasks. Use language-specific rankings or small model portfolios instead to match diverse user needs.

Current LLM leaderboards rank models using global voting patterns, but this masks the reality: opinions differ dramatically by language and task. This paper shows that 2/3 of votes cancel out and top models are statistically indistinguishable globally. Instead, grouping by language reveals coherent subpopulations with consistent rankings.

evaluationmultimodal

When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels

May 7, 2026

Sushant Gautam, Finn Schwall, Annika Willoch Olstad et al.

When deploying LLMs in new languages or sectors without existing safety benchmarks, you can't collapse safety comparisons into a single score—you must report the full context: which scenarios, which judge, which risk measure, and the uncertainty around each comparison.

This paper tackles a real-world problem: comparing AI models for safety when no labeled benchmark exists yet. Instead of relying on ground-truth labels, the authors validate safety scores through three checks—whether models respond to safety changes, whether model differences dominate over measurement noise, and whether results stay consistent across retests.

safetyevaluation

Edge-specific signal propagation on mature chromophore-region 3D mechanism graphs for fluorescent protein quantum-yield prediction

May 7, 2026

Yuchen Xiong, Swee Keong Yeap, Steven Aw Yoong Kit

Local 3D structure around a protein's light-emitting center matters more than overall sequence for predicting brightness—and you can build interpretable models by explicitly encoding which atoms contact which chromophore regions.

This paper predicts how bright fluorescent proteins will be by analyzing their 3D structure around the light-emitting chromophore region. Instead of just looking at protein sequences, the method builds a graph of how atoms and chemical groups physically contact the chromophore, then uses machine learning to predict brightness.

evaluationarchitecture

Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study

May 7, 2026

Hao Dong, Hongzhao Li, Shupan Li et al.

Despite claims of progress, multimodal domain generalization methods show only marginal improvements over basic approaches when fairly compared—the field needs better methods and standardized evaluation to make real progress.

This paper creates MMDG-Bench, the first standardized benchmark for multimodal domain generalization across action recognition, fault diagnosis, and sentiment analysis. Testing 9 methods on 6 datasets with 7,402 trained models, it reveals that recent specialized methods barely beat simple baselines, no method works consistently across tasks, and all methods struggle with corrupted or missing data.

evaluationmultimodal

Taming Outlier Tokens in Diffusion Transformers

May 6, 2026

Xiaoyu Wu, Yifei Wang, Tsu-Jui Fu et al.

Outlier tokens in diffusion transformers aren't just extreme values but represent corrupted local information; controlling them with register tokens significantly improves image generation quality.

This paper identifies and fixes a problem in Diffusion Transformers where certain tokens develop unusually high values that degrade image quality. The authors show this happens in both the image encoder and the generation model itself, and propose Dual-Stage Registers—a technique using learnable tokens to stabilize these problematic values and improve image generation.

architectureefficiencyevaluation

Implicit Representations of Grammaticality in Language Models

May 6, 2026

Yingshan Susan Wang, Linlu Qiu, Zhaofeng Wu et al.

Language models learn grammaticality as a distinct concept from string probability, hidden in their internal representations rather than reflected in output probabilities—you can extract this knowledge with a simple linear probe.

Language models generate grammatical text but their probability scores don't clearly distinguish grammatical from ungrammatical sentences.

evaluation

Almost-Orthogonality in Lp Spaces: A Case Study with Grok

May 6, 2026

Ziang Chen, Jaume de Dios Pont, Paata Ivanisvili et al.

AI language models can contribute meaningfully to mathematical discovery by helping identify intermediate lemmas and inequalities, though human mathematicians remain essential for rigorous proof construction and validation.

This paper proves new bounds on how sums of functions behave in mathematical spaces, showing when certain inequalities hold and when they fail. The authors use a large language model called Grok to help discover intermediate results, demonstrating how AI can assist in mathematical research.

reasoningevaluation

Sharp Capacity Thresholds in Linear Associative Memory: From Winner-Take-All to Listwise Retrieval

May 6, 2026

Nicholas Barnfield, Juno Kim, Eshaan Nichani et al.

Linear memory systems face a fundamental logarithmic penalty for top-1 retrieval but can achieve quadratic capacity if you only need the correct answer ranked highly rather than first—a distinction that matters for building efficient retrieval systems.

This paper analyzes how many key-value pairs a linear memory matrix can store, showing the answer depends on the retrieval task. For winner-take-all retrieval (finding the single best match), capacity scales as d² ≈ n log n due to extreme-value statistics. For listwise retrieval (keeping the correct answer in a top-k list), capacity improves to d² ≈ n.

scalingevaluation

Estimating the expected output of wide random MLPs more efficiently than sampling

May 6, 2026

Wilson Wu, Victor Lecomte, Michael Winer et al.

You can estimate a wide MLP's expected output more efficiently than sampling by directly computing activation distributions layer-by-layer using mathematical tools, which is particularly useful for detecting tail risks.

This paper presents a mathematical method to estimate what a randomly initialized neural network will output on average, without actually running data through it. Instead of sampling (the standard approach), the authors use statistical tools like cumulants and Hermite expansions to track how activations behave at each layer.

efficiencyevaluationarchitecture

Understanding In-Context Learning for Nonlinear Regression with Transformers: Attention as Featurizer

May 6, 2026

Alexander Hsu, Zhaiming Shen, Wenjing Liao et al.

Transformer attention can act as a feature learner for nonlinear functions during in-context learning, and this capability can be theoretically analyzed with concrete error bounds—bridging the gap between empirical success and mathematical understanding.

This paper explains how transformers perform in-context learning for nonlinear regression tasks. The researchers show that transformer attention mechanisms can automatically create nonlinear features (like polynomials or splines) from examples in the prompt, enabling the model to solve complex regression problems without updating weights.

reasoningarchitectureevaluation

MRI-Eval: A Tiered Benchmark for Evaluating LLM Performance on MRI Physics and GE Scanner Operations Knowledge

May 6, 2026

Perry E. Radau

LLMs may appear competent on multiple-choice MRI benchmarks but struggle significantly with free-text recall of vendor-specific operational knowledge; multiple-choice scores alone don't indicate readiness for real-world MRI protocol guidance.

This paper introduces MRI-Eval, a benchmark with 1,365 questions testing LLM knowledge of MRI physics and GE scanner operations across three difficulty levels.

evaluationapplications

The First Token Knows: Single-Decode Confidence for Hallucination Detection

May 6, 2026

Mina Gabriel

A single metric based on the model's confidence distribution at the first answer token can reliably detect hallucinations without expensive multi-sample generation, making it a practical baseline for production systems.

This paper shows that checking a language model's confidence on just the first token of an answer can detect hallucinations as well as methods that generate multiple answers and compare them. The approach is faster and simpler, requiring only a single model run instead of repeated sampling.

evaluationefficiency

PSK at SemEval-2026 Task 9: Multilingual Polarization Detection Using Ensemble Gemma Models with Synthetic Data Augmentation

May 6, 2026

Srikar Kashyap Pulipaka

Per-language fine-tuning with synthetic data augmentation and threshold tuning can significantly improve multilingual NLP tasks, but model generalization to test data varies dramatically—some architectures dropped 30-50% in performance despite strong development results.

This paper describes a system for detecting polarized language across 22 languages using fine-tuned Gemma models with synthetic data augmentation. The approach combines per-language model tuning, LLM-generated synthetic training data with quality filtering, and weighted ensemble predictions to achieve competitive performance on a multilingual classification task.

trainingevaluation

Aes3D: Aesthetic Assessment in 3D Gaussian Splatting

May 6, 2026

Chuanzhi Xu, Boyu Wei, Haoxian Zhou et al.

You can now automatically evaluate whether a 3D scene looks visually appealing by analyzing its Gaussian Splatting representation directly, which is faster and cheaper than traditional rendering-based assessment methods.

This paper introduces Aes3D, the first framework for evaluating the visual aesthetics of 3D scenes created with Gaussian Splatting. It includes a new dataset with aesthetic annotations and a lightweight model that directly assesses aesthetic qualities like composition and harmony from 3D Gaussian primitives, without needing to render images.

evaluationmultimodal

Superposition Is Not Necessary: A Mechanistic Interpretability Analysis of Transformer Representations for Time Series Forecasting

May 6, 2026

Alper Yıldırım

Transformers for time series don't rely on superposition like they do in language tasks, meaning time series forecasting may not require the compositional complexity that makes Transformers powerful for NLP.

This paper investigates how Transformers work internally for time series forecasting by analyzing their hidden representations using sparse autoencoders. The key finding: Transformers don't need complex, overlapping feature representations (superposition) to forecast well—their representations stay sparse and simple, which explains why basic linear models remain competitive.

reasoningevaluation

A Closed-Form Adaptive-Landmark Kernel for Certified Point-Cloud and Graph Classification

May 5, 2026

Sushovan Majhi, Atish Mitra, Žiga Virk et al.

You can build certified graph classifiers without gradient training by using topology-aware landmark selection and closed-form kernel methods—achieving competitive accuracy with built-in confidence bounds.

PALACE is a method for classifying point clouds and graphs using persistent homology (a topological data analysis technique) with adaptive landmark placement.

evaluationreasoning

Safety and accuracy follow different scaling laws in clinical large language models

May 5, 2026

Sebastian Wind, Tri-Thien Nguyen, Jeta Sopa et al.

In clinical AI, safety requires deliberate design choices around evidence quality and retrieval strategy, not just model scaling. A few high-risk errors matter more than average performance.

This paper shows that making clinical AI models bigger or faster doesn't automatically make them safer—safety and accuracy follow different rules. Researchers tested 34 medical AI models and found that high-quality evidence dramatically improved both accuracy and safety, but standard retrieval methods and extra computing power didn't prevent dangerous errors or overconfidence.

safetyevaluationapplications

Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours

May 5, 2026

Raja Sekhar Rao Dheekonda, Will Pearce, Nick Landers

Agentic red teaming can dramatically speed up security testing of AI systems by automating workflow construction, letting security teams focus on what vulnerabilities to test rather than how to implement each test.

This paper introduces an AI red teaming agent that automates adversarial testing of AI systems. Instead of manually building attack workflows over weeks, operators describe their testing goals in natural language, and the agent automatically selects attacks, applies transformations, and scores results—compressing the process from weeks to hours.

safetyagentsevaluation

Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems

May 5, 2026

Yilun Zhao, Jinbiao Wei, Tingyu Song et al.

Retrievers for agentic AI systems need to be evaluated and trained differently—they must surface complementary evidence across multiple aspects and search iterations, not just find topically similar passages.

This paper tackles how search systems find evidence for AI agents that need to reason through complex problems. Current retrieval systems just match keywords, but agentic systems need diverse, complementary evidence across multiple search rounds.

evaluationagents

SymptomAI: Towards a Conversational AI Agent for Everyday Symptom Assessment

May 5, 2026

Joseph Breda, Fadi Yousif, Beszel Hawkins et al.

Structured conversational strategies—where AI systematically interviews patients before diagnosing—significantly outperform unguided chat-based symptom assessment, suggesting that agentic design patterns matter more than raw model capability for medical applications.

Researchers deployed SymptomAI, a conversational AI system for symptom assessment, to nearly 14,000 Fitbit users and found it diagnosed conditions more accurately than independent clinicians reviewing the same conversations.

applicationsagentsevaluation

EQUITRIAGE: A Fairness Audit of Gender Bias in LLM-Based Emergency Department Triage

May 5, 2026

Richard J. Young, Alice M. Matthews

Before deploying LLMs in clinical settings, you need model-specific fairness audits using counterfactual testing—demographic parity alone doesn't guarantee fair decisions, and interventions like demographic blinding work differently across models.

Researchers audited five large language models for gender bias in emergency department triage decisions, finding that all models showed concerning flip rates (9.9-43.8%) when patient gender was swapped.

safetyevaluationalignment

From Intent to Execution: Composing Agentic Workflows with Agent Recommendation

May 5, 2026

Kishan Athrey, Ramin Pishehvar, Brian Riordan et al.

Automating agent selection in multi-agent systems using retrieval-based matching and LLM re-ranking improves reliability and scalability compared to manual composition, especially when a critique agent validates the full workflow.

This paper presents an automated framework for building multi-agent systems that replaces manual steps with AI-driven composition. It uses an LLM planner to break down user requests into tasks, then automatically selects the best agents from registries using a two-stage retrieval system (fast retriever + LLM re-ranker), with a critique agent validating the entire plan.

agentsarchitectureevaluation

Logical Consistency as a Bridge: Improving LLM Hallucination Detection via Label Constraint Modeling between Responses and Self-Judgments

May 5, 2026

Hao Mi, Qiang Sheng, Shaofei Wang et al.

Hallucination detection improves when you combine a model's internal uncertainty signals with its own self-judgments, enforcing that they logically agree—this dual-view approach catches more false claims than either method alone.

This paper tackles hallucination detection in large language models by combining two approaches: analyzing internal neural patterns and extracting explicit self-judgments from the model. The key innovation is a framework that treats these as logically connected signals—if a model says something is true and judges itself as correct, those signals should align.

safetyevaluation

Feature-Augmented Transformers for Robust AI-Text Detection Across Domains and Generators

May 5, 2026

Mohamed Mady, Johannes Reschke, Björn Schuller

AI-text detectors need feature augmentation and careful threshold calibration to work reliably across different domains and generators; linguistic features like readability are crucial for robustness under distribution shift.

This paper tackles the challenge of detecting AI-generated text across different domains and AI models. Researchers trained transformer-based detectors and found that while they perform nearly perfectly on their training data, they struggle when tested on new domains or text from different AI generators.

evaluationsafetyarchitecture

Unsupervised Machine Learning for Detecting Structural Anomalies in European Regional Statistics

May 4, 2026

Bogdan Oancea

Unsupervised learning can detect multivariate anomalies in regional data that traditional single-variable checks miss, helping statistical agencies distinguish between data quality issues and genuine structural divergence.

This paper uses five unsupervised machine learning techniques to detect regions in Europe with unusual combinations of economic and social indicators, rather than just extreme individual values.

evaluationdata

Trust, but Verify: Peeling Low-Bit Transformer Networks for Training Monitoring

May 4, 2026

Arian Eamaz, Farhang Yeganegi, Mojtaba Soltanalian

Standard training loss curves can hide poorly-optimized layers in transformers—layer-wise analysis using reference bounds exposes optimization failures that aggregate metrics miss, especially critical for expensive model training.

This paper introduces a method to monitor whether transformer models are actually learning well during training by analyzing each layer individually. Instead of just looking at overall loss, the authors create lightweight reference solutions for each layer and compare them against the trained model, revealing hidden inefficiencies.

trainingevaluationefficiency

A Closed-Form Persistence-Landmark Pipeline for Certified Point-Cloud and Graph Classification

May 4, 2026

Sushovan Majhi, Atish Mitra, Žiga Virk et al.

This approach trades the flexibility of learned models for interpretability and formal guarantees: you get provable error bounds and confidence scores for each prediction, but performance lags behind neural baselines on some datasets due to limited descriptor expressiveness.

PLACE is a method for classifying point clouds and graphs using topological features (persistent homology) with mathematical guarantees.

evaluationreasoning

VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition

May 4, 2026

Tanush Yadav, Mohammadreza Salehi, Jae Sung Park et al.

Vision-language models perform surprisingly poorly on domain-specific action recognition even in simplified settings, but fine-tuning on domain-specific video data significantly closes the gap.

VideoNet is a new benchmark and dataset for testing how well AI models recognize specific actions in videos across 37 different domains. The researchers found that current vision-language models struggle with domain-specific action recognition—even simple binary choices—and created a 500k video question-answer dataset to improve performance through fine-tuning.

evaluationdatamultimodal

First-Order Efficiency for Probabilistic Value Estimation via A Statistical Viewpoint

May 4, 2026

Ziqi Liu, Kiljae Lee, Yuan Zhang et al.

Understanding the shared mathematical structure of value estimation methods enables designing more statistically efficient estimators—EASE reduces mean squared error by jointly optimizing sampling and surrogate functions rather than treating them separately.

This paper explains how to efficiently estimate Shapley values and similar attribution methods that explain AI model decisions. The authors show that different estimation approaches share a common mathematical structure, then use this insight to design a better estimator (EASE) that reduces computational error by optimizing both the sampling strategy and the surrogate function used.

evaluationefficiency

SCPRM: A Schema-aware Cumulative Process Reward Model for Knowledge Graph Question Answering

May 4, 2026

Jiujiu Chen, Yazheng Liu, Sihong Xie et al.

Process reward models need to account for the full context of reasoning paths and penalize risky intermediate steps, not just reward final correctness—this matters most in domains where wrong reasoning paths are costly.

This paper addresses a key problem in evaluating AI reasoning: process reward models often give high scores to flawed reasoning paths because later correct steps mask earlier mistakes. The authors propose SCPRM, which evaluates reasoning steps by looking at what came before and measuring distance to the target, then use it with tree search to answer questions about knowledge graphs.

reasoningevaluationagents
agentsevaluationapplications

Generating Statistical Charts with Validation-Driven LLM Workflows

May 1, 2026

Pavlin G. Poličar, Andraž Pevcin, Blaž Zupan

Treating chart generation as a multi-step inspectable process with rendered-output validation catches visualization failures that code-only checks miss, and the resulting dataset reveals specific weaknesses in how multimodal LLMs understand charts.

This paper presents a structured workflow for generating statistical charts from data using LLMs, with built-in validation to catch visualization errors before they reach users. The workflow produces 1,500 diverse charts paired with 30,000+ question-answer pairs, revealing that while LLMs excel at reading chart syntax, they struggle with value extraction and reasoning tasks.

evaluationapplicationsdata

When RAG Chatbots Expose Their Backend: An Anonymized Case Study of Privacy and Security Risks in Patient-Facing Medical AI

May 1, 2026

Alfredo Madrid-García, Miguel Rujas

Medical RAG chatbots often expose sensitive backend details and patient data through client-side communication—use server-side security controls and independent audits before deploying patient-facing AI systems.

Researchers audited a patient-facing medical chatbot and found critical security flaws: sensitive system prompts, API endpoints, and 1,000 patient conversations were exposed through basic browser inspection. The study shows how RAG chatbots can leak backend configuration and private health data without authentication, highlighting governance gaps in AI healthcare deployment.

safetyapplicationsevaluation

Unsupervised Denoising of Real Clinical Low Dose Liver CT with Perceptual Attention Networks

May 1, 2026

Jingxi Pu, Tonghua Liu, Zhilin Guan et al.

You can denoise real clinical CT images without paired training data by using unsupervised learning with perceptual loss, making it practical for hospitals that can't easily create labeled datasets.

This paper tackles noise in low-dose CT scans—a real clinical problem where reducing radiation exposure creates grainy images that are hard for doctors to read.

efficiencyevaluationapplications

GeoContra: From Fluent GIS Code to Verifiable Spatial Analysis with Geography-Grounded Repair

May 1, 2026

Yinhao Xiao, Rongbo Xiao, Yihan Zhang

LLM-generated GIS code can look correct but violate geographic rules; GeoContra's contract-based verification catches these semantic errors before they produce wrong spatial analysis.

GeoContra is a verification and repair system that catches geographic errors in AI-generated GIS code. It checks that spatial analysis preserves coordinate systems, topology, units, and geographic plausibility—catching bugs like negative travel times or mismatched coordinate systems that would otherwise produce executable but wrong results.

evaluationsafetyapplications

Observable Performance Does Not Fully Reflect System Organization: A Multi-Level Analysis of Gait Dynamics Under Occlusal Constraint

May 1, 2026

Jacques Raynal, Pierre Slangen, Jacques Margerit

Observable performance metrics can mask fundamentally different internal system organizations—a critical insight for understanding adaptive biological systems where multiple solutions may produce identical outputs.

This study shows that measuring a system's output performance alone doesn't reveal how it's actually organized internally. Using gait analysis in a Parkinson's patient with dental constraints, researchers found that similar-looking movement patterns can come from very different internal system states when examined through dynamical systems and machine learning lenses.

evaluationreasoning

Directed Social Regard: Surfacing Targeted Advocacy, Opposition, Aid, Harms, and Victimization in Online Media

May 1, 2026

Scott Friedman, Ruta Wheelock, Sonja Schmer-Galunder et al.

Most sentiment analysis tools miss nuance—they can't detect that a single message contains both praise for one group and criticism for another. This work enables fine-grained tracking of who is being helped, harmed, supported, or opposed in online discourse.

This paper introduces a new method to detect mixed positive and negative sentiments directed at different targets within the same message. Instead of labeling text as simply positive or negative, the approach identifies specific targets (like people or groups) and scores them across three dimensions: advocacy vs. opposition, aid vs. harm, and support vs. victimization.

evaluationdata

Characterizing the Expressivity of Local Attention in Transformers

May 1, 2026

Jiaoda Li, Ryan Cotterell

Local attention isn't just an efficiency trick—it fundamentally expands what a transformer can learn by recognizing different patterns than global attention, and combining both types creates the most powerful model.

This paper explains why local attention (where tokens only look at nearby predecessors instead of all previous tokens) sometimes improves transformer performance. The authors prove that local attention expands what patterns a transformer can recognize, and combining local and global attention together creates the most expressive model.

architecturereasoningevaluation

LLM as Clinical Graph Structure Refiner: Enhancing Representation Learning in EEG Seizure Diagnosis

Apr 30, 2026

Lincan Li, Zheng Chen, Yushun Dong

LLMs can effectively refine noisy graph structures in medical signal analysis by identifying and removing redundant connections, improving both seizure detection accuracy and model interpretability.

This paper uses large language models to improve how neural networks analyze EEG brain signals for seizure detection. The key innovation is treating LLMs as 'graph refiners'—they remove unnecessary connections in a graph representation of EEG data, making the model more accurate and interpretable.

architectureevaluation

Strait: Perceiving Priority and Interference in ML Inference Serving

Apr 30, 2026

Haidong Zhao, Nikolaos Georgantas

Accurate latency prediction under GPU contention is critical for priority-aware scheduling in inference serving—Strait reduces deadline violations for high-priority tasks by modeling interference effects that traditional systems ignore.

Strait is an ML inference serving system that improves deadline satisfaction for high-priority requests by better predicting latency under GPU contention and using priority-aware scheduling.

efficiencyevaluation

PhyCo: Learning Controllable Physical Priors for Generative Motion

Apr 30, 2026

Sriram Narayanan, Ziyu Jiang, Srinivasa Narasimhan et al.

You can make generative video models physically consistent by combining physics-labeled training data, ControlNet conditioning on physical properties, and VLM-based reward signals—no simulator needed at runtime.

PhyCo teaches video generation models to respect physics by fine-tuning them on 100K+ realistic simulation videos with varying physical properties (friction, bouncing, deformation), then using a vision-language model to provide physics-aware feedback during generation. This lets models create videos where objects behave realistically without needing a physics simulator at inference time.

trainingmultimodalevaluation

Explainable Load Forecasting with Covariate-Informed Time Series Foundation Models

Apr 30, 2026

Matthias Hertel, Alexandra Nikoltchovska, Sebastian Pütz et al.

You can now explain time series foundation model predictions efficiently using SHAP, making them trustworthy for critical infrastructure like power grids—without sacrificing accuracy or requiring model retraining.

This paper makes time series foundation models (TSFMs) transparent for power grid forecasting by developing an efficient method to compute SHAP explanations. The approach leverages TSFMs' ability to handle variable input lengths and selective masking, enabling scalable explanations without retraining.

applicationsevaluation

On the Proper Treatment of Units in Surprisal Theory

Apr 30, 2026

Samuel Kiegeland, Vésteinn Snæbjarnarson, Tim Vieira et al.

When using language models to measure reading difficulty, you must explicitly choose your unit of analysis (word, morpheme, etc.) separately from tokenization—don't let the model's token boundaries dictate your scientific analysis.

This paper clarifies how surprisal theory—which measures human reading difficulty based on word predictability—should handle units of analysis. Language models tokenize text differently than linguistic units (like words), creating confusion in how surprisal is calculated. The authors provide a framework to make these choices explicit and consistent.

evaluation

Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

Apr 30, 2026

Chenxin Li, Zhengyang Tang, Huangxin Lin et al.

Building reliable workflow automation is harder than leaderboard rankings suggest—agents need to be evaluated on what they actually execute, not just outputs, and benchmarks must track real-world demand to stay relevant.

Claw-Eval-Live is a benchmark for testing AI agents that automate real-world workflows across software tools and services. Unlike static benchmarks, it updates with real-world demand signals while maintaining reproducible test snapshots.

evaluationagentsapplications

Latent Adversarial Detection: Adaptive Probing of LLM Activations for Multi-Turn Attack Detection

Apr 30, 2026

Prashant Kulkarni

Multi-turn attacks leave detectable signatures in LLM activations that text-level defenses miss—you can catch covert attacks by monitoring how the model's internal states shift across conversation turns, but detection models don't transfer between different LLM architectures.

This paper detects multi-turn prompt injection attacks by analyzing patterns in a language model's internal activations rather than just the text. The researchers found that adversarial attacks create a distinctive 'restlessness' signature in the model's activation patterns as attackers progress through trust-building, pivoting, and escalation phases.

safetyevaluation

Do Sparse Autoencoders Capture Concept Manifolds?

Apr 30, 2026

Usha Bhalla, Thomas Fel, Can Rager et al.

SAEs don't cleanly capture continuous concept structures—they fragment them across features in ways that hide geometric relationships, suggesting interpretability research needs to look for groups of features rather than individual directions.

Sparse autoencoders (SAEs) are popular tools for finding interpretable features in AI models, but this paper shows they struggle to capture concepts organized as continuous geometric structures (manifolds).

architectureevaluation

DEFault++: Automated Fault Detection, Categorization, and Diagnosis for Transformer Architectures

Apr 30, 2026

Sigma Jahan, Saurabh Singh Rajput, Tushar Sharma et al.

When transformer models fail silently, DEFault++ can pinpoint exactly which component is broken and why—helping developers fix issues 46% faster than manual debugging.

DEFault++ automatically detects, categorizes, and diagnoses faults in transformer models by analyzing internal component behavior. It identifies 12 types of transformer-specific faults and pinpoints root causes among 45 mechanisms, helping developers fix silent failures that don't trigger runtime errors.

evaluationsafetyarchitecture

What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design

Apr 30, 2026

Ivan Bercovich

When designing agent benchmarks, treat tasks as adversarial tests rather than helpful prompts; focus on conceptual difficulty over environmental complexity, and rigorously verify that your evaluation logic actually measures what you intend.

This paper provides practical guidelines for designing high-quality benchmark tasks that evaluate AI agents' coding and system-administration abilities.

evaluationagents

TopBench: A Benchmark for Implicit Prediction and Reasoning over Tabular Question Answering

Apr 30, 2026

An-Yang Ji, Jun-Peng Jiang, De-Chuan Zhan et al.

LLMs fail at implicit prediction tasks on tables because they don't recognize when a question requires inference from patterns rather than lookup; intent disambiguation is the critical bottleneck.

TopBench is a benchmark for testing how well language models can answer questions about tables that require prediction and reasoning, not just data lookup. It includes 779 examples across tasks like forecasting values, analyzing treatment effects, and complex filtering—revealing that current models struggle to recognize when prediction is needed and often default to simple retrieval instead.

evaluationreasoningdata

A Unified Framework of Hyperbolic Graph Representation Learning Methods

Apr 30, 2026

Sofía Pérez Casulo, Marcelo Fiori, Bernardo Marenco et al.

Hyperbolic embeddings can represent complex hierarchical networks in low dimensions, but practitioners now have a standardized framework to fairly compare methods and understand their trade-offs before choosing one for their application.

This paper presents a unified framework for hyperbolic graph embedding methods—techniques that represent networks in hyperbolic space to capture hierarchical structures efficiently. The framework consolidates multiple embedding approaches under one interface, enabling fair comparison and reproducible evaluation on real-world networks for tasks like link prediction and node classification.

architectureevaluation