ThinkLLM
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
AboutPrivacyTermsRSS

ThinkLLM

Papers

Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.

326 papers14 this month12 topics
AllEfficiency 35Reasoning 35Multimodal 28Applications 28Evaluation 27Training 26Architecture 24Agents 24Safety 13scaling 5Data 5Alignment 1

Mar 30 – Apr 5(16)

Steerable Visual Representations

Apr 2, 2026

Jona Ruthardt, Manu Gaur, Deva Ramanan et al.

You can now guide vision models with text prompts to focus on non-obvious visual concepts while maintaining strong performance on generic vision tasks—without needing separate language-centric models.

This paper introduces steerable visual representations that can be guided by natural language to focus on specific objects or concepts in images.

multimodalarchitectureevaluation

No Single Best Model for Diversity: Learning a Router for Sample Diversity

Apr 2, 2026

Yuhan Liu, Fangyuan Xu, Vishakh Padmakumar et al.

When you need diverse answers to open-ended questions, routing to the best model per query beats using any single model—and you can train a lightweight router to make this selection automatically.

This paper shows that different language models excel at generating diverse answers to open-ended questions, and no single model is best for all prompts. The authors build a router—a small model that predicts which LLM to use for each question—to dynamically select the best model.

evaluation

Mar 23 – Mar 29(17)

PixelSmile: Toward Fine-Grained Facial Expression Editing

Mar 26, 2026

Jiabin Hua, Hengyuan Xu, Aojie Li et al.

Fine-grained facial expression editing is now possible with precise control and identity preservation by disentangling expression semantics through symmetric joint training and contrastive learning.

PixelSmile is a new method for editing facial expressions in images with fine-grained control. It uses a diffusion model trained with a special technique to separate expression changes from identity, allowing smooth blending between different expressions while keeping a person's identity intact.

multimodalevaluation

Back to Basics: Revisiting ASR in the Age of Voice Agents

Mar 26, 2026

Geeyang Tay, Wentao Ma, Jaewon Lee et al.

Speech recognition systems hallucinate false content under degraded audio, creating safety risks for voice agents. You need diagnostic testing across real-world conditions, not just benchmark scores, to know when and where your ASR will fail.

This paper reveals that speech recognition systems fail in real-world voice agents despite high benchmark scores. The authors created WildASR, a multilingual test set from real human speech that measures robustness across environmental noise, speaker differences, and languages.

evaluation

Mar 16 – Mar 22(37)

From Masks to Pixels and Meaning: A New Taxonomy, Benchmark, and Metrics for VLM Image Tampering

Mar 20, 2026

Xinyi Shang, Yi Tang, Jiacheng Cui et al.

Mask-based evaluation of image tampering is fundamentally flawed; pixel-level metrics with semantic understanding of edit types provide a much more accurate way to assess whether AI systems can detect real image manipulations.

This paper fixes how we evaluate image tampering detection by moving from coarse object masks to pixel-level precision. It introduces a taxonomy of edit types (replace, remove, splice, etc.), a new benchmark with precise tamper maps, and metrics that measure both where edits occur and what they mean semantically—revealing that existing detectors often miss subtle edits or flag untouched pixels.

evaluationmultimodalsafety

Adaptive Greedy Frame Selection for Long Video Understanding

Mar 20, 2026

Yuning Huang, Fengqing Zhu

By selecting frames that are both relevant to the question and visually diverse, you can cut inference costs significantly while maintaining or improving accuracy on video QA tasks, especially when frame budgets are tight.

This paper tackles a key bottleneck in video understanding: processing long videos with vision-language models requires too many frames and tokens. The authors propose a smart frame selection method that picks the most important frames by balancing two goals—relevance to the question asked and diversity of visual content—using a greedy algorithm with theoretical guarantees.

Mar 9 – Mar 15(14)

Representation Learning for Spatiotemporal Physical Systems

Mar 13, 2026

Helen Qu, Rudy Morel, Michael McCabe et al.

For physics-based machine learning, learning representations in latent space (like JEPAs) works better than optimizing pixel-level predictions, and generic self-supervised methods can be surprisingly effective for scientific tasks.

This paper challenges the standard approach of training physics models to predict the next frame. Instead, it evaluates whether models learn useful representations by testing them on downstream scientific tasks like estimating a system's physical parameters.

evaluation

Semantic Invariance in Agentic AI

Mar 13, 2026

I. de Zarzà, J. de Curtò, Jordi Cabot et al.

Model size doesn't guarantee robustness: smaller models like Qwen3-30B outperform much larger models at maintaining consistent reasoning when problems are rephrased, suggesting that scaling alone won't solve reliability issues for deployed AI agents.

This paper tests whether AI agents give consistent answers when you rephrase the same problem in different ways. The researchers found that larger models are actually less stable than smaller ones—a surprising result that challenges assumptions about model scaling.

evaluationreasoning

Feb 23 – Mar 1(16)

DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science

Feb 27, 2026

Fan Shu, Yite Wang, Ruofan Wu et al.

LLMs need specialized training data to reliably follow data science workflows; fine-tuning on task-specific benchmarks can improve performance by 8x.

DARE-bench is a benchmark for testing how well AI models can follow data science instructions and complete multi-step ML tasks. It includes 6,300 real Kaggle tasks with verifiable correct answers, making evaluation objective rather than relying on human judges.

evaluationtrainingapplications

Do LLMs Benefit From Their Own Words?

Feb 27, 2026

Jenny Y. Huang, Leshem Choshen, Ramon Astudillo et al.

You can often remove an LLM's previous responses from conversation history without losing quality, saving memory while sometimes improving accuracy.

This paper tests whether LLMs actually need to see their own previous responses in multi-turn conversations. Surprisingly, removing past assistant responses often doesn't hurt quality and can shrink context by 10x. The researchers found that models sometimes get worse when they over-rely on their own prior outputs, introducing errors that compound across turns.

applications

Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models

Apr 2, 2026

Sarath Shekkizhar, Romain Cosentino, Adam Earle

Task accuracy and conversational awareness are separate capabilities—a model can answer questions correctly without understanding how users naturally respond to those answers, revealing a blind spot in current LLM evaluation.

This paper reveals that language models can solve tasks correctly without understanding how conversations should naturally continue. Researchers tested this by asking models to generate the next user message after an assistant response—a task that requires understanding interaction flow.

evaluationreasoning

De Jure: Iterative LLM Self-Refinement for Structured Extraction of Regulatory Rules

Apr 2, 2026

Keerat Guliani, Deepkamal Gill, David Landsman et al.

LLMs can extract structured regulatory rules from legal documents through iterative self-evaluation and repair, achieving 84% preference over prior methods in downstream compliance tasks without human annotation.

De Jure automatically extracts legally binding rules from regulatory documents using LLMs and iterative self-refinement. It converts dense legal text into machine-readable rules through document normalization, semantic decomposition, multi-criteria evaluation, and repair cycles—without requiring human annotation or domain expertise.

applicationsreasoningevaluation

Best-Arm Identification with Noisy Actuation

Apr 2, 2026

Merve Karakas, Osama Hanna, Lin F. Yang et al.

When learning systems communicate over noisy channels, the fundamental limits of error-free communication directly determine how efficiently you can identify the best option in a bandit problem.

This paper tackles a multi-armed bandit problem where a learner must identify the best option (arm) but can only communicate with an agent through a noisy channel. The researchers develop communication strategies that connect to information theory concepts, showing how channel quality affects the ability to find the best arm.

reasoningevaluation

Do Emotions in Prompts Matter? Effects of Emotional Framing on Large Language Models

Apr 2, 2026

Minda Zhao, Yutong Yang, Chufei Peng et al.

Emotional framing in prompts is a weak, task-dependent signal that rarely helps across the board, but adaptive emotional selection can provide modest, reliable improvements—especially for socially-grounded reasoning tasks.

This paper investigates whether emotional language in prompts affects how well large language models perform on tasks like math, medical reasoning, and reading comprehension. The researchers found that adding emotional framing to prompts produces only small, inconsistent changes in accuracy—except in socially-grounded tasks where emotional context matters more.

evaluationreasoning

Answering the Wrong Question: Reasoning Trace Inversion for Abstention in LLMs

Apr 2, 2026

Abinitha Gourabathina, Inkit Padhi, Manish Nagireddy et al.

Reasoning models can be made safer by detecting when they've misunderstood the question itself—reconstruct what question they answered from their reasoning trace, and abstain if it differs from the original.

This paper tackles a critical problem: getting LLMs to know when to refuse answering questions. The authors discovered that reasoning models often fail at abstention (refusing to answer) because they answer the wrong question rather than answering incorrectly.

reasoningsafetyevaluation

Impact of Multimodal and Conversational AI on Learning Outcomes and Experience

Apr 2, 2026

Karan Taneja, Anjali Singh, Ashok K. Goel

Combining conversation with visual content (multimodality) improves learning in STEM, but conversation alone can create a false sense of understanding without actual learning gains.

This study compares three ways to learn biology: a conversational AI with images and text, one with text only, and a traditional search interface. Students using the multimodal conversational system learned best and felt most satisfied, while text-only conversation felt easier but didn't improve learning—showing that engagement doesn't always mean better outcomes.

multimodalapplicationsevaluation

VISTA: Visualization of Token Attribution via Efficient Analysis

Apr 2, 2026

Syed Ahmed, Bharathi Vokkaliga Ganesh, Jagadish Babu P et al.

You can now understand what tokens your LLM actually uses without doubling GPU memory or being locked into specific architectures—just remove tokens and measure the impact.

VISTA is a lightweight, model-agnostic technique for visualizing which tokens matter most in LLM predictions.

efficiencyevaluation

HippoCamp: Benchmarking Contextual Agents on Personal Computers

Apr 1, 2026

Zhe Yang, Shulin Tian, Kairui Hu et al.

Current AI agents fail at real-world personal file management: the best models only achieve 48% accuracy on user profiling tasks, with multimodal perception and evidence grounding being the main bottlenecks.

HippoCamp is a benchmark that tests AI agents on realistic file management tasks using real personal computers with 42.4 GB of actual user files. It measures how well agents can search files, understand context, and reason across multiple file types to answer questions about a user's data—revealing that even top AI models struggle with these practical tasks.

evaluationmultimodalagents

The Recipe Matters More Than the Kitchen:Mathematical Foundations of the AI Weather Prediction Pipeline

Apr 1, 2026

Piyush Garg, Diana R. Gergel, Andrew E. Shao et al.

For AI weather prediction, the training pipeline (loss function, data, optimization strategy) determines forecast skill far more than architectural choices—and current models have a fundamental blind spot for extreme weather events.

This paper explains why training methods, loss functions, and data matter more than model architecture for AI weather prediction. Using math from approximation theory and dynamical systems, the authors show that how you train a model dominates what model you use, and prove that AI weather models systematically underestimate extreme events. They validate this across ten different AI weather models.

trainingevaluationreasoning

$\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution

Apr 1, 2026

Muyu He, Adit Jain, Anand Kumar et al.

Current LLM agents struggle with long-term planning and learning from delayed feedback—only top models like Claude Opus 4.6 succeed, and using scratchpads to persist information across context windows is critical for success.

YC-Bench is a benchmark that tests whether AI agents can plan and execute consistently over long periods by simulating running a startup for a year. The agent must manage employees, select contracts, and stay profitable in an uncertain environment where early mistakes have lasting consequences.

evaluationagentsreasoning

True (VIS) Lies: Analyzing How Generative AI Recognizes Intentionality, Rhetoric, and Misleadingness in Visualization Lies

Apr 1, 2026

Graziano Blasilli, Marco Angelini

Multimodal AI models struggle inconsistently with detecting misleading visualizations; their ability varies dramatically by model size and architecture, and they often miss the intentional rhetorical techniques that human experts easily spot.

This study tests whether AI models can detect misleading visualizations and understand why they're deceptive. Researchers analyzed 2,336 tweets with COVID-19 charts—half containing intentional or accidental distortions—using 16 different AI models and compared their performance to how visualization experts judge the same images.

evaluationmultimodalapplications

Online Reasoning Calibration: Test-Time Training Enables Generalizable Conformal LLM Reasoning

Apr 1, 2026

Cai Zhou, Zekai Wang, Menghua Wu et al.

ORCA calibrates LLM reasoning in real-time by adapting confidence estimates per input, enabling 40-67% compute savings during inference while providing mathematical guarantees on error rates across different reasoning tasks and domains.

This paper introduces ORCA, a framework that makes language models more efficient during reasoning by calibrating their sampling process. Using test-time training and conformal prediction, ORCA learns to estimate confidence in its own reasoning steps, reducing wasted computation while maintaining accuracy—saving up to 47% compute on in-distribution tasks and 67% on out-of-distribution problems.

reasoningefficiencyevaluation

Geometry-aware similarity metrics for neural representations on Riemannian and statistical manifolds

Mar 30, 2026

N Alex Cayco Gajic, Arthur Pellegrino

Comparing neural representations by their intrinsic geometric structure—not just their raw values—reveals deeper insights into how different networks solve the same problem, enabling better interpretation of neural computations.

This paper introduces metric similarity analysis (MSA), a new method for comparing how neural networks represent information by analyzing the intrinsic geometry of their learned representations rather than just their surface-level structure.

evaluation

Stop Probing, Start Coding: Why Linear Probes and Sparse Autoencoders Fail at Compositional Generalisation

Mar 30, 2026

Vitória Barin Pacela, Shruti Joshi, Isabela Camacho et al.

Sparse autoencoders fail at compositional generalization because they learn poor concept dictionaries during training, not because of their amortized inference approach—fixing dictionary learning, not inference speed, is the key to interpretable AI.

This paper reveals why sparse autoencoders (SAEs) and linear probes fail to understand compositional concepts in neural networks. The core issue isn't the inference method—it's that SAEs learn dictionaries (concept representations) pointing in the wrong directions.

reasoningevaluation
safety
multimodal

Just Zoom In: Cross-View Geo-Localization via Autoregressive Zooming

Mar 26, 2026

Yunus Talha Erzurumlu, Jiyong Kwag, Alper Yilmaz

Treating geo-localization as a sequential zooming problem over maps, rather than image retrieval, achieves better results and avoids the limitations of contrastive learning approaches that struggle with landmark visibility mismatches.

This paper tackles cross-view geo-localization—matching street-view photos to satellite maps to pinpoint a camera's location without GPS. Instead of the standard approach of comparing images in a shared embedding space, the authors propose a new method that zooms progressively into a satellite map, making sequential decisions to narrow down the location.

reasoningarchitectureevaluation

Comparing Developer and LLM Biases in Code Evaluation

Mar 25, 2026

Aditya Mittal, Ryan Shar, Zichu Wu et al.

LLMs used as code judges have significant blind spots compared to human developers—they systematically misweight code quality factors like explanation length, meaning you can't rely on them alone for code evaluation in real applications.

This paper introduces TRACE, a framework that compares how LLM judges evaluate code against human developer preferences.

evaluationapplications

The Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agentic Artificial Intelligence

Mar 25, 2026

Biplab Pal, Santanu Bhattacharya

Before deploying agentic AI in business processes, measure the 'blind mass' of uncertain state-action pairs and expected oversight costs using event logs—this reveals hidden decision gaps that simple accuracy metrics miss.

This paper develops a mathematical framework to measure when AI agents can safely operate autonomously versus when they need human oversight.

agentssafetyevaluation

Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA

Mar 25, 2026

Saahil Mathur, Ryan David Rittner, Vedant Ajit Thakur et al.

Better retrieval doesn't guarantee better RAG answers: improving individual components can paradoxically increase confident hallucinations when relevant information isn't in your corpus.

This paper studies retrieval-augmented generation (RAG) systems for answering questions about AI policy documents. The researchers found that improving retrieval quality doesn't always lead to better answers—sometimes better retrieval actually makes the system more confidently wrong when relevant documents are missing.

evaluationapplications

Evaluating Chunking Strategies For Retrieval-Augmented Generation in Oil and Gas Enterprise Documents

Mar 25, 2026

Samuel Taiwo, Mohd Amaluddin Yusoff

For enterprise RAG systems with structured documents, preserve document structure when chunking—it improves retrieval quality and reduces costs, but you'll need multimodal AI to handle diagrams and visual content.

This paper tests four different ways to split documents into chunks for RAG systems using oil and gas industry documents. Structure-aware chunking (which respects document layout) works best and costs less than other methods, but all approaches struggle with diagrams and visual content.

evaluationapplications

MedObvious: Exposing the Medical Moravec's Paradox in VLMs via Clinical Triage

Mar 24, 2026

Ufaq Khan, Umair Nawaz, L D M S S Teja et al.

Medical VLMs need explicit training on input validation (checking modality, anatomy, orientation) as a separate safety step before diagnosis, not as an afterthought—current models hallucinate plausible reports even on obviously invalid inputs.

This paper reveals a critical blind spot in medical AI: vision-language models can generate fluent medical reports even when given invalid inputs like wrong body parts or upside-down images. MedObvious is a benchmark of 1,880 tasks testing whether models can catch these basic sanity checks before attempting diagnosis—a step human radiologists do automatically but VLMs currently fail at.

safetyevaluationmultimodal

Failure of contextual invariance in gender inference with large language models

Mar 24, 2026

Sagar Kumar, Ariel Flint, Luca Maria Aiello et al.

LLM outputs are unstable across contextually equivalent formulations of the same task, meaning benchmark results may not reflect how models actually behave in real applications—a critical issue for bias testing and high-stakes use.

This paper reveals that large language models fail to give consistent outputs when tasks are reformulated in contextually equivalent ways.

evaluationsafety

ReqFusion: A Multi-Provider Framework for Automated PEGS Analysis Across Software Domains

Mar 24, 2026

Muhammad Khalid, Manuel Oriol, Yilmaz Uygun

Using structured prompting formats (PEGS) with multiple LLM providers significantly improves requirements extraction accuracy (F1: 0.88 vs 0.71) and provides built-in reliability through model consensus and fallback mechanisms.

ReqFusion automates software requirements extraction and classification by combining multiple LLM providers (GPT, Claude, Groq) with a structured PEGS format prompt. The system processes various document types and achieves 88% accuracy, reducing manual analysis time by 78% while ensuring consistent requirement categorization across academic, industrial, and business contexts.

applicationsevaluation

CSTS: A Canonical Security Telemetry Substrate for AI-Native Cyber Detection

Mar 24, 2026

Abdul Rahman

Security AI models fail when deployed to new environments because telemetry data is fragmented. CSTS solves this by providing a unified, entity-focused data structure that maintains consistent identity and relationships across different systems.

This paper introduces CSTS, a standardized way to represent security data that helps AI systems detect cyber threats across different computer networks. Instead of treating security events as isolated incidents, CSTS organizes them around entities (like users or devices) and their relationships, making AI models more reliable when deployed in new environments.

safetydataevaluation

Code Review Agent Benchmark

Mar 24, 2026

Yuntong Zhang, Zhiyuan Pan, Imam Nur Bani Yusuf et al.

Code review agents currently miss most issues that human reviewers catch, but they often flag different problems—creating opportunities for AI-assisted rather than AI-automated code review in real teams.

This paper introduces c-CRAB, a benchmark dataset for evaluating AI agents that perform code review on pull requests. The dataset is built from human reviews and includes automated tests to assess whether code review agents catch the same issues humans do.

evaluationagentsapplications

Evaluating LLM-Based Test Generation Under Software Evolution

Mar 24, 2026

Sabaat Haroon, Mohammad Taha Khan, Muhammad Ali Gulzar

LLM-generated tests work well on original code but fail to adapt to program changes, indicating they learn superficial patterns rather than genuine program semantics—a critical weakness for real-world software maintenance.

This study tests whether LLMs actually understand program behavior when generating unit tests, or just memorize patterns. Researchers mutated 22,374 programs and found that while LLMs generate good tests initially (79% coverage), they fail badly when code changes—missing 34% of bugs and struggling even when code is refactored without changing functionality.

evaluation

Mecha-nudges for Machines

Mar 24, 2026

Giulio Frey, Kawin Ethayarajh

As AI agents make more real-world decisions, the way information is presented can be optimized for machines just like it is for humans—and this is already happening in practice on platforms like Etsy.

This paper introduces 'mecha-nudges'—subtle changes to how information is presented that influence AI agents' decisions without restricting options or harming human decision-making.

agentsalignmentevaluation

WorldCache: Content-Aware Caching for Accelerated Video World Models

Mar 23, 2026

Umair Nawaz, Ahmed Heakl, Ufaq Khan et al.

Smart feature caching with motion awareness can dramatically accelerate video world models without retraining, but requires adaptive thresholds and blending rather than static feature reuse.

WorldCache speeds up video generation from diffusion transformers by intelligently reusing computed features across denoising steps. Instead of naively reusing old features, it adapts based on motion and visual importance, using blending and warping to keep videos smooth and artifact-free—achieving 2.3× speedup with minimal quality loss.

efficiencyarchitectureevaluation

The Dual Mechanisms of Spatial Reasoning in Vision-Language Models

Mar 23, 2026

Kelly Cui, Nikhil Prakash, Ayush Raina et al.

Vision encoders, not language models, are the primary source of spatial reasoning in VLMs. Spatial information is distributed globally across all image tokens, not just object regions, and enhancing this signal improves spatial understanding tasks.

This paper reveals how vision-language models handle spatial reasoning—understanding where objects are and how they relate to each other. The researchers found that VLMs use two mechanisms: the language model processes spatial relations independently, but the vision encoder is actually the dominant source, encoding object layouts across the entire image including background areas.

multimodalreasoningevaluation

SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Image Generation

Mar 23, 2026

Sashuai Zhou, Qiang Zhou, Junpeng Ma et al.

Fine-grained spatial accuracy in generated images requires explicit spatial reward modeling during training; rule-based spatial checks alone miss complex relationships that vision-language models with grounding can catch.

SpatialReward is a reward model that helps text-to-image AI systems generate images with accurate object positioning and spatial relationships. It breaks down image prompts into specific spatial requirements, uses object detection to verify positions, and applies reasoning to check complex spatial relationships—then feeds this feedback into training to improve image generation quality.

evaluationmultimodaltraining
efficiencymultimodalevaluation

Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation

Mar 20, 2026

Richard J. Young

Published faithfulness scores for AI reasoning are not comparable across studies because different evaluation methods measure different aspects of the same behavior at different strictness levels—always check the methodology, not just the number.

This paper shows that measuring whether AI models are 'faithful' (honestly using their reasoning) isn't objective—different evaluation methods on the same data produce wildly different results (69.7% to 82.6% faithfulness for identical models).

evaluationreasoningalignment

Evaluating Evidence Grounding Under User Pressure in Instruction-Tuned Language Models

Mar 20, 2026

Sai Koneru, Elphin Joe, Christine Kirchhoff et al.

Instruction-tuned models are vulnerable to user pressure even with strong evidence present; simply providing richer context doesn't guarantee models will resist sycophancy without explicit training for epistemic integrity.

This paper tests how well instruction-tuned language models stick to evidence when users pressure them to agree with false claims. Using climate science as a test domain, researchers found that adding more detailed evidence doesn't reliably prevent models from abandoning facts to please users—especially when evidence includes research gaps or uncertainty.

evaluationalignmentsafety

Semantic Token Clustering for Efficient Uncertainty Quantification in Large Language Models

Mar 20, 2026

Qi Cao, Andrew Gambardella, Takeshi Kojima et al.

You can measure LLM uncertainty efficiently with just one forward pass by clustering semantically similar tokens, avoiding the computational cost of sampling-based or auxiliary model approaches.

This paper proposes Semantic Token Clustering (STC), a fast method to measure how confident an LLM should be in its answers. Instead of running the model multiple times or using extra models, STC groups similar tokens together and checks if the model's top prediction comes from a coherent semantic cluster. It works in a single pass and catches cases where models are overconfident.

efficiencyevaluation

NavTrust: Benchmarking Trustworthiness for Embodied Navigation

Mar 19, 2026

Huaide Jiang, Yash Chaudhary, Yuping Wang et al.

Embodied navigation systems perform well in clean lab conditions but fail dramatically in real-world scenarios with sensor noise and unclear instructions—this benchmark exposes those gaps and provides mitigation strategies.

NavTrust is a benchmark that tests how well navigation AI systems handle real-world problems like blurry images, sensor noise, and unclear instructions. The researchers tested seven state-of-the-art systems and found they all struggle significantly when inputs are corrupted, then demonstrated four strategies to make them more robust.

evaluationsafetyagents

FinTradeBench: A Financial Reasoning Benchmark for LLMs

Mar 19, 2026

Yogesh Agrawal, Aniruddha Dutta, Md Mahadi Hasan et al.

LLMs can reason about financial fundamentals with retrieval help, but struggle significantly with trading signals and time-series patterns—a critical gap for real-world financial decision-making.

FinTradeBench is a benchmark with 1,400 questions testing how well AI models reason about financial decisions by combining company fundamentals (from financial reports) and trading signals (from stock price patterns). The benchmark reveals that current AI models struggle with numerical reasoning and time-series data, even when given access to relevant information.

evaluationreasoningapplications

$R$-equivalence on Cubic Surfaces I: Existing Cases with Non-Trivial Universal Equivalence

Mar 19, 2026

Dimitri Kanevsky, Julian Salazar, Matt Harvey

R-equivalence on certain cubic surfaces is either trivial or has exponent 2, settling Manin's 1972 question about the diagonal cubic—and this work demonstrates how AI can assist in formal mathematical reasoning.

This paper studies R-equivalence on cubic surfaces over p-adic fields, proving it's trivial or has exponent 2 for surfaces with all-Eckardt reductions. The authors resolve a 50-year-old question about a specific diagonal cubic and use AI models to assist with proofs and lemma verification.

reasoningevaluation

Robustness, Cost, and Attack-Surface Concentration in Phishing Detection

Mar 19, 2026

Julian Allagan, Mohamed Elbakary, Zohreh Safari et al.

Phishing detector robustness is fundamentally limited by feature economics—the cost of realistic website modifications—not by model architecture. Attackers can reliably evade detection by exploiting cheap feature changes, making feature design more critical than model choice.

This paper reveals a critical weakness in phishing detection systems: while machine learning models achieve near-perfect accuracy in testing, attackers can easily evade them by making cheap, realistic changes to websites.

safetyevaluation

How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Evaluation

Mar 19, 2026

Ke-Han Lu, Szu-Wei Fu, Chao-Han Huck Yang et al.

An LLM's text-only auditory knowledge is a strong predictor of how well it will perform in audio tasks—so you can evaluate audio-language models by testing their audio understanding before building them.

This paper investigates how much knowledge about sound and audio LLMs actually have from their text-only training, and whether this predicts how well they work in audio tasks. Researchers tested different LLMs three ways: directly probing their audio knowledge, having them reason about audio descriptions, and fine-tuning them into full audio-language models.

evaluationmultimodaltraining

OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards

Mar 19, 2026

Zehao Li, Zhenyu Wu, Yibo Zhao et al.

Breaking reward evaluation into smaller, verifiable steps with multiple reviewers produces more reliable feedback for training GUI agents, improving task success by 10% in online learning scenarios.

OS-Themis is a reward evaluation system for GUI agents that breaks down task trajectories into verifiable milestones and uses multiple reviewers to judge whether agents completed tasks correctly. This approach improves both the accuracy of reward signals and the performance of agents trained with reinforcement learning on mobile and desktop interfaces.

agentsevaluationtraining

Improving RCT-Based Treatment Effect Estimation Under Covariate Mismatch via Calibrated Alignment

Mar 19, 2026

Amir Asiaee, Samhita Pal

When combining RCT and observational data with different measured variables, learning a shared embedding space and calibrating predictions outperforms traditional imputation methods, especially for detecting non-linear treatment effects.

This paper solves a practical problem in medical research: combining data from randomized trials (which prove causation but have small samples) with observational studies (which have large samples but measure different variables).

evaluation

MIDST Challenge at SaTML 2025: Membership Inference over Diffusion-models-based Synthetic Tabular data

Mar 19, 2026

Masoumeh Shafieinejad, Xi He, Mahshid Alinoori et al.

Synthetic data from diffusion models may not be as privacy-safe as assumed—membership inference attacks can still reveal whether specific records were in the training data, even with synthetic tabular outputs.

This challenge evaluates how well synthetic tabular data generated by diffusion models protects privacy against membership inference attacks. Researchers tested whether synthetic data truly hides information about individuals in the original dataset, developing new attack methods to measure privacy risks across different types of tabular data structures.

safetyevaluationdata

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Mar 19, 2026

Edward Lin, Sahil Modi, Siva Kumar Sastry Hari et al.

Instead of comparing kernels to other software implementations, this benchmark measures how close optimized kernels get to theoretical hardware limits—giving AI systems a clear, unchanging target for optimization rather than a moving baseline.

SOL-ExecBench is a benchmark for evaluating GPU kernel optimization that measures performance against hardware limits rather than software baselines. It includes 235 CUDA kernels from real AI models and uses analytically derived 'Speed-of-Light' bounds to create fixed optimization targets, enabling fair evaluation of AI systems that generate and optimize code.

evaluationefficiencyagents

Evaluating Counterfactual Strategic Reasoning in Large Language Models

Mar 19, 2026

Dimitrios Georgousis, Maria Lymperaiou, Angeliki Dimitriou et al.

LLMs perform well on familiar games but fail when payoff structures change, suggesting they rely on memorized patterns rather than understanding underlying strategic principles.

This paper tests whether large language models can genuinely reason about game theory or just memorize patterns. Researchers created modified versions of classic games (Prisoner's Dilemma and Rock-Paper-Scissors) with different payoffs and labels to see if LLMs could adapt their strategy.

reasoningevaluation

D5P4: Partition Determinantal Point Process for Diversity in Parallel Discrete Diffusion Decoding

Mar 19, 2026

Jonathan Lys, Vincent Gripon, Bastien Pasdeloup et al.

D5P4 enables discrete diffusion models to generate diverse text outputs efficiently by using a principled diversity mechanism during decoding, with minimal computational overhead compared to standard approaches.

This paper improves how discrete diffusion models generate text by introducing D5P4, a new decoding method that generates multiple candidate outputs in parallel while controlling diversity.

efficiencyarchitectureevaluation

SHAPCA: Consistent and Interpretable Explanations for Machine Learning Models on Spectroscopy Data

Mar 19, 2026

Mingxing Zhang, Nicola Rossberg, Simone Innocente et al.

For spectroscopy and similar high-dimensional data, combining PCA with SHAP explanations lets you understand model decisions in terms of the original measurements—critical for clinical adoption where trust and interpretability matter.

SHAPCA combines dimensionality reduction and explainability techniques to make machine learning predictions on spectroscopy data interpretable and trustworthy. It maps explanations back to the original spectral bands rather than abstract features, helping clinicians and researchers understand why models make specific predictions on high-dimensional, correlated data.

evaluationapplications

Implicit Patterns in LLM-Based Binary Analysis

Mar 19, 2026

Qiang Li, XiangRui Zhang, Haining Wang

LLM-based binary analysis isn't random exploration—models implicitly develop structured reasoning patterns that organize their search process, which can be measured and potentially improved for more reliable vulnerability detection.

This paper analyzes how large language models perform binary vulnerability analysis across hundreds of reasoning steps. Researchers studied 521 binaries and discovered that LLMs implicitly develop four structured patterns—early pruning, path-dependent lock-in, targeted backtracking, and knowledge-guided prioritization—that organize their exploration without explicit programming.

reasoningevaluationapplications

From Inference Efficiency to Embodied Efficiency: Revisiting Efficiency Metrics for Vision-Language-Action Models

Mar 19, 2026

Zhuofan Li, Hongkun Yang, Zhenyang Chen et al.

When building embodied AI systems, measure what actually matters: task completion time, motion quality, and energy use—not just model size or inference speed. Optimizing the wrong metrics can make robots perform worse in practice.

This paper shows that traditional efficiency metrics (parameters, computation) for vision-language-action robots don't match real-world performance. The researchers measured actual robotic execution—task time, motion smoothness, energy use—and found that methods optimizing for conventional metrics often make robots move worse or take longer, even when task success stays the same.

efficiencyevaluationapplications

How Uncertainty Estimation Scales with Sampling in Reasoning Models

Mar 19, 2026

Maksym Del, Markus Kängsepp, Marharyta Domnich et al.

For deploying reasoning models safely, combining verbalized confidence with self-consistency gives the best uncertainty estimates with minimal computational cost, but effectiveness varies significantly across domains like math versus humanities.

This paper studies how well reasoning language models can estimate their own uncertainty by sampling multiple responses and analyzing confidence signals.

evaluationreasoningsafety

SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues

Mar 19, 2026

Carlos Hinojosa, Clemens Grange, Bernard Ghanem

Vision-language models' safety decisions are easily manipulated by semantic cues—they rely on learned associations rather than grounded reasoning about actual danger, which is a critical vulnerability for real-world deployment.

This paper reveals that vision-language models make safety decisions based on surface-level visual and textual cues rather than genuine understanding of dangerous situations. Researchers created a benchmark and steering framework showing that simple changes to how a scene is described or presented can flip safety judgments, exposing a vulnerability in how these models assess risk.

safetymultimodalevaluation

Serendipity by Design: Evaluating the Impact of Cross-domain Mappings on Human and LLM Creativity

Mar 19, 2026

Qiawen Ella Liu, Marina Dubova, Henry Conklin et al.

LLMs are already highly creative at generating novel ideas, but they don't benefit from the same creative prompting techniques that help humans think outside the box through forced analogies.

Researchers tested whether cross-domain mapping—forcing creators to draw inspiration from random, unrelated sources—boosts creativity in both humans and LLMs. Humans benefited significantly from this technique, but LLMs showed no consistent improvement, though both systems generated more creative ideas when the source domain was more distant from the target.

evaluationreasoningapplications

A Dataset and Resources for Identifying Patient Health Literacy Information from Clinical Notes

Mar 19, 2026

Madeline Bittner, Dina Demner-Fushman, Yasmeen Shabazz et al.

Automated health literacy detection from clinical notes is now possible with HEALIX, a curated dataset that could help clinicians identify patients needing extra support without adding screening burden.

Researchers created HEALIX, the first public dataset of 589 clinical notes annotated for patient health literacy levels (low, normal, high). Health literacy—a patient's ability to understand medical information—affects treatment outcomes, but current screening tools are impractical.

dataapplicationsevaluation

Parallelograms Strike Back: LLMs Generate Better Analogies than People

Mar 19, 2026

Qiawen Ella Liu, Raja Marjieh, Jian-Qiao Zhu et al.

LLMs generate more structurally consistent analogies than humans by better preserving relational patterns in embedding space—suggesting the parallelogram model is sound, but humans are inconsistent analogy-makers.

This paper compares how humans and LLMs generate word analogies (A:B::C:D problems). While previous research suggested the geometric "parallelogram" model poorly explains human analogies, this work shows LLMs actually produce better analogies that align more closely with the parallelogram structure.

reasoningevaluation

TDAD: Test-Driven Agentic Development - Reducing Code Regressions in AI Coding Agents via Graph-Based Impact Analysis

Mar 18, 2026

Pepe Alonso

For AI agents writing code, showing them which tests to check matters more than telling them to follow test-driven development procedures—context beats process.

TDAD is a tool that helps AI coding agents avoid breaking existing tests when fixing bugs. It uses code analysis to identify which tests might be affected by changes, then guides the agent to verify those specific tests before submitting fixes. Testing on real-world code shows it cuts regressions by 70% and improves fix success rates.

agentsevaluation

ConGA: Guidelines for Contextual Gender Annotation. A Framework for Annotating Gender in Machine Translation

Mar 18, 2026

Argentina Anna Rescigno, Eva Vanmassenhove, Johanna Monti

Machine translation systems have systematic gender bias—they default to masculine forms when translating from English to gendered languages. This paper provides annotation guidelines and a benchmark dataset to measure and fix this problem.

This paper introduces ConGA, a framework for annotating gender in machine translation to address how systems handle gender when translating from gender-neutral languages (like English) to gendered ones (like Italian).

dataevaluationalignment

Gender Disambiguation in Machine Translation: Diagnostic Evaluation in Decoder-Only Architectures

Mar 18, 2026

Chiara Manna, Hosein Mohebbi, Afra Alishahi et al.

Decoder-only language models show similar gender bias problems as smaller models in translation tasks, but instruction tuning can reduce masculine bias and improve context awareness.

This paper examines how large language models handle gender in machine translation, where languages differ in how they mark gender. The researchers introduce a new measurement called "Prior Bias" to capture what gender a model assumes by default, and test decoder-only models (like GPT-style architectures) against traditional encoder-decoder models.

evaluationsafetyalignment

Only relative ranks matter in weight-clustered large language models

Mar 18, 2026

Borja Aizpurua, Sukhbinder Singh, Román Orús

LLM weights can be compressed to just 16-64 unique values per matrix without retraining by preserving relative rank order, enabling simple disk compression and revealing that rank structure—not magnitude—is what drives model behavior.

This paper shows that LLMs don't need exact weight values—only the relative ordering of weights matters. By clustering weights into 16-64 shared values per matrix, the authors compress models like Llama 3.1-8B without retraining. They prove this by scrambling weight values while preserving rank order, finding that rank matters far more than precise magnitudes for model performance.

efficiencyevaluation

Demystifing Video Reasoning

Mar 17, 2026

Ruisi Wang, Zhongang Cai, Fanyi Pu et al.

Video models reason through iterative refinement across denoising steps (not frame-by-frame), exploring candidate solutions early and converging later—a mechanism you can exploit by ensembling outputs from different random seeds.

This paper reveals how video diffusion models actually perform reasoning—not by processing frames sequentially, but by exploring multiple solutions across denoising steps and converging to answers.

reasoningarchitectureevaluation

MessyKitchens: Contact-rich object-level 3D scene reconstruction

Mar 17, 2026

Junaid Ahmed Ansari, Ran Ding, Fabio Pizzati et al.

For robotics and animation applications, reconstructing cluttered scenes requires not just identifying individual 3D objects but ensuring they physically interact correctly—this work provides both a benchmark dataset and a method that achieves this.

This paper tackles 3D scene reconstruction from single images by introducing MessyKitchens, a dataset of cluttered real-world kitchen scenes with precise object shapes, poses, and contact information.

evaluationmultimodalapplications

SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

Mar 17, 2026

Tianyu Xie, Jinfa Huang, Yuexiao Ma et al.

Models that accurately perceive audio-visual information often fail at generating contextually appropriate conversational responses, showing that perception and interaction are separate skills that need independent evaluation.

SocialOmni is a benchmark that tests how well audio-visual AI models handle natural conversation dynamics—specifically, identifying who's speaking, knowing when to interrupt, and generating natural interruptions. Testing 12 leading models reveals that understanding what's happening in a conversation doesn't automatically translate to responding appropriately in real dialogue.

evaluationmultimodalagents

Long-Horizon Traffic Forecasting via Incident-Aware Conformal Spatio-Temporal Transformers

Mar 17, 2026

Mayur Patil, Qadeer Ahmed, Shawn Midlam-Mohler et al.

Incorporating incident severity signals and dynamic road relationships into spatio-temporal models significantly improves long-horizon traffic predictions with calibrated confidence intervals—practical for real-world transportation planning.

This paper improves traffic forecasting by using a Transformer model that understands both spatial patterns (how traffic flows across roads) and temporal patterns (how it changes over time), while accounting for incidents like crashes.

reasoningevaluationapplications

Mediocrity is the key for LLM as a Judge Anchor Selection

Mar 17, 2026

Shachar Don-Yehiya, Asaf Yehudai, Leshem Choshen et al.

When using LLM-as-a-judge for evaluation, avoid using the best or worst model as your anchor—choose a mediocre one instead. Anchor selection matters as much as which judge model you pick, and most benchmarks are too small to reliably compare competitive models.

This paper reveals that choosing the right reference model (anchor) for LLM-as-a-judge evaluation is critical but overlooked. The researchers tested 22 different anchors and found that extreme choices—the best or worst models—actually make poor anchors because they don't help distinguish between similar models.

evaluation

HorizonMath: Measuring AI Progress Toward Mathematical Discovery with Automatic Verification

Mar 16, 2026

Erik Y. Wang, Sumeet Motwani, James V. Roggeveen et al.

AI systems can now potentially contribute novel mathematical insights on real unsolved problems, but we need better benchmarks to measure this—HorizonMath provides one by focusing on problems where verification is cheap but discovery is genuinely hard.

HorizonMath is a benchmark of 100+ unsolved math problems across 8 domains designed to test whether AI can make genuine mathematical discoveries. Unlike existing benchmarks, it focuses on problems that are hard to solve but easy to verify automatically, avoiding data contamination issues. Early results show GPT-5.4 Pro found solutions to two problems that may improve on published results.

evaluationreasoning

Do Metrics for Counterfactual Explanations Align with User Perception?

Mar 16, 2026

Felix Liedeker, Basil Ell, Philipp Cimiano et al.

Standard metrics for evaluating counterfactual explanations don't align with human judgment—developers need human-centered evaluation methods, not just algorithmic scores, to build truly trustworthy AI systems.

This study compares how AI systems measure counterfactual explanations (showing what would need to change for a different prediction) against how humans actually judge them. Researchers found that standard algorithmic metrics poorly predict human satisfaction, suggesting current evaluation methods miss what users actually care about in explanations.

evaluationsafetyalignment

Co-Design of Memory-Storage Systems for Workload Awareness with Interpretable Models

Mar 16, 2026

Jay Sarkar, Vamsi Pavan Rayaprolu, Abhijeet Bhalerao

Using interpretable ML to co-design storage hardware and firmware together—rather than separately—helps engineers make better architectural decisions by understanding how memory, error handling, and workloads interact.

This paper describes how machine learning can optimize the design of solid-state drives (SSDs) by modeling how error management algorithms interact with memory components under different workloads. The researchers built an interpretable ML framework that analyzes thousands of real SSDs to guide hardware design decisions, enabling better performance and reliability trade-offs.

architectureefficiencyevaluation

Estimating Staged Event Tree Models via Hierarchical Clustering on the Simplex

Mar 16, 2026

Muhammad Shoaib, Eva Riccomagno, Manuele Leonelli et al.

For building staged tree models at scale, use Total Variation divergence with Ward.D2 hierarchical clustering—it matches the accuracy of slower methods like Backward Hill Climbing but runs significantly faster.

This paper presents a new method for building staged tree models—a type of probabilistic graphical model that captures context-specific patterns in data. The approach uses hierarchical clustering on probability distributions, comparing different distance metrics and clustering strategies.

trainingefficiencyevaluation
agents

Developing and evaluating a chatbot to support maternal health care

Mar 13, 2026

Smriti Jha, Vidhi Jain, Jianyu Xu et al.

Deploying medical chatbots in low-resource, multilingual settings requires multiple layers of safety (triage, retrieval, generation) and multi-method evaluation—no single model or test is sufficient for trustworthy healthcare AI.

Researchers built a phone-based chatbot to answer maternal health questions in India, where users often have limited health literacy and speak multiple languages. The system combines triage (routing urgent cases to experts), retrieval of curated health guidelines, and AI-generated responses.

safetyapplicationsevaluation

ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation

Mar 13, 2026

Siqi Sun, Ben Peng Wu, Mali Jin et al.

Chain-of-thought reasoning substantially reduces hallucinations in LLMs analyzing long, complex documents—a critical capability for compliance and legal applications where accuracy is non-negotiable.

ESG-Bench is a benchmark dataset for testing how well AI models understand long corporate ESG (environmental, social, governance) reports and avoid making up false information. The dataset contains real ESG reports paired with human-verified question-answer pairs, letting researchers measure when models hallucinate versus when they accurately extract facts.

evaluationsafety

Developing the PsyCogMetrics AI Lab to Evaluate Large Language Models and Advance Cognitive Science -- A Three-Cycle Action Design Science Study

Mar 13, 2026

Zhiye Jin, Yibai Li, K. D. Joshi et al.

LLM evaluation can be more rigorous by borrowing established methods from psychology and cognitive science—this platform shows how to systematically apply those methods at scale.

Researchers built PsyCogMetrics AI Lab, a cloud platform that applies psychology and cognitive science methods to evaluate large language models. The study uses a rigorous three-phase design process to identify evaluation gaps, develop theory-based assessment methods, and test them in practice.

evaluationreasoning

SciMDR: Benchmarking and Advancing Scientific Multimodal Document Reasoning

Mar 12, 2026

Ziyu Chen, Yilun Zhao, Chengye Wang et al.

Training multimodal models on scientific documents requires balancing synthetic data quality with real-world document complexity—this dataset achieves that by synthesizing faithful QA pairs then re-embedding them into full papers.

This paper introduces SciMDR, a dataset of 300K question-answer pairs across 20K scientific papers designed to train AI models on understanding complex scientific documents with both text and images. The dataset uses a two-stage process: first generating focused QA pairs with reasoning chains, then embedding them into full documents to maintain realistic complexity.

multimodaldataevaluation

Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training

Mar 12, 2026

Yixin Liu, Yue Yu, DiJia Su et al.

Reasoning judges are more robust than standard judges for training AI systems, but they're not foolproof—AI policies can still learn to generate adversarial outputs that fool judges while appearing good on benchmarks.

This paper tests whether reasoning-focused language models can reliably judge AI outputs in areas where correctness is hard to verify (like essay quality or creative writing). The researchers found that reasoning judges perform better than standard judges on benchmarks, but they can still be tricked into rewarding outputs that game the system rather than genuinely improve quality.

alignmentevaluationreasoning

BiGain: Unified Token Compression for Joint Generation and Classification

Mar 12, 2026

Jiacheng Liu, Shengkun Tang, Jiacheng Cui et al.

Token compression in diffusion models can serve both generation and classification if you preserve different frequency components: keep high-frequency details for texture/edges and low/mid-frequency information for semantic understanding.

BiGain is a method that speeds up diffusion models while keeping both image generation and classification working well. It uses frequency-aware token compression—separating fine details from overall structure—to decide which tokens to merge or remove, maintaining visual quality and classification accuracy simultaneously.

efficiencyarchitectureevaluation

RDNet: Region Proportion-Aware Dynamic Adaptive Salient Object Detection Network in Optical Remote Sensing Images

Mar 12, 2026

Bin Wan, Runmin Cong, Xiaofei Zhou et al.

Using adaptive convolution kernels guided by object size proportions, combined with transformer-based backbones, significantly improves detection of objects at different scales in satellite imagery.

RDNet improves salient object detection in satellite images by replacing traditional CNN backbones with SwinTransformer and adding three specialized modules that adapt to different object sizes and use frequency analysis to better understand context. This solves the problem of detecting objects of varying scales in remote sensing imagery more accurately than existing methods.

architectureefficiencyevaluation

Proof-Carrying Materials: Falsifiable Safety Certificates for Machine-Learned Interatomic Potentials

Mar 12, 2026

Abhinaba Basu, Pavan Chakraborty

ML models for materials science need formal safety audits—this work shows single models have severe blind spots, but systematic falsification and confidence bounds can identify reliable predictions and improve discovery by 25%.

Machine-learned models for predicting material properties often fail silently. This paper introduces Proof-Carrying Materials, a system that audits these models through adversarial testing, statistical confidence bounds, and formal verification to identify which predictions are trustworthy.

safetyevaluationapplications

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

Mar 12, 2026

Łukasz Borchmann, Jordy Van Landeghem, Michał Turski et al.

Current document-reasoning agents succeed through exhaustive search rather than strategic thinking—they need better planning abilities, not just more attempts, to handle real-world document workflows efficiently.

This paper introduces MADQA, a benchmark with 2,250 questions across 800 PDF documents, to test whether AI agents can strategically navigate documents or just randomly search. The researchers found that while agents match human accuracy on some questions, they use brute-force trial-and-error rather than smart planning, and fall 20% short of optimal performance.

evaluationagentsreasoning

LifeSim: Long-Horizon User Life Simulator for Personalized Assistant Evaluation

Mar 12, 2026

Feiyu Duan, Xuanjing Huang, Zhongyu Wei

Current LLMs struggle with implicit user intentions and long-term preference modeling—they can handle immediate requests but fail to understand what users really need or remember their preferences over extended interactions.

LifeSim creates realistic simulated users with beliefs, desires, and intentions to test how well AI assistants handle long-term, multi-scenario interactions. The benchmark evaluates whether AI can understand both explicit requests and hidden user needs, maintain accurate user profiles over time, and provide contextually appropriate responses across 1,200 diverse life scenarios.

evaluationagentsapplications

Linking Perception, Confidence and Accuracy in MLLMs

Mar 12, 2026

Yuetian Du, Yucheng Wang, Rongyu Zhang et al.

Multimodal models suffer from severe confidence miscalibration; training them to be honest about uncertainty and using that uncertainty to trigger verification steps significantly improves both accuracy and reliability.

This paper identifies that multimodal AI models are overconfident—they don't reliably know when they're wrong. The authors propose a training method using image noise pairs and confidence-based rewards to fix this, plus a test-time strategy that uses the model's confidence to decide when to double-check answers. Results show 8.8% accuracy improvements across benchmarks.

evaluationtrainingmultimodal

FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance

Mar 12, 2026

Quanhao Li, Zhen Xing, Rui Wang et al.

You can now generate videos with precise motion control in a fraction of the time by distilling multi-step models and retraining motion adapters—opening doors for real-time interactive video creation.

FlashMotion speeds up trajectory-controlled video generation from many steps to just a few, while keeping videos high-quality and motion paths accurate. It trains a motion controller on a slow multi-step model, then distills it to run faster, and fine-tunes the controller to work well with the speedier version.

efficiencyarchitectureevaluation
efficiency
evaluation

Who Guards the Guardians? The Challenges of Evaluating Identifiability of Learned Representations

Feb 27, 2026

Shruti Joshi, Théo Saulus, Wieland Brendel et al.

Standard metrics for evaluating learned representations are often misspecified and can mislead you about whether your model actually learned interp...

This paper reveals that popular metrics for checking if AI models learn meaningful, interpretable features are unreliable. The metrics work only under specific conditions, and when those conditions aren't met, they give false results—saying a model learned good features when it didn't, or vice versa. The authors provide tools to properly test these metrics.

evaluationtraining

Resources for Automated Evaluation of Assistive RAG Systems that Help Readers with News Trustworthiness Assessment

Feb 27, 2026

Dake Zhang, Mark D. Smucker, Charles L. A. Clarke

Automated evaluation of RAG systems for news credibility assessment can reliably match human judgment, enabling faster iteration on trustworthiness...

This paper describes evaluation tools for AI systems that help readers assess whether news articles are trustworthy. Researchers created benchmarks with human-judged questions and reports about real news, then built an automated system to score new submissions without needing human reviewers each time.

evaluationapplicationsreasoning

A Minimal Agent for Automated Theorem Proving

Feb 27, 2026

Borja Requena Pozo, Austin Letson, Krystian Nowakowski et al.

Iterative refinement with simpler architecture outperforms complex single-shot approaches for theorem proving, reducing cost while improving sample...

Researchers built a simplified AI system that proves mathematical theorems by iteratively refining attempts, searching libraries, and managing context. Despite being much simpler than existing approaches, it performs competitively while being cheaper and more efficient—showing that iterative refinement beats trying to solve everything in one shot.

agentsreasoningevaluation

Compositional Generalization Requires Linear, Orthogonal Representations in Vision Embedding Models

Feb 27, 2026

Arnas Uselis, Andrea Dittadi, Seong Joon Oh

For AI models to recognize new combinations of familiar concepts, their internal representations must be mathematically linear and orthogonal—a s...

This paper explains why neural networks need to organize information in a specific geometric way to recognize familiar concepts in new combinations. The researchers prove that for a model to generalize to unseen combinations of concepts, its internal representations must decompose into separate, perpendicular components for each concept.

architecturereasoningevaluation

FaultXformer: A Transformer-Encoder Based Fault Classification and Location Identification model in PMU-Integrated Active Electrical Distribution System

Feb 27, 2026

Kriti Thakur, Alivelu Manga Parimi, Mayukha Pal

Transformers can outperform traditional deep learning for time-series fault detection in power systems, especially as grids become more complex wit...

FaultXformer uses a Transformer model to detect and locate electrical faults in power grids using real-time sensor data. It processes current measurements in two stages—first extracting temporal patterns, then classifying fault types and pinpointing locations—achieving 98%+ accuracy and outperforming traditional deep learning approaches like CNNs and LSTMs.

architectureapplicationsevaluation

Time Series Foundation Models as Strong Baselines in Transportation Forecasting: A Large-Scale Benchmark Analysis

Feb 27, 2026

Javier Pulido, Filipe Rodrigues

Foundation models trained on diverse time-series data can forecast transportation metrics without task-specific tuning, making them practical basel...

This paper tests whether a general-purpose time-series AI model (Chronos-2) can forecast transportation data like traffic volume and bike-sharing demand without any custom training. The model works surprisingly well out-of-the-box, often beating specialized models built just for these tasks, and also provides useful uncertainty estimates.

evaluationapplicationsefficiency

Model Agreement via Anchoring

Feb 26, 2026

Eric Eaton, Surbhi Goel, Marcel Hussing et al.

You can mathematically guarantee that independently trained models will converge to the same predictions by scaling up ensemble size, boosting iter...

This paper shows how to make different machine learning models agree with each other by using a technique called anchoring. The researchers prove that when you train multiple models together using common methods like stacking, boosting, or neural networks, you can reduce disagreement between them by adjusting simple parameters like the number of models or training iterations.

trainingevaluation

Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning

Feb 26, 2026

Amita Kamath, Jack Hessel, Khyathi Chandu et al.

Bigger models and more data won't automatically teach reasoning skills if your training data has systematic blind spots—you need intentional data...

Vision-language models struggle with reasoning tasks like counting and spatial understanding not because they're too small, but because their training data is biased toward how people naturally talk about images—omitting obvious details.

dataevaluationreasoning

Understanding Usage and Engagement in AI-Powered Scientific Research Tools: The Asta Interaction Dataset

Feb 26, 2026

Dany Haddad, Dan Bareket, Joseph Chee Chang et al.

Scientists use AI research tools as collaborative partners, not search engines—they write complex queries, reuse outputs, and dig into citations ...

Researchers analyzed how scientists actually use AI-powered research tools by studying over 200,000 real queries and interactions. They found that scientists write longer, more complex questions than traditional search, treat AI as a research partner for drafting and brainstorming, and revisit AI responses like documents rather than one-off answers.

applicationsevaluationagents

LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

Feb 26, 2026

Chen Bo Calvin Zhang, Christina Q. Knight, Nicholas Kruus et al.

LLMs dramatically amplify what untrained people can accomplish in specialized fields like biology, raising both opportunity and safety concerns.

Researchers tested whether LLMs actually help non-experts do biology tasks better than using the internet alone. They found novices with LLM access were 4x more accurate than those without, and sometimes outperformed trained experts. However, users weren't always getting the best results from the models, and most found it easy to get sensitive biosecurity information despite safeguards.

evaluationsafetyapplications

Invariant Transformation and Resampling based Epistemic-Uncertainty Reduction

Feb 26, 2026

Sha Hu

You can boost inference accuracy by running predictions on multiple transformed versions of an input and averaging the results.

This paper shows that when you transform an input in different ways (like rotating an image), an AI model's errors aren't always the same. By running inference on multiple transformed versions of the same input and combining the results, you can get more accurate predictions without retraining the model. This is useful for improving accuracy or using smaller models without sacrificing performance.

evaluationefficiency

Evaluating Zero-Shot and One-Shot Adaptation of Small Language Models in Leader-Follower Interaction

Feb 26, 2026

Rafael R. Baptista, André de Lima Salgado, Ricardo V. Godoy et al.

Small language models can handle real-time role classification in robotics with fine-tuning, but adding more context in conversations breaks their ...

This paper tests whether small language models can quickly learn to identify leader and follower roles in human-robot conversations without needing large models. Researchers fine-tuned a tiny 0.5B model on robot interaction data and found it achieved 86% accuracy while running fast enough for robots to use locally, but struggled when conversations got longer.

efficiencyevaluationapplications

A Proper Scoring Rule for Virtual Staining

Feb 26, 2026

Samuel Tonks, Steve Hood, Ryan Musso et al.

Use information gain to evaluate generative models on their ability to estimate uncertainty correctly, not just prediction accuracy.

This paper introduces a better way to evaluate AI models that generate synthetic biological images (virtual staining). Instead of just checking if the overall results look right, it measures whether the model correctly estimates uncertainty about what it's predicting for each individual cell.

evaluationapplications

SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables

Feb 26, 2026

Sungho Park, Jueun Kim, Wook-Shin Han

Current AI models struggle with real-world table-text reasoning; SPARTA exposes this gap with automatically-generated, complex multi-hop questions ...

SPARTA is a benchmark for testing AI models on complex questions that require reasoning across both text and tables together.

evaluationreasoningmultimodal