Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.
Rishi Bommasani, Sarah H. Bana, Kathleen A. Creel et al.
When many employers use the same hiring algorithm, it amplifies bias rather than spreading risk—the same people get rejected everywhere, and racial disparities compound across the job market.
This paper analyzes hiring algorithms from a single vendor used by many employers and finds they create unfair outcomes.
Melissa Z. Pan, Negar Arabzadeh, Mathew Jacob et al.
You can optimize retrieval pipelines per-query rather than per-workload by using lightweight predictors trained on query characteristics, achieving the same accuracy at significantly lower cost or better accuracy at the same cost.
This paper presents BRANE, a system that automatically selects the best configuration for retrieval agents on a per-query basis. Instead of manually tuning a retrieval pipeline once, BRANE analyzes each query to predict which combination of LLM, retriever, and other settings will work best, allowing teams to optimize for either accuracy or cost at inference time without retraining.
Xu Ouyang, Deyi Liu, Yuhang Cai et al.
LLMs have a fundamental capacity limit based on signal-to-noise ratio: scaling parameters or data without maintaining sufficient signal clarity causes performance degradation, explaining phenomena like catastrophic overtraining and quantization failures that standard scaling laws can't capture.
This paper explains why large language models sometimes get worse with more training or smaller precision—not just better. Using information theory, the authors model LLM training like sending signals through a noisy channel. When you scale up the model or data without keeping the signal clear relative to noise, performance actually drops in a U-shape.
Zisu Huang, Jingwen Xu, Yifan Yang et al.
Model-generated skills can improve agent performance, but their effectiveness depends on how they're extracted and which agent uses them—not on model size or baseline strength.
This paper studies how AI agents can reuse skills—structured procedures extracted from past experience—to improve performance. The researchers built a comprehensive evaluation framework testing skill extraction and reuse across five different task domains, finding that while model-generated skills help on average, they sometimes hurt performance.
Ruozhen He, Meng Wei, Ziyan Yang et al.
Maintaining consistent characters and objects across long video sequences is hard; explicit memory of each entity's appearance significantly improves consistency, especially when characters reappear after many shots.
EntityBench is a benchmark for evaluating multi-shot video generation—creating coherent video sequences with multiple scenes. It includes 140 episodes with detailed tracking of characters, objects, and locations across shots, plus an evaluation system that measures both video quality and consistency.
Shashwat Goel, Nikhil Chandak, Arvindh Arun et al.
Current AI agents struggle with long-horizon real-world adaptation—the best models achieve only 25% accuracy predicting events three months ahead, showing this is a critical capability gap for deployed AI systems.
FutureSim is a benchmark that tests AI agents' ability to adapt and predict real-world events over time by replaying actual news and events in chronological order. Agents must forecast future events beyond their training data while interacting with a live stream of information, revealing significant gaps in current frontier models' capabilities.
Shuhang Lin, Chuhao Zhou, Xiao Lin et al.
Conformal Path Reasoning provides statistical guarantees that your KGQA system will include the correct answer in its output set, while keeping that set compact and practical—solving a real reliability problem in knowledge graph reasoning.
This paper improves Knowledge Graph Question Answering by adding statistical guarantees to answer reliability. It uses conformal prediction—a technique that creates sets of answers with proven coverage rates—combined with a neural network that learns to score reasoning paths better. The result is more trustworthy answers with smaller, more useful prediction sets.
Peyman Baghershahi, Fangxin Wang, Debmalya Mandal et al.
When using GNNs for predictions, you can get tighter, more reliable uncertainty estimates by explicitly using graph structure rather than just embedding similarity—this gives you both statistical guarantees and practical efficiency.
GRAPHLCP improves uncertainty quantification for graph neural networks by using graph structure to make better predictions with guaranteed coverage. Instead of just looking at embedding similarity, it uses graph topology and a PageRank-based approach to identify similar nodes and weight predictions appropriately, reducing wasted prediction sets while maintaining statistical guarantees.
Sailesh Panda, Pritam Kadasi, Abhishek Upperwal et al.
LLMs fail at executing multi-step procedures faithfully, with accuracy collapsing as procedure length increases. This means strong benchmark performance can hide critical weaknesses in following instructions step-by-step.
This paper tests whether large language models actually follow step-by-step procedures correctly, not just whether they get the right final answer. Researchers created a benchmark where models execute arithmetic algorithms of varying length and complexity.
Ziyang Huang, Yi Cao, Ali K. Shargh et al.
AI coding agents are far from ready for autonomous scientific research: they excel at software engineering but fail at the domain-specific reasoning, procedure reconstruction, and result interpretation needed to reproduce real computational science claims.
This paper introduces AutoMat, a benchmark that tests whether AI coding agents can reproduce scientific findings from materials science papers. The benchmark reveals that current AI agents struggle significantly—achieving only 54% success—because they can't fully reconstruct experimental procedures from paper descriptions, deviate from required methods, and fail during execution.