Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.
Rishi Bommasani, Sarah H. Bana, Kathleen A. Creel et al.
When many employers use the same hiring algorithm, it amplifies bias rather than spreading risk—the same people get rejected everywhere, and racial disparities compound across the job market.
This paper analyzes hiring algorithms from a single vendor used by many employers and finds they create unfair outcomes.
Tamerlan Aghayev, Maxime Elkael, Michele Polese et al.
AI agents can handle complex domain-specific engineering when grounded in real-world validation and persistent knowledge—LLMs alone fail on RAN work because they hallucinate APIs and break on real hardware, but agents with feedback loops and ground truth don't.
GENESIS is an AI agent framework that automates cellular network (6G RAN) development by converting specifications and problems into tested code solutions. It combines LLMs with real hardware validation and a persistent knowledge base to handle tasks like feature implementation, testing, and optimization that normally take months of manual engineering.
Aneesh Komanduri, Xintao Wu
You can leverage existing pretrained models for causal reasoning tasks by building a modular pipeline that extracts concepts, manipulates them causally, and generates counterfactuals—no need to retrain from scratch.
This paper presents FM-CGM, a framework that combines pretrained foundation models (reasoning models and diffusion models) to perform causal reasoning on images. It enables zero-shot discovery of causal relationships, intervention on concepts, and generation of counterfactual images—all without retraining the models.
Carlos Heredia, Daniel Roncel
Neural demand models can be designed to respect economic constraints (integrability), producing more reliable price-elasticity estimates that are both mathematically consistent and practically useful for retail pricing.
This paper introduces ICDN, a neural network model that learns demand patterns for multiple products based on prices. Unlike traditional approaches, it directly models how demand changes with price (elasticity) in a mathematically consistent way, making the learned relationships more economically realistic and stable.
James Petullo, Nianwen Xue
Allocating more computational effort to harder SQL generation tasks—by exploring more candidate solutions—significantly improves accuracy without needing larger models.
CA-SQL improves LLM performance on complex SQL generation tasks by estimating question difficulty and dynamically adjusting how many candidate queries to explore. It uses evolutionary search principles and a custom voting method to find better SQL solutions, achieving state-of-the-art results on the BIRD benchmark's hardest problems.
Yi Yu, Parker Martin, Zhenyu Bu et al.
Distilled LLMs can extract medical data from unstructured reports with high accuracy and built-in confidence estimates, enabling clinicians to prioritize which extractions need human review.
CMR-EXTR converts free-text cardiac MRI reports into structured data with confidence scores for each extracted field. Using a lightweight distilled language model, it achieves 99.65% accuracy while running entirely offline, making it practical for clinical use without requiring constant API access.
Ziyang Huang, Yi Cao, Ali K. Shargh et al.
AI coding agents are far from ready for autonomous scientific research: they excel at software engineering but fail at the domain-specific reasoning, procedure reconstruction, and result interpretation needed to reproduce real computational science claims.
This paper introduces AutoMat, a benchmark that tests whether AI coding agents can reproduce scientific findings from materials science papers. The benchmark reveals that current AI agents struggle significantly—achieving only 54% success—because they can't fully reconstruct experimental procedures from paper descriptions, deviate from required methods, and fail during execution.
Pavlin G. Poličar, Andraž Pevcin, Blaž Zupan
Treating chart generation as a multi-step inspectable process with rendered-output validation catches visualization failures that code-only checks miss, and the resulting dataset reveals specific weaknesses in how multimodal LLMs understand charts.
This paper presents a structured workflow for generating statistical charts from data using LLMs, with built-in validation to catch visualization errors before they reach users. The workflow produces 1,500 diverse charts paired with 30,000+ question-answer pairs, revealing that while LLMs excel at reading chart syntax, they struggle with value extraction and reasoning tasks.
Inês Oliveira e Silva, Sérgio Jesus, Iker Perez et al.
Quantitative metrics for evaluating AI explanations (like sparsity and faithfulness) don't predict whether explanations actually help humans make better decisions in high-stakes settings—you need human-centered evaluation, not just mathematical benchmarks.
This paper evaluates eight different Shapley value methods—a popular AI explanation technique—by testing them with real financial analysts on fraud detection and risk assessment tasks.
Thibault Bañeras-Roux, Shashi Kumar, Driss Khalil et al.
LLMs outperform traditional word-error metrics for evaluating speech recognition by understanding semantic meaning rather than just counting mistakes, opening the door to more human-aligned ASR evaluation.
This paper shows that large language models can evaluate speech recognition quality much better than traditional metrics like Word Error Rate. Instead of just counting wrong words, LLMs can understand meaning and classify errors in ways that match how humans judge speech quality—achieving 92-94% agreement with human raters.
Thomas Bayer, Alexander Lohr, Sarah Weiß et al.
LLMs can dynamically query Knowledge Graphs to generate contextual, domain-aware explanations of ML model predictions—making AI decisions more transparent and trustworthy in specialized industries like manufacturing.
This paper combines Knowledge Graphs and Large Language Models to explain machine learning predictions in manufacturing. The system stores domain knowledge and ML results in a structured graph, then uses an LLM to convert relevant information into clear, user-friendly explanations. Testing shows the approach works well for both standard and complex questions in real manufacturing settings.
Shriram Chennakesavalu, Kirill Shmilovich, Hayley Weir et al.
LLMs show promise for drug discovery, but RL-based post-training on domain-specific tasks is critical: a smaller model trained this way outperformed much larger untrained models, suggesting a practical path forward for real-world drug design applications.
This paper creates a benchmark of chemistry tasks to test how well large language models can help design new drugs. The researchers test three model families on tasks like predicting molecular properties and designing molecules, then show that reinforcement learning training can significantly boost performance—even making smaller models competitive with frontier models.