Papers

Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.

1552 papers39 this month12 topics

All Evaluation 40 Training 34 Efficiency 33 Reasoning 30 Agents 27 Applications 22 Multimodal 18 Data 17 Safety 13 Architecture 11 Alignment 7 scaling 5

Jul 6 – Jul 12(23)

UniClawBench: A Universal Benchmark for Proactive Agents on Real-World Tasks

Jul 9, 2026

Zhekai Chen, Chengqi Duan, Kaiyue Sun et al.

This benchmark separates what a language model can do from how well an agent framework uses those abilities—showing that both matter equally for real-world performance.

UniClawBench is a new benchmark for evaluating AI agents that work with real-world tools and applications. Unlike older benchmarks that use static simulations, it tests agents in live environments with 400 real tasks across five key capabilities: using tools, exploring options, understanding long documents, processing images/video, and coordinating across platforms.

evaluationagentsreasoning

OpenCoF: Learning to Reason Through Video Generation

Jul 9, 2026

Xinyan Chen, Ziyu Guo, Renrui Zhang et al.

Video generation can be a reasoning mechanism: training models on diverse temporal reasoning tasks and adding explicit reasoning tokens improves their ability to solve logical problems by generating step-by-step visual explanations.

OpenCoF introduces a dataset and fine-tuned video model designed to teach AI systems to reason through generating sequences of video frames. Unlike text-based reasoning, this 'Chain-of-Frame' approach lets models unfold logical steps visually across time. The work shows that video models trained on diverse reasoning tasks with special reasoning tokens perform better at solving complex problems.

Jun 29 – Jul 5(25)

ReContext: Recursive Evidence Replay as LLM Harness for Long-Context Reasoning

Jul 2, 2026

Yanjun Zhao, Ruizhong Qiu, Tianxin Wei et al.

You can boost long-context reasoning without retraining by identifying relevant evidence through attention patterns and replaying it before generation—a simple inference-time trick that works across different model sizes.

ReContext improves how LLMs use information in long documents by replaying relevant evidence before generating answers. Instead of training or pruning context, it uses the model's internal attention signals to identify and reorder important passages, helping the model focus on what matters for each question.

reasoning

Reasoning LLM Improves Speaker Recognition in Long-form TV Dramas

Jul 2, 2026

Yuxuan Li, Lingxi Xie, Xinyue Huo et al.

Reasoning models can improve speaker identification in video by combining multiple modalities and contextual evidence, outperforming traditional audio-only approaches on challenging cases.

This paper tackles speaker recognition in long-form TV dramas by introducing DramaSR-532K, a large benchmark with 532K annotated dialogue lines, and DramaSR-LRM, a reasoning-based approach that combines audio, text, and visual information to accurately identify which character is speaking. The method works especially well on short utterances where voice alone isn't reliable.

Jun 22 – Jun 28(26)

VGB for Masked Diffusion Model: Efficient Test-time Scaling for Reward Satisfaction and Sample Editing

Jun 26, 2026

Kijung Jeon, Thuy-Duong Vuong, Molei Tao

MDM-VGB enables efficient test-time scaling for constrained generation by allowing tokens to be remasked during sampling, achieving quadratic complexity while competing methods like best-of-N suffer exponential complexity—making it practical for real-world constraint satisfaction problems.

This paper introduces MDM-VGB, a sampling method for masked diffusion models that improves generation quality at test time by allowing tokens to be strategically unmasked and remasked based on reward signals.

reasoningevaluation

Democratic ICAI: Debating Our Way to Steering Principles from Preferences

Jun 26, 2026

Kevin Kingslin, Anish Natekar, Ashutosh Ranjan et al.

Using multi-perspective debate to extract alignment principles from preferences captures richer decision-making reasoning than single-pass explanations, leading to more faithful and interpretable AI steering.

This paper improves how AI systems learn from human preferences by using structured debates between different viewpoints to uncover the reasoning behind choices. Instead of just recording which option humans prefer, Democratic ICAI captures multiple competing arguments that influence decisions, then distills these into clear principles that guide AI behavior.

Jun 15 – Jun 21(26)

Multi-Task Bayesian In-Context Learning

Jun 18, 2026

Qingyang Zhu, Eric Karl Oermann, Kyunghyun Cho

You can train a transformer to act as a fast Bayesian predictor by treating prior information as part of the input context, achieving oracle-level accuracy orders of magnitude faster than traditional Bayesian methods.

This paper presents a method for training transformers to perform Bayesian inference quickly by learning from examples of prior distributions and target datasets. Instead of computing exact Bayesian predictions (which is slow), the model learns to map sequences of prior information and data directly to predictions, enabling fast uncertainty-aware inference that adapts to new priors at test time.

trainingreasoningefficiency

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

Jun 18, 2026

Md Nayem Uddin, Amir Saeidi, Eduardo Blanco et al.

Explicitly tracking task state in a separate ledger helps agents avoid stale information and policy violations—two major failure modes in tool-calling agents—without requiring model retraining.

LedgerAgent is a method that helps AI agents handle customer service tasks by maintaining a separate record (ledger) of important task information like facts and constraints. Instead of having agents dig through long prompts to find relevant details, the ledger keeps this information organized and visible, and also checks whether tool calls follow domain rules before executing them.

Papers

Jul 6 – Jul 12(23)

UniClawBench: A Universal Benchmark for Proactive Agents on Real-World Tasks

OpenCoF: Learning to Reason Through Video Generation

Jun 29 – Jul 5(25)

ReContext: Recursive Evidence Replay as LLM Harness for Long-Context Reasoning

Reasoning LLM Improves Speaker Recognition in Long-form TV Dramas

Jun 22 – Jun 28(26)

VGB for Masked Diffusion Model: Efficient Test-time Scaling for Reward Satisfaction and Sample Editing

Democratic ICAI: Debating Our Way to Steering Principles from Preferences

Jun 15 – Jun 21(26)

Multi-Task Bayesian In-Context Learning

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

Ideas Have Genomes: Benchmarking Scientific Lineage Reasoning and Lineage-Grounded Idea Generation

Workflow as Knowledge: Semantic Persistence for LLM-Mediated Workflows

Latent Memory Palace: Reasoning for Control as Autoregressive Variational Inference

Remember When It Matters: Proactive Memory Agent for Long-Horizon Agents

MPFlow: Learning Budgeted Max-Flow Optimization on the Lightning Network with Deep Graph Reinforcement Learning

WebSwarm: Recursive Multi-Agent Orchestration for Deep-and-Wide Web Search

Formal Mechanisms for Market Stability in Self-Interested Agent Societies: A Marketplace Simulation Study

Accurate, Interdisciplinary and Transparent Structure-property Understanding with Deep Native Structural Reasoning

The Key to Going Linear: Analysis-Driven Transformer Linearization

From Noisy Traces to Root Causes: Structural Trajectory Analysis and Causal Extraction for Agent Optimization

Agon: Competitive Cross-Model RL with Implicit Rival Grading of Reasoning

Neural Operator-enabled Topology-informed Evolutionary Strategy for PDE-Constrained Optimization

How Data Shapes RoPE Frequency Usage: From Positional Scale Matching to Length Generalization

Max Out GRPO Signal: Adaptive Trace Prefix Control for Hard Reasoning Problems

Graph Convolutional Attention: A Spectral Perspective on Graph Denoising and Diffusion

RSF-GLLM: Bridging the Semantic Gap in Multi-Hop Knowledge Graph QA via Recurrent Soft-Flow and Decoupled LLM Generation

Bridging Physical Reasoning and Task Generalization via Visual Action Outcome Reasoning Alignment

FootsiesGym: A Fighting Game Benchmark for Two-Player Zero-Sum Imperfect-Information Games

DynaKRAG: A Unified Framework for Learnable Evidence Control in Multi-Hop Retrieval-Augmented Generation

Weak-to-Strong Generalization via Direct On-Policy Distillation

LLM-as-a-Verifier: A General-Purpose Verification Framework

DemoPSD: Disagreement-Modulated Policy Self-Distillation

G-RRM: Guiding Symbolic Solvers with Recurrent Reasoning Models

Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning

EvoPolicyGym: Evaluating Autonomous Policy Evolution in Interactive Environments

Extreme Adaptive Transformer for Time Series Forecasting

DecompRL: Solving Harder Problems by Learning Modular Code Generation

Measuring the Gap Between Human and LLM Research Ideas

Is One Layer Enough? Training A Single Transformer Layer Can Match Full-Parameter RL Training

AutoMem: Automated Learning of Memory as a Cognitive Skill

Theoria: Rewrite-Acceptability Verification over Informal Reasoning States

FurnitureVLA: Learning Long-Horizon Bimanual Furniture Assembly with Vision-Language-Action Model

Optimal Resource Utilization for Autonomous Laboratory Orchestrators

Right in the Right Way: LM Training with Verifiable Rewards and Human Demonstrations

QuasiMoTTo: Quasi-Monte Carlo Test-Time Scaling

When LLMs Read Tables Carelessly: Measuring and Reducing Data Referencing Errors

Generative Skill Composition for LLM Agents

TRIAGE: Role-Typed Credit Assignment for Agentic Reinforcement Learning

AxDafny: Agentic Verified Code Generation in Dafny

PolicyGuard: From Organizational Policies to Neuro-SymbolicCompliance Review Engines

Radial Suppression Accelerates Algorithmic Generalization: A Geometric Analysis of Delayed Generalization

Self-Evolving World Models for LLM Agent Planning

GROW$^2$: Grounding Which and Where for Robot Tool Use

Uncertainty-Aware Generation and Decision-Making Under Ambiguity

Towards Automating Scientific Review with Google's Paper Assistant Tool

Learning Topology-Aware Representations via Test-Time Adaptation for Anomaly Segmentation

Reinforcement Learning without Ground-Truth Solutions can Improve LLMs

When are likely answers right? On Sequence Probability and Correctness in LLMs

Empowering GUI Agents via Autonomous Experience Exploration and Hindsight Experience Utilization for Task Planning

Multilingual Reasoning Cascades Need More Context

EO-WM: A Physically Informed World Model for Probabilistic Earth Observation Forecasting

E-TTS: A New Embodied Test-Time Scaling Framework for Robotic Manipulation

Advancing Omnimodal Embodied Agents from Isolated Skills to Everyday Physical Autonomy

LMs as Task-Specific Knowledge Bases: An Interpretability Analysis

Bridging Talk and Thought: Understanding Dialogue Dynamics Across Collaborative Problem-Solving Contexts

Ask, Don't Judge: Binary Questions for Interpretable LLM Evaluation and Self-Improvement

Forecasting With LLMs: Improved Generalization Through Feature Steering

RevengeBench: Reverse Engineering Code-Space Policies from Behavioral Experiments

On-Policy Self-Distillation with Sampled Demonstrations Reduces Output Diversity

Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents

TriViewBench: Controlled Complexity Scaling for Multi-View Structural Reasoning in MLLMs

Real vs. Complex Spectral Bases for Neural Operators: The Role of Green's Function Alignment

World Models in Pieces: Structural Certification for General Agents

Grading the Grader: Lessons from Evaluating an Agentic Data Analysis System

Large-Language-Model Discovery of Quantum LDPC Codes through Structured Concept Evolution

AIR: Adaptive Interleaved Reasoning with Code in MLLMs

Teaching LLMs String Matching, Backtracking, and Error Recovery to Deduce Bases and Truth Tables for the Combinatorially Exploding Bit Manipulation Puzzles