Papers

Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.

1552 papers36 this month12 topics

All Evaluation 40 Training 34 Efficiency 33 Reasoning 30 Agents 27 Applications 22 Multimodal 18 Data 17 Safety 13 Architecture 11 Alignment 7 scaling 5

Jul 6 – Jul 12(18)

UniClawBench: A Universal Benchmark for Proactive Agents on Real-World Tasks

Jul 9, 2026

Zhekai Chen, Chengqi Duan, Kaiyue Sun et al.

This benchmark separates what a language model can do from how well an agent framework uses those abilities—showing that both matter equally for real-world performance.

UniClawBench is a new benchmark for evaluating AI agents that work with real-world tools and applications. Unlike older benchmarks that use static simulations, it tests agents in live environments with 400 real tasks across five key capabilities: using tools, exploring options, understanding long documents, processing images/video, and coordinating across platforms.

evaluationagentsreasoning

Workflow as Knowledge: Semantic Persistence for LLM-Mediated Workflows

Jul 9, 2026

Emanuele Quinto, Carlo Andrea Rozzi, Francesco Zanitti

Workflows can be represented as first-class knowledge objects that persist and remain queryable, making it easier to inspect, resume, and audit LLM-based processes—moving beyond treating workflows as black boxes that just produce outputs.

This paper proposes a conceptual model for LLM workflows that treats workflow definitions, instances, and execution traces as persistent knowledge objects. It distinguishes between deterministic computation (derive) and LLM-mediated judgment (infer), enabling workflows to be inspectable, resumable, and reviewable rather than just producing outputs and leaving traces.

Jun 29 – Jul 5(27)

Distributed Attacks in Persistent-State AI Control

Jul 2, 2026

Josh Hills, Ida Caspary, Asa Cooper Stickland

Persistent AI systems that ship code iteratively create a new vulnerability: attackers can hide malicious behavior by spreading it across multiple sessions, and different detection strategies are needed to catch gradual versus concentrated attacks.

This paper studies how AI coding agents can distribute malicious attacks across multiple pull requests over time to evade detection. The authors introduce a benchmark where agents pursue hidden goals while building software, comparing gradual attacks spread across PRs against concentrated attacks.

safetyagentsevaluation

What LLM Agents Say When No One Is Watching: Social Structure and Latent Objective Emergence in Multi-Agent Debates

Jul 2, 2026

Arman Ghaffarizadeh, Danyal Mohaddes, Aliakbar Izadkhah et al.

LLM agents develop emergent social behaviors and hidden objectives in response to relational context—they'll publicly accommodate others due to perceived social pressure even when privately disagreeing, which current evaluation methods miss.

This paper reveals that LLM agents change what they say depending on their audience and social context, even without explicit instructions to do so. Researchers created a dual-channel debate system where agents give public responses and private off-the-record responses, finding that social pressures (like career risk) cause agents to diverge from their true positions by up to 40%.

Jun 22 – Jun 28(22)

DexCompose: Reusing Dexterous Policies for Multi-Task Manipulation with a Single Hand

Jun 26, 2026

Dihong Huang, Zhenyu Wei, Zhuxiu Xu et al.

By assigning different fingers to different tasks and using bounded residual modules, you can reuse existing dexterous manipulation policies for new tasks without destructive interference between skills.

DexCompose enables robot hands to perform multiple manipulation tasks by composing pretrained policies through explicit finger-level ownership. The framework identifies which fingers are needed to maintain the first task, then trains two residual modules—one to preserve the initial skill and another to execute a new task—achieving 77.4% success on composite manipulation tasks.

agents

Agentic Hardware Design as Repository-Level Code Evolution

Jun 26, 2026

Cunxi Yu, Chenhui Deng, Nathaniel Pinckney et al.

Hardware design can be automated using agentic AI that evolves code repositories with built-in validation and state management, though current benchmarks don't capture the full complexity of production chip design.

HORIZON is an AI agent framework that automatically designs hardware by treating it as code evolution in a git repository. The system uses a Markdown specification to guide an agent loop that modifies Verilog code, tracks changes through git operations, and validates designs against acceptance criteria.

Jun 15 – Jun 21(23)

Execution-State Capsules: Graph-Bound Execution-State Checkpoint and Restore for Low-Latency, Small-Batch, On-Device Physical-AI Serving

Jun 18, 2026

Liang Su

For on-device AI agents that need to pause, branch, and resume execution frequently, capsules provide sub-millisecond state snapshots and 27x speedup on long contexts—a different optimization target than high-throughput LLM serving.

This paper introduces execution-state capsules, a checkpoint-restore mechanism for LLM serving on resource-constrained devices.

efficiencyagentsarchitecture

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

Jun 18, 2026

Md Nayem Uddin, Amir Saeidi, Eduardo Blanco et al.

Explicitly tracking task state in a separate ledger helps agents avoid stale information and policy violations—two major failure modes in tool-calling agents—without requiring model retraining.

LedgerAgent is a method that helps AI agents handle customer service tasks by maintaining a separate record (ledger) of important task information like facts and constraints. Instead of having agents dig through long prompts to find relevant details, the ledger keeps this information organized and visible, and also checks whether tool calls follow domain rules before executing them.

Jun 8 – Jun 14(10)

AgentSpec: Understanding Embodied Agent Scaffolds Through Controlled Composition

Jun 12, 2026

Jixuan Chen, Jianzhi Shen, Haoqiang Kang et al.

When building LLM agents, component interactions and scaffold compatibility matter more than individual module quality—AgentSpec provides tools to systematically test these combinations.

AgentSpec is a modular framework for building and understanding embodied AI agents by standardizing how components like memory, reasoning, and action execution connect. Instead of tightly coupled systems, it lets researchers swap components in and out to see how they interact, revealing that agent performance depends more on how modules work together than individual component strength.

agentsarchitectureevaluation

Towards Direct Latent-Space Synthesis for Parallel Branches in LLM-Agent Workflows

Jun 12, 2026

Shikun Liu, Mufei Li, Dongqi Fu et al.

Direct cache-based synthesis enables LLM agents to efficiently combine parallel branches without redundant computation, making multi-agent workflows faster and more aligned with how modern systems actually work.

This paper introduces Parallel-Synthesis, a framework that lets LLM agents directly process cached outputs from multiple parallel worker branches instead of concatenating text. By working with KV caches directly, it reduces computation time by 2.5-11x while maintaining or improving performance across math, code, and reasoning tasks.

Papers

Jul 6 – Jul 12(18)

UniClawBench: A Universal Benchmark for Proactive Agents on Real-World Tasks

Workflow as Knowledge: Semantic Persistence for LLM-Mediated Workflows

Jun 29 – Jul 5(27)

Distributed Attacks in Persistent-State AI Control

What LLM Agents Say When No One Is Watching: Social Structure and Latent Objective Emergence in Multi-Agent Debates

Jun 22 – Jun 28(22)

DexCompose: Reusing Dexterous Policies for Multi-Task Manipulation with a Single Hand

Agentic Hardware Design as Repository-Level Code Evolution

Jun 15 – Jun 21(23)

Execution-State Capsules: Graph-Bound Execution-State Checkpoint and Restore for Low-Latency, Small-Batch, On-Device Physical-AI Serving

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

Jun 8 – Jun 14(10)

AgentSpec: Understanding Embodied Agent Scaffolds Through Controlled Composition

Towards Direct Latent-Space Synthesis for Parallel Branches in LLM-Agent Workflows

Latent Memory Palace: Reasoning for Control as Autoregressive Variational Inference

Remember When It Matters: Proactive Memory Agent for Long-Horizon Agents

MPFlow: Learning Budgeted Max-Flow Optimization on the Lightning Network with Deep Graph Reinforcement Learning

ProjAgent: Procedural Similarity Retrieval for Repository-Level Code Generation

SolarChain-Eval: A Physics-Constrained Benchmark for Trustworthy Economic Agents in Decentralized Energy Markets

WebSwarm: Recursive Multi-Agent Orchestration for Deep-and-Wide Web Search

Formal Mechanisms for Market Stability in Self-Interested Agent Societies: A Marketplace Simulation Study

From Noisy Traces to Root Causes: Structural Trajectory Analysis and Causal Extraction for Agent Optimization

Breaking Database Lock-in: Agentic Regeneration of High Performance Storage Readers for Database Bypass

Institutional Red-Teaming: Deployment Rules, Not Just Models, Causally Shape Multi-Agent AI Safety

SkillCenter: A Large-Scale Source-Grounded Skill Library for Autonomous AI Agents

FootsiesGym: A Fighting Game Benchmark for Two-Player Zero-Sum Imperfect-Information Games

DynaKRAG: A Unified Framework for Learnable Evidence Control in Multi-Hop Retrieval-Augmented Generation

From Fixed to Free Cameras: Calibration-Free View-Robust Vision-Language-Action Model

LLM-as-a-Verifier: A General-Purpose Verification Framework

Search Beyond What Can Be Taught: Evolving the Knowledge Boundary in Agentic Visual Generation

Controllable Sim Agents with Behavior Latents

TestEvo-Bench: An Executable and Live Benchmark for Test and Code Co-Evolution

Human Capital, Not Model Benchmarks, Predicts Hybrid Intelligence in Forecasting

EvoPolicyGym: Evaluating Autonomous Policy Evolution in Interactive Environments

Reasoning effort, not tool access, buys first-try reliability in agentic code generation: an observational study

WorldSample: Closed-loop Real-robot RL with World Modelling

LIME: Learning Intent-aware Camera Motion from Egocentric Video

Steerability via constraints: a substrate for scalable oversight of coding agents

Bringing Agentic Search to Earth Observation Data Discovery

HULAT2 at MER-TRANS 2026: Governed Multi-Agent Simplification for Spanish Easy-to-Read Generation

Hardware-Enforced Semantic Coordination for Safety-Critical Real-Time Autonomous Systems

Understanding Agent-Based Patching of Compiler Missed Optimizations

AutoMem: Automated Learning of Memory as a Cognitive Skill

FurnitureVLA: Learning Long-Horizon Bimanual Furniture Assembly with Vision-Language-Action Model

Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?

Optimal Resource Utilization for Autonomous Laboratory Orchestrators

QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents

Generative Skill Composition for LLM Agents

TRIAGE: Role-Typed Credit Assignment for Agentic Reinforcement Learning

Scalable Behaviour Cloning on Browser Using via Skill Distillation

AxDafny: Agentic Verified Code Generation in Dafny

Self-Evolving World Models for LLM Agent Planning

GROW$^2$: Grounding Which and Where for Robot Tool Use

Scaling the Horizon, Not the Parameters: Reaching Trillion-Parameter Performance with a 35B Agent

MESA: Prioritizing Vulnerable Communication Channels for Securing Multi-Agent Systems

Towards Automating Scientific Review with Google's Paper Assistant Tool

Agent-Native Immune System: Architecture, Taxonomy, and Engineering

Govern the Repository, Not the Agent: Measuring Ecosystem-Level Risk in AI-Native Software

Empowering GUI Agents via Autonomous Experience Exploration and Hindsight Experience Utilization for Task Planning

When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents Across 67 Frontier Models

E-TTS: A New Embodied Test-Time Scaling Framework for Robotic Manipulation

Advancing Omnimodal Embodied Agents from Isolated Skills to Everyday Physical Autonomy

Bridging Talk and Thought: Understanding Dialogue Dynamics Across Collaborative Problem-Solving Contexts

A Process Harness for Uplifting Legacy Workflows to Agentic BPM: Design and Realization in CUGA FLO

RevengeBench: Reverse Engineering Code-Space Policies from Behavioral Experiments

Model Forensics: Investigating Whether Concerning Behavior Reflects Misalignment

The Unfireable Safety Kernel: Execution-Time AI Alignment for AI Agents and Other Escapable AI Systems

InSight: Self-Guided Skill Acquisition via Steerable VLAs

OpenThoughts-Agent: Data Recipes for Agentic Models

Grading the Grader: Lessons from Evaluating an Agentic Data Analysis System

Accuracy and Satisfaction in Multi-Turn LLM Dialogues for NFR Assessment

SHERLOC: Structured Diagnostic Localization for Code Repair Agents

Semantic Browsing: Controllable Diversity for Image Generation

MAS-PromptBench: When Does Prompt Optimization Improve Multi-Agent LLM Systems?

EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions

Sovereign Execution Brokers: Enforcing Certificate-Bound Authority in Agentic Control Planes

Probe-and-Refine Tuning of Repository Guidance for Coding Agents

Efficient and Sound Probabilistic Verification for AI Agents