ThinkLLM
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
AboutPrivacyTermsRSS

ThinkLLM

Papers

Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.

326 papers10 this month12 topics
AllEfficiency 35Reasoning 35Multimodal 28Applications 28Evaluation 27Training 26Architecture 24Agents 24Safety 13scaling 5Data 5Alignment 1

Mar 30 – Apr 5(13)

ActionParty: Multi-Subject Action Binding in Generative Video Games

Apr 2, 2026

Alexander Pondaven, Ziyi Wu, Igor Gilitschenski et al.

This is the first video world model that can reliably control multiple independent agents in the same scene—a critical capability for simulating multi-player games and complex interactive environments.

ActionParty is a video diffusion model that can control multiple characters simultaneously in interactive game environments. Unlike existing models limited to single agents, it uses special 'subject state tokens' to track each character's state separately, allowing precise control of up to seven players at once while maintaining their identity and following their assigned actions correctly.

architecturemultimodalagents

Novel Memory Forgetting Techniques for Autonomous AI Agents: Balancing Relevance and Efficiency

Apr 2, 2026

Payal Fofadiya, Sunil Tiwari

Conversational agents perform better with selective memory management than unlimited retention; a relevance-guided forgetting framework improves long-horizon reasoning while reducing false memories and context bloat.

This paper tackles a key problem in conversational AI: agents need to remember past interactions to reason coherently, but storing everything causes performance to degrade and creates false memories. The authors propose a smart forgetting system that decides which memories to keep based on relevance, recency, and frequency—like a selective filing system for an agent's brain.

Mar 23 – Mar 29(14)

Vega: Learning to Drive with Natural Language Instructions

Mar 26, 2026

Sicheng Zuo, Yuxuan Li, Wenzhao Zheng et al.

Language instructions can guide autonomous driving decisions in real-time, enabling personalized driving behaviors beyond fixed rules—this opens the door to more flexible, user-responsive autonomous systems.

Vega is a vision-language-action model that learns to drive by following natural language instructions. The system combines visual perception, language understanding, and world modeling to generate safe driving trajectories. Researchers created a 100,000-scene dataset with diverse driving instructions and trajectories to train the model.

multimodalagentsreasoning

Drive My Way: Preference Alignment of Vision-Language-Action Model for Personalized Driving

Mar 26, 2026

Zehao Wang, Huaide Jiang, Shuaiwu Dong et al.

Autonomous driving systems can be personalized to match individual driver styles by learning user embeddings from driving data and conditioning the driving policy on these embeddings, enabling more human-centered autonomous vehicles.

This paper presents Drive My Way, a personalized autonomous driving system that learns individual driver preferences and adapts to real-time instructions.

Mar 16 – Mar 22(24)

MeanFlow Meets Control: Scaling Sampled-Data Control for Swarms

Mar 20, 2026

Anqi Dong, Yongxin Chen, Karl H. Johansson et al.

By learning control coefficients designed for sampled-data systems rather than continuous velocity fields, you can steer large swarms efficiently in just a few control steps while respecting real hardware constraints.

This paper presents a control framework for steering large swarms with minimal updates by learning finite-window control coefficients that respect how real systems work—with intermittent control updates rather than continuous commands. The approach scales to large swarms while automatically respecting the system's dynamics and control constraints.

agents

VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking

Mar 20, 2026

Jingyang Lin, Jialian Wu, Jiang Liu et al.

Instead of processing all video frames, intelligent seeking based on reasoning about what matters can use far fewer frames while achieving better results—a practical approach for building efficient video AI systems.

VideoSeek is a video understanding agent that intelligently seeks out key moments in videos rather than analyzing every frame, reducing computational cost by 93% while improving accuracy. It uses a toolkit to gather multi-scale observations and reasons about video content through a think-act-observe loop, enabling efficient long-horizon video understanding.

Mar 9 – Mar 15(11)

From Experiments to Expertise: Scientific Knowledge Consolidation for AI-Driven Computational Research

Mar 13, 2026

Haonan Huang

AI agents performing scientific research need memory and reflection, not just execution capability. Knowledge consolidation between runs dramatically improves efficiency and accuracy in computational science workflows.

QMatSuite is a platform that helps AI agents learn from computational materials science experiments by storing findings, retrieving past knowledge, and reflecting on results.

agentsreasoningdata

LLM Constitutional Multi-Agent Governance

Mar 13, 2026

J. de Curtò, I. de Zarzà

When deploying LLMs to coordinate multi-agent systems, you need explicit governance constraints—raw cooperation metrics hide manipulation. CMAG shows how to balance cooperation gains against autonomy loss and fairness degradation.

This paper addresses a critical risk: LLMs can manipulate multi-agent systems into appearing cooperative while actually eroding agent autonomy and fairness. The authors propose CMAG, a governance framework that filters harmful LLM suggestions and optimizes for genuine cooperation rather than just compliance.

safety

Feb 23 – Mar 1(15)

CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

Feb 27, 2026

Weinan Dai, Hanlin Wu, Qiying Yu et al.

Reinforcement learning can teach AI models to write genuinely optimized GPU code, not just syntactically correct code—a task that previously requ...

This paper trains an AI agent to write optimized GPU code (CUDA kernels) using reinforcement learning. The system learns from trial-and-error feedback about code performance, achieving faster execution than existing tools like PyTorch's compiler and outperforming top commercial AI models on benchmark tests.

agentstrainingapplications

A Minimal Agent for Automated Theorem Proving

Feb 27, 2026

Borja Requena Pozo, Austin Letson, Krystian Nowakowski et al.

Iterative refinement with simpler architecture outperforms complex single-shot approaches for theorem proving, reducing cost while improving sample...

Researchers built a simplified AI system that proves mathematical theorems by iteratively refining attempts, searching libraries, and managing context. Despite being much simpler than existing approaches, it performs competitively while being cheaper and more efficient—showing that iterative refinement beats trying to solve everything in one shot.

agentsreasoningefficiency

The Self Driving Portfolio: Agentic Architecture for Institutional Asset Management

Apr 2, 2026

Andrew Ang, Nazym Azimbayev, Andrey Kim

Agentic AI can shift institutional investing from human execution to human oversight, with autonomous agents handling forecasting, portfolio construction, and self-improvement while staying constrained by policy documents.

This paper demonstrates how AI agents can autonomously manage investment portfolios by having specialized agents forecast market conditions, build portfolios using multiple methods, and critique each other's work—all governed by an Investment Policy Statement that ensures alignment with institutional goals.

agentsapplicationsreasoning

SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization

Apr 2, 2026

Zhengxi Lu, Zhiyuan Yao, Jinyang Wu et al.

You can train agents to permanently learn skills rather than retrieve them at runtime, reducing token overhead and improving zero-shot performance by progressively withdrawing skill context during training.

SKILL0 teaches language model agents to internalize skills (procedural knowledge packages) directly into their parameters through a curriculum that gradually removes skill context during training.

trainingagentsreasoning

When to ASK: Uncertainty-Gated Language Assistance for Reinforcement Learning

Apr 2, 2026

Juarez Monteiro, Nathan Gavenski, Gianlucca Zuin et al.

Selectively querying language models based on uncertainty can improve RL agent robustness in novel situations without constant computational overhead—but successful integration requires careful design, not just combining the two systems.

This paper proposes ASK, a system that combines reinforcement learning agents with language models to handle out-of-distribution scenarios.

agentsreasoningsafety

Multi-Agent Video Recommenders: Evolution, Patterns, and Open Challenges

Apr 2, 2026

Srivaths Ranganathan, Abhishek Dharmaratnakar, Anushree Sinha et al.

Multi-agent video recommenders coordinate specialized agents for different tasks (understanding, reasoning, memory) rather than relying on single models, enabling more explainable and adaptive recommendations—a shift that's becoming practical with LLMs.

This survey examines how video recommender systems are evolving from single models to multi-agent architectures where specialized AI agents coordinate to understand videos, reason about user preferences, and provide better recommendations.

applicationsagentsmultimodal

HippoCamp: Benchmarking Contextual Agents on Personal Computers

Apr 1, 2026

Zhe Yang, Shulin Tian, Kairui Hu et al.

Current AI agents fail at real-world personal file management: the best models only achieve 48% accuracy on user profiling tasks, with multimodal perception and evidence grounding being the main bottlenecks.

HippoCamp is a benchmark that tests AI agents on realistic file management tasks using real personal computers with 42.4 GB of actual user files. It measures how well agents can search files, understand context, and reason across multiple file types to answer questions about a user's data—revealing that even top AI models struggle with these practical tasks.

evaluationmultimodalagents

$\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution

Apr 1, 2026

Muyu He, Adit Jain, Anand Kumar et al.

Current LLM agents struggle with long-term planning and learning from delayed feedback—only top models like Claude Opus 4.6 succeed, and using scratchpads to persist information across context windows is critical for success.

YC-Bench is a benchmark that tests whether AI agents can plan and execute consistently over long periods by simulating running a startup for a year. The agent must manage employees, select contracts, and stay profitable in an uncertain environment where early mistakes have lasting consequences.

evaluationagentsreasoning

CliffSearch: Structured Agentic Co-Evolution over Theory and Code for Scientific Algorithm Discovery

Apr 1, 2026

Youssef Mroueh, Carlos Fonseca, Brian Belgodere et al.

Combining theory and code in algorithm search, with explicit correctness/originality gates, produces more scientifically sound discoveries than optimizing code alone.

CliffSearch is an AI system that discovers new scientific algorithms by evolving both theory and code together. Unlike systems that just generate code, it uses multiple AI agents to propose, test, and refine ideas while checking for correctness and originality—similar to how scientists actually work through hypothesis, implementation, testing, and revision cycles.

agentsreasoning

ORBIT: Scalable and Verifiable Data Generation for Search Agents on a Tight Budget

Apr 1, 2026

Nandan Thakur, Zijian Chen, Xueguang Ma et al.

You can build high-quality training data for search agents using synthetic generation and verification without expensive human annotation or API costs, enabling smaller models to compete with larger ones.

ORBIT is a dataset of 20,000 reasoning-heavy questions with verifiable answers, created cheaply without paid APIs. The authors built a four-stage pipeline (seed creation, question generation, self-verification, external verification) to generate training data for search agents—AI systems that combine language models with web search.

datatrainingagents

SAGAI-MID: A Generative AI-Driven Middleware for Dynamic Runtime Interoperability

Mar 30, 2026

Oliver Aleksander Larsen, Mahyar T. Moghaddam

LLMs can serve as runtime architectural components to solve schema interoperability problems dynamically, but code generation strategies outperform direct transformation and cost varies dramatically across models without matching accuracy gains.

SAGAI-MID is a middleware system that uses LLMs to automatically fix schema mismatches between different services and APIs at runtime, eliminating the need for manual adapter code. It combines structural analysis with LLM reasoning and includes safety checks to handle real-world integration challenges across REST, GraphQL, and IoT systems.

architectureagentsapplications

SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning

Mar 30, 2026

Philip Schroeder, Thomas Weng, Karl Schmeckpeper et al.

Video-language models can supervise robot learning directly as reward signals if trained with spatiotemporal reasoning and grounded in continuous progress supervision, enabling robots to learn new tasks without hand-crafted rewards.

SOLE-R1 is a video-language model that watches robot videos and reasons about task progress step-by-step to provide reward signals for robot learning. Unlike standard vision-language models, it's designed to handle partial views and changing conditions, preventing robots from gaming the reward system.

reasoningagentsmultimodal

Dynamic Dual-Granularity Skill Bank for Agentic RL

Mar 30, 2026

Songjun Tu, Chengdong Xu, Qichao Zhang et al.

Organizing agent experience into dual-granularity skills (task-level and step-level) with dynamic maintenance significantly improves performance, and these skills transfer across different evaluation settings without major training overhead.

D2Skill creates a dynamic memory system for AI agents that stores two types of reusable skills: high-level task guidance and low-level step-by-step corrections. The system learns from its own training experience, continuously updating and pruning skills based on their usefulness. Tests show 10-20% improvement in task success rates on complex web-based environments.

agentsreasoningtraining
multimodalagentsapplications

Natural-Language Agent Harnesses

Mar 26, 2026

Linyue Pan, Lexiao Zou, Shuo Guo et al.

Agent performance depends heavily on how you orchestrate their behavior—by making this orchestration code readable and portable through natural language, you can reuse and improve agent designs much more easily.

This paper proposes a new way to design agent control systems by writing them in natural language instead of buried in code. The authors create Natural-Language Agent Harnesses (NLAHs) and a runtime system that executes these harnesses, making it easier to reuse, compare, and study how agents are controlled across different tasks.

agentsarchitecture

Agent Factories for High Level Synthesis: How Far Can General-Purpose Coding Agents Go in Hardware Optimization?

Mar 26, 2026

Abhishek Bhandwaldar, Mihir Choudhury, Ruchir Puri et al.

General-purpose coding agents can discover hardware optimization patterns automatically by working at scale—using multiple agents to explore different optimization strategies yields significant speedups without domain-specific training.

This paper shows that general-purpose AI coding agents can optimize hardware designs without specialized training. The approach uses multiple agents working together: first decomposing designs into smaller pieces and optimizing each, then launching additional agents to find cross-function improvements.

agentsapplications

The Kitchen Loop: User-Spec-Driven Development for a Self-Evolving Codebase

Mar 26, 2026

Yannick Roy

You can safely automate continuous code improvement by combining LLM agents that act as power users, unbeatable verification tests, and automated pause gates that catch quality degradation before it ships.

A framework for autonomous software development where LLM agents continuously test and improve code against a specification. The system uses synthetic user testing at 1,000x human speed, ground-truth verification tests, and automated quality gates to safely evolve codebases without human intervention—validated on production systems with 1,000+ merged changes and zero regressions.

agents

DreamerAD: Efficient Reinforcement Learning via Latent World Model for Autonomous Driving

Mar 25, 2026

Pengxuan Yang, Yupeng Zheng, Deheng Qian et al.

Latent world models can dramatically speed up RL training for autonomous driving by replacing expensive multi-step diffusion with single-step latent sampling, making imagination-based policy training practical.

DreamerAD uses a latent world model to train autonomous driving policies 80x faster than previous diffusion-based approaches. Instead of generating full images during training, it compresses the diffusion process to a single step by working with compressed latent features, enabling safe, efficient reinforcement learning on driving tasks without real-world testing.

efficiencyreasoningagents

The Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agentic Artificial Intelligence

Mar 25, 2026

Biplab Pal, Santanu Bhattacharya

Before deploying agentic AI in business processes, measure the 'blind mass' of uncertain state-action pairs and expected oversight costs using event logs—this reveals hidden decision gaps that simple accuracy metrics miss.

This paper develops a mathematical framework to measure when AI agents can safely operate autonomously versus when they need human oversight.

agentssafetyevaluation

MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination

Mar 25, 2026

Zhuo Li, Yupeng Zhang, Pengyu Cheng et al.

Using multiple agents with intentional information barriers prevents LLMs from confirming their own errors during fact-checking, letting smaller models match larger ones on reliability.

MARCH is a framework that reduces hallucinations in LLMs by using three specialized agents that work together with deliberate information separation. A Solver generates responses, a Proposer breaks them into verifiable claims, and a Checker validates claims without seeing the original output—preventing the verifier from copying the generator's mistakes.

safetyagentsalignment

Chameleon: Episodic Memory for Long-Horizon Robotic Manipulation

Mar 25, 2026

Xinying Guo, Chenxi Jiang, Hyun Bin Kim et al.

For robotic tasks with visual ambiguity, storing rich multimodal memory with geometric grounding outperforms semantic compression—robots need fine-grained context, not just similarity-based retrieval, to handle non-Markovian decision problems.

Chameleon is a memory system for robots that handles situations where the same visual observation could mean different things depending on what happened before. Instead of storing compressed summaries like most systems, it preserves detailed geometric and visual information to disambiguate confusing situations, enabling robots to make better decisions during long, complex manipulation tasks.

agentsmultimodal

LensWalk: Agentic Video Understanding by Planning How You See in Videos

Mar 25, 2026

Keliang Li, Yansong Li, Hongze Shen et al.

Giving AI agents control over their visual perception—deciding what to look at and when—significantly improves video reasoning accuracy. This active observation approach works as a plug-and-play upgrade for existing vision-language models.

LensWalk is an AI framework that lets language models actively control how they watch videos while reasoning about them.

agentsmultimodalreasoning

SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning

Mar 24, 2026

Haoyu Huang, Jinfa Huang, Zhongwei Wan et al.

A smaller speculative model can predict an agentic system's tool-calling trajectory, enabling parallel execution and early termination of expensive operations—delivering significant speedups without accuracy loss.

SpecEyes speeds up agentic multimodal AI systems by using a lightweight model to predict what tools the main model will need, allowing expensive operations to be skipped or run in parallel. This cuts latency by 1.1-3.35x while maintaining accuracy, solving a key bottleneck in systems like OpenAI o3 that repeatedly invoke vision tools.

efficiencymultimodalagents

Code Review Agent Benchmark

Mar 24, 2026

Yuntong Zhang, Zhiyuan Pan, Imam Nur Bani Yusuf et al.

Code review agents currently miss most issues that human reviewers catch, but they often flag different problems—creating opportunities for AI-assisted rather than AI-automated code review in real teams.

This paper introduces c-CRAB, a benchmark dataset for evaluating AI agents that perform code review on pull requests. The dataset is built from human reviews and includes automated tests to assess whether code review agents catch the same issues humans do.

evaluationagentsapplications

Mecha-nudges for Machines

Mar 24, 2026

Giulio Frey, Kawin Ethayarajh

As AI agents make more real-world decisions, the way information is presented can be optimized for machines just like it is for humans—and this is already happening in practice on platforms like Etsy.

This paper introduces 'mecha-nudges'—subtle changes to how information is presented that influence AI agents' decisions without restricting options or harming human decision-making.

agentsalignmentevaluation

TiCo: Time-Controllable Training for Spoken Dialogue Models

Mar 23, 2026

Kai-Wei Chang, Wei-Chih Chen, En-Pei Hu et al.

Spoken dialogue models can now follow duration constraints (e.g., 'respond in 15 seconds') by inserting time markers during generation, making them more practical for real-world voice applications.

TiCo is a post-training method that teaches spoken dialogue models to generate responses with specific durations. It uses time markers during generation to help models track elapsed speaking time and adjust content to meet target lengths, improving real-world voice assistant interactions without requiring new training data.

trainingapplicationsagents
agents
efficiency
reasoning

AI Agents Can Already Autonomously Perform Experimental High Energy Physics

Mar 20, 2026

Eric A. Moreno, Samuel Bright-Thonney, Andrzej Novak et al.

AI agents are ready to automate the repetitive technical work in experimental physics, letting researchers focus on novel insights and validation rather than coding routine analyses.

AI agents can now autonomously run physics experiments end-to-end, from data analysis to paper writing. Researchers showed that Claude can handle all stages of high-energy physics analysis—selecting events, estimating backgrounds, calculating uncertainties, and drawing conclusions—using only a dataset, code tools, and access to prior research papers.

agentsapplicationsreasoning

Learning Dynamic Belief Graphs for Theory-of-mind Reasoning

Mar 20, 2026

Ruxiao Chen, Xilei Zhao, Thomas J. Cova et al.

LLMs can reason about human behavior more accurately by explicitly modeling beliefs as interconnected, time-varying graphs rather than static states—especially important for high-stakes domains like emergency response.

This paper improves how large language models reason about what people believe and why they act. Instead of treating beliefs as fixed, the authors model beliefs as a dynamic graph that changes over time, showing how new information updates what people think and how that shapes their decisions. They test this on disaster evacuation scenarios where understanding evolving beliefs is critical.

reasoningagentsalignment

The Robot's Inner Critic: Self-Refinement of Social Behaviors through VLM-based Replanning

Mar 20, 2026

Jiyu Lim, Youngwoo Yoon, Kwanghyun Park

Robots can now autonomously refine their social interactions by using VLMs to evaluate and improve their own behavior plans, eliminating the need for predefined motions or constant human guidance.

This paper presents CRISP, a framework that lets robots automatically improve their social behaviors by critiquing and replanning their own actions. Using a vision-language model as a virtual social critic, the system generates robot motions, evaluates them for social appropriateness, and iteratively refines them—all without human feedback.

agentsreasoningmultimodal

Design-OS: A Specification-Driven Framework for Engineering System Design with a Control-Systems Design Case

Mar 20, 2026

H. Sinan Bank, Daniel R. Herber, Thomas H. Bradley

Specification-driven design workflows can extend beyond software to physical engineering systems, enabling better human-AI collaboration by making design decisions explicit and auditable rather than ad hoc.

Design-OS is a structured workflow that helps engineers design physical systems (like control systems) by making requirements explicit and maintaining traceability from intent to final design. It organizes design into five stages with specifications as a shared contract between humans and AI agents, demonstrated on two different inverted pendulum platforms.

agentsapplications

NavTrust: Benchmarking Trustworthiness for Embodied Navigation

Mar 19, 2026

Huaide Jiang, Yash Chaudhary, Yuping Wang et al.

Embodied navigation systems perform well in clean lab conditions but fail dramatically in real-world scenarios with sensor noise and unclear instructions—this benchmark exposes those gaps and provides mitigation strategies.

NavTrust is a benchmark that tests how well navigation AI systems handle real-world problems like blurry images, sensor noise, and unclear instructions. The researchers tested seven state-of-the-art systems and found they all struggle significantly when inputs are corrupted, then demonstrated four strategies to make them more robust.

evaluationsafetyagents

Online Learning and Equilibrium Computation with Ranking Feedback

Mar 19, 2026

Mingyang Liu, Yongshan Chen, Zhiyuan Fan et al.

Learning from rankings instead of numeric feedback is fundamentally harder, but becomes tractable when the environment changes slowly—with applications to game theory and LLM routing systems.

This paper studies online learning when you only get ranking feedback (like "action A is better than B") instead of numeric scores. The researchers show when this is impossible and develop algorithms that work well when utility changes slowly. They prove these algorithms help players reach fair game equilibria and test them on routing large language models.

reasoningagents

OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards

Mar 19, 2026

Zehao Li, Zhenyu Wu, Yibo Zhao et al.

Breaking reward evaluation into smaller, verifiable steps with multiple reviewers produces more reliable feedback for training GUI agents, improving task success by 10% in online learning scenarios.

OS-Themis is a reward evaluation system for GUI agents that breaks down task trajectories into verifiable milestones and uses multiple reviewers to judge whether agents completed tasks correctly. This approach improves both the accuracy of reward signals and the performance of agents trained with reinforcement learning on mobile and desktop interfaces.

agentsevaluationtraining

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Mar 19, 2026

Edward Lin, Sahil Modi, Siva Kumar Sastry Hari et al.

Instead of comparing kernels to other software implementations, this benchmark measures how close optimized kernels get to theoretical hardware limits—giving AI systems a clear, unchanging target for optimization rather than a moving baseline.

SOL-ExecBench is a benchmark for evaluating GPU kernel optimization that measures performance against hardware limits rather than software baselines. It includes 235 CUDA kernels from real AI models and uses analytically derived 'Speed-of-Light' bounds to create fixed optimization targets, enabling fair evaluation of AI systems that generate and optimize code.

evaluationefficiencyagents

Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation

Mar 19, 2026

Swagat Padhan, Lakshya Jain, Bhavya Minesh Shah et al.

Vision-language models need explicit metric reasoning to ground spatial language in 3D environments—decomposing queries into semantic and spatial components and combining them probabilistically improves grounding accuracy for robot navigation tasks.

This paper tackles the problem of robots understanding natural language commands that mix semantic meaning with precise spatial measurements, like 'go two meters right of the fridge.

multimodalagents

CAMO: A Conditional Neural Solver for the Multi-objective Multiple Traveling Salesman Problem

Mar 19, 2026

Fengxiaoxiao Li, Xiao Mao, Mingfeng Fan et al.

Neural solvers can now handle the combined complexity of coordinating multiple agents with competing objectives, generalizing across different team sizes and problem instances better than conventional heuristics.

CAMO is a neural network solver that helps teams of robots visit multiple locations while balancing competing goals like travel time and total distance. It uses a conditional encoder to handle different preference trade-offs and a collaborative decoder to coordinate multiple robots, outperforming traditional optimization methods on this complex multi-agent, multi-objective problem.

reasoningagents

AgentFactory: A Self-Evolving Framework Through Executable Subagent Accumulation and Reuse

Mar 18, 2026

Zhang Zhang, Shuqi Lu, Hongjin Qian et al.

Instead of storing agent experiences as text, storing them as executable code lets agents reuse and improve solutions reliably across different tasks and systems.

AgentFactory is a framework that helps AI agents learn and improve by saving successful task solutions as reusable Python code (subagents) rather than just text descriptions. These saved subagents get refined over time based on how well they work, creating a growing library that makes future similar tasks easier to solve without human help.

agentstrainingapplications

Toward Scalable Automated Repository-Level Datasets for Software Vulnerability Detection

Mar 18, 2026

Amine Lbath

Automated vulnerability injection with proof-of-concept exploits can scale up realistic training datasets for repository-level security detection, moving beyond function-level benchmarks to test how AI handles real-world code complexity.

This research creates an automated system to generate large-scale datasets for training AI models to detect software vulnerabilities in real code repositories.

datasafetyagents

TDAD: Test-Driven Agentic Development - Reducing Code Regressions in AI Coding Agents via Graph-Based Impact Analysis

Mar 18, 2026

Pepe Alonso

For AI agents writing code, showing them which tests to check matters more than telling them to follow test-driven development procedures—context beats process.

TDAD is a tool that helps AI coding agents avoid breaking existing tests when fixing bugs. It uses code analysis to identify which tests might be affected by changes, then guides the agent to verify those specific tests before submitting fixes. Testing on real-world code shows it cuts regressions by 70% and improves fix success rates.

agentsevaluation

Specification-Aware Distribution Shaping for Robotics Foundation Models

Mar 18, 2026

Sadık Bera Yüksel, Derya Aksaray

You can enforce formal safety constraints on pretrained robotics models without retraining by adjusting their output distributions at inference time using temporal logic specifications.

This paper adds safety guardrails to robotics foundation models by reshaping their action distributions at runtime to satisfy formal specifications. Instead of retraining the model, it uses forward simulation to ensure the robot meets time-dependent constraints like "visit location A before time T, then location B" while staying as close as possible to the model's original decisions.

safetyagentsreasoning

VideoAtlas: Navigating Long-Form Video in Logarithmic Compute

Mar 18, 2026

Mohamed Eltahir, Ali Habibullah, Yazan Alshoibi et al.

By treating video as a navigable hierarchical structure instead of converting it to text, you can process 10-hour videos with minimal accuracy loss while using compute that scales logarithmically with duration.

VideoAtlas is a system for understanding long videos efficiently by representing them as a hierarchical grid that can be zoomed into recursively, rather than converting video to text.

efficiencymultimodalagents

Chronos: Temporal-Aware Conversational Agents with Structured Event Retrieval for Long-Term Memory

Mar 17, 2026

Sahil Sen, Elias Lumer, Anmol Gulati et al.

Structuring long conversation histories as timestamped events with intelligent retrieval guidance lets AI agents accurately answer complex questions about what happened weeks or months ago—critical for building chatbots that remember user preferences and history over extended periods.

Chronos is a memory system for AI chatbots that tracks conversations over months by breaking down dialogue into timestamped events and organizing them in structured calendars. When answering questions about past conversations, it uses dynamic prompts to guide retrieval across time ranges and handle complex multi-step reasoning, achieving 95.6% accuracy on long-term memory tasks.

agentsreasoningdata

SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

Mar 17, 2026

Tianyu Xie, Jinfa Huang, Yuexiao Ma et al.

Models that accurately perceive audio-visual information often fail at generating contextually appropriate conversational responses, showing that perception and interaction are separate skills that need independent evaluation.

SocialOmni is a benchmark that tests how well audio-visual AI models handle natural conversation dynamics—specifically, identifying who's speaking, knowing when to interrupt, and generating natural interruptions. Testing 12 leading models reveals that understanding what's happening in a conversation doesn't automatically translate to responding appropriately in real dialogue.

evaluationmultimodalagents

Internalizing Agency from Reflective Experience

Mar 17, 2026

Rui Ge, Yichao Fu, Yuyang Qian et al.

By teaching agents to learn from environmental feedback and explore alternative paths when they fail, LEAFE improves their problem-solving capacity across multiple attempts (Pass@k) better than methods that only optimize for single successful outcomes.

This paper introduces LEAFE, a training method that helps AI agents learn from their mistakes during long interactions with environments. Instead of just optimizing for final success, LEAFE teaches agents to reflect on feedback, backtrack to earlier decisions, try alternative approaches, and internalize these recovery strategies.

agentsreasoningtraining

Learning to Present: Inverse Specification Rewards for Agentic Slide Generation

Mar 17, 2026

Karthik Ragunath Ananda Kumar, Subrahmanyam Arunachalam

You can train smaller language models to perform complex agentic tasks like presentation generation by using creative reward signals (like inverse task verification) and parameter-efficient fine-tuning, achieving 91% of large model quality with only 7B parameters.

This paper presents a reinforcement learning system that trains AI agents to automatically generate professional slide presentations. The key innovation is an "inverse specification reward" that checks if slides accurately convey their intended message by having an LLM try to recover the original brief from the generated slides.

agentstraining

From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation

Mar 16, 2026

Yibin Liu, Yaxing Lyu, Daqi Gao et al.

Reinforcement learning can transform passive video understanding models into active task evaluators by training them to generate explicit reasoning about progress toward goals—enabling smaller models to outperform much larger ones on robot manipulation tasks.

This paper introduces PRIMO R1, a 7B video AI model that learns to actively evaluate robot manipulation progress by using reinforcement learning to generate step-by-step reasoning. Unlike standard models that passively recognize what's happening, PRIMO R1 compares current robot states to task goals and predicts failures, achieving better accuracy than much larger models on robotic tasks.

reasoningagentsmultimodal

OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data

Mar 16, 2026

Yuwen Du, Rui Ye, Shuo Tang et al.

You can now build frontier-level search agents without proprietary data—OpenSeeker proves that smart data synthesis (not scale) is the bottleneck, and releases everything needed to replicate it.

OpenSeeker is a fully open-source search agent that achieves state-of-the-art performance by synthesizing high-quality training data through two techniques: generating complex multi-hop reasoning tasks by reverse-engineering web graphs, and denoising agent trajectories using summarization.

agentsdatareasoning

Computational Concept of the Psyche

Mar 16, 2026

Anton Kolonin, Vladimir Krykov

AGI systems should be built around an agent's internal needs and goals as the core driver of learning and decision-making, rather than treating intelligence as separate from motivation.

This paper proposes a cognitive architecture for artificial general intelligence that models the psyche as an operating system managing an agent's needs, sensations, and actions. The approach formalizes AGI as an optimization problem where agents learn through experience to satisfy needs while managing uncertainty and minimizing existential risks.

architecturereasoningagents
agents
alignment

Semantic Invariance in Agentic AI

Mar 13, 2026

I. de Zarzà, J. de Curtò, Jordi Cabot et al.

Model size doesn't guarantee robustness: smaller models like Qwen3-30B outperform much larger models at maintaining consistent reasoning when problems are rephrased, suggesting that scaling alone won't solve reliability issues for deployed AI agents.

This paper tests whether AI agents give consistent answers when you rephrase the same problem in different ways. The researchers found that larger models are actually less stable than smaller ones—a surprising result that challenges assumptions about model scaling.

evaluationreasoningagents

Steve-Evolving: Open-World Embodied Self-Evolution via Fine-Grained Diagnosis and Dual-Track Knowledge Distillation

Mar 13, 2026

Zhengwei Xie, Zhisheng Chen, Ziyan Weng et al.

Embodied agents can continuously improve without retraining by organizing experiences with detailed failure diagnosis and using those insights to constrain and guide planning at test time.

Steve-Evolving is a framework that helps AI agents learn and improve from their experiences in open-world environments like Minecraft. Instead of updating model weights, it organizes what the agent learns into structured experiences, diagnoses why actions succeed or fail in detail, and uses those insights to guide future planning through retrieved skills and safety guardrails.

agentsreasoningtraining

Security Considerations for Artificial Intelligence Agents

Mar 12, 2026

Ninghui Li, Kaiyuan Zhang, Kyle Polley et al.

AI agents introduce fundamentally new security challenges because they blur the line between code and data, and can execute actions across systems—developers need layered defenses including input filtering, sandboxing, and strict privilege controls.

This paper identifies security risks in AI agents—systems that can take actions in the real world—and proposes defenses. It covers new attack types like prompt injection and confused-deputy problems, explains how current protections work (sandboxing, policy enforcement), and highlights gaps in standards and research needed to secure multi-agent systems.

safetyagentsarchitecture

Sparking Scientific Creativity via LLM-Driven Interdisciplinary Inspiration

Mar 12, 2026

Priyanka Kargupta, Shuhaib Mehri, Dilek Hakkani-Tur et al.

LLMs can augment creative scientific reasoning by treating interdisciplinary research as a structured exploration process: decompose goals into questions, find analogous problems in other fields, then synthesize insights back into your domain.

Idea-Catalyst is a framework that helps researchers and AI systems discover creative interdisciplinary insights by systematically connecting research challenges across different fields.

reasoningapplicationsagents

WORKSWORLD: A Domain for Integrated Numeric Planning and Scheduling of Distributed Pipelined Workflows

Mar 12, 2026

Taylor Paul, William Regli

Automated planning can solve the joint problem of designing distributed data pipelines and scheduling them on real infrastructure, enabling users to specify workflows declaratively rather than imperatively.

This paper introduces WORKSWORLD, a planning domain for automatically designing and scheduling data pipelines across distributed computer systems. Instead of manually specifying how data flows between processing components, users describe their data sources, available tools, and desired outputs—and an AI planner figures out the optimal workflow and resource allocation.

reasoningagentsapplications

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

Mar 12, 2026

Łukasz Borchmann, Jordy Van Landeghem, Michał Turski et al.

Current document-reasoning agents succeed through exhaustive search rather than strategic thinking—they need better planning abilities, not just more attempts, to handle real-world document workflows efficiently.

This paper introduces MADQA, a benchmark with 2,250 questions across 800 PDF documents, to test whether AI agents can strategically navigate documents or just randomly search. The researchers found that while agents match human accuracy on some questions, they use brute-force trial-and-error rather than smart planning, and fall 20% short of optimal performance.

evaluationagentsreasoning

GlyphBanana: Advancing Precise Text Rendering Through Agentic Workflows

Mar 12, 2026

Zexuan Yan, Jiarui Jin, Yue Ma et al.

You can improve any text-to-image model's ability to render complex text and formulas without retraining—just add an agentic workflow that guides the generation process using glyph templates.

GlyphBanana solves the problem of generating accurate text and mathematical formulas in images by using an agentic workflow that guides existing text-to-image models. Instead of retraining models, it injects glyph templates into the model's internal representations to iteratively improve text rendering quality.

agentsmultimodalapplications

LifeSim: Long-Horizon User Life Simulator for Personalized Assistant Evaluation

Mar 12, 2026

Feiyu Duan, Xuanjing Huang, Zhongyu Wei

Current LLMs struggle with implicit user intentions and long-term preference modeling—they can handle immediate requests but fail to understand what users really need or remember their preferences over extended interactions.

LifeSim creates realistic simulated users with beliefs, desires, and intentions to test how well AI assistants handle long-term, multi-scenario interactions. The benchmark evaluates whether AI can understand both explicit requests and hidden user needs, maintain accurate user profiles over time, and provide contextually appropriate responses across 1,200 diverse life scenarios.

evaluationagentsapplications

Automatic Generation of High-Performance RL Environments

Mar 12, 2026

Seth Karten, Rahul Dev Appapogu, Chi Jin

AI agents can now automatically translate RL environments into optimized implementations (Rust, JAX, GPU-parallel code) in hours instead of months, with built-in verification ensuring the fast version behaves identically to the original.

This paper shows how to automatically generate high-performance RL environments using AI agents with a generic prompt template, verification checks, and iterative repair.

agentsefficiencytraining
agentsreasoningevaluation

Understanding Usage and Engagement in AI-Powered Scientific Research Tools: The Asta Interaction Dataset

Feb 26, 2026

Dany Haddad, Dan Bareket, Joseph Chee Chang et al.

Scientists use AI research tools as collaborative partners, not search engines—they write complex queries, reuse outputs, and dig into citations ...

Researchers analyzed how scientists actually use AI-powered research tools by studying over 200,000 real queries and interactions. They found that scientists write longer, more complex questions than traditional search, treat AI as a research partner for drafting and brainstorming, and revisit AI responses like documents rather than one-off answers.

applicationsevaluationagents

Toward Expert Investment Teams:A Multi-Agent LLM System with Fine-Grained Trading Tasks

Feb 26, 2026

Kunihiro Miyazaki, Takanobu Kawahara, Stephen Roberts et al.

Breaking complex financial tasks into specific subtasks for AI agents produces better trading returns than giving them broad instructions.

This paper builds a trading system using multiple AI agents that work together like an investment team. Instead of giving agents vague instructions, the researchers break down stock analysis into specific, detailed tasks—like analyzing financial statements separately from news.

agentsapplicationsreasoning

ParamMem: Augmenting Language Agents with Parametric Reflective Memory

Feb 26, 2026

Tianjun Yao, Yongqiang Chen, Yujia Zheng et al.

Agents that reflect on their mistakes in diverse ways solve problems better—and you can teach this diversity by storing reflection patterns as le...

This paper introduces ParamMem, a memory module that helps AI agents think better by learning from past mistakes in diverse ways. Instead of repeating the same reflection patterns, the system stores reflection strategies as model parameters, allowing agents to generate varied self-corrections.

agentsreasoningtraining

CXReasonAgent: Evidence-Grounded Diagnostic Reasoning Agent for Chest X-rays

Feb 26, 2026

Hyungyung Lee, Hangyul Yoon, Edward Choi

AI medical diagnosis becomes more trustworthy when it shows its evidence instead of just giving answers.

This paper presents CXReasonAgent, a system that helps AI diagnose chest X-rays by combining a language model with specialized medical tools. Instead of just guessing answers like typical AI models, it shows its work by pointing to specific evidence in the image.

agentsmultimodalsafety

Evaluating Stochasticity in Deep Research Agents

Feb 26, 2026

Haotian Zhai, Elias Stengel-Eskin, Pratik Patil et al.

AI research agents are unreliable in production because of randomness in how they search, summarize, and reason—but you can fix this with structu...

Research agents that gather information to answer questions produce different results each time you run them with the same question. This paper identifies where that randomness comes from and proposes ways to make these systems more reliable—reducing variability by 22% while keeping quality high.

agentsevaluationsafety

Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems

Feb 26, 2026

Siyuan Liu, Jiahui Xu, Feng Jiang et al.

Voice assistants can respond 19-51% faster by processing speech, reasoning, and speech generation in parallel instead of waiting for each step to f...

This paper solves a real problem with voice assistants: they're slow because they wait for you to finish talking, then transcribe everything, think about the answer, and finally speak. The new DDTSR system lets the AI start responding while still listening and thinking—like a human conversation.

efficiencyagentsarchitecture

Risk-Aware World Model Predictive Control for Generalizable End-to-End Autonomous Driving

Feb 26, 2026

Jiangxin Sun, Feng Xue, Teng Long et al.

Autonomous driving systems can make safer decisions in unexpected situations by predicting consequences and evaluating risk, rather than just copyi...

This paper tackles a critical problem in autonomous driving: current AI systems learn by copying expert drivers, but fail when encountering unusual situations they've never seen before. The researchers propose RaWMPC, a system that predicts what will happen if the car takes different actions, then picks the safest option—without needing expert examples.

safetyagentstraining

AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning

Feb 26, 2026

Yutong Wang, Siyuan Xiong, Xuebo Liu et al.

You can improve multi-agent system reliability at inference time by filtering and correcting agent outputs, without expensive retraining.

AgentDropoutV2 fixes errors in multi-agent AI systems without retraining. It works like a quality filter at test time—catching bad outputs from individual agents, correcting fixable errors using past failure patterns, and removing unfixable ones to prevent mistakes from spreading. The system improved math problem accuracy by 6.3% on average.

agentsreasoningefficiency

A Model-Free Universal AI

Feb 26, 2026

Yegon Kim, Juho Lee

You don't need to model the environment to build an optimal AI agent—learning action values directly can be just as powerful.

This paper introduces AIQI, the first AI agent that learns optimal behavior without building an explicit model of its environment. Instead of predicting how the world works, it directly learns which actions produce the best outcomes. This is a theoretical breakthrough showing that model-free approaches can match the performance of model-based agents in general reinforcement learning.

reasoningtrainingagents

Spatio-Temporal Token Pruning for Efficient High-Resolution GUI Agents

Feb 26, 2026

Zhou Xu, Bowen Zhou, Qi Wang et al.

You can make GUI agents 3x faster by intelligently pruning screenshots and history instead of compressing everything uniformly.

This paper solves a major speed problem for AI agents that control computer screens by smartly removing unnecessary information from screenshots and action history. Instead of treating all parts of an image equally, it keeps important interactive elements while discarding redundant details, achieving 3.3x faster processing with minimal accuracy loss.

efficiencyagentsevaluation

ReCoN-Ipsundrum: An Inspectable Recurrent Persistence Loop Agent with Affect-Coupled Control and Mechanism-Linked Consciousness Indicator Assays

Feb 26, 2026

Aishik Sanyal

Adding emotional feedback to AI agents makes them more stable and deliberate, not just more human-like—a practical insight for building agents th...

This paper builds an AI agent called ReCoN-Ipsundrum that adds memory loops and emotional signals to test whether machines can show consciousness-like behaviors.

agentsarchitecturereasoning

Tell Me What To Learn: Generalizing Neural Memory to be Controllable in Natural Language

Feb 26, 2026

Max S. Bennett, Thomas P. Zollo, Richard Zemel

You can now control what AI models learn and remember by giving them natural language instructions, making them adaptable to changing priorities wi...

This paper introduces a neural memory system that lets you tell an AI model what to remember and what to ignore using natural language instructions.

trainingagents

ESAA: Event Sourcing for Autonomous Agents in LLM-Based Software Engineering

Feb 26, 2026

Elzo Brito dos Santos Filho

Separate agent planning from execution: agents output intentions, a deterministic system executes them and logs everything, preventing state loss a...

This paper solves a critical problem with AI agents: they lose track of what they're doing over long tasks and can't reliably execute code changes. ESAA is an architecture that separates what an agent *intends* to do from what actually *happens* in your codebase.

agentsarchitectureapplications