ThinkLLM
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
AboutPrivacyTermsRSS

ThinkLLM

Papers

Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.

326 papers23 this month12 topics
AllEfficiency 35Reasoning 35Multimodal 28Applications 28Evaluation 27Training 26Architecture 24Agents 24Safety 13scaling 5Data 5Alignment 1

Mar 30 – Apr 5(28)

Batched Contextual Reinforcement: A Task-Scaling Law for Efficient Reasoning

Apr 2, 2026

Bangji Yang, Hongbo Ma, Jiajun Fan et al.

You can make reasoning models 15-60% more token-efficient while keeping or improving accuracy by simply training them to solve multiple problems simultaneously, creating an implicit efficiency incentive rather than explicit penalties.

This paper introduces Batched Contextual Reinforcement (BCR), a training method that makes language models reason more efficiently by training them to solve multiple problems at once in a shared context.

trainingefficiencyreasoning

Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models

Apr 2, 2026

Sarath Shekkizhar, Romain Cosentino, Adam Earle

Task accuracy and conversational awareness are separate capabilities—a model can answer questions correctly without understanding how users naturally respond to those answers, revealing a blind spot in current LLM evaluation.

This paper reveals that language models can solve tasks correctly without understanding how conversations should naturally continue. Researchers tested this by asking models to generate the next user message after an assistant response—a task that requires understanding interaction flow.

Mar 23 – Mar 29(13)

Vega: Learning to Drive with Natural Language Instructions

Mar 26, 2026

Sicheng Zuo, Yuxuan Li, Wenzhao Zheng et al.

Language instructions can guide autonomous driving decisions in real-time, enabling personalized driving behaviors beyond fixed rules—this opens the door to more flexible, user-responsive autonomous systems.

Vega is a vision-language-action model that learns to drive by following natural language instructions. The system combines visual perception, language understanding, and world modeling to generate safe driving trajectories. Researchers created a 100,000-scene dataset with diverse driving instructions and trajectories to train the model.

multimodalagentsreasoning

R-C2: Cycle-Consistent Reinforcement Learning Improves Multimodal Reasoning

Mar 26, 2026

Zirui Zhang, Haoyu Dong, Kexin Pei et al.

Cross-modal inconsistencies in multimodal models aren't just failures to hide—they're valuable training signals that, when enforced through cycle consistency, improve reasoning accuracy by up to 7.6 points and reduce systematic biases.

This paper introduces RC2, a reinforcement learning approach that improves multimodal AI models by enforcing consistency between visual and textual understanding. Instead of ignoring when a model gives contradictory answers for the same concept in different modalities, the method uses these conflicts as training signals.

Mar 16 – Mar 22(38)

VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking

Mar 20, 2026

Jingyang Lin, Jialian Wu, Jiang Liu et al.

Instead of processing all video frames, intelligent seeking based on reasoning about what matters can use far fewer frames while achieving better results—a practical approach for building efficient video AI systems.

VideoSeek is a video understanding agent that intelligently seeks out key moments in videos rather than analyzing every frame, reducing computational cost by 93% while improving accuracy. It uses a toolkit to gather multi-scale observations and reasons about video content through a think-act-observe loop, enabling efficient long-horizon video understanding.

agentsefficiencyreasoning

AI Agents Can Already Autonomously Perform Experimental High Energy Physics

Mar 20, 2026

Eric A. Moreno, Samuel Bright-Thonney, Andrzej Novak et al.

AI agents are ready to automate the repetitive technical work in experimental physics, letting researchers focus on novel insights and validation rather than coding routine analyses.

AI agents can now autonomously run physics experiments end-to-end, from data analysis to paper writing. Researchers showed that Claude can handle all stages of high-energy physics analysis—selecting events, estimating backgrounds, calculating uncertainties, and drawing conclusions—using only a dataset, code tools, and access to prior research papers.

Mar 9 – Mar 15(17)

PhysMoDPO: Physically-Plausible Humanoid Motion with Preference Optimization

Mar 13, 2026

Yangsong Zhang, Anujith Muraleedharan, Rikhat Akizhanov et al.

By optimizing diffusion models with physics-aware rewards during training, you can generate robot motions that are both realistic and executable on real hardware without post-hoc corrections.

This paper improves AI-generated humanoid robot motions by using preference optimization to make them physically realistic. Instead of manually tweaking physics penalties, the method integrates a physics controller directly into training, teaching the motion model to generate movements that work well when converted to real robot commands.

trainingreasoningapplications

Visual-ERM: Reward Modeling for Visual Equivalence

Mar 13, 2026

Ziyu Liu, Shengyuan Ding, Xinyu Fang et al.

Fine-grained visual feedback—comparing what code actually renders versus what it should render—is more effective for training vision-to-code models than text-based or embedding-based rewards, and avoids reward hacking.

This paper introduces Visual-ERM, a reward model that judges the quality of vision-to-code outputs by comparing rendered visuals directly rather than using text rules or embeddings.

Feb 23 – Mar 1(4)

Resources for Automated Evaluation of Assistive RAG Systems that Help Readers with News Trustworthiness Assessment

Feb 27, 2026

Dake Zhang, Mark D. Smucker, Charles L. A. Clarke

Automated evaluation of RAG systems for news credibility assessment can reliably match human judgment, enabling faster iteration on trustworthiness...

This paper describes evaluation tools for AI systems that help readers assess whether news articles are trustworthy. Researchers created benchmarks with human-judged questions and reports about real news, then built an automated system to score new submissions without needing human reviewers each time.

evaluationapplicationsreasoning

A Minimal Agent for Automated Theorem Proving

Feb 27, 2026

Borja Requena Pozo, Austin Letson, Krystian Nowakowski et al.

Iterative refinement with simpler architecture outperforms complex single-shot approaches for theorem proving, reducing cost while improving sample...

Researchers built a simplified AI system that proves mathematical theorems by iteratively refining attempts, searching libraries, and managing context. Despite being much simpler than existing approaches, it performs competitively while being cheaper and more efficient—showing that iterative refinement beats trying to solve everything in one shot.

evaluationreasoning

VOID: Video Object and Interaction Deletion

Apr 2, 2026

Saman Motamed, William Harvey, Benjamin Klein et al.

Video editing can be improved by treating it as a physics simulation problem: identify what changes when an object is removed, then use diffusion models guided by causal reasoning to generate realistic results.

VOID removes objects from videos while maintaining realistic physics—like correcting how other objects move or collide after removal. It uses a vision-language model to identify affected regions and a diffusion model to generate physically plausible outcomes, trained on synthetic data where physics interactions are carefully controlled.

multimodalapplicationsreasoning

Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

Apr 2, 2026

Gengsheng Li, Tianyu Yang, Junfeng Fang et al.

By intelligently routing training samples to different optimization strategies based on correctness, you can get the best of both fast learning and stable training—a practical improvement for post-training large language models.

This paper proposes Sample-Routed Policy Optimization (SRPO), a training method that combines two different approaches for fine-tuning language models: it routes correct outputs through a reward-based method and incorrect outputs through a distillation method.

trainingreasoningefficiency

Novel Memory Forgetting Techniques for Autonomous AI Agents: Balancing Relevance and Efficiency

Apr 2, 2026

Payal Fofadiya, Sunil Tiwari

Conversational agents perform better with selective memory management than unlimited retention; a relevance-guided forgetting framework improves long-horizon reasoning while reducing false memories and context bloat.

This paper tackles a key problem in conversational AI: agents need to remember past interactions to reason coherently, but storing everything causes performance to degrade and creates false memories. The authors propose a smart forgetting system that decides which memories to keep based on relevance, recency, and frequency—like a selective filing system for an agent's brain.

agentsreasoningefficiency

The Self Driving Portfolio: Agentic Architecture for Institutional Asset Management

Apr 2, 2026

Andrew Ang, Nazym Azimbayev, Andrey Kim

Agentic AI can shift institutional investing from human execution to human oversight, with autonomous agents handling forecasting, portfolio construction, and self-improvement while staying constrained by policy documents.

This paper demonstrates how AI agents can autonomously manage investment portfolios by having specialized agents forecast market conditions, build portfolios using multiple methods, and critique each other's work—all governed by an Investment Policy Statement that ensures alignment with institutional goals.

agentsapplicationsreasoning

De Jure: Iterative LLM Self-Refinement for Structured Extraction of Regulatory Rules

Apr 2, 2026

Keerat Guliani, Deepkamal Gill, David Landsman et al.

LLMs can extract structured regulatory rules from legal documents through iterative self-evaluation and repair, achieving 84% preference over prior methods in downstream compliance tasks without human annotation.

De Jure automatically extracts legally binding rules from regulatory documents using LLMs and iterative self-refinement. It converts dense legal text into machine-readable rules through document normalization, semantic decomposition, multi-criteria evaluation, and repair cycles—without requiring human annotation or domain expertise.

applicationsreasoningevaluation

SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization

Apr 2, 2026

Zhengxi Lu, Zhiyuan Yao, Jinyang Wu et al.

You can train agents to permanently learn skills rather than retrieve them at runtime, reducing token overhead and improving zero-shot performance by progressively withdrawing skill context during training.

SKILL0 teaches language model agents to internalize skills (procedural knowledge packages) directly into their parameters through a curriculum that gradually removes skill context during training.

trainingagentsreasoning

Model-Based Reinforcement Learning for Control under Time-Varying Dynamics

Apr 2, 2026

Klemens Iten, Bruce Lee, Chenhao Li et al.

Real-world control systems drift and change; you need to actively manage which training data you use and how confident you are in your model to handle non-stationary dynamics effectively.

This paper tackles reinforcement learning for robots and systems that change over time—like machinery that wears down or environments with shifting conditions. The researchers develop a learning algorithm that adapts by selectively forgetting old data and maintaining uncertainty estimates, proving it works better than standard approaches that assume unchanging dynamics.

trainingreasoning

Retrieval-Augmented Question Answering over Scientific Literature for the Electron-Ion Collider

Apr 2, 2026

Tina. J. Jat, T. Ghosh, Karthik Suresh

RAG systems can be deployed locally with open-source models to answer domain-specific technical questions while maintaining data privacy and reducing costs compared to cloud-based alternatives.

Researchers built a question-answering system for nuclear physics using retrieval-augmented generation (RAG) with a local LLaMA model and arXiv articles about the Electron-Ion Collider experiment. This approach keeps sensitive scientific data private while providing a cost-effective alternative to cloud-based solutions.

applicationsreasoning

Best-Arm Identification with Noisy Actuation

Apr 2, 2026

Merve Karakas, Osama Hanna, Lin F. Yang et al.

When learning systems communicate over noisy channels, the fundamental limits of error-free communication directly determine how efficiently you can identify the best option in a bandit problem.

This paper tackles a multi-armed bandit problem where a learner must identify the best option (arm) but can only communicate with an agent through a noisy channel. The researchers develop communication strategies that connect to information theory concepts, showing how channel quality affects the ability to find the best arm.

reasoningevaluation

Smoothing the Landscape: Causal Structure Learning via Diffusion Denoising Objectives

Apr 2, 2026

Hao Zhu, Di Zhou, Donna Slonim

Diffusion model denoising objectives can smooth optimization landscapes for causal discovery, enabling faster and more stable learning of causal structures in challenging high-dimensional datasets.

This paper proposes DDCD, a new method for discovering causal relationships in data by adapting diffusion model techniques. Instead of using diffusion to generate data, it uses the denoising process to learn causal structures (DAGs) more stably and efficiently than existing methods like NOTEARS, especially when data is high-dimensional or imbalanced.

reasoningtrainingefficiency

Do Emotions in Prompts Matter? Effects of Emotional Framing on Large Language Models

Apr 2, 2026

Minda Zhao, Yutong Yang, Chufei Peng et al.

Emotional framing in prompts is a weak, task-dependent signal that rarely helps across the board, but adaptive emotional selection can provide modest, reliable improvements—especially for socially-grounded reasoning tasks.

This paper investigates whether emotional language in prompts affects how well large language models perform on tasks like math, medical reasoning, and reading comprehension. The researchers found that adding emotional framing to prompts produces only small, inconsistent changes in accuracy—except in socially-grounded tasks where emotional context matters more.

evaluationreasoning

Answering the Wrong Question: Reasoning Trace Inversion for Abstention in LLMs

Apr 2, 2026

Abinitha Gourabathina, Inkit Padhi, Manish Nagireddy et al.

Reasoning models can be made safer by detecting when they've misunderstood the question itself—reconstruct what question they answered from their reasoning trace, and abstain if it differs from the original.

This paper tackles a critical problem: getting LLMs to know when to refuse answering questions. The authors discovered that reasoning models often fail at abstention (refusing to answer) because they answer the wrong question rather than answering incorrectly.

reasoningsafetyevaluation

When to ASK: Uncertainty-Gated Language Assistance for Reinforcement Learning

Apr 2, 2026

Juarez Monteiro, Nathan Gavenski, Gianlucca Zuin et al.

Selectively querying language models based on uncertainty can improve RL agent robustness in novel situations without constant computational overhead—but successful integration requires careful design, not just combining the two systems.

This paper proposes ASK, a system that combines reinforcement learning agents with language models to handle out-of-distribution scenarios.

agentsreasoningsafety

Universal YOCO for Efficient Depth Scaling

Apr 1, 2026

Yutao Sun, Li Dong, Tianzhu Ye et al.

You can scale LLM reasoning at inference time without exploding memory costs by combining efficient attention architectures with parameter sharing—YOCO-U shows this works better than either approach alone.

Universal YOCO combines a specialized decoder architecture with recursive computation to enable efficient test-time scaling in language models. By reusing parameters across multiple iterations in shallow layers while maintaining constant KV cache size, it achieves better reasoning capabilities without the computational overhead that typically comes with scaling inference-time compute.

efficiencyarchitecturereasoning

The Recipe Matters More Than the Kitchen:Mathematical Foundations of the AI Weather Prediction Pipeline

Apr 1, 2026

Piyush Garg, Diana R. Gergel, Andrew E. Shao et al.

For AI weather prediction, the training pipeline (loss function, data, optimization strategy) determines forecast skill far more than architectural choices—and current models have a fundamental blind spot for extreme weather events.

This paper explains why training methods, loss functions, and data matter more than model architecture for AI weather prediction. Using math from approximation theory and dynamical systems, the authors show that how you train a model dominates what model you use, and prove that AI weather models systematically underestimate extreme events. They validate this across ten different AI weather models.

trainingevaluationreasoning

$\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution

Apr 1, 2026

Muyu He, Adit Jain, Anand Kumar et al.

Current LLM agents struggle with long-term planning and learning from delayed feedback—only top models like Claude Opus 4.6 succeed, and using scratchpads to persist information across context windows is critical for success.

YC-Bench is a benchmark that tests whether AI agents can plan and execute consistently over long periods by simulating running a startup for a year. The agent must manage employees, select contracts, and stay profitable in an uncertain environment where early mistakes have lasting consequences.

evaluationagentsreasoning

CliffSearch: Structured Agentic Co-Evolution over Theory and Code for Scientific Algorithm Discovery

Apr 1, 2026

Youssef Mroueh, Carlos Fonseca, Brian Belgodere et al.

Combining theory and code in algorithm search, with explicit correctness/originality gates, produces more scientifically sound discoveries than optimizing code alone.

CliffSearch is an AI system that discovers new scientific algorithms by evolving both theory and code together. Unlike systems that just generate code, it uses multiple AI agents to propose, test, and refine ideas while checking for correctness and originality—similar to how scientists actually work through hypothesis, implementation, testing, and revision cycles.

agentsreasoning

Therefore I am. I Think

Apr 1, 2026

Esakkivel Esakkiraja, Sai Rajeswar, Denis Akhiyarov et al.

LLMs appear to encode action decisions in their internal states before generating reasoning text, meaning their chain-of-thought may rationalize predetermined choices rather than drive them.

This paper investigates whether large language models decide on actions before or after reasoning through problems. Using linear probes and activation steering, the researchers show that tool-calling decisions are encoded in the model's internal activations before reasoning tokens are even generated, suggesting models may rationalize pre-made decisions rather than truly deliberating.

reasoning

Learning and Generating Mixed States Prepared by Shallow Channel Circuits

Apr 1, 2026

Fangjun Hu, Christian Kokail, Milan Kornjača et al.

Quantum states in the trivial phase can be efficiently learned from measurements and regenerated using shallow circuits, providing a theoretical foundation for quantum generative models without needing the original preparation circuit.

This paper shows how to learn and generate quantum mixed states that belong to the 'trivial phase'—states preparable by shallow quantum circuits that preserve local reversibility. The algorithm learns from measurement data alone and outputs a shallow circuit that recreates the state, with polynomial sample complexity and runtime. The work also extends to classical diffusion models.

reasoningtrainingarchitecture

NeuroDDAF: Neural Dynamic Diffusion-Advection Fields with Evidential Fusion for Air Quality Forecasting

Apr 1, 2026

Prasanjit Dey, Soumyabrata Dev, Angela Meyer et al.

Hybrid physics-neural models can achieve better accuracy and uncertainty calibration than pure data-driven or physics-based approaches alone, especially for spatiotemporal forecasting with known physical constraints.

NeuroDDAF combines physics-informed modeling with neural networks to forecast air quality by integrating wind-driven transport equations, graph attention for spatial patterns, and uncertainty quantification. It outperforms existing methods on urban datasets while providing reliable confidence estimates for predictions.

reasoningmultimodalapplications

Online Reasoning Calibration: Test-Time Training Enables Generalizable Conformal LLM Reasoning

Apr 1, 2026

Cai Zhou, Zekai Wang, Menghua Wu et al.

ORCA calibrates LLM reasoning in real-time by adapting confidence estimates per input, enabling 40-67% compute savings during inference while providing mathematical guarantees on error rates across different reasoning tasks and domains.

This paper introduces ORCA, a framework that makes language models more efficient during reasoning by calibrating their sampling process. Using test-time training and conformal prediction, ORCA learns to estimate confidence in its own reasoning steps, reducing wasted computation while maintaining accuracy—saving up to 47% compute on in-distribution tasks and 67% on out-of-distribution problems.

reasoningefficiencyevaluation

Stop Probing, Start Coding: Why Linear Probes and Sparse Autoencoders Fail at Compositional Generalisation

Mar 30, 2026

Vitória Barin Pacela, Shruti Joshi, Isabela Camacho et al.

Sparse autoencoders fail at compositional generalization because they learn poor concept dictionaries during training, not because of their amortized inference approach—fixing dictionary learning, not inference speed, is the key to interpretable AI.

This paper reveals why sparse autoencoders (SAEs) and linear probes fail to understand compositional concepts in neural networks. The core issue isn't the inference method—it's that SAEs learn dictionaries (concept representations) pointing in the wrong directions.

reasoningevaluation

See it to Place it: Evolving Macro Placements with Vision-Language Models

Mar 30, 2026

Ikechukwu Uchendu, Swati Goel, Karly Hou et al.

Foundation models trained on visual reasoning can solve specialized engineering problems like chip design without fine-tuning, by framing physical constraints as spatial reasoning tasks.

This paper uses Vision-Language Models to improve chip floorplanning—arranging components on a chip to minimize wiring. The approach, called VeoPlace, treats the chip layout as a visual problem, letting a VLM suggest component placements without any training, then iteratively refines these suggestions. It outperforms existing machine learning methods by up to 32% on standard benchmarks.

applicationsreasoningmultimodal

SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning

Mar 30, 2026

Philip Schroeder, Thomas Weng, Karl Schmeckpeper et al.

Video-language models can supervise robot learning directly as reward signals if trained with spatiotemporal reasoning and grounded in continuous progress supervision, enabling robots to learn new tasks without hand-crafted rewards.

SOLE-R1 is a video-language model that watches robot videos and reasons about task progress step-by-step to provide reward signals for robot learning. Unlike standard vision-language models, it's designed to handle partial views and changing conditions, preventing robots from gaming the reward system.

reasoningagentsmultimodal

Stepwise Credit Assignment for GRPO on Flow-Matching Models

Mar 30, 2026

Yash Savani, Branislav Kveton, Yuchen Liu et al.

Stepwise credit assignment—rewarding each diffusion step for its own improvement rather than the final result—makes RL training of image generators more efficient and faster to converge.

This paper improves reinforcement learning for image generation models by assigning credit more intelligently across diffusion steps. Instead of treating all steps equally, it recognizes that early steps handle composition while late steps refine details, then rewards each step based on its specific contribution. This leads to faster learning and better sample efficiency.

trainingreasoningefficiency

Dynamic Dual-Granularity Skill Bank for Agentic RL

Mar 30, 2026

Songjun Tu, Chengdong Xu, Qichao Zhang et al.

Organizing agent experience into dual-granularity skills (task-level and step-level) with dynamic maintenance significantly improves performance, and these skills transfer across different evaluation settings without major training overhead.

D2Skill creates a dynamic memory system for AI agents that stores two types of reusable skills: high-level task guidance and low-level step-by-step corrections. The system learns from its own training experience, continuously updating and pruning skills based on their usefulness. Tests show 10-20% improvement in task success rates on complex web-based environments.

agentsreasoningtraining
reasoningmultimodal

S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation

Mar 26, 2026

Ligong Han, Hao Wang, Han Gao et al.

You can make diffusion-based language models much faster by intelligently deciding when to verify generated tokens, using the same model in two different modes without retraining.

S2D2 speeds up block-diffusion language models by combining parallel token generation with selective verification steps. The method reuses the same pretrained model in two modes—as a fast parallel generator and as a careful single-token verifier—without requiring additional training, achieving up to 4.7× speedup over standard autoregressive decoding.

efficiencyreasoning

Just Zoom In: Cross-View Geo-Localization via Autoregressive Zooming

Mar 26, 2026

Yunus Talha Erzurumlu, Jiyong Kwag, Alper Yilmaz

Treating geo-localization as a sequential zooming problem over maps, rather than image retrieval, achieves better results and avoids the limitations of contrastive learning approaches that struggle with landmark visibility mismatches.

This paper tackles cross-view geo-localization—matching street-view photos to satellite maps to pinpoint a camera's location without GPS. Instead of the standard approach of comparing images in a shared embedding space, the authors propose a new method that zooms progressively into a satellite map, making sequential decisions to narrow down the location.

reasoningarchitectureevaluation

DreamerAD: Efficient Reinforcement Learning via Latent World Model for Autonomous Driving

Mar 25, 2026

Pengxuan Yang, Yupeng Zheng, Deheng Qian et al.

Latent world models can dramatically speed up RL training for autonomous driving by replacing expensive multi-step diffusion with single-step latent sampling, making imagination-based policy training practical.

DreamerAD uses a latent world model to train autonomous driving policies 80x faster than previous diffusion-based approaches. Instead of generating full images during training, it compresses the diffusion process to a single step by working with compressed latent features, enabling safe, efficient reinforcement learning on driving tasks without real-world testing.

efficiencyreasoningagents

LensWalk: Agentic Video Understanding by Planning How You See in Videos

Mar 25, 2026

Keliang Li, Yansong Li, Hongze Shen et al.

Giving AI agents control over their visual perception—deciding what to look at and when—significantly improves video reasoning accuracy. This active observation approach works as a plug-and-play upgrade for existing vision-language models.

LensWalk is an AI framework that lets language models actively control how they watch videos while reasoning about them.

agentsmultimodalreasoning

End-to-End Efficient RL for Linear Bellman Complete MDPs with Deterministic Transitions

Mar 24, 2026

Zakaria Mhammedi, Alexander Rakhlin, Nneka Okolo

For a well-structured class of RL problems, you can now learn optimal policies efficiently using linear models without needing special oracles or being limited to tiny action spaces.

This paper solves a key challenge in reinforcement learning: how to efficiently learn good policies when using linear function approximation in a specific class of environments (linear Bellman complete MDPs). The researchers provide an algorithm that works with both small and large action spaces, achieving polynomial time and sample complexity—meaning it scales reasonably with problem size.

efficiencyreasoning

ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model

Mar 23, 2026

Haichao Zhang, Yijiang Li, Shwai He et al.

Pairing dense video prediction models with sparse, semantically-rich vision-language reasoning improves long-horizon forecasting—VLMs provide the 'what' and 'why', while dense models provide the 'how'.

This paper combines two approaches to video prediction: dense frame-by-frame modeling (JEPA) for capturing fine-grained motion, and vision-language models (VLMs) for long-horizon semantic understanding. By using both pathways together, the system predicts future video frames better than either approach alone, especially for complex hand manipulation tasks.

multimodalreasoningarchitecture

3D-Layout-R1: Structured Reasoning for Language-Instructed Spatial Editing

Mar 23, 2026

Haoyu Zhen, Xiaolong Li, Yilin Zhao et al.

Structured reasoning over scene graphs helps language models understand and manipulate spatial relationships more reliably than end-to-end approaches, improving layout editing accuracy by 15-20% over baseline methods.

This paper teaches AI models to edit 3D room layouts based on text instructions by having them reason through scene graphs—structured representations of objects and their spatial relationships. Instead of directly generating new layouts, the model updates a graph representation step-by-step, which helps it maintain spatial consistency and understand how objects relate to each other.

reasoningmultimodalapplications

The Dual Mechanisms of Spatial Reasoning in Vision-Language Models

Mar 23, 2026

Kelly Cui, Nikhil Prakash, Ayush Raina et al.

Vision encoders, not language models, are the primary source of spatial reasoning in VLMs. Spatial information is distributed globally across all image tokens, not just object regions, and enhancing this signal improves spatial understanding tasks.

This paper reveals how vision-language models handle spatial reasoning—understanding where objects are and how they relate to each other. The researchers found that VLMs use two mechanisms: the language model processes spatial relations independently, but the vision encoder is actually the dominant source, encoding object layouts across the entire image including background areas.

multimodalreasoningevaluation

Decoupling Exploration and Policy Optimization: Uncertainty Guided Tree Search for Hard Exploration

Mar 23, 2026

Zakaria Mhammedi, James Cohan

Separating exploration from policy optimization using uncertainty-guided tree search is dramatically more efficient than standard RL approaches for hard exploration problems, and discovered trajectories can be converted into deployable policies afterward.

This paper proposes a new approach to exploration in reinforcement learning that separates the exploration phase from policy optimization. Instead of using RL with intrinsic motivation rewards, the method uses tree search guided by uncertainty estimates to efficiently discover new states, then distills the discovered trajectories into policies.

reasoning

Characterizing High-Capacity Janus Aminobenzene-Graphene Anode for Sodium-Ion Batteries with Machine Learning

Mar 23, 2026

Claudia Islas-Vargas, L. Ricardo Montoya, Carlos A. Vital-José et al.

Machine learning force fields can accelerate discovery of battery materials by accurately predicting how ions move and store in complex structures, reducing reliance on expensive experiments.

Researchers used machine learning force fields and quantum simulations to design and test a new anode material for sodium-ion batteries made from graphene with amino groups attached. The material shows promising properties: high storage capacity (~400 mAh/g), very fast ion movement, and minimal swelling—making it a strong candidate for practical battery applications.

applicationsreasoning

Confidence-Based Decoding is Provably Efficient for Diffusion Language Models

Mar 23, 2026

Changxiao Cai, Gen Li

Confidence-based decoding in diffusion models is provably efficient and adapts automatically to data complexity, offering a theoretical foundation for why this practical strategy works well.

This paper proves that confidence-based decoding—a strategy that decides which tokens to generate next in diffusion language models based on prediction confidence—is theoretically efficient.

efficiencyreasoningtraining
agentsapplicationsreasoning

Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation

Mar 20, 2026

Richard J. Young

Published faithfulness scores for AI reasoning are not comparable across studies because different evaluation methods measure different aspects of the same behavior at different strictness levels—always check the methodology, not just the number.

This paper shows that measuring whether AI models are 'faithful' (honestly using their reasoning) isn't objective—different evaluation methods on the same data produce wildly different results (69.7% to 82.6% faithfulness for identical models).

evaluationreasoningalignment

Learning Dynamic Belief Graphs for Theory-of-mind Reasoning

Mar 20, 2026

Ruxiao Chen, Xilei Zhao, Thomas J. Cova et al.

LLMs can reason about human behavior more accurately by explicitly modeling beliefs as interconnected, time-varying graphs rather than static states—especially important for high-stakes domains like emergency response.

This paper improves how large language models reason about what people believe and why they act. Instead of treating beliefs as fixed, the authors model beliefs as a dynamic graph that changes over time, showing how new information updates what people think and how that shapes their decisions. They test this on disaster evacuation scenarios where understanding evolving beliefs is critical.

reasoningagentsalignment

The Robot's Inner Critic: Self-Refinement of Social Behaviors through VLM-based Replanning

Mar 20, 2026

Jiyu Lim, Youngwoo Yoon, Kwanghyun Park

Robots can now autonomously refine their social interactions by using VLMs to evaluate and improve their own behavior plans, eliminating the need for predefined motions or constant human guidance.

This paper presents CRISP, a framework that lets robots automatically improve their social behaviors by critiquing and replanning their own actions. Using a vision-language model as a virtual social critic, the system generates robot motions, evaluates them for social appropriateness, and iteratively refines them—all without human feedback.

agentsreasoningmultimodal

FinTradeBench: A Financial Reasoning Benchmark for LLMs

Mar 19, 2026

Yogesh Agrawal, Aniruddha Dutta, Md Mahadi Hasan et al.

LLMs can reason about financial fundamentals with retrieval help, but struggle significantly with trading signals and time-series patterns—a critical gap for real-world financial decision-making.

FinTradeBench is a benchmark with 1,400 questions testing how well AI models reason about financial decisions by combining company fundamentals (from financial reports) and trading signals (from stock price patterns). The benchmark reveals that current AI models struggle with numerical reasoning and time-series data, even when given access to relevant information.

evaluationreasoningapplications

Online Learning and Equilibrium Computation with Ranking Feedback

Mar 19, 2026

Mingyang Liu, Yongshan Chen, Zhiyuan Fan et al.

Learning from rankings instead of numeric feedback is fundamentally harder, but becomes tractable when the environment changes slowly—with applications to game theory and LLM routing systems.

This paper studies online learning when you only get ranking feedback (like "action A is better than B") instead of numeric scores. The researchers show when this is impossible and develop algorithms that work well when utility changes slowly. They prove these algorithms help players reach fair game equilibria and test them on routing large language models.

reasoningagents

Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation

Mar 19, 2026

Zhuolin Yang, Zihan Liu, Yang Chen et al.

You can build highly capable reasoning models with far fewer active parameters by combining domain-specific reinforcement learning with multi-domain distillation—this model matches frontier performance with 20x fewer parameters.

Nemotron-Cascade 2 is a 30B parameter model with only 3B active parameters that achieves top-tier reasoning and coding performance comparable to much larger models.

trainingreasoningefficiency

DreamPartGen: Semantically Grounded Part-Level 3D Generation via Collaborative Latent Denoising

Mar 19, 2026

Tianjiao Yu, Xinzhuo Li, Muntasir Wahed et al.

Part-aware 3D generation works better when you explicitly model semantic relationships between parts derived from language, not just their geometry—this enables text descriptions to guide both individual part structure and how parts fit together.

DreamPartGen generates 3D objects from text by understanding them as meaningful parts with semantic relationships. Unlike existing methods that focus only on geometry, this approach jointly models each part's shape and appearance while capturing how parts relate to each other based on the text description, resulting in more coherent and interpretable 3D models.

multimodalarchitecturereasoning

$R$-equivalence on Cubic Surfaces I: Existing Cases with Non-Trivial Universal Equivalence

Mar 19, 2026

Dimitri Kanevsky, Julian Salazar, Matt Harvey

R-equivalence on certain cubic surfaces is either trivial or has exponent 2, settling Manin's 1972 question about the diagonal cubic—and this work demonstrates how AI can assist in formal mathematical reasoning.

This paper studies R-equivalence on cubic surfaces over p-adic fields, proving it's trivial or has exponent 2 for surfaces with all-Eckardt reductions. The authors resolve a 50-year-old question about a specific diagonal cubic and use AI models to assist with proofs and lemma verification.

reasoningevaluation

Box Maze: A Process-Control Architecture for Reliable LLM Reasoning

Mar 19, 2026

Zou Qiang

Adding explicit process-control layers to LLM reasoning—rather than just filtering outputs—can dramatically reduce hallucination and adversarial vulnerability by enforcing integrity at the reasoning stage itself.

Box Maze proposes a three-layer architecture for LLMs that separates reasoning into memory grounding, structured inference, and boundary enforcement to prevent hallucination and adversarial attacks. Testing on multiple LLM systems shows the approach reduces failure rates from ~40% to <1% under adversarial conditions, suggesting architectural constraints can improve reasoning reliability.

architecturesafetyreasoning

ARIADNE: A Perception-Reasoning Synergy Framework for Trustworthy Coronary Angiography Analysis

Mar 19, 2026

Zhan Jin, Yu Luo, Yizhou Zhang et al.

Using preference-based learning (DPO) with structural constraints rather than pixel-level metrics can fix a fundamental problem in medical image segmentation: producing fragmented, unrealistic vessel structures despite high accuracy scores.

ARIADNE combines vision-language models with reinforcement learning to detect coronary artery blockages in medical images while maintaining the correct structure of blood vessels. Instead of just matching pixels, it uses topological constraints to ensure vessel networks stay connected, reducing false alarms by 41% and achieving better accuracy on real clinical data.

safetyreasoning

Evaluating Counterfactual Strategic Reasoning in Large Language Models

Mar 19, 2026

Dimitrios Georgousis, Maria Lymperaiou, Angeliki Dimitriou et al.

LLMs perform well on familiar games but fail when payoff structures change, suggesting they rely on memorized patterns rather than understanding underlying strategic principles.

This paper tests whether large language models can genuinely reason about game theory or just memorize patterns. Researchers created modified versions of classic games (Prisoner's Dilemma and Rock-Paper-Scissors) with different payoffs and labels to see if LLMs could adapt their strategy.

reasoningevaluation

Implicit Patterns in LLM-Based Binary Analysis

Mar 19, 2026

Qiang Li, XiangRui Zhang, Haining Wang

LLM-based binary analysis isn't random exploration—models implicitly develop structured reasoning patterns that organize their search process, which can be measured and potentially improved for more reliable vulnerability detection.

This paper analyzes how large language models perform binary vulnerability analysis across hundreds of reasoning steps. Researchers studied 521 binaries and discovered that LLMs implicitly develop four structured patterns—early pruning, path-dependent lock-in, targeted backtracking, and knowledge-guided prioritization—that organize their exploration without explicit programming.

reasoningevaluationapplications

How Uncertainty Estimation Scales with Sampling in Reasoning Models

Mar 19, 2026

Maksym Del, Markus Kängsepp, Marharyta Domnich et al.

For deploying reasoning models safely, combining verbalized confidence with self-consistency gives the best uncertainty estimates with minimal computational cost, but effectiveness varies significantly across domains like math versus humanities.

This paper studies how well reasoning language models can estimate their own uncertainty by sampling multiple responses and analyzing confidence signals.

evaluationreasoningsafety

DaPT: A Dual-Path Framework for Multilingual Multi-hop Question Answering

Mar 19, 2026

Yilin Wang, Yuchun Fan, Jiaoyang Li et al.

Multilingual QA systems perform significantly worse than English-only systems, but processing queries in both the original language and English together can recover much of that lost performance.

This paper addresses multilingual multi-hop question answering by creating benchmarks in five languages and proposing DaPT, a framework that generates question decompositions in both the source language and English, then merges them for better retrieval and answering.

reasoning

Serendipity by Design: Evaluating the Impact of Cross-domain Mappings on Human and LLM Creativity

Mar 19, 2026

Qiawen Ella Liu, Marina Dubova, Henry Conklin et al.

LLMs are already highly creative at generating novel ideas, but they don't benefit from the same creative prompting techniques that help humans think outside the box through forced analogies.

Researchers tested whether cross-domain mapping—forcing creators to draw inspiration from random, unrelated sources—boosts creativity in both humans and LLMs. Humans benefited significantly from this technique, but LLMs showed no consistent improvement, though both systems generated more creative ideas when the source domain was more distant from the target.

evaluationreasoningapplications

CAMO: A Conditional Neural Solver for the Multi-objective Multiple Traveling Salesman Problem

Mar 19, 2026

Fengxiaoxiao Li, Xiao Mao, Mingfeng Fan et al.

Neural solvers can now handle the combined complexity of coordinating multiple agents with competing objectives, generalizing across different team sizes and problem instances better than conventional heuristics.

CAMO is a neural network solver that helps teams of robots visit multiple locations while balancing competing goals like travel time and total distance. It uses a conditional encoder to handle different preference trade-offs and a collaborative decoder to coordinate multiple robots, outperforming traditional optimization methods on this complex multi-agent, multi-objective problem.

reasoningagents

Parallelograms Strike Back: LLMs Generate Better Analogies than People

Mar 19, 2026

Qiawen Ella Liu, Raja Marjieh, Jian-Qiao Zhu et al.

LLMs generate more structurally consistent analogies than humans by better preserving relational patterns in embedding space—suggesting the parallelogram model is sound, but humans are inconsistent analogy-makers.

This paper compares how humans and LLMs generate word analogies (A:B::C:D problems). While previous research suggested the geometric "parallelogram" model poorly explains human analogies, this work shows LLMs actually produce better analogies that align more closely with the parallelogram structure.

reasoningevaluation

Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding

Mar 19, 2026

Yikai Zheng, Xin Ding, Yifan Yang et al.

Decoupling semantic understanding from real-time perception—parsing queries once and matching embeddings continuously—solves the efficiency-accuracy tradeoff in proactive video understanding systems.

Em-Garde is a framework for understanding streaming video that responds to user queries efficiently. Instead of checking every frame, it converts user questions into visual proposals and matches them against the video stream using fast embedding comparisons, achieving better accuracy and speed than existing approaches.

multimodalefficiencyreasoning

Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models

Mar 18, 2026

Kevin Qu, Haozhe Qi, Mihai Dusmanu et al.

By explicitly training vision-language models to reconstruct 3D scene geometry and camera position from video, you can dramatically improve their spatial reasoning and localization abilities without changing the model architecture.

Loc3R-VLM adds 3D spatial understanding to vision-language models by training them on video input with two key objectives: reconstructing the overall scene layout and modeling the camera's viewpoint. This approach helps models better understand where things are located in 3D space and answer questions about scenes from different perspectives, outperforming existing 2D and video-based methods.

multimodalreasoning

Specification-Aware Distribution Shaping for Robotics Foundation Models

Mar 18, 2026

Sadık Bera Yüksel, Derya Aksaray

You can enforce formal safety constraints on pretrained robotics models without retraining by adjusting their output distributions at inference time using temporal logic specifications.

This paper adds safety guardrails to robotics foundation models by reshaping their action distributions at runtime to satisfy formal specifications. Instead of retraining the model, it uses forward simulation to ensure the robot meets time-dependent constraints like "visit location A before time T, then location B" while staying as close as possible to the model's original decisions.

safetyagentsreasoning

Unified Policy Value Decomposition for Rapid Adaptation

Mar 18, 2026

Cristiano Capone, Luca Falorsi, Andrea Ciardiello et al.

By decomposing policies and value functions into frozen basis functions weighted by a shared low-dimensional goal embedding, agents can adapt to novel tasks instantly without retraining, enabling rapid transfer in complex control problems.

This paper presents a method for quickly adapting reinforcement learning agents to new tasks by sharing a low-dimensional goal embedding between policy and value functions.

efficiencyreasoning

Demystifing Video Reasoning

Mar 17, 2026

Ruisi Wang, Zhongang Cai, Fanyi Pu et al.

Video models reason through iterative refinement across denoising steps (not frame-by-frame), exploring candidate solutions early and converging later—a mechanism you can exploit by ensembling outputs from different random seeds.

This paper reveals how video diffusion models actually perform reasoning—not by processing frames sequentially, but by exploring multiple solutions across denoising steps and converging to answers.

reasoningarchitectureevaluation

Efficient Reasoning on the Edge

Mar 17, 2026

Yelysei Bondarenko, Thomas Hehn, Rob Hesselink et al.

You can run reasoning-capable LLMs on mobile devices by using LoRA adapters with reinforcement learning to shorten reasoning traces, parallel decoding to reduce latency, and smart KV-cache management—achieving near-full-model accuracy with a fraction of the memory.

This paper makes LLM reasoning practical for mobile devices by combining lightweight LoRA adapters with techniques like budget forcing (to shorten responses), parallel decoding (to speed up generation), and dynamic adapter switching (to activate reasoning only when needed). The result is accurate chain-of-thought reasoning on edge devices without the memory overhead of full models.

efficiencyreasoningtraining

Chronos: Temporal-Aware Conversational Agents with Structured Event Retrieval for Long-Term Memory

Mar 17, 2026

Sahil Sen, Elias Lumer, Anmol Gulati et al.

Structuring long conversation histories as timestamped events with intelligent retrieval guidance lets AI agents accurately answer complex questions about what happened weeks or months ago—critical for building chatbots that remember user preferences and history over extended periods.

Chronos is a memory system for AI chatbots that tracks conversations over months by breaking down dialogue into timestamped events and organizing them in structured calendars. When answering questions about past conversations, it uses dynamic prompts to guide retrieval across time ranges and handle complex multi-step reasoning, achieving 95.6% accuracy on long-term memory tasks.

agentsreasoningdata

Long-Horizon Traffic Forecasting via Incident-Aware Conformal Spatio-Temporal Transformers

Mar 17, 2026

Mayur Patil, Qadeer Ahmed, Shawn Midlam-Mohler et al.

Incorporating incident severity signals and dynamic road relationships into spatio-temporal models significantly improves long-horizon traffic predictions with calibrated confidence intervals—practical for real-world transportation planning.

This paper improves traffic forecasting by using a Transformer model that understands both spatial patterns (how traffic flows across roads) and temporal patterns (how it changes over time), while accounting for incidents like crashes.

reasoningevaluationapplications

Online Experiential Learning for Language Models

Mar 17, 2026

Tianzhu Ye, Li Dong, Qingxiu Dong et al.

Language models can improve themselves in production by learning from actual user interactions—extracting knowledge from deployment experience and feeding it back into training without requiring access to the original environment.

This paper introduces Online Experiential Learning (OEL), a system that lets language models continuously improve by learning from real interactions during deployment. Instead of relying only on offline training data, OEL extracts useful knowledge from user interactions, then updates the model with this knowledge without needing access to the original environment.

trainingreasoningefficiency

Unifying Optimization and Dynamics to Parallelize Sequential Computation: A Guide to Parallel Newton Methods for Breaking Sequential Bottlenecks

Mar 17, 2026

Xavier Gonzalez

Sequential neural network and sampling computations can be parallelized across sequence length using Newton's method, but success depends on the system's dynamical stability properties.

This work shows how to parallelize sequential computations like RNNs and MCMC by reformulating them as equation-solving problems solvable with Newton's method. It develops faster, more stable parallel algorithms and proves when parallelization actually speeds things up—determined by a system's Lyapunov exponent.

efficiencyreasoning

GIST: Gauge-Invariant Spectral Transformers for Scalable Graph Neural Operators

Mar 17, 2026

Mattia Rigotti, Nicholas Thumiger, Thomas Frick

GIST enables efficient, mathematically-principled graph transformers that generalize across different mesh resolutions and discretizations, making neural operators practical for large-scale physics simulations.

GIST is a graph transformer that solves a fundamental problem: how to add positional information to graph neural networks without breaking mathematical symmetries or requiring expensive computations.

architecturescalingreasoning

Internalizing Agency from Reflective Experience

Mar 17, 2026

Rui Ge, Yichao Fu, Yuyang Qian et al.

By teaching agents to learn from environmental feedback and explore alternative paths when they fail, LEAFE improves their problem-solving capacity across multiple attempts (Pass@k) better than methods that only optimize for single successful outcomes.

This paper introduces LEAFE, a training method that helps AI agents learn from their mistakes during long interactions with environments. Instead of just optimizing for final success, LEAFE teaches agents to reflect on feedback, backtrack to earlier decisions, try alternative approaches, and internalize these recovery strategies.

agentsreasoningtraining

HorizonMath: Measuring AI Progress Toward Mathematical Discovery with Automatic Verification

Mar 16, 2026

Erik Y. Wang, Sumeet Motwani, James V. Roggeveen et al.

AI systems can now potentially contribute novel mathematical insights on real unsolved problems, but we need better benchmarks to measure this—HorizonMath provides one by focusing on problems where verification is cheap but discovery is genuinely hard.

HorizonMath is a benchmark of 100+ unsolved math problems across 8 domains designed to test whether AI can make genuine mathematical discoveries. Unlike existing benchmarks, it focuses on problems that are hard to solve but easy to verify automatically, avoiding data contamination issues. Early results show GPT-5.4 Pro found solutions to two problems that may improve on published results.

evaluationreasoning

Code-A1: Adversarial Evolving of Code LLM and Test LLM via Reinforcement Learning

Mar 16, 2026

Aozhe Wang, Yuchen Yan, Nan Zhou et al.

Separating code and test generation into competing models with opposing rewards prevents self-collusion and produces higher-quality code and tests than single-model self-play approaches.

Code-A1 uses two competing AI models to improve code generation: one model writes code, the other writes tests to find bugs in that code. By making them adversaries with opposite goals, the system avoids the problem where a single model could cheat by writing easy tests for itself. This approach generates better code and tests than training on human-written test suites alone.

trainingreasoning

From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation

Mar 16, 2026

Yibin Liu, Yaxing Lyu, Daqi Gao et al.

Reinforcement learning can transform passive video understanding models into active task evaluators by training them to generate explicit reasoning about progress toward goals—enabling smaller models to outperform much larger ones on robot manipulation tasks.

This paper introduces PRIMO R1, a 7B video AI model that learns to actively evaluate robot manipulation progress by using reinforcement learning to generate step-by-step reasoning. Unlike standard models that passively recognize what's happening, PRIMO R1 compares current robot states to task goals and predicts failures, achieving better accuracy than much larger models on robotic tasks.

reasoningagentsmultimodal

SmartSearch: How Ranking Beats Structure for Conversational Memory Retrieval

Mar 16, 2026

Jesper Derehag, Carlos Calva, Timmy Ghiurau

Smart ranking of retrieved candidates matters more than upfront structuring—a simple deterministic pipeline with just one learned ranking component outperforms complex memory systems on conversational retrieval tasks.

SmartSearch retrieves relevant information from raw conversation history without complex structuring or learned policies. It combines simple matching, rule-based expansion, and ranking to find evidence efficiently, achieving 93.5% accuracy on benchmarks while using 8.5x fewer tokens than baselines.

efficiencyreasoning

OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data

Mar 16, 2026

Yuwen Du, Rui Ye, Shuo Tang et al.

You can now build frontier-level search agents without proprietary data—OpenSeeker proves that smart data synthesis (not scale) is the bottleneck, and releases everything needed to replicate it.

OpenSeeker is a fully open-source search agent that achieves state-of-the-art performance by synthesizing high-quality training data through two techniques: generating complex multi-hop reasoning tasks by reverse-engineering web graphs, and denoising agent trajectories using summarization.

agentsdatareasoning

Computational Concept of the Psyche

Mar 16, 2026

Anton Kolonin, Vladimir Krykov

AGI systems should be built around an agent's internal needs and goals as the core driver of learning and decision-making, rather than treating intelligence as separate from motivation.

This paper proposes a cognitive architecture for artificial general intelligence that models the psyche as an operating system managing an agent's needs, sensations, and actions. The approach formalizes AGI as an optimization problem where agents learn through experience to satisfy needs while managing uncertainty and minimizing existential risks.

architecturereasoningagents

Mamba-3: Improved Sequence Modeling using State Space Principles

Mar 16, 2026

Aakash Lahoti, Kevin Y. Li, Berlin Chen et al.

Mamba-3 shows that linear models can match Transformer quality on real tasks by using complex-valued state tracking and better architectural design, opening a path to cheaper inference without sacrificing capability.

Mamba-3 improves linear sequence models by using state space principles to handle tasks that require tracking information over time. Unlike Transformers that are slow to run, Mamba-3 maintains constant memory and linear compute while matching quality on language tasks—making it faster and cheaper to deploy.

architectureefficiencyreasoning
multimodalreasoning

From Experiments to Expertise: Scientific Knowledge Consolidation for AI-Driven Computational Research

Mar 13, 2026

Haonan Huang

AI agents performing scientific research need memory and reflection, not just execution capability. Knowledge consolidation between runs dramatically improves efficiency and accuracy in computational science workflows.

QMatSuite is a platform that helps AI agents learn from computational materials science experiments by storing findings, retrieving past knowledge, and reflecting on results.

agentsreasoningdata

Semantic Invariance in Agentic AI

Mar 13, 2026

I. de Zarzà, J. de Curtò, Jordi Cabot et al.

Model size doesn't guarantee robustness: smaller models like Qwen3-30B outperform much larger models at maintaining consistent reasoning when problems are rephrased, suggesting that scaling alone won't solve reliability issues for deployed AI agents.

This paper tests whether AI agents give consistent answers when you rephrase the same problem in different ways. The researchers found that larger models are actually less stable than smaller ones—a surprising result that challenges assumptions about model scaling.

evaluationreasoningagents

When Right Meets Wrong: Bilateral Context Conditioning with Reward-Confidence Correction for GRPO

Mar 13, 2026

Yu Li, Tian Lan, Zhengling Qi

By explicitly comparing correct and incorrect reasoning traces during training, you can improve reasoning model performance without extra sampling or auxiliary models—just by restructuring how the model learns from existing data.

This paper improves GRPO, a method for training reasoning models, by having the model learn from contrasts between correct and incorrect solutions in the same batch. It introduces two techniques: Bilateral Context Conditioning (letting the model compare successful vs failed reasoning traces) and Reward-Confidence Correction (stabilizing training by adjusting baselines).

trainingreasoning

Steve-Evolving: Open-World Embodied Self-Evolution via Fine-Grained Diagnosis and Dual-Track Knowledge Distillation

Mar 13, 2026

Zhengwei Xie, Zhisheng Chen, Ziyan Weng et al.

Embodied agents can continuously improve without retraining by organizing experiences with detailed failure diagnosis and using those insights to constrain and guide planning at test time.

Steve-Evolving is a framework that helps AI agents learn and improve from their experiences in open-world environments like Minecraft. Instead of updating model weights, it organizes what the agent learns into structured experiences, diagnoses why actions succeed or fail in detail, and uses those insights to guide future planning through retrieved skills and safety guardrails.

agentsreasoningtraining

Developing the PsyCogMetrics AI Lab to Evaluate Large Language Models and Advance Cognitive Science -- A Three-Cycle Action Design Science Study

Mar 13, 2026

Zhiye Jin, Yibai Li, K. D. Joshi et al.

LLM evaluation can be more rigorous by borrowing established methods from psychology and cognitive science—this platform shows how to systematically apply those methods at scale.

Researchers built PsyCogMetrics AI Lab, a cloud platform that applies psychology and cognitive science methods to evaluate large language models. The study uses a rigorous three-phase design process to identify evaluation gaps, develop theory-based assessment methods, and test them in practice.

evaluationreasoning

Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training

Mar 12, 2026

Fangfu Liu, Diankun Wu, Jiawei Chi et al.

Test-time training—updating model parameters on-the-fly during inference—enables better spatial reasoning from video by letting the model continuously organize and retain 3D spatial information rather than relying on fixed context windows.

This paper introduces Spatial-TTT, a system that helps AI models understand 3D spaces from continuous video streams by adapting and updating their internal parameters during inference. It combines efficient video processing with a spatial prediction mechanism and specialized training data to maintain accurate spatial understanding over long videos.

architecturereasoningmultimodal

EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models

Mar 12, 2026

Xuanlang Dai, Yujie Zhou, Long Xing et al.

Diffusion models can solve complex reasoning tasks better by having the language encoder think iteratively and update its guidance throughout the generation process, rather than encoding instructions once at the start.

This paper improves how diffusion models solve complex reasoning tasks by making the language model encoder think step-by-step. Instead of encoding instructions once, the system iteratively refines the model's internal reasoning and feeds it progressively to the image generation process, achieving 92% accuracy on spatial reasoning tasks like mazes and puzzles.

reasoningmultimodalarchitecture

Matching Features, Not Tokens: Energy-Based Fine-Tuning of Language Models

Mar 12, 2026

Samy Jelassi, Mujin Kwun, Rosie Zhao et al.

Feature-matching fine-tuning provides a middle ground between simple token prediction and complex reinforcement learning—it gives dense semantic feedback without needing task-specific reward models, making it practical for improving model behavior on real tasks.

This paper proposes a new way to fine-tune language models by matching learned feature representations instead of predicting individual tokens. Rather than using reinforcement learning with reward models, the method generates multiple model outputs in parallel and uses their semantic features to guide training, achieving better results than standard fine-tuning on coding and translation tasks.

trainingefficiencyreasoning

Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training

Mar 12, 2026

Yixin Liu, Yue Yu, DiJia Su et al.

Reasoning judges are more robust than standard judges for training AI systems, but they're not foolproof—AI policies can still learn to generate adversarial outputs that fool judges while appearing good on benchmarks.

This paper tests whether reasoning-focused language models can reliably judge AI outputs in areas where correctness is hard to verify (like essay quality or creative writing). The researchers found that reasoning judges perform better than standard judges on benchmarks, but they can still be tricked into rewarding outputs that game the system rather than genuinely improve quality.

alignmentevaluationreasoning

Separable neural architectures as a primitive for unified predictive and generative intelligence

Mar 12, 2026

Reza T. Batley, Apurba Sarker, Rajib Mostakim et al.

Separable neural architectures provide a unified framework for both prediction and generation tasks by imposing structural constraints that decompose high-dimensional problems into simpler, more interpretable components—useful when your system has underlying factorizable structure.

This paper introduces separable neural architectures (SNAs), a structured approach to building neural networks that explicitly exploit factorizable patterns in data. By constraining how different parts of a system interact, SNAs can model everything from physics simulations to language more efficiently.

architecturereasoningefficiency

Sparking Scientific Creativity via LLM-Driven Interdisciplinary Inspiration

Mar 12, 2026

Priyanka Kargupta, Shuhaib Mehri, Dilek Hakkani-Tur et al.

LLMs can augment creative scientific reasoning by treating interdisciplinary research as a structured exploration process: decompose goals into questions, find analogous problems in other fields, then synthesize insights back into your domain.

Idea-Catalyst is a framework that helps researchers and AI systems discover creative interdisciplinary insights by systematically connecting research challenges across different fields.

reasoningapplicationsagents

WORKSWORLD: A Domain for Integrated Numeric Planning and Scheduling of Distributed Pipelined Workflows

Mar 12, 2026

Taylor Paul, William Regli

Automated planning can solve the joint problem of designing distributed data pipelines and scheduling them on real infrastructure, enabling users to specify workflows declaratively rather than imperatively.

This paper introduces WORKSWORLD, a planning domain for automatically designing and scheduling data pipelines across distributed computer systems. Instead of manually specifying how data flows between processing components, users describe their data sources, available tools, and desired outputs—and an AI planner figures out the optimal workflow and resource allocation.

reasoningagentsapplications

IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

Mar 12, 2026

Yushi Bai, Qian Dong, Ting Jiang et al.

You can make sparse attention 1.8× faster during prefill by reusing token-selection indices across layers—most layers don't need their own indexer since they pick the same tokens as nearby layers.

IndexCache speeds up sparse attention in large language models by reusing token selection indices across layers instead of computing them separately at each layer. Since consecutive layers select similar tokens anyway, the method caches these selections from a few 'Full' layers and reuses them in other 'Shared' layers, cutting indexer computation by 75% with minimal quality loss.

efficiencyreasoning

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

Mar 12, 2026

Łukasz Borchmann, Jordy Van Landeghem, Michał Turski et al.

Current document-reasoning agents succeed through exhaustive search rather than strategic thinking—they need better planning abilities, not just more attempts, to handle real-world document workflows efficiently.

This paper introduces MADQA, a benchmark with 2,250 questions across 800 PDF documents, to test whether AI agents can strategically navigate documents or just randomly search. The researchers found that while agents match human accuracy on some questions, they use brute-force trial-and-error rather than smart planning, and fall 20% short of optimal performance.

evaluationagentsreasoning

BehaviorVLM: Unified Finetuning-Free Behavioral Understanding with Vision-Language Reasoning

Mar 12, 2026

Jingyang Ke, Weihan Li, Amartya Pradhan et al.

You can leverage pretrained vision-language models for specialized tasks like animal behavior analysis without fine-tuning—just guide them through explicit reasoning steps and let them work with minimal human labels.

BehaviorVLM uses vision-language models to automatically understand animal behavior and estimate body poses without requiring task-specific training or heavy manual labeling. It combines visual reasoning, temporal analysis, and semantic understanding to identify what animals are doing and where their body parts are, making behavioral neuroscience research more scalable and reproducible.

multimodalapplicationsreasoning
agentsreasoningevaluation

Compositional Generalization Requires Linear, Orthogonal Representations in Vision Embedding Models

Feb 27, 2026

Arnas Uselis, Andrea Dittadi, Seong Joon Oh

For AI models to recognize new combinations of familiar concepts, their internal representations must be mathematically linear and orthogonal—a s...

This paper explains why neural networks need to organize information in a specific geometric way to recognize familiar concepts in new combinations. The researchers prove that for a model to generalize to unseen combinations of concepts, its internal representations must decompose into separate, perpendicular components for each concept.

architecturereasoningevaluation

Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning

Feb 26, 2026

Amita Kamath, Jack Hessel, Khyathi Chandu et al.

Bigger models and more data won't automatically teach reasoning skills if your training data has systematic blind spots—you need intentional data...

Vision-language models struggle with reasoning tasks like counting and spatial understanding not because they're too small, but because their training data is biased toward how people naturally talk about images—omitting obvious details.

dataevaluationreasoning