ThinkLLM
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
AboutPrivacyTermsRSS

ThinkLLM

Spot an error in our data? Let us know.

Papers

Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.

921 papers65 this month12 topics
AllEfficiency 38Training 37Evaluation 33Reasoning 27Agents 23Architecture 23Applications 21Multimodal 15Safety 12scaling 8Alignment 8Data 6

May 25 – May 31(9)

MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation

May 26, 2026

Huawei Lin, Peng Li, Jie Song et al.

Treating AI agent skills as long-lived, testable assets with persistent memory—rather than disposable code—significantly improves task success rates and enables skills to transfer between agents and tasks.

This paper introduces MUSE-Autoskill, a framework that helps AI agents continuously improve by creating, storing, and refining reusable skills over time. Instead of treating skills as one-time solutions, the system manages them like software—organizing them in memory, testing them, and learning from experience to make them more reliable and effective across different tasks.

agentstrainingreasoning

GENESIS: Harnessing AI Agents for Autonomous 6G RAN Synthesis, Research, and Testing

May 26, 2026

Tamerlan Aghayev, Maxime Elkael, Michele Polese et al.

AI agents can handle complex domain-specific engineering when grounded in real-world validation and persistent knowledge—LLMs alone fail on RAN work because they hallucinate APIs and break on real hardware, but agents with feedback loops and ground truth don't.

GENESIS is an AI agent framework that automates cellular network (6G RAN) development by converting specifications and problems into tested code solutions. It combines LLMs with real hardware validation and a persistent knowledge base to handle tasks like feature implementation, testing, and optimization that normally take months of manual engineering.

May 18 – May 24(20)

SPACENUM: Revisiting Spatial Numerical Understanding in VLMs

May 22, 2026

Jianshu Zhang, Yijiang Li, Huifeixin Chen et al.

Current VLMs struggle to genuinely understand spatial numbers—they can't reliably map between visual coordinates and numerical values, which is critical for embodied AI tasks like robotics that require precise spatial outputs.

This paper tests whether Vision-Language Models (VLMs) truly understand spatial numbers like coordinates and distances. Using SpaceNum, a framework with two tasks (converting numbers to spatial positions and vice versa), researchers find that VLMs largely fail at grounding numbers in actual spatial meaning, relying instead on shallow visual cues rather than genuine spatial reasoning.

evaluationmultimodalreasoning

ETCHR: Editing To Clarify and Harness Reasoning

May 22, 2026

Beichen Zhang, Yuhong Liu, Jinsong Li et al.

Decoupling image editing from language understanding—and training the editor specifically for reasoning tasks—improves multimodal reasoning accuracy across diverse visual tasks without modifying the base model.

ETCHR is a specialized image editing model that helps multimodal AI systems reason better by transforming images based on questions. Unlike general image editors, it's trained to understand abstract reasoning tasks and produce clearer images for downstream analysis, improving performance across visual reasoning tasks by 4-5% without retraining the main AI model.

May 11 – May 17(9)

ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

May 14, 2026

Ziyu Guo, Rain Liu, Xinyan Chen et al.

A single discrete token can serve dual purposes—executing visual operations like code while also functioning as a learnable reasoning unit—making visual reasoning more efficient and trainable without architectural changes.

ATLAS introduces a single 'functional token' that acts as both an agentic operation and a latent visual reasoning unit, enabling models to reason about images without generating intermediate visual content. This approach combines the interpretability of code-based reasoning with the efficiency of latent reasoning, while remaining compatible with standard language model training.

reasoningmultimodalagents

FutureSim: Replaying World Events to Evaluate Adaptive Agents

May 14, 2026

Shashwat Goel, Nikhil Chandak, Arvindh Arun et al.

Current AI agents struggle with long-horizon real-world adaptation—the best models achieve only 25% accuracy predicting events three months ahead, showing this is a critical capability gap for deployed AI systems.

FutureSim is a benchmark that tests AI agents' ability to adapt and predict real-world events over time by replaying actual news and events in chronological order. Agents must forecast future events beyond their training data while interacting with a live stream of information, revealing significant gaps in current frontier models' capabilities.

May 4 – May 10(22)

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

May 8, 2026

Tong Zheng, Haolin Liu, Chengsong Huang et al.

You can automatically discover better inference strategies for LLMs by treating it as a search problem over execution traces, rather than manually designing heuristics—and it's cheap to do at scale.

This paper presents AutoTTS, a framework that automatically discovers test-time scaling strategies for LLMs instead of relying on hand-crafted heuristics.

reasoning

Conformal Path Reasoning: Trustworthy Knowledge Graph Question Answering via Path-Level Calibration

May 8, 2026

Shuhang Lin, Chuhao Zhou, Xiao Lin et al.

Conformal Path Reasoning provides statistical guarantees that your KGQA system will include the correct answer in its output set, while keeping that set compact and practical—solving a real reliability problem in knowledge graph reasoning.

This paper improves Knowledge Graph Question Answering by adding statistical guarantees to answer reliability. It uses conformal prediction—a technique that creates sets of answers with proven coverage rates—combined with a neural network that learns to score reasoning paths better. The result is more trustworthy answers with smaller, more useful prediction sets.

reasoning

Apr 27 – May 3(19)

HyCOP: Hybrid Composition Operators for Interpretable Learning of PDEs

May 1, 2026

Jinpai Zhao, Nishant Panda, Yen Ting Lin et al.

Composing interpretable numerical and learned modules with learned policies outperforms monolithic neural operators on PDEs, generalizes better to out-of-distribution cases, and lets you swap components (like boundary conditions) without retraining.

HyCOP learns to solve PDEs by composing simple, interpretable modules (like advection and diffusion) rather than training a single neural network. It learns a policy that decides which module to apply and for how long based on the current state, enabling better generalization to new scenarios and easier transfer to different problems.

reasoningarchitectureefficiency

When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models

May 1, 2026

Sailesh Panda, Pritam Kadasi, Abhishek Upperwal et al.

LLMs fail at executing multi-step procedures faithfully, with accuracy collapsing as procedure length increases. This means strong benchmark performance can hide critical weaknesses in following instructions step-by-step.

This paper tests whether large language models actually follow step-by-step procedures correctly, not just whether they get the right final answer. Researchers created a benchmark where models execute arithmetic algorithms of varying length and complexity.

Apr 20 – Apr 26(21)

Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

Apr 24, 2026

Meng Chu, Xuan Billy Zhang, Kevin Qinghong Lin et al.

World models are essential for agents that act in the world, but they need different architectures and evaluation methods depending on what they're modeling (physics vs. software vs. social dynamics) and how sophisticated their predictions need to be.

This paper creates a framework for understanding world models—systems that predict how environments change—by organizing them into three capability levels (from simple one-step prediction to autonomous model revision) and four domain types (physical, digital, social, scientific).

agentsreasoningevaluation

Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought

Apr 24, 2026

Keshav Ramji, Tahira Naseem, Ramón Fernandez Astudillo

You can train models to reason efficiently using learned abstract tokens instead of natural language, reducing inference cost by over 10× while keeping reasoning quality comparable to verbose chain-of-thought.

This paper introduces Abstract Chain-of-Thought, a method that trains language models to reason using short sequences of special tokens instead of writing out full explanations. The approach uses a warm-up phase combining supervised learning from verbal reasoning and self-distillation, then optimizes with reinforcement learning.

agentsreasoningapplications

Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders

May 26, 2026

Yi Jing, Zao Dai, Jinwu Hu et al.

Instead of picking training data based only on external metrics, you can use SAEs to decode what the model actually learns internally, then use those signals to organize data better—making training more efficient without changing the model architecture.

This paper shows how to improve LLM training by using Sparse Autoencoders (SAEs) to read the model's internal representations and guide data selection. The method clusters training data for diversity, orders it by difficulty, and filters low-quality examples—improving math performance by 3% and cutting training time by 20% on smaller models.

trainingdatareasoning

From Scores to Gibbs Correctors: Accelerating Uniform-Rate Discrete Diffusion Models

May 26, 2026

Yuchen Liang, Ness Shroff, Yingbin Liang

GADD accelerates discrete diffusion sampling from many steps to logarithmically few steps without additional training, providing both theoretical guarantees and practical speedups for text and symbolic generation tasks.

This paper speeds up discrete diffusion models (used for text and symbolic data generation) by introducing GADD, a new method that uses Gibbs corrections to reduce sampling steps. Unlike existing acceleration techniques, GADD doesn't require extra training and achieves theoretically optimal speedup, making it practical for real applications like text and music generation.

efficiencytrainingreasoning

2-ASP(Q) programs with weak constraints: Complexity and efficient implementation

May 26, 2026

Andrea Cuteri, Giuseppe Mazzotta, Francesco Ricca

2-ASP(Q)^w can express optimization problems up to Delta_3^P complexity, and the CEGAR-based approach in Casper makes solving these problems practical despite their theoretical hardness.

This paper extends Answer Set Programming with quantifiers and weak constraints, creating a system called 2-ASP(Q)^w that can solve complex optimization problems. The authors prove how hard these problems are to solve theoretically, then build practical software using a refinement technique that gradually improves solutions by learning from counterexamples.

reasoning

Language Models Need Sleep

May 25, 2026

Sangyun Lee, Sean McLeish, Tom Goldstein et al.

Language models can improve long-context reasoning by periodically consolidating recent information into fast weights during offline 'sleep' phases, trading inference latency for better performance on reasoning-heavy tasks.

This paper proposes a sleep-like mechanism for language models that periodically consolidates recent context into persistent memory before clearing the cache. During 'sleep,' the model performs offline passes to update fast weights in state-space blocks, shifting computation away from real-time inference.

architectureefficiencyreasoning

DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking

May 25, 2026

Matt L. Wiemann, Lindsay M. Smith, Peter Melchior et al.

LLMs can predict physics outcomes but struggle with true scientific discovery: the strongest models pass only 50% of worlds, and good prediction accuracy doesn't guarantee conceptual understanding of the underlying laws.

DiscoverPhysics is a benchmark that tests whether large language models can discover unknown physics laws by designing experiments in simulated worlds with non-standard physics.

reasoningevaluationagents

Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User's Digital World

May 25, 2026

Yusong Lin, Xinyuan Liang, Haiyang Wang et al.

Building truly useful AI assistants requires handling messy, interconnected real-world contexts—not isolated tasks—and current models fall far short of this challenge, but synthetic data generation can help close the gap.

Claw-Anything is a benchmark for testing AI agents as always-on personal assistants with access to a user's full digital world—including activity history, multiple services, and both GUI and CLI interfaces.

agentsevaluationreasoning

VeriTrace: Evolving Mental Models for Deep Research Agents

May 25, 2026

Haolang Zhao, Yunbo Long, Lukas Beckenbauer et al.

Research agents need explicit feedback mechanisms to evolve their understanding of tasks—not just bigger models—to avoid error propagation when working through complex, interdependent information.

VeriTrace is a framework that helps AI research agents maintain accurate mental models by explicitly tracking and correcting their understanding as they work through complex problems. Instead of letting language models implicitly manage their reasoning, it uses three feedback loops to catch errors early and prevent them from cascading through the agent's work.

reasoningagentsevaluation
multimodalreasoningtraining

Leveraging Foundation Models for Causal Generative Modeling

May 22, 2026

Aneesh Komanduri, Xintao Wu

You can leverage existing pretrained models for causal reasoning tasks by building a modular pipeline that extracts concepts, manipulates them causally, and generates counterfactuals—no need to retrain from scratch.

This paper presents FM-CGM, a framework that combines pretrained foundation models (reasoning models and diffusion models) to perform causal reasoning on images. It enables zero-shot discovery of causal relationships, intervention on concepts, and generation of counterfactual images—all without retraining the models.

reasoningmultimodalapplications

Vector Policy Optimization: Training for Diversity Improves Test-Time Search

May 21, 2026

Ryan Bahlous-Boldi, Isha Puri, Idan Shenfeld et al.

Training LLMs to produce diverse outputs across multiple reward dimensions—not just maximizing a single score—makes them better at test-time search where you can pick the best solution from many candidates.

This paper introduces Vector Policy Optimization (VPO), a training method that teaches language models to generate diverse solutions by optimizing for multiple reward objectives simultaneously, rather than a single scalar reward.

trainingreasoningefficiency

Remember to be Curious: Episodic Context and Persistent Worlds for 3D Exploration

May 21, 2026

Lily Goli, Justin Kerr, Daniele Reda et al.

Effective curiosity-driven exploration in 3D environments requires both a persistent, continuously-updated world model and episodic memory of the agent's trajectory—without these, agents waste effort revisiting forgotten states instead of discovering new regions.

This paper shows how to make AI agents explore 3D environments effectively using curiosity-driven learning. The key insight is that agents need two things: a persistent 3D map of the world that updates continuously, and memory of where they've been.

reasoningagents

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

May 21, 2026

Ali Hatamizadeh, Yejin Choi, Jan Kautz

Decoupling erase and write operations in linear attention with separate gates improves language model performance, especially on long-context tasks, while maintaining constant-memory decoding.

This paper improves linear attention mechanisms by separating the control of what to forget from what to remember in compressed memory. Instead of using a single gate to control both erasing old information and writing new information, Gated DeltaNet-2 uses separate channel-wise gates for each operation, making memory updates more flexible and efficient.

architectureefficiencyreasoning

Advancing Mathematics Research with AI-Driven Formal Proof Search

May 21, 2026

George Tsoukalas, Anton Kovsharov, Sergey Shirobokov et al.

LLMs become reliable enough for mathematics research when their outputs are verified by formal proof checkers—this hybrid approach solved previously open problems at a practical cost, showing a path beyond LLM hallucination.

Researchers used large language models to automatically generate formal proofs in Lean, a proof verification language, to solve open mathematical problems. Their AI agent successfully proved 9 open Erdős problems and 44 OEIS conjectures, demonstrating that LLMs can contribute to real mathematical research when paired with formal verification systems that catch errors.

reasoningagentsapplications

Equilibrium Reasoners: Learning Attractors Enables Scalable Reasoning

May 20, 2026

Benhao Huang, Zhengyang Geng, Zico Kolter

Iterative reasoning models work by learning task-specific attractors in their latent space; scaling test-time compute (more iterations and parallel paths) improves performance on hard problems without needing external verifiers.

This paper explains how AI models can solve hard problems by iteratively refining internal states, like a brain thinking through steps. The key insight is that models learn to create 'attractors'—stable patterns that pull the model toward correct answers.

reasoningscaling

Velocityformer: Broken-Symmetry-Matched Equivariant Graph Transformers for Cosmological Velocity Reconstruction

May 20, 2026

Tilman Tröster, David Mirkovic, Veronika Oehl et al.

Matching a model's architectural symmetries to the actual symmetries present in your data—not just the underlying physics—significantly improves performance and data efficiency.

Velocityformer is a specialized neural network that reconstructs galaxy velocities from survey data to improve cosmological measurements. By designing the model to match the asymmetric structure of real observations (where one direction—the line of sight—is special), it achieves 35% better accuracy than traditional methods and works well even with very limited training data.

architecturereasoningapplications

DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

May 20, 2026

Sixiong Xie, Zhuofan Shi, Haiyang Shen et al.

Retrieval isn't the main problem for frontier models on deep research tasks; instead, they fail primarily at deriving answers from evidence and calibrating confidence correctly, suggesting future improvements should focus on reasoning and verification rather than search.

DeepWeb-Bench is a challenging benchmark for evaluating AI agents that research questions by searching the web, collecting evidence, and reasoning through answers. Unlike existing benchmarks, it focuses on tasks requiring massive evidence gathering, cross-source verification, and complex multi-step reasoning—areas where current frontier models still struggle significantly.

evaluationreasoningagents

Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling

May 20, 2026

Caleb Winston, Ron Yifeng Wang, Azalia Mirhoseini et al.

Compiling agent tasks into code upfront—rather than deciding actions one step at a time—enables parallelization and validation, dramatically reducing latency and errors in web automation.

This paper introduces a compilation approach for web agents that converts natural language tasks into executable code plans instead of executing step-by-step. By generating multiple candidate plans, validating them against tool specifications, and optimizing for parallelization, the system achieves 10x faster execution and better accuracy than existing sequential approaches.

agentsefficiencyreasoning

You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories

May 20, 2026

Zhepei Wei, Xinyu Zhu, Wei-Lin Chen et al.

RLVR training produces predictable, low-rank weight changes that can be extrapolated mathematically, letting you skip 85% of training compute while matching or exceeding performance on reasoning tasks.

This paper reveals that language models trained with reinforcement learning from verifiable rewards (RLVR) follow surprisingly simple, low-rank weight trajectories.

trainingefficiencyreasoning

DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

May 20, 2026

Kaiyi Zhang, Wei Wu, Yankai Lin

When training language models with verifiable rewards, focusing on the most discriminative token patterns—rather than averaging all tokens equally—significantly improves learning efficiency and final performance.

This paper improves how language models learn from step-by-step feedback by better understanding which tokens should be rewarded or penalized. The authors show that standard learning methods get distracted by common formatting tokens and miss important patterns that distinguish good answers from bad ones.

trainingreasoningalignment

Mem-$π$: Adaptive Memory through Learning When and What to Generate

May 20, 2026

Xiaoqiang Wang, Chao Wang, Hadi Nekoei et al.

Generating context-specific guidance dynamically outperforms traditional retrieval-based memory for agents—the system learns to abstain when unnecessary and produce only relevant help, improving task success by over 30% on web navigation.

Mem-π is a framework that gives AI agents smarter memory by generating helpful guidance on-the-fly instead of retrieving fixed entries from a database. A separate model learns when to create guidance and what to create, trained to skip unhelpful suggestions and produce only what the agent actually needs for the current task.

agentstrainingreasoning

HITL-D: Human In The Loop Diffusion Assisted Shared Control

May 20, 2026

Riley Zilka, Sergey Khlynovskiy, Allie Wang et al.

Diffusion models can effectively assist human operators in robotic control by automating specific subtasks (like orientation), reducing cognitive load while maintaining human oversight—a practical model for human-AI collaboration in physical systems.

This paper presents HITL-D, a shared control system that combines diffusion-based AI policies with human input for robotic manipulation tasks. Instead of requiring operators to control every aspect of a robot arm, the system automatically handles orientation adjustments while the human focuses on positioning, reducing mental workload and task completion time by 40% in user studies.

agentsapplicationsreasoning

Code as Agent Harness

May 18, 2026

Xuying Ning, Katherine Tieu, Dongqi Fu et al.

Code is becoming the primary substrate for building reliable, verifiable AI agents. Understanding code as agent harness—the infrastructure layer—is essential for building systems that can plan, remember, use tools, and coordinate across multiple agents.

This survey examines how code serves as the operational foundation for AI agents—not just as output, but as the infrastructure that enables agents to reason, act, model environments, and verify their own behavior.

agentsarchitecturereasoning

ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop

May 18, 2026

Yining Hong, Jiageng Liu, Han Yin et al.

AI agents fail at embodied spatial reasoning primarily because they make poor action choices, not because they can't see—and they confidently stick to wrong answers even when evidence contradicts them, unlike humans who actively seek disconfirming evidence.

ESI-Bench is a benchmark for testing how well AI agents actively explore physical environments to understand spatial relationships. Rather than passively looking at images, agents must decide when to move, manipulate objects, and gather observations to solve tasks.

multimodalreasoning

Actionable World Representation

May 18, 2026

Kunqi Xu, Jitao Li, Jianglong Ye et al.

By explicitly modeling object state changes as a learnable manifold, WorldString provides a unified way to represent how objects respond to actions—bridging the gap between perception and control for physical world models.

WorldString is a neural architecture that learns to represent how real-world objects change state over time by processing point clouds or video data. It creates a digital twin of objects that captures their actionable properties, serving as a building block for world models that can predict and interact with the physical world.

architecturereasoning

General Preference Reinforcement Learning

May 18, 2026

Muhammad Umer, Muhammad Ahmed Mohsin, Ahsan Bilal et al.

GPRL solves reward hacking in LLM training by treating quality as multi-dimensional rather than scalar, allowing online RL to work on open-ended tasks without collapsing onto exploitable reward axes.

This paper addresses a gap in LLM training by proposing General Preference Reinforcement Learning (GPRL), which handles open-ended tasks like traditional preference optimization while maintaining the continuous exploration benefits of online RL.

trainingalignmentreasoning

Learned Memory Attenuation in Sage-Husa Kalman Filters for Robust UAV State Estimation

May 18, 2026

Kenan Majewski, Marcin Żugaj

Neural networks can improve classical state estimation by learning adaptive forgetting factors that respond to real-time sensor quality, enabling robust UAV navigation during sensor outages and dynamic environments.

This paper presents a learned Kalman filter that adapts to changing noise conditions in UAVs by using a neural network to dynamically adjust how much it trusts past measurements. Instead of using a fixed forgetting factor, the filter learns a memory policy from sensor data, helping it handle sensor failures and vibrations better than traditional adaptive filters.

trainingefficiencyreasoning
evaluationagentsreasoning

OpenDeepThink: Parallel Reasoning via Bradley--Terry Aggregation

May 14, 2026

Shang Zhou, Wenhao Chai, Kaiyuan Liu et al.

Instead of judging multiple reasoning attempts individually (which is noisy), compare them pairwise and aggregate votes to find the best solution—this scales test-time compute breadth more reliably than single-trace depth scaling.

OpenDeepThink improves LLM reasoning by running multiple solution attempts in parallel and selecting the best one using pairwise comparisons between candidates, rather than trying to judge each solution independently. The method uses Bradley-Terry aggregation to rank candidates based on LLM pairwise judgments, then evolves the top solutions using critiques from comparisons.

reasoningevaluation

AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward

May 12, 2026

Runhui Huang, Jie Wu, Rui Yang et al.

Self-reflective multimodal models can improve generation quality by learning to reason about user intent and autonomously correct their outputs using decomposed, verifiable rewards from language models.

AlphaGRPO enhances multimodal AI models to generate images and text by teaching them to reason about what users want and fix their own mistakes. It uses a novel reward system that breaks down complex requests into simple checkable questions, allowing the model to learn from reliable feedback without needing extra training setup.

multimodalreasoningtraining

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

May 12, 2026

Di Wu, Zixiang Ji, Asmi Kawatkar et al.

Long-term memory for agents requires more than just storing task outcomes; agents need to internalize environment-specific patterns, workflows, and failure modes to become truly experienced colleagues, and current memory systems still struggle with this despite recent advances.

This paper introduces LongMemEval-V2, a benchmark for testing whether AI agents can build long-term memory of specialized web environments. It includes 451 questions about five types of memory (state recall, workflow knowledge, failure modes, etc.) paired with massive history trajectories up to 500 steps and 115M tokens.

agentsevaluationreasoning

Learning, Fast and Slow: Towards LLMs That Adapt Continually

May 12, 2026

Rishabh Tiwari, Kusha Sareen, Lakshya A Agrawal et al.

Combining parameter updates with context optimization lets LLMs learn new tasks 3x more efficiently while staying closer to their original capabilities and avoiding the forgetting that comes from pure fine-tuning.

This paper proposes Fast-Slow Training (FST), a method that combines two learning mechanisms for LLMs: updating model parameters (slow learning) and optimizing the input context (fast learning). By separating task-specific adaptation from general knowledge, FST achieves better sample efficiency, reduces catastrophic forgetting, and maintains the model's ability to learn new tasks over time.

trainingefficiencyreasoning

ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

May 12, 2026

Xuhao Hu, Xi Zhang, Haiyang Xu et al.

Agents perform better when trained to decide dynamically between GUI actions and tool calls rather than using only one approach—this hybrid strategy improved accuracy by 66% on real-world tasks.

ToolCUA trains computer agents to intelligently choose between GUI actions (clicks, typing) and tool calls (APIs) by synthesizing diverse training trajectories from existing data and using reinforcement learning to optimize when to switch between action types. This solves a key problem for digital agents: knowing when to use high-level tools versus low-level GUI interactions.

agentstrainingreasoning

MEME: Multi-entity & Evolving Memory Evaluation

May 12, 2026

Seokwon Jung, Alexander Rubinstein, Arnas Uselis et al.

LLM agents struggle with dependency reasoning in persistent memory—when facts relate to each other, systems collapse to near-random performance, and fixing this requires impractically expensive configurations.

This paper introduces MEME, a benchmark for evaluating how well AI agents manage information across multiple sessions. It tests six memory tasks including complex scenarios like tracking dependencies between facts and handling deletions.

evaluationagentsreasoning

Solve the Loop: Attractor Models for Language and Reasoning

May 12, 2026

Jacob Fein-Ashley, Paria Rashidinejad

Attractor Models make iterative refinement practical by using implicit differentiation to solve fixed points, enabling smaller models (27M-770M parameters) to outperform much larger ones on reasoning and language tasks without the training instability of traditional recurrent architectures.

This paper introduces Attractor Models, which improve on looped Transformers by using implicit differentiation to solve for fixed points in latent representations.

architecturereasoningefficiency
evaluation
safety

GRAPHLCP: Structure-Aware Localized Conformal Prediction on Graphs

May 8, 2026

Peyman Baghershahi, Fangxin Wang, Debmalya Mandal et al.

When using GNNs for predictions, you can get tighter, more reliable uncertainty estimates by explicitly using graph structure rather than just embedding similarity—this gives you both statistical guarantees and practical efficiency.

GRAPHLCP improves uncertainty quantification for graph neural networks by using graph structure to make better predictions with guaranteed coverage. Instead of just looking at embedding similarity, it uses graph topology and a PageRank-based approach to identify similar nodes and weight predictions appropriately, reducing wasted prediction sets while maintaining statistical guarantees.

evaluationreasoning

VecCISC: Improving Confidence-Informed Self-Consistency with Reasoning Trace Clustering and Candidate Answer Selection

May 8, 2026

James Petullo, Sonny George, Dylan Cashman et al.

You can make confidence-weighted answer selection 47% cheaper by clustering similar reasoning traces and only evaluating unique ones, without sacrificing accuracy.

VecCISC reduces the cost of weighted majority voting for LLM reasoning by filtering out duplicate or low-quality reasoning traces before sending them to a critic model. It uses semantic similarity to identify which candidate answers are worth evaluating, cutting token usage by 47% while maintaining accuracy across math, science, and reasoning tasks.

efficiencyreasoning

Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning

May 8, 2026

Manish Bhattarai, Ismael Boureima, Nishath Rajiv Ranasinghe et al.

Structured, multi-criterion rewards grounded in real documents help models develop generalizable reasoning skills that transfer to unseen tasks better than single holistic scores.

This paper shows how to train AI models to reason better by grading their responses on multiple specific criteria instead of just right/wrong. The researchers created detailed rubrics from scientific documents and used them to train a language model with a technique called GRPO, which optimizes for partial credit across different dimensions.

trainingreasoningevaluation

The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents

May 8, 2026

Jiayuan Liu, Tianqin Li, Shiyi Du et al.

Giving LLM agents access to longer memory doesn't automatically improve performance; it can actually harm cooperation in multi-agent settings by shifting how they reason about the future, not by making them more suspicious.

When LLMs can remember more conversation history, they actually cooperate less in multi-agent games—a problem called the memory curse. The researchers found that expanded context windows cause models to lose forward-looking intent rather than become paranoid, and they proved this by showing that synthetic positive history and targeted fine-tuning can restore cooperation.

agentsreasoningalignment

CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation

May 8, 2026

James Petullo, Nianwen Xue

Allocating more computational effort to harder SQL generation tasks—by exploring more candidate solutions—significantly improves accuracy without needing larger models.

CA-SQL improves LLM performance on complex SQL generation tasks by estimating question difficulty and dynamically adjusting how many candidate queries to explore. It uses evolutionary search principles and a custom voting method to find better SQL solutions, achieving state-of-the-art results on the BIRD benchmark's hardest problems.

reasoningapplications

Reinforcement Learning for Exponential Utility: Algorithms and Convergence in Discounted MDPs

May 8, 2026

Gugan Thoppe, L. A. Prashanth, Ankur Naskar et al.

You can now use principled Q-learning algorithms for risk-sensitive decision-making (exponential utility), with mathematical guarantees that they find optimal policies—previously this lacked solid theoretical foundations.

This paper develops reinforcement learning algorithms for optimizing exponential utility in decision-making problems, which is important for risk-sensitive applications. The authors prove that their Q-learning-style algorithms converge to optimal policies and provide theoretical guarantees on convergence speed.

reasoning

Verifier-Backed Hard Problem Generation for Mathematical Reasoning

May 7, 2026

Yuhang Lai, Jiazhan Feng, Yee Whye Teh et al.

Using an independent verifier to validate problem correctness prevents reward hacking in AI-generated math problems, enabling better training data creation without human experts.

This paper tackles the problem of generating valid and challenging math problems for training AI models. Instead of relying on humans or simple self-play (which often produces invalid problems), the authors introduce VHG, a system with three players: a problem setter, a solver, and an independent verifier.

trainingreasoningdata

AI Co-Mathematician: Accelerating Mathematicians with Agentic AI

May 7, 2026

Daniel Zheng, Ingrid von Glehn, Yori Zwols et al.

AI agents work best for complex research when designed as collaborative partners that maintain context, track what didn't work, and produce native outputs—not just as answer machines.

Researchers built an interactive AI workbench that helps mathematicians explore open-ended research problems by combining agents for literature search, computation, theorem proving, and theory building. The system tracks failed ideas, manages uncertainty, and outputs mathematical artifacts—mimicking how human collaborators work together.

agentsreasoningapplications

Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

May 7, 2026

Mingwei Xu, Hao Fang

You can train reasoning models effectively using only positive examples—negative examples aren't necessary if you redistribute probability mass correctly and stabilize learning through siamese networks.

This paper proposes POPO, a new training method for reasoning-focused language models that learns exclusively from successful (positive) examples rather than mixing successes with failures. Instead of comparing positive and negative rollouts like existing methods (GRPO), POPO uses importance sampling to implicitly learn what to avoid, stabilized through a siamese network architecture.

trainingreasoningalignment

StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction

May 7, 2026

Xiangyuan Xue, Yifan Zhou, Zidong Wang et al.

Adding explicit strategy planning at the start of a task—rather than pure reactive decision-making—dramatically improves both learning efficiency and success rates for LLM agents on long-horizon tasks.

StraTA improves how language models learn to make decisions over many steps by having them first plan a high-level strategy before acting. Instead of reacting moment-by-moment, the model samples a strategy from the initial state, follows it through actions, and learns both strategy planning and action execution together.

agentsreasoning

Almost-Orthogonality in Lp Spaces: A Case Study with Grok

May 6, 2026

Ziang Chen, Jaume de Dios Pont, Paata Ivanisvili et al.

AI language models can contribute meaningfully to mathematical discovery by helping identify intermediate lemmas and inequalities, though human mathematicians remain essential for rigorous proof construction and validation.

This paper proves new bounds on how sums of functions behave in mathematical spaces, showing when certain inequalities hold and when they fail. The authors use a large language model called Grok to help discover intermediate results, demonstrating how AI can assist in mathematical research.

reasoningevaluation

LongSeeker: Elastic Context Orchestration for Long-Horizon Search Agents

May 6, 2026

Yijun Lu, Rui Ye, Yuwen Du et al.

Agents performing long-horizon tasks need adaptive context management—selectively compressing or discarding information—rather than naively accumulating everything, which improves efficiency and reduces hallucination.

LongSeeker introduces Context-ReAct, a framework that helps AI agents manage growing context during long tasks by selectively compressing, skipping, or deleting information based on relevance. The agent uses five operations to reshape its working memory, reducing costs and errors while maintaining task-critical information.

agentsreasoningefficiency

Understanding In-Context Learning for Nonlinear Regression with Transformers: Attention as Featurizer

May 6, 2026

Alexander Hsu, Zhaiming Shen, Wenjing Liao et al.

Transformer attention can act as a feature learner for nonlinear functions during in-context learning, and this capability can be theoretically analyzed with concrete error bounds—bridging the gap between empirical success and mathematical understanding.

This paper explains how transformers perform in-context learning for nonlinear regression tasks. The researchers show that transformer attention mechanisms can automatically create nonlinear features (like polynomials or splines) from examples in the prompt, enabling the model to solve complex regression problems without updating weights.

reasoningarchitectureevaluation

Superposition Is Not Necessary: A Mechanistic Interpretability Analysis of Transformer Representations for Time Series Forecasting

May 6, 2026

Alper Yıldırım

Transformers for time series don't rely on superposition like they do in language tasks, meaning time series forecasting may not require the compositional complexity that makes Transformers powerful for NLP.

This paper investigates how Transformers work internally for time series forecasting by analyzing their hidden representations using sparse autoencoders. The key finding: Transformers don't need complex, overlapping feature representations (superposition) to forecast well—their representations stay sparse and simple, which explains why basic linear models remain competitive.

reasoningevaluation

A Closed-Form Adaptive-Landmark Kernel for Certified Point-Cloud and Graph Classification

May 5, 2026

Sushovan Majhi, Atish Mitra, Žiga Virk et al.

You can build certified graph classifiers without gradient training by using topology-aware landmark selection and closed-form kernel methods—achieving competitive accuracy with built-in confidence bounds.

PALACE is a method for classifying point clouds and graphs using persistent homology (a topological data analysis technique) with adaptive landmark placement.

evaluationreasoning

An Agent-Oriented Pluggable Experience-RAG Skill for Experience-Driven Retrieval Strategy Orchestration

May 5, 2026

Dutao Zhang, Tian Liao

Retrieval strategy selection can be packaged as a reusable agent skill that learns from experience, rather than hard-coded into workflows, enabling better performance across diverse question types without changing the underlying retrievers.

This paper presents Experience-RAG Skill, a smart retrieval orchestration layer that learns which retrieval strategy works best for different types of questions.

agentsreasoning

A Closed-Form Persistence-Landmark Pipeline for Certified Point-Cloud and Graph Classification

May 4, 2026

Sushovan Majhi, Atish Mitra, Žiga Virk et al.

This approach trades the flexibility of learned models for interpretability and formal guarantees: you get provable error bounds and confidence scores for each prediction, but performance lags behind neural baselines on some datasets due to limited descriptor expressiveness.

PLACE is a method for classifying point clouds and graphs using topological features (persistent homology) with mathematical guarantees.

evaluationreasoning

SCPRM: A Schema-aware Cumulative Process Reward Model for Knowledge Graph Question Answering

May 4, 2026

Jiujiu Chen, Yazheng Liu, Sihong Xie et al.

Process reward models need to account for the full context of reasoning paths and penalize risky intermediate steps, not just reward final correctness—this matters most in domains where wrong reasoning paths are costly.

This paper addresses a key problem in evaluating AI reasoning: process reward models often give high scores to flawed reasoning paths because later correct steps mask earlier mistakes. The authors propose SCPRM, which evaluates reasoning steps by looking at what came before and measuring distance to the target, then use it with tree search to answer questions about knowledge graphs.

reasoningevaluationagents

FlexSQL: Flexible Exploration and Execution Make Better Text-to-SQL Agents

May 4, 2026

Quang Hieu Pham, Yang He, Ping Nie et al.

Flexible database interaction throughout reasoning—exploring schemas and data on-demand rather than upfront—is more effective for text-to-SQL than fixed pipelines, even with smaller models.

FlexSQL is a text-to-SQL agent that can explore database schemas, inspect data, and run verification queries at any point during reasoning—rather than retrieving schema once upfront. It generates multiple execution plans, implements them in SQL or Python, and uses a two-tiered repair system to recover from mistakes.

reasoningagentsapplications

AIs and Humans with Agency

May 4, 2026

David Mumford

Building AI systems with genuine agency isn't about making LLMs act alone—it requires new architectures where AI and humans co-develop plans and actions together for specific real-world situations.

This paper examines what agency means for both humans and AI systems, noting that human agency develops gradually through brain maturation while current LLMs struggle to act autonomously. The author argues that effective AI agency requires a fundamentally different architecture where AI systems and humans jointly plan and execute actions together in real-world contexts.

agentsarchitecturereasoning
evaluationreasoningalignment

RunAgent: Interpreting Natural-Language Plans with Constraint-Guided Execution

May 1, 2026

Arunabh Srivastava, Mohammad A., Khojastepour et al.

To make LLMs reliable at executing plans, you need to enforce structure through explicit control constructs, validate outputs against derived constraints at each step, and dynamically route to the best execution method (reasoning, tools, or code).

RunAgent is a system that helps AI agents execute multi-step plans written in natural language by converting them into a structured format with explicit control flow (like IF statements and loops).

agentsreasoningarchitecture

Observable Performance Does Not Fully Reflect System Organization: A Multi-Level Analysis of Gait Dynamics Under Occlusal Constraint

May 1, 2026

Jacques Raynal, Pierre Slangen, Jacques Margerit

Observable performance metrics can mask fundamentally different internal system organizations—a critical insight for understanding adaptive biological systems where multiple solutions may produce identical outputs.

This study shows that measuring a system's output performance alone doesn't reveal how it's actually organized internally. Using gait analysis in a Parkinson's patient with dental constraints, researchers found that similar-looking movement patterns can come from very different internal system states when examined through dynamical systems and machine learning lenses.

evaluationreasoning

Characterizing the Expressivity of Local Attention in Transformers

May 1, 2026

Jiaoda Li, Ryan Cotterell

Local attention isn't just an efficiency trick—it fundamentally expands what a transformer can learn by recognizing different patterns than global attention, and combining both types creates the most powerful model.

This paper explains why local attention (where tokens only look at nearby predecessors instead of all previous tokens) sometimes improves transformer performance. The authors prove that local attention expands what patterns a transformer can recognize, and combining local and global attention together creates the most expressive model.

architecturereasoningevaluation

Towards Neuro-symbolic Causal Rule Synthesis, Verification, and Evaluation Grounded in Legal and Safety Principles

Apr 30, 2026

Zainab Rehan, Christian Medeiros Adriano, Sona Ghahremani et al.

You can use LLMs with formal verification to automatically synthesize safety rules from human goals, catching errors before deployment—reducing the gap between what we want AI to do and what it actually does.

This paper presents a system that automatically creates and verifies safety rules for AI systems by combining language models, formal logic, and causal reasoning. It takes high-level goals from humans (like "avoid collisions") and converts them into formal logical rules that can be checked for correctness, tested in autonomous driving scenarios.

safetyreasoningalignment

TopBench: A Benchmark for Implicit Prediction and Reasoning over Tabular Question Answering

Apr 30, 2026

An-Yang Ji, Jun-Peng Jiang, De-Chuan Zhan et al.

LLMs fail at implicit prediction tasks on tables because they don't recognize when a question requires inference from patterns rather than lookup; intent disambiguation is the critical bottleneck.

TopBench is a benchmark for testing how well language models can answer questions about tables that require prediction and reasoning, not just data lookup. It includes 779 examples across tasks like forecasting values, analyzing treatment effects, and complex filtering—revealing that current models struggle to recognize when prediction is needed and often default to simple retrieval instead.

evaluationreasoningdata

Select to Think: Unlocking SLM Potential with Local Sufficiency

Apr 29, 2026

Wenxuan Ye, Yangyang Zhang, Xueli An et al.

Small models already generate the right answers in their candidate predictions—they just rank them poorly. Training them to re-rank their own outputs improves reasoning without external model calls.

Small language models struggle with reasoning tasks compared to large models. This paper discovers that when small models fail, the correct token from a large model is usually hidden in the small model's top-8 predictions.

efficiencyreasoningtraining

HealthNLP_Retrievers at ArchEHR-QA 2026: Cascaded LLM Pipeline for Grounded Clinical Question Answering

Apr 29, 2026

Md Biplob Hosen, Md Alomgeer Hussein, Md Akmol Masud et al.

Cascading multiple specialized modules (query reformulation, evidence ranking, grounded generation, answer-evidence linking) with an LLM outperforms end-to-end approaches for clinical QA, especially when grounding answers to source documents matters for patient safety.

A clinical question-answering system that helps patients understand their electronic health records by using a four-stage pipeline with an LLM to interpret patient questions, find relevant evidence in medical notes, generate grounded answers, and link answers back to source documents.

applicationsreasoningevaluation

Language Diffusion Models are Associative Memories Capable of Retrieving Unseen Data

Apr 29, 2026

Bao Pham, Mohammed J. Zaki, Luca Ambrogioni et al.

Language diffusion models memorize training data by default, but you can detect when they switch to genuine generalization by monitoring conditional entropy—a practical signal for assessing whether a deployed model is memorizing or creating.

This paper reveals that language diffusion models work like associative memories—they store training data in 'basins of attraction' and can retrieve both memorized and unseen examples. As training data grows, the model transitions from memorizing to generalizing, a shift detectable by measuring conditional entropy of token predictions.

trainingevaluationreasoning

Recursive Multi-Agent Systems

Apr 28, 2026

Xiyuan Yang, Jiaru Zou, Rui Pan et al.

Multi-agent systems can be made faster and more efficient by having agents refine their reasoning through recursive loops in latent space rather than text-based communication, achieving 1.2-2.4× speedup with 35-76% fewer tokens.

This paper introduces RecursiveMAS, a framework that improves multi-agent AI systems by having agents collaborate through repeated refinement cycles in a shared latent space rather than exchanging text.

agentsreasoningefficiency

How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

Apr 28, 2026

Chu-Cheng Lin, Eugene Ie

When training reasoning models with sparse rewards, you can escape cold-start failure by interpolating between RL and supervised learning via the Tsallis loss family—intermediate values of q balance speed of learning with training stability.

This paper solves a key problem in training reasoning models: when models rarely succeed initially, standard reinforcement learning gets stuck. The authors introduce a family of loss functions (using Tsallis math) that smoothly blend between two extremes—pure RL and pure supervised learning—letting practitioners choose how quickly to commit to learning from successes.

trainingreasoningalignment

Teacher Forcing as Generalized Bayes: Optimization Geometry Mismatch in Switching Surrogates for Chaotic Dynamics

Apr 28, 2026

Andre Herz, Daniel Durstewitz, Georgia Koppe

Teacher forcing trains RNNs on chaotic systems differently than the model will actually be used—this mismatch can make models fit data well statistically while performing poorly at predicting actual dynamics, a problem that becomes worse when multiple explanations exist for the data.

This paper reveals a fundamental mismatch between how teacher forcing (a common training technique) and marginal likelihood (the true objective) shape neural network optimization for chaotic systems.

trainingreasoning

Toward a Functional Geometric Algebra for Natural Language Semantics

Apr 28, 2026

James Pustejovsky

Geometric algebra expands n-dimensional embeddings into a 2^n-dimensional structure that can represent both base concepts and their interactions in a single unified framework, potentially solving long-standing problems in how neural networks compose meanings.

This paper proposes using geometric algebra (Clifford algebras) instead of conventional linear algebra as the mathematical foundation for representing word and sentence meanings in AI.

architecturereasoning

Variational Neural Belief Parameterizations for Robust Dexterous Grasping under Multimodal Uncertainty

Apr 28, 2026

Clinton Enwerem, Shreya Kalyanaraman, John S. Baras et al.

Using differentiable Gaussian mixtures to represent grasp uncertainty enables fast, gradient-based optimization for worst-case robustness—achieving 10x speedup over particle filters while maintaining or improving success rates.

This paper tackles the problem of robust robotic grasping when contact forces, sensing, and external disturbances are unpredictable. Instead of using slow particle-filter approaches, the authors represent uncertainty as a learnable Gaussian mixture and optimize for worst-case performance (CVaR) using gradient-based methods.

reasoningefficiencyagents

Conflict-Aware Harmonized Rotational Gradient for Multiscale Kinetic Regimes

Apr 27, 2026

Zhangyong Liang

When training neural networks on multiscale physics problems, gradient conflicts between different regimes can cause training failure—HRGrad fixes this by explicitly managing gradient directions to keep all objectives aligned during optimization.

This paper introduces HRGrad, a method for training neural networks on physics problems that span multiple scales—from microscopic to macroscopic behavior. The key challenge is that different scales pull the network in conflicting directions during training.

trainingreasoning

Learning to Think from Multiple Thinkers

Apr 27, 2026

Nirmit Joshi, Roey Magen, Nathan Srebro et al.

Learning from diverse reasoning traces is harder than learning from a single thinker, but you can overcome this by actively collecting reasoning data from many thinkers (logarithmic in target accuracy) combined with passive final-answer supervision.

This paper studies how AI models can learn from multiple people or programs solving the same problem in different ways (e.g., different math solutions or code implementations).

trainingreasoningdata

SpecRLBench: A Benchmark for Generalization in Specification-Guided Reinforcement Learning

Apr 27, 2026

Zijian Guo, İlker Işık, H. M. Sabbir Ahmad et al.

Current specification-guided RL methods generalize poorly to new environments and complex tasks—this benchmark helps identify where they fail and guides development of more robust approaches.

SpecRLBench is a benchmark for testing how well reinforcement learning agents can follow formal task specifications (written in linear temporal logic) across different, unseen environments and robot types. The benchmark reveals that current methods struggle as tasks and environments become more complex, providing a structured way to develop better specification-guided RL systems.

evaluationreasoningagents

Can Current Agents Close the Discovery-to-Application Gap? A Case Study in Minecraft

Apr 27, 2026

Zhou Ziheng, Huacong Tang, Jinyuan Zhang et al.

Current AI agents struggle most with identifying knowledge gaps and formulating the right questions, not just answering them—a shift in bottleneck that suggests we need better ways to help AI systems recognize what they don't know.

This paper introduces SciCrafter, a Minecraft-based benchmark that tests whether AI agents can discover causal rules and apply them to solve increasingly complex problems.

reasoningagentsevaluation
reasoningefficiencytraining

BERAG: Bayesian Ensemble Retrieval-Augmented Generation for Knowledge-based Visual Question Answering

Apr 24, 2026

Jinghong Chen, Jingbiao Mei, Guangyu Yang et al.

By treating retrieved documents as an ensemble with probabilistic weights updated during generation, BERAG avoids concatenating long contexts while improving both performance and interpretability—especially valuable for visual question answering where context length is expensive.

This paper proposes BERAG, a retrieval-augmented generation system that processes retrieved documents individually rather than concatenating them into one long context. Instead of treating all documents equally, BERAG uses Bayesian inference to weight documents based on how useful they are during answer generation, updating these weights token-by-token.

multimodalreasoning

MathDuels: Evaluating LLMs as Problem Posers and Solvers

Apr 23, 2026

Zhiqiu Xu, Shibo Jin, Shreya Arya et al.

Models can be strong at solving math problems but weak at creating challenging ones—dual-role evaluation exposes capability gaps that single-role benchmarks miss, and the benchmark naturally scales with model strength.

MathDuels is a new way to test AI math abilities by having models both create and solve problems against each other. Unlike static benchmarks that get too easy, this self-play approach reveals hidden differences between models—some are great solvers but poor problem creators, and vice versa.

evaluationreasoning

From Research Question to Scientific Workflow: Leveraging Agentic AI for Science Automation

Apr 23, 2026

Bartosz Balis, Michal Orzechowski, Piotr Kica et al.

By separating LLM interpretation from deterministic workflow generation and encoding domain knowledge in reusable "Skills" documents, you can reliably automate the conversion of research questions into executable scientific workflows with minimal cost and overhead.

This paper presents an AI system that automatically converts research questions into executable scientific workflows. It uses three layers: an LLM to understand natural language, validated generators to create reproducible workflow specifications, and domain expert "Skills" documents that guide the process.

agentsapplicationsreasoning

Nemobot Games: Crafting Strategic AI Gaming Agents for Interactive Learning with Large Language Models

Apr 23, 2026

Chee Wei Tan, Yuchen Wang, Shangxin Guo

LLMs can be operationalized as strategic game agents that adapt their reasoning approach based on game type, and interactive platforms like Nemobot let developers actively experiment with and refine these agents in real time.

Nemobot is an interactive platform that uses large language models to create game-playing AI agents across different game types—from word games to strategy games. Users can build, customize, and deploy these agents while watching them learn and improve through reinforcement learning, human feedback, and self-critique.

agentsreasoningapplications

A Multi-Stage Warm-Start Deep Learning Framework for Unit Commitment

Apr 23, 2026

Muhy Eddin Za'ter, Anna Van Boven, Bri-Mathias Hodge et al.

Deep learning can accelerate hard optimization problems by providing intelligent warm-start solutions that reduce the search space, rather than replacing traditional solvers entirely.

This paper uses a transformer neural network to predict electricity generator schedules 72 hours ahead, then refines those predictions with rule-based corrections and feeds them to a traditional optimization solver as a starting point.

applicationsreasoningefficiency

TingIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale

Apr 23, 2026

Jun Wang, Ziyin Zhang, Rui Wang et al.

LLMs can be practical for production incident detection when paired with efficient indexing, noise filtering, and domain-specific routing—not just as standalone models, but as part of a multi-stage system that handles real-world scale and complexity.

TingIS is a production system that detects critical technical incidents from noisy customer reports in real-time at enterprise scale.

applicationsagentsreasoning

Machine Behavior in Relational Moral Dilemmas: Moral Rightness, Predicted Human Behavior, and Model Decisions

Apr 23, 2026

Jiseon Kim, Jea Kwon, Luiz Felipe Vecchietti et al.

LLMs can model human moral reasoning but don't use that understanding in their own decisions—they follow abstract rules instead of social context, creating a dangerous misalignment between their internal understanding and external behavior.

This study tests whether large language models understand how human morality shifts based on relationships and context. Using a whistleblower dilemma scenario, researchers found that LLMs can predict how humans actually behave (favoring loyalty to friends), but their own decisions follow rigid fairness rules instead.

alignmentreasoningevaluation

Modulating Cross-Modal Convergence with Single-Stimulus, Intra-Modal Dispersion

Apr 23, 2026

Eghbal A. Hosseini, Brian Cheung, Evelina Fedorenko et al.

Single images with high agreement among vision models show dramatically stronger alignment with language models, suggesting that representational convergence across modalities is driven by how unambiguously the environment constrains perception.

This paper reveals that how consistently different vision models represent individual images (intra-modal agreement) strongly predicts whether vision and language models will represent those same images similarly (cross-modal alignment).

multimodalevaluationreasoning

On the algebra of Koopman eigenfunctions and on some of their infinities

Apr 23, 2026

Zahra Monfared, Saksham Malhotra, Sekiya Hajime et al.

You can generate many more Koopman eigenfunctions from a few computed ones by treating them as an algebraic group, enabling better system representations from sparse or incomplete data.

This paper shows how to compute more eigenfunctions of the Koopman operator—a mathematical tool for analyzing dynamical systems—by using algebraic relationships between a small set of known eigenfunctions.

reasoningarchitecture

Probably Approximately Consensus: On the Learning Theory of Finding Common Ground

Apr 23, 2026

Carter Blair, Ben Armstrong, Shiri Alouf-Heffetz et al.

You can find practical consensus in large communities by treating it as a learning problem—identifying opinion intervals that maximize agreement while accounting for topic importance, with provable guarantees on how many user queries you actually need.

This paper tackles finding consensus in online communities by modeling agreement as an interval in opinion space. Rather than just looking at specific statements users provide, the method accounts for which topics matter most to the community.

reasoningevaluation

Quotient-Space Diffusion Models

Apr 23, 2026

Yixian Xu, Yusong Wang, Shengjie Luo et al.

Quotient-space diffusion models reduce learning complexity for symmetric generative tasks by formally accounting for group symmetries, enabling better molecular and protein structure generation without learning redundant symmetric variations.

This paper introduces a mathematical framework for diffusion models that accounts for symmetries in generative tasks, particularly molecular structure generation. By modeling distributions on quotient spaces (which treat symmetric objects as equivalent), the approach simplifies learning compared to existing symmetry-aware methods and guarantees correct sampling of target distributions.

architecturereasoningapplications

Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems

Apr 23, 2026

Ye Yu, Heming Liu, Haibo Jin et al.

Multi-agent LLM systems can achieve better reasoning by learning optimized latent communication channels instead of relying on fixed text-based protocols, with significant improvements on challenging benchmarks.

This paper introduces DiffMAS, a training framework that lets multiple AI agents learn how to communicate with each other through internal representations (like key-value caches) rather than text. By jointly optimizing both reasoning and communication during training, agents can better coordinate on complex tasks like math, science, and coding problems.

agentstrainingreasoning

Inferring High-Level Events from Timestamped Data: Complexity and Medical Applications

Apr 23, 2026

Yvon K. Awuklu, Meghyn Bienvenu, Katsumi Inoue et al.

You can build practical event detection systems using logical rules and constraint satisfaction that work efficiently on real timestamped data while handling conflicting inferences—demonstrated on medical records.

This paper presents a logic-based system for detecting high-level events from timestamped data, like inferring disease episodes from patient medical records. The system uses logical rules to identify events, handles conflicts between inferred events, and can run efficiently on real data while staying aligned with expert knowledge.

reasoningdataapplications

Thinking with Reasoning Skills: Fewer Tokens, More Accuracy

Apr 23, 2026

Guangxiang Zhao, Qilong Shi, Xusen Xiao et al.

By retrieving learned reasoning skills at inference time instead of reasoning from scratch, you can reduce token usage and improve accuracy—making LLM reasoning cheaper and faster for practical deployment.

This paper proposes storing reusable reasoning skills learned from past problem-solving attempts, then retrieving and applying them during inference to guide new reasoning. Instead of reasoning from scratch each time, the model recalls relevant skills to avoid redundant work and reach solutions faster. Tests on coding and math tasks show it uses fewer tokens while improving accuracy.

reasoningefficiencytraining

Transferable Physics-Informed Representations via Closed-Form Head Adaptation

Apr 23, 2026

Jian Cheng Wong, Isaac Yin Chung Lai, Pao-Hsiung Chiu et al.

Physics-informed neural networks can be made dramatically faster and more generalizable by learning shared representations across PDE families and using closed-form adaptation, enabling accurate predictions on new problems without retraining.

This paper introduces Pi-PINN, a physics-informed neural network that learns reusable representations for solving different partial differential equations (PDEs). Instead of training separate models for each PDE, Pi-PINN learns a shared representation and adapts quickly to new PDEs using a mathematical technique called pseudoinverse, achieving 100-1000x faster predictions than standard PINNs.

efficiencyreasoning

Convergent Evolution: How Different Language Models Learn Similar Number Representations

Apr 22, 2026

Deqing Fu, Tianyi Zhou, Mikhail Belkin et al.

Language models naturally converge on similar periodic number representations across different architectures, but whether they learn features useful for arithmetic depends on training signals like text-number co-occurrence or multi-token addition problems.

Different language models (Transformers, RNNs, LSTMs) independently learn to represent numbers using periodic patterns with periods of 2, 5, and 10—a phenomenon called convergent evolution.

trainingreasoning

Diagnosing CFG Interpretation in LLMs

Apr 22, 2026

Hanqi Li, Lu Chen, Kai Yu

LLMs can maintain surface-level syntax when following grammars but fail at deeper semantic understanding, especially with complex nested structures—a critical limitation for building reliable AI agents that need to follow formal specifications.

This paper tests whether large language models can correctly interpret and follow context-free grammars (formal rules for structured output). The researchers created RoboGrid, a testing framework that checks if LLMs produce syntactically correct, semantically meaningful outputs when given novel grammars.

evaluationreasoningagents

OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model

Apr 22, 2026

Qiguang Chen, Chengyu Luan, Jiajun Wu et al.

Current vision-language models struggle with multi-image reasoning even on problems they might solve with single images—this benchmark shows that connecting information across multiple images is a major unsolved challenge.

OMIBench is a benchmark for testing how well vision-language models can solve Olympiad-level problems that require reasoning across multiple images. Unlike existing benchmarks that focus on single images, OMIBench tests whether models can connect evidence scattered across different images to solve complex problems in biology, chemistry, math, and physics.

evaluationmultimodalreasoning

Automatic Ontology Construction Using LLMs as an External Layer of Memory, Verification, and Planning for Hybrid Intelligent Systems

Apr 22, 2026

Pavel Salovskii, Iuliia Gorshkova

Pairing LLMs with structured ontologies creates a verification layer that catches errors and enables long-term memory—turning language models into more reliable reasoning systems for planning and decision-making.

This paper proposes adding a structured knowledge graph layer to LLMs using RDF/OWL ontologies, enabling persistent memory and verifiable reasoning. The system automatically builds ontologies from documents and APIs, then combines graph-based reasoning with LLM inference to improve multi-step planning tasks and add formal validation to AI outputs.

reasoningagents