Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.
Huawei Lin, Peng Li, Jie Song et al.
Treating AI agent skills as long-lived, testable assets with persistent memory—rather than disposable code—significantly improves task success rates and enables skills to transfer between agents and tasks.
This paper introduces MUSE-Autoskill, a framework that helps AI agents continuously improve by creating, storing, and refining reusable skills over time. Instead of treating skills as one-time solutions, the system manages them like software—organizing them in memory, testing them, and learning from experience to make them more reliable and effective across different tasks.
Tamerlan Aghayev, Maxime Elkael, Michele Polese et al.
AI agents can handle complex domain-specific engineering when grounded in real-world validation and persistent knowledge—LLMs alone fail on RAN work because they hallucinate APIs and break on real hardware, but agents with feedback loops and ground truth don't.
GENESIS is an AI agent framework that automates cellular network (6G RAN) development by converting specifications and problems into tested code solutions. It combines LLMs with real hardware validation and a persistent knowledge base to handle tasks like feature implementation, testing, and optimization that normally take months of manual engineering.
Jianshu Zhang, Yijiang Li, Huifeixin Chen et al.
Current VLMs struggle to genuinely understand spatial numbers—they can't reliably map between visual coordinates and numerical values, which is critical for embodied AI tasks like robotics that require precise spatial outputs.
This paper tests whether Vision-Language Models (VLMs) truly understand spatial numbers like coordinates and distances. Using SpaceNum, a framework with two tasks (converting numbers to spatial positions and vice versa), researchers find that VLMs largely fail at grounding numbers in actual spatial meaning, relying instead on shallow visual cues rather than genuine spatial reasoning.
Beichen Zhang, Yuhong Liu, Jinsong Li et al.
Decoupling image editing from language understanding—and training the editor specifically for reasoning tasks—improves multimodal reasoning accuracy across diverse visual tasks without modifying the base model.
ETCHR is a specialized image editing model that helps multimodal AI systems reason better by transforming images based on questions. Unlike general image editors, it's trained to understand abstract reasoning tasks and produce clearer images for downstream analysis, improving performance across visual reasoning tasks by 4-5% without retraining the main AI model.
Ziyu Guo, Rain Liu, Xinyan Chen et al.
A single discrete token can serve dual purposes—executing visual operations like code while also functioning as a learnable reasoning unit—making visual reasoning more efficient and trainable without architectural changes.
ATLAS introduces a single 'functional token' that acts as both an agentic operation and a latent visual reasoning unit, enabling models to reason about images without generating intermediate visual content. This approach combines the interpretability of code-based reasoning with the efficiency of latent reasoning, while remaining compatible with standard language model training.
Shashwat Goel, Nikhil Chandak, Arvindh Arun et al.
Current AI agents struggle with long-horizon real-world adaptation—the best models achieve only 25% accuracy predicting events three months ahead, showing this is a critical capability gap for deployed AI systems.
FutureSim is a benchmark that tests AI agents' ability to adapt and predict real-world events over time by replaying actual news and events in chronological order. Agents must forecast future events beyond their training data while interacting with a live stream of information, revealing significant gaps in current frontier models' capabilities.
Tong Zheng, Haolin Liu, Chengsong Huang et al.
You can automatically discover better inference strategies for LLMs by treating it as a search problem over execution traces, rather than manually designing heuristics—and it's cheap to do at scale.
This paper presents AutoTTS, a framework that automatically discovers test-time scaling strategies for LLMs instead of relying on hand-crafted heuristics.
Shuhang Lin, Chuhao Zhou, Xiao Lin et al.
Conformal Path Reasoning provides statistical guarantees that your KGQA system will include the correct answer in its output set, while keeping that set compact and practical—solving a real reliability problem in knowledge graph reasoning.
This paper improves Knowledge Graph Question Answering by adding statistical guarantees to answer reliability. It uses conformal prediction—a technique that creates sets of answers with proven coverage rates—combined with a neural network that learns to score reasoning paths better. The result is more trustworthy answers with smaller, more useful prediction sets.
Jinpai Zhao, Nishant Panda, Yen Ting Lin et al.
Composing interpretable numerical and learned modules with learned policies outperforms monolithic neural operators on PDEs, generalizes better to out-of-distribution cases, and lets you swap components (like boundary conditions) without retraining.
HyCOP learns to solve PDEs by composing simple, interpretable modules (like advection and diffusion) rather than training a single neural network. It learns a policy that decides which module to apply and for how long based on the current state, enabling better generalization to new scenarios and easier transfer to different problems.
Sailesh Panda, Pritam Kadasi, Abhishek Upperwal et al.
LLMs fail at executing multi-step procedures faithfully, with accuracy collapsing as procedure length increases. This means strong benchmark performance can hide critical weaknesses in following instructions step-by-step.
This paper tests whether large language models actually follow step-by-step procedures correctly, not just whether they get the right final answer. Researchers created a benchmark where models execute arithmetic algorithms of varying length and complexity.
Meng Chu, Xuan Billy Zhang, Kevin Qinghong Lin et al.
World models are essential for agents that act in the world, but they need different architectures and evaluation methods depending on what they're modeling (physics vs. software vs. social dynamics) and how sophisticated their predictions need to be.
This paper creates a framework for understanding world models—systems that predict how environments change—by organizing them into three capability levels (from simple one-step prediction to autonomous model revision) and four domain types (physical, digital, social, scientific).
Keshav Ramji, Tahira Naseem, Ramón Fernandez Astudillo
You can train models to reason efficiently using learned abstract tokens instead of natural language, reducing inference cost by over 10× while keeping reasoning quality comparable to verbose chain-of-thought.
This paper introduces Abstract Chain-of-Thought, a method that trains language models to reason using short sequences of special tokens instead of writing out full explanations. The approach uses a warm-up phase combining supervised learning from verbal reasoning and self-distillation, then optimizes with reinforcement learning.