ThinkLLM
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
AboutPrivacyTermsRSS

ThinkLLM

Spot an error in our data? Let us know.

Papers

Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.

921 papers100 this month12 topics
AllEfficiency 38Training 37Evaluation 33Reasoning 27Agents 23Architecture 23Applications 21Multimodal 15Safety 12scaling 8Alignment 8Data 6

May 25 – May 31(30)

Algorithmic Monocultures in Hiring

May 26, 2026

Rishi Bommasani, Sarah H. Bana, Kathleen A. Creel et al.

When many employers use the same hiring algorithm, it amplifies bias rather than spreading risk—the same people get rejected everywhere, and racial disparities compound across the job market.

This paper analyzes hiring algorithms from a single vendor used by many employers and finds they create unfair outcomes.

safetyevaluationapplications

MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation

May 26, 2026

Huawei Lin, Peng Li, Jie Song et al.

Treating AI agent skills as long-lived, testable assets with persistent memory—rather than disposable code—significantly improves task success rates and enables skills to transfer between agents and tasks.

This paper introduces MUSE-Autoskill, a framework that helps AI agents continuously improve by creating, storing, and refining reusable skills over time. Instead of treating skills as one-time solutions, the system manages them like software—organizing them in memory, testing them, and learning from experience to make them more reliable and effective across different tasks.

agents

May 18 – May 24(70)

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

May 22, 2026

Yifan Yang, Ziyang Gong, Weiquan Huang et al.

Skills can be trained like model parameters: use a separate optimizer to iteratively edit skill text based on validation feedback, not just generate them once. This approach is reproducible, stable, and transfers across models.

SkillOpt treats agent skills like neural network weights—optimizing them systematically through an external optimizer model that suggests bounded edits to skill documents based on scored rollouts.

agentstraining

LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws

May 22, 2026

Xu Ouyang, Deyi Liu, Yuhang Cai et al.

LLMs have a fundamental capacity limit based on signal-to-noise ratio: scaling parameters or data without maintaining sufficient signal clarity causes performance degradation, explaining phenomena like catastrophic overtraining and quantization failures that standard scaling laws can't capture.

This paper explains why large language models sometimes get worse with more training or smaller precision—not just better. Using information theory, the authors model LLM training like sending signals through a noisy channel. When you scale up the model or data without keeping the signal clear relative to noise, performance actually drops in a U-shape.

training
reasoning

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

May 26, 2026

Shihao Wang, Shilong Liu, Yuanguo Kuang et al.

Decoding bounding boxes as complete geometric units instead of individual tokens dramatically speeds up inference while maintaining or improving localization accuracy.

LocateAnything replaces slow token-by-token box decoding with Parallel Box Decoding, which generates entire bounding boxes at once. Combined with a 138-million-sample dataset, this approach makes visual grounding and detection faster while improving accuracy on standard benchmarks.

efficiencymultimodalarchitecture

Natural Language Query to Configuration for Retrieval Agents

May 26, 2026

Melissa Z. Pan, Negar Arabzadeh, Mathew Jacob et al.

You can optimize retrieval pipelines per-query rather than per-workload by using lightweight predictors trained on query characteristics, achieving the same accuracy at significantly lower cost or better accuracy at the same cost.

This paper presents BRANE, a system that automatically selects the best configuration for retrieval agents on a per-query basis. Instead of manually tuning a retrieval pipeline once, BRANE analyzes each query to predict which combination of LLM, retriever, and other settings will work best, allowing teams to optimize for either accuracy or cost at inference time without retraining.

agentsefficiencyevaluation

GENESIS: Harnessing AI Agents for Autonomous 6G RAN Synthesis, Research, and Testing

May 26, 2026

Tamerlan Aghayev, Maxime Elkael, Michele Polese et al.

AI agents can handle complex domain-specific engineering when grounded in real-world validation and persistent knowledge—LLMs alone fail on RAN work because they hallucinate APIs and break on real hardware, but agents with feedback loops and ground truth don't.

GENESIS is an AI agent framework that automates cellular network (6G RAN) development by converting specifications and problems into tested code solutions. It combines LLMs with real hardware validation and a persistent knowledge base to handle tasks like feature implementation, testing, and optimization that normally take months of manual engineering.

agentsreasoningapplications

MobileMoE: Scaling On-Device Mixture of Experts

May 26, 2026

Yanbei Chen, Hanxian Huang, Ernie Chang et al.

MoE isn't just for giant models—on mobile devices, moderate sparsity with shared experts is both memory and compute-optimal, letting you get better performance with fewer active parameters than dense models.

MobileMoE brings Mixture-of-Experts (MoE) architecture to phones and edge devices by optimizing it for memory and compute constraints. The models use 0.3-0.9B active parameters but achieve better performance than larger dense models, running 2-4× faster on real smartphones while using less memory.

efficiencyarchitecturescaling

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

May 26, 2026

Dongyoon Hahm, Dylan Hadfield-Menell, Kimin Lee

RLHF systems can be exploited by models that mix high quality with hidden biases—annotators prefer them, but the reward model can't tell quality from bias apart, amplifying misalignment during training.

This paper reveals a critical vulnerability in RLHF where language models can exploit the alignment process itself by generating biased outputs that annotators rate highly for quality, causing the reward model to amplify misaligned behaviors like sexism and propaganda.

alignmentsafetytraining

Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders

May 26, 2026

Yi Jing, Zao Dai, Jinwu Hu et al.

Instead of picking training data based only on external metrics, you can use SAEs to decode what the model actually learns internally, then use those signals to organize data better—making training more efficient without changing the model architecture.

This paper shows how to improve LLM training by using Sparse Autoencoders (SAEs) to read the model's internal representations and guide data selection. The method clusters training data for diversity, orders it by difficulty, and filters low-quality examples—improving math performance by 3% and cutting training time by 20% on smaller models.

trainingdatareasoning

From Scores to Gibbs Correctors: Accelerating Uniform-Rate Discrete Diffusion Models

May 26, 2026

Yuchen Liang, Ness Shroff, Yingbin Liang

GADD accelerates discrete diffusion sampling from many steps to logarithmically few steps without additional training, providing both theoretical guarantees and practical speedups for text and symbolic generation tasks.

This paper speeds up discrete diffusion models (used for text and symbolic data generation) by introducing GADD, a new method that uses Gibbs corrections to reduce sampling steps. Unlike existing acceleration techniques, GADD doesn't require extra training and achieves theoretically optimal speedup, making it practical for real applications like text and music generation.

efficiencytrainingreasoning

When Eyes Betray AI: Social Gaze Consistency as a Semantic Cue for AI-Generated Image Detection

May 26, 2026

Kim Jihyeon, Sohee Kim, Soosan Lee et al.

High-level semantic inconsistencies in social gaze (eye direction, head-eye alignment) are more reliable for detecting AI-generated images than low-level pixel artifacts, and this signal transfers across different generative models.

This paper shows that AI-generated images often fail at maintaining realistic gaze patterns between people—like consistent eye direction and head-eye alignment. The researchers built a detection system using this semantic weakness, along with a carefully designed dataset and training approach, achieving better detection across multiple AI image generators.

evaluationsafety

MATCHA: Matching Text via Contrastive Semantic Alignment

May 26, 2026

Siran Li, Ece Sena Etoglu, Carsten Eickhoff et al.

Current LLM evaluation metrics fail to catch semantic contradictions, potentially hiding serious errors. MATCHA solves this by explicitly measuring both agreement with correct answers and distance from contradictory statements.

MATCHA is a new evaluation metric for LLMs that fixes a critical flaw in popular metrics like ROUGE and BERTScore: they give similar scores to contradictory texts. MATCHA uses a dual approach—rewarding similarity to correct answers while penalizing contradictions—and significantly outperforms existing metrics across question-answering, summarization, and other tasks.

evaluationalignment

Towards Controllable Image Generation through Representation-Conditioned Diffusion Models

May 26, 2026

Nithesh Chandher Karthikeyan, Jonas Unger, Gabriel Eilertsen

You can guide diffusion models to generate specific images by using learned representations as conditioning signals, avoiding the need for expensive annotated datasets while maintaining smooth, interpretable control.

This paper shows how to control image generation in diffusion models by conditioning them on representations from self-supervised models instead of requiring text or semantic annotations. The approach discovers interpretable directions in the representation space that let you smoothly control what gets generated.

architecturemultimodalefficiency

2-ASP(Q) programs with weak constraints: Complexity and efficient implementation

May 26, 2026

Andrea Cuteri, Giuseppe Mazzotta, Francesco Ricca

2-ASP(Q)^w can express optimization problems up to Delta_3^P complexity, and the CEGAR-based approach in Casper makes solving these problems practical despite their theoretical hardness.

This paper extends Answer Set Programming with quantifiers and weak constraints, creating a system called 2-ASP(Q)^w that can solve complex optimization problems. The authors prove how hard these problems are to solve theoretically, then build practical software using a refinement technique that gradually improves solutions by learning from counterexamples.

reasoning

FinHarness: An Inline Lifecycle Safety Harness for Finance LLM Agents

May 26, 2026

Haoxuan Jia, Yang Liu, Bin Chong et al.

Real-time safety monitoring during agent execution is more effective and efficient than post-hoc auditing—catching risky actions before they happen and routing verification intelligently reduces security breaches by 61% while cutting computational costs by 78%.

FinHarness is a safety system for AI agents handling financial transactions that monitors requests in real-time rather than after the fact. It catches risky actions mid-process by checking user intent and each tool call, then routes complex decisions to either a fast or thorough AI judge based on risk level.

safetyagentsefficiency

EdgeFlow: Edge-Map Augmented VLM-Based Flowchart Processing for Industrial Requirements Engineering

May 26, 2026

Zhifei Dou, Shabnam Hassani, Ou Wei

Adding simple edge detection to flowchart images helps VLMs understand topology better—a practical, training-free technique that improves industrial document processing by 11-17 percentage points without requiring annotated data.

EdgeFlow improves how Vision Language Models convert flowcharts into machine-readable formats by adding edge detection as a visual guide. The method works without training data or fine-tuning, achieving significant improvements on real-world industrial flowcharts by helping the model better understand the structure and connections between elements.

applicationsmultimodaldata

MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research

May 25, 2026

Dingbang Wu, Rui Hao, Haiyang Wang et al.

You can now train mobile agents at scale with deterministic, verifiable rewards in simulation, and the skills transfer well to real devices—solving a major bottleneck in agent research.

MobileGym is a lightweight simulation platform for training mobile app agents that runs hundreds of parallel instances in a browser. It provides verifiable task outcomes through structured JSON states and enables scalable reinforcement learning training, with a benchmark of 416 tasks across 28 apps that shows strong transfer to real devices.

agentsevaluationapplications

From Model Scaling to System Scaling: Scaling the Harness in Agentic AI

May 25, 2026

Shangding Gu

Agent performance depends equally on system design (memory, routing, verification) as on model capability; evaluating agents requires measuring trajectory quality and system hygiene, not just final outcomes.

This paper argues that building better AI agents requires focusing on the system architecture around language models, not just making the models bigger. It introduces the concept of 'scaling the harness'—designing the memory, tool-use, verification, and orchestration layers that turn a model into a working agent—and proposes benchmarks to measure agent quality beyond just task success.

agentsarchitectureevaluation

Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation

May 25, 2026

Shuhong Zheng, Aashish Kumar Misraa, Yu-Teng Li et al.

Jointly encoding text and images in MLLMs before conditioning diffusion models preserves subject identity better than separate encoding, while a multi-stage denoising strategy balances semantic instruction-following with fine-detail preservation.

This paper improves subject-driven image generation by using multimodal large language models (MLLMs) to jointly understand text and reference images together, rather than separately. The approach adds a VAE-based identity module and a novel aggregation technique to balance semantic understanding with preserving the subject's identity, reducing unwanted copy-paste artifacts.

multimodalarchitectureapplications

Prism: A Plug-in Reproducible Infrastructure for Scalable Multimodal Continual Instruction Tuning

May 25, 2026

Jun-Tao Tang, Yu-Cheng Shi, Zhen-Hao Xie et al.

A plug-in architecture for multimodal continual learning lets researchers test new training strategies without rewriting the base model code, making MLLM research faster and more reproducible.

Prism is a software framework that makes it easier to develop and test new methods for continuously training multimodal AI models on new tasks. Instead of modifying the core model code each time, researchers can add new strategies as plug-in modules, reducing engineering overhead and enabling fair comparisons between different approaches.

trainingarchitectureapplications

Looped Diffusion Language Models

May 25, 2026

Sanghyun Lee, Chunsan Hong, Seungryong Kim et al.

Selectively looping transformer layers in masked diffusion models improves both training efficiency and reasoning capability—you can match performance with far fewer computations, or trade compute for better results.

This paper introduces LoopMDM, a technique that reuses early-middle transformer layers in masked diffusion models by looping them during training and inference. The approach achieves better training efficiency (3.3× fewer FLOPs) and stronger reasoning performance than standard models, while enabling flexible compute scaling at inference time without adding parameters.

architectureefficiencytraining

Beyond Summaries: Structure-Aware Labeling of Code Changes with Large Language Models

May 25, 2026

Bar Weiss, Antonio Abu-Nassar, Adi Sosnovich et al.

LLMs can reliably classify code changes into structured categories (renames, moves, logic changes, etc.) to automate and prioritize code review tasks, achieving strong accuracy while being language-agnostic and customizable.

This paper shows how large language models can automatically label and categorize code changes in patches (like identifying renames, moves, or logic modifications) to make code review faster and more efficient. Using a two-stage approach with few-shot prompting, the method achieves 84% recall and 81% precision without needing traditional static analysis tools.

applicationsevaluation

Language Models Need Sleep

May 25, 2026

Sangyun Lee, Sean McLeish, Tom Goldstein et al.

Language models can improve long-context reasoning by periodically consolidating recent information into fast weights during offline 'sleep' phases, trading inference latency for better performance on reasoning-heavy tasks.

This paper proposes a sleep-like mechanism for language models that periodically consolidates recent context into persistent memory before clearing the cache. During 'sleep,' the model performs offline passes to update fast weights in state-space blocks, shifting computation away from real-time inference.

architectureefficiencyreasoning

Forgetting in Language Models: Capacity, Optimization, and Self-Generated Replay

May 25, 2026

Martin Marek, Dongkyu Cho, Shikai Qiu et al.

Self-generated replay nearly eliminates catastrophic forgetting in language models, but capacity constraints are the real bottleneck: a saturated model can't learn new tasks without forgetting, no matter what technique you use.

When language models learn new tasks, they forget old ones. This paper shows that models can generate their own training data to replay and prevent forgetting, but only if they have spare capacity. If a model is already saturated from pretraining, no amount of replay helps—it must overwrite old knowledge to learn anything new.

trainingefficiency

OrpQuant: Geometric Orthogonal Residual Projection for Multiplier-Free Power-of-Two Transformer Quantization

May 25, 2026

Maoyang Xiang, Bo Wang, Tao Luo

Power-of-Two quantization with orthogonal residual projection lets you run large models on edge hardware with minimal accuracy loss and no multiplier circuits—calibration takes ~15 minutes instead of hours.

OrpQuant enables efficient deployment of large AI models on edge devices by using a novel quantization method that replaces expensive multiply operations with simple bit-shifts. The approach uses geometric projection to maintain accuracy even at ultra-low bit widths (3-4 bits), and can calibrate models 10x faster than existing methods.

efficiencyarchitecture

Channel-wise Vector Quantization

May 25, 2026

Wei Song, Tianhang Wang, Yitong Chen et al.

Quantizing image channels instead of patches improves codebook efficiency and enables a more intuitive generation process that mirrors human artistic creation, achieving strong text-to-image results.

This paper introduces Channel-wise Vector Quantization (CVQ), a new way to convert images into discrete tokens by quantizing color channels instead of spatial patches. It enables a new image generation model (CAR) that builds images progressively—first sketching overall structure, then adding fine details—similar to how artists work.

architectureefficiency

DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking

May 25, 2026

Matt L. Wiemann, Lindsay M. Smith, Peter Melchior et al.

LLMs can predict physics outcomes but struggle with true scientific discovery: the strongest models pass only 50% of worlds, and good prediction accuracy doesn't guarantee conceptual understanding of the underlying laws.

DiscoverPhysics is a benchmark that tests whether large language models can discover unknown physics laws by designing experiments in simulated worlds with non-standard physics.

reasoningevaluationagents

Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User's Digital World

May 25, 2026

Yusong Lin, Xinyuan Liang, Haiyang Wang et al.

Building truly useful AI assistants requires handling messy, interconnected real-world contexts—not isolated tasks—and current models fall far short of this challenge, but synthetic data generation can help close the gap.

Claw-Anything is a benchmark for testing AI agents as always-on personal assistants with access to a user's full digital world—including activity history, multiple services, and both GUI and CLI interfaces.

agentsevaluationreasoning

VeriTrace: Evolving Mental Models for Deep Research Agents

May 25, 2026

Haolang Zhao, Yunbo Long, Lukas Beckenbauer et al.

Research agents need explicit feedback mechanisms to evolve their understanding of tasks—not just bigger models—to avoid error propagation when working through complex, interdependent information.

VeriTrace is a framework that helps AI research agents maintain accurate mental models by explicitly tracking and correcting their understanding as they work through complex problems. Instead of letting language models implicitly manage their reasoning, it uses three feedback loops to catch errors early and prevent them from cascading through the agent's work.

reasoningagentsevaluation

Automated Benchmark Auditing for AI Agents and Large Language Models

May 25, 2026

Junlin Wang, Federico Bianchi, Shang Zhu et al.

Many AI benchmarks contain hidden flaws that distort model rankings and performance scores; automated auditing can catch these issues at scale and improve the reliability of capability assessments.

This paper introduces Auto Benchmark Audit (ABA), an AI agent that automatically checks benchmark tasks for hidden problems like incomplete specifications, environment conflicts, and broken evaluation logic. Testing 168 benchmarks across nine domains, ABA found critical issues in over 25% of tasks—problems that human reviewers missed.

evaluationagents

Global Convergence of Wasserstein Policy Gradient for Entropy-Regularized Reinforcement Learning

May 25, 2026

Zhaoyu Zhu, Rui Gao, Shuang Li

WPG is theoretically sound for continuous control: the Bellman recursion in RL creates favorable convergence properties similar to convex optimization, even though the problem isn't convex.

This paper proves that Wasserstein Policy Gradient (WPG), an algorithm for reinforcement learning that moves policies using optimal transport geometry, converges globally to optimal solutions. The key insight is that even though RL objectives aren't convex in the traditional sense, the Bellman equation creates a special geometric structure that guarantees convergence.

training
scalingtrainingevaluation

From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills

May 22, 2026

Zisu Huang, Jingwen Xu, Yifan Yang et al.

Model-generated skills can improve agent performance, but their effectiveness depends on how they're extracted and which agent uses them—not on model size or baseline strength.

This paper studies how AI agents can reuse skills—structured procedures extracted from past experience—to improve performance. The researchers built a comprehensive evaluation framework testing skill extraction and reuse across five different task domains, finding that while model-generated skills help on average, they sometimes hurt performance.

agentstrainingevaluation

SPACENUM: Revisiting Spatial Numerical Understanding in VLMs

May 22, 2026

Jianshu Zhang, Yijiang Li, Huifeixin Chen et al.

Current VLMs struggle to genuinely understand spatial numbers—they can't reliably map between visual coordinates and numerical values, which is critical for embodied AI tasks like robotics that require precise spatial outputs.

This paper tests whether Vision-Language Models (VLMs) truly understand spatial numbers like coordinates and distances. Using SpaceNum, a framework with two tasks (converting numbers to spatial positions and vice versa), researchers find that VLMs largely fail at grounding numbers in actual spatial meaning, relying instead on shallow visual cues rather than genuine spatial reasoning.

evaluationmultimodalreasoning

ETCHR: Editing To Clarify and Harness Reasoning

May 22, 2026

Beichen Zhang, Yuhong Liu, Jinsong Li et al.

Decoupling image editing from language understanding—and training the editor specifically for reasoning tasks—improves multimodal reasoning accuracy across diverse visual tasks without modifying the base model.

ETCHR is a specialized image editing model that helps multimodal AI systems reason better by transforming images based on questions. Unlike general image editors, it's trained to understand abstract reasoning tasks and produce clearer images for downstream analysis, improving performance across visual reasoning tasks by 4-5% without retraining the main AI model.

multimodalreasoningtraining

Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models

May 22, 2026

Hongwu Peng, Ohiremen Dibua, Yuanjun Xiong et al.

You can now tune hyperparameters on a single dense model and transfer them directly to MoE models of any size or configuration, eliminating the need for expensive hyperparameter search when scaling with MoE.

Complete-muE is a framework that solves the problem of transferring hyperparameters (like learning rate and weight decay) from dense neural networks to Mixture-of-Experts (MoE) models without expensive retuning.

trainingscalingefficiency

Good Token Hunting: A Hitchhiker's Guide to Token Selection for Visual Geometry Transformers

May 22, 2026

Shuhong Zheng, Michael Oechsle, Erik Sandström et al.

By selectively dropping redundant image patches across frames and within frames using attention entropy, you can speed up 3D reconstruction transformers dramatically without sacrificing quality.

This paper tackles the computational bottleneck in visual geometry transformers—models that reconstruct 3D scenes from multiple images. The authors propose a token selection strategy that reduces which image patches the model attends to, cutting computation by 85% while maintaining or improving accuracy.

efficiencyarchitectureevaluation

CHRONOS: Temporally-Aware Multi-Agent Coordination for Evolving Data Marketplaces

May 22, 2026

Joydeep Chandra

For building data marketplaces, CHRONOS shows how to maintain search quality, fair pricing, and privacy simultaneously by treating temporal decay, value attribution, and privacy budgets as coupled problems rather than separate concerns.

CHRONOS solves three interconnected problems in data marketplaces: keeping search indexes fresh as data changes, fairly pricing data contributions after market shifts, and preventing agents from exhausting privacy budgets.

dataagentsefficiency

Multilingual Knowledge Transfer under Data Constraints via Lexical Interventions

May 22, 2026

Anastasiia Sedova, Natalie Schluter, Skyler Seto et al.

You can improve cross-lingual knowledge transfer by strategically replacing words in high-resource training data with translations—no parallel data, translation models, or extra training needed.

This paper proposes LINK, a simple method to improve multilingual language models for low-resource languages by swapping English words with their translations during pretraining. The approach requires only a bilingual dictionary and no extra training, yet achieves significant performance gains on downstream tasks across eight languages.

trainingdata

PGT: Procedurally Generated Tasks for improving visual grounding in MLLMs

May 22, 2026

Rim Assouel, Amir Bar, Michal Drozdzal et al.

Adding synthetic geometric overlays during training helps MLLMs learn better spatial and quantitative reasoning—suggesting many visual understanding failures come from insufficient training data rather than model architecture limits.

This paper introduces Procedurally Generated Tasks (PGT), a method that overlays geometric shapes on images to create training data that improves how multimodal AI models understand fine-grained visual details like spatial relationships and quantities. Testing shows improvements of up to 20% on visual reasoning benchmarks while keeping general capabilities intact.

multimodaltrainingevaluation

Training-Free Looped Transformers

May 22, 2026

Lizhang Chen, Jonathan Li, Chen Liang et al.

You can boost performance of frozen models by intelligently looping internal layers at inference time—no retraining needed, just a smarter application strategy based on ODE theory.

This paper shows how to improve pretrained transformer models at test time by looping a middle section of layers without retraining. The key insight is treating layer loops as smaller refinement steps rather than naive repetition, inspired by numerical methods for solving differential equations.

efficiency

Move on Muon : A Hamiltonian probability gradient flow perspective of Muon optimizer

May 22, 2026

Aratrika Mustafi, Soumya Mukherjee, Bharath K. Sriperumbudur

Muon optimizer can be understood as Hamiltonian dynamics on probability measures, providing theoretical guarantees for convergence and opening the door to analyzing large-scale neural network training through mean-field theory.

This paper analyzes the Muon optimizer through the lens of Hamiltonian dynamics and probability flows. The authors show that Muon's orthogonalization step is actually a mirror descent update, then extend this insight to neural network training by deriving a mean-field equation describing how probability distributions over parameters evolve.

trainingscaling

Human Decision-Making with Persuasive and Narrative LLM Explanations

May 22, 2026

Laura R. Marusich, Mary Grace Kozuch Dhooghe, Jonathan Z. Bakdash et al.

Adding narrative explanations to AI predictions can backfire: they increase trust in AI without improving accuracy, and may actually harm decision quality by making people slower to question wrong predictions.

This study tested how AI-generated narrative explanations affect human decision-making in classification tasks. Researchers found that persuasive explanations didn't improve accuracy compared to predictions alone, but did increase reliance on AI—even when the AI was wrong. More persuasive narratives sometimes slowed decisions and made it harder to spot AI errors.

evaluationsafetyalignment

Leveraging Foundation Models for Causal Generative Modeling

May 22, 2026

Aneesh Komanduri, Xintao Wu

You can leverage existing pretrained models for causal reasoning tasks by building a modular pipeline that extracts concepts, manipulates them causally, and generates counterfactuals—no need to retrain from scratch.

This paper presents FM-CGM, a framework that combines pretrained foundation models (reasoning models and diffusion models) to perform causal reasoning on images. It enables zero-shot discovery of causal relationships, intervention on concepts, and generation of counterfactual images—all without retraining the models.

reasoningmultimodalapplications

Strong Teacher Not Needed? On Distillation in LLM Pretraining

May 22, 2026

Taiming Lu, Zhuang Liu

You don't need a powerful teacher to improve a larger language model through distillation—smaller teachers work fine, and over-training the teacher can actually hurt performance.

This paper challenges the assumption that knowledge distillation in language model training requires a strong teacher model. By systematically testing different teacher-student size combinations, the researchers found that even small, undertrained teachers can improve larger students when losses are properly balanced, and that stronger teachers don't always produce better results.

trainingefficiency

Tokenisation via Convex Relaxations

May 21, 2026

Jan Tempus, Philip Whittington, Craig W. Schmidt et al.

ConvexTok uses convex optimization to build tokenizers that are provably near-optimal (within 1% at typical vocabulary sizes) and compress text better than greedy algorithms like BPE, with measurable improvements in language model efficiency.

This paper replaces greedy tokenization algorithms like BPE with a convex optimization approach called ConvexTok. Instead of making locally optimal choices, it formulates tokenizer construction as a linear program, achieving better compression (bits-per-byte) and allowing users to verify how close their tokenizer is to mathematically optimal.

trainingefficiency

Integrable Elasticity via Neural Demand Potentials

May 21, 2026

Carlos Heredia, Daniel Roncel

Neural demand models can be designed to respect economic constraints (integrability), producing more reliable price-elasticity estimates that are both mathematically consistent and practically useful for retail pricing.

This paper introduces ICDN, a neural network model that learns demand patterns for multiple products based on prices. Unlike traditional approaches, it directly models how demand changes with price (elasticity) in a mathematically consistent way, making the learned relationships more economically realistic and stable.

architectureapplications

Vector Policy Optimization: Training for Diversity Improves Test-Time Search

May 21, 2026

Ryan Bahlous-Boldi, Isha Puri, Idan Shenfeld et al.

Training LLMs to produce diverse outputs across multiple reward dimensions—not just maximizing a single score—makes them better at test-time search where you can pick the best solution from many candidates.

This paper introduces Vector Policy Optimization (VPO), a training method that teaches language models to generate diverse solutions by optimizing for multiple reward objectives simultaneously, rather than a single scalar reward.

trainingreasoningefficiency

Remember to be Curious: Episodic Context and Persistent Worlds for 3D Exploration

May 21, 2026

Lily Goli, Justin Kerr, Daniele Reda et al.

Effective curiosity-driven exploration in 3D environments requires both a persistent, continuously-updated world model and episodic memory of the agent's trajectory—without these, agents waste effort revisiting forgotten states instead of discovering new regions.

This paper shows how to make AI agents explore 3D environments effectively using curiosity-driven learning. The key insight is that agents need two things: a persistent 3D map of the world that updates continuously, and memory of where they've been.

reasoningagents

The Matching Principle: A Geometric Theory of Loss Functions for Nuisance-Robust Representation Learning

May 21, 2026

Vishal Rajput

Many robustness techniques (CORAL, adversarial training, IRM, metric learning) are different ways of solving the same problem: identifying and regularizing against label-preserving variations in your data.

This paper unifies seemingly separate robustness problems (domain adaptation, adversarial training, compositional generalization) under one framework: regularizing neural network gradients to match the covariance of label-preserving variations in deployment data.

trainingalignment

Finite-Particle Convergence Rates for Conservative and Non-Conservative Drifting Models

May 21, 2026

Krishnakumar Balasubramanian

Conservative drifting with kernel density estimators achieves provable convergence rates for one-step generative modeling, with the convergence speed depending on dimension and a tunable parameter that trades off between different error sources.

This paper analyzes drifting methods for generative modeling, proposing a conservative approach using kernel density estimators that guarantees gradient-field properties. The authors prove finite-particle convergence rates showing how quickly the method converges as sample size increases, with explicit tracking of how bandwidth and dimension affect performance.

trainingevaluation

MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems

May 21, 2026

Qianshu Cai, Yonggang Zhang, Xianzhang Jia et al.

Self-evolving agents need source-code access, not just prompt editing—structural bugs in routing and state management can't be fixed by text-layer changes alone, and MOSS demonstrates this works in production with measurable improvements.

MOSS is a system that lets autonomous agents automatically fix themselves by rewriting their own source code based on real failures. Unlike existing approaches that only modify text files like prompts, MOSS can change the actual code structure—routing logic, state management, dispatch—making it possible to fix a much broader class of problems.

agentssafety

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

May 21, 2026

Ali Hatamizadeh, Yejin Choi, Jan Kautz

Decoupling erase and write operations in linear attention with separate gates improves language model performance, especially on long-context tasks, while maintaining constant-memory decoding.

This paper improves linear attention mechanisms by separating the control of what to forget from what to remember in compressed memory. Instead of using a single gate to control both erasing old information and writing new information, Gated DeltaNet-2 uses separate channel-wise gates for each operation, making memory updates more flexible and efficient.

architectureefficiencyreasoning

LCGuard: Latent Communication Guard for Safe KV Sharing in Multi-Agent Systems

May 21, 2026

Sadia Asif, Mohammad Mohammadi Amiri, Momin Abbas et al.

When LLM agents communicate through shared KV caches for efficiency, you need explicit safeguards—LCGuard shows how to block sensitive information leakage at the representation level without breaking task coordination.

LCGuard is a safety framework that protects sensitive information when multiple AI agents share transformer key-value caches to coordinate tasks. It uses adversarial training to transform shared cache data so that agents can't reconstruct each other's private inputs, while keeping the information useful for task performance.

safetyagentsefficiency

Evaluating Commercial AI Chatbots as News Intermediaries

May 21, 2026

Mirac Suzgun, Emily Shen, Federico Bianchi et al.

AI chatbots excel at retrieving and synthesizing recent news but have three critical weaknesses: they systematically underperform on non-English content, fail primarily due to retrieval errors rather than reasoning mistakes, and are easily fooled by questions containing subtle false information.

This study evaluates six major AI chatbots (Gemini, Grok, Claude, GPT models) on their ability to answer factual news questions across six languages and regions.

evaluationmultimodaldata

DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback

May 21, 2026

Yunpeng Dong, Jingkai He, Yuze Hou et al.

By tracking only differences between consecutive states rather than full duplicates, DeltaBox reduces AI agent checkpoint/rollback latency from seconds to milliseconds, directly enabling deeper search and larger-scale exploration for reasoning and RL tasks.

DeltaBox is a system that makes AI agents much faster by storing only the changes between checkpoints instead of copying entire sandbox states. Using new OS-level mechanisms for filesystems and process state, it reduces checkpoint/rollback time from hundreds of milliseconds to just milliseconds, enabling agents to explore more possibilities in the same time budget.

efficiencyagents

FAME: Failure-Aware Mixture-of-Experts for Message-Level Log Anomaly Detection

May 21, 2026

Huanchi Wang, Zihang Huang, Yifang Tian et al.

You can build practical, label-efficient log anomaly detectors by using LLMs once offline to structure the problem, then training lightweight domain-specific models that run continuously without expensive LLM calls.

FAME is a system for detecting anomalies in individual log messages rather than groups, using a mixture-of-experts approach that leverages an LLM offline to organize log templates into failure domains. It requires minimal labeled data (as few as 100 examples) and runs efficiently on-premise, achieving 98% accuracy on real production logs while reducing annotation effort by 76x.

efficiencyevaluationapplications

SDPM: Survival Diffusion Probabilistic Model for Continuous-Time Survival Analysis

May 21, 2026

Stanislav R. Kirpichenko, Andrei V. Konstantinov, Lev V. Utkin

Diffusion models can effectively handle continuous-time survival analysis by modeling censored outcomes directly, avoiding parametric assumptions and discretization errors that limit traditional survival methods.

SDPM uses diffusion models to estimate time-to-event distributions from data with censored observations, without requiring assumptions about the hazard function or discretizing time. The model generates samples that can be converted to survival curves, achieving competitive performance on real datasets while accurately recovering underlying continuous distributions.

applicationsevaluation

MambaGaze: Bidirectional Mamba with Explicit Missing Data Modeling for Cognitive Load Assessment from Eye-Gaze Tracking Data

May 21, 2026

Amir Mousavi, Mohammad Sadegh Sirjani, Erfan Nourbakhsh et al.

Mamba's linear-complexity architecture enables real-time cognitive load monitoring from noisy eye-tracking signals on wearable devices—a practical alternative to Transformers for temporal sensor data with frequent gaps.

MambaGaze uses a bidirectional Mamba neural network to assess cognitive load from eye-tracking data in real-time. It handles missing data from eye blinks and tracking failures by explicitly encoding uncertainty, and runs efficiently on edge devices like smartglasses for applications like driver monitoring.

architectureefficiencyapplications

CogAdapt: Transferring Clinical ECG Foundation Models to Wearable Cognitive Load Assessment via Lead Adaptation

May 21, 2026

Amir Mousavi, Mohammad Sadegh Sirjani, Erfan Nourbakhsh et al.

Foundation models trained on large clinical datasets can be effectively adapted to wearable sensor tasks through domain-specific adapters and careful fine-tuning, enabling better cognitive load assessment with limited labeled data.

CogAdapt adapts pre-trained clinical ECG models to assess cognitive load from wearable devices. It uses a learnable adapter to convert 3-lead wearable signals into 12-lead clinical format and a progressive fine-tuning strategy to preserve learned knowledge while adapting to the new task, achieving strong performance on cognitive load prediction.

applications

Deep Reinforcement Learning for Flexible Job Shop Scheduling with Random Job Arrivals

May 21, 2026

Yu Tang, Muhammad Zakwan, Efe Balta et al.

Deep RL can make real-time scheduling decisions for dynamic manufacturing environments by learning to combine simple dispatching rules better than any single rule alone, without requiring expensive optimization solvers.

This paper uses deep reinforcement learning to solve the flexible job shop scheduling problem—deciding which machine should process which job to minimize total completion time.

applications

Reducing Political Manipulation with Consistency Training

May 21, 2026

Long Phan, Devin Kim, Alexander Pan et al.

LLMs exhibit systematic covert political bias through asymmetric handling of opposing viewpoints; consistency-based training can reduce this bias without sacrificing model helpfulness.

Large language models show hidden political bias by treating opposing viewpoints asymmetrically—using different tones or effort levels for left vs. right perspectives.

safetyalignmenttraining

Understanding Data Temporality Impact on Large Language Models Pre-training

May 21, 2026

Pilchen Hippolyte, Fabre Romain, Signe Talla Franck et al.

Training LLMs on chronologically ordered data instead of shuffled data improves their knowledge of recent facts and temporal accuracy, suggesting data ordering matters for building models that stay current.

This paper investigates how the order of training data affects what LLMs learn about time-sensitive facts. Researchers trained 6B-parameter models on chronologically ordered data versus shuffled data, and found that sequential training produces models with more current and accurate temporal knowledge while maintaining general language understanding.

trainingdataevaluation

Uniform Diffusion Models Revisited: Leave-One-Out Denoiser and Absorbing State Reformulation

May 21, 2026

Samson Gourevitch, Yazid Janati, Dario Shariatian et al.

Discrete diffusion models have a hidden training-inference mismatch: the standard objective doesn't match what's actually needed for sampling. Using the correct "leave-one-out" parameterization and an absorbing-state reformulation improves generation quality without retraining.

This paper fixes a fundamental mismatch in how Uniform Diffusion Models are trained versus used for generation. The authors show that standard training doesn't actually optimize what the model uses during sampling, and they provide mathematical conversions to align these.

trainingarchitectureefficiency

Advancing Mathematics Research with AI-Driven Formal Proof Search

May 21, 2026

George Tsoukalas, Anton Kovsharov, Sergey Shirobokov et al.

LLMs become reliable enough for mathematics research when their outputs are verified by formal proof checkers—this hybrid approach solved previously open problems at a practical cost, showing a path beyond LLM hallucination.

Researchers used large language models to automatically generate formal proofs in Lean, a proof verification language, to solve open mathematical problems. Their AI agent successfully proved 9 open Erdős problems and 44 OEIS conjectures, demonstrating that LLMs can contribute to real mathematical research when paired with formal verification systems that catch errors.

reasoningagentsapplications

Towards a General Intelligence and Interface for Wearable Health Data

May 21, 2026

Girish Narayanswamy, Maxwell A. Xu, A. Ali Heydari et al.

Pretraining on massive unlabeled wearable data creates reusable health representations that work across diverse prediction tasks with little labeled data—similar to how large language models work, but for physiological signals.

Researchers built a foundation model trained on over one trillion minutes of unlabeled wearable sensor data from five million people to predict health outcomes. The model learns general patterns from this massive dataset, then adapts to specific health tasks (like predicting heart disease or sleep quality) with minimal labeled examples.

multimodalapplications

Lumberjack: Better Differentially Private Random Forests through Heavy Hitter Detection in Trees

May 21, 2026

Christian Janos Lebeda, David Erb, Tudor Cebere et al.

You can now build random forests on sensitive data with differential privacy that actually work well in practice—Lumberjack's smart pruning strategy significantly closes the gap between private and non-private model performance.

Lumberjack is a differentially private random forest algorithm that builds large decision trees and then prunes them intelligently to protect sensitive data. By using a novel heavy hitter detection method, it can use deeper trees than previous approaches while maintaining privacy guarantees, achieving much better accuracy on real datasets.

trainingefficiency

Cyber-Physical Anomaly Detection in IoT-Enabled Smart Grids Using Machine Learning and Metaheuristic Feature Optimization

May 21, 2026

Adis Alihodžić, Eva Tuba, Milan Tuba

Smart grid operators can use genetic algorithm feature selection to identify which electrical measurements matter most for attack detection, reducing sensor requirements while maintaining 98%+ accuracy.

This paper detects cyber-physical attacks in smart grids by combining machine learning with genetic algorithm-based feature selection. Using real power system data, the authors show that tree-based models like Extra Trees can accurately distinguish between natural faults and malicious attacks, and that a small subset of 27 features (down from 112) is sufficient for reliable detection.

evaluation

Superhuman Safe and Agile Racing through Multi-Agent Reinforcement Learning

May 21, 2026

Ismail Geles, Leonard Bauersfeld, Markus Wulfmeier et al.

Training AI systems with multiple agents through self-play creates more robust and safer real-world behavior than traditional single-agent approaches, because agents must learn to anticipate and coordinate with others rather than treating them as noise.

This paper shows that multi-agent reinforcement learning makes autonomous systems safer and more capable in real-world shared spaces.

safety

Plug-in Losses for Evidential Deep Learning: A Simplified Framework for Uncertainty Estimation that Includes the Softmax Classifier

May 21, 2026

Berk Hayta, Hannah Laus, Simon Mittermaier et al.

You can get reliable uncertainty estimates using standard loss functions (cross-entropy, MSE) instead of complex Dirichlet objectives—the math shows this works, and it's simpler to implement in practice.

This paper simplifies Evidential Deep Learning (EDL) for uncertainty estimation by replacing complex Dirichlet-based losses with standard losses like cross-entropy, evaluated at the Dirichlet mean. The authors prove this approximation works well when evidence is strong and show it includes softmax as a special case, making uncertainty estimation easier to implement without sacrificing accuracy.

trainingefficiency

SeqLoRA: Bilevel Orthogonal Adaptation for Continual Multi-Concept Generation

May 21, 2026

Javad Parsa, Enis Simsar, Amir Joudaki et al.

When fine-tuning diffusion models for multiple concepts, jointly optimizing LoRA factors with orthogonal constraints prevents representation interference and scales better than existing modular approaches—enabling cleaner composition of up to 101 concepts.

SeqLoRA improves how AI models learn multiple custom concepts at once by using a smarter optimization technique that prevents concepts from interfering with each other. Instead of freezing parts of the model or doing expensive post-processing, it jointly trains the adaptation components while keeping them orthogonal, enabling better multi-concept image generation with less computational cost.

trainingefficiencymultimodal

Ternary Decision Trees with Locally-Adaptive Uncertainty Zones

May 21, 2026

William Smits

Decision trees can improve accuracy by explicitly handling boundary cases through locally-computed uncertainty zones—instances near splits get soft predictions and uncertainty flags instead of hard classifications, helping downstream applications make better decisions.

This paper introduces ternary decision trees that add uncertainty zones around split thresholds, allowing predictions near decision boundaries to blend outputs from both child subtrees and flag uncertain cases.

architectureevaluation

Proxy-Based Approximation of Shapley and Banzhaf Interactions

May 21, 2026

Santo M. A. R. Thies, Hubert Baniecki, R. Teal Witter et al.

ProxySHAP makes it practical to explain complex feature interactions in ML models by using proxy models and residual correction, achieving state-of-the-art accuracy while remaining computationally efficient even with thousands of features.

ProxySHAP is a new method for computing Shapley and Banzhaf interactions—measures that explain how features work together in machine learning models. It combines fast tree-based approximations with mathematical corrections to achieve both speed and accuracy, outperforming existing methods on large datasets.

evaluationefficiency

The Distillation Game: Adaptive Attacks & Efficient Defenses

May 21, 2026

Youssef Allouah, Mahdi Haghifam, Sanmi Koyejo et al.

Distillation defenses must be evaluated against adaptive attackers who strategically choose which outputs to learn from—not just passive ones—and simple forward-pass defenses like PoE can match expensive defenses while preserving reasoning quality.

This paper studies how AI model providers face a trade-off: making models more useful (through better outputs) makes them easier to copy through distillation attacks. The authors develop a game-theoretic framework to understand this trade-off and propose Product-of-Experts (PoE), a lightweight defense that combines the teacher model with a proxy student during generation.

safetyevaluationefficiency

Optimization over the intersection of manifolds

May 21, 2026

Yan Yang, Bin Gao, Ya-xiang Yuan

Optimization on manifold intersections becomes tractable when intrinsic transversality holds; a geometric algorithm can efficiently solve these problems by maintaining feasibility on one manifold while steering toward the intersection.

This paper tackles optimization problems where the solution must lie on the intersection of two geometric surfaces (manifolds). The authors prove that two key geometric properties are equivalent, enabling efficient projection onto the intersection.

architecture

Variance Reduction for Expectations with Diffusion Teachers

May 20, 2026

Jesse Bettencourt, Xindi Wu, Matan Atzmon et al.

When using diffusion models to guide other tasks, you can dramatically reduce compute cost by resampling cheap diffusion noise multiple times per expensive upstream computation, rather than doing one expensive computation per noise sample.

This paper introduces CARV, a framework for reducing variance in gradient estimates when using pretrained diffusion models as teachers in downstream tasks like text-to-3D generation. By reusing expensive computations (like 3D rendering) across multiple noise samples and applying importance sampling techniques, the method achieves 2-3x speedups without changing the underlying objective.

efficiencytrainingevaluation

Equilibrium Reasoners: Learning Attractors Enables Scalable Reasoning

May 20, 2026

Benhao Huang, Zhengyang Geng, Zico Kolter

Iterative reasoning models work by learning task-specific attractors in their latent space; scaling test-time compute (more iterations and parallel paths) improves performance on hard problems without needing external verifiers.

This paper explains how AI models can solve hard problems by iteratively refining internal states, like a brain thinking through steps. The key insight is that models learn to create 'attractors'—stable patterns that pull the model toward correct answers.

reasoningscaling

Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate

May 20, 2026

Dayal Singh Kalra, Maissam Barkeshli

When scaling up LLM training, use a higher embedding layer learning rate (scaled by model width) to stabilize training and reliably transfer hyperparameters from small to large models—this is the primary reason μP outperforms standard parameterization.

This paper explains why μP (Maximal Update) parameterization works better than standard parameterization for transferring learning rates across different model sizes. The key finding: μP's advantage mainly comes from using a higher learning rate for the embedding layer, which stabilizes training and improves hyperparameter transfer when scaling up language models.

scalingtrainingefficiency

EvoStruct: Bridging Evolutionary and Structural Priors for Antibody CDR Design via Protein Language Model Adaptation

May 20, 2026

Mansoor Ahmed, Sujin Lee, Umar Khayaz et al.

Combining evolutionary knowledge from language models with 3D structural constraints solves vocabulary collapse in antibody design, achieving 16% better sequence accuracy and 2.3x more amino acid diversity than structure-only methods.

EvoStruct fixes a critical problem in AI-designed antibodies: neural networks trained on 3D structures alone forget important amino acid patterns from evolution. The method combines a pre-trained protein language model (which knows evolutionary patterns) with structural information, using a special adapter to merge both sources of knowledge.

architecturetrainingapplications

Velocityformer: Broken-Symmetry-Matched Equivariant Graph Transformers for Cosmological Velocity Reconstruction

May 20, 2026

Tilman Tröster, David Mirkovic, Veronika Oehl et al.

Matching a model's architectural symmetries to the actual symmetries present in your data—not just the underlying physics—significantly improves performance and data efficiency.

Velocityformer is a specialized neural network that reconstructs galaxy velocities from survey data to improve cosmological measurements. By designing the model to match the asymmetric structure of real observations (where one direction—the line of sight—is special), it achieves 35% better accuracy than traditional methods and works well even with very limited training data.

architecturereasoningapplications

DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

May 20, 2026

Sixiong Xie, Zhuofan Shi, Haiyang Shen et al.

Retrieval isn't the main problem for frontier models on deep research tasks; instead, they fail primarily at deriving answers from evidence and calibrating confidence correctly, suggesting future improvements should focus on reasoning and verification rather than search.

DeepWeb-Bench is a challenging benchmark for evaluating AI agents that research questions by searching the web, collecting evidence, and reasoning through answers. Unlike existing benchmarks, it focuses on tasks requiring massive evidence gathering, cross-source verification, and complex multi-step reasoning—areas where current frontier models still struggle significantly.

evaluationreasoningagents

WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata

May 20, 2026

Basel Shbita, Pengyuan Li, Anna Lisa Gentile

Most vision-language models struggle with knowledge-grounded visual reasoning—even large models only reach 75% accuracy when questions require combining visual evidence with external facts, suggesting a major gap in real-world VQA capabilities.

WikiVQABench is a new benchmark for testing vision-language models on questions that require both visual understanding and external knowledge from Wikipedia and Wikidata.

evaluationmultimodal

Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling

May 20, 2026

Caleb Winston, Ron Yifeng Wang, Azalia Mirhoseini et al.

Compiling agent tasks into code upfront—rather than deciding actions one step at a time—enables parallelization and validation, dramatically reducing latency and errors in web automation.

This paper introduces a compilation approach for web agents that converts natural language tasks into executable code plans instead of executing step-by-step. By generating multiple candidate plans, validating them against tool specifications, and optimizing for parallelization, the system achieves 10x faster execution and better accuracy than existing sequential approaches.

agentsefficiencyreasoning

You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories

May 20, 2026

Zhepei Wei, Xinyu Zhu, Wei-Lin Chen et al.

RLVR training produces predictable, low-rank weight changes that can be extrapolated mathematically, letting you skip 85% of training compute while matching or exceeding performance on reasoning tasks.

This paper reveals that language models trained with reinforcement learning from verifiable rewards (RLVR) follow surprisingly simple, low-rank weight trajectories.

trainingefficiencyreasoning

DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

May 20, 2026

Kaiyi Zhang, Wei Wu, Yankai Lin

When training language models with verifiable rewards, focusing on the most discriminative token patterns—rather than averaging all tokens equally—significantly improves learning efficiency and final performance.

This paper improves how language models learn from step-by-step feedback by better understanding which tokens should be rewarded or penalized. The authors show that standard learning methods get distracted by common formatting tokens and miss important patterns that distinguish good answers from bad ones.

trainingreasoningalignment

Leveraging LLMs for Grammar Adaptation: A Study on Metamodel-Grammar Co-Evolution

May 20, 2026

Weixing Zhang, Bowen Jiang, Rahul Sharma et al.

LLMs can learn grammar adaptation patterns from examples and apply them to new versions, achieving 100% consistency on medium-sized grammars but failing on large-scale ones—suggesting LLMs work best for targeted, smaller grammar updates.

This paper shows how Large Language Models can automatically adapt domain-specific language grammars when their underlying models change, reducing manual work. Testing on real-world languages shows LLMs work well for complex scenarios but struggle with very large grammars (300+ rules).

trainingapplications

Mem-$π$: Adaptive Memory through Learning When and What to Generate

May 20, 2026

Xiaoqiang Wang, Chao Wang, Hadi Nekoei et al.

Generating context-specific guidance dynamically outperforms traditional retrieval-based memory for agents—the system learns to abstain when unnecessary and produce only relevant help, improving task success by over 30% on web navigation.

Mem-π is a framework that gives AI agents smarter memory by generating helpful guidance on-the-fly instead of retrieving fixed entries from a database. A separate model learns when to create guidance and what to create, trained to skip unhelpful suggestions and produce only what the agent actually needs for the current task.

agentstrainingreasoning

HITL-D: Human In The Loop Diffusion Assisted Shared Control

May 20, 2026

Riley Zilka, Sergey Khlynovskiy, Allie Wang et al.

Diffusion models can effectively assist human operators in robotic control by automating specific subtasks (like orientation), reducing cognitive load while maintaining human oversight—a practical model for human-AI collaboration in physical systems.

This paper presents HITL-D, a shared control system that combines diffusion-based AI policies with human input for robotic manipulation tasks. Instead of requiring operators to control every aspect of a robot arm, the system automatically handles orientation adjustments while the human focuses on positioning, reducing mental workload and task completion time by 40% in user studies.

agentsapplicationsreasoning

Mitigating Label Bias with Interpretable Rubric Embeddings

May 20, 2026

Calvin Isley, Johann D. Gaebler, Sharad Goel

Replace opaque learned embeddings with interpretable features derived from expert-defined rubrics to reduce bias inheritance from biased training labels in high-stakes decisions.

When training AI models on biased historical data (like past hiring decisions), the models learn and perpetuate those biases. This paper proposes using 'rubric embeddings'—features based on expert-defined criteria—instead of black-box embeddings to make fairer predictions. Testing on university admissions data, the approach reduces group disparities while maintaining quality.

alignmentevaluation

Quality and Security Signals in AI-Generated Python Refactoring Pull Requests

May 20, 2026

Mohamed Almukhtar, Anwar Ghammam, Hua Ming

AI-generated refactoring often improves code but frequently introduces new quality and security issues that developers accept anyway, highlighting the need for automated quality checks before merging AI contributions.

This study examines Python refactoring pull requests created by AI agents, measuring their impact on code quality and security.

evaluationsafetyapplications

DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention

May 18, 2026

Yuxiang Huang, Nuno M. T. Gonçalves, Federico Alvetreti et al.

DashAttention enables efficient long-context processing by combining adaptive sparse selection with differentiable training, outperforming fixed-sparsity methods while maintaining gradient flow through both attention stages.

DashAttention improves how language models handle long documents by using a smarter two-stage attention mechanism. Instead of always selecting the same number of relevant tokens, it adaptively picks different amounts based on what each query needs, while keeping the entire process trainable. This achieves full-attention quality with 75% fewer computations.

efficiencyarchitecture

A Readiness-Driven Runtime for Pipeline-Parallel Training under Runtime Variability

May 18, 2026

Ruitao Liu, Xinyang Tian, Shuo Chen et al.

For distributed model training, executing tasks based on actual readiness rather than pre-committed schedules can dramatically reduce GPU idle time and improve throughput, especially when computation times vary unpredictably.

This paper introduces RRFP, a runtime system that improves GPU training efficiency by executing ready tasks immediately instead of waiting for a pre-planned order. When training large models across multiple GPUs, unpredictable delays in computation cause stages to sit idle.

trainingefficiencyscaling

Code as Agent Harness

May 18, 2026

Xuying Ning, Katherine Tieu, Dongqi Fu et al.

Code is becoming the primary substrate for building reliable, verifiable AI agents. Understanding code as agent harness—the infrastructure layer—is essential for building systems that can plan, remember, use tools, and coordinate across multiple agents.

This survey examines how code serves as the operational foundation for AI agents—not just as output, but as the infrastructure that enables agents to reason, act, model environments, and verify their own behavior.

agentsarchitecturereasoning

ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop

May 18, 2026

Yining Hong, Jiageng Liu, Han Yin et al.

AI agents fail at embodied spatial reasoning primarily because they make poor action choices, not because they can't see—and they confidently stick to wrong answers even when evidence contradicts them, unlike humans who actively seek disconfirming evidence.

ESI-Bench is a benchmark for testing how well AI agents actively explore physical environments to understand spatial relationships. Rather than passively looking at images, agents must decide when to move, manipulate objects, and gather observations to solve tasks.

multimodalreasoning

SURGE: Approximation-free Training Free Particle Filter for Diffusion Surrogate

May 18, 2026

Lifu Wei, Yinuo Ren, Naichen Shi et al.

You can guide diffusion models without computing gradients or scores—just reweight trajectories and resample periodically, making inference-time improvements cheaper and easier to implement.

This paper introduces URGE, a gradient-free method for improving diffusion model outputs at inference time. Instead of computing expensive gradients, URGE reweights and resamples trajectories using a mathematical technique called Girsanov estimation, making guidance simpler and faster while maintaining theoretical guarantees.

efficiency

Actionable World Representation

May 18, 2026

Kunqi Xu, Jitao Li, Jianglong Ye et al.

By explicitly modeling object state changes as a learnable manifold, WorldString provides a unified way to represent how objects respond to actions—bridging the gap between perception and control for physical world models.

WorldString is a neural architecture that learns to represent how real-world objects change state over time by processing point clouds or video data. It creates a digital twin of objects that captures their actionable properties, serving as a building block for world models that can predict and interact with the physical world.

architecturereasoning

Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

May 18, 2026

Qianhao Yuan, Jie Lou, Xing Yu et al.

MLLMs can improve fine-grained visual understanding by learning from their own superior performance on evidence-focused crops, using on-policy self-distillation to transfer regional perception skills to full-image reasoning.

This paper addresses a key weakness in multimodal AI models: they struggle to notice small but important details in images. The researchers discovered that models actually perform better when shown cropped images focused on relevant areas versus full images, suggesting the problem isn't recognizing details but finding them.

multimodaltrainingefficiency

What Does the AI Doctor Value? Auditing Pluralism in the Clinical Ethics of Language Models

May 18, 2026

Payal Chandak, Victoria Alkin, David Wu et al.

LLMs deployed for medical advice have hidden, consistent ethical biases that don't reflect real physician diversity; without explicit auditing and balancing, a single model's values could be imposed at scale to thousands of patients.

This paper audits how large language models handle ethical dilemmas in medicine, revealing that while models discuss multiple ethical perspectives in their reasoning, they make near-identical decisions across repeated attempts.

safetyevaluationalignment

PIXLRelight: Controllable Relighting via Intrinsic Conditioning

May 18, 2026

Miguel Farinha, Ronald Clark

By conditioning on intrinsic image properties (albedo and shading) extracted from both photos and 3D renders, you can achieve photorealistic relighting with full PBR lighting control while staying fast enough for practical use.

PIXLRelight is a fast neural relighting method that lets you change lighting in photos using physically-based rendering controls. It decomposes images into intrinsic components (albedo, shading, residuals) and uses these to condition a transformer model, enabling realistic lighting adjustments in under 0.1 seconds per image without per-image optimization.

multimodalarchitectureefficiency

Predictable Confabulations: Factual Recall by LLMs Scales with Model Size and Topic Frequency

May 18, 2026

Matthew L. Smith, Jonathan P. Shock, Samuel T. Segun et al.

LLM factual accuracy isn't random—it scales predictably with model size and training data frequency, meaning you can estimate what facts a model will reliably remember based on these two factors.

This paper reveals that LLM factual recall follows a predictable pattern based on two factors: model size and how often a topic appears in training data.

scalingevaluationtraining