ThinkLLM
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
AboutPrivacyTermsRSS

ThinkLLM

Papers

Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.

326 papers13 this month12 topics
AllEfficiency 35Reasoning 35Multimodal 28Applications 28Evaluation 27Training 26Architecture 24Agents 24Safety 13scaling 5Data 5Alignment 1

Mar 30 – Apr 5(19)

Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation

Apr 2, 2026

Daiwei Chen, Zhoutong Fu, Chengming Jiang et al.

Token initialization is a critical bottleneck when extending language models with new vocabulary—grounding new tokens in semantically meaningful positions before fine-tuning substantially improves downstream task performance.

When language models add new vocabulary tokens for specific tasks like recommendation systems, they typically initialize them as averages of existing embeddings. This paper shows this approach fails because all new tokens collapse into the same subspace, losing their distinctiveness.

trainingefficiencyapplications

Batched Contextual Reinforcement: A Task-Scaling Law for Efficient Reasoning

Apr 2, 2026

Bangji Yang, Hongbo Ma, Jiajun Fan et al.

You can make reasoning models 15-60% more token-efficient while keeping or improving accuracy by simply training them to solve multiple problems simultaneously, creating an implicit efficiency incentive rather than explicit penalties.

This paper introduces Batched Contextual Reinforcement (BCR), a training method that makes language models reason more efficiently by training them to solve multiple problems at once in a shared context.

Mar 23 – Mar 29(14)

Training the Knowledge Base through Evidence Distillation and Write-Back Enrichment

Mar 26, 2026

Yuxing Lu, Xukai Zhao, Wei Wu et al.

You can improve RAG systems by preprocessing your corpus once to add distilled, compact versions of relevant documents—this works with any retrieval method and shows consistent gains without changing your pipeline.

This paper proposes WriteBack-RAG, a method that improves retrieval-augmented generation (RAG) systems by treating the knowledge base as trainable. Using labeled examples, the system identifies relevant documents, distills them into compact knowledge units, and adds these to the corpus.

datatraining

PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference

Mar 26, 2026

Xiaofeng Mao, Shaohao Rui, Kaining Ying et al.

You can train video models on short clips and generate much longer videos by using a three-tier memory strategy that compresses historical context without losing quality.

PackForcing solves the memory problem in video generation by compressing old frames intelligently—keeping early frames for context, heavily compressing middle frames, and preserving recent frames for smooth transitions. This lets models generate 2-minute videos on a single GPU after training only on 5-second clips, achieving 24x longer videos than training data.

Mar 16 – Mar 22(26)

Improving Generalization on Cybersecurity Tasks with Multi-Modal Contrastive Learning

Mar 20, 2026

Jianan Huang, Rodolfo V. Valentim, Luca Vassio et al.

By aligning payload embeddings with text-based vulnerability descriptions using contrastive learning, you can reduce shortcut learning and improve how well cybersecurity models generalize to unseen threats.

This paper tackles a major problem in cybersecurity AI: models trained in labs fail in the real world because they learn surface-level patterns instead of genuine security concepts.

trainingmultimodalsafety

Beyond Single Tokens: Distilling Discrete Diffusion Models via Discrete MMD

Mar 20, 2026

Emiel Hoogeboom, David Ruhe, Jonathan Heek et al.

Discrete diffusion models can now be distilled into faster generators using moment matching, enabling practical deployment with fewer sampling steps while maintaining quality.

This paper solves the problem of making discrete diffusion models faster by distilling them into simpler models. Unlike continuous diffusion models which have many distillation techniques, discrete diffusion (used for text and images) has been hard to compress.

Mar 9 – Mar 15(13)

PhysMoDPO: Physically-Plausible Humanoid Motion with Preference Optimization

Mar 13, 2026

Yangsong Zhang, Anujith Muraleedharan, Rikhat Akizhanov et al.

By optimizing diffusion models with physics-aware rewards during training, you can generate robot motions that are both realistic and executable on real hardware without post-hoc corrections.

This paper improves AI-generated humanoid robot motions by using preference optimization to make them physically realistic. Instead of manually tweaking physics penalties, the method integrates a physics controller directly into training, teaching the motion model to generate movements that work well when converted to real robot commands.

trainingreasoningapplications

Neuron-Aware Data Selection In Instruction Tuning For Large Language Models

Mar 13, 2026

Xin Chen, Junchao Wu, Shu Yang et al.

You can train better LLMs on less data by selecting instruction examples that activate the same neurons as your target task—this beats using all data or relying on external models to score examples.

This paper introduces NAIT, a method for selecting the most useful instruction-tuning data for large language models by analyzing which neurons activate when processing different types of tasks. Instead of using all available training data, NAIT identifies a small subset (10% of data) that produces better results by matching neuron activation patterns to target capabilities.

Feb 23 – Mar 1(23)

Mode Seeking meets Mean Seeking for Fast Long Video Generation

Feb 27, 2026

Shengqu Cai, Weili Nie, Chao Liu et al.

Decouple learning long-term coherence from local quality to generate minute-scale videos without needing massive amounts of long-form training data.

This paper solves a key problem in video generation: making long videos (minutes) that are both sharp and coherent. The trick is training two separate components—one learns long-term story structure from rare long videos, while another copies local quality from abundant short videos. This lets the model generate minute-long videos that look crisp and stay consistent throughout.

trainingefficiencyarchitecture

DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science

Feb 27, 2026

Fan Shu, Yite Wang, Ruofan Wu et al.

LLMs need specialized training data to reliably follow data science workflows; fine-tuning on task-specific benchmarks can improve performance by 8x.

DARE-bench is a benchmark for testing how well AI models can follow data science instructions and complete multi-step ML tasks. It includes 6,300 real Kaggle tasks with verifiable correct answers, making evaluation objective rather than relying on human judges.

trainingefficiencyreasoning

Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

Apr 2, 2026

Gengsheng Li, Tianyu Yang, Junfeng Fang et al.

By intelligently routing training samples to different optimization strategies based on correctness, you can get the best of both fast learning and stable training—a practical improvement for post-training large language models.

This paper proposes Sample-Routed Policy Optimization (SRPO), a training method that combines two different approaches for fine-tuning language models: it routes correct outputs through a reward-based method and incorrect outputs through a distillation method.

trainingreasoningefficiency

SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization

Apr 2, 2026

Zhengxi Lu, Zhiyuan Yao, Jinyang Wu et al.

You can train agents to permanently learn skills rather than retrieve them at runtime, reducing token overhead and improving zero-shot performance by progressively withdrawing skill context during training.

SKILL0 teaches language model agents to internalize skills (procedural knowledge packages) directly into their parameters through a curriculum that gradually removes skill context during training.

trainingagentsreasoning

Model-Based Reinforcement Learning for Control under Time-Varying Dynamics

Apr 2, 2026

Klemens Iten, Bruce Lee, Chenhao Li et al.

Real-world control systems drift and change; you need to actively manage which training data you use and how confident you are in your model to handle non-stationary dynamics effectively.

This paper tackles reinforcement learning for robots and systems that change over time—like machinery that wears down or environments with shifting conditions. The researchers develop a learning algorithm that adapts by selectively forgetting old data and maintaining uncertainty estimates, proving it works better than standard approaches that assume unchanging dynamics.

trainingreasoning

Smoothing the Landscape: Causal Structure Learning via Diffusion Denoising Objectives

Apr 2, 2026

Hao Zhu, Di Zhou, Donna Slonim

Diffusion model denoising objectives can smooth optimization landscapes for causal discovery, enabling faster and more stable learning of causal structures in challenging high-dimensional datasets.

This paper proposes DDCD, a new method for discovering causal relationships in data by adapting diffusion model techniques. Instead of using diffusion to generate data, it uses the denoising process to learn causal structures (DAGs) more stably and efficiently than existing methods like NOTEARS, especially when data is high-dimensional or imbalanced.

reasoningtrainingefficiency

BVFLMSP : Bayesian Vertical Federated Learning for Multimodal Survival with Privacy

Apr 2, 2026

Abhilash Kar, Basisth Saha, Tanmay Sen et al.

This framework enables hospitals and clinics to collaboratively build better survival prediction models without sharing raw patient data, while also quantifying prediction confidence—critical for clinical adoption.

BVFLMSP combines Bayesian neural networks with federated learning to predict survival outcomes from sensitive multimodal data distributed across multiple parties. Each organization keeps its data private while contributing predictions to a shared model, with added privacy protections and uncertainty estimates for more reliable medical decision-making.

safetymultimodaltraining

Generative AI Spotlights the Human Core of Data Science: Implications for Education

Apr 2, 2026

Nathan Taback

As AI handles data cleaning, modeling, and reporting, data science education must prioritize teaching human reasoning, problem formulation, and ethical judgment—skills that AI cannot replace.

This paper argues that generative AI automates routine data science tasks but reveals that the most valuable skills remain fundamentally human: problem formulation, causal reasoning, ethics, and judgment. The author proposes that data science education should focus on these irreducibly human competencies while teaching students to work effectively with AI tools.

trainingapplications

Universal Hypernetworks for Arbitrary Models

Apr 2, 2026

Xuanfeng Zhou

A single fixed hypernetwork can generate weights for diverse architectures and tasks by using architecture/task descriptors as input, eliminating the need to retrain generators when switching between different model types.

This paper introduces Universal Hypernetworks (UHN), a single neural network that can generate weights for many different model architectures and tasks. Instead of building separate weight generators for each model type, UHN uses descriptors (text descriptions of architecture and task) to produce weights for any compatible model, working across vision, graphs, text, and math tasks.

architecturetrainingefficiency

The Recipe Matters More Than the Kitchen:Mathematical Foundations of the AI Weather Prediction Pipeline

Apr 1, 2026

Piyush Garg, Diana R. Gergel, Andrew E. Shao et al.

For AI weather prediction, the training pipeline (loss function, data, optimization strategy) determines forecast skill far more than architectural choices—and current models have a fundamental blind spot for extreme weather events.

This paper explains why training methods, loss functions, and data matter more than model architecture for AI weather prediction. Using math from approximation theory and dynamical systems, the authors show that how you train a model dominates what model you use, and prove that AI weather models systematically underestimate extreme events. They validate this across ten different AI weather models.

trainingevaluationreasoning

Learning and Generating Mixed States Prepared by Shallow Channel Circuits

Apr 1, 2026

Fangjun Hu, Christian Kokail, Milan Kornjača et al.

Quantum states in the trivial phase can be efficiently learned from measurements and regenerated using shallow circuits, providing a theoretical foundation for quantum generative models without needing the original preparation circuit.

This paper shows how to learn and generate quantum mixed states that belong to the 'trivial phase'—states preparable by shallow quantum circuits that preserve local reversibility. The algorithm learns from measurement data alone and outputs a shallow circuit that recreates the state, with polynomial sample complexity and runtime. The work also extends to classical diffusion models.

reasoningtrainingarchitecture

ORBIT: Scalable and Verifiable Data Generation for Search Agents on a Tight Budget

Apr 1, 2026

Nandan Thakur, Zijian Chen, Xueguang Ma et al.

You can build high-quality training data for search agents using synthetic generation and verification without expensive human annotation or API costs, enabling smaller models to compete with larger ones.

ORBIT is a dataset of 20,000 reasoning-heavy questions with verifiable answers, created cheaply without paid APIs. The authors built a four-stage pipeline (seed creation, question generation, self-verification, external verification) to generate training data for search agents—AI systems that combine language models with web search.

datatrainingagents

Embarrassingly Simple Self-Distillation Improves Code Generation

Apr 1, 2026

Ruixiang Zhang, Richard He Bai, Huangjie Zheng et al.

You can improve code generation by sampling from your model's own outputs and fine-tuning on them—no external tools needed. The gains come from balancing precision (removing bad options) with exploration (keeping useful diversity).

A simple technique called self-distillation improves code generation in large language models by having them sample their own outputs and fine-tune on those samples. The method boosts performance significantly (42.4% to 55.3% on benchmarks) without needing external verifiers or teacher models, and works across different model sizes and architectures.

trainingefficiencyapplications

Adaptive Block-Scaled Data Types

Mar 30, 2026

Jack Cook, Hyemin S. Lee, Kathryn Le et al.

Adaptive block-scaled quantization can significantly reduce errors in 4-bit model compression by intelligently switching between data types per block, achieving better accuracy than fixed formats without extra storage cost.

This paper introduces adaptive quantization formats (IF4, IF3, IF6) that improve upon NVFP4 by dynamically choosing between floating-point and integer representations for each block of values. The approach uses an unused bit in NVFP4 to signal which format to use, reducing quantization errors and improving language model performance with minimal hardware overhead.

efficiencytrainingarchitecture

Temporal Credit Is Free

Mar 30, 2026

Aur Shalev Merin

Online learning in RNNs doesn't require sophisticated credit assignment algorithms—proper gradient normalization with immediate derivatives is sufficient and dramatically more memory-efficient.

Recurrent networks can learn online using simple immediate derivatives instead of expensive backpropagation-through-time. The key insight: the hidden state naturally carries temporal information forward, so you just need proper gradient normalization and avoid stale memory traces. This approach matches or beats complex algorithms while using 1000x less memory.

trainingefficiency

Rethinking Language Model Scaling under Transferable Hypersphere Optimization

Mar 30, 2026

Liliang Ren, Yang Liu, Yelong Shen et al.

Hypersphere-constrained optimization enables predictable scaling of language models with a single transferable learning rate, eliminating expensive hyperparameter retuning when scaling up and improving training stability.

This paper introduces HyperP, a framework for scaling language models more efficiently by constraining weights to a hypersphere during training. The key innovation is showing that a single learning rate tuned at small scale transfers reliably across different model sizes, depths, and training amounts—achieving 1.58× better compute efficiency while maintaining training stability.

trainingscalingefficiency

Expectation Error Bounds for Transfer Learning in Linear Regression and Linear Neural Networks

Mar 30, 2026

Meitong Liu, Christopher Jung, Rui Li et al.

Transfer learning with auxiliary tasks provably helps only under specific conditions—this paper gives exact formulas to check those conditions and optimal ways to combine auxiliary and main tasks in linear settings.

This paper provides theoretical guarantees for when auxiliary data helps in transfer learning. For linear regression, the authors derive exact formulas showing when and how auxiliary tasks improve performance. For linear neural networks with shared representations, they prove the first non-vacuous conditions for beneficial auxiliary learning and show how to optimally weight different tasks.

training

Stepwise Credit Assignment for GRPO on Flow-Matching Models

Mar 30, 2026

Yash Savani, Branislav Kveton, Yuchen Liu et al.

Stepwise credit assignment—rewarding each diffusion step for its own improvement rather than the final result—makes RL training of image generators more efficient and faster to converge.

This paper improves reinforcement learning for image generation models by assigning credit more intelligently across diffusion steps. Instead of treating all steps equally, it recognizes that early steps handle composition while late steps refine details, then rewards each step based on its specific contribution. This leads to faster learning and better sample efficiency.

trainingreasoningefficiency

Dynamic Dual-Granularity Skill Bank for Agentic RL

Mar 30, 2026

Songjun Tu, Chengdong Xu, Qichao Zhang et al.

Organizing agent experience into dual-granularity skills (task-level and step-level) with dynamic maintenance significantly improves performance, and these skills transfer across different evaluation settings without major training overhead.

D2Skill creates a dynamic memory system for AI agents that stores two types of reusable skills: high-level task guidance and low-level step-by-step corrections. The system learns from its own training experience, continuously updating and pruning skills based on their usefulness. Tests show 10-20% improvement in task success rates on complex web-based environments.

agentsreasoningtraining
efficiencyarchitecturetraining

No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models

Mar 26, 2026

Hai X. Pham, David T. Hoffmann, Ricardo Guerrero et al.

You can teach vision-language models to understand compositional meaning by focusing on concept-level alignment and preserving fine-grained visual information—without custom data or hurting general performance.

This paper improves how vision-language models learn to understand combinations of concepts (like "red car" vs "blue car") without sacrificing their ability to recognize new objects.

trainingmultimodalefficiency

On Neural Scaling Laws for Weather Emulation through Continual Training

Mar 26, 2026

Shashank Subramanian, Alexander Kiefer, Arnur Nigmetov et al.

Neural scaling laws can predict weather model performance and guide efficient resource allocation—models trained with periodic cooldowns outperform standard approaches and enable longer, more accurate forecasts.

This paper studies how neural networks for weather forecasting improve as you scale up the model size, training data, and compute.

scalingefficiencytraining

VFIG: Vectorizing Complex Figures in SVG with Vision-Language Models

Mar 25, 2026

Qijia He, Xunmei Liu, Hammaad Memon et al.

You can now automatically convert flat images of technical figures into editable, scalable vector graphics—matching GPT-5.2 performance—enabling recovery of lost design source files without manual reconstruction.

VFIG converts rasterized images (PNG, JPEG) of technical diagrams back into editable SVG vector graphics using vision-language models. The team created a 66K dataset of figure-SVG pairs and a two-stage training approach (supervised learning for basic shapes, then reinforcement learning for refinement) to reconstruct complex professional diagrams with high fidelity.

multimodaltrainingapplications

Trust Region Constrained Bayesian Optimization with Penalized Constraint Handling

Mar 25, 2026

Raju Chowdhury, Tanmay Sen, Prajamitra Bhuyan et al.

Trust regions combined with penalty-based constraints enable Bayesian optimization to find feasible solutions faster in high-dimensional constrained problems where evaluations are expensive.

This paper presents a Bayesian optimization method for expensive black-box optimization problems with constraints. It combines penalty-based constraint handling, surrogate modeling, and trust regions to efficiently find good solutions in high dimensions with fewer evaluations.

efficiencytraining

Byzantine-Robust and Differentially Private Federated Optimization under Weaker Assumptions

Mar 24, 2026

Rustem Islamov, Grigory Malinovsky, Alexander Gaponov et al.

You can now build federated learning systems that defend against both Byzantine attacks and privacy breaches simultaneously, without needing unrealistic assumptions like bounded gradients or extra server datasets.

This paper tackles two critical security issues in federated learning: protecting against malicious servers (Byzantine attacks) and preventing data leakage (differential privacy).

safetytrainingefficiency

End-to-End Training for Unified Tokenization and Latent Denoising

Mar 23, 2026

Shivam Duggal, Xingjian Bai, Zongze Wu et al.

You can train tokenization and image generation together from scratch using a single model with shared weights, simplifying the pipeline and reducing training complexity while maintaining quality.

This paper proposes UNITE, a new way to train image generation models more efficiently by combining tokenization and diffusion in a single training stage.

architecturetrainingefficiency

Scaling DoRA: High-Rank Adaptation via Factored Norms and Fused Kernels

Mar 23, 2026

Alexandra Zelenin, Alexandra Zhuravlyova

If you're using DoRA for high-rank fine-tuning on limited GPU memory, these optimizations make it practical by cutting peak memory usage by up to 7 GB and doubling speed without changing the model's behavior.

DoRA is a fine-tuning method that adapts model weights by separating magnitude from direction, but computing its forward pass requires materializing large dense matrices that consume massive GPU memory.

efficiencytraining

TiCo: Time-Controllable Training for Spoken Dialogue Models

Mar 23, 2026

Kai-Wei Chang, Wei-Chih Chen, En-Pei Hu et al.

Spoken dialogue models can now follow duration constraints (e.g., 'respond in 15 seconds') by inserting time markers during generation, making them more practical for real-world voice applications.

TiCo is a post-training method that teaches spoken dialogue models to generate responses with specific durations. It uses time markers during generation to help models track elapsed speaking time and adjust content to meet target lengths, improving real-world voice assistant interactions without requiring new training data.

trainingapplicationsagents

Confidence-Based Decoding is Provably Efficient for Diffusion Language Models

Mar 23, 2026

Changxiao Cai, Gen Li

Confidence-based decoding in diffusion models is provably efficient and adapts automatically to data complexity, offering a theoretical foundation for why this practical strategy works well.

This paper proves that confidence-based decoding—a strategy that decides which tokens to generate next in diffusion language models based on prediction confidence—is theoretically efficient.

efficiencyreasoningtraining

MemDLM: Memory-Enhanced DLM Training

Mar 23, 2026

Zehua Pei, Hui-Ling Zhen, Weizhe Lin et al.

Diffusion language models can be trained more effectively by embedding a simulated denoising trajectory into training, and this memory mechanism can be reused at inference time to improve long-context retrieval tasks.

This paper addresses a key problem in diffusion language models: they're trained one way (predicting masked tokens) but used differently (multi-step denoising). MemDLM fixes this mismatch by simulating the denoising process during training using a memory mechanism that learns from each sample's trajectory, leading to faster training and better long-context performance.

trainingarchitectureefficiency

One Model, Two Markets: Bid-Aware Generative Recommendation

Mar 23, 2026

Yanchen Jiang, Zhe Feng, Christopher P. Mah et al.

You can build recommendation systems that serve both users and business goals by treating ad placement as part of the generation process, letting bids influence which items appear at inference time rather than requiring model retraining.

This paper presents GEM-Rec, a recommendation system that balances user satisfaction with platform revenue by integrating ads and bids directly into generative models. Using special control tokens and a bid-aware decoding method, the system learns when to show ads from real user behavior and adjusts which ads appear based on real-time pricing, without needing to retrain the model.

applicationstraining

SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Image Generation

Mar 23, 2026

Sashuai Zhou, Qiang Zhou, Junpeng Ma et al.

Fine-grained spatial accuracy in generated images requires explicit spatial reward modeling during training; rule-based spatial checks alone miss complex relationships that vision-language models with grounding can catch.

SpatialReward is a reward model that helps text-to-image AI systems generate images with accurate object positioning and spatial relationships. It breaks down image prompts into specific spatial requirements, uses object detection to verify positions, and applies reasoning to check complex spatial relationships—then feeds this feedback into training to improve image generation quality.

evaluationmultimodaltraining
efficiency
training
architecture

F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

Mar 19, 2026

Ziyin Zhang, Zihan Liao, Hang Yu et al.

You can now use smaller, faster embedding models for multilingual search and retrieval without sacrificing quality—F2LLM-v2 offers efficient options for resource-constrained deployments while the largest variant ranks first on major benchmarks.

F2LLM-v2 is a family of multilingual embedding models (80M to 14B parameters) trained on 60 million high-quality samples that support 200+ languages, including underserved low-resource ones. Using matryoshka learning and knowledge distillation, these models achieve top performance on benchmarks while being more efficient than previous LLM-based embeddings.

multimodalefficiencytraining

Spectrally-Guided Diffusion Noise Schedules

Mar 19, 2026

Carlos Esteves, Ameesh Makadia

By tailoring noise schedules to each image's spectral content, you can generate higher-quality images with fewer denoising steps, making diffusion models faster and more efficient.

This paper proposes a smarter way to design noise schedules for diffusion models by analyzing the spectral properties of images. Instead of using the same handcrafted noise schedule for all images, the method creates custom schedules for each image that eliminate unnecessary denoising steps, improving generation quality especially when using fewer sampling steps.

efficiencyarchitecturetraining

Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation

Mar 19, 2026

Zhuolin Yang, Zihan Liu, Yang Chen et al.

You can build highly capable reasoning models with far fewer active parameters by combining domain-specific reinforcement learning with multi-domain distillation—this model matches frontier performance with 20x fewer parameters.

Nemotron-Cascade 2 is a 30B parameter model with only 3B active parameters that achieves top-tier reasoning and coding performance comparable to much larger models.

trainingreasoningefficiency

How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Evaluation

Mar 19, 2026

Ke-Han Lu, Szu-Wei Fu, Chao-Han Huck Yang et al.

An LLM's text-only auditory knowledge is a strong predictor of how well it will perform in audio tasks—so you can evaluate audio-language models by testing their audio understanding before building them.

This paper investigates how much knowledge about sound and audio LLMs actually have from their text-only training, and whether this predicts how well they work in audio tasks. Researchers tested different LLMs three ways: directly probing their audio knowledge, having them reason about audio descriptions, and fine-tuning them into full audio-language models.

evaluationmultimodaltraining

OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards

Mar 19, 2026

Zehao Li, Zhenyu Wu, Yibo Zhao et al.

Breaking reward evaluation into smaller, verifiable steps with multiple reviewers produces more reliable feedback for training GUI agents, improving task success by 10% in online learning scenarios.

OS-Themis is a reward evaluation system for GUI agents that breaks down task trajectories into verifiable milestones and uses multiple reviewers to judge whether agents completed tasks correctly. This approach improves both the accuracy of reward signals and the performance of agents trained with reinforcement learning on mobile and desktop interfaces.

agentsevaluationtraining

VEPO: Variable Entropy Policy Optimization for Low-Resource Language Foundation Models

Mar 19, 2026

Chonghan Liu, Yimin Du, Qi An et al.

VEPO uses variable entropy and constrained RL to improve low-resource language models by enforcing linguistic well-formedness during training while maintaining exploration—achieving better tokenization and translation quality on 90 language pairs.

This paper introduces VEPO, a training method that improves language models for low-resource languages by using reinforcement learning to enforce structural constraints (like proper formatting and sequence length) while dynamically balancing exploration and exploitation.

trainingalignment

Optimal Splitting of Language Models from Mixtures to Specialized Domains

Mar 19, 2026

Skyler Seto, Pierre Ablin, Anastasiia Filippova et al.

You can train better domain-specific models by mathematically optimizing how many tokens to spend on general pretraining versus specialized training, rather than using a fixed two-stage recipe.

This paper shows how to efficiently train multiple specialized language models by splitting compute between general pretraining and domain-specific training. Using scaling laws, the authors predict optimal token allocation for each stage, improving performance on reasoning and knowledge tasks across different model sizes.

trainingscalingefficiency

Fast and Effective Computation of Generalized Symmetric Matrix Factorization

Mar 19, 2026

Lei Yang, Han Wan, Min Zhang et al.

The paper provides both theoretical foundations (exactness properties) and a practical algorithm (A-NAUM) for symmetric matrix factorization problems, with proven convergence rates—useful for practitioners implementing matrix factorization in applications.

This paper develops a fast algorithm for symmetric matrix factorization, a mathematical technique used across machine learning and image processing. The authors prove theoretical guarantees about when their method finds exact solutions and propose A-NAUM, an efficient algorithm that alternates between updating matrix factors, with convergence guarantees.

training

Enhancing Pretrained Model-based Continual Representation Learning via Guided Random Projection

Mar 19, 2026

Ruilin Li, Heming Zou, Xiufeng Yan et al.

Using data-guided projection instead of random initialization makes continual learning more stable and effective, especially when there's a big gap between pretrained model knowledge and new tasks.

This paper improves how pretrained models learn continuously on new tasks by replacing random projection layers with a smarter, data-guided approach. Instead of randomly initializing the projection layer, the method selectively builds it based on the target data, creating more stable and expressive representations when learning new classes incrementally without storing old examples.

trainingefficiency

UGID: Unified Graph Isomorphism for Debiasing Large Language Models

Mar 19, 2026

Zikang Ding, Junchi Yao, Junhao Li et al.

Biases in LLMs can be reduced by enforcing structural consistency in the model's internal computations (attention and hidden states) across counterfactual inputs, rather than just fixing outputs or training data.

This paper proposes UGID, a method to reduce social biases in large language models by treating the model as a computational graph and enforcing that its internal structure remains consistent across inputs that differ only in sensitive attributes like gender or race.

safetyalignmenttraining

LuMamba: Latent Unified Mamba for Electrode Topology-Invariant and Efficient EEG Modeling

Mar 19, 2026

Danaé Broustail, Anna Tegon, Thorir Mar Ingolfsson et al.

State-space models (Mamba) enable efficient EEG foundation models that work across varying electrode setups—crucial for real-world clinical deployment where equipment differs across hospitals.

LuMamba is an efficient EEG foundation model that handles different electrode configurations by combining topology-invariant encodings with linear-complexity state-space modeling. Pre-trained on 21,000+ hours of unlabeled EEG data, it achieves strong performance on clinical tasks while using 377× fewer computations than transformer-based alternatives.

efficiencyarchitecturetraining

AgentFactory: A Self-Evolving Framework Through Executable Subagent Accumulation and Reuse

Mar 18, 2026

Zhang Zhang, Shuqi Lu, Hongjin Qian et al.

Instead of storing agent experiences as text, storing them as executable code lets agents reuse and improve solutions reliably across different tasks and systems.

AgentFactory is a framework that helps AI agents learn and improve by saving successful task solutions as reusable Python code (subagents) rather than just text descriptions. These saved subagents get refined over time based on how well they work, creating a growing library that makes future similar tasks easier to solve without human help.

agentstrainingapplications

Beyond Muon: MUD (MomentUm Decorrelation) for Faster Transformer Training

Mar 18, 2026

Ben S. Southworth, Stephen Thomas

MUD offers 1.3-3x faster token throughput than Muon with similar final performance, making it a practical drop-in replacement for faster transformer training without sacrificing convergence.

MUD is a faster alternative to Muon, an optimizer that speeds up transformer training. Instead of using expensive matrix operations to smooth momentum updates, MUD uses a simpler triangular approach inspired by classical numerical methods. This cuts optimizer overhead by 30-70% while maintaining training speed, making transformers train 10-50% faster in real time.

trainingefficiency

ShapleyLaw: A Game-Theoretic Approach to Multilingual Scaling Laws

Mar 18, 2026

Xuyang Cao, Qianying Liu, Chuan Xiao et al.

By measuring how much each language helps other languages learn during training, you can predict model performance more accurately and find better language mixture ratios than methods that ignore cross-lingual transfer effects.

This paper treats multilingual language model training as a cooperative game where each language contributes to overall performance. It uses game theory to measure how much each language helps others learn (cross-lingual transfer), then uses these insights to predict the best mix of languages for training data.

scalingtraining

Efficient Reasoning on the Edge

Mar 17, 2026

Yelysei Bondarenko, Thomas Hehn, Rob Hesselink et al.

You can run reasoning-capable LLMs on mobile devices by using LoRA adapters with reinforcement learning to shorten reasoning traces, parallel decoding to reduce latency, and smart KV-cache management—achieving near-full-model accuracy with a fraction of the memory.

This paper makes LLM reasoning practical for mobile devices by combining lightweight LoRA adapters with techniques like budget forcing (to shorten responses), parallel decoding (to speed up generation), and dynamic adapter switching (to activate reasoning only when needed). The result is accurate chain-of-thought reasoning on edge devices without the memory overhead of full models.

efficiencyreasoningtraining

ManiTwin: Scaling Data-Generation-Ready Digital Object Dataset to 100K

Mar 17, 2026

Kaixuan Wang, Tianxing Chen, Jiawei Liu et al.

Having diverse, high-quality 3D assets at scale dramatically improves robot learning in simulation—this dataset removes a major bottleneck for scaling robotic manipulation training.

ManiTwin is an automated pipeline that converts single images into simulation-ready 3D digital objects for robot training. The team created ManiTwin-100K, a dataset of 100,000 annotated 3D assets with physical properties and manipulation instructions, enabling large-scale generation of robot training data in simulation.

dataapplicationstraining

Online Experiential Learning for Language Models

Mar 17, 2026

Tianzhu Ye, Li Dong, Qingxiu Dong et al.

Language models can improve themselves in production by learning from actual user interactions—extracting knowledge from deployment experience and feeding it back into training without requiring access to the original environment.

This paper introduces Online Experiential Learning (OEL), a system that lets language models continuously improve by learning from real interactions during deployment. Instead of relying only on offline training data, OEL extracts useful knowledge from user interactions, then updates the model with this knowledge without needing access to the original environment.

trainingreasoningefficiency

Internalizing Agency from Reflective Experience

Mar 17, 2026

Rui Ge, Yichao Fu, Yuyang Qian et al.

By teaching agents to learn from environmental feedback and explore alternative paths when they fail, LEAFE improves their problem-solving capacity across multiple attempts (Pass@k) better than methods that only optimize for single successful outcomes.

This paper introduces LEAFE, a training method that helps AI agents learn from their mistakes during long interactions with environments. Instead of just optimizing for final success, LEAFE teaches agents to reflect on feedback, backtrack to earlier decisions, try alternative approaches, and internalize these recovery strategies.

agentsreasoningtraining

Stochastic Resetting Accelerates Policy Convergence in Reinforcement Learning

Mar 17, 2026

Jello Zhou, Vudtiwat Ngampruetikorn, David J. Schwab

Stochastic resetting—randomly restarting an agent during training—accelerates learning convergence by truncating long, uninformative trajectories, offering a simple tuning mechanism for RL that works independently of reward structure.

This paper shows that periodically resetting an agent back to a starting state during training speeds up reinforcement learning. Unlike traditional methods that slow down learning, resetting helps by cutting short unproductive exploration paths and improving how value estimates propagate through the network, especially when rewards are sparse.

training

Learning to Present: Inverse Specification Rewards for Agentic Slide Generation

Mar 17, 2026

Karthik Ragunath Ananda Kumar, Subrahmanyam Arunachalam

You can train smaller language models to perform complex agentic tasks like presentation generation by using creative reward signals (like inverse task verification) and parameter-efficient fine-tuning, achieving 91% of large model quality with only 7B parameters.

This paper presents a reinforcement learning system that trains AI agents to automatically generate professional slide presentations. The key innovation is an "inverse specification reward" that checks if slides accurately convey their intended message by having an LLM try to recover the original brief from the generated slides.

agentstraining

Code-A1: Adversarial Evolving of Code LLM and Test LLM via Reinforcement Learning

Mar 16, 2026

Aozhe Wang, Yuchen Yan, Nan Zhou et al.

Separating code and test generation into competing models with opposing rewards prevents self-collusion and produces higher-quality code and tests than single-model self-play approaches.

Code-A1 uses two competing AI models to improve code generation: one model writes code, the other writes tests to find bugs in that code. By making them adversaries with opposite goals, the system avoids the problem where a single model could cheat by writing easy tests for itself. This approach generates better code and tests than training on human-written test suites alone.

trainingreasoning

Effective Distillation to Hybrid xLSTM Architectures

Mar 16, 2026

Lukas Hauzenberger, Niklas Schmidinger, Thomas Schmied et al.

You can now distill transformer-based LLMs into more efficient xLSTM architectures without significant performance degradation, making it practical to deploy smaller, cheaper models that match their larger teachers.

This paper shows how to effectively compress large language models into smaller xLSTM models while preserving performance. The researchers developed a distillation pipeline that combines multiple specialized experts into a single efficient model, successfully distilling models from Llama, Qwen, and Olmo families with minimal performance loss.

efficiencyarchitecturetraining

Unbiased and Biased Variance-Reduced Forward-Reflected-Backward Splitting Methods for Stochastic Composite Inclusions

Mar 16, 2026

Quoc Tran-Dinh, Nghia Nguyen-Trung

The paper introduces practical variance-reduction techniques that significantly reduce the number of gradient computations needed to solve stochastic optimization problems, with proven convergence guarantees and real-world applications in machine learning.

This paper develops new optimization techniques for solving complex stochastic problems by combining variance reduction (reducing noise in gradient estimates) with a splitting method called forward-reflected-backward splitting.

trainingefficiency

Estimating Staged Event Tree Models via Hierarchical Clustering on the Simplex

Mar 16, 2026

Muhammad Shoaib, Eva Riccomagno, Manuele Leonelli et al.

For building staged tree models at scale, use Total Variation divergence with Ward.D2 hierarchical clustering—it matches the accuracy of slower methods like Backward Hill Climbing but runs significantly faster.

This paper presents a new method for building staged tree models—a type of probabilistic graphical model that captures context-specific patterns in data. The approach uses hierarchical clustering on probability distributions, comparing different distance metrics and clustering strategies.

trainingefficiencyevaluation
trainingdataefficiency

Learnability and Privacy Vulnerability are Entangled in a Few Critical Weights

Mar 13, 2026

Xingli Fang, Jung-Eun Kim

Privacy vulnerabilities and model performance are concentrated in a small set of weights—you can defend against privacy attacks by carefully fine-tuning just these critical weights instead of retraining the whole model.

This paper identifies that privacy leaks in neural networks come from a tiny fraction of weights, and these same weights are crucial for model performance. Rather than retraining the entire model, the authors propose selectively rewinding only these critical weights during fine-tuning to defend against membership inference attacks while keeping the model accurate.

safetytrainingefficiency

MXNorm: Reusing MXFP block scales for efficient tensor normalisation

Mar 13, 2026

Callum McLean, Luke Y. Prince, Alexandre Payot et al.

You can speed up neural network training by 1-3% by reusing computation from low-precision matrix operations for normalization, with no accuracy loss.

This paper proposes MXNorm, a faster alternative to RMSNorm (a standard layer normalization technique) that reuses scale information already computed during low-precision matrix multiplication. By avoiding redundant calculations, MXNorm achieves 2.4x speedups in normalization while maintaining training accuracy on Llama models.

efficiencytraining

When Right Meets Wrong: Bilateral Context Conditioning with Reward-Confidence Correction for GRPO

Mar 13, 2026

Yu Li, Tian Lan, Zhengling Qi

By explicitly comparing correct and incorrect reasoning traces during training, you can improve reasoning model performance without extra sampling or auxiliary models—just by restructuring how the model learns from existing data.

This paper improves GRPO, a method for training reasoning models, by having the model learn from contrasts between correct and incorrect solutions in the same batch. It introduces two techniques: Bilateral Context Conditioning (letting the model compare successful vs failed reasoning traces) and Reward-Confidence Correction (stabilizing training by adjusting baselines).

trainingreasoning

Steve-Evolving: Open-World Embodied Self-Evolution via Fine-Grained Diagnosis and Dual-Track Knowledge Distillation

Mar 13, 2026

Zhengwei Xie, Zhisheng Chen, Ziyan Weng et al.

Embodied agents can continuously improve without retraining by organizing experiences with detailed failure diagnosis and using those insights to constrain and guide planning at test time.

Steve-Evolving is a framework that helps AI agents learn and improve from their experiences in open-world environments like Minecraft. Instead of updating model weights, it organizes what the agent learns into structured experiences, diagnoses why actions succeed or fail in detail, and uses those insights to guide future planning through retrieved skills and safety guardrails.

agentsreasoningtraining

Matching Features, Not Tokens: Energy-Based Fine-Tuning of Language Models

Mar 12, 2026

Samy Jelassi, Mujin Kwun, Rosie Zhao et al.

Feature-matching fine-tuning provides a middle ground between simple token prediction and complex reinforcement learning—it gives dense semantic feedback without needing task-specific reward models, making it practical for improving model behavior on real tasks.

This paper proposes a new way to fine-tune language models by matching learned feature representations instead of predicting individual tokens. Rather than using reinforcement learning with reward models, the method generates multiple model outputs in parallel and uses their semantic features to guide training, achieving better results than standard fine-tuning on coding and translation tasks.

trainingefficiencyreasoning

HiAP: A Multi-Granular Stochastic Auto-Pruning Framework for Vision Transformers

Mar 12, 2026

Andy Li, Aiden Durrant, Milan Markovic et al.

HiAP simplifies Vision Transformer deployment by automatically discovering efficient architectures in one training phase without manual sparsity targets, matching complex multi-stage methods while being easier to use.

HiAP is a pruning method that automatically removes unnecessary parts of Vision Transformers during training to make them faster and smaller for edge devices. Unlike existing approaches that require manual tuning, it uses a single training process to find optimal sub-networks by removing entire attention heads, FFN blocks, and individual neurons simultaneously.

efficiencyarchitecturetraining

QAQ: Bidirectional Semantic Coherence for Selecting High-Quality Synthetic Code Instructions

Mar 12, 2026

Jiayin Lei, Ming Ma, Yunxi Duan et al.

When training on synthetic code data, filtering by reverse semantic coherence (can the answer predict the question?) is more effective at removing noise than forward metrics, letting you use 75% less data without losing model quality.

This paper introduces QAQ, a method for filtering noisy synthetic code training data by measuring bidirectional semantic coherence—checking not just if a model can generate answers from questions, but also if answers can predict back to questions. By selecting only 25% of data with the highest quality scores, the approach matches full-dataset performance while cutting computational costs.

datatraining

A Quantitative Characterization of Forgetting in Post-Training

Mar 12, 2026

Krishnakumar Balasubramanian, Shiva Prasad Kasiviswanathan

The direction of your training objective (forward-KL vs reverse-KL) fundamentally determines whether a model forgets old tasks—reverse-KL naturally avoids catastrophic forgetting while forward-KL requires replay to prevent it.

This paper explains why AI models forget old knowledge when trained on new tasks. Using mathematical analysis, the authors show that different training objectives (forward-KL vs reverse-KL) cause different types of forgetting, and that replaying old data helps prevent it. They also analyze three recent training methods to predict when they'll preserve old knowledge.

trainingalignment

IsoCompute Playbook: Optimally Scaling Sampling Compute for LLM RL

Mar 12, 2026

Zhoujun Cheng, Yutao Xie, Yuxiao Qu et al.

When doing RL training on LLMs, increase parallel rollouts per problem as your compute budget grows, but expect diminishing returns; this single principle helps you allocate compute efficiently across sampling and training.

This paper studies how to optimally distribute computing resources when training language models with reinforcement learning. The researchers found that the number of parallel attempts per problem should increase with total compute budget before leveling off, and this pattern holds whether problems are easy or hard—though for different reasons.

scalingtraining

Linking Perception, Confidence and Accuracy in MLLMs

Mar 12, 2026

Yuetian Du, Yucheng Wang, Rongyu Zhang et al.

Multimodal models suffer from severe confidence miscalibration; training them to be honest about uncertainty and using that uncertainty to trigger verification steps significantly improves both accuracy and reliability.

This paper identifies that multimodal AI models are overconfident—they don't reliably know when they're wrong. The authors propose a training method using image noise pairs and confidence-based rewards to fix this, plus a test-time strategy that uses the model's confidence to decide when to double-check answers. Results show 8.8% accuracy improvements across benchmarks.

evaluationtrainingmultimodal

Automatic Generation of High-Performance RL Environments

Mar 12, 2026

Seth Karten, Rahul Dev Appapogu, Chi Jin

AI agents can now automatically translate RL environments into optimized implementations (Rust, JAX, GPU-parallel code) in hours instead of months, with built-in verification ensuring the fast version behaves identically to the original.

This paper shows how to automatically generate high-performance RL environments using AI agents with a generic prompt template, verification checks, and iterative repair.

agentsefficiencytraining
evaluationtrainingapplications

CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

Feb 27, 2026

Weinan Dai, Hanlin Wu, Qiying Yu et al.

Reinforcement learning can teach AI models to write genuinely optimized GPU code, not just syntactically correct code—a task that previously requ...

This paper trains an AI agent to write optimized GPU code (CUDA kernels) using reinforcement learning. The system learns from trial-and-error feedback about code performance, achieving faster execution than existing tools like PyTorch's compiler and outperforming top commercial AI models on benchmark tests.

agentstrainingapplications

Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation

Feb 27, 2026

Zhengbo Wang, Jian Liang, Ran He et al.

You can reduce optimizer memory by 8x using low-rank decomposition without sacrificing model quality—making it easier to train larger models on l...

This paper makes training large language models cheaper by redesigning how optimizers store momentum information. Instead of keeping full-sized momentum matrices in memory, the authors compress them into smaller low-rank approximations—using 1/8 the memory while maintaining or improving training quality.

efficiencytraining

Memory Caching: RNNs with Growing Memory

Feb 27, 2026

Ali Behrouz, Zeman Li, Yuan Deng et al.

Memory Caching lets RNNs scale their memory capacity with sequence length while staying faster than Transformers.

This paper fixes a major weakness of fast RNN models: they forget information too quickly because they have fixed-size memory. The authors introduce Memory Caching, which lets RNNs save snapshots of their memory as they process longer sequences. This gives RNNs the ability to remember more without becoming as slow as Transformers, creating a sweet spot between speed and accuracy.

architectureefficiencytraining

Who Guards the Guardians? The Challenges of Evaluating Identifiability of Learned Representations

Feb 27, 2026

Shruti Joshi, Théo Saulus, Wieland Brendel et al.

Standard metrics for evaluating learned representations are often misspecified and can mislead you about whether your model actually learned interp...

This paper reveals that popular metrics for checking if AI models learn meaningful, interpretable features are unreliable. The metrics work only under specific conditions, and when those conditions aren't met, they give false results—saying a model learned good features when it didn't, or vice versa. The authors provide tools to properly test these metrics.

evaluationtraining

Histopathology Image Normalization via Latent Manifold Compaction

Feb 27, 2026

Xiaolong Zhang, Jianwei Zhang, Selim Sevim et al.

Unsupervised learning can remove batch effects from medical images, letting models generalize across hospitals without retraining.

Medical image analysis struggles when microscope slides are stained or scanned differently across hospitals—models trained on one site fail at another. This paper introduces a technique that learns to remove these visual differences automatically, making AI models work reliably across different clinical sites without needing labeled examples.

dataapplicationstraining

Model Agreement via Anchoring

Feb 26, 2026

Eric Eaton, Surbhi Goel, Marcel Hussing et al.

You can mathematically guarantee that independently trained models will converge to the same predictions by scaling up ensemble size, boosting iter...

This paper shows how to make different machine learning models agree with each other by using a technique called anchoring. The researchers prove that when you train multiple models together using common methods like stacking, boosting, or neural networks, you can reduce disagreement between them by adjusting simple parameters like the number of models or training iterations.

trainingevaluation

A Dataset is Worth 1 MB

Feb 26, 2026

Elad Kimchi Shoshani, Leeyam Gabay, Yedid Hoshen

You can teach models new tasks by transmitting just labels instead of data, if clients have a generic reference dataset pre-loaded.

Instead of sending large datasets over the network, this paper proposes sending only class labels for images from a reference dataset that clients already have locally. A smart filtering mechanism picks which images are most relevant to the new task, reducing communication to under 1 MB while maintaining accuracy.

efficiencydatatraining

SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport

Feb 26, 2026

Simon Roschmann, Paul Krzakala, Sonia Mazelet et al.

You can align vision and language models with 10-100x less paired training data by leveraging unpaired images and text separately.

This paper shows how to align vision and language models using far fewer paired examples than current methods require. Instead of needing millions of image-text pairs, SOTAlign uses a small set of paired data plus lots of unpaired images and text, employing a technique called optimal transport to learn how the two models relate to each other.

multimodaltrainingefficiency

FlashOptim: Optimizers for Memory Efficient Training

Feb 26, 2026

Jose Javier Gonzalez Ortiz, Abhay Gupta, Chris Renard et al.

You can train large models with 50% less GPU memory by using better compression for optimizer states—no quality loss, drop-in replacement.

FlashOptim cuts the memory needed to train large AI models in half by storing optimizer information more efficiently. It uses smarter compression techniques for gradients and optimizer states without hurting model quality, making it possible to train 7B+ parameter models on consumer GPUs.

efficiencytraining

Differentiable Zero-One Loss via Hypersimplex Projections

Feb 26, 2026

Camilo Gomez, Pengyang Wang, Liansheng Tang

You can now directly optimize for classification accuracy during training instead of using proxy losses, improving performance especially when trai...

This paper solves a long-standing problem in machine learning: how to optimize the zero-one loss (the metric that actually measures classification accuracy) using gradient descent. The authors create a smooth mathematical approximation that lets you backpropagate through this loss, which helps models train better on large batches of data.

training

Utilizing LLMs for Industrial Process Automation

Feb 26, 2026

Salim Fares

LLMs can accelerate industrial automation development despite being trained on little specialized domain code, opening new productivity gains in ma...

This paper explores how large language models can help developers write code for industrial automation systems—like programming robotic arms in manufacturing. Most LLM research focuses on common languages like Python, but industrial systems use specialized proprietary languages that LLMs rarely see in training data.

applicationstrainingefficiency

ParamMem: Augmenting Language Agents with Parametric Reflective Memory

Feb 26, 2026

Tianjun Yao, Yongqiang Chen, Yujia Zheng et al.

Agents that reflect on their mistakes in diverse ways solve problems better—and you can teach this diversity by storing reflection patterns as le...

This paper introduces ParamMem, a memory module that helps AI agents think better by learning from past mistakes in diverse ways. Instead of repeating the same reflection patterns, the system stores reflection strategies as model parameters, allowing agents to generate varied self-corrections.

agentsreasoningtraining

Conformalized Neural Networks for Federated Uncertainty Quantification under Dual Heterogeneity

Feb 26, 2026

Quang-Huy Nguyen, Jiaqi Wang, Wei-Shinn Ku

Federated learning systems can now quantify prediction uncertainty reliably across heterogeneous devices with minimal communication overhead using ...

This paper solves a critical problem in federated learning: how to know when your model is uncertain about its predictions, especially when different devices have different types of data.

trainingsafetyefficiency

Physics Informed Viscous Value Representations

Feb 26, 2026

Hrishikesh Viswanath, Juanwu Lu, S. Talha Bukhari et al.

Physics-informed constraints based on optimal control theory make offline goal-conditioned reinforcement learning more stable and accurate in high-...

This paper improves how AI agents learn to reach goals from pre-recorded data by using physics principles. Instead of guessing value estimates that might be wrong, the method constrains learning using equations from optimal control theory, making the agent's decisions more geometrically consistent and stable—especially useful for navigation and complex robot manipulation tasks.

trainingreasoning

Risk-Aware World Model Predictive Control for Generalizable End-to-End Autonomous Driving

Feb 26, 2026

Jiangxin Sun, Feng Xue, Teng Long et al.

Autonomous driving systems can make safer decisions in unexpected situations by predicting consequences and evaluating risk, rather than just copyi...

This paper tackles a critical problem in autonomous driving: current AI systems learn by copying expert drivers, but fail when encountering unusual situations they've never seen before. The researchers propose RaWMPC, a system that predicts what will happen if the car takes different actions, then picks the safest option—without needing expert examples.

safetyagentstraining

Mitigating Legibility Tax with Decoupled Prover-Verifier Games

Feb 26, 2026

Yegon Kim, Juho Lee

Separate the model that solves problems from the model that explains them to avoid accuracy loss when making AI outputs verifiable.

When AI models need to show their work so humans can verify it, they often get worse at solving problems—a cost called "legibility tax." This paper fixes that by splitting the job: one model solves the problem correctly, then a second model rewrites the solution in a way that's easy to check. This avoids forcing one model to juggle both accuracy and explainability.

reasoningsafetytraining

A Model-Free Universal AI

Feb 26, 2026

Yegon Kim, Juho Lee

You don't need to model the environment to build an optimal AI agent—learning action values directly can be just as powerful.

This paper introduces AIQI, the first AI agent that learns optimal behavior without building an explicit model of its environment. Instead of predicting how the world works, it directly learns which actions produce the best outcomes. This is a theoretical breakthrough showing that model-free approaches can match the performance of model-based agents in general reinforcement learning.

reasoningtrainingagents

Scaling Search Relevance: Augmenting App Store Ranking with LLM-Generated Judgments

Feb 26, 2026

Evangelia Christakopoulou, Vivekkumar Patel, Hemanth Velaga et al.

A smaller, specialized AI model can generate better training data than a giant pre-trained one, unlocking real improvements in production systems.

Google used fine-tuned AI models to generate millions of relevance labels for app search results, solving a shortage of human-labeled training data. By combining these AI-generated labels with user behavior signals, they improved their App Store ranking system—especially for unpopular searches where user clicks are rare.

trainingapplicationsdata

Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?

Feb 26, 2026

Pengxiang Li, Dilxat Muhtar, Lu Yin et al.

Training data structure, not model architecture, is why parallel language models revert to sequential generation—fix the training data to unlock ...

Diffusion language models promise faster parallel text generation, but they often end up generating tokens one-at-a-time like traditional models. This paper shows the problem is how models are trained—sequential training data pushes them toward sequential generation.

trainingefficiencydata

Tell Me What To Learn: Generalizing Neural Memory to be Controllable in Natural Language

Feb 26, 2026

Max S. Bennett, Thomas P. Zollo, Richard Zemel

You can now control what AI models learn and remember by giving them natural language instructions, making them adaptable to changing priorities wi...

This paper introduces a neural memory system that lets you tell an AI model what to remember and what to ignore using natural language instructions.

trainingagents

Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models

Feb 26, 2026

Chungpa Lee, Jy-yong Sohn, Kangwook Lee

Fine-tune only the value matrix in attention layers to improve zero-shot performance without breaking the model's ability to learn from in-context ...

When you fine-tune a language model to work better on specific tasks without examples, it often loses the ability to learn from examples shown in the prompt.

trainingefficiency