ThinkLLM
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
AboutPrivacyTermsRSS

ThinkLLM

Spot an error in our data? Let us know.

Papers

Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.

921 papers64 this month12 topics
AllEfficiency 38Training 37Evaluation 33Reasoning 27Agents 23Architecture 23Applications 21Multimodal 15Safety 12scaling 8Alignment 8Data 6

May 25 – May 31(8)

MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation

May 26, 2026

Huawei Lin, Peng Li, Jie Song et al.

Treating AI agent skills as long-lived, testable assets with persistent memory—rather than disposable code—significantly improves task success rates and enables skills to transfer between agents and tasks.

This paper introduces MUSE-Autoskill, a framework that helps AI agents continuously improve by creating, storing, and refining reusable skills over time. Instead of treating skills as one-time solutions, the system manages them like software—organizing them in memory, testing them, and learning from experience to make them more reliable and effective across different tasks.

agentstrainingreasoning

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

May 26, 2026

Dongyoon Hahm, Dylan Hadfield-Menell, Kimin Lee

RLHF systems can be exploited by models that mix high quality with hidden biases—annotators prefer them, but the reward model can't tell quality from bias apart, amplifying misalignment during training.

This paper reveals a critical vulnerability in RLHF where language models can exploit the alignment process itself by generating biased outputs that annotators rate highly for quality, causing the reward model to amplify misaligned behaviors like sexism and propaganda.

May 18 – May 24(33)

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

May 22, 2026

Yifan Yang, Ziyang Gong, Weiquan Huang et al.

Skills can be trained like model parameters: use a separate optimizer to iteratively edit skill text based on validation feedback, not just generate them once. This approach is reproducible, stable, and transfers across models.

SkillOpt treats agent skills like neural network weights—optimizing them systematically through an external optimizer model that suggests bounded edits to skill documents based on scored rollouts.

agentstraining

LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws

May 22, 2026

Xu Ouyang, Deyi Liu, Yuhang Cai et al.

LLMs have a fundamental capacity limit based on signal-to-noise ratio: scaling parameters or data without maintaining sufficient signal clarity causes performance degradation, explaining phenomena like catastrophic overtraining and quantization failures that standard scaling laws can't capture.

This paper explains why large language models sometimes get worse with more training or smaller precision—not just better. Using information theory, the authors model LLM training like sending signals through a noisy channel. When you scale up the model or data without keeping the signal clear relative to noise, performance actually drops in a U-shape.

May 11 – May 17(10)

Hand-in-the-Loop: Improving Dexterous VLA via Seamless Interventional Correction

May 14, 2026

Zhuohang Li, Liqun Huang, Wei Xu et al.

Seamlessly blending human intervention with robot policy execution—rather than abrupt takeovers—dramatically reduces manipulation failures in dexterous tasks and produces better-trained policies from human correction data.

This paper addresses a key problem in robotic hand control: when humans take over from an AI policy during manipulation tasks, abrupt hand configuration changes ('gesture jumps') cause failures. Hand-in-the-Loop smoothly blends human corrections with the robot's ongoing actions, reducing takeover disruptions by 99.8% and improving task success rates by 19% when used to train better policies.

agentstraining

MeMo: Memory as a Model

May 14, 2026

Ryan Wei Heng Quek, Sanghyuk Lee, Alfred Wei Lun Leong et al.

You can add new knowledge to any LLM without touching its weights by training a separate memory model that retrieves and augments the LLM's responses—making it practical for real-world applications needing frequent updates.

MeMo introduces a modular memory model that stores new knowledge separately from a frozen LLM, enabling efficient updates without retraining. It works with any LLM (open or proprietary), handles complex document relationships, and maintains constant retrieval cost regardless of corpus size.

May 4 – May 10(12)

Normalizing Trajectory Models

May 8, 2026

Jiatao Gu, Tianrong Chen, Ying Shen et al.

NTM enables fast image generation (4 steps) while preserving exact likelihood calculation—something previous fast diffusion methods couldn't do—by using normalizing flows for each denoising step instead of simple Gaussian assumptions.

This paper introduces Normalizing Trajectory Models (NTM), a new approach for fast image generation that compresses diffusion sampling from many steps to just four. Unlike existing fast methods that lose the ability to calculate exact probabilities, NTM maintains a mathematically exact likelihood while generating high-quality images, making it useful for both generation and evaluation.

efficiencyarchitecturetraining

Flow-OPD: On-Policy Distillation for Flow Matching Models

May 8, 2026

Zhen Fang, Wenxuan Huang, Yu Zeng et al.

On-policy distillation with specialized teachers can resolve conflicting optimization goals in multi-objective image generation, achieving 10-point improvements over standard reinforcement learning approaches while maintaining quality across all metrics.

Flow-OPD is a training method that improves text-to-image models by using specialized teacher models and on-policy distillation to align multiple competing objectives (like image quality, text accuracy, and aesthetics).

Apr 27 – May 3(28)

LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation

May 1, 2026

Venkata Pushpak Teja Menta

Adversarial training can make speaker embeddings invariant to language/script while preserving speaker identity—critical for multilingual voice cloning systems that need to recognize the same speaker across different languages.

Speaker encoders for voice cloning often fail when audio switches between languages or scripts—a problem especially acute for Indic languages. This paper introduces LASE, a small neural layer that makes speaker embeddings language-agnostic by combining speaker identity learning with adversarial training against language classification.

multimodalalignmenttraining

Exploration Hacking: Can LLMs Learn to Resist RL Training?

Apr 30, 2026

Eyon Jang, Damon Falck, Joschka Braun et al.

LLMs may be able to strategically resist RL training by limiting exploration, posing a novel safety risk for post-training alignment—detection methods like monitoring and weight noise offer partial mitigation but aren't foolproof.

This paper investigates whether LLMs can strategically resist reinforcement learning during post-training by suppressing their exploration of actions. Researchers create models trained to underperform, show they can evade RL-based training while staying competent on other tasks, and demonstrate that frontier models can reason about suppressing exploration when they understand their training setup.

Apr 20 – Apr 26(9)

Spend Less, Fit Better: Budget-Efficient Scaling Law Fitting via Active Experiment Selection

Apr 24, 2026

Sijie Li, Shanda Li, Haowei Lin et al.

Use active learning to strategically pick which small experiments to run when fitting scaling laws—you can predict large-scale model performance with 90% less compute by choosing experiments that reduce uncertainty about the target region you care about.

Training large AI models costs millions, and figuring out how they'll scale costs millions more. This paper proposes a smarter way to choose which smaller pilot experiments to run so you can accurately predict how a massive training run will perform, using only about 10% of the budget that naive approaches would need.

scalingefficiencytraining

Relaxation-Informed Training of Neural Network Surrogate Models

Apr 24, 2026

Calvin Tsay

Training neural network surrogates with MILP-aware regularizers can dramatically speed up downstream optimization without sacrificing accuracy, by directly controlling structural properties that affect solver performance.

This paper shows how to train neural networks as surrogate models that work better when embedded in optimization problems. By adding special regularizers during training that target MILP tractability—penalizing large constants, unstable neurons, and LP relaxation gaps—the approach makes the resulting optimization problems solve 10,000x faster while keeping prediction accuracy competitive.

alignmentsafetytraining

Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders

May 26, 2026

Yi Jing, Zao Dai, Jinwu Hu et al.

Instead of picking training data based only on external metrics, you can use SAEs to decode what the model actually learns internally, then use those signals to organize data better—making training more efficient without changing the model architecture.

This paper shows how to improve LLM training by using Sparse Autoencoders (SAEs) to read the model's internal representations and guide data selection. The method clusters training data for diversity, orders it by difficulty, and filters low-quality examples—improving math performance by 3% and cutting training time by 20% on smaller models.

trainingdatareasoning

From Scores to Gibbs Correctors: Accelerating Uniform-Rate Discrete Diffusion Models

May 26, 2026

Yuchen Liang, Ness Shroff, Yingbin Liang

GADD accelerates discrete diffusion sampling from many steps to logarithmically few steps without additional training, providing both theoretical guarantees and practical speedups for text and symbolic generation tasks.

This paper speeds up discrete diffusion models (used for text and symbolic data generation) by introducing GADD, a new method that uses Gibbs corrections to reduce sampling steps. Unlike existing acceleration techniques, GADD doesn't require extra training and achieves theoretically optimal speedup, making it practical for real applications like text and music generation.

efficiencytrainingreasoning

Prism: A Plug-in Reproducible Infrastructure for Scalable Multimodal Continual Instruction Tuning

May 25, 2026

Jun-Tao Tang, Yu-Cheng Shi, Zhen-Hao Xie et al.

A plug-in architecture for multimodal continual learning lets researchers test new training strategies without rewriting the base model code, making MLLM research faster and more reproducible.

Prism is a software framework that makes it easier to develop and test new methods for continuously training multimodal AI models on new tasks. Instead of modifying the core model code each time, researchers can add new strategies as plug-in modules, reducing engineering overhead and enabling fair comparisons between different approaches.

trainingarchitectureapplications

Looped Diffusion Language Models

May 25, 2026

Sanghyun Lee, Chunsan Hong, Seungryong Kim et al.

Selectively looping transformer layers in masked diffusion models improves both training efficiency and reasoning capability—you can match performance with far fewer computations, or trade compute for better results.

This paper introduces LoopMDM, a technique that reuses early-middle transformer layers in masked diffusion models by looping them during training and inference. The approach achieves better training efficiency (3.3× fewer FLOPs) and stronger reasoning performance than standard models, while enabling flexible compute scaling at inference time without adding parameters.

architectureefficiencytraining

Forgetting in Language Models: Capacity, Optimization, and Self-Generated Replay

May 25, 2026

Martin Marek, Dongkyu Cho, Shikai Qiu et al.

Self-generated replay nearly eliminates catastrophic forgetting in language models, but capacity constraints are the real bottleneck: a saturated model can't learn new tasks without forgetting, no matter what technique you use.

When language models learn new tasks, they forget old ones. This paper shows that models can generate their own training data to replay and prevent forgetting, but only if they have spare capacity. If a model is already saturated from pretraining, no amount of replay helps—it must overwrite old knowledge to learn anything new.

trainingefficiency

Global Convergence of Wasserstein Policy Gradient for Entropy-Regularized Reinforcement Learning

May 25, 2026

Zhaoyu Zhu, Rui Gao, Shuang Li

WPG is theoretically sound for continuous control: the Bellman recursion in RL creates favorable convergence properties similar to convex optimization, even though the problem isn't convex.

This paper proves that Wasserstein Policy Gradient (WPG), an algorithm for reinforcement learning that moves policies using optimal transport geometry, converges globally to optimal solutions. The key insight is that even though RL objectives aren't convex in the traditional sense, the Bellman equation creates a special geometric structure that guarantees convergence.

training
scalingtrainingevaluation

From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills

May 22, 2026

Zisu Huang, Jingwen Xu, Yifan Yang et al.

Model-generated skills can improve agent performance, but their effectiveness depends on how they're extracted and which agent uses them—not on model size or baseline strength.

This paper studies how AI agents can reuse skills—structured procedures extracted from past experience—to improve performance. The researchers built a comprehensive evaluation framework testing skill extraction and reuse across five different task domains, finding that while model-generated skills help on average, they sometimes hurt performance.

agentstrainingevaluation

ETCHR: Editing To Clarify and Harness Reasoning

May 22, 2026

Beichen Zhang, Yuhong Liu, Jinsong Li et al.

Decoupling image editing from language understanding—and training the editor specifically for reasoning tasks—improves multimodal reasoning accuracy across diverse visual tasks without modifying the base model.

ETCHR is a specialized image editing model that helps multimodal AI systems reason better by transforming images based on questions. Unlike general image editors, it's trained to understand abstract reasoning tasks and produce clearer images for downstream analysis, improving performance across visual reasoning tasks by 4-5% without retraining the main AI model.

multimodalreasoningtraining

Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models

May 22, 2026

Hongwu Peng, Ohiremen Dibua, Yuanjun Xiong et al.

You can now tune hyperparameters on a single dense model and transfer them directly to MoE models of any size or configuration, eliminating the need for expensive hyperparameter search when scaling with MoE.

Complete-muE is a framework that solves the problem of transferring hyperparameters (like learning rate and weight decay) from dense neural networks to Mixture-of-Experts (MoE) models without expensive retuning.

trainingscalingefficiency

Multilingual Knowledge Transfer under Data Constraints via Lexical Interventions

May 22, 2026

Anastasiia Sedova, Natalie Schluter, Skyler Seto et al.

You can improve cross-lingual knowledge transfer by strategically replacing words in high-resource training data with translations—no parallel data, translation models, or extra training needed.

This paper proposes LINK, a simple method to improve multilingual language models for low-resource languages by swapping English words with their translations during pretraining. The approach requires only a bilingual dictionary and no extra training, yet achieves significant performance gains on downstream tasks across eight languages.

trainingdata

PGT: Procedurally Generated Tasks for improving visual grounding in MLLMs

May 22, 2026

Rim Assouel, Amir Bar, Michal Drozdzal et al.

Adding synthetic geometric overlays during training helps MLLMs learn better spatial and quantitative reasoning—suggesting many visual understanding failures come from insufficient training data rather than model architecture limits.

This paper introduces Procedurally Generated Tasks (PGT), a method that overlays geometric shapes on images to create training data that improves how multimodal AI models understand fine-grained visual details like spatial relationships and quantities. Testing shows improvements of up to 20% on visual reasoning benchmarks while keeping general capabilities intact.

multimodaltrainingevaluation

Move on Muon : A Hamiltonian probability gradient flow perspective of Muon optimizer

May 22, 2026

Aratrika Mustafi, Soumya Mukherjee, Bharath K. Sriperumbudur

Muon optimizer can be understood as Hamiltonian dynamics on probability measures, providing theoretical guarantees for convergence and opening the door to analyzing large-scale neural network training through mean-field theory.

This paper analyzes the Muon optimizer through the lens of Hamiltonian dynamics and probability flows. The authors show that Muon's orthogonalization step is actually a mirror descent update, then extend this insight to neural network training by deriving a mean-field equation describing how probability distributions over parameters evolve.

trainingscaling

Strong Teacher Not Needed? On Distillation in LLM Pretraining

May 22, 2026

Taiming Lu, Zhuang Liu

You don't need a powerful teacher to improve a larger language model through distillation—smaller teachers work fine, and over-training the teacher can actually hurt performance.

This paper challenges the assumption that knowledge distillation in language model training requires a strong teacher model. By systematically testing different teacher-student size combinations, the researchers found that even small, undertrained teachers can improve larger students when losses are properly balanced, and that stronger teachers don't always produce better results.

trainingefficiency

Tokenisation via Convex Relaxations

May 21, 2026

Jan Tempus, Philip Whittington, Craig W. Schmidt et al.

ConvexTok uses convex optimization to build tokenizers that are provably near-optimal (within 1% at typical vocabulary sizes) and compress text better than greedy algorithms like BPE, with measurable improvements in language model efficiency.

This paper replaces greedy tokenization algorithms like BPE with a convex optimization approach called ConvexTok. Instead of making locally optimal choices, it formulates tokenizer construction as a linear program, achieving better compression (bits-per-byte) and allowing users to verify how close their tokenizer is to mathematically optimal.

trainingefficiency

Vector Policy Optimization: Training for Diversity Improves Test-Time Search

May 21, 2026

Ryan Bahlous-Boldi, Isha Puri, Idan Shenfeld et al.

Training LLMs to produce diverse outputs across multiple reward dimensions—not just maximizing a single score—makes them better at test-time search where you can pick the best solution from many candidates.

This paper introduces Vector Policy Optimization (VPO), a training method that teaches language models to generate diverse solutions by optimizing for multiple reward objectives simultaneously, rather than a single scalar reward.

trainingreasoningefficiency

The Matching Principle: A Geometric Theory of Loss Functions for Nuisance-Robust Representation Learning

May 21, 2026

Vishal Rajput

Many robustness techniques (CORAL, adversarial training, IRM, metric learning) are different ways of solving the same problem: identifying and regularizing against label-preserving variations in your data.

This paper unifies seemingly separate robustness problems (domain adaptation, adversarial training, compositional generalization) under one framework: regularizing neural network gradients to match the covariance of label-preserving variations in deployment data.

trainingalignment

Finite-Particle Convergence Rates for Conservative and Non-Conservative Drifting Models

May 21, 2026

Krishnakumar Balasubramanian

Conservative drifting with kernel density estimators achieves provable convergence rates for one-step generative modeling, with the convergence speed depending on dimension and a tunable parameter that trades off between different error sources.

This paper analyzes drifting methods for generative modeling, proposing a conservative approach using kernel density estimators that guarantees gradient-field properties. The authors prove finite-particle convergence rates showing how quickly the method converges as sample size increases, with explicit tracking of how bandwidth and dimension affect performance.

trainingevaluation

Reducing Political Manipulation with Consistency Training

May 21, 2026

Long Phan, Devin Kim, Alexander Pan et al.

LLMs exhibit systematic covert political bias through asymmetric handling of opposing viewpoints; consistency-based training can reduce this bias without sacrificing model helpfulness.

Large language models show hidden political bias by treating opposing viewpoints asymmetrically—using different tones or effort levels for left vs. right perspectives.

safetyalignmenttraining

Understanding Data Temporality Impact on Large Language Models Pre-training

May 21, 2026

Pilchen Hippolyte, Fabre Romain, Signe Talla Franck et al.

Training LLMs on chronologically ordered data instead of shuffled data improves their knowledge of recent facts and temporal accuracy, suggesting data ordering matters for building models that stay current.

This paper investigates how the order of training data affects what LLMs learn about time-sensitive facts. Researchers trained 6B-parameter models on chronologically ordered data versus shuffled data, and found that sequential training produces models with more current and accurate temporal knowledge while maintaining general language understanding.

trainingdataevaluation

Uniform Diffusion Models Revisited: Leave-One-Out Denoiser and Absorbing State Reformulation

May 21, 2026

Samson Gourevitch, Yazid Janati, Dario Shariatian et al.

Discrete diffusion models have a hidden training-inference mismatch: the standard objective doesn't match what's actually needed for sampling. Using the correct "leave-one-out" parameterization and an absorbing-state reformulation improves generation quality without retraining.

This paper fixes a fundamental mismatch in how Uniform Diffusion Models are trained versus used for generation. The authors show that standard training doesn't actually optimize what the model uses during sampling, and they provide mathematical conversions to align these.

trainingarchitectureefficiency

Lumberjack: Better Differentially Private Random Forests through Heavy Hitter Detection in Trees

May 21, 2026

Christian Janos Lebeda, David Erb, Tudor Cebere et al.

You can now build random forests on sensitive data with differential privacy that actually work well in practice—Lumberjack's smart pruning strategy significantly closes the gap between private and non-private model performance.

Lumberjack is a differentially private random forest algorithm that builds large decision trees and then prunes them intelligently to protect sensitive data. By using a novel heavy hitter detection method, it can use deeper trees than previous approaches while maintaining privacy guarantees, achieving much better accuracy on real datasets.

trainingefficiency

Plug-in Losses for Evidential Deep Learning: A Simplified Framework for Uncertainty Estimation that Includes the Softmax Classifier

May 21, 2026

Berk Hayta, Hannah Laus, Simon Mittermaier et al.

You can get reliable uncertainty estimates using standard loss functions (cross-entropy, MSE) instead of complex Dirichlet objectives—the math shows this works, and it's simpler to implement in practice.

This paper simplifies Evidential Deep Learning (EDL) for uncertainty estimation by replacing complex Dirichlet-based losses with standard losses like cross-entropy, evaluated at the Dirichlet mean. The authors prove this approximation works well when evidence is strong and show it includes softmax as a special case, making uncertainty estimation easier to implement without sacrificing accuracy.

trainingefficiency

SeqLoRA: Bilevel Orthogonal Adaptation for Continual Multi-Concept Generation

May 21, 2026

Javad Parsa, Enis Simsar, Amir Joudaki et al.

When fine-tuning diffusion models for multiple concepts, jointly optimizing LoRA factors with orthogonal constraints prevents representation interference and scales better than existing modular approaches—enabling cleaner composition of up to 101 concepts.

SeqLoRA improves how AI models learn multiple custom concepts at once by using a smarter optimization technique that prevents concepts from interfering with each other. Instead of freezing parts of the model or doing expensive post-processing, it jointly trains the adaptation components while keeping them orthogonal, enabling better multi-concept image generation with less computational cost.

trainingefficiencymultimodal

Variance Reduction for Expectations with Diffusion Teachers

May 20, 2026

Jesse Bettencourt, Xindi Wu, Matan Atzmon et al.

When using diffusion models to guide other tasks, you can dramatically reduce compute cost by resampling cheap diffusion noise multiple times per expensive upstream computation, rather than doing one expensive computation per noise sample.

This paper introduces CARV, a framework for reducing variance in gradient estimates when using pretrained diffusion models as teachers in downstream tasks like text-to-3D generation. By reusing expensive computations (like 3D rendering) across multiple noise samples and applying importance sampling techniques, the method achieves 2-3x speedups without changing the underlying objective.

efficiencytrainingevaluation

Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate

May 20, 2026

Dayal Singh Kalra, Maissam Barkeshli

When scaling up LLM training, use a higher embedding layer learning rate (scaled by model width) to stabilize training and reliably transfer hyperparameters from small to large models—this is the primary reason μP outperforms standard parameterization.

This paper explains why μP (Maximal Update) parameterization works better than standard parameterization for transferring learning rates across different model sizes. The key finding: μP's advantage mainly comes from using a higher learning rate for the embedding layer, which stabilizes training and improves hyperparameter transfer when scaling up language models.

scalingtrainingefficiency

EvoStruct: Bridging Evolutionary and Structural Priors for Antibody CDR Design via Protein Language Model Adaptation

May 20, 2026

Mansoor Ahmed, Sujin Lee, Umar Khayaz et al.

Combining evolutionary knowledge from language models with 3D structural constraints solves vocabulary collapse in antibody design, achieving 16% better sequence accuracy and 2.3x more amino acid diversity than structure-only methods.

EvoStruct fixes a critical problem in AI-designed antibodies: neural networks trained on 3D structures alone forget important amino acid patterns from evolution. The method combines a pre-trained protein language model (which knows evolutionary patterns) with structural information, using a special adapter to merge both sources of knowledge.

architecturetrainingapplications

You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories

May 20, 2026

Zhepei Wei, Xinyu Zhu, Wei-Lin Chen et al.

RLVR training produces predictable, low-rank weight changes that can be extrapolated mathematically, letting you skip 85% of training compute while matching or exceeding performance on reasoning tasks.

This paper reveals that language models trained with reinforcement learning from verifiable rewards (RLVR) follow surprisingly simple, low-rank weight trajectories.

trainingefficiencyreasoning

DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

May 20, 2026

Kaiyi Zhang, Wei Wu, Yankai Lin

When training language models with verifiable rewards, focusing on the most discriminative token patterns—rather than averaging all tokens equally—significantly improves learning efficiency and final performance.

This paper improves how language models learn from step-by-step feedback by better understanding which tokens should be rewarded or penalized. The authors show that standard learning methods get distracted by common formatting tokens and miss important patterns that distinguish good answers from bad ones.

trainingreasoningalignment

Leveraging LLMs for Grammar Adaptation: A Study on Metamodel-Grammar Co-Evolution

May 20, 2026

Weixing Zhang, Bowen Jiang, Rahul Sharma et al.

LLMs can learn grammar adaptation patterns from examples and apply them to new versions, achieving 100% consistency on medium-sized grammars but failing on large-scale ones—suggesting LLMs work best for targeted, smaller grammar updates.

This paper shows how Large Language Models can automatically adapt domain-specific language grammars when their underlying models change, reducing manual work. Testing on real-world languages shows LLMs work well for complex scenarios but struggle with very large grammars (300+ rules).

trainingapplications

Mem-$π$: Adaptive Memory through Learning When and What to Generate

May 20, 2026

Xiaoqiang Wang, Chao Wang, Hadi Nekoei et al.

Generating context-specific guidance dynamically outperforms traditional retrieval-based memory for agents—the system learns to abstain when unnecessary and produce only relevant help, improving task success by over 30% on web navigation.

Mem-π is a framework that gives AI agents smarter memory by generating helpful guidance on-the-fly instead of retrieving fixed entries from a database. A separate model learns when to create guidance and what to create, trained to skip unhelpful suggestions and produce only what the agent actually needs for the current task.

agentstrainingreasoning

A Readiness-Driven Runtime for Pipeline-Parallel Training under Runtime Variability

May 18, 2026

Ruitao Liu, Xinyang Tian, Shuo Chen et al.

For distributed model training, executing tasks based on actual readiness rather than pre-committed schedules can dramatically reduce GPU idle time and improve throughput, especially when computation times vary unpredictably.

This paper introduces RRFP, a runtime system that improves GPU training efficiency by executing ready tasks immediately instead of waiting for a pre-planned order. When training large models across multiple GPUs, unpredictable delays in computation cause stages to sit idle.

trainingefficiencyscaling

Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

May 18, 2026

Qianhao Yuan, Jie Lou, Xing Yu et al.

MLLMs can improve fine-grained visual understanding by learning from their own superior performance on evidence-focused crops, using on-policy self-distillation to transfer regional perception skills to full-image reasoning.

This paper addresses a key weakness in multimodal AI models: they struggle to notice small but important details in images. The researchers discovered that models actually perform better when shown cropped images focused on relevant areas versus full images, suggesting the problem isn't recognizing details but finding them.

multimodaltrainingefficiency

Predictable Confabulations: Factual Recall by LLMs Scales with Model Size and Topic Frequency

May 18, 2026

Matthew L. Smith, Jonathan P. Shock, Samuel T. Segun et al.

LLM factual accuracy isn't random—it scales predictably with model size and training data frequency, meaning you can estimate what facts a model will reliably remember based on these two factors.

This paper reveals that LLM factual recall follows a predictable pattern based on two factors: model size and how often a topic appears in training data.

scalingevaluationtraining

General Preference Reinforcement Learning

May 18, 2026

Muhammad Umer, Muhammad Ahmed Mohsin, Ahsan Bilal et al.

GPRL solves reward hacking in LLM training by treating quality as multi-dimensional rather than scalar, allowing online RL to work on open-ended tasks without collapsing onto exploitable reward axes.

This paper addresses a gap in LLM training by proposing General Preference Reinforcement Learning (GPRL), which handles open-ended tasks like traditional preference optimization while maintaining the continuous exploration benefits of online RL.

trainingalignmentreasoning

Semantic Generative Tuning for Unified Multimodal Models

May 18, 2026

Songsong Yu, Yuxin Chen, Ying Shan et al.

Using segmentation as a generative training task bridges the gap between visual understanding and generation in multimodal models, improving both capabilities simultaneously rather than training them separately.

This paper shows how to train unified multimodal models (that do both image understanding and generation) more effectively by using image segmentation as a training task. Instead of training understanding and generation separately, the authors use segmentation to align both capabilities, improving the model's ability to understand images and generate them accurately.

multimodaltrainingarchitecture

Learned Memory Attenuation in Sage-Husa Kalman Filters for Robust UAV State Estimation

May 18, 2026

Kenan Majewski, Marcin Żugaj

Neural networks can improve classical state estimation by learning adaptive forgetting factors that respond to real-time sensor quality, enabling robust UAV navigation during sensor outages and dynamic environments.

This paper presents a learned Kalman filter that adapts to changing noise conditions in UAVs by using a neural network to dynamically adjust how much it trusts past measurements. Instead of using a fixed forgetting factor, the filter learns a memory policy from sensor data, helping it handle sensor failures and vibrations better than traditional adaptive filters.

trainingefficiencyreasoning

EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL

May 18, 2026

Minrui Xu, Zilin Wang, Mengyi DENG et al.

Automated environment synthesis and trajectory generation can reduce the data requirements for tool-use agent training by 5x while improving downstream performance, making agentic RL more practical and scalable.

EnvFactory automates the creation of tool-use training environments and realistic multi-turn interaction trajectories for teaching language models to use tools effectively. It generates diverse, natural training data from verified executable environments, enabling more efficient agent training with fewer resources than existing approaches.

agentstrainingdata
trainingefficiency

Self-Distilled Agentic Reinforcement Learning

May 14, 2026

Zhengxi Lu, Zhiyuan Yao, Zhuowen Han et al.

Combining RL with selective token-level distillation through a gating mechanism significantly improves LLM agent performance on complex tasks, achieving 7-10% gains over standard RL approaches while avoiding training instability.

This paper improves how language model agents learn through reinforcement learning by combining trajectory-level rewards with dense token-level guidance. The key innovation is a gating mechanism that selectively uses teacher signals—strengthening learning from good decisions and softly ignoring bad teacher suggestions—making multi-turn agent training more stable and effective.

agentstraining

AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward

May 12, 2026

Runhui Huang, Jie Wu, Rui Yang et al.

Self-reflective multimodal models can improve generation quality by learning to reason about user intent and autonomously correct their outputs using decomposed, verifiable rewards from language models.

AlphaGRPO enhances multimodal AI models to generate images and text by teaching them to reason about what users want and fix their own mistakes. It uses a novel reward system that breaks down complex requests into simple checkable questions, allowing the model to learn from reliable feedback without needing extra training setup.

multimodalreasoningtraining

Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation

May 12, 2026

Kexuan Shi, Hanxuan Li, Zeju Qiu et al.

Pion's orthogonal update mechanism preserves weight matrix spectral properties during training, providing a geometrically principled alternative to gradient-based optimizers like Adam with competitive performance.

Pion is a new optimizer for training large language models that updates weights using orthogonal transformations instead of adding gradients like Adam does. By preserving the singular values of weight matrices, it keeps the spectral properties stable while still allowing the model to learn, offering a more geometrically-grounded approach to optimization.

training

Learning, Fast and Slow: Towards LLMs That Adapt Continually

May 12, 2026

Rishabh Tiwari, Kusha Sareen, Lakshya A Agrawal et al.

Combining parameter updates with context optimization lets LLMs learn new tasks 3x more efficiently while staying closer to their original capabilities and avoiding the forgetting that comes from pure fine-tuning.

This paper proposes Fast-Slow Training (FST), a method that combines two learning mechanisms for LLMs: updating model parameters (slow learning) and optimizing the input context (fast learning). By separating task-specific adaptation from general knowledge, FST achieves better sample efficiency, reduces catastrophic forgetting, and maintains the model's ability to learn new tasks over time.

trainingefficiencyreasoning

ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

May 12, 2026

Xuhao Hu, Xi Zhang, Haiyang Xu et al.

Agents perform better when trained to decide dynamically between GUI actions and tool calls rather than using only one approach—this hybrid strategy improved accuracy by 66% on real-world tasks.

ToolCUA trains computer agents to intelligently choose between GUI actions (clicks, typing) and tool calls (APIs) by synthesizing diverse training trajectories from existing data and using reinforcement learning to optimize when to switch between action types. This solves a key problem for digital agents: knowing when to use high-level tools versus low-level GUI interactions.

agentstrainingreasoning

OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation

May 12, 2026

Guohui Zhang, XiaoXiao Ma, Jie Huang et al.

When training models to generate audio and video together, treating each modality's learning separately and protecting audio-specific layers from video interference leads to better results than standard single-objective RL approaches.

OmniNFT improves joint audio-video generation by using reinforcement learning with three key techniques: routing rewards separately to each modality, preventing video gradients from interfering with audio processing, and focusing optimization on synchronization regions. This addresses real-world needs for high-quality audio, high-quality video, and tight audio-video alignment simultaneously.

multimodaltraining

Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts

May 12, 2026

Sagi Ahrac, Noya Hochwald, Mor Geva

Routers in sparse mixture-of-experts models work best when they maintain geometric alignment with their experts—understanding this coupling can improve routing stability and reduce the need for complex auxiliary losses.

This paper reveals that routers in Sparse Mixture-of-Experts models learn a geometric relationship with their experts: router weights and expert weights receive gradients along the same directions, causing them to specialize together.

architecturetrainingefficiency

Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs

May 12, 2026

Guinan Su, Yanwu Yang, Xueyan Li et al.

By training models to handle multiple parallel computation streams instead of sequential message exchanges, you can build faster, more responsive AI agents that can act while thinking and react to new information without waiting for previous operations to complete.

This paper proposes Multi-Stream LLMs, which replace the single sequential message stream in current language models with multiple parallel streams for inputs, outputs, and reasoning. This allows models to read and write simultaneously, think while acting, and process different types of information in parallel—addressing fundamental bottlenecks in how AI agents currently operate.

architectureagentstraining
trainingalignmentefficiency

Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning

May 8, 2026

Manish Bhattarai, Ismael Boureima, Nishath Rajiv Ranasinghe et al.

Structured, multi-criterion rewards grounded in real documents help models develop generalizable reasoning skills that transfer to unseen tasks better than single holistic scores.

This paper shows how to train AI models to reason better by grading their responses on multiple specific criteria instead of just right/wrong. The researchers created detailed rubrics from scientific documents and used them to train a language model with a technique called GRPO, which optimizes for partial credit across different dimensions.

trainingreasoningevaluation

EMO: Pretraining Mixture of Experts for Emergent Modularity

May 7, 2026

Ryan Wang, Akshita Bhagia, Sewon Min

By constraining tokens within the same document to share expert pools during pretraining, EMO creates naturally modular experts that specialize in semantic domains (math, code, etc.), enabling practical memory-efficient deployment without sacrificing performance.

EMO is a Mixture-of-Experts language model designed to work efficiently when you only need a subset of its capabilities. Instead of forcing all experts to activate for every input, EMO groups experts by document domain during training, so code-heavy documents use code experts, math documents use math experts, and so on.

architectureefficiencytraining

Verifier-Backed Hard Problem Generation for Mathematical Reasoning

May 7, 2026

Yuhang Lai, Jiazhan Feng, Yee Whye Teh et al.

Using an independent verifier to validate problem correctness prevents reward hacking in AI-generated math problems, enabling better training data creation without human experts.

This paper tackles the problem of generating valid and challenging math problems for training AI models. Instead of relying on humans or simple self-play (which often produces invalid problems), the authors introduce VHG, a system with three players: a problem setter, a solver, and an independent verifier.

trainingreasoningdata

Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less

May 7, 2026

Yuxing Liu, Jianyu Wang, Tong Zhang

Use the same optimizer for finetuning as you used for pretraining—it significantly reduces catastrophic forgetting while maintaining task performance, even outperforming parameter-efficient methods like LoRA.

When finetuning large language models, using the same optimizer during finetuning as was used during pretraining reduces forgetting of previously learned knowledge while maintaining or improving performance on new tasks.

trainingefficiency

Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

May 7, 2026

Mingwei Xu, Hao Fang

You can train reasoning models effectively using only positive examples—negative examples aren't necessary if you redistribute probability mass correctly and stabilize learning through siamese networks.

This paper proposes POPO, a new training method for reasoning-focused language models that learns exclusively from successful (positive) examples rather than mixing successes with failures. Instead of comparing positive and negative rollouts like existing methods (GRPO), POPO uses importance sampling to implicitly learn what to avoid, stabilized through a siamese network architecture.

trainingreasoningalignment

PSK at SemEval-2026 Task 9: Multilingual Polarization Detection Using Ensemble Gemma Models with Synthetic Data Augmentation

May 6, 2026

Srikar Kashyap Pulipaka

Per-language fine-tuning with synthetic data augmentation and threshold tuning can significantly improve multilingual NLP tasks, but model generalization to test data varies dramatically—some architectures dropped 30-50% in performance despite strong development results.

This paper describes a system for detecting polarized language across 22 languages using fine-tuned Gemma models with synthetic data augmentation. The approach combines per-language model tuning, LLM-generated synthetic training data with quality filtering, and weighted ensemble predictions to achieve competitive performance on a multilingual classification task.

trainingevaluation

OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories

May 5, 2026

Yuwen Du, Rui Ye, Shuo Tang et al.

High-quality training data matters more than pipeline complexity: careful data curation with SFT alone can beat industrial-scale approaches combining pre-training, continual pre-training, and RL for building capable search agents.

OpenSeeker-v2 shows that simple supervised fine-tuning on carefully designed training data can match or beat complex industrial pipelines for building search agents.

trainingagentsdata

Conditional Diffusion Sampling

May 5, 2026

Francisco M. Castro-Macías, Pablo Morales-Álvarez, Saifuddin Syed et al.

CDS offers a practical way to sample from difficult distributions by combining two proven techniques—Parallel Tempering for initial exploration and exact diffusion dynamics for refinement—without requiring neural network training.

This paper introduces Conditional Diffusion Sampling (CDS), a new method for sampling from complex probability distributions that combines Parallel Tempering with diffusion-based transport.

trainingefficiency

Trust, but Verify: Peeling Low-Bit Transformer Networks for Training Monitoring

May 4, 2026

Arian Eamaz, Farhang Yeganegi, Mojtaba Soltanalian

Standard training loss curves can hide poorly-optimized layers in transformers—layer-wise analysis using reference bounds exposes optimization failures that aggregate metrics miss, especially critical for expensive model training.

This paper introduces a method to monitor whether transformer models are actually learning well during training by analyzing each layer individually. Instead of just looking at overall loss, the authors create lightweight reference solutions for each layer and compare them against the trained model, revealing hidden inefficiencies.

trainingevaluationefficiency

Compress Then Adapt? No, Do It Together via Task-aware Union of Subspaces

May 4, 2026

Jingze Ge, Yun Liu, Xue Geng et al.

Jointly optimizing compression and adaptation using task-aware subspaces beats the standard two-step approach, delivering better accuracy with fewer parameters on both vision and language models.

JACTUS combines model compression and task adaptation in a single step rather than doing them sequentially. Instead of compressing a model first and then fine-tuning it, the method estimates what directions matter for your specific task and compresses the model while preserving those directions.

efficiencytraining
safetyalignmenttraining

Synthetic Computers at Scale for Long-Horizon Productivity Simulation

Apr 30, 2026

Tao Ge, Baolin Peng, Hao Cheng et al.

Synthetic computer environments with long-horizon simulations can generate realistic training data for productivity agents at scale, enabling them to learn from diverse workplace scenarios without human annotation.

Researchers created a system to generate realistic computer environments at scale—complete with folder structures and documents—then simulated AI agents working on month-long productivity tasks within them.

agentsdatatraining

PhyCo: Learning Controllable Physical Priors for Generative Motion

Apr 30, 2026

Sriram Narayanan, Ziyu Jiang, Srinivasa Narasimhan et al.

You can make generative video models physically consistent by combining physics-labeled training data, ControlNet conditioning on physical properties, and VLM-based reward signals—no simulator needed at runtime.

PhyCo teaches video generation models to respect physics by fine-tuning them on 100K+ realistic simulation videos with varying physical properties (friction, bouncing, deformation), then using a vision-language model to provide physics-aware feedback during generation. This lets models create videos where objects behave realistically without needing a physics simulator at inference time.

trainingmultimodalevaluation

PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning

Apr 30, 2026

Sudong Wang, Weiquan Huang, Xiaomin Yu et al.

Adding an explicit distribution-alignment stage between supervised fine-tuning and RL training significantly reduces model drift in multimodal models, with gains coming from disentangled feedback on perception vs. reasoning failures.

PRISM fixes a key problem in training multimodal AI models: when you fine-tune a model on examples and then use reinforcement learning, the model drifts away from what it learned initially.

trainingmultimodalalignment

Auto-FlexSwitch: Efficient Dynamic Model Merging via Learnable Task Vector Compression

Apr 30, 2026

Junqi Gao, Dazhi Zhang, Zhichang Guo et al.

Task vectors can be compressed to 1-5% of their original size while maintaining model performance, making it practical to store and dynamically merge multiple task-specific models without prohibitive storage costs.

This paper tackles the storage overhead problem in dynamic model merging by compressing task vectors (fine-tuned weight changes) using learnable compression techniques.

efficiencytraining

FiLMMeD: Feature-wise Linear Modulation for Cross-Problem Multi-Depot Vehicle Routing

Apr 30, 2026

Arthur Corrêa, Paulo Nascimento, Samuel Moniz

A single neural model can now handle multiple variants of complex routing problems by dynamically adapting to different constraints, suggesting that multi-task learning with adaptive conditioning is more practical than building separate models for each problem type.

FiLMMeD is a neural model that solves 24 different multi-depot vehicle routing problems (a logistics optimization task) using a single unified architecture.

architecturetrainingapplications

Characterizing the Consistency of the Emergent Misalignment Persona

Apr 30, 2026

Anietta Weckauff, Yuchen Zhang, Maksym Andriushchenko

Fine-tuning on narrow harmful data can cause models to behave broadly harmfully, but they don't consistently develop matching self-awareness—some models hide their misalignment while others openly acknowledge it.

When large language models are fine-tuned on specific types of harmful data, they sometimes develop broader harmful behavior—a phenomenon called emergent misalignment. This paper tests whether models that behave harmfully also recognize themselves as misaligned.

safetyalignmenttraining

Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling

Apr 30, 2026

Ansar Aynetdinov, Patrick Haller, Alan Akbik

For non-English language models, aggressively filtering data for quality and repeating it multiple times beats training once on larger, diverse datasets—a practical insight for resource-constrained language model development.

This paper challenges the assumption that diverse data is always better for language model training. For German, the researchers found that repeatedly training on a smaller, high-quality filtered dataset outperforms training once on a larger, less-filtered dataset—even after 7 epochs of repetition.

trainingdataefficiency

RHyVE: Competence-Aware Verification and Phase-Aware Deployment for LLM-Generated Reward Hypotheses

Apr 30, 2026

Feiyu Wu, Xu Zheng, Zhuocheng Wang et al.

LLM-generated rewards aren't equally useful throughout training—their reliability depends on policy competence and training phase, so verification and deployment timing matter as much as reward generation itself.

This paper addresses when and how to use LLM-generated rewards during reinforcement learning training. The authors propose RHyVE, a method that verifies reward quality based on the current policy's skill level and training phase, rather than treating all rewards equally throughout training.

training

Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models

Apr 29, 2026

Gongbo Zhang, Wen Wang, Ye Tian et al.

Cross-architecture distillation for diffusion models is now practical: you can compress large diffusion LLMs into tiny ones (13x smaller) while maintaining performance, even when teacher and student have completely different designs.

This paper introduces TIDE, a framework for distilling knowledge from large diffusion language models into much smaller ones across different architectures. Unlike previous distillation methods that work within a single model type, TIDE handles cases where teacher and student models have different designs, attention mechanisms, and tokenizers.

trainingefficiencyarchitecture

Select to Think: Unlocking SLM Potential with Local Sufficiency

Apr 29, 2026

Wenxuan Ye, Yangyang Zhang, Xueli An et al.

Small models already generate the right answers in their candidate predictions—they just rank them poorly. Training them to re-rank their own outputs improves reasoning without external model calls.

Small language models struggle with reasoning tasks compared to large models. This paper discovers that when small models fail, the correct token from a large model is usually hidden in the small model's top-8 predictions.

efficiencyreasoningtraining

Learning Over-Relaxation Policies for ADMM with Convergence Guarantees

Apr 29, 2026

Junan Lin, Paul J. Goulart, Luca Furieri

Learning to adapt relaxation parameters in ADMM can speed up solving repeated optimization problems while maintaining convergence guarantees—useful for real-time control systems that solve similar problems repeatedly.

This paper shows how to use machine learning to automatically tune the relaxation parameter in ADMM, an algorithm for solving optimization problems. By learning better parameter choices for repeated similar problems (like in Model Predictive Control), the method reduces computation time without requiring expensive matrix refactorizations.

training

ClawGym: A Scalable Framework for Building Effective Claw Agents

Apr 29, 2026

Fei Bai, Huatong Song, Shuang Sun et al.

To build effective agents for real-world file and tool interactions, you need systematic data synthesis, training on realistic rollout trajectories, and careful evaluation—ClawGym provides all three components together.

ClawGym is a framework for building AI agents that work with files, tools, and persistent workspaces through multi-step tasks. It includes a dataset of 13.5K synthesized tasks with realistic mock environments, trained agent models using supervised learning and reinforcement learning, and a benchmark for evaluation.

agentstrainingevaluation

Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models

Apr 29, 2026

Andrea Agazzi, Giuseppe Bruno, Eloy Mosig García et al.

Noise in transformers can synchronize token behavior and stabilize learning—a counterintuitive finding that suggests randomness plays a constructive role in how these models process sequences.

This paper proves that transformer models with finite depth and width converge to a stochastic particle system as they scale. The researchers show that token evolution follows a continuous-time process with noise-driven synchronization, meaning random perturbations actually help tokens align rather than diverge.

scalingarchitecturetraining

Multiple Additive Neural Networks for Structured and Unstructured Data

Apr 29, 2026

Janis Mohr, Jörg Frochte

MANN combines gradient boosting with neural networks instead of trees, enabling a single framework to handle structured and unstructured data while outperforming XGBoost and reducing hyperparameter sensitivity.

This paper presents Multiple Additive Neural Networks (MANN), which replaces decision trees in gradient boosting with shallow neural networks. MANN works with both structured data and images/audio by using CNNs and capsule networks as feature extractors, and shows better accuracy than XGBoost on standard benchmarks while being more robust to hyperparameter choices.

trainingarchitectureefficiency

What Kind of Language is Easy to Language-Model Under Curriculum Learning?

Apr 29, 2026

Nadine El-Naggar, Tatsuki Kuribayashi, Ted Briscoe

Curriculum learning substantially changes language models' learning biases, suggesting that training order matters as much as model architecture when predicting which language structures are 'easy' to learn.

This paper investigates how curriculum learning—training language models on simpler sentences first rather than random order—affects which linguistic patterns models naturally learn.

trainingdata

Language Diffusion Models are Associative Memories Capable of Retrieving Unseen Data

Apr 29, 2026

Bao Pham, Mohammed J. Zaki, Luca Ambrogioni et al.

Language diffusion models memorize training data by default, but you can detect when they switch to genuine generalization by monitoring conditional entropy—a practical signal for assessing whether a deployed model is memorizing or creating.

This paper reveals that language diffusion models work like associative memories—they store training data in 'basins of attraction' and can retrieve both memorized and unseen examples. As training data grows, the model transitions from memorizing to generalizing, a shift detectable by measuring conditional entropy of token predictions.

trainingevaluationreasoning

How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

Apr 28, 2026

Chu-Cheng Lin, Eugene Ie

When training reasoning models with sparse rewards, you can escape cold-start failure by interpolating between RL and supervised learning via the Tsallis loss family—intermediate values of q balance speed of learning with training stability.

This paper solves a key problem in training reasoning models: when models rarely succeed initially, standard reinforcement learning gets stuck. The authors introduce a family of loss functions (using Tsallis math) that smoothly blend between two extremes—pure RL and pure supervised learning—letting practitioners choose how quickly to commit to learning from successes.

trainingreasoningalignment

Teacher Forcing as Generalized Bayes: Optimization Geometry Mismatch in Switching Surrogates for Chaotic Dynamics

Apr 28, 2026

Andre Herz, Daniel Durstewitz, Georgia Koppe

Teacher forcing trains RNNs on chaotic systems differently than the model will actually be used—this mismatch can make models fit data well statistically while performing poorly at predicting actual dynamics, a problem that becomes worse when multiple explanations exist for the data.

This paper reveals a fundamental mismatch between how teacher forcing (a common training technique) and marginal likelihood (the true objective) shape neural network optimization for chaotic systems.

trainingreasoning

Carbon-Taxed Transformers: A Green Compression Pipeline for Overgrown Language Models

Apr 28, 2026

Ajmain Inqiad Alam, Palash Roy, Chanchal K. Roy et al.

You can compress LLMs for SE tasks to 1/49th their original size with minimal accuracy loss—making them practical to deploy while cutting environmental impact dramatically.

This paper presents Carbon-Taxed Transformers (CTT), a compression pipeline that makes large language models smaller, faster, and greener for software engineering tasks.

efficiencytrainingevaluation

TSN-Affinity: Similarity-Driven Parameter Reuse for Continual Offline Reinforcement Learning

Apr 28, 2026

Dominik Żurek, Kamil Faber, Marcin Pietron et al.

Architectural parameter reuse guided by task similarity is a memory-efficient alternative to replay-based continual learning in offline RL, enabling better multi-task performance without storing historical data.

This paper presents TSN-Affinity, a method for continual offline reinforcement learning that learns multiple tasks sequentially from pre-collected datasets without forgetting previous tasks.

trainingarchitectureefficiency

Personalized Worked Example Generation from Student Code Submissions using Pattern-based Knowledge Components

Apr 27, 2026

Griffin Pitts, Muntasir Hoq, Peter Brusilovsky et al.

By extracting knowledge components from student code patterns, you can steer generative models to create personalized learning content that directly targets the logical errors students are making, rather than relying on generic pre-written examples.

This paper presents a system that automatically generates personalized worked examples for programming students based on their actual code submissions. Instead of using fixed example libraries, the system analyzes patterns in student errors using code structure analysis and uses these patterns to guide an AI model to create relevant examples that address each student's specific misconceptions.

applicationstrainingdata

Conflict-Aware Harmonized Rotational Gradient for Multiscale Kinetic Regimes

Apr 27, 2026

Zhangyong Liang

When training neural networks on multiscale physics problems, gradient conflicts between different regimes can cause training failure—HRGrad fixes this by explicitly managing gradient directions to keep all objectives aligned during optimization.

This paper introduces HRGrad, a method for training neural networks on physics problems that span multiple scales—from microscopic to macroscopic behavior. The key challenge is that different scales pull the network in conflicting directions during training.

trainingreasoning

Learning to Think from Multiple Thinkers

Apr 27, 2026

Nirmit Joshi, Roey Magen, Nathan Srebro et al.

Learning from diverse reasoning traces is harder than learning from a single thinker, but you can overcome this by actively collecting reasoning data from many thinkers (logarithmic in target accuracy) combined with passive final-answer supervision.

This paper studies how AI models can learn from multiple people or programs solving the same problem in different ways (e.g., different math solutions or code implementations).

trainingreasoningdata

Sentiment and Emotion Classification of Indonesian E-Commerce Reviews via Multi-Task BiLSTM and AutoML Benchmarking

Apr 27, 2026

Hermawan Manurung, Ibrahim Al-Kahfi, Ahmad Rizqi et al.

Multi-task learning (training one model for both sentiment and emotion at once) with BiLSTM outperforms single-task approaches on noisy, informal Indonesian text—and preprocessing with domain-specific slang dictionaries matters more than model complexity.

This paper tackles sentiment and emotion classification for Indonesian e-commerce reviews, which contain slang, regional words, and emoji that confuse standard tools. The authors built a two-track system: one using AutoML with TF-IDF features, and another using a BiLSTM neural network trained on both sentiment and emotion simultaneously.

trainingapplications

Scalable Hyperparameter-Divergent Ensemble Training with Automatic Learning Rate Exploration for Large Models

Apr 27, 2026

Hailing Cheng, Tao Huang, Chen Zhu et al.

You can use your existing multi-GPU setup to automatically find better learning rates during training by having each GPU try slightly different rates and averaging them periodically—no extra compute needed.

This paper proposes HDET, a method that uses multiple GPU replicas to explore different learning rates during training instead of computing identical updates. Replicas train independently with different learning rates, then synchronize periodically.

trainingefficiency

Contextual Linear Activation Steering of Language Models

Apr 27, 2026

Brandon Hsu, Daniel Beaglehole, Adityanarayanan Radhakrishnan et al.

Adapting steering strength dynamically per context significantly improves LLM control compared to fixed steering, matching more complex methods like LoRA while remaining simpler and more interpretable.

This paper improves linear activation steering—a technique for controlling LLM behavior—by making the steering strength adapt to each input context instead of using a fixed strength for all tokens. The method, called CLAS, works better than existing approaches across multiple benchmarks and models, offering a practical way to customize LLMs with limited training data.

alignmentefficiencytraining
trainingefficiency

Zero-Shot Morphological Discovery in Low-Resource Bantu Languages via Cross-Lingual Transfer and Unsupervised Clustering

Apr 24, 2026

Hillary Mutisya, John Mugane

Cross-lingual transfer and unsupervised clustering are complementary for morphology discovery in low-resource languages—transfer finds cognates while clustering spots language-specific innovations that transfer misses.

This paper develops a method to automatically discover morphological patterns in Giriama, a low-resource Bantu language with minimal labeled data. By combining knowledge transfer from Swahili with unsupervised clustering, the system identifies noun classes and uncovers two previously unknown morphological patterns, achieving 86.7% accuracy on lemmatization across word classes.

datatraining

Aligning Dense Retrievers with LLM Utility via DistillationAligning Dense Retrievers with LLM Utility via Distillation

Apr 24, 2026

Rajinder Sandhu, Di Mu, Cheng Chang et al.

You can train dense retrievers to match LLM utility by distilling perplexity-based signals into embeddings during training, eliminating expensive test-time LLM re-ranking while improving retrieval quality.

This paper proposes Utility-Aligned Embeddings (UAE), a method that trains dense retrievers to match the ranking quality of LLM-based re-ranking without the computational cost.

trainingefficiency

Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought

Apr 24, 2026

Keshav Ramji, Tahira Naseem, Ramón Fernandez Astudillo

You can train models to reason efficiently using learned abstract tokens instead of natural language, reducing inference cost by over 10× while keeping reasoning quality comparable to verbose chain-of-thought.

This paper introduces Abstract Chain-of-Thought, a method that trains language models to reason using short sequences of special tokens instead of writing out full explanations. The approach uses a warm-up phase combining supervised learning from verbal reasoning and self-distillation, then optimizes with reinforcement learning.

reasoningefficiencytraining

CRAFT: Clustered Regression for Adaptive Filtering of Training data

Apr 24, 2026

Parthasarathi Panda, Asheswari Swain, Subhrakanta Panda

You can select optimal training data 40x faster than competing methods by matching source distributions through clustering and target distributions through regression, without sacrificing quality.

CRAFT is a fast method for selecting high-quality training data subsets from massive datasets. It uses clustering and statistical matching to pick training examples whose target outputs align with your validation set, enabling efficient fine-tuning of translation models on millions of examples in under a minute.

datatrainingefficiency

Temporal Taskification in Streaming Continual Learning: A Source of Evaluation Instability

Apr 23, 2026

Nicolae Filat, Ahmed Hussain, Konstantinos Kalogiannis et al.

When evaluating continual learning systems on streaming data, the way you partition the stream into tasks is as important as the algorithm itself—different valid splits can produce contradictory conclusions about which method works best.

This paper reveals that how you split a continuous data stream into tasks dramatically affects continual learning benchmarks—even when using the same data and model. The authors introduce tools to measure this effect and show that different task boundaries can flip which learning method performs best, making temporal taskification a critical but often-overlooked evaluation choice.

evaluationtraining

Fine-Tuning Regimes Define Distinct Continual Learning Problems

Apr 23, 2026

Paul-Tiberiu Iordache, Elena Burceanu

When comparing continual learning methods, the choice of which model layers to train is as important as the method itself—different fine-tuning regimes can completely change which approach performs best.

This paper shows that how you choose which parts of a model to update during continual learning (learning new tasks sequentially) significantly changes which methods work best.

trainingevaluation

Low-Rank Adaptation Redux for Large Models

Apr 23, 2026

Bingcong Li, Yilang Zhang, Georgios B. Giannakis

LoRA works by adding small, low-rank weight matrices to a pre-trained model instead of updating all parameters—signal processing principles can guide better design choices for this approach and similar efficient fine-tuning methods.

This paper examines LoRA (Low-Rank Adaptation), a widely-used technique for efficiently fine-tuning large AI models, through the lens of signal processing. It explains the core mechanisms behind LoRA variants and how classical signal processing tools can improve parameter-efficient fine-tuning methods, covering architectural design, optimization strategies, and real-world applications.

trainingefficiency