Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.
Shihao Wang, Shilong Liu, Yuanguo Kuang et al.
Decoding bounding boxes as complete geometric units instead of individual tokens dramatically speeds up inference while maintaining or improving localization accuracy.
LocateAnything replaces slow token-by-token box decoding with Parallel Box Decoding, which generates entire bounding boxes at once. Combined with a 138-million-sample dataset, this approach makes visual grounding and detection faster while improving accuracy on standard benchmarks.
Melissa Z. Pan, Negar Arabzadeh, Mathew Jacob et al.
You can optimize retrieval pipelines per-query rather than per-workload by using lightweight predictors trained on query characteristics, achieving the same accuracy at significantly lower cost or better accuracy at the same cost.
This paper presents BRANE, a system that automatically selects the best configuration for retrieval agents on a per-query basis. Instead of manually tuning a retrieval pipeline once, BRANE analyzes each query to predict which combination of LLM, retriever, and other settings will work best, allowing teams to optimize for either accuracy or cost at inference time without retraining.
Hongwu Peng, Ohiremen Dibua, Yuanjun Xiong et al.
You can now tune hyperparameters on a single dense model and transfer them directly to MoE models of any size or configuration, eliminating the need for expensive hyperparameter search when scaling with MoE.
Complete-muE is a framework that solves the problem of transferring hyperparameters (like learning rate and weight decay) from dense neural networks to Mixture-of-Experts (MoE) models without expensive retuning.
Shuhong Zheng, Michael Oechsle, Erik Sandström et al.
By selectively dropping redundant image patches across frames and within frames using attention entropy, you can speed up 3D reconstruction transformers dramatically without sacrificing quality.
This paper tackles the computational bottleneck in visual geometry transformers—models that reconstruct 3D scenes from multiple images. The authors propose a token selection strategy that reduces which image patches the model attends to, cutting computation by 85% while maintaining or improving accuracy.
Xiang Fan, Yuheng Wang, Bohan Fang et al.
Video generation systems lose detail because their decoders ignore the input image—adding reference conditioning to the decoder recovers this information and improves quality by up to 2.1dB PSNR.
RefDecoder improves video generation by conditioning the decoder on a reference image, fixing a common architectural flaw where decoders ignore input details. By injecting reference image information through attention mechanisms during decoding, it preserves fine details and consistency without requiring retraining of existing systems.
Ellwil Sharma, Arastu Sharma
Sparse mixture-of-experts routing can solve the problem of conflicting physics domains in foundation models by automatically routing different physics problems to specialized experts while maintaining shared knowledge for universal principles.
This paper tackles negative transfer in multi-physics AI models—where training on different physics problems simultaneously hurts performance. The authors propose Shodh-MoE, which uses sparse expert routing to let different parts of the model specialize in different physics regimes (like fluid dynamics vs. porous media flows) while sharing knowledge where it helps.
Jiatao Gu, Tianrong Chen, Ying Shen et al.
NTM enables fast image generation (4 steps) while preserving exact likelihood calculation—something previous fast diffusion methods couldn't do—by using normalizing flows for each denoising step instead of simple Gaussian assumptions.
This paper introduces Normalizing Trajectory Models (NTM), a new approach for fast image generation that compresses diffusion sampling from many steps to just four. Unlike existing fast methods that lose the ability to calculate exact probabilities, NTM maintains a mathematically exact likelihood while generating high-quality images, making it useful for both generation and evaluation.
Wei Yu, Yunhang Qian
State space models offer a practical alternative to transformers for event-based image reconstruction, achieving better results with linear computational complexity instead of quadratic, making high-resolution processing feasible.
EmambaIR uses a new type of neural network architecture (state space models) to reconstruct clear images from event camera data.
Jinpai Zhao, Nishant Panda, Yen Ting Lin et al.
Composing interpretable numerical and learned modules with learned policies outperforms monolithic neural operators on PDEs, generalizes better to out-of-distribution cases, and lets you swap components (like boundary conditions) without retraining.
HyCOP learns to solve PDEs by composing simple, interpretable modules (like advection and diffusion) rather than training a single neural network. It learns a policy that decides which module to apply and for how long based on the current state, enabling better generalization to new scenarios and easier transfer to different problems.
Siyuan Huang, Xiaoye Qu, Yafu Li et al.
PVM solves a fundamental problem in vision-language models where visual understanding degrades during long text generation by creating a separate, always-accessible pathway to visual information—improving reasoning tasks with minimal added parameters.
Large vision-language models struggle when generating long text because visual information gets diluted by accumulated text tokens. This paper introduces Persistent Visual Memory (PVM), a lightweight add-on module that maintains direct access to visual embeddings throughout generation, preventing the model from losing sight of the image as it produces longer outputs.
Sijie Li, Shanda Li, Haowei Lin et al.
Use active learning to strategically pick which small experiments to run when fitting scaling laws—you can predict large-scale model performance with 90% less compute by choosing experiments that reduce uncertainty about the target region you care about.
Training large AI models costs millions, and figuring out how they'll scale costs millions more. This paper proposes a smarter way to choose which smaller pilot experiments to run so you can accurately predict how a massive training run will perform, using only about 10% of the budget that naive approaches would need.
Longju Bai, Zhemin Huang, Xingyao Wang et al.
AI agents are expensive and unpredictable: token costs vary wildly (up to 30x difference on the same task), models differ dramatically in efficiency, and even frontier models can't accurately predict their own token usage before running.
This paper analyzes how much AI agents spend on tokens when solving coding tasks. Researchers studied eight frontier LLMs on real-world coding benchmarks and found that agentic tasks consume 1000x more tokens than simpler coding tasks, with huge variability between runs. Surprisingly, spending more tokens doesn't guarantee better results—accuracy often peaks at intermediate costs then plateaus.