Papers

Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.

1552 papers21 this month12 topics

All Evaluation 40 Training 34 Efficiency 33 Reasoning 30 Agents 27 Applications 22 Multimodal 18 Data 17 Safety 13 Architecture 11 Alignment 7 scaling 5

Jul 6 – Jul 12(11)

OpenCoF: Learning to Reason Through Video Generation

Jul 9, 2026

Xinyan Chen, Ziyu Guo, Renrui Zhang et al.

Video generation can be a reasoning mechanism: training models on diverse temporal reasoning tasks and adding explicit reasoning tokens improves their ability to solve logical problems by generating step-by-step visual explanations.

OpenCoF introduces a dataset and fine-tuned video model designed to teach AI systems to reason through generating sequences of video frames. Unlike text-based reasoning, this 'Chain-of-Frame' approach lets models unfold logical steps visually across time. The work shows that video models trained on diverse reasoning tasks with special reasoning tokens perform better at solving complex problems.

reasoningmultimodaltraining

AUTOPILOT VQA: Benchmarking Vision-Language Models for Incident-Centric Dashcam Understanding

Jul 9, 2026

Siddharth Damodharan, Radhika Gupta, Ali Alshami et al.

Current vision-language models struggle with safety-critical reasoning in autonomous driving; this benchmark provides a standardized way to measure whether they can understand incident context and predict avoidability.

AUTOPILOT-VQA is a benchmark dataset for evaluating vision-language models on safety-critical dashcam understanding. It uses structured questions about real-world driving incidents to test whether AI systems can reliably reason about weather, traffic, road conditions, and accident scenarios—moving beyond simple object recognition to temporally grounded, safety-aware reasoning.

Jun 29 – Jul 5(15)

Reasoning LLM Improves Speaker Recognition in Long-form TV Dramas

Jul 2, 2026

Yuxuan Li, Lingxi Xie, Xinyue Huo et al.

Reasoning models can improve speaker identification in video by combining multiple modalities and contextual evidence, outperforming traditional audio-only approaches on challenging cases.

This paper tackles speaker recognition in long-form TV dramas by introducing DramaSR-532K, a large benchmark with 532K annotated dialogue lines, and DramaSR-LRM, a reasoning-based approach that combines audio, text, and visual information to accurately identify which character is speaking. The method works especially well on short utterances where voice alone isn't reliable.

multimodalreasoningapplications

Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning

Jul 2, 2026

Liyan Tang, Fangcong Yin, Greg Durrett

Vision-language models can be trained to self-correct more effectively by explicitly grounding their reflection in visual inputs, rather than just generating text-based corrections—this matters especially when models encounter out-of-distribution images.

This paper improves how vision-language models correct their own mistakes by training them to look back at images while reasoning. The authors use reinforcement learning with two key techniques: masking earlier reasoning steps to force the model to recover from errors, and replaying diverse failure scenarios. Their method helps models stay accurate even when given unfamiliar images.

Jun 22 – Jun 28(22)

Parameter Efficient Hybrid Transformer (PEHT) for Network Traffic Prediction via Dynamic Urban Congestion Integration

Jun 26, 2026

Abdolazim Rezaei, Mehdi Sookhak, Mahboobeh Haghparast

By combining parameter-efficient fine-tuning (LoRA) with multimodal fusion of urban context, you can build accurate traffic prediction models that use fewer trainable parameters without sacrificing performance.

This paper presents PEHT, a traffic prediction model that combines Transformers with urban mobility data to forecast cellular network demand. It uses LoRA to reduce parameters while a multimodal fusion strategy integrates congestion and mobility information, achieving better accuracy than existing methods on real telecom data.

efficiencymultimodalapplications

Vision-Default, Prior-Override: Causal Mechanisms of Perception-Knowledge Conflict in Vision-Language Models

Jun 26, 2026

Niclas Lietzow, Danielle Bitterman, Carsten Eickhoff et al.

Vision-language models have a sparse, identifiable causal circuit that controls whether they trust visual input or stored knowledge—removing just a few attention heads flips the model from knowledge-based to vision-based answers in most cases.

This paper reveals how vision-language models choose between visual evidence and memorized knowledge when they conflict. Using activation analysis, researchers identified a small set of attention heads (2.5-4.8% of heads) that act as a causal switch: removing them makes models trust their eyes instead of what they've learned.

Jun 15 – Jun 21(15)

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

Jun 18, 2026

Wenhao Chi, Arkaprava Sinha, Dominick Reilly et al.

Using proxy models as intermediaries between diverse teachers prevents conflicting gradients and enables learning richer egocentric representations from heterogeneous knowledge sources—achieving better results than naive multi-teacher distillation.

This paper introduces UNIEGO, a unified egocentric video encoder trained through a novel multi-teacher distillation framework.

multimodaltrainingarchitecture

How Do Instructions Shape Speech? Cross-Attention Attribution for Style-Captioned Text-to-Speech

Jun 18, 2026

Nityanand Mathur, Hamees Sayed, Wasim Madha et al.

Style instructions in TTS are processed differently than content words—they influence acoustic properties like pitch and energy globally rather than locally, with maximum effect in early generation steps and mid-depth network layers.

This paper reveals how individual words in style descriptions influence speech generation by analyzing attention patterns in a text-to-speech system.

Jun 8 – Jun 14(17)

Gaze Heads: How VLMs Look at What They Describe

Jun 12, 2026

Rohit Gandikota, David Bau

VLMs have interpretable internal mechanisms (gaze heads) that can be surgically edited at inference time to control what the model describes, offering a practical way to steer multimodal outputs without model retraining.

This paper discovers that vision-language models develop specialized attention heads called 'gaze heads' that track which image regions they're describing. By redirecting these heads' attention during inference, researchers can steer the model to describe any chosen image region without retraining—achieving 83% accuracy on comic panels and extending to natural images.

multimodal

ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning

Jun 12, 2026

Sicheng Yang, Hangjie Yuan, Wenjun Zhang et al.

Medical AI hallucinations have different sources (visual, knowledge, reasoning); diagnosing which stage fails helps you fix the right problem and improve trustworthiness.

ClinHallu is a benchmark with 7,031 medical cases that diagnoses where hallucinations occur in medical AI systems—whether from misreading images, recalling wrong medical facts, or flawed reasoning. It includes detailed reasoning traces and shows that training on these traces reduces errors.

evaluation

Jun 1 – Jun 7(11)

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

Jun 5, 2026

Cong Chen, Guo Gan, Kaixiang Ji et al.

For long-form video understanding, decoupling perception (building structured memory) from reasoning (agentic exploration) is more efficient than end-to-end processing, achieving better accuracy while using only 2% of the context that full-video processing would require.

MemDreamer solves the problem of understanding very long videos by splitting the task into two parts: a perception system that builds a memory structure from video frames, and a reasoning system that explores this memory like an agent using tools.

multimodalagentsreasoning

TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies

Jun 4, 2026

Dong Jing, Jingchen Nie, Tianqi Zhang et al.

Robot policies can control execution speed by scaling action magnitudes, enabling a single model to adapt between fast and slow motions without retraining—useful for tasks requiring both speed and precision.

TempoVLA enables robots to execute manipulation tasks at variable speeds by conditioning a Vision-Language-Action model on a speed parameter. The approach uses trajectory augmentation to create training data at different speeds and adds a conditioning mechanism to the policy, allowing a single model to handle both fast transit phases and slow, precise contact phases.

May 25 – May 31(9)

Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models

May 29, 2026

Jiazheng Xing, Hangjie Yuan, Lingling Cai et al.

By separating training (lightweight generator) from inference (high-capacity generator), you can build reasoning-driven video models that produce cinema-quality results without prohibitive training costs.

Lumos-Nexus is a video generation system that combines reasoning capabilities with high visual quality by using a lightweight generator during training and progressively handing off to a powerful generator at inference time. This two-stage approach lets models understand user intent and generate coherent videos without the computational cost of training with large generators.

multimodalefficiencyarchitecture

TunerDiT: Training-free Progressive Steering of Diffusion Transformer for Multi-Event Video Generation

May 29, 2026

Ruotong Liao, Guowen Huang, Qing Cheng et al.

You can steer video generation at inference time by identifying and leveraging natural turning points in the diffusion denoising process—no retraining needed, and it scales better with more events.

This paper presents TunerDiT, a method for generating videos with multiple sequential events from text descriptions without requiring additional training. By identifying key moments in the diffusion process where text conditioning affects different aspects of video generation, the authors use strategic masking and prompt fusion to control event boundaries and transitions in long-form videos.

Papers

Jul 6 – Jul 12(11)

OpenCoF: Learning to Reason Through Video Generation

AUTOPILOT VQA: Benchmarking Vision-Language Models for Incident-Centric Dashcam Understanding

Jun 29 – Jul 5(15)

Reasoning LLM Improves Speaker Recognition in Long-form TV Dramas

Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning

Jun 22 – Jun 28(22)

Parameter Efficient Hybrid Transformer (PEHT) for Network Traffic Prediction via Dynamic Urban Congestion Integration

Vision-Default, Prior-Override: Causal Mechanisms of Perception-Knowledge Conflict in Vision-Language Models

Jun 15 – Jun 21(15)

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

How Do Instructions Shape Speech? Cross-Attention Attribution for Style-Captioned Text-to-Speech

Jun 8 – Jun 14(17)

Gaze Heads: How VLMs Look at What They Describe

ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning

Jun 1 – Jun 7(11)

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies

May 25 – May 31(9)

Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models

TunerDiT: Training-free Progressive Steering of Diffusion Transformer for Multi-Event Video Generation

ARDY: Autoregressive Diffusion with Hybrid Representation for Interactive Human Motion Generation

Pose-to-Biomechanics: Bridging 3D Human Pose Estimation and Biomechanical Attribute Prediction

Accurate, Interdisciplinary and Transparent Structure-property Understanding with Deep Native Structural Reasoning

MedPMC: A Systematic Framework for Scaling High-Fidelity Medical Multimodal Data for Foundation Models

ELSA3D: Elastic Semantic Anchoring for Unified 3D Understanding and Generation

Hierarchical Acoustic-Semantic Modeling: Modality Separation and Semantic Coherence for Full-Duplex SLMs

Bridging Physical Reasoning and Task Generalization via Visual Action Outcome Reasoning Alignment

From Fixed to Free Cameras: Calibration-Free View-Robust Vision-Language-Action Model

Search Beyond What Can Be Taught: Evolving the Knowledge Boundary in Agentic Visual Generation

Combating Textual Noise and Redundancy: Entropy-Aware Dense Visual Token Pruning

Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs

QFedAgent: Quantum-Enhanced Personalized Federated Learning for Multi-Agent Activity Recognition

LIME: Learning Intent-aware Camera Motion from Egocentric Video

Text-Driven 3D Indoor Scene Synthesis in Non-Manhattan Environments

VisionAId: An Offline-First Multimodal Android Assistant for People with Visual Impairment, Featuring Personalized Object Retrieval

World Wide Models: Literary Tools for Cultural AI

FurnitureVLA: Learning Long-Horizon Bimanual Furniture Assembly with Vision-Language-Action Model

FedLAB: Traceable Semantic Codebooks for Federated Multimodal Graph Foundation Learning

CoMet: Context and Multiplicity Decomposition for Multimodal Uncertainty Estimation

LeVo 2: Stable and Melodious Song Generation via Hierarchical Representation Modeling and Progressive Post-Training

GROW$^2$: Grounding Which and Where for Robot Tool Use

Beyond 2D Matching: A Unified Single-Stage Framework for Geometry-Aware Cross-View Object Geo-Localization

HPRO: Hierarchical Progressive Reward Optimization via Preference Extraction for Emotional Text-to-Speech

DanceOPD: On-Policy Generative Field Distillation

Mapping Political-Elite Networks in Europe with a Multilingual Joint Entity-Relation Extraction Pipeline

Language-Based Digital Twins for Elderly Cognitive Assistance

EO-WM: A Physically Informed World Model for Probabilistic Earth Observation Forecasting

HarmVideoBench: Benchmarking Harmful Video Understanding in Large Multimodal Models

Automating Potential-based Reward Shaping with Vision Language Model Guidance

Learning Action Priors for Cross-embodiment Robot Manipulation

Real-Time Voice AI Hears but Does Not Listen

Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models

How Robust is OCR-Reasoning? Evaluating OCR-Reasoning Robustness of Vision-Language Models under Visual Perturbations

TriViewBench: Controlled Complexity Scaling for Multi-View Structural Reasoning in MLLMs

InSight: Self-Guided Skill Acquisition via Steerable VLAs

FLUX3D: High-Fidelity 3D Gaussian Generation with Diffusion-Aligned Sparse Representation

IV-CoT: Implicit Visual Chain-of-Thought for Structure-Aware Text-to-Image Generation

OrbitForge: Text-to-3D Scene Generation via Reconstruction-Anchored Video Synthesis

Semantic Browsing: Controllable Diversity for Image Generation

AIR: Adaptive Interleaved Reasoning with Code in MLLMs

PsyBridge: A Hybrid Intelligent Framework for Multi-Dimensional Mental Health Assessment and Decision Support

TailorMind: Towards Preference-Aligned Multimodal Content Generation

StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs

SARLO-80: Worldwide Slant SAR Language Optic Dataset 80cm

Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages

FreeStyle: Free Control of Style-Content Dual-Reference Generation from Community LoRA Mining

Scalable Training of Spatially Grounded 2D Vision-Language Models for Radiology

DataMagic: Transforming Tabular Data into Data Insight Video

Native Active Perception as Reasoning for Omni-Modal Understanding

Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors

Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

Risk Stratification for ICU Delirium using Pervasive Ambient Sensing Information

Context-Aware RL for Agentic and Multimodal LLMs

Geometric Action Model for Robot Policy Learning

FusionRS: A Large-Scale RGB-Infrared Remote Sensing Dataset for Dual-Modal Vision-Language Foundation Models

CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment

Beyond task performance: Decoding bioacoustic embeddings with speech features

SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning