ThinkLLM
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
AboutPrivacyTermsRSS

ThinkLLM

Papers

Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.

326 papers11 this month12 topics
AllEfficiency 35Reasoning 35Multimodal 28Applications 28Evaluation 27Training 26Architecture 24Agents 24Safety 13scaling 5Data 5Alignment 1

Mar 30 – Apr 5(15)

ActionParty: Multi-Subject Action Binding in Generative Video Games

Apr 2, 2026

Alexander Pondaven, Ziyi Wu, Igor Gilitschenski et al.

This is the first video world model that can reliably control multiple independent agents in the same scene—a critical capability for simulating multi-player games and complex interactive environments.

ActionParty is a video diffusion model that can control multiple characters simultaneously in interactive game environments. Unlike existing models limited to single agents, it uses special 'subject state tokens' to track each character's state separately, allowing precise control of up to seven players at once while maintaining their identity and following their assigned actions correctly.

architecturemultimodalagents

Steerable Visual Representations

Apr 2, 2026

Jona Ruthardt, Manu Gaur, Deva Ramanan et al.

You can now guide vision models with text prompts to focus on non-obvious visual concepts while maintaining strong performance on generic vision tasks—without needing separate language-centric models.

This paper introduces steerable visual representations that can be guided by natural language to focus on specific objects or concepts in images.

multimodal

Mar 23 – Mar 29(20)

Vega: Learning to Drive with Natural Language Instructions

Mar 26, 2026

Sicheng Zuo, Yuxuan Li, Wenzhao Zheng et al.

Language instructions can guide autonomous driving decisions in real-time, enabling personalized driving behaviors beyond fixed rules—this opens the door to more flexible, user-responsive autonomous systems.

Vega is a vision-language-action model that learns to drive by following natural language instructions. The system combines visual perception, language understanding, and world modeling to generate safe driving trajectories. Researchers created a 100,000-scene dataset with diverse driving instructions and trajectories to train the model.

multimodalagentsreasoning

Drive My Way: Preference Alignment of Vision-Language-Action Model for Personalized Driving

Mar 26, 2026

Zehao Wang, Huaide Jiang, Shuaiwu Dong et al.

Autonomous driving systems can be personalized to match individual driver styles by learning user embeddings from driving data and conditioning the driving policy on these embeddings, enabling more human-centered autonomous vehicles.

This paper presents Drive My Way, a personalized autonomous driving system that learns individual driver preferences and adapts to real-time instructions.

Mar 16 – Mar 22(25)

From Masks to Pixels and Meaning: A New Taxonomy, Benchmark, and Metrics for VLM Image Tampering

Mar 20, 2026

Xinyi Shang, Yi Tang, Jiacheng Cui et al.

Mask-based evaluation of image tampering is fundamentally flawed; pixel-level metrics with semantic understanding of edit types provide a much more accurate way to assess whether AI systems can detect real image manipulations.

This paper fixes how we evaluate image tampering detection by moving from coarse object masks to pixel-level precision. It introduces a taxonomy of edit types (replace, remove, splice, etc.), a new benchmark with precise tamper maps, and metrics that measure both where edits occur and what they mean semantically—revealing that existing detectors often miss subtle edits or flag untouched pixels.

evaluationmultimodalsafety

LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation

Mar 20, 2026

Jiazheng Xing, Fei Du, Hangjie Yuan et al.

To generate videos with multiple people where each person's appearance stays consistent with their attributes, you need both better training data that captures identity-attribute relationships and model attention mechanisms designed to enforce those relationships.

LumosX improves personalized video generation by explicitly linking identities to their attributes. It uses a data pipeline with multimodal AI to extract subject relationships, then applies specialized attention mechanisms in diffusion models to ensure faces stay consistent with their assigned attributes across video frames.

Mar 9 – Mar 15(9)

Visual-ERM: Reward Modeling for Visual Equivalence

Mar 13, 2026

Ziyu Liu, Shengyuan Ding, Xinyu Fang et al.

Fine-grained visual feedback—comparing what code actually renders versus what it should render—is more effective for training vision-to-code models than text-based or embedding-based rewards, and avoids reward hacking.

This paper introduces Visual-ERM, a reward model that judges the quality of vision-to-code outputs by comparing rendered visuals directly rather than using text rules or embeddings.

multimodalreasoning

Towards Faithful Multimodal Concept Bottleneck Models

Mar 13, 2026

Pierre Moreau, Emeline Pineau Ferrand, Yann Choho et al.

Concept Bottleneck Models can now work reliably across text and images by jointly addressing concept detection and information leakage—enabling interpretable AI without sacrificing accuracy.

This paper introduces f-CBM, a framework for building interpretable multimodal AI models that make predictions through human-understandable concepts. The key innovation is solving two problems simultaneously: accurately detecting concepts and preventing 'leakage' (where irrelevant information sneaks into predictions).

multimodalarchitecture

Feb 23 – Mar 1(8)

Chunk-wise Attention Transducers for Fast and Accurate Streaming Speech-to-Text

Feb 27, 2026

Hainan Xu, Vladimir Bataev, Travis M. Bartley et al.

You can make streaming speech-to-text models faster and more accurate by processing audio in fixed chunks instead of one token at a time.

This paper introduces CHAT, an improved version of RNN-T models for converting speech to text in real-time. By processing audio in small chunks and using a smarter attention mechanism, CHAT runs 1.7x faster during inference, uses 46% less memory during training, and produces more accurate transcriptions—especially for translating speech between languages.

efficiencyarchitecturemultimodal

SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

Feb 26, 2026

Vaibhav Agrawal, Rishubh Parihar, Pradhaan Bhat et al.

AI image generators can now understand and correctly render partially hidden objects when you specify 3D layouts and camera positions.

This paper solves a key problem in AI image generation: when you ask an AI to create a scene with specific 3D positions and camera angles, it often gets confused about which objects should be hidden behind others. SeeThrough3D adds 'occlusion awareness' by representing objects as transparent 3D boxes, letting the model understand what's visible and what's blocked before generating the final image.

architecture
evaluation

VOID: Video Object and Interaction Deletion

Apr 2, 2026

Saman Motamed, William Harvey, Benjamin Klein et al.

Video editing can be improved by treating it as a physics simulation problem: identify what changes when an object is removed, then use diffusion models guided by causal reasoning to generate realistic results.

VOID removes objects from videos while maintaining realistic physics—like correcting how other objects move or collide after removal. It uses a vision-language model to identify affected regions and a diffusion model to generate physically plausible outcomes, trained on synthetic data where physics interactions are carefully controlled.

multimodalapplicationsreasoning

Omni123: Exploring 3D Native Foundation Models with Limited 3D Data by Unifying Text to 2D and 3D Generation

Apr 2, 2026

Chongjie Ye, Cheng Cao, Chuanyu Pan et al.

By unifying 2D and 3D generation in one model and leveraging plentiful 2D data as a structural constraint, you can train better 3D generators with limited 3D assets—no separate 2D-to-3D conversion pipeline needed.

Omni123 is a 3D foundation model that generates both 2D images and 3D objects from text by treating them as sequences of tokens. It uses abundant 2D image data as a guide to improve 3D generation, avoiding the need for scarce aligned text-image-3D datasets. The model cycles through different modalities (text→image→3D→image) to ensure consistency across all forms.

multimodalarchitecturedata

BVFLMSP : Bayesian Vertical Federated Learning for Multimodal Survival with Privacy

Apr 2, 2026

Abhilash Kar, Basisth Saha, Tanmay Sen et al.

This framework enables hospitals and clinics to collaboratively build better survival prediction models without sharing raw patient data, while also quantifying prediction confidence—critical for clinical adoption.

BVFLMSP combines Bayesian neural networks with federated learning to predict survival outcomes from sensitive multimodal data distributed across multiple parties. Each organization keeps its data private while contributing predictions to a shared model, with added privacy protections and uncertainty estimates for more reliable medical decision-making.

safetymultimodaltraining

Impact of Multimodal and Conversational AI on Learning Outcomes and Experience

Apr 2, 2026

Karan Taneja, Anjali Singh, Ashok K. Goel

Combining conversation with visual content (multimodality) improves learning in STEM, but conversation alone can create a false sense of understanding without actual learning gains.

This study compares three ways to learn biology: a conversational AI with images and text, one with text only, and a traditional search interface. Students using the multimodal conversational system learned best and felt most satisfied, while text-only conversation felt easier but didn't improve learning—showing that engagement doesn't always mean better outcomes.

multimodalapplicationsevaluation

Multi-Agent Video Recommenders: Evolution, Patterns, and Open Challenges

Apr 2, 2026

Srivaths Ranganathan, Abhishek Dharmaratnakar, Anushree Sinha et al.

Multi-agent video recommenders coordinate specialized agents for different tasks (understanding, reasoning, memory) rather than relying on single models, enabling more explainable and adaptive recommendations—a shift that's becoming practical with LLMs.

This survey examines how video recommender systems are evolving from single models to multi-agent architectures where specialized AI agents coordinate to understand videos, reason about user preferences, and provide better recommendations.

applicationsagentsmultimodal

HippoCamp: Benchmarking Contextual Agents on Personal Computers

Apr 1, 2026

Zhe Yang, Shulin Tian, Kairui Hu et al.

Current AI agents fail at real-world personal file management: the best models only achieve 48% accuracy on user profiling tasks, with multimodal perception and evidence grounding being the main bottlenecks.

HippoCamp is a benchmark that tests AI agents on realistic file management tasks using real personal computers with 42.4 GB of actual user files. It measures how well agents can search files, understand context, and reason across multiple file types to answer questions about a user's data—revealing that even top AI models struggle with these practical tasks.

evaluationmultimodalagents

True (VIS) Lies: Analyzing How Generative AI Recognizes Intentionality, Rhetoric, and Misleadingness in Visualization Lies

Apr 1, 2026

Graziano Blasilli, Marco Angelini

Multimodal AI models struggle inconsistently with detecting misleading visualizations; their ability varies dramatically by model size and architecture, and they often miss the intentional rhetorical techniques that human experts easily spot.

This study tests whether AI models can detect misleading visualizations and understand why they're deceptive. Researchers analyzed 2,336 tweets with COVID-19 charts—half containing intentional or accidental distortions—using 16 different AI models and compared their performance to how visualization experts judge the same images.

evaluationmultimodalapplications

A ROS 2 Wrapper for Florence-2: Multi-Mode Local Vision-Language Inference for Robotic Systems

Apr 1, 2026

J. E. Domínguez-Vidal

Florence-2 can now be easily integrated into robot software stacks through a standardized ROS 2 wrapper, enabling local vision-language inference on consumer GPUs without cloud dependencies.

This paper presents a ROS 2 software wrapper that integrates Florence-2, a vision-language model, into robotic systems for local inference.

applicationsmultimodalefficiency

NeuroDDAF: Neural Dynamic Diffusion-Advection Fields with Evidential Fusion for Air Quality Forecasting

Apr 1, 2026

Prasanjit Dey, Soumyabrata Dev, Angela Meyer et al.

Hybrid physics-neural models can achieve better accuracy and uncertainty calibration than pure data-driven or physics-based approaches alone, especially for spatiotemporal forecasting with known physical constraints.

NeuroDDAF combines physics-informed modeling with neural networks to forecast air quality by integrating wind-driven transport equations, graph attention for spatial patterns, and uncertainty quantification. It outperforms existing methods on urban datasets while providing reliable confidence estimates for predictions.

reasoningmultimodalapplications

On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers

Mar 30, 2026

Omer Dahary, Benaya Koren, Daniel Garibi et al.

You can increase diversity in generated images by applying repulsion forces in the transformer's attention channels during generation, without expensive optimization or visual artifacts.

This paper tackles the problem of text-to-image diffusion models producing visually similar outputs for the same prompt. The authors propose a method that applies 'repulsion' in the attention mechanism during image generation to encourage diverse outputs while maintaining quality and semantic accuracy.

architectureefficiencymultimodal

ParaSpeechCLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining

Mar 30, 2026

Anuj Diwan, Eunsol Choi, David Harwath

Specialized models for different types of speech style (speaker traits vs. utterance characteristics) outperform single unified models on individual tasks, but a combined model works better when styles need to be understood together.

ParaSpeechCLAP is a dual-encoder model that learns to match speech audio with text descriptions of speaking style (like pitch, emotion, and texture). It maps both modalities into a shared embedding space, enabling applications like finding similar-sounding speech, classifying speaker characteristics, and improving text-to-speech synthesis without retraining.

multimodalapplications

See it to Place it: Evolving Macro Placements with Vision-Language Models

Mar 30, 2026

Ikechukwu Uchendu, Swati Goel, Karly Hou et al.

Foundation models trained on visual reasoning can solve specialized engineering problems like chip design without fine-tuning, by framing physical constraints as spatial reasoning tasks.

This paper uses Vision-Language Models to improve chip floorplanning—arranging components on a chip to minimize wiring. The approach, called VeoPlace, treats the chip layout as a visual problem, letting a VLM suggest component placements without any training, then iteratively refines these suggestions. It outperforms existing machine learning methods by up to 32% on standard benchmarks.

applicationsreasoningmultimodal

SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning

Mar 30, 2026

Philip Schroeder, Thomas Weng, Karl Schmeckpeper et al.

Video-language models can supervise robot learning directly as reward signals if trained with spatiotemporal reasoning and grounded in continuous progress supervision, enabling robots to learn new tasks without hand-crafted rewards.

SOLE-R1 is a video-language model that watches robot videos and reasons about task progress step-by-step to provide reward signals for robot learning. Unlike standard vision-language models, it's designed to handle partial views and changing conditions, preventing robots from gaming the reward system.

reasoningagentsmultimodal
multimodalagentsapplications

PixelSmile: Toward Fine-Grained Facial Expression Editing

Mar 26, 2026

Jiabin Hua, Hengyuan Xu, Aojie Li et al.

Fine-grained facial expression editing is now possible with precise control and identity preservation by disentangling expression semantics through symmetric joint training and contrastive learning.

PixelSmile is a new method for editing facial expressions in images with fine-grained control. It uses a diffusion model trained with a special technique to separate expression changes from identity, allowing smooth blending between different expressions while keeping a person's identity intact.

multimodalevaluation

Back to Basics: Revisiting ASR in the Age of Voice Agents

Mar 26, 2026

Geeyang Tay, Wentao Ma, Jaewon Lee et al.

Speech recognition systems hallucinate false content under degraded audio, creating safety risks for voice agents. You need diagnostic testing across real-world conditions, not just benchmark scores, to know when and where your ASR will fail.

This paper reveals that speech recognition systems fail in real-world voice agents despite high benchmark scores. The authors created WildASR, a multilingual test set from real human speech that measures robustness across environmental noise, speaker differences, and languages.

evaluationsafetymultimodal

No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models

Mar 26, 2026

Hai X. Pham, David T. Hoffmann, Ricardo Guerrero et al.

You can teach vision-language models to understand compositional meaning by focusing on concept-level alignment and preserving fine-grained visual information—without custom data or hurting general performance.

This paper improves how vision-language models learn to understand combinations of concepts (like "red car" vs "blue car") without sacrificing their ability to recognize new objects.

trainingmultimodalefficiency

R-C2: Cycle-Consistent Reinforcement Learning Improves Multimodal Reasoning

Mar 26, 2026

Zirui Zhang, Haoyu Dong, Kexin Pei et al.

Cross-modal inconsistencies in multimodal models aren't just failures to hide—they're valuable training signals that, when enforced through cycle consistency, improve reasoning accuracy by up to 7.6 points and reduce systematic biases.

This paper introduces RC2, a reinforcement learning approach that improves multimodal AI models by enforcing consistency between visual and textual understanding. Instead of ignoring when a model gives contradictory answers for the same concept in different modalities, the method uses these conflicts as training signals.

reasoningmultimodal

Chameleon: Episodic Memory for Long-Horizon Robotic Manipulation

Mar 25, 2026

Xinying Guo, Chenxi Jiang, Hyun Bin Kim et al.

For robotic tasks with visual ambiguity, storing rich multimodal memory with geometric grounding outperforms semantic compression—robots need fine-grained context, not just similarity-based retrieval, to handle non-Markovian decision problems.

Chameleon is a memory system for robots that handles situations where the same visual observation could mean different things depending on what happened before. Instead of storing compressed summaries like most systems, it preserves detailed geometric and visual information to disambiguate confusing situations, enabling robots to make better decisions during long, complex manipulation tasks.

agentsmultimodal

VFIG: Vectorizing Complex Figures in SVG with Vision-Language Models

Mar 25, 2026

Qijia He, Xunmei Liu, Hammaad Memon et al.

You can now automatically convert flat images of technical figures into editable, scalable vector graphics—matching GPT-5.2 performance—enabling recovery of lost design source files without manual reconstruction.

VFIG converts rasterized images (PNG, JPEG) of technical diagrams back into editable SVG vector graphics using vision-language models. The team created a 66K dataset of figure-SVG pairs and a two-stage training approach (supervised learning for basic shapes, then reinforcement learning for refinement) to reconstruct complex professional diagrams with high fidelity.

multimodaltrainingapplications

LensWalk: Agentic Video Understanding by Planning How You See in Videos

Mar 25, 2026

Keliang Li, Yansong Li, Hongze Shen et al.

Giving AI agents control over their visual perception—deciding what to look at and when—significantly improves video reasoning accuracy. This active observation approach works as a plug-and-play upgrade for existing vision-language models.

LensWalk is an AI framework that lets language models actively control how they watch videos while reasoning about them.

agentsmultimodalreasoning

MedObvious: Exposing the Medical Moravec's Paradox in VLMs via Clinical Triage

Mar 24, 2026

Ufaq Khan, Umair Nawaz, L D M S S Teja et al.

Medical VLMs need explicit training on input validation (checking modality, anatomy, orientation) as a separate safety step before diagnosis, not as an afterthought—current models hallucinate plausible reports even on obviously invalid inputs.

This paper reveals a critical blind spot in medical AI: vision-language models can generate fluent medical reports even when given invalid inputs like wrong body parts or upside-down images. MedObvious is a benchmark of 1,880 tasks testing whether models can catch these basic sanity checks before attempting diagnosis—a step human radiologists do automatically but VLMs currently fail at.

safetyevaluationmultimodal

VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions

Mar 24, 2026

Adrian Bulat, Alberto Baldrati, Ioannis Maniadis Metaxas et al.

You can make vision-language models faster without losing visual detail by being selective about which attention layers process images—use efficient cross-attention for context and add self-attention layers only when the task complexity demands it.

VISOR improves vision-language model efficiency by selectively attending to visual information rather than compressing images. Instead of reducing visual tokens, it uses sparse cross-attention and dynamically chosen self-attention layers to process high-resolution details only when needed, reducing computation while maintaining performance on complex visual reasoning tasks.

efficiencymultimodalarchitecture

SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning

Mar 24, 2026

Haoyu Huang, Jinfa Huang, Zhongwei Wan et al.

A smaller speculative model can predict an agentic system's tool-calling trajectory, enabling parallel execution and early termination of expensive operations—delivering significant speedups without accuracy loss.

SpecEyes speeds up agentic multimodal AI systems by using a lightweight model to predict what tools the main model will need, allowing expensive operations to be skipped or run in parallel. This cuts latency by 1.1-3.35x while maintaining accuracy, solving a key bottleneck in systems like OpenAI o3 that repeatedly invoke vision tools.

efficiencymultimodalagents

VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs

Mar 24, 2026

Haoran Yuan, Weigang Yi, Zhenyu Zhang et al.

Adding tactile (touch) sensing to video-based robot learning models significantly improves performance on tasks requiring precise force control and contact awareness, without needing separate tactile pretraining.

This paper introduces VTAM, a robot learning system that combines video and touch (tactile) sensing to better understand and perform complex physical tasks.

multimodalapplications

3DCity-LLM: Empowering Multi-modality Large Language Models for 3D City-scale Perception and Understanding

Mar 24, 2026

Yiping Chen, Jinpeng Li, Wenyu Ke et al.

This work shows how to scale vision-language models from room-sized scenes to entire cities by handling 3D spatial relationships and introducing a large, quality-controlled urban dataset—essential for building AI systems that understand real-world spatial reasoning.

3DCity-LLM extends multimodal AI models to understand entire city-scale 3D environments, not just individual objects. The system uses a three-part approach to analyze objects, their relationships, and overall scenes, trained on a new dataset of 1.2 million urban scenarios covering tasks from object identification to city planning.

multimodalapplications

UniMotion: A Unified Framework for Motion-Text-Vision Understanding and Generation

Mar 23, 2026

Ziyi Wang, Xinshun Wang, Shuang Chen et al.

Treating motion as a continuous first-class modality rather than discretizing it enables a single model to handle motion-text-image tasks end-to-end, achieving better performance on cross-modal tasks like describing motion or editing poses from text.

UniMotion is the first unified AI system that understands and generates human motion, text, and images all in one model. Instead of converting motion into discrete tokens (which loses information), it treats motion as a continuous stream like video, using a shared language model backbone with special techniques to align motion with visual and text understanding.

multimodalarchitecture

ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model

Mar 23, 2026

Haichao Zhang, Yijiang Li, Shwai He et al.

Pairing dense video prediction models with sparse, semantically-rich vision-language reasoning improves long-horizon forecasting—VLMs provide the 'what' and 'why', while dense models provide the 'how'.

This paper combines two approaches to video prediction: dense frame-by-frame modeling (JEPA) for capturing fine-grained motion, and vision-language models (VLMs) for long-horizon semantic understanding. By using both pathways together, the system predicts future video frames better than either approach alone, especially for complex hand manipulation tasks.

multimodalreasoningarchitecture

3D-Layout-R1: Structured Reasoning for Language-Instructed Spatial Editing

Mar 23, 2026

Haoyu Zhen, Xiaolong Li, Yilin Zhao et al.

Structured reasoning over scene graphs helps language models understand and manipulate spatial relationships more reliably than end-to-end approaches, improving layout editing accuracy by 15-20% over baseline methods.

This paper teaches AI models to edit 3D room layouts based on text instructions by having them reason through scene graphs—structured representations of objects and their spatial relationships. Instead of directly generating new layouts, the model updates a graph representation step-by-step, which helps it maintain spatial consistency and understand how objects relate to each other.

reasoningmultimodalapplications

The Dual Mechanisms of Spatial Reasoning in Vision-Language Models

Mar 23, 2026

Kelly Cui, Nikhil Prakash, Ayush Raina et al.

Vision encoders, not language models, are the primary source of spatial reasoning in VLMs. Spatial information is distributed globally across all image tokens, not just object regions, and enhancing this signal improves spatial understanding tasks.

This paper reveals how vision-language models handle spatial reasoning—understanding where objects are and how they relate to each other. The researchers found that VLMs use two mechanisms: the language model processes spatial relations independently, but the vision encoder is actually the dominant source, encoding object layouts across the entire image including background areas.

multimodalreasoningevaluation

Greater accessibility can amplify discrimination in generative AI

Mar 23, 2026

Carolin Holtermann, Minh Duc Bui, Kaitlyn Zhou et al.

Adding voice to language models doesn't just extend text capabilities—it introduces new bias mechanisms tied to speaker identity cues that amplify discrimination beyond text-only versions, requiring fairness safeguards alongside accessibility improvements.

Voice interfaces on AI chatbots amplify gender discrimination more than text-based versions because speech reveals speaker identity through tone and accent. The research shows these models shift toward gender-stereotyped responses based on voice alone, and surveys reveal users worry about hidden attribute inference.

safetymultimodalalignment

SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Image Generation

Mar 23, 2026

Sashuai Zhou, Qiang Zhou, Junpeng Ma et al.

Fine-grained spatial accuracy in generated images requires explicit spatial reward modeling during training; rule-based spatial checks alone miss complex relationships that vision-language models with grounding can catch.

SpatialReward is a reward model that helps text-to-image AI systems generate images with accurate object positioning and spatial relationships. It breaks down image prompts into specific spatial requirements, uses object detection to verify positions, and applies reasoning to check complex spatial relationships—then feeds this feedback into training to improve image generation quality.

evaluationmultimodaltraining
multimodalarchitecturedata

Improving Generalization on Cybersecurity Tasks with Multi-Modal Contrastive Learning

Mar 20, 2026

Jianan Huang, Rodolfo V. Valentim, Luca Vassio et al.

By aligning payload embeddings with text-based vulnerability descriptions using contrastive learning, you can reduce shortcut learning and improve how well cybersecurity models generalize to unseen threats.

This paper tackles a major problem in cybersecurity AI: models trained in labs fail in the real world because they learn surface-level patterns instead of genuine security concepts.

trainingmultimodalsafety

Adaptive Greedy Frame Selection for Long Video Understanding

Mar 20, 2026

Yuning Huang, Fengqing Zhu

By selecting frames that are both relevant to the question and visually diverse, you can cut inference costs significantly while maintaining or improving accuracy on video QA tasks, especially when frame budgets are tight.

This paper tackles a key bottleneck in video understanding: processing long videos with vision-language models requires too many frames and tokens. The authors propose a smart frame selection method that picks the most important frames by balancing two goals—relevance to the question asked and diversity of visual content—using a greedy algorithm with theoretical guarantees.

efficiencymultimodalevaluation

The Robot's Inner Critic: Self-Refinement of Social Behaviors through VLM-based Replanning

Mar 20, 2026

Jiyu Lim, Youngwoo Yoon, Kwanghyun Park

Robots can now autonomously refine their social interactions by using VLMs to evaluate and improve their own behavior plans, eliminating the need for predefined motions or constant human guidance.

This paper presents CRISP, a framework that lets robots automatically improve their social behaviors by critiquing and replanning their own actions. Using a vision-language model as a virtual social critic, the system generates robot motions, evaluates them for social appropriateness, and iteratively refines them—all without human feedback.

agentsreasoningmultimodal

F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

Mar 19, 2026

Ziyin Zhang, Zihan Liao, Hang Yu et al.

You can now use smaller, faster embedding models for multilingual search and retrieval without sacrificing quality—F2LLM-v2 offers efficient options for resource-constrained deployments while the largest variant ranks first on major benchmarks.

F2LLM-v2 is a family of multilingual embedding models (80M to 14B parameters) trained on 60 million high-quality samples that support 200+ languages, including underserved low-resource ones. Using matryoshka learning and knowledge distillation, these models achieve top performance on benchmarks while being more efficient than previous LLM-based embeddings.

multimodalefficiencytraining

DriveTok: 3D Driving Scene Tokenization for Unified Multi-View Reconstruction and Understanding

Mar 19, 2026

Dong Zhuo, Wenzhao Zheng, Sicheng Zuo et al.

A single tokenizer can efficiently represent multi-view driving scenes in a way that works for both reconstruction tasks (RGB, depth) and understanding tasks (segmentation, 3D occupancy), making it practical for vision-language-action models in autonomous vehicles.

DriveTok creates a unified tokenizer for autonomous driving that converts multi-view camera images into compact 3D scene tokens. Unlike existing tokenizers designed for single images, it handles multiple camera views efficiently while preserving semantic, geometric, and depth information—enabling better reconstruction and understanding of driving scenes.

multimodalarchitectureapplications

DreamPartGen: Semantically Grounded Part-Level 3D Generation via Collaborative Latent Denoising

Mar 19, 2026

Tianjiao Yu, Xinzhuo Li, Muntasir Wahed et al.

Part-aware 3D generation works better when you explicitly model semantic relationships between parts derived from language, not just their geometry—this enables text descriptions to guide both individual part structure and how parts fit together.

DreamPartGen generates 3D objects from text by understanding them as meaningful parts with semantic relationships. Unlike existing methods that focus only on geometry, this approach jointly models each part's shape and appearance while capturing how parts relate to each other based on the text description, resulting in more coherent and interpretable 3D models.

multimodalarchitecturereasoning

Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders

Mar 19, 2026

Shang-Jui Ray Kuo, Paola Cascante-Bonilla

State space models are a viable and more efficient alternative to vision transformers for vision-language models, challenging the assumption that transformers are necessary for this task.

This paper tests whether state space models (SSMs) can replace vision transformers as the visual backbone in vision-language models. The researchers find that SSM-based vision encoders match or outperform transformer-based encoders on VQA and visual grounding tasks, while using fewer parameters. They also identify instability issues in some backbones and propose fixes to improve robustness.

architecturemultimodalefficiency

How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Evaluation

Mar 19, 2026

Ke-Han Lu, Szu-Wei Fu, Chao-Han Huck Yang et al.

An LLM's text-only auditory knowledge is a strong predictor of how well it will perform in audio tasks—so you can evaluate audio-language models by testing their audio understanding before building them.

This paper investigates how much knowledge about sound and audio LLMs actually have from their text-only training, and whether this predicts how well they work in audio tasks. Researchers tested different LLMs three ways: directly probing their audio knowledge, having them reason about audio descriptions, and fine-tuning them into full audio-language models.

evaluationmultimodaltraining

Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation

Mar 19, 2026

Swagat Padhan, Lakshya Jain, Bhavya Minesh Shah et al.

Vision-language models need explicit metric reasoning to ground spatial language in 3D environments—decomposing queries into semantic and spatial components and combining them probabilistically improves grounding accuracy for robot navigation tasks.

This paper tackles the problem of robots understanding natural language commands that mix semantic meaning with precise spatial measurements, like 'go two meters right of the fridge.

multimodalagents

On Optimizing Multimodal Jailbreaks for Spoken Language Models

Mar 19, 2026

Aravind Krishnan, Karolina Stańczak, Dietrich Klakow

Multimodal AI systems need safety defenses that account for attacks across all input modalities together—defending text alone or audio alone isn't enough.

This paper shows that spoken language models (which process both speech and text) can be attacked more effectively by perturbing both modalities simultaneously rather than just one. The researchers developed JAMA, a method that jointly optimizes adversarial text and audio to bypass safety guardrails, achieving 1.5x to 10x higher attack success rates than single-modality attacks.

safetymultimodal

CustomTex: High-fidelity Indoor Scene Texturing via Multi-Reference Customization

Mar 19, 2026

Weilin Chen, Jiahao Rao, Wenhao Wang et al.

Reference-image-driven texturing with instance-level control produces sharper, more artifact-free 3D scene textures than text-based approaches, making it practical for professional 3D scene editing.

CustomTex generates high-quality textures for 3D indoor scenes by taking reference images and applying them to specific objects. Unlike text-based methods, it uses a dual-distillation approach to ensure textures match reference images precisely while maintaining visual quality and avoiding artifacts.

multimodalapplications

SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues

Mar 19, 2026

Carlos Hinojosa, Clemens Grange, Bernard Ghanem

Vision-language models' safety decisions are easily manipulated by semantic cues—they rely on learned associations rather than grounded reasoning about actual danger, which is a critical vulnerability for real-world deployment.

This paper reveals that vision-language models make safety decisions based on surface-level visual and textual cues rather than genuine understanding of dangerous situations. Researchers created a benchmark and steering framework showing that simple changes to how a scene is described or presented can flip safety judgments, exposing a vulnerability in how these models assess risk.

safetymultimodalevaluation

Communication-Efficient and Robust Multi-Modal Federated Learning via Latent-Space Consensus

Mar 19, 2026

Mohamed Badi, Chaouki Ben Issaid, Mehdi Bennis

When building federated systems with multi-modal data, you can align different data types in a shared compressed space using learnable projections, reducing both communication overhead and the need for all devices to use identical architectures.

This paper presents CoMFed, a federated learning system that lets multiple devices train together on different types of data (like video and audio) without sharing raw information. It uses compressed representations and alignment techniques to handle the challenge of different devices having different data types and model structures, while keeping communication costs low.

multimodalefficiency

Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding

Mar 19, 2026

Yikai Zheng, Xin Ding, Yifan Yang et al.

Decoupling semantic understanding from real-time perception—parsing queries once and matching embeddings continuously—solves the efficiency-accuracy tradeoff in proactive video understanding systems.

Em-Garde is a framework for understanding streaming video that responds to user queries efficiently. Instead of checking every frame, it converts user questions into visual proposals and matches them against the video stream using fast embedding comparisons, achieving better accuracy and speed than existing approaches.

multimodalefficiencyreasoning

Unified Spatio-Temporal Token Scoring for Efficient Video VLMs

Mar 18, 2026

Jianrui Zhang, Yue Yang, Rohun Tripathi et al.

You can prune half of video tokens across both vision and language components without complex mechanisms, gaining significant speed improvements (62%) while maintaining performance—making video VLMs practical for real-world deployment.

This paper introduces a method to speed up video understanding models by removing redundant visual information. The technique scores and removes 50% of unnecessary visual tokens across the entire model architecture, achieving 62% faster processing with minimal accuracy loss on video question-answering tasks.

efficiencymultimodalarchitecture

Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models

Mar 18, 2026

Kevin Qu, Haozhe Qi, Mihai Dusmanu et al.

By explicitly training vision-language models to reconstruct 3D scene geometry and camera position from video, you can dramatically improve their spatial reasoning and localization abilities without changing the model architecture.

Loc3R-VLM adds 3D spatial understanding to vision-language models by training them on video input with two key objectives: reconstructing the overall scene layout and modeling the camera's viewpoint. This approach helps models better understand where things are located in 3D space and answer questions about scenes from different perspectives, outperforming existing 2D and video-based methods.

multimodalreasoning

LoST: Level of Semantics Tokenization for 3D Shapes

Mar 18, 2026

Niladri Shekhar Dutt, Zifan Shi, Paul Guerrero et al.

By tokenizing 3D shapes based on semantic importance rather than spatial detail levels, you can train autoregressive 3D generation models that are 10-1000x more token-efficient while maintaining or improving quality.

LoST is a new way to break down 3D shapes into tokens (small pieces) for AI models to process. Instead of using spatial hierarchies like existing methods, it orders tokens by semantic importance—so early tokens capture the main shape, and later tokens add fine details. This makes 3D generation models much more efficient, using 90-99% fewer tokens than previous approaches.

architectureefficiencymultimodal

VideoAtlas: Navigating Long-Form Video in Logarithmic Compute

Mar 18, 2026

Mohamed Eltahir, Ali Habibullah, Yazan Alshoibi et al.

By treating video as a navigable hierarchical structure instead of converting it to text, you can process 10-hour videos with minimal accuracy loss while using compute that scales logarithmically with duration.

VideoAtlas is a system for understanding long videos efficiently by representing them as a hierarchical grid that can be zoomed into recursively, rather than converting video to text.

efficiencymultimodalagents

MessyKitchens: Contact-rich object-level 3D scene reconstruction

Mar 17, 2026

Junaid Ahmed Ansari, Ran Ding, Fabio Pizzati et al.

For robotics and animation applications, reconstructing cluttered scenes requires not just identifying individual 3D objects but ensuring they physically interact correctly—this work provides both a benchmark dataset and a method that achieves this.

This paper tackles 3D scene reconstruction from single images by introducing MessyKitchens, a dataset of cluttered real-world kitchen scenes with precise object shapes, poses, and contact information.

evaluationmultimodalapplications

SparkVSR: Interactive Video Super-Resolution via Sparse Keyframe Propagation

Mar 17, 2026

Jiongze Yu, Xiangbo Gao, Pooja Verlani et al.

Interactive video processing is now practical: users can control AI video enhancement by editing sparse keyframes, and the system intelligently propagates those edits across the full video sequence.

SparkVSR lets users interactively improve low-quality videos by editing a few keyframes, then automatically applies those improvements across the entire video. Instead of treating video enhancement as a black box, users can manually fix specific frames and the system propagates those corrections while keeping the video grounded in the original motion.

multimodalapplicationsefficiency

SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

Mar 17, 2026

Tianyu Xie, Jinfa Huang, Yuexiao Ma et al.

Models that accurately perceive audio-visual information often fail at generating contextually appropriate conversational responses, showing that perception and interaction are separate skills that need independent evaluation.

SocialOmni is a benchmark that tests how well audio-visual AI models handle natural conversation dynamics—specifically, identifying who's speaking, knowing when to interrupt, and generating natural interruptions. Testing 12 leading models reveals that understanding what's happening in a conversation doesn't automatically translate to responding appropriately in real dialogue.

evaluationmultimodalagents

From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation

Mar 16, 2026

Yibin Liu, Yaxing Lyu, Daqi Gao et al.

Reinforcement learning can transform passive video understanding models into active task evaluators by training them to generate explicit reasoning about progress toward goals—enabling smaller models to outperform much larger ones on robot manipulation tasks.

This paper introduces PRIMO R1, a 7B video AI model that learns to actively evaluate robot manipulation progress by using reinforcement learning to generate step-by-step reasoning. Unlike standard models that passively recognize what's happening, PRIMO R1 compares current robot states to task goals and predicts failures, achieving better accuracy than much larger models on robotic tasks.

reasoningagentsmultimodal

AC-Foley: Reference-Audio-Guided Video-to-Audio Synthesis with Acoustic Transfer

Mar 16, 2026

Pengjun Fang, Yingqing He, Yazhou Xing et al.

Using audio examples as conditioning signals instead of text prompts gives you finer control over sound synthesis and avoids the ambiguity problems that come with describing acoustic details in words.

AC-Foley generates realistic sound effects for videos by using reference audio as a guide instead of text descriptions. This solves the problem of text being too vague to describe subtle acoustic details, enabling precise control over sound timbre and quality while supporting zero-shot generation of new sounds.

multimodalapplications

Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training

Mar 12, 2026

Fangfu Liu, Diankun Wu, Jiawei Chi et al.

Test-time training—updating model parameters on-the-fly during inference—enables better spatial reasoning from video by letting the model continuously organize and retain 3D spatial information rather than relying on fixed context windows.

This paper introduces Spatial-TTT, a system that helps AI models understand 3D spaces from continuous video streams by adapting and updating their internal parameters during inference. It combines efficient video processing with a spatial prediction mechanism and specialized training data to maintain accurate spatial understanding over long videos.

architecturereasoningmultimodal

EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models

Mar 12, 2026

Xuanlang Dai, Yujie Zhou, Long Xing et al.

Diffusion models can solve complex reasoning tasks better by having the language encoder think iteratively and update its guidance throughout the generation process, rather than encoding instructions once at the start.

This paper improves how diffusion models solve complex reasoning tasks by making the language model encoder think step-by-step. Instead of encoding instructions once, the system iteratively refines the model's internal reasoning and feeds it progressively to the image generation process, achieving 92% accuracy on spatial reasoning tasks like mazes and puzzles.

reasoningmultimodalarchitecture

SciMDR: Benchmarking and Advancing Scientific Multimodal Document Reasoning

Mar 12, 2026

Ziyu Chen, Yilun Zhao, Chengye Wang et al.

Training multimodal models on scientific documents requires balancing synthetic data quality with real-world document complexity—this dataset achieves that by synthesizing faithful QA pairs then re-embedding them into full papers.

This paper introduces SciMDR, a dataset of 300K question-answer pairs across 20K scientific papers designed to train AI models on understanding complex scientific documents with both text and images. The dataset uses a two-stage process: first generating focused QA pairs with reasoning chains, then embedding them into full documents to maintain realistic complexity.

multimodaldataevaluation

Interpreting Contrastive Embeddings in Specific Domains with Fuzzy Rules

Mar 12, 2026

Javier Fumanal-Idocin, Mohammadreza Jamalifard, Javier Andreu-Perez

CLIP embeddings work well for general tasks, but you need domain-specific interpretation tools like fuzzy rules to understand and improve their performance on specialized text like medical or legal documents.

This paper shows how to interpret what CLIP embeddings learn in specific domains like medical records and film reviews. The researchers use fuzzy rules to map domain-specific features into CLIP's vector space, making it easier to understand which text features matter most for classification tasks in specialized fields.

multimodal

BehaviorVLM: Unified Finetuning-Free Behavioral Understanding with Vision-Language Reasoning

Mar 12, 2026

Jingyang Ke, Weihan Li, Amartya Pradhan et al.

You can leverage pretrained vision-language models for specialized tasks like animal behavior analysis without fine-tuning—just guide them through explicit reasoning steps and let them work with minimal human labels.

BehaviorVLM uses vision-language models to automatically understand animal behavior and estimate body poses without requiring task-specific training or heavy manual labeling. It combines visual reasoning, temporal analysis, and semantic understanding to identify what animals are doing and where their body parts are, making behavioral neuroscience research more scalable and reproducible.

multimodalapplicationsreasoning

GlyphBanana: Advancing Precise Text Rendering Through Agentic Workflows

Mar 12, 2026

Zexuan Yan, Jiarui Jin, Yue Ma et al.

You can improve any text-to-image model's ability to render complex text and formulas without retraining—just add an agentic workflow that guides the generation process using glyph templates.

GlyphBanana solves the problem of generating accurate text and mathematical formulas in images by using an agentic workflow that guides existing text-to-image models. Instead of retraining models, it injects glyph templates into the model's internal representations to iteratively improve text rendering quality.

agentsmultimodalapplications

Linking Perception, Confidence and Accuracy in MLLMs

Mar 12, 2026

Yuetian Du, Yucheng Wang, Rongyu Zhang et al.

Multimodal models suffer from severe confidence miscalibration; training them to be honest about uncertainty and using that uncertainty to trigger verification steps significantly improves both accuracy and reliability.

This paper identifies that multimodal AI models are overconfident—they don't reliably know when they're wrong. The authors propose a training method using image noise pairs and confidence-based rewards to fix this, plus a test-time strategy that uses the model's confidence to decide when to double-check answers. Results show 8.8% accuracy improvements across benchmarks.

evaluationtrainingmultimodal
multimodalarchitecture

SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport

Feb 26, 2026

Simon Roschmann, Paul Krzakala, Sonia Mazelet et al.

You can align vision and language models with 10-100x less paired training data by leveraging unpaired images and text separately.

This paper shows how to align vision and language models using far fewer paired examples than current methods require. Instead of needing millions of image-text pairs, SOTAlign uses a small set of paired data plus lots of unpaired images and text, employing a technique called optimal transport to learn how the two models relate to each other.

multimodaltrainingefficiency

A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations

Feb 26, 2026

Soumya Dutta, Smruthi Balaji, Sriram Ganapathy

Using specialized experts for different modalities (speech vs.

This paper presents MiSTER-E, a system that recognizes emotions in conversations by combining speech and text information. It uses separate AI experts for speech, text, and cross-modal analysis, then intelligently combines their predictions. The system works on real conversations without needing to know who's speaking, and achieves strong results on standard emotion recognition benchmarks.

multimodalarchitectureapplications

SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables

Feb 26, 2026

Sungho Park, Jueun Kim, Wook-Shin Han

Current AI models struggle with real-world table-text reasoning; SPARTA exposes this gap with automatically-generated, complex multi-hop questions ...

SPARTA is a benchmark for testing AI models on complex questions that require reasoning across both text and tables together.

evaluationreasoningmultimodal

CXReasonAgent: Evidence-Grounded Diagnostic Reasoning Agent for Chest X-rays

Feb 26, 2026

Hyungyung Lee, Hangyul Yoon, Edward Choi

AI medical diagnosis becomes more trustworthy when it shows its evidence instead of just giving answers.

This paper presents CXReasonAgent, a system that helps AI diagnose chest X-rays by combining a language model with specialized medical tools. Instead of just guessing answers like typical AI models, it shows its work by pointing to specific evidence in the image.

agentsmultimodalsafety

MovieTeller: Tool-augmented Movie Synopsis with ID Consistent Progressive Abstraction

Feb 26, 2026

Yizhi Li, Xiaohan Chen, Miao Jiang et al.

Combining specialized tools with general AI models beats trying to do everything with one model—especially for long videos where context matters.

MovieTeller automatically creates summaries of full-length movies by breaking the task into stages and using face recognition to keep track of which character is which. Instead of retraining models, it combines existing tools (like face detection) with language models to generate accurate, coherent movie synopses that maintain character identity throughout.

multimodalapplications

ColoDiff: Integrating Dynamic Consistency With Content Awareness for Colonoscopy Video Generation

Feb 26, 2026

Junhu Fu, Shuyu Liang, Wutong Li et al.

Synthetic colonoscopy videos can now be generated with enough quality and control to help with doctor training and disease diagnosis in data-scarce...

ColoDiff generates realistic colonoscopy videos using AI to help doctors train and diagnose intestinal diseases when real patient data is limited. It uses a technique called diffusion to create videos with smooth motion and precise control over medical details like disease type and imaging quality.

multimodalapplicationsdata