ThinkLLM
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
AboutPrivacyTermsRSS

ThinkLLM

Papers

Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.

326 papers15 this month12 topics
AllEfficiency 35Reasoning 35Multimodal 28Applications 28Evaluation 27Training 26Architecture 24Agents 24Safety 13scaling 5Data 5Alignment 1

Mar 30 – Apr 5(19)

Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation

Apr 2, 2026

Daiwei Chen, Zhoutong Fu, Chengming Jiang et al.

Token initialization is a critical bottleneck when extending language models with new vocabulary—grounding new tokens in semantically meaningful positions before fine-tuning substantially improves downstream task performance.

When language models add new vocabulary tokens for specific tasks like recommendation systems, they typically initialize them as averages of existing embeddings. This paper shows this approach fails because all new tokens collapse into the same subspace, losing their distinctiveness.

trainingefficiencyapplications

No Single Best Model for Diversity: Learning a Router for Sample Diversity

Apr 2, 2026

Yuhan Liu, Fangyuan Xu, Vishakh Padmakumar et al.

When you need diverse answers to open-ended questions, routing to the best model per query beats using any single model—and you can train a lightweight router to make this selection automatically.

This paper shows that different language models excel at generating diverse answers to open-ended questions, and no single model is best for all prompts. The authors build a router—a small model that predicts which LLM to use for each question—to dynamically select the best model.

Mar 23 – Mar 29(15)

Drive My Way: Preference Alignment of Vision-Language-Action Model for Personalized Driving

Mar 26, 2026

Zehao Wang, Huaide Jiang, Shuaiwu Dong et al.

Autonomous driving systems can be personalized to match individual driver styles by learning user embeddings from driving data and conditioning the driving policy on these embeddings, enabling more human-centered autonomous vehicles.

This paper presents Drive My Way, a personalized autonomous driving system that learns individual driver preferences and adapts to real-time instructions.

multimodalagentsapplications

Agent Factories for High Level Synthesis: How Far Can General-Purpose Coding Agents Go in Hardware Optimization?

Mar 26, 2026

Abhishek Bhandwaldar, Mihir Choudhury, Ruchir Puri et al.

General-purpose coding agents can discover hardware optimization patterns automatically by working at scale—using multiple agents to explore different optimization strategies yields significant speedups without domain-specific training.

This paper shows that general-purpose AI coding agents can optimize hardware designs without specialized training. The approach uses multiple agents working together: first decomposing designs into smaller pieces and optimizing each, then launching additional agents to find cross-function improvements.

Mar 16 – Mar 22(18)

AI Agents Can Already Autonomously Perform Experimental High Energy Physics

Mar 20, 2026

Eric A. Moreno, Samuel Bright-Thonney, Andrzej Novak et al.

AI agents are ready to automate the repetitive technical work in experimental physics, letting researchers focus on novel insights and validation rather than coding routine analyses.

AI agents can now autonomously run physics experiments end-to-end, from data analysis to paper writing. Researchers showed that Claude can handle all stages of high-energy physics analysis—selecting events, estimating backgrounds, calculating uncertainties, and drawing conclusions—using only a dataset, code tools, and access to prior research papers.

agentsapplicationsreasoning

Design-OS: A Specification-Driven Framework for Engineering System Design with a Control-Systems Design Case

Mar 20, 2026

H. Sinan Bank, Daniel R. Herber, Thomas H. Bradley

Specification-driven design workflows can extend beyond software to physical engineering systems, enabling better human-AI collaboration by making design decisions explicit and auditable rather than ad hoc.

Design-OS is a structured workflow that helps engineers design physical systems (like control systems) by making requirements explicit and maintaining traceability from intent to final design. It organizes design into five stages with specifications as a shared contract between humans and AI agents, demonstrated on two different inverted pendulum platforms.

Mar 9 – Mar 15(9)

PhysMoDPO: Physically-Plausible Humanoid Motion with Preference Optimization

Mar 13, 2026

Yangsong Zhang, Anujith Muraleedharan, Rikhat Akizhanov et al.

By optimizing diffusion models with physics-aware rewards during training, you can generate robot motions that are both realistic and executable on real hardware without post-hoc corrections.

This paper improves AI-generated humanoid robot motions by using preference optimization to make them physically realistic. Instead of manually tweaking physics penalties, the method integrates a physics controller directly into training, teaching the motion model to generate movements that work well when converted to real robot commands.

trainingreasoningapplications

Developing and evaluating a chatbot to support maternal health care

Mar 13, 2026

Smriti Jha, Vidhi Jain, Jianyu Xu et al.

Deploying medical chatbots in low-resource, multilingual settings requires multiple layers of safety (triage, retrieval, generation) and multi-method evaluation—no single model or test is sufficient for trustworthy healthcare AI.

Researchers built a phone-based chatbot to answer maternal health questions in India, where users often have limited health literacy and speak multiple languages. The system combines triage (routing urgent cases to experts), retrieval of curated health guidelines, and AI-generated responses.

Feb 23 – Mar 1(20)

DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science

Feb 27, 2026

Fan Shu, Yite Wang, Ruofan Wu et al.

LLMs need specialized training data to reliably follow data science workflows; fine-tuning on task-specific benchmarks can improve performance by 8x.

DARE-bench is a benchmark for testing how well AI models can follow data science instructions and complete multi-step ML tasks. It includes 6,300 real Kaggle tasks with verifiable correct answers, making evaluation objective rather than relying on human judges.

evaluationtrainingapplications

CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

Feb 27, 2026

Weinan Dai, Hanlin Wu, Qiying Yu et al.

Reinforcement learning can teach AI models to write genuinely optimized GPU code, not just syntactically correct code—a task that previously requ...

This paper trains an AI agent to write optimized GPU code (CUDA kernels) using reinforcement learning. The system learns from trial-and-error feedback about code performance, achieving faster execution than existing tools like PyTorch's compiler and outperforming top commercial AI models on benchmark tests.

evaluationapplications

VOID: Video Object and Interaction Deletion

Apr 2, 2026

Saman Motamed, William Harvey, Benjamin Klein et al.

Video editing can be improved by treating it as a physics simulation problem: identify what changes when an object is removed, then use diffusion models guided by causal reasoning to generate realistic results.

VOID removes objects from videos while maintaining realistic physics—like correcting how other objects move or collide after removal. It uses a vision-language model to identify affected regions and a diffusion model to generate physically plausible outcomes, trained on synthetic data where physics interactions are carefully controlled.

multimodalapplicationsreasoning

The Self Driving Portfolio: Agentic Architecture for Institutional Asset Management

Apr 2, 2026

Andrew Ang, Nazym Azimbayev, Andrey Kim

Agentic AI can shift institutional investing from human execution to human oversight, with autonomous agents handling forecasting, portfolio construction, and self-improvement while staying constrained by policy documents.

This paper demonstrates how AI agents can autonomously manage investment portfolios by having specialized agents forecast market conditions, build portfolios using multiple methods, and critique each other's work—all governed by an Investment Policy Statement that ensures alignment with institutional goals.

agentsapplicationsreasoning

De Jure: Iterative LLM Self-Refinement for Structured Extraction of Regulatory Rules

Apr 2, 2026

Keerat Guliani, Deepkamal Gill, David Landsman et al.

LLMs can extract structured regulatory rules from legal documents through iterative self-evaluation and repair, achieving 84% preference over prior methods in downstream compliance tasks without human annotation.

De Jure automatically extracts legally binding rules from regulatory documents using LLMs and iterative self-refinement. It converts dense legal text into machine-readable rules through document normalization, semantic decomposition, multi-criteria evaluation, and repair cycles—without requiring human annotation or domain expertise.

applicationsreasoningevaluation

Crystalite: A Lightweight Transformer for Efficient Crystal Modeling

Apr 2, 2026

Tin Hadži Veljković, Joshua Rosenthal, Ivor Lončarić et al.

By combining efficient tokenization with geometry-aware attention, you can build crystal generation models that are both faster and more accurate than complex graph neural networks, making generative modeling of materials more practical.

Crystalite is a lightweight diffusion Transformer for generating crystal structures that uses two key innovations: a compact atom representation called Subatomic Tokenization and a Geometry Enhancement Module that encodes crystal geometry directly into the model's attention mechanism.

architectureefficiencyapplications

Retrieval-Augmented Question Answering over Scientific Literature for the Electron-Ion Collider

Apr 2, 2026

Tina. J. Jat, T. Ghosh, Karthik Suresh

RAG systems can be deployed locally with open-source models to answer domain-specific technical questions while maintaining data privacy and reducing costs compared to cloud-based alternatives.

Researchers built a question-answering system for nuclear physics using retrieval-augmented generation (RAG) with a local LLaMA model and arXiv articles about the Electron-Ion Collider experiment. This approach keeps sensitive scientific data private while providing a cost-effective alternative to cloud-based solutions.

applicationsreasoning

Generative AI Spotlights the Human Core of Data Science: Implications for Education

Apr 2, 2026

Nathan Taback

As AI handles data cleaning, modeling, and reporting, data science education must prioritize teaching human reasoning, problem formulation, and ethical judgment—skills that AI cannot replace.

This paper argues that generative AI automates routine data science tasks but reveals that the most valuable skills remain fundamentally human: problem formulation, causal reasoning, ethics, and judgment. The author proposes that data science education should focus on these irreducibly human competencies while teaching students to work effectively with AI tools.

trainingapplications

Impact of Multimodal and Conversational AI on Learning Outcomes and Experience

Apr 2, 2026

Karan Taneja, Anjali Singh, Ashok K. Goel

Combining conversation with visual content (multimodality) improves learning in STEM, but conversation alone can create a false sense of understanding without actual learning gains.

This study compares three ways to learn biology: a conversational AI with images and text, one with text only, and a traditional search interface. Students using the multimodal conversational system learned best and felt most satisfied, while text-only conversation felt easier but didn't improve learning—showing that engagement doesn't always mean better outcomes.

multimodalapplicationsevaluation

Multi-Agent Video Recommenders: Evolution, Patterns, and Open Challenges

Apr 2, 2026

Srivaths Ranganathan, Abhishek Dharmaratnakar, Anushree Sinha et al.

Multi-agent video recommenders coordinate specialized agents for different tasks (understanding, reasoning, memory) rather than relying on single models, enabling more explainable and adaptive recommendations—a shift that's becoming practical with LLMs.

This survey examines how video recommender systems are evolving from single models to multi-agent architectures where specialized AI agents coordinate to understand videos, reason about user preferences, and provide better recommendations.

applicationsagentsmultimodal

LLM REgression with a Latent Iterative State Head

Apr 1, 2026

Yiheng Su, Matthew Lease

You can make LLMs predict continuous numeric values more efficiently by adding a tiny learned head that works with frozen representations, rather than decoding text or fine-tuning the entire model.

RELISH is a lightweight method for making LLMs predict numeric values directly from their internal representations. Instead of generating numbers as text, it uses a small learned component that iteratively refines a latent state through attention over token representations, then outputs a single number. It outperforms existing approaches while adding minimal parameters (0.01-0.04% overhead).

architectureefficiencyapplications

Embarrassingly Simple Self-Distillation Improves Code Generation

Apr 1, 2026

Ruixiang Zhang, Richard He Bai, Huangjie Zheng et al.

You can improve code generation by sampling from your model's own outputs and fine-tuning on them—no external tools needed. The gains come from balancing precision (removing bad options) with exploration (keeping useful diversity).

A simple technique called self-distillation improves code generation in large language models by having them sample their own outputs and fine-tune on those samples. The method boosts performance significantly (42.4% to 55.3% on benchmarks) without needing external verifiers or teacher models, and works across different model sizes and architectures.

trainingefficiencyapplications

True (VIS) Lies: Analyzing How Generative AI Recognizes Intentionality, Rhetoric, and Misleadingness in Visualization Lies

Apr 1, 2026

Graziano Blasilli, Marco Angelini

Multimodal AI models struggle inconsistently with detecting misleading visualizations; their ability varies dramatically by model size and architecture, and they often miss the intentional rhetorical techniques that human experts easily spot.

This study tests whether AI models can detect misleading visualizations and understand why they're deceptive. Researchers analyzed 2,336 tweets with COVID-19 charts—half containing intentional or accidental distortions—using 16 different AI models and compared their performance to how visualization experts judge the same images.

evaluationmultimodalapplications

A ROS 2 Wrapper for Florence-2: Multi-Mode Local Vision-Language Inference for Robotic Systems

Apr 1, 2026

J. E. Domínguez-Vidal

Florence-2 can now be easily integrated into robot software stacks through a standardized ROS 2 wrapper, enabling local vision-language inference on consumer GPUs without cloud dependencies.

This paper presents a ROS 2 software wrapper that integrates Florence-2, a vision-language model, into robotic systems for local inference.

applicationsmultimodalefficiency

NeuroDDAF: Neural Dynamic Diffusion-Advection Fields with Evidential Fusion for Air Quality Forecasting

Apr 1, 2026

Prasanjit Dey, Soumyabrata Dev, Angela Meyer et al.

Hybrid physics-neural models can achieve better accuracy and uncertainty calibration than pure data-driven or physics-based approaches alone, especially for spatiotemporal forecasting with known physical constraints.

NeuroDDAF combines physics-informed modeling with neural networks to forecast air quality by integrating wind-driven transport equations, graph attention for spatial patterns, and uncertainty quantification. It outperforms existing methods on urban datasets while providing reliable confidence estimates for predictions.

reasoningmultimodalapplications

ParaSpeechCLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining

Mar 30, 2026

Anuj Diwan, Eunsol Choi, David Harwath

Specialized models for different types of speech style (speaker traits vs. utterance characteristics) outperform single unified models on individual tasks, but a combined model works better when styles need to be understood together.

ParaSpeechCLAP is a dual-encoder model that learns to match speech audio with text descriptions of speaking style (like pitch, emotion, and texture). It maps both modalities into a shared embedding space, enabling applications like finding similar-sounding speech, classifying speaker characteristics, and improving text-to-speech synthesis without retraining.

multimodalapplications

RAD-AI: Rethinking Architecture Documentation for AI-Augmented Ecosystems

Mar 30, 2026

Oliver Aleksander Larsen, Mahyar T. Moghaddam

If you're building AI systems, standard software architecture documentation won't capture ML-specific risks like model drift or data dependencies—RAD-AI provides a structured way to document these for both compliance and team understanding.

RAD-AI extends existing architecture documentation frameworks (arc42 and C4 model) to handle AI systems, adding sections for probabilistic behavior, ML lifecycles, and data dependencies. It maps to EU AI Act compliance requirements and shows 93% coverage of regulatory documentation needs versus 36% for standard frameworks.

architecturesafetyapplications

See it to Place it: Evolving Macro Placements with Vision-Language Models

Mar 30, 2026

Ikechukwu Uchendu, Swati Goel, Karly Hou et al.

Foundation models trained on visual reasoning can solve specialized engineering problems like chip design without fine-tuning, by framing physical constraints as spatial reasoning tasks.

This paper uses Vision-Language Models to improve chip floorplanning—arranging components on a chip to minimize wiring. The approach, called VeoPlace, treats the chip layout as a visual problem, letting a VLM suggest component placements without any training, then iteratively refines these suggestions. It outperforms existing machine learning methods by up to 32% on standard benchmarks.

applicationsreasoningmultimodal

SAGAI-MID: A Generative AI-Driven Middleware for Dynamic Runtime Interoperability

Mar 30, 2026

Oliver Aleksander Larsen, Mahyar T. Moghaddam

LLMs can serve as runtime architectural components to solve schema interoperability problems dynamically, but code generation strategies outperform direct transformation and cost varies dramatically across models without matching accuracy gains.

SAGAI-MID is a middleware system that uses LLMs to automatically fix schema mismatches between different services and APIs at runtime, eliminating the need for manual adapter code. It combines structural analysis with LLM reasoning and includes safety checks to handle real-world integration challenges across REST, GraphQL, and IoT systems.

architectureagentsapplications
agentsapplications

Comparing Developer and LLM Biases in Code Evaluation

Mar 25, 2026

Aditya Mittal, Ryan Shar, Zichu Wu et al.

LLMs used as code judges have significant blind spots compared to human developers—they systematically misweight code quality factors like explanation length, meaning you can't rely on them alone for code evaluation in real applications.

This paper introduces TRACE, a framework that compares how LLM judges evaluate code against human developer preferences.

evaluationapplications

Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA

Mar 25, 2026

Saahil Mathur, Ryan David Rittner, Vedant Ajit Thakur et al.

Better retrieval doesn't guarantee better RAG answers: improving individual components can paradoxically increase confident hallucinations when relevant information isn't in your corpus.

This paper studies retrieval-augmented generation (RAG) systems for answering questions about AI policy documents. The researchers found that improving retrieval quality doesn't always lead to better answers—sometimes better retrieval actually makes the system more confidently wrong when relevant documents are missing.

evaluationapplications

VFIG: Vectorizing Complex Figures in SVG with Vision-Language Models

Mar 25, 2026

Qijia He, Xunmei Liu, Hammaad Memon et al.

You can now automatically convert flat images of technical figures into editable, scalable vector graphics—matching GPT-5.2 performance—enabling recovery of lost design source files without manual reconstruction.

VFIG converts rasterized images (PNG, JPEG) of technical diagrams back into editable SVG vector graphics using vision-language models. The team created a 66K dataset of figure-SVG pairs and a two-stage training approach (supervised learning for basic shapes, then reinforcement learning for refinement) to reconstruct complex professional diagrams with high fidelity.

multimodaltrainingapplications

Scaling Recurrence-aware Foundation Models for Clinical Records via Next-Visit Prediction

Mar 25, 2026

Haresh Rengaraj Rajamohan, Xiang Gao, Weicheng Zhu et al.

Foundation models can effectively predict clinical outcomes from EHR data, but scaling model size alone doesn't improve performance—you need proportionally more training data, and careful handling of repeated events is critical to avoid inflated evaluation metrics.

RAVEN is a foundation model trained on electronic health records (EHRs) from over one million patients to predict what clinical events will happen at a patient's next visit.

applicationsscaling

Evaluating Chunking Strategies For Retrieval-Augmented Generation in Oil and Gas Enterprise Documents

Mar 25, 2026

Samuel Taiwo, Mohd Amaluddin Yusoff

For enterprise RAG systems with structured documents, preserve document structure when chunking—it improves retrieval quality and reduces costs, but you'll need multimodal AI to handle diagrams and visual content.

This paper tests four different ways to split documents into chunks for RAG systems using oil and gas industry documents. Structure-aware chunking (which respects document layout) works best and costs less than other methods, but all approaches struggle with diagrams and visual content.

evaluationapplications

ReqFusion: A Multi-Provider Framework for Automated PEGS Analysis Across Software Domains

Mar 24, 2026

Muhammad Khalid, Manuel Oriol, Yilmaz Uygun

Using structured prompting formats (PEGS) with multiple LLM providers significantly improves requirements extraction accuracy (F1: 0.88 vs 0.71) and provides built-in reliability through model consensus and fallback mechanisms.

ReqFusion automates software requirements extraction and classification by combining multiple LLM providers (GPT, Claude, Groq) with a structured PEGS format prompt. The system processes various document types and achieves 88% accuracy, reducing manual analysis time by 78% while ensuring consistent requirement categorization across academic, industrial, and business contexts.

applicationsevaluation

VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs

Mar 24, 2026

Haoran Yuan, Weigang Yi, Zhenyu Zhang et al.

Adding tactile (touch) sensing to video-based robot learning models significantly improves performance on tasks requiring precise force control and contact awareness, without needing separate tactile pretraining.

This paper introduces VTAM, a robot learning system that combines video and touch (tactile) sensing to better understand and perform complex physical tasks.

multimodalapplications

Code Review Agent Benchmark

Mar 24, 2026

Yuntong Zhang, Zhiyuan Pan, Imam Nur Bani Yusuf et al.

Code review agents currently miss most issues that human reviewers catch, but they often flag different problems—creating opportunities for AI-assisted rather than AI-automated code review in real teams.

This paper introduces c-CRAB, a benchmark dataset for evaluating AI agents that perform code review on pull requests. The dataset is built from human reviews and includes automated tests to assess whether code review agents catch the same issues humans do.

evaluationagentsapplications

3DCity-LLM: Empowering Multi-modality Large Language Models for 3D City-scale Perception and Understanding

Mar 24, 2026

Yiping Chen, Jinpeng Li, Wenyu Ke et al.

This work shows how to scale vision-language models from room-sized scenes to entire cities by handling 3D spatial relationships and introducing a large, quality-controlled urban dataset—essential for building AI systems that understand real-world spatial reasoning.

3DCity-LLM extends multimodal AI models to understand entire city-scale 3D environments, not just individual objects. The system uses a three-part approach to analyze objects, their relationships, and overall scenes, trained on a new dataset of 1.2 million urban scenarios covering tasks from object identification to city planning.

multimodalapplications

3D-Layout-R1: Structured Reasoning for Language-Instructed Spatial Editing

Mar 23, 2026

Haoyu Zhen, Xiaolong Li, Yilin Zhao et al.

Structured reasoning over scene graphs helps language models understand and manipulate spatial relationships more reliably than end-to-end approaches, improving layout editing accuracy by 15-20% over baseline methods.

This paper teaches AI models to edit 3D room layouts based on text instructions by having them reason through scene graphs—structured representations of objects and their spatial relationships. Instead of directly generating new layouts, the model updates a graph representation step-by-step, which helps it maintain spatial consistency and understand how objects relate to each other.

reasoningmultimodalapplications

TiCo: Time-Controllable Training for Spoken Dialogue Models

Mar 23, 2026

Kai-Wei Chang, Wei-Chih Chen, En-Pei Hu et al.

Spoken dialogue models can now follow duration constraints (e.g., 'respond in 15 seconds') by inserting time markers during generation, making them more practical for real-world voice applications.

TiCo is a post-training method that teaches spoken dialogue models to generate responses with specific durations. It uses time markers during generation to help models track elapsed speaking time and adjust content to meet target lengths, improving real-world voice assistant interactions without requiring new training data.

trainingapplicationsagents

Characterizing High-Capacity Janus Aminobenzene-Graphene Anode for Sodium-Ion Batteries with Machine Learning

Mar 23, 2026

Claudia Islas-Vargas, L. Ricardo Montoya, Carlos A. Vital-José et al.

Machine learning force fields can accelerate discovery of battery materials by accurately predicting how ions move and store in complex structures, reducing reliance on expensive experiments.

Researchers used machine learning force fields and quantum simulations to design and test a new anode material for sodium-ion batteries made from graphene with amino groups attached. The material shows promising properties: high storage capacity (~400 mAh/g), very fast ion movement, and minimal swelling—making it a strong candidate for practical battery applications.

applicationsreasoning

One Model, Two Markets: Bid-Aware Generative Recommendation

Mar 23, 2026

Yanchen Jiang, Zhe Feng, Christopher P. Mah et al.

You can build recommendation systems that serve both users and business goals by treating ad placement as part of the generation process, letting bids influence which items appear at inference time rather than requiring model retraining.

This paper presents GEM-Rec, a recommendation system that balances user satisfaction with platform revenue by integrating ads and bids directly into generative models. Using special control tokens and a bid-aware decoding method, the system learns when to show ads from real user behavior and adjusts which ads appear based on real-time pricing, without needing to retrain the model.

applicationstraining
agentsapplications

FinTradeBench: A Financial Reasoning Benchmark for LLMs

Mar 19, 2026

Yogesh Agrawal, Aniruddha Dutta, Md Mahadi Hasan et al.

LLMs can reason about financial fundamentals with retrieval help, but struggle significantly with trading signals and time-series patterns—a critical gap for real-world financial decision-making.

FinTradeBench is a benchmark with 1,400 questions testing how well AI models reason about financial decisions by combining company fundamentals (from financial reports) and trading signals (from stock price patterns). The benchmark reveals that current AI models struggle with numerical reasoning and time-series data, even when given access to relevant information.

evaluationreasoningapplications

DriveTok: 3D Driving Scene Tokenization for Unified Multi-View Reconstruction and Understanding

Mar 19, 2026

Dong Zhuo, Wenzhao Zheng, Sicheng Zuo et al.

A single tokenizer can efficiently represent multi-view driving scenes in a way that works for both reconstruction tasks (RGB, depth) and understanding tasks (segmentation, 3D occupancy), making it practical for vision-language-action models in autonomous vehicles.

DriveTok creates a unified tokenizer for autonomous driving that converts multi-view camera images into compact 3D scene tokens. Unlike existing tokenizers designed for single images, it handles multiple camera views efficiently while preserving semantic, geometric, and depth information—enabling better reconstruction and understanding of driving scenes.

multimodalarchitectureapplications

cuGenOpt: A GPU-Accelerated General-Purpose Metaheuristic Framework for Combinatorial Optimization

Mar 19, 2026

Yuyang Liu

GPU acceleration can make general-purpose optimization solvers orders of magnitude faster than traditional solvers, while remaining flexible enough for domain-specific customization through a Python interface.

cuGenOpt is a GPU-accelerated framework for solving combinatorial optimization problems (like routing and scheduling) that balances generality, speed, and ease of use. It uses CUDA to run multiple solution attempts in parallel, lets experts add custom solvers, and includes an AI assistant that converts plain-English problem descriptions into working code.

efficiencyapplications

SHAPCA: Consistent and Interpretable Explanations for Machine Learning Models on Spectroscopy Data

Mar 19, 2026

Mingxing Zhang, Nicola Rossberg, Simone Innocente et al.

For spectroscopy and similar high-dimensional data, combining PCA with SHAP explanations lets you understand model decisions in terms of the original measurements—critical for clinical adoption where trust and interpretability matter.

SHAPCA combines dimensionality reduction and explainability techniques to make machine learning predictions on spectroscopy data interpretable and trustworthy. It maps explanations back to the original spectral bands rather than abstract features, helping clinicians and researchers understand why models make specific predictions on high-dimensional, correlated data.

evaluationapplications

Implicit Patterns in LLM-Based Binary Analysis

Mar 19, 2026

Qiang Li, XiangRui Zhang, Haining Wang

LLM-based binary analysis isn't random exploration—models implicitly develop structured reasoning patterns that organize their search process, which can be measured and potentially improved for more reliable vulnerability detection.

This paper analyzes how large language models perform binary vulnerability analysis across hundreds of reasoning steps. Researchers studied 521 binaries and discovered that LLMs implicitly develop four structured patterns—early pruning, path-dependent lock-in, targeted backtracking, and knowledge-guided prioritization—that organize their exploration without explicit programming.

reasoningevaluationapplications

Adaptive Regime-Aware Stock Price Prediction Using Autoencoder-Gated Dual Node Transformers with Reinforcement Learning Control

Mar 19, 2026

Mohammad Al Ridhawi, Mahtab Haj Ali, Hussein Al Osman

Treating all market conditions the same hurts prediction accuracy; this framework learns to detect regime shifts automatically and uses specialized models for each, improving performance especially during volatile periods without requiring manual market labeling.

This paper presents an adaptive stock price prediction system that automatically detects market regime changes (stable vs. volatile periods) and routes data through specialized prediction models.

architectureapplications

From Inference Efficiency to Embodied Efficiency: Revisiting Efficiency Metrics for Vision-Language-Action Models

Mar 19, 2026

Zhuofan Li, Hongkun Yang, Zhenyang Chen et al.

When building embodied AI systems, measure what actually matters: task completion time, motion quality, and energy use—not just model size or inference speed. Optimizing the wrong metrics can make robots perform worse in practice.

This paper shows that traditional efficiency metrics (parameters, computation) for vision-language-action robots don't match real-world performance. The researchers measured actual robotic execution—task time, motion smoothness, energy use—and found that methods optimizing for conventional metrics often make robots move worse or take longer, even when task success stays the same.

efficiencyevaluationapplications

CustomTex: High-fidelity Indoor Scene Texturing via Multi-Reference Customization

Mar 19, 2026

Weilin Chen, Jiahao Rao, Wenhao Wang et al.

Reference-image-driven texturing with instance-level control produces sharper, more artifact-free 3D scene textures than text-based approaches, making it practical for professional 3D scene editing.

CustomTex generates high-quality textures for 3D indoor scenes by taking reference images and applying them to specific objects. Unlike text-based methods, it uses a dual-distillation approach to ensure textures match reference images precisely while maintaining visual quality and avoiding artifacts.

multimodalapplications

Serendipity by Design: Evaluating the Impact of Cross-domain Mappings on Human and LLM Creativity

Mar 19, 2026

Qiawen Ella Liu, Marina Dubova, Henry Conklin et al.

LLMs are already highly creative at generating novel ideas, but they don't benefit from the same creative prompting techniques that help humans think outside the box through forced analogies.

Researchers tested whether cross-domain mapping—forcing creators to draw inspiration from random, unrelated sources—boosts creativity in both humans and LLMs. Humans benefited significantly from this technique, but LLMs showed no consistent improvement, though both systems generated more creative ideas when the source domain was more distant from the target.

evaluationreasoningapplications

A Dataset and Resources for Identifying Patient Health Literacy Information from Clinical Notes

Mar 19, 2026

Madeline Bittner, Dina Demner-Fushman, Yasmeen Shabazz et al.

Automated health literacy detection from clinical notes is now possible with HEALIX, a curated dataset that could help clinicians identify patients needing extra support without adding screening burden.

Researchers created HEALIX, the first public dataset of 589 clinical notes annotated for patient health literacy levels (low, normal, high). Health literacy—a patient's ability to understand medical information—affects treatment outcomes, but current screening tools are impractical.

dataapplicationsevaluation

AgentFactory: A Self-Evolving Framework Through Executable Subagent Accumulation and Reuse

Mar 18, 2026

Zhang Zhang, Shuqi Lu, Hongjin Qian et al.

Instead of storing agent experiences as text, storing them as executable code lets agents reuse and improve solutions reliably across different tasks and systems.

AgentFactory is a framework that helps AI agents learn and improve by saving successful task solutions as reusable Python code (subagents) rather than just text descriptions. These saved subagents get refined over time based on how well they work, creating a growing library that makes future similar tasks easier to solve without human help.

agentstrainingapplications

MessyKitchens: Contact-rich object-level 3D scene reconstruction

Mar 17, 2026

Junaid Ahmed Ansari, Ran Ding, Fabio Pizzati et al.

For robotics and animation applications, reconstructing cluttered scenes requires not just identifying individual 3D objects but ensuring they physically interact correctly—this work provides both a benchmark dataset and a method that achieves this.

This paper tackles 3D scene reconstruction from single images by introducing MessyKitchens, a dataset of cluttered real-world kitchen scenes with precise object shapes, poses, and contact information.

evaluationmultimodalapplications

ManiTwin: Scaling Data-Generation-Ready Digital Object Dataset to 100K

Mar 17, 2026

Kaixuan Wang, Tianxing Chen, Jiawei Liu et al.

Having diverse, high-quality 3D assets at scale dramatically improves robot learning in simulation—this dataset removes a major bottleneck for scaling robotic manipulation training.

ManiTwin is an automated pipeline that converts single images into simulation-ready 3D digital objects for robot training. The team created ManiTwin-100K, a dataset of 100,000 annotated 3D assets with physical properties and manipulation instructions, enabling large-scale generation of robot training data in simulation.

dataapplicationstraining

SparkVSR: Interactive Video Super-Resolution via Sparse Keyframe Propagation

Mar 17, 2026

Jiongze Yu, Xiangbo Gao, Pooja Verlani et al.

Interactive video processing is now practical: users can control AI video enhancement by editing sparse keyframes, and the system intelligently propagates those edits across the full video sequence.

SparkVSR lets users interactively improve low-quality videos by editing a few keyframes, then automatically applies those improvements across the entire video. Instead of treating video enhancement as a black box, users can manually fix specific frames and the system propagates those corrections while keeping the video grounded in the original motion.

multimodalapplicationsefficiency

Long-Horizon Traffic Forecasting via Incident-Aware Conformal Spatio-Temporal Transformers

Mar 17, 2026

Mayur Patil, Qadeer Ahmed, Shawn Midlam-Mohler et al.

Incorporating incident severity signals and dynamic road relationships into spatio-temporal models significantly improves long-horizon traffic predictions with calibrated confidence intervals—practical for real-world transportation planning.

This paper improves traffic forecasting by using a Transformer model that understands both spatial patterns (how traffic flows across roads) and temporal patterns (how it changes over time), while accounting for incidents like crashes.

reasoningevaluationapplications

AC-Foley: Reference-Audio-Guided Video-to-Audio Synthesis with Acoustic Transfer

Mar 16, 2026

Pengjun Fang, Yingqing He, Yazhou Xing et al.

Using audio examples as conditioning signals instead of text prompts gives you finer control over sound synthesis and avoids the ambiguity problems that come with describing acoustic details in words.

AC-Foley generates realistic sound effects for videos by using reference audio as a guide instead of text descriptions. This solves the problem of text being too vague to describe subtle acoustic details, enabling precise control over sound timbre and quality while supporting zero-shot generation of new sounds.

multimodalapplications
safetyapplicationsevaluation

Sparking Scientific Creativity via LLM-Driven Interdisciplinary Inspiration

Mar 12, 2026

Priyanka Kargupta, Shuhaib Mehri, Dilek Hakkani-Tur et al.

LLMs can augment creative scientific reasoning by treating interdisciplinary research as a structured exploration process: decompose goals into questions, find analogous problems in other fields, then synthesize insights back into your domain.

Idea-Catalyst is a framework that helps researchers and AI systems discover creative interdisciplinary insights by systematically connecting research challenges across different fields.

reasoningapplicationsagents

Portfolio of Solving Strategies in CEGAR-based Object Packing and Scheduling for Sequential 3D Printing

Mar 12, 2026

Pavel Surynek

Running multiple solving strategies in parallel on standard CPUs can solve complex packing problems better than single strategies—a practical way to use modern computing power for real manufacturing optimization.

This paper shows how to pack and schedule objects more efficiently for 3D printing by running multiple arrangement strategies in parallel on modern multi-core computers. Instead of using one fixed strategy, the system tries different approaches (like placing objects toward corners vs. centers) simultaneously and picks the best result, reducing the number of printing plates needed.

applications

WORKSWORLD: A Domain for Integrated Numeric Planning and Scheduling of Distributed Pipelined Workflows

Mar 12, 2026

Taylor Paul, William Regli

Automated planning can solve the joint problem of designing distributed data pipelines and scheduling them on real infrastructure, enabling users to specify workflows declaratively rather than imperatively.

This paper introduces WORKSWORLD, a planning domain for automatically designing and scheduling data pipelines across distributed computer systems. Instead of manually specifying how data flows between processing components, users describe their data sources, available tools, and desired outputs—and an AI planner figures out the optimal workflow and resource allocation.

reasoningagentsapplications

Proof-Carrying Materials: Falsifiable Safety Certificates for Machine-Learned Interatomic Potentials

Mar 12, 2026

Abhinaba Basu, Pavan Chakraborty

ML models for materials science need formal safety audits—this work shows single models have severe blind spots, but systematic falsification and confidence bounds can identify reliable predictions and improve discovery by 25%.

Machine-learned models for predicting material properties often fail silently. This paper introduces Proof-Carrying Materials, a system that audits these models through adversarial testing, statistical confidence bounds, and formal verification to identify which predictions are trustworthy.

safetyevaluationapplications

BehaviorVLM: Unified Finetuning-Free Behavioral Understanding with Vision-Language Reasoning

Mar 12, 2026

Jingyang Ke, Weihan Li, Amartya Pradhan et al.

You can leverage pretrained vision-language models for specialized tasks like animal behavior analysis without fine-tuning—just guide them through explicit reasoning steps and let them work with minimal human labels.

BehaviorVLM uses vision-language models to automatically understand animal behavior and estimate body poses without requiring task-specific training or heavy manual labeling. It combines visual reasoning, temporal analysis, and semantic understanding to identify what animals are doing and where their body parts are, making behavioral neuroscience research more scalable and reproducible.

multimodalapplicationsreasoning

GlyphBanana: Advancing Precise Text Rendering Through Agentic Workflows

Mar 12, 2026

Zexuan Yan, Jiarui Jin, Yue Ma et al.

You can improve any text-to-image model's ability to render complex text and formulas without retraining—just add an agentic workflow that guides the generation process using glyph templates.

GlyphBanana solves the problem of generating accurate text and mathematical formulas in images by using an agentic workflow that guides existing text-to-image models. Instead of retraining models, it injects glyph templates into the model's internal representations to iteratively improve text rendering quality.

agentsmultimodalapplications

LifeSim: Long-Horizon User Life Simulator for Personalized Assistant Evaluation

Mar 12, 2026

Feiyu Duan, Xuanjing Huang, Zhongyu Wei

Current LLMs struggle with implicit user intentions and long-term preference modeling—they can handle immediate requests but fail to understand what users really need or remember their preferences over extended interactions.

LifeSim creates realistic simulated users with beliefs, desires, and intentions to test how well AI assistants handle long-term, multi-scenario interactions. The benchmark evaluates whether AI can understand both explicit requests and hidden user needs, maintain accurate user profiles over time, and provide contextually appropriate responses across 1,200 diverse life scenarios.

evaluationagentsapplications
agentstrainingapplications

Resources for Automated Evaluation of Assistive RAG Systems that Help Readers with News Trustworthiness Assessment

Feb 27, 2026

Dake Zhang, Mark D. Smucker, Charles L. A. Clarke

Automated evaluation of RAG systems for news credibility assessment can reliably match human judgment, enabling faster iteration on trustworthiness...

This paper describes evaluation tools for AI systems that help readers assess whether news articles are trustworthy. Researchers created benchmarks with human-judged questions and reports about real news, then built an automated system to score new submissions without needing human reviewers each time.

evaluationapplicationsreasoning

Coverage-Aware Web Crawling for Domain-Specific Supplier Discovery via a Web--Knowledge--Web Pipeline

Feb 27, 2026

Yijiashun Qi, Yijiazhen Qi, Tanmay Wagh

Use knowledge graph topology to guide web crawling toward undiscovered entities, making supplier discovery more complete with less computational cost.

This paper tackles the problem of finding all small and medium-sized businesses in specialized industries (like semiconductor equipment makers) by combining web crawling, knowledge graphs, and smart coverage estimation.

dataapplications

FaultXformer: A Transformer-Encoder Based Fault Classification and Location Identification model in PMU-Integrated Active Electrical Distribution System

Feb 27, 2026

Kriti Thakur, Alivelu Manga Parimi, Mayukha Pal

Transformers can outperform traditional deep learning for time-series fault detection in power systems, especially as grids become more complex wit...

FaultXformer uses a Transformer model to detect and locate electrical faults in power grids using real-time sensor data. It processes current measurements in two stages—first extracting temporal patterns, then classifying fault types and pinpointing locations—achieving 98%+ accuracy and outperforming traditional deep learning approaches like CNNs and LSTMs.

architectureapplicationsevaluation

Histopathology Image Normalization via Latent Manifold Compaction

Feb 27, 2026

Xiaolong Zhang, Jianwei Zhang, Selim Sevim et al.

Unsupervised learning can remove batch effects from medical images, letting models generalize across hospitals without retraining.

Medical image analysis struggles when microscope slides are stained or scanned differently across hospitals—models trained on one site fail at another. This paper introduces a technique that learns to remove these visual differences automatically, making AI models work reliably across different clinical sites without needing labeled examples.

dataapplicationstraining

Time Series Foundation Models as Strong Baselines in Transportation Forecasting: A Large-Scale Benchmark Analysis

Feb 27, 2026

Javier Pulido, Filipe Rodrigues

Foundation models trained on diverse time-series data can forecast transportation metrics without task-specific tuning, making them practical basel...

This paper tests whether a general-purpose time-series AI model (Chronos-2) can forecast transportation data like traffic volume and bike-sharing demand without any custom training. The model works surprisingly well out-of-the-box, often beating specialized models built just for these tasks, and also provides useful uncertainty estimates.

evaluationapplicationsefficiency

Understanding Usage and Engagement in AI-Powered Scientific Research Tools: The Asta Interaction Dataset

Feb 26, 2026

Dany Haddad, Dan Bareket, Joseph Chee Chang et al.

Scientists use AI research tools as collaborative partners, not search engines—they write complex queries, reuse outputs, and dig into citations ...

Researchers analyzed how scientists actually use AI-powered research tools by studying over 200,000 real queries and interactions. They found that scientists write longer, more complex questions than traditional search, treat AI as a research partner for drafting and brainstorming, and revisit AI responses like documents rather than one-off answers.

applicationsevaluationagents

Utilizing LLMs for Industrial Process Automation

Feb 26, 2026

Salim Fares

LLMs can accelerate industrial automation development despite being trained on little specialized domain code, opening new productivity gains in ma...

This paper explores how large language models can help developers write code for industrial automation systems—like programming robotic arms in manufacturing. Most LLM research focuses on common languages like Python, but industrial systems use specialized proprietary languages that LLMs rarely see in training data.

applicationstrainingefficiency

Toward Expert Investment Teams:A Multi-Agent LLM System with Fine-Grained Trading Tasks

Feb 26, 2026

Kunihiro Miyazaki, Takanobu Kawahara, Stephen Roberts et al.

Breaking complex financial tasks into specific subtasks for AI agents produces better trading returns than giving them broad instructions.

This paper builds a trading system using multiple AI agents that work together like an investment team. Instead of giving agents vague instructions, the researchers break down stock analysis into specific, detailed tasks—like analyzing financial statements separately from news.

agentsapplicationsreasoning

LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

Feb 26, 2026

Chen Bo Calvin Zhang, Christina Q. Knight, Nicholas Kruus et al.

LLMs dramatically amplify what untrained people can accomplish in specialized fields like biology, raising both opportunity and safety concerns.

Researchers tested whether LLMs actually help non-experts do biology tasks better than using the internet alone. They found novices with LLM access were 4x more accurate than those without, and sometimes outperformed trained experts. However, users weren't always getting the best results from the models, and most found it easy to get sensitive biosecurity information despite safeguards.

evaluationsafetyapplications

Evaluating Zero-Shot and One-Shot Adaptation of Small Language Models in Leader-Follower Interaction

Feb 26, 2026

Rafael R. Baptista, André de Lima Salgado, Ricardo V. Godoy et al.

Small language models can handle real-time role classification in robotics with fine-tuning, but adding more context in conversations breaks their ...

This paper tests whether small language models can quickly learn to identify leader and follower roles in human-robot conversations without needing large models. Researchers fine-tuned a tiny 0.5B model on robot interaction data and found it achieved 86% accuracy while running fast enough for robots to use locally, but struggled when conversations got longer.

efficiencyevaluationapplications

A Proper Scoring Rule for Virtual Staining

Feb 26, 2026

Samuel Tonks, Steve Hood, Ryan Musso et al.

Use information gain to evaluate generative models on their ability to estimate uncertainty correctly, not just prediction accuracy.

This paper introduces a better way to evaluate AI models that generate synthetic biological images (virtual staining). Instead of just checking if the overall results look right, it measures whether the model correctly estimates uncertainty about what it's predicting for each individual cell.

evaluationapplications

A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations

Feb 26, 2026

Soumya Dutta, Smruthi Balaji, Sriram Ganapathy

Using specialized experts for different modalities (speech vs.

This paper presents MiSTER-E, a system that recognizes emotions in conversations by combining speech and text information. It uses separate AI experts for speech, text, and cross-modal analysis, then intelligently combines their predictions. The system works on real conversations without needing to know who's speaking, and achieves strong results on standard emotion recognition benchmarks.

multimodalarchitectureapplications

Scaling Search Relevance: Augmenting App Store Ranking with LLM-Generated Judgments

Feb 26, 2026

Evangelia Christakopoulou, Vivekkumar Patel, Hemanth Velaga et al.

A smaller, specialized AI model can generate better training data than a giant pre-trained one, unlocking real improvements in production systems.

Google used fine-tuned AI models to generate millions of relevance labels for app search results, solving a shortage of human-labeled training data. By combining these AI-generated labels with user behavior signals, they improved their App Store ranking system—especially for unpopular searches where user clicks are rare.

trainingapplicationsdata

MovieTeller: Tool-augmented Movie Synopsis with ID Consistent Progressive Abstraction

Feb 26, 2026

Yizhi Li, Xiaohan Chen, Miao Jiang et al.

Combining specialized tools with general AI models beats trying to do everything with one model—especially for long videos where context matters.

MovieTeller automatically creates summaries of full-length movies by breaking the task into stages and using face recognition to keep track of which character is which. Instead of retraining models, it combines existing tools (like face detection) with language models to generate accurate, coherent movie synopses that maintain character identity throughout.

multimodalapplications

Plug-and-Play Diffusion Meets ADMM: Dual-Variable Coupling for Robust Medical Image Reconstruction

Feb 26, 2026

Chenhe Du, Xuanyu Tian, Qing Wu et al.

Adding historical tracking to diffusion-based medical image reconstruction eliminates the bias-hallucination tradeoff and guarantees convergence to...

This paper fixes a problem with using AI image generators to reconstruct medical scans from incomplete data. Previous methods lose track of what they've already tried, causing them to either ignore measurement constraints or hallucinate fake details. The solution adds memory to the optimization process and cleans up noise patterns so the AI generator works correctly.

applicationsefficiencyarchitecture

ColoDiff: Integrating Dynamic Consistency With Content Awareness for Colonoscopy Video Generation

Feb 26, 2026

Junhu Fu, Shuyu Liang, Wutong Li et al.

Synthetic colonoscopy videos can now be generated with enough quality and control to help with doctor training and disease diagnosis in data-scarce...

ColoDiff generates realistic colonoscopy videos using AI to help doctors train and diagnose intestinal diseases when real patient data is limited. It uses a technique called diffusion to create videos with smooth motion and precise control over medical details like disease type and imaging quality.

multimodalapplicationsdata

SC-Arena: A Natural Language Benchmark for Single-Cell Reasoning with Knowledge-Augmented Evaluation

Feb 26, 2026

Jiahao Zhao, Feng Jiang, Shaowei Qin et al.

Current AI models struggle with biology tasks requiring causal reasoning, and you need domain-aware evaluation metrics to properly assess them.

SC-Arena is a benchmark for testing how well AI language models understand single-cell biology. Instead of multiple-choice questions, it uses real-world tasks like predicting what happens when genes are modified. It also introduces smarter evaluation that checks answers against biological databases and scientific literature, rather than just matching text strings.

evaluationapplicationsreasoning

ESAA: Event Sourcing for Autonomous Agents in LLM-Based Software Engineering

Feb 26, 2026

Elzo Brito dos Santos Filho

Separate agent planning from execution: agents output intentions, a deterministic system executes them and logs everything, preventing state loss a...

This paper solves a critical problem with AI agents: they lose track of what they're doing over long tasks and can't reliably execute code changes. ESAA is an architecture that separates what an agent *intends* to do from what actually *happens* in your codebase.

agentsarchitectureapplications