ThinkLLM
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
AboutPrivacyTermsRSS

ThinkLLM

Papers

Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.

326 papers3 this month12 topics
AllEfficiency 35Reasoning 35Multimodal 28Applications 28Evaluation 27Training 26Architecture 24Agents 24Safety 13scaling 5Data 5Alignment 1

Mar 30 – Apr 5(3)

Omni123: Exploring 3D Native Foundation Models with Limited 3D Data by Unifying Text to 2D and 3D Generation

Apr 2, 2026

Chongjie Ye, Cheng Cao, Chuanyu Pan et al.

By unifying 2D and 3D generation in one model and leveraging plentiful 2D data as a structural constraint, you can train better 3D generators with limited 3D assets—no separate 2D-to-3D conversion pipeline needed.

Omni123 is a 3D foundation model that generates both 2D images and 3D objects from text by treating them as sequences of tokens. It uses abundant 2D image data as a guide to improve 3D generation, avoiding the need for scarce aligned text-image-3D datasets. The model cycles through different modalities (text→image→3D→image) to ensure consistency across all forms.

multimodalarchitecturedata

CV-18 NER: Augmented Common Voice for Named Entity Recognition from Arabic Speech

Apr 2, 2026

Youssef Saidi, Haroun Elleuch, Fethi Bougares

End-to-end speech-to-entity models substantially outperform cascaded ASR+NER pipelines for Arabic, and multilingual pretraining transfers better than Arabic-specific pretraining for this low-resource task.

This paper introduces CV-18 NER, the first dataset for extracting named entities directly from Arabic speech. The researchers created 21 entity types by annotating the Arabic Common Voice corpus, then compared end-to-end speech models (Whisper, AraBEST-RQ) against traditional pipelines that first transcribe speech then extract entities.

Mar 23 – Mar 29(2)

Training the Knowledge Base through Evidence Distillation and Write-Back Enrichment

Mar 26, 2026

Yuxing Lu, Xukai Zhao, Wei Wu et al.

You can improve RAG systems by preprocessing your corpus once to add distilled, compact versions of relevant documents—this works with any retrieval method and shows consistent gains without changing your pipeline.

This paper proposes WriteBack-RAG, a method that improves retrieval-augmented generation (RAG) systems by treating the knowledge base as trainable. Using labeled examples, the system identifies relevant documents, distills them into compact knowledge units, and adds these to the corpus.

datatraining

CSTS: A Canonical Security Telemetry Substrate for AI-Native Cyber Detection

Mar 24, 2026

Abdul Rahman

Security AI models fail when deployed to new environments because telemetry data is fragmented. CSTS solves this by providing a unified, entity-focused data structure that maintains consistent identity and relationships across different systems.

This paper introduces CSTS, a standardized way to represent security data that helps AI systems detect cyber threats across different computer networks. Instead of treating security events as isolated incidents, CSTS organizes them around entities (like users or devices) and their relationships, making AI models more reliable when deployed in new environments.

Mar 16 – Mar 22(8)

LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation

Mar 20, 2026

Jiazheng Xing, Fei Du, Hangjie Yuan et al.

To generate videos with multiple people where each person's appearance stays consistent with their attributes, you need both better training data that captures identity-attribute relationships and model attention mechanisms designed to enforce those relationships.

LumosX improves personalized video generation by explicitly linking identities to their attributes. It uses a data pipeline with multimodal AI to extract subject relationships, then applies specialized attention mechanisms in diffusion models to ensure faces stay consistent with their assigned attributes across video frames.

multimodalarchitecturedata

MIDST Challenge at SaTML 2025: Membership Inference over Diffusion-models-based Synthetic Tabular data

Mar 19, 2026

Masoumeh Shafieinejad, Xi He, Mahshid Alinoori et al.

Synthetic data from diffusion models may not be as privacy-safe as assumed—membership inference attacks can still reveal whether specific records were in the training data, even with synthetic tabular outputs.

This challenge evaluates how well synthetic tabular data generated by diffusion models protects privacy against membership inference attacks. Researchers tested whether synthetic data truly hides information about individuals in the original dataset, developing new attack methods to measure privacy risks across different types of tabular data structures.

Mar 9 – Mar 15(5)

Neuron-Aware Data Selection In Instruction Tuning For Large Language Models

Mar 13, 2026

Xin Chen, Junchao Wu, Shu Yang et al.

You can train better LLMs on less data by selecting instruction examples that activate the same neurons as your target task—this beats using all data or relying on external models to score examples.

This paper introduces NAIT, a method for selecting the most useful instruction-tuning data for large language models by analyzing which neurons activate when processing different types of tasks. Instead of using all available training data, NAIT identifies a small subset (10% of data) that produces better results by matching neuron activation patterns to target capabilities.

trainingdataefficiency

From Experiments to Expertise: Scientific Knowledge Consolidation for AI-Driven Computational Research

Mar 13, 2026

Haonan Huang

AI agents performing scientific research need memory and reflection, not just execution capability. Knowledge consolidation between runs dramatically improves efficiency and accuracy in computational science workflows.

QMatSuite is a platform that helps AI agents learn from computational materials science experiments by storing findings, retrieving past knowledge, and reflecting on results.

Feb 23 – Mar 1(8)

Coverage-Aware Web Crawling for Domain-Specific Supplier Discovery via a Web--Knowledge--Web Pipeline

Feb 27, 2026

Yijiashun Qi, Yijiazhen Qi, Tanmay Wagh

Use knowledge graph topology to guide web crawling toward undiscovered entities, making supplier discovery more complete with less computational cost.

This paper tackles the problem of finding all small and medium-sized businesses in specialized industries (like semiconductor equipment makers) by combining web crawling, knowledge graphs, and smart coverage estimation.

dataapplications

Histopathology Image Normalization via Latent Manifold Compaction

Feb 27, 2026

Xiaolong Zhang, Jianwei Zhang, Selim Sevim et al.

Unsupervised learning can remove batch effects from medical images, letting models generalize across hospitals without retraining.

Medical image analysis struggles when microscope slides are stained or scanned differently across hospitals—models trained on one site fail at another. This paper introduces a technique that learns to remove these visual differences automatically, making AI models work reliably across different clinical sites without needing labeled examples.

data
data

ORBIT: Scalable and Verifiable Data Generation for Search Agents on a Tight Budget

Apr 1, 2026

Nandan Thakur, Zijian Chen, Xueguang Ma et al.

You can build high-quality training data for search agents using synthetic generation and verification without expensive human annotation or API costs, enabling smaller models to compete with larger ones.

ORBIT is a dataset of 20,000 reasoning-heavy questions with verifiable answers, created cheaply without paid APIs. The authors built a four-stage pipeline (seed creation, question generation, self-verification, external verification) to generate training data for search agents—AI systems that combine language models with web search.

datatrainingagents
safetydataevaluation
safetyevaluationdata

A Dataset and Resources for Identifying Patient Health Literacy Information from Clinical Notes

Mar 19, 2026

Madeline Bittner, Dina Demner-Fushman, Yasmeen Shabazz et al.

Automated health literacy detection from clinical notes is now possible with HEALIX, a curated dataset that could help clinicians identify patients needing extra support without adding screening burden.

Researchers created HEALIX, the first public dataset of 589 clinical notes annotated for patient health literacy levels (low, normal, high). Health literacy—a patient's ability to understand medical information—affects treatment outcomes, but current screening tools are impractical.

dataapplicationsevaluation

Toward Scalable Automated Repository-Level Datasets for Software Vulnerability Detection

Mar 18, 2026

Amine Lbath

Automated vulnerability injection with proof-of-concept exploits can scale up realistic training datasets for repository-level security detection, moving beyond function-level benchmarks to test how AI handles real-world code complexity.

This research creates an automated system to generate large-scale datasets for training AI models to detect software vulnerabilities in real code repositories.

datasafetyagents

ConGA: Guidelines for Contextual Gender Annotation. A Framework for Annotating Gender in Machine Translation

Mar 18, 2026

Argentina Anna Rescigno, Eva Vanmassenhove, Johanna Monti

Machine translation systems have systematic gender bias—they default to masculine forms when translating from English to gendered languages. This paper provides annotation guidelines and a benchmark dataset to measure and fix this problem.

This paper introduces ConGA, a framework for annotating gender in machine translation to address how systems handle gender when translating from gender-neutral languages (like English) to gendered ones (like Italian).

dataevaluationalignment

ManiTwin: Scaling Data-Generation-Ready Digital Object Dataset to 100K

Mar 17, 2026

Kaixuan Wang, Tianxing Chen, Jiawei Liu et al.

Having diverse, high-quality 3D assets at scale dramatically improves robot learning in simulation—this dataset removes a major bottleneck for scaling robotic manipulation training.

ManiTwin is an automated pipeline that converts single images into simulation-ready 3D digital objects for robot training. The team created ManiTwin-100K, a dataset of 100,000 annotated 3D assets with physical properties and manipulation instructions, enabling large-scale generation of robot training data in simulation.

dataapplicationstraining

Chronos: Temporal-Aware Conversational Agents with Structured Event Retrieval for Long-Term Memory

Mar 17, 2026

Sahil Sen, Elias Lumer, Anmol Gulati et al.

Structuring long conversation histories as timestamped events with intelligent retrieval guidance lets AI agents accurately answer complex questions about what happened weeks or months ago—critical for building chatbots that remember user preferences and history over extended periods.

Chronos is a memory system for AI chatbots that tracks conversations over months by breaking down dialogue into timestamped events and organizing them in structured calendars. When answering questions about past conversations, it uses dynamic prompts to guide retrieval across time ranges and handle complex multi-step reasoning, achieving 95.6% accuracy on long-term memory tasks.

agentsreasoningdata

OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data

Mar 16, 2026

Yuwen Du, Rui Ye, Shuo Tang et al.

You can now build frontier-level search agents without proprietary data—OpenSeeker proves that smart data synthesis (not scale) is the bottleneck, and releases everything needed to replicate it.

OpenSeeker is a fully open-source search agent that achieves state-of-the-art performance by synthesizing high-quality training data through two techniques: generating complex multi-hop reasoning tasks by reverse-engineering web graphs, and denoising agent trajectories using summarization.

agentsdatareasoning
agentsreasoningdata

SciMDR: Benchmarking and Advancing Scientific Multimodal Document Reasoning

Mar 12, 2026

Ziyu Chen, Yilun Zhao, Chengye Wang et al.

Training multimodal models on scientific documents requires balancing synthetic data quality with real-world document complexity—this dataset achieves that by synthesizing faithful QA pairs then re-embedding them into full papers.

This paper introduces SciMDR, a dataset of 300K question-answer pairs across 20K scientific papers designed to train AI models on understanding complex scientific documents with both text and images. The dataset uses a two-stage process: first generating focused QA pairs with reasoning chains, then embedding them into full documents to maintain realistic complexity.

multimodaldataevaluation

STAMP: Selective Task-Aware Mechanism for Text Privacy

Mar 12, 2026

Fengwei Tian, Payel Bhattacharjee, Heidi Hanson et al.

By combining task-aware importance scoring with privacy sensitivity detection, STAMP achieves better privacy-utility trade-offs than uniform noise approaches—meaning you can protect sensitive data without sacrificing model performance.

STAMP is a privacy framework that protects sensitive information in text while keeping it useful for AI tasks. It smartly decides which parts of text need more protection (like names and dates) versus which parts are less sensitive, then applies targeted noise to embeddings using a novel 'polar mechanism' that preserves semantic meaning better than traditional approaches.

safetydataefficiency

QAQ: Bidirectional Semantic Coherence for Selecting High-Quality Synthetic Code Instructions

Mar 12, 2026

Jiayin Lei, Ming Ma, Yunxi Duan et al.

When training on synthetic code data, filtering by reverse semantic coherence (can the answer predict the question?) is more effective at removing noise than forward metrics, letting you use 75% less data without losing model quality.

This paper introduces QAQ, a method for filtering noisy synthetic code training data by measuring bidirectional semantic coherence—checking not just if a model can generate answers from questions, but also if answers can predict back to questions. By selecting only 25% of data with the highest quality scores, the approach matches full-dataset performance while cutting computational costs.

datatraining
applications
training

A Dataset is Worth 1 MB

Feb 26, 2026

Elad Kimchi Shoshani, Leeyam Gabay, Yedid Hoshen

You can teach models new tasks by transmitting just labels instead of data, if clients have a generic reference dataset pre-loaded.

Instead of sending large datasets over the network, this paper proposes sending only class labels for images from a reference dataset that clients already have locally. A smart filtering mechanism picks which images are most relevant to the new task, reducing communication to under 1 MB while maintaining accuracy.

efficiencydatatraining

Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning

Feb 26, 2026

Amita Kamath, Jack Hessel, Khyathi Chandu et al.

Bigger models and more data won't automatically teach reasoning skills if your training data has systematic blind spots—you need intentional data...

Vision-language models struggle with reasoning tasks like counting and spatial understanding not because they're too small, but because their training data is biased toward how people naturally talk about images—omitting obvious details.

dataevaluationreasoning

ManifoldGD: Training-Free Hierarchical Manifold Guidance for Diffusion-Based Dataset Distillation

Feb 26, 2026

Ayush Roy, Wei-Yang Alex Lee, Rudrasis Chakraborty et al.

You can create smaller datasets that preserve large dataset knowledge using pre-trained diffusion models with geometric guidance—no retraining ne...

This paper introduces ManifoldGD, a method to create smaller, representative datasets from large ones using diffusion models without any training. Instead of simple guidance, it uses geometric manifold structures to ensure generated synthetic data captures both broad concepts and fine details, resulting in better quality distilled datasets with fewer images.

dataefficiency

Scaling Search Relevance: Augmenting App Store Ranking with LLM-Generated Judgments

Feb 26, 2026

Evangelia Christakopoulou, Vivekkumar Patel, Hemanth Velaga et al.

A smaller, specialized AI model can generate better training data than a giant pre-trained one, unlocking real improvements in production systems.

Google used fine-tuned AI models to generate millions of relevance labels for app search results, solving a shortage of human-labeled training data. By combining these AI-generated labels with user behavior signals, they improved their App Store ranking system—especially for unpopular searches where user clicks are rare.

trainingapplicationsdata

Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?

Feb 26, 2026

Pengxiang Li, Dilxat Muhtar, Lu Yin et al.

Training data structure, not model architecture, is why parallel language models revert to sequential generation—fix the training data to unlock ...

Diffusion language models promise faster parallel text generation, but they often end up generating tokens one-at-a-time like traditional models. This paper shows the problem is how models are trained—sequential training data pushes them toward sequential generation.

trainingefficiencydata

ColoDiff: Integrating Dynamic Consistency With Content Awareness for Colonoscopy Video Generation

Feb 26, 2026

Junhu Fu, Shuyu Liang, Wutong Li et al.

Synthetic colonoscopy videos can now be generated with enough quality and control to help with doctor training and disease diagnosis in data-scarce...

ColoDiff generates realistic colonoscopy videos using AI to help doctors train and diagnose intestinal diseases when real patient data is limited. It uses a technique called diffusion to create videos with smooth motion and precise control over medical details like disease type and imaging quality.

multimodalapplicationsdata