Papers

Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.

1552 papers20 this month12 topics

All Evaluation 40 Training 34 Efficiency 33 Reasoning 30 Agents 27 Applications 22 Multimodal 18 Data 17 Safety 13 Architecture 11 Alignment 7 scaling 5

Jul 6 – Jul 12(13)

Ideas Have Genomes: Benchmarking Scientific Lineage Reasoning and Lineage-Grounded Idea Generation

Jul 9, 2026

Yifan Zhou, Qihao Yang, Yan Li et al.

Current LLMs struggle with scientific lineage reasoning (only 27.3% accuracy), suggesting AI systems need better mechanisms to understand how ideas inherit, mutate, and recombine across research communities.

This paper introduces IdeaGene-Bench, a benchmark for evaluating whether AI systems can understand how scientific ideas evolve and build on each other. It represents papers as 'Idea Genomes' with tracked inheritance patterns, and tests both reasoning about scientific lineages and generating new ideas that fit coherently into existing research traditions across 10 scientific domains.

evaluationreasoningdata

MulTTiPop: A Multitrack Transcription Dataset for Pop Music

Jul 9, 2026

Nathan Pruyne, Benjamin Stoler, William Chen et al.

Automatic music transcription models still struggle with real-world pop music—the best model only achieves 38% Onset F1—suggesting this dataset will be valuable for developing better transcription systems.

MulTTiPop is a dataset of 572 pop music segments (3.5 hours) paired with multitrack MIDI transcriptions, spanning from the 1930s to 2000s. The authors created it by matching audio from existing datasets, manually aligning beats, and using tempo warping. They benchmark state-of-the-art transcription models and show significant room for improvement.

Jun 29 – Jul 5(11)

Neuron-Aware Data Selection for Annotation-Free LLM Self-Distillation

Jul 2, 2026

Zhuowei Chen, Xiang Lorraine Li

By analyzing which neurons activate during model predictions, you can automatically select better training data and improve self-supervised learning without any human annotations—useful when expert labels are expensive or unavailable.

This paper proposes Neuron-OPSD, a method for improving large language models without human labels by using the model's internal neuron activations to select which training examples to learn from and how to construct better teacher models. The approach trains the model on its own predictions, achieving better performance on specialized tasks while maintaining general knowledge.

trainingefficiencydata

Language Models as Measurement Apparatus for Culture

Jul 2, 2026

Kent K. Chang

Language models used for cultural analysis aren't neutral measurement tools; their architecture, training data, and evaluation methods actively constitute the cultural phenomena they claim to measure, making methodological choices inherently ethical decisions.

This paper examines how language models measure cultural phenomena, arguing that the models, data, and evaluation methods don't just record culture—they actively shape what counts as cultural reality.

Jun 22 – Jun 28(14)

Mapping Political-Elite Networks in Europe with a Multilingual Joint Entity-Relation Extraction Pipeline

Jun 25, 2026

Kirill Solovev, Jana Lasser

Open-weight multilingual NLP can scale political network analysis beyond manual coding, extracting signed relationships from news at scale while remaining reproducible and avoiding proprietary APIs.

This paper presents an open-source pipeline for automatically extracting political relationships from multilingual news articles. It combines named-entity recognition, entity linking to Wikidata, and a specialized model to build knowledge graphs of political networks—showing it can reconstruct party lifecycles and uncover patronage networks in Austria and Poland.

datamultimodalapplications

Understanding Domain-Aware Distribution Alignment in Budgeted Entity Matching

Jun 25, 2026

Nicholas Pulsone, Gregory Goren, Roee Shraga

Distribution alignment is critical for entity matching in low-resource settings—understanding which algorithmic choices matter most helps practitioners build more reliable data integration systems with limited supervision.

This paper investigates BEACON, a method for matching records across databases when you have limited labeled data and domain knowledge. The researchers test how different design choices and data availability affect performance, revealing insights about how distribution alignment helps the system adapt to new domains.

Jun 15 – Jun 21(15)

Structuring and Tokenizing Distributed User Interest Context for Generative Recommendation

Jun 18, 2026

Ruizhong Qiu, Yinglong Xia, Dongqi Fu et al.

Combining graph-based user co-engagement patterns with semantic tokenization creates more accurate user interest representations for generative recommendation systems at scale.

This paper presents G2Rec, a framework that improves generative recommendation systems by better organizing user behavior and item information. It combines graph-based user interaction patterns with semantic tokenization to help recommendation models understand what users want next, without needing labeled user interests.

applicationsarchitecturedata

SARLO-80: Worldwide Slant SAR Language Optic Dataset 80cm

Jun 18, 2026

Solène Debuysère, Nicolas Trouvé, Nathan Letheule et al.

This is the first large-scale, very-high-resolution SAR-optical-language dataset with complex-valued SAR data and pixel-level alignment, unlocking new benchmarks for multimodal foundation models that can learn from radar imagery the way they learn from optical images.

SARLO-80 is a large-scale dataset of 119,566 aligned SAR-optical-text triplets at 80cm resolution covering 257 locations worldwide. It preserves complex-valued SAR measurements and native acquisition geometry—unlike existing low-resolution datasets—enabling physically grounded multimodal learning for radar and optical image understanding.

Jun 8 – Jun 14(11)

Abstracting Cross-Domain Action Sequences into Interpretable Workflows

Jun 12, 2026

Gaurav Verma, Scott Counts

LLMs can reliably abstract noisy, granular user interaction logs into interpretable workflows without domain-specific training, enabling better product insights while preserving privacy.

WorkflowView uses large language models to convert low-level user action logs (clicks, keystrokes, etc.) into high-level, meaningful activities. The approach works across different applications and domains—from browser history to online courses to document editing—without needing task-specific training, making it practical for understanding real user behavior at scale.

applicationsdata

Agents-K1: Towards Agent-native Knowledge Orchestration

Jun 11, 2026

Zongsheng Cao, Bihao Zhan, Jinxin Shi et al.

Instead of feeding agents flat paper summaries, this work structures scientific knowledge into queryable graphs with explicit entities, claims, and evidence—making it easier for AI systems to perform multi-step scientific reasoning and fact-checking.

Agents-K1 builds agent-friendly knowledge graphs from scientific papers by extracting entities, claims, evidence, and relationships across full documents—not just abstracts. The system combines a multimodal parser, an information-extraction model trained with reinforcement learning, and a unified interface for searching and traversing knowledge.

Jun 1 – Jun 7(3)

Operation-Guided Progressive Human-to-AI Text Transformation Benchmark for Multi-Granularity AI-Text Detection

Jun 4, 2026

Sondos Mahmoud Bsharat, Jiacheng Liu, Xiaohan Zhao et al.

AI-text detection isn't just about how much AI content is present—it depends on what edits were made, the domain, and revision history. Mixed-authorship documents can be harder to detect than fully AI-generated ones, exposing blind spots in current detection methods.

This paper introduces OpAI-Bench, a benchmark for detecting AI-generated text in documents that have been progressively edited by both humans and AI. Unlike existing benchmarks that only look at final outputs, OpAI-Bench tracks how AI authorship signals change across multiple revision stages, edit types, and document granularities (document, sentence, token, and span levels).

evaluationsafetydata

BBOmix: A Tabular Benchmark for Hyperparameter Optimization of Unsupervised Biological Representation Learning

Jun 3, 2026

Luca Thale-Bombien, Jan Ewald, Ralf König et al.

When using autoencoders for biological data, don't assume reconstruction loss is a good guide—use this benchmark to find hyperparameters that actually improve downstream task performance.

BBOmix is an open-source benchmark with 105,000 pre-computed results for tuning autoencoders on biological data. It shows that reconstruction loss (what autoencoders optimize for) doesn't always predict how useful the learned representations are for downstream tasks, and provides baselines for hyperparameter optimization methods.

May 25 – May 31(13)

SPECTRA: Synthetic IR Test Collections with Relevance Oracles and Controlled Distractor Diagnostics

May 29, 2026

Eric Liang

You can now generate large, reproducible test collections with known relevance answers to diagnose IR system failures at scale—useful for stress-testing before building expensive human-judged benchmarks.

SPECTRA is a framework for generating synthetic text collections and test datasets for information retrieval systems. Instead of relying on expensive human-labeled data, it creates controllable document corpora with built-in relevance labels, letting researchers test how search systems handle scale, latency, and ranking challenges before investing in real human evaluation.

evaluationdata

LLMSurgeon: Diagnosing Data Mixture of Large Language Models

May 28, 2026

Yaxin Luo, Jiacheng Cui, Xiaohan Zhao et al.

You can audit an LLM's training data composition by analyzing its outputs, even without access to the original training corpus, using statistical techniques to correct for classifier confusion and recover the underlying data mixture.

This paper introduces a method to reverse-engineer what data was used to train large language models by analyzing their generated text.

data

May 18 – May 24(5)

CHRONOS: Temporally-Aware Multi-Agent Coordination for Evolving Data Marketplaces

May 22, 2026

Joydeep Chandra

For building data marketplaces, CHRONOS shows how to maintain search quality, fair pricing, and privacy simultaneously by treating temporal decay, value attribution, and privacy budgets as coupled problems rather than separate concerns.

CHRONOS solves three interconnected problems in data marketplaces: keeping search indexes fresh as data changes, fairly pricing data contributions after market shifts, and preventing agents from exhausting privacy budgets.

dataagentsefficiency

Multilingual Knowledge Transfer under Data Constraints via Lexical Interventions

May 22, 2026

Anastasiia Sedova, Natalie Schluter, Skyler Seto et al.

You can improve cross-lingual knowledge transfer by strategically replacing words in high-resource training data with translations—no parallel data, translation models, or extra training needed.

This paper proposes LINK, a simple method to improve multilingual language models for low-resource languages by swapping English words with their translations during pretraining. The approach requires only a bilingual dictionary and no extra training, yet achieves significant performance gains on downstream tasks across eight languages.

May 4 – May 10(5)

Verifier-Backed Hard Problem Generation for Mathematical Reasoning

May 7, 2026

Yuhang Lai, Jiazhan Feng, Yee Whye Teh et al.

Using an independent verifier to validate problem correctness prevents reward hacking in AI-generated math problems, enabling better training data creation without human experts.

This paper tackles the problem of generating valid and challenging math problems for training AI models. Instead of relying on humans or simple self-play (which often produces invalid problems), the authors introduce VHG, a system with three players: a problem setter, a solver, and an independent verifier.

trainingreasoningdata

GlazyBench: A Benchmark for Ceramic Glaze Property Prediction and Image Generation

May 7, 2026

Ziyu Zhai, Siyou Li, Juexi Shao et al.

This dataset bridges AI and materials science by providing standardized benchmarks for predicting ceramic properties and generating glaze visuals—showing that multimodal AI can accelerate traditionally trial-and-error design processes.

GlazyBench is the first large-scale dataset for AI-assisted ceramic glaze design, containing 23,148 real glaze formulations. It enables two tasks: predicting glaze properties (color, transparency) from raw materials, and generating visual images of glazes.

Apr 27 – May 3(10)

Generating Statistical Charts with Validation-Driven LLM Workflows

May 1, 2026

Pavlin G. Poličar, Andraž Pevcin, Blaž Zupan

Treating chart generation as a multi-step inspectable process with rendered-output validation catches visualization failures that code-only checks miss, and the resulting dataset reveals specific weaknesses in how multimodal LLMs understand charts.

This paper presents a structured workflow for generating statistical charts from data using LLMs, with built-in validation to catch visualization errors before they reach users. The workflow produces 1,500 diverse charts paired with 30,000+ question-answer pairs, revealing that while LLMs excel at reading chart syntax, they struggle with value extraction and reasoning tasks.

evaluationapplicationsdata

Directed Social Regard: Surfacing Targeted Advocacy, Opposition, Aid, Harms, and Victimization in Online Media

May 1, 2026

Scott Friedman, Ruta Wheelock, Sonja Schmer-Galunder et al.

Most sentiment analysis tools miss nuance—they can't detect that a single message contains both praise for one group and criticism for another. This work enables fine-grained tracking of who is being helped, harmed, supported, or opposed in online discourse.

This paper introduces a new method to detect mixed positive and negative sentiments directed at different targets within the same message. Instead of labeling text as simply positive or negative, the approach identifies specific targets (like people or groups) and scores them across three dimensions: advocacy vs. opposition, aid vs. harm, and support vs. victimization.

Papers

Jul 6 – Jul 12(13)

Ideas Have Genomes: Benchmarking Scientific Lineage Reasoning and Lineage-Grounded Idea Generation

MulTTiPop: A Multitrack Transcription Dataset for Pop Music

Jun 29 – Jul 5(11)

Neuron-Aware Data Selection for Annotation-Free LLM Self-Distillation

Language Models as Measurement Apparatus for Culture

Jun 22 – Jun 28(14)

Mapping Political-Elite Networks in Europe with a Multilingual Joint Entity-Relation Extraction Pipeline

Understanding Domain-Aware Distribution Alignment in Budgeted Entity Matching

Jun 15 – Jun 21(15)

Structuring and Tokenizing Distributed User Interest Context for Generative Recommendation

SARLO-80: Worldwide Slant SAR Language Optic Dataset 80cm

Jun 8 – Jun 14(11)

Abstracting Cross-Domain Action Sequences into Interpretable Workflows

Agents-K1: Towards Agent-native Knowledge Orchestration

Jun 1 – Jun 7(3)

Operation-Guided Progressive Human-to-AI Text Transformation Benchmark for Multi-Granularity AI-Text Detection

BBOmix: A Tabular Benchmark for Hyperparameter Optimization of Unsupervised Biological Representation Learning

May 25 – May 31(13)

SPECTRA: Synthetic IR Test Collections with Relevance Oracles and Controlled Distractor Diagnostics

LLMSurgeon: Diagnosing Data Mixture of Large Language Models

May 18 – May 24(5)

CHRONOS: Temporally-Aware Multi-Agent Coordination for Evolving Data Marketplaces

Multilingual Knowledge Transfer under Data Constraints via Lexical Interventions

May 4 – May 10(5)

Verifier-Backed Hard Problem Generation for Mathematical Reasoning

GlazyBench: A Benchmark for Ceramic Glaze Property Prediction and Image Generation

Apr 27 – May 3(10)

Generating Statistical Charts with Validation-Driven LLM Workflows

Directed Social Regard: Surfacing Targeted Advocacy, Opposition, Aid, Harms, and Victimization in Online Media

Dimensionality Reduction Meets Network Science: Sensemaking on UMAP's kNN Graph

Validity of LLMs as data annotators: AMALIA on authority

UltraX: Refining Pre-Training Data at Scale with Adaptive Programmatic Editing

Co-LMLM: Continuous-Query Limited Memory Language Models

SkillCenter: A Large-Scale Source-Grounded Skill Library for Autonomous AI Agents

MedPMC: A Systematic Framework for Scaling High-Fidelity Medical Multimodal Data for Foundation Models

Rethinking Indic AI from a Lens of Cultural Heritage Preservation

On the feasibility of dependency parsing of non-human sequences without a gold standard. Is evaluation possible in other species?

GraphBU: MILP Instance Generation with Graph-Native Block Units

Industry Classification of GitHub Repositories Using the North American Industry Classification System (NAICS)

RMISC: A Large-scale Real-world Multivariate Corpus for Time Series Foundation Models

Understanding the Robustness of Distributed Self-Supervised Learning Frameworks Against Non-IID Data

Q-GAIN: A Python Package for Machine Learning and Physically Informed Analysis Applications

Know Your Source: A Public Knowledge Store for Media Background Checks

World Wide Models: Literary Tools for Cultural AI

Language-Critique Imitation Learning from Suboptimal Demonstrations

When LLMs Read Tables Carelessly: Measuring and Reducing Data Referencing Errors

Scalable Behaviour Cloning on Browser Using via Skill Distillation

Self-Study Reconsidered: The Hidden Fragility of Learning from Self-Generated QA

VLK: Learning Humanoid Loco-Manipulation from Synthetic Interactions in Reconstructed Scenes

LLM-Based Examination of Eligibility Criteria from Securities Prospectuses at the German Central Bank

How Good Can Linear Models Be for Time-Series Forecasting?

How Surprising Is Historical Italian to Language Models? Tokenization Tax, Comprehension Tax, and a Simple Mitigation

BetXplain: An Explanation-Annotated Dataset for Detecting Manipulative Betting Advertisements on Social Media

RSPC: A Benchmark for Modeling Stress and Psychiatric Conditions in Digitally Mediated Relationships using Psychiatrist Annotations

The Geometry of Updates: Fisher Alignment at Vocabulary Scale

From Celebrities to Anyone: Characterizing AI Nudification Content, Technology, and Community Dynamics on 4chan

When Certainty Is an Artifact: Keyword Lexicon Blindness and the (Mis)Measurement of Rhetorical Stance

Natural Ungrokking: Asymmetric Control of Which Rules Survive Pretraining

OpenThoughts-Agent: Data Recipes for Agentic Models

Less is More: Quality-Aware Training Data Selection for Scientific Summarization

L3Cube-MahaPOS: A Marathi Part-of-Speech Tagging Dataset and BERT Models

FreeStyle: Free Control of Style-Content Dual-Reference Generation from Community LoRA Mining

Scalable Training of Spatially Grounded 2D Vision-Language Models for Radiology

Data Bias Mitigation under Coverage Constraints & The Price of Fairness

Context-Aware Hierarchical Bayesian Modeling of IVF Laboratory Environmental Conditions

The Significance of Style Diversity in Annotation-Free Synthetic Data Generation

DataMagic: Transforming Tabular Data into Data Insight Video

Freeing the Law with LOCUS: A Local Ordinance Corpus for the United States

Data Intelligence Agents: Interpreting, Modeling, and Querying Enterprise Data via Autonomous Coding Agents

Optimal scenario design for climate emulation

Darshana Graph: A Parallel Commentary Corpus for Comparative Indian Philosophy, with Stylometric and Exploratory Graph Analyses

Rethinking Dataset Distillation for Classification: Do Distilled Sets Outperform Coresets?

Analyzing and Encoding the Al-Mawrid Arabic-English Dictionary with the ISO Language Markup Framework and TEI Lex-0

FusionRS: A Large-Scale RGB-Infrared Remote Sensing Dataset for Dual-Modal Vision-Language Foundation Models

Influcoder: Distilling Decoders' Gradient Influence Rankings into an Encoder for Data Attribution

Valid Inference with Synthetic Data via Task Exchangeability

The Tone of Awareness: Topic, Sentiment, and Toxicity Maps During Mental Health Month on TikTok

Existence Precedes Value: Joint Modeling of Observational Existence and Evolving States in Time Series Forecasting