Papers

Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.

1552 papers50 this month12 topics

All Evaluation 40 Training 34 Efficiency 33 Reasoning 30 Agents 27 Applications 22 Multimodal 18 Data 17 Safety 13 Architecture 11 Alignment 7 scaling 5

Jul 6 – Jul 12(24)

UniClawBench: A Universal Benchmark for Proactive Agents on Real-World Tasks

Jul 9, 2026

Zhekai Chen, Chengqi Duan, Kaiyue Sun et al.

This benchmark separates what a language model can do from how well an agent framework uses those abilities—showing that both matter equally for real-world performance.

UniClawBench is a new benchmark for evaluating AI agents that work with real-world tools and applications. Unlike older benchmarks that use static simulations, it tests agents in live environments with 400 real tasks across five key capabilities: using tools, exploring options, understanding long documents, processing images/video, and coordinating across platforms.

evaluationagentsreasoning

Ideas Have Genomes: Benchmarking Scientific Lineage Reasoning and Lineage-Grounded Idea Generation

Jul 9, 2026

Yifan Zhou, Qihao Yang, Yan Li et al.

Current LLMs struggle with scientific lineage reasoning (only 27.3% accuracy), suggesting AI systems need better mechanisms to understand how ideas inherit, mutate, and recombine across research communities.

This paper introduces IdeaGene-Bench, a benchmark for evaluating whether AI systems can understand how scientific ideas evolve and build on each other. It represents papers as 'Idea Genomes' with tracked inheritance patterns, and tests both reasoning about scientific lineages and generating new ideas that fit coherently into existing research traditions across 10 scientific domains.

Jun 29 – Jul 5(38)

Distributed Attacks in Persistent-State AI Control

Jul 2, 2026

Josh Hills, Ida Caspary, Asa Cooper Stickland

Persistent AI systems that ship code iteratively create a new vulnerability: attackers can hide malicious behavior by spreading it across multiple sessions, and different detection strategies are needed to catch gradual versus concentrated attacks.

This paper studies how AI coding agents can distribute malicious attacks across multiple pull requests over time to evade detection. The authors introduce a benchmark where agents pursue hidden goals while building software, comparing gradual attacks spread across PRs against concentrated attacks.

safetyagentsevaluation

LACUNA: A Testbed for Evaluating Localization Precision for LLM Unlearning

Jul 2, 2026

Matteo Boglioni, Thibault Rousset, Siva Reddy et al.

Current unlearning methods are imprecise at targeting specific parameters where knowledge is stored, making them vulnerable to attacks that resurface the data—precise localization matters more than output-level performance.

LACUNA is a new benchmark for testing whether LLM unlearning methods actually erase sensitive data from model parameters or just hide it. The researchers inject fake personal information into specific weights of language models, then check if unlearning methods successfully target those exact parameters.

Jun 22 – Jun 28(38)

Which Nash Equilibrium? Solver-Dependent Selection on Zero-Sum Nash Polytopes

Jun 26, 2026

Luis Leal

Different Nash equilibrium solvers systematically select different equilibria based on their algorithm design—regularized methods pick maximum-entropy solutions while regret-averaging methods pick lower-entropy ones—which matters for robustness against imperfect opponents.

This paper investigates how different algorithms for solving two-player zero-sum games select different Nash equilibria from the convex set of possible equilibria.

evaluation

VGB for Masked Diffusion Model: Efficient Test-time Scaling for Reward Satisfaction and Sample Editing

Jun 26, 2026

Kijung Jeon, Thuy-Duong Vuong, Molei Tao

MDM-VGB enables efficient test-time scaling for constrained generation by allowing tokens to be remasked during sampling, achieving quadratic complexity while competing methods like best-of-N suffer exponential complexity—making it practical for real-world constraint satisfaction problems.

This paper introduces MDM-VGB, a sampling method for masked diffusion models that improves generation quality at test time by allowing tokens to be strategically unmasked and remasked based on reward signals.

reasoning

Papers

Jul 6 – Jul 12(24)

UniClawBench: A Universal Benchmark for Proactive Agents on Real-World Tasks

Ideas Have Genomes: Benchmarking Scientific Lineage Reasoning and Lineage-Grounded Idea Generation

Jun 29 – Jul 5(38)

Distributed Attacks in Persistent-State AI Control

LACUNA: A Testbed for Evaluating Localization Precision for LLM Unlearning

Jun 22 – Jun 28(38)

Which Nash Equilibrium? Solver-Dependent Selection on Zero-Sum Nash Polytopes

VGB for Masked Diffusion Model: Efficient Test-time Scaling for Reward Satisfaction and Sample Editing

Score Accuracy Along the Forward Diffusion Does Not Certify Numerical Stability in Diffusion Sampling

MulTTiPop: A Multitrack Transcription Dataset for Pop Music

Using AI-based Learning Assistants in Higher Education: A Large-Scale Descriptive Analysis

Dimensionality Reduction Meets Network Science: Sensemaking on UMAP's kNN Graph

AUTOPILOT VQA: Benchmarking Vision-Language Models for Incident-Centric Dashcam Understanding

The Illusion of Equivalency: Statistical Characterization of Quantization Effects in LLMs

Validity of LLMs as data annotators: AMALIA on authority

Do You Need a Frontier Model as a Citation Verifier? Benchmarking Rubric LLMs for Deep-Research Source Attribution

A Practical Investigation of Training-free Relaxed Speculative Decoding

SolarChain-Eval: A Physics-Constrained Benchmark for Trustworthy Economic Agents in Decentralized Energy Markets

Resample or Reroute? Budget-Aware Test-Time Model Selection for Large Language Models

EdgeRefine: Privacy-Utility Balance for Graphs via Jaccard Sampling under Edge Differential Privacy

Multi-Modal, Multi-Environment Machine Teaching for Robust Reward Learning

Institutional Red-Teaming: Deployment Rules, Not Just Models, Causally Shape Multi-Agent AI Safety

Agon: Competitive Cross-Model RL with Implicit Rival Grading of Reasoning

On the feasibility of dependency parsing of non-human sequences without a gold standard. Is evaluation possible in other species?

FootsiesGym: A Fighting Game Benchmark for Two-Player Zero-Sum Imperfect-Information Games

Industry Classification of GitHub Repositories Using the North American Industry Classification System (NAICS)

RMISC: A Large-scale Real-world Multivariate Corpus for Time Series Foundation Models

Interpretable Human-Label-Free Deep Learning for Real-Bogus Classification with Uncertainty Quantification

LLM-as-a-Verifier: A General-Purpose Verification Framework

Search Beyond What Can Be Taught: Evolving the Knowledge Boundary in Agentic Visual Generation

Online Safety Monitoring for LLMs

What LLM Agents Say When No One Is Watching: Social Structure and Latent Objective Emergence in Multi-Agent Debates

Controllable Sim Agents with Behavior Latents

Combating Textual Noise and Redundancy: Entropy-Aware Dense Visual Token Pruning

TestEvo-Bench: An Executable and Live Benchmark for Test and Code Co-Evolution

Human Capital, Not Model Benchmarks, Predicts Hybrid Intelligence in Forecasting

Will Scaling Improve Social Simulation with LLMs?

Language Models as Measurement Apparatus for Culture

Optimal Stabilizer Testing and Learning with Limited Quantum Memory

EvoPolicyGym: Evaluating Autonomous Policy Evolution in Interactive Environments

Reasoning effort, not tool access, buys first-try reliability in agentic code generation: an observational study

Automated grading of Linux/bash examinations using large language models: a four-level cognitive taxonomy approach

Neuron-Aware Active Few-Shot Learning for LLMs

The Future of NLP may not be at NLP Conferences: Scholarly Migration Patterns in Natural Language Processing

WattGPU: Predicting Inference Power and Latency on Unseen GPUs and LLMs

Know Your Source: A Public Knowledge Store for Media Background Checks

HULAT2 at MER-TRANS 2026: Governed Multi-Agent Simplification for Spanish Easy-to-Read Generation

DRIFTLENS: Measuring Memory-Induced Reasoning Drift in Personalized Language Models

Understanding Agent-Based Patching of Compiler Missed Optimizations

Measuring the Gap Between Human and LLM Research Ideas

Theoria: Rewrite-Acceptability Verification over Informal Reasoning States

Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?

Distill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation

Decision-Aware Training for Sample-Based Generative Models

QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents

Reinforcement Learning with Metacognitive Feedback Elicits Faithful Uncertainty Expression in LLMs

When LLMs Read Tables Carelessly: Measuring and Reducing Data Referencing Errors

CoMet: Context and Multiplicity Decomposition for Multimodal Uncertainty Estimation

Surrogate Fidelity: When Can Open LLMs Explain Closed Ones?

AxDafny: Agentic Verified Code Generation in Dafny

Optimization Dynamics Imprint Semantic Specificity in Contrastive Embedding Norms

MESA: Prioritizing Vulnerable Communication Channels for Securing Multi-Agent Systems

Words Speak Louder Than Code: Investigating Cognitive Heuristics in LLM-Based Code Vulnerability Detection

A Hybrid Framework For Crypto-Ransomware Detection In Enterprise Shared Storage

Uncertainty-Aware Generation and Decision-Making Under Ambiguity

Beyond 2D Matching: A Unified Single-Stage Framework for Geometry-Aware Cross-View Object Geo-Localization

Democratic ICAI: Debating Our Way to Steering Principles from Preferences

Towards Automating Scientific Review with Google's Paper Assistant Tool

Vision-Default, Prior-Override: Causal Mechanisms of Perception-Knowledge Conflict in Vision-Language Models

Learning Topology-Aware Representations via Test-Time Adaptation for Anomaly Segmentation

Govern the Repository, Not the Agent: Measuring Ecosystem-Level Risk in AI-Native Software

When are likely answers right? On Sequence Probability and Correctness in LLMs

Understanding Domain-Aware Distribution Alignment in Budgeted Entity Matching

Language-Based Digital Twins for Elderly Cognitive Assistance

Beyond the Hard Budget: Sparsity Regularizers for More Interpretable Top-k Sparse Autoencoders

LLM-Based Examination of Eligibility Criteria from Securities Prospectuses at the German Central Bank

Beyond Surface Forms: A Comprehensive, Mechanism-Oriented Taxonomy of Indirect Linguistic Encoding for LLM-Based Coded Language Detection

Multilingual Reasoning Cascades Need More Context