Papers

Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.

1552 papers100 this month12 topics

All Evaluation 40 Training 34 Efficiency 33 Reasoning 30 Agents 27 Applications 22 Multimodal 18 Data 17 Safety 13 Architecture 11 Alignment 7 scaling 5

Jul 6 – Jul 12(65)

UniClawBench: A Universal Benchmark for Proactive Agents on Real-World Tasks

Jul 9, 2026

Zhekai Chen, Chengqi Duan, Kaiyue Sun et al.

This benchmark separates what a language model can do from how well an agent framework uses those abilities—showing that both matter equally for real-world performance.

UniClawBench is a new benchmark for evaluating AI agents that work with real-world tools and applications. Unlike older benchmarks that use static simulations, it tests agents in live environments with 400 real tasks across five key capabilities: using tools, exploring options, understanding long documents, processing images/video, and coordinating across platforms.

evaluationagentsreasoning

OpenCoF: Learning to Reason Through Video Generation

Jul 9, 2026

Xinyan Chen, Ziyu Guo, Renrui Zhang et al.

Video generation can be a reasoning mechanism: training models on diverse temporal reasoning tasks and adding explicit reasoning tokens improves their ability to solve logical problems by generating step-by-step visual explanations.

OpenCoF introduces a dataset and fine-tuned video model designed to teach AI systems to reason through generating sequences of video frames. Unlike text-based reasoning, this 'Chain-of-Frame' approach lets models unfold logical steps visually across time. The work shows that video models trained on diverse reasoning tasks with special reasoning tokens perform better at solving complex problems.

Jun 29 – Jul 5(35)

Distributed Attacks in Persistent-State AI Control

Jul 2, 2026

Josh Hills, Ida Caspary, Asa Cooper Stickland

Persistent AI systems that ship code iteratively create a new vulnerability: attackers can hide malicious behavior by spreading it across multiple sessions, and different detection strategies are needed to catch gradual versus concentrated attacks.

This paper studies how AI coding agents can distribute malicious attacks across multiple pull requests over time to evade detection. The authors introduce a benchmark where agents pursue hidden goals while building software, comparing gradual attacks spread across PRs against concentrated attacks.

safetyagentsevaluation

LACUNA: A Testbed for Evaluating Localization Precision for LLM Unlearning

Jul 2, 2026

Matteo Boglioni, Thibault Rousset, Siva Reddy et al.

Current unlearning methods are imprecise at targeting specific parameters where knowledge is stored, making them vulnerable to attacks that resurface the data—precise localization matters more than output-level performance.

LACUNA is a new benchmark for testing whether LLM unlearning methods actually erase sensitive data from model parameters or just hide it. The researchers inject fake personal information into specific weights of language models, then check if unlearning methods successfully target those exact parameters.

Papers

Jul 6 – Jul 12(65)

UniClawBench: A Universal Benchmark for Proactive Agents on Real-World Tasks

OpenCoF: Learning to Reason Through Video Generation

Jun 29 – Jul 5(35)

Distributed Attacks in Persistent-State AI Control

LACUNA: A Testbed for Evaluating Localization Precision for LLM Unlearning

Ideas Have Genomes: Benchmarking Scientific Lineage Reasoning and Lineage-Grounded Idea Generation

Score Accuracy Along the Forward Diffusion Does Not Certify Numerical Stability in Diffusion Sampling

MulTTiPop: A Multitrack Transcription Dataset for Pop Music

SLORR: Simple and Efficient In-Training Low-Rank Regularization

Using AI-based Learning Assistants in Higher Education: A Large-Scale Descriptive Analysis

Dimensionality Reduction Meets Network Science: Sensemaking on UMAP's kNN Graph

AUTOPILOT VQA: Benchmarking Vision-Language Models for Incident-Centric Dashcam Understanding

ARDY: Autoregressive Diffusion with Hybrid Representation for Interactive Human Motion Generation

Workflow as Knowledge: Semantic Persistence for LLM-Mediated Workflows

The Illusion of Equivalency: Statistical Characterization of Quantization Effects in LLMs

Super Weights in LLMs and the Failure of Selective Training

Validity of LLMs as data annotators: AMALIA on authority

Pose-to-Biomechanics: Bridging 3D Human Pose Estimation and Biomechanical Attribute Prediction

Latent Memory Palace: Reasoning for Control as Autoregressive Variational Inference

Remember When It Matters: Proactive Memory Agent for Long-Horizon Agents

LTM: Large-scale Terrain Model for Wildfire-prone Landscapes

MPFlow: Learning Budgeted Max-Flow Optimization on the Lightning Network with Deep Graph Reinforcement Learning

Do You Need a Frontier Model as a Citation Verifier? Benchmarking Rubric LLMs for Deep-Research Source Attribution

ProjAgent: Procedural Similarity Retrieval for Repository-Level Code Generation

A Practical Investigation of Training-free Relaxed Speculative Decoding

SolarChain-Eval: A Physics-Constrained Benchmark for Trustworthy Economic Agents in Decentralized Energy Markets

Resample or Reroute? Budget-Aware Test-Time Model Selection for Large Language Models

WebSwarm: Recursive Multi-Agent Orchestration for Deep-and-Wide Web Search

EdgeRefine: Privacy-Utility Balance for Graphs via Jaccard Sampling under Edge Differential Privacy

Formal Mechanisms for Market Stability in Self-Interested Agent Societies: A Marketplace Simulation Study

Secure Decentralized Federated Learning via Gossip and Virtual Voting

Multi-Modal, Multi-Environment Machine Teaching for Robust Reward Learning

UltraX: Refining Pre-Training Data at Scale with Adaptive Programmatic Editing

Accurate, Interdisciplinary and Transparent Structure-property Understanding with Deep Native Structural Reasoning

Co-LMLM: Continuous-Query Limited Memory Language Models

The Key to Going Linear: Analysis-Driven Transformer Linearization

From Noisy Traces to Root Causes: Structural Trajectory Analysis and Causal Extraction for Agent Optimization

Breaking Database Lock-in: Agentic Regeneration of High Performance Storage Readers for Database Bypass

Institutional Red-Teaming: Deployment Rules, Not Just Models, Causally Shape Multi-Agent AI Safety

Selective Timestep Weighting and Advantage-Based Replay for Sample-Efficient Diffusion RLHF

Agon: Competitive Cross-Model RL with Implicit Rival Grading of Reasoning

Neural Operator-enabled Topology-informed Evolutionary Strategy for PDE-Constrained Optimization

Any-Dimensional Learning by Sampling

How Data Shapes RoPE Frequency Usage: From Positional Scale Matching to Length Generalization

SkillCenter: A Large-Scale Source-Grounded Skill Library for Autonomous AI Agents

Max Out GRPO Signal: Adaptive Trace Prefix Control for Hard Reasoning Problems

MedPMC: A Systematic Framework for Scaling High-Fidelity Medical Multimodal Data for Foundation Models

PeTeR: Post-Training Robustification of Probabilistic Circuits

ELSA3D: Elastic Semantic Anchoring for Unified 3D Understanding and Generation

Graph Convolutional Attention: A Spectral Perspective on Graph Denoising and Diffusion

Rethinking Indic AI from a Lens of Cultural Heritage Preservation

On the feasibility of dependency parsing of non-human sequences without a gold standard. Is evaluation possible in other species?

Hierarchical Acoustic-Semantic Modeling: Modality Separation and Semantic Coherence for Full-Duplex SLMs

GraphBU: MILP Instance Generation with Graph-Native Block Units

The Large Cancer Assistant (LCA): A Model-Agnostic Orchestration Framework for Scalable Clinical Decision Support in Oncology

RSF-GLLM: Bridging the Semantic Gap in Multi-Hop Knowledge Graph QA via Recurrent Soft-Flow and Decoupled LLM Generation

DepthWeave-KV: Token-Adaptive Cross-Layer Residual Factorization for Long-Context KV Cache Compression

Bridging Physical Reasoning and Task Generalization via Visual Action Outcome Reasoning Alignment

FreqDepthKV: Frequency-Guided Depth Sharing for Robust KV Cache Compression in Long-Context LLM Inference

FootsiesGym: A Fighting Game Benchmark for Two-Player Zero-Sum Imperfect-Information Games

DynaKRAG: A Unified Framework for Learnable Evidence Control in Multi-Hop Retrieval-Augmented Generation

Industry Classification of GitHub Repositories Using the North American Industry Classification System (NAICS)

RMISC: A Large-scale Real-world Multivariate Corpus for Time Series Foundation Models

From Fixed to Free Cameras: Calibration-Free View-Robust Vision-Language-Action Model

Weak-to-Strong Generalization via Direct On-Policy Distillation

Interpretable Human-Label-Free Deep Learning for Real-Bogus Classification with Uncertainty Quantification

LLM-as-a-Verifier: A General-Purpose Verification Framework

Search Beyond What Can Be Taught: Evolving the Knowledge Boundary in Agentic Visual Generation

Program-as-Weights: A Programming Paradigm for Fuzzy Functions

Online Safety Monitoring for LLMs

ReContext: Recursive Evidence Replay as LLM Harness for Long-Context Reasoning

What LLM Agents Say When No One Is Watching: Social Structure and Latent Objective Emergence in Multi-Agent Debates

Reasoning LLM Improves Speaker Recognition in Long-form TV Dramas

DemoPSD: Disagreement-Modulated Policy Self-Distillation

Beyond Adam: SOAP and Muon for Faster, Label-Efficient Training of Machine Learning Interatomic Potentials

Controllable Sim Agents with Behavior Latents

Towards Robustness against Typographic Attack with Training-free Concept Localization

G-RRM: Guiding Symbolic Solvers with Recurrent Reasoning Models