Neuron-Aware Data Selection In Instruction Tuning For Large Language Models

Xin Chen, Junchao Wu, Shu Yang, Runzhe Zhan, Zeyu Wu et al.|March 13, 2026arXiv

Key Takeaway

You can train better LLMs on less data by selecting instruction examples that activate the same neurons as your target task—this beats using all data or relying on external models to score examples.

Summary

This paper introduces NAIT, a method for selecting the most useful instruction-tuning data for large language models by analyzing which neurons activate when processing different types of tasks. Instead of using all available training data, NAIT identifies a small subset (10% of data) that produces better results by matching neuron activation patterns to target capabilities.

training data efficiency

Key Terms

instruction-tuning neuron-activation data-selection activation-pattern