Attention mechanism components that store and retrieve information; fewer heads means reduced model capacity and faster computation.
Performance retention over long documents and conversations
Multi-step reasoning, logic puzzles, mathematical problem-solving