The number of attention head pairs used for storing and retrieving key-value information in a transformer model's attention mechanism.