You can make streaming speech-to-text models faster and more accurate by processing audio in fixed chunks instead of one token at a time.
This paper introduces CHAT, an improved version of RNN-T models for converting speech to text in real-time. By processing audio in small chunks and using a smarter attention mechanism, CHAT runs 1.7x faster during inference, uses 46% less memory during training, and produces more accurate transcriptions—especially for translating speech between languages.