HCCS replaces softmax's expensive exponential computation with a lightweight linear approximation calibrated per attention head, enabling 8-bit integer inference on edge hardware without sacrificing model accuracy.
This paper proposes Head-Calibrated Clipped-Linear Softmax (HCCS), a fast approximation of softmax designed for edge devices running small quantized AI models.