Taming the Exponential: A Fast Softmax Surrogate for Integer-Native Edge Inference

Dimitrios Danopoulos, Enrico Lupi, Michael Kagan, Maurizio Pierini|April 2, 2026arXiv

Key Takeaway

HCCS replaces softmax's expensive exponential computation with a lightweight linear approximation calibrated per attention head, enabling 8-bit integer inference on edge hardware without sacrificing model accuracy.

Summary

This paper proposes Head-Calibrated Clipped-Linear Softmax (HCCS), a fast approximation of softmax designed for edge devices running small quantized AI models.

efficiency architecture

Key Terms

softmax-attention attention-head quantization-aware-retraining int8-precision edge-device