Linking Perception, Confidence and Accuracy in MLLMs

Yuetian Du, Yucheng Wang, Rongyu Zhang, Zhijie Xu, Boyu Yang et al.|March 12, 2026arXiv

Key Takeaway

Multimodal models suffer from severe confidence miscalibration; training them to be honest about uncertainty and using that uncertainty to trigger verification steps significantly improves both accuracy and reliability.

Summary

This paper identifies that multimodal AI models are overconfident—they don't reliably know when they're wrong. The authors propose a training method using image noise pairs and confidence-based rewards to fix this, plus a test-time strategy that uses the model's confidence to decide when to double-check answers. Results show 8.8% accuracy improvements across benchmarks.

evaluation training multimodal

Key Terms

confidence-calibration multimodal-large-language-model test-time-scaling self-consistency reinforcement-learning-from-confidence