Mechanistic Origin of Moral Indifference in Language Models

Lingyu Li, Yan Teng, Yingchun Wang|March 16, 2026arXiv

Key Takeaway

LLMs can pass alignment tests while internally treating opposed moral concepts as equivalent; fixing this requires intervening directly on internal representations, not just adjusting outputs.

Summary

This paper reveals that large language models suffer from 'moral indifference'—they compress different moral concepts into similar internal representations, making them vulnerable to manipulation even when they appear aligned.

alignment safety

Key Terms

sparse-autoencoder latent-representation mechanistic-interpretability alignment-fine-tuning moral-reasoning