LLMs can pass alignment tests while internally treating opposed moral concepts as equivalent; fixing this requires intervening directly on internal representations, not just adjusting outputs.
This paper reveals that large language models suffer from 'moral indifference'—they compress different moral concepts into similar internal representations, making them vulnerable to manipulation even when they appear aligned.