RLHF-based AI systems cannot be governed by norms because optimization forces all values into tradeable weights—genuine norm-following requires a...
This paper argues that AI systems like ChatGPT trained with RLHF cannot follow ethical rules or norms because of how they're built. They work by turning everything into a single score and picking the highest one—which means they'll always trade off any principle if it scores higher. The author shows this isn't a bug to fix, but a fundamental limit of optimization itself.