Agency and Architectural Limits: Why Optimization-Based Systems Cannot Be Norm-Responsive

Radha Sarma|February 26, 2026arXiv

Key Takeaway

RLHF-based AI systems cannot be governed by norms because optimization forces all values into tradeable weights—genuine norm-following requires a...

Summary

This paper argues that AI systems like ChatGPT trained with RLHF cannot follow ethical rules or norms because of how they're built. They work by turning everything into a single score and picking the highest one—which means they'll always trade off any principle if it scores higher. The author shows this isn't a bug to fix, but a fundamental limit of optimization itself.

alignment safety architecture

Key Terms

rlhf optimization agency norm-responsiveness