Learning from rankings instead of numeric feedback is fundamentally harder, but becomes tractable when the environment changes slowly—with applications to game theory and LLM routing systems.
This paper studies online learning when you only get ranking feedback (like "action A is better than B") instead of numeric scores. The researchers show when this is impossible and develop algorithms that work well when utility changes slowly. They prove these algorithms help players reach fair game equilibria and test them on routing large language models.