Training method where a model plays against itself or generates both solutions and evaluations, risking the model learning to exploit itself.