Fine-grained visual feedback—comparing what code actually renders versus what it should render—is more effective for training vision-to-code models than text-based or embedding-based rewards, and avoids reward hacking.
This paper introduces Visual-ERM, a reward model that judges the quality of vision-to-code outputs by comparing rendered visuals directly rather than using text rules or embeddings.