Bigger models and more data won't automatically teach reasoning skills if your training data has systematic blind spots—you need intentional data...
Vision-language models struggle with reasoning tasks like counting and spatial understanding not because they're too small, but because their training data is biased toward how people naturally talk about images—omitting obvious details.