LLMs need specialized training data to reliably follow data science workflows; fine-tuning on task-specific benchmarks can improve performance by 8x.
DARE-bench is a benchmark for testing how well AI models can follow data science instructions and complete multi-step ML tasks. It includes 6,300 real Kaggle tasks with verifiable correct answers, making evaluation objective rather than relying on human judges.