Current AI models struggle with biology tasks requiring causal reasoning, and you need domain-aware evaluation metrics to properly assess them.
SC-Arena is a benchmark for testing how well AI language models understand single-cell biology. Instead of multiple-choice questions, it uses real-world tasks like predicting what happens when genes are modified. It also introduces smarter evaluation that checks answers against biological databases and scientific literature, rather than just matching text strings.