Current LLM agents struggle with long-term planning and learning from delayed feedback—only top models like Claude Opus 4.6 succeed, and using scratchpads to persist information across context windows is critical for success.
YC-Bench is a benchmark that tests whether AI agents can plan and execute consistently over long periods by simulating running a startup for a year. The agent must manage employees, select contracts, and stay profitable in an uncertain environment where early mistakes have lasting consequences.