$\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution

Muyu He, Adit Jain, Anand Kumar, Vincent Tu, Soumyadeep Bakshi et al.|April 1, 2026arXiv

Key Takeaway

Current LLM agents struggle with long-term planning and learning from delayed feedback—only top models like Claude Opus 4.6 succeed, and using scratchpads to persist information across context windows is critical for success.

Summary

YC-Bench is a benchmark that tests whether AI agents can plan and execute consistently over long periods by simulating running a startup for a year. The agent must manage employees, select contracts, and stay profitable in an uncertain environment where early mistakes have lasting consequences.

evaluation agents reasoning

Key Terms

agentic-tasks long-horizon-tasks context-truncation delayed-feedback