Current AI agents fail at real-world personal file management: the best models only achieve 48% accuracy on user profiling tasks, with multimodal perception and evidence grounding being the main bottlenecks.
HippoCamp is a benchmark that tests AI agents on realistic file management tasks using real personal computers with 42.4 GB of actual user files. It measures how well agents can search files, understand context, and reason across multiple file types to answer questions about a user's data—revealing that even top AI models struggle with these practical tasks.