Current document-reasoning agents succeed through exhaustive search rather than strategic thinking—they need better planning abilities, not just more attempts, to handle real-world document workflows efficiently.
This paper introduces MADQA, a benchmark with 2,250 questions across 800 PDF documents, to test whether AI agents can strategically navigate documents or just randomly search. The researchers found that while agents match human accuracy on some questions, they use brute-force trial-and-error rather than smart planning, and fall 20% short of optimal performance.