Agents can learn to use tools more wisely by training them with separate optimization objectives for accuracy and efficiency, rather than combining both into a single reward signal that creates conflicting incentives.
This paper addresses a critical problem in AI agents: they overuse external tools even when they could solve problems using their own knowledge. The authors propose HDPO, a training framework that teaches agents to be smarter about when to use tools by separating the optimization into two independent channels—one for accuracy and one for efficiency.