Separating exploration from policy optimization using uncertainty-guided tree search is dramatically more efficient than standard RL approaches for hard exploration problems, and discovered trajectories can be converted into deployable policies afterward.
This paper proposes a new approach to exploration in reinforcement learning that separates the exploration phase from policy optimization. Instead of using RL with intrinsic motivation rewards, the method uses tree search guided by uncertainty estimates to efficiently discover new states, then distills the discovered trajectories into policies.