The ability to learn from large batches of autonomously collected data for policy improvement—a paradigm we refer to as batch online reinforcement learning—holds the promise of enabling truly scalable robot learning by significantly reducing the need for human effort of data collection while getting benefits from self-improvement. Yet, despite the promise of this paradigm, it remains challenging to achieve due to algorithms not being able to learn effectively from the autonomous data. For example, prior works have applied imitation learning and filtered imitation learning methods to the batch online RL problem, but these algorithms often fail to efficiently improve from the autonomously collected data or converge quickly to a suboptimal point. This raises the question of what matters for effective batch online reinforcement learning in robotics. Motivated by this question, we perform a systematic empirical study of three axes—(i) algorithm class, (ii) policy extraction methods, and (iii) policy expressivity—and analyze how these axes affect performance and scaling with the amount of autonomously collected data. Through our analysis, we make several observations. First, we observe that the use of Q-functions to guide batch online RL significantly improves performance over imitation-based methods. Building on this, we show that an implicit method of policy extraction—via choosing the best action in the distribution of the policy—is preferred over traditional explicit policy extraction methods from offline RL. Next, we show that an expressive policy class is necessary over less expressive policy classes. Based on this analysis, we propose a general recipe for effective batch online RL. We then show a simple addition to the recipe, namely using temporally-correlated noise to obtain more diversity, results in further performance gains. Our recipe obtains significantly better performance and scaling compared to prior methods.
Key Finding 1: Algorithm Class. Value-based RL is necessary for overcoming suboptimal convergence
of filtered imitation learning methods, because it is able to better leverage diversity in autonomous data.
Further, value-based RL scales better with larger batches of autonomous data.
Key Finding 2: Policy Extraction Method. Implicit policy extraction—wherein the best action in the
distribution of the policy is selected—significantly outperforms explicit policy extraction in batch online RL settings.
Key Finding 3. Policy Expressivity. Expressive policy classes outperform less expressive policy classes
with either explicit or implicit policy extraction methods.
Real-Robot Experiment. We validate the recipe for batch online RL on a challenging real-world manipulation task of hanging
a tape roll on a hook.
Select one of the tasks below to view trajectories at the beginning, middle, and end of running batch online RL.
@article{dong2025batch,
title = {What Matters for Batch Online Reinforcement Learning in Robotics?},
author = {Perry Dong and Suvir Mirchandani and Dorsa Sadigh and Chelsea Finn},
journal = {arXiv},
year = {2025},
}