Jiajun Fan, Yuzheng Zhuang, Yuecheng Liu, Jianye HAO, Bin Wang, Jiangcheng Zhu, Hao Wang, Shu-Tao Xia
Abstract
The exploration problem is one of the main challenges in deep reinforcement learning (RL). Recent promising works tried to handle the problem with population-based methods, which collect samples with diverse behaviors derived from a population of different exploratory policies. Moreover, adaptive policy selection has been adopted for behavior control. However, the behavior selection space is largely limited by the predefined policy population, which further limits behavior diversity. In this paper, we propose a general framework called Learnable Behavioral Control (LBC) to address the limitation, which a) enables a significantly enlarged behavior selection space via formulating a hybrid behavior mapping from all policies; b) constructs a unified learnable process for behavior selection. We introduce LBC into distributed off-policy actor-critic methods and achieve behavior control by optimizing the selection of the behavior mappings with bandit-based meta-controllers. Our agents have achieved 10077.52% mean human normalized score and surpassed 24 human world records within 1B training frames in the Arcade Learning Environment, which demonstrates our significant state-of-the-art (SOTA) performance without degrading the sample efficiency.
1. Introduction
Reinforcement learning (RL) has led to tremendous progress in a variety of domains ranging from video games[1] to robotics[4][5]. However, efficient exploration remains one of the significant challenges. Recent prominent works tried to address the problem with population-based training[6] wherein a population of policies with different degrees of exploration is jointly trained to keep both the long-term and short-term exploration capabilities throughout the learning process. A set of actors is created to acquire diverse behaviors derived from the policy population[3]. Despite the significant improvement in the performance, these methods suffer from the aggravated high sample complexity due to the joint training on the whole population while keeping the diversity property. To acquire diverse behaviors, NGU uniformly selects policies in the population regardless of their contribution to the learning progress[7]. As an improvement, Agent57 adopts an adaptive policy selection mechanism that each behavior used for sampling is periodically selected from the population according to a goal-directed meta-controller[3]. Although Agent57 achieved significantly better results on the Arcade Learning Environment (ALE) benchmark, it costs tens of billions of environment interactions as much as NGU. Conclusively, prior PBT-based methods boost behavior diversity with expanding policy population, which is particularly data-consuming since the policies with heterogeneous training objectives require more data to converge.
2. Learnable Behavioral Control
In this paper, we try to improve the sample efficiency of population-based reinforcement learning methods by taking a step towards a more challenging setting to control behaviors with significantly enlarged behavior space without increasing the size of the policy population. We formulate the process of getting sampling behaviors from all policies as hybrid behavior mapping, and the behavior control problem is directly transformed into selecting appropriate mapping functions (See Fig. 1). By combining all policies, the behavior selection space increases exponentially along with the population size. As a special case that population size degrades to one, diverse behaviors can also be obtained by choosing different behavior mappings. This two-fold mechanism enables tremendous larger space for behavior selection. By properly parameterizing the mapping functions, our method enables a unified learnable process, and we call this general framework Learnable Behavior Control (LBC):
Figure 1: A General Framework of LBC
2.1 How to Build a Hybrid Behavior Mapping? (A Straightforward Example)
1. Generalized Policy Selection. Adjusting the contribution proportion of each learned policy for the behavior via an importance weight w.
2. Policy-Wise Entropy Control. Controlling the entropy of each policy via an entropy control function f.
3. Behavior Distillation from Multiple Policies. Distilling the entropy-controlled policies into a behavior policy according to the proportion of contribution and a behavior distillation function g.
The behavioral space can be constructed as follows:
When the behavior distillation function g, the entropy control function f, and the hyperparameter set H are determined, each behavior mapping can be indexed by psi, which can be used to optimize the behaviors across training. More ways to build a hybrid behavior mapping can be found in the App. D of our paper.
3. Experiment
We use the Arcade Learning Environment (ALE) to evaluate the performance of the proposed methods, which is an important testing ground that requires a broad set of skills such as perception, exploration, and control [3]. Previous works use the normalized human score to summarize the performance on ALE and claim superhuman performance [9]. However, the human baseline is far from representative of the best human player, which greatly underestimates the ability of humanity. In this paper, we introduce a more challenging baseline, i.e., the human world records baseline (see [8] for more information on Atari human world records). We summarize the number of games that agents can outperform the human world records to claim a real superhuman performance in these games, inducing a more challenging and fair comparison with human intelligence. Experimental results show that the sample efficiency of our method also outperforms the concurrent work MEME[2], which is 200x faster than Agent57.
3.1 How does LBC perform in ALE?
Figure 2: Atari Learning Curve
Among the algorithms with a comparable final performance, our agents achieve the best mean HNS and surpass the most human world records across 57 games of the Atari benchmark with minimal training frames, leading to the best learning efficiency. Noting that Agent57 reported the maximum scores across training as the final score, and if we report our performance in the same manner, our median is 1934%, which is higher than Agent57 and demonstrates our SOTA performance.
3.2 How does LBC control the beahvior?
Figure 3: Learning Cruve with Entropy
To further explore the mechanisms underlying the success of our methods, we adopt a case study to showcase our control process of behaviors. In most tasks, our agents prefer exploratory behaviors first (i.e., high stochasticity policies with high entropy), and as training progresses, the agents shift into producing experience from more exploitative behaviors. On the verge of peaking, the entropy of the behaviors could be maintained at a certain level (task-wise) instead of collapsing swiftly to zero to avoid converging prematurely to sub-optimal policies.
3.3 How does LBC perform compared to Muzero?
Figure 4: Comparison with Muzero. Human-normalized scores per game at different interaction budgets, sorted from highest to lowest.
We have added a comprehensive comparison for each of LBC and Muzero in Fig. 4. We can find that:
1. In most tasks, LBC (1B) can achieve a better performance than Muzero (20B).
2. In a very few set of tasks, LBC (1B) is not as good as Muzero (20B).
3.4 How does LBC perform compared to MEME?
Figure 5: Comparison with MEME. Human-normalized scores per game at different interaction budgets, sorted from highest to lowest.
We have added a comprehensive comparison for each LBC and MEME in the above figure. We can find that LBC achieves better performance than MEME on most tasks, but it is not as good as MEME on a few tasks [2].
To better explain the differences, we classify the tasks with significant improvements in LBC according to the problems of each task:
Hard-exploration problems. Alien (234.24%), Zaxxon (209.94%), and Wizard of Wor (65.27%) [10].
Long-term credit assignment. Beam Rider (781.66%) [3].
Adaptive exploration-exploitation trade-off. Beam Rider (781.66%), Jamesbond (432.69%) [3], and Demon Attack (533.14%).
In conclusion, LBC has made great improvement in hard-exploration tasks with relatively dense rewards and tasks requiring adaptive behavior control through the training process. This is because a large behavior space in LBC can enable an appropriate behavior to be found throughout the training process when the reward signal is not so sparse that it could not be learned (i.e., Private Eye and Montezuma Revenge). Apart from that, LBC has a significant improvement in some tasks requiring long-term credit assignments [3].
It can be seen that as the behavior space becomes larger (via the general ways proposed in LBC), how to make full use of the capacity advantage of the behavior space and find those potential policy solutions more efficiently is another focus of our future research (e.g., Design the optimization target of behavior control).
3.5 What are the LBC's agents have learned?
Reflect the pinball on the upper side
Game Master
Perfect bowling
A shortcut to high score
Breakout. In Breakout, our agents have found an easier way to remove bricks quickly via 1) Concentrate on eliminating the bricks on the right to form a gap leading to the space on the upper side. 2) Launch the ball through the gap on the right to the upper side of the bricks and use reflection to eliminate more bricks.
Bowling. In Bowling, our agents have found a way to hit all bowling balls.
NameThisGame. Our agents have found the optimal (nearly) strategy for this game so that scores can be continuously increased.
Qbert. In Qbert, our agents find the policy to get the maximum profit/rewards in the shortest time, namely a shortcut, which ensures the scores can be continuously increased.
Reference
[1] Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." nature 518.7540 (2015): 529-533.
[2] Kapturowski, Steven, et al. "Human-level Atari 200x faster." arXiv preprint arXiv:2209.07550 (2022).
[3] Badia, Adrià Puigdomènech, et al. "Agent57: Outperforming the atari human benchmark." International Conference on Machine Learning. PMLR, 2020.
[4] Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).
[5] Schulman, John, et al. "Trust region policy optimization." International conference on machine learning. PMLR, 2015.
[6] Jaderberg, Max, et al. "Population based training of neural networks." arXiv preprint arXiv:1711.09846 (2017).
[7] Badia, Adrià Puigdomènech, et al. "Never give up: Learning directed exploration strategies." arXiv preprint arXiv:2002.06038 (2020).
[8] Toromanoff, Marin, Emilie Wirbel, and Fabien Moutarde. "Is deep reinforcement learning really superhuman on atari? leveling the playing field." arXiv preprint arXiv:1908.04683 (2019).
[9] Machado, Marlos C., et al. "Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents." Journal of Artificial Intelligence Research 61 (2018): 523-562.
[10] Bellemare, Marc, et al. "Unifying count-based exploration and intrinsic motivation." Advances in neural information processing systems 29 (2016).