How to teach a Deep RL agent to learn to become good at a skill over a wide range of diverse tasks ? To address this problem we propose to rely on teacher algorithms using Learning Progress (LP) as a signal to optimize the sequential selection of tasks to propose to their DRL Student. To study our proposed teachers, we design two parameterized BipedalWalker environments. Teachers are then tasked to sample a parameter mapped to a distribution of tasks on which a task is sampled and proposed to the student. The teacher then observes the episodic performance of its student and uses this information to adapt its sampling distribution (see figure below for a simple workflow diagram).
In this work we present a new algorithm modeling absolute learning progress with Gaussian mixture models (ALP-GMM) along with existing LP-based algorithms. Using our BipedalWalker environments, we study their efficiency to personalize a learning curriculum for different learners (embodiments), their robustness to the ratio of learnable/unlearnable tasks, and their scalability to non-linear and high-dimensional parameter spaces. ALP-GMM, which is conceptually simple and has very few crucial hyperparameters, opens-up exciting perspectives for various curriculum learning challenges within DRL (domain randomization for Sim2Real transfer, curriculum learning within autonomously discovered task spaces, ...).
Paper: R. Portelas, C. Colas, K. Hofmann, and P.-Y. Oudeyer. Teacher Algorithms For Curriculum Learning of Deep RL in Continuously Parameterized Environments. CoRL 2019. https://arxiv.org/abs/1910.07224
This github repository provides implementations for the following teacher algorithms:
- ALP-GMM, our proposed teacher algorithm
- Robust Intelligent Adaptive Curiosity (RIAC), from Baranes and Oudeyer (2009). This is also an implementation of the SAGG-RIAC algorithm from Baranes and Oudeyer (2013), where RIAC is used to incrementally segment the goal space into sub-regions, and a bandit algorithm is used to sample tasks/goals in regions of high absolute learning progress.
- Covar-GMM, from Moulin-Frier et al. (2013)
Full references:
- A. Baranes and P.-Y. Oudeyer (2009). R-IAC: robust intrinsically motivated exploration and active learning. IEEE Trans. Autonomous Mental Development.
- A. Baranes and P.-Y. Oudeyer (2013). Active learning of inverse models with intrinsically motivated goal exploration in robots. Robotics and Autonomous Systems.
- C. Moulin-Frier, S. M. Nguyen, and P.-Y. Oudeyer (2013). Self-organization of early vocal development in infants and machines: The role of intrinsic motivation. Frontiers in Psychology (Cognitive Science).
Installation
Testing teachers
Using the Parameterized BW environment
Launching experiments on Parameterized BW environment
Visualizations
1- Get the repository
git clone https://github.com/flowersteam/teachDeepRL
cd teachDeepRL/
2- Install it, using Conda for example (use Python >= 3.6)
conda create --name teachDRL python=3.6
conda activate teachDRL
pip install -e .
Test Random, RIAC, ALP-GMM and CovarGMM teachers on a simple toy env
cd teachDRL/
python3 toy_env/toy_env.py
Gifs of the parameter sampling dynamics of teachers will be created in graphics/toy_env_gifs/
In case you want to use our Parameterized BW in your own projects, the following pseudo-code shows how to interact with the environment:
import gym
import teachDRL.gym_flowers
import numpy as np
env = gym.make('bipedal-walker-continuous-v0') # create environment
env.env.my_init({'leg_size': 'default'}) # set walker type within environment, do it once
for nb_episode in range(10):
# now set the parameters for the procedural generation
# make sure to do it for each new episode, before reset
env.set_environment(stump_height=np.random.uniform(0,3),
obstacle_spacing=np.random.uniform(0,6)) # Stump Tracks
#env.set_environment(poly_shape=np.random.uniform(0,4,12)) # Hexagon Tracks
env.reset()
for i in range(2000):
obs, rew, done, _ = env.step(env.env.action_space.sample())
env.render()
We implemented two additional walker morphologies, which can be visualized along with the parametric variations of the environment (Stump Tracks and Hexagon Tracks):
python3 test_bipedal_walker_continuous.py
In the Parameterized BW environment, we use Soft Actor Critic (OpenAI spinningup implementation) as our Deep RL student.
Then you can launch a Teacher-Student run in Stump Tracks:
python3 run.py --exp_name test_alpgmm --teacher ALP-GMM --seed 42 --leg_size default --max_stump_h 3.0 --max_obstacle_spacing 6.0
Available teachers (-- teacher): ALP-GMM, RIAC, Oracle, Random, Covar-GMM
Available walker morphologies (--leg_size): short, default, quadru
You can also test the quadrupedal walker on Hexagon Tracks:
python3 run.py --exp_name test_alpgmm_hexa --teacher ALP-GMM --seed 42 --leg_size quadru -hexa
To run multiple seeds, we recommand to use taskset to bind each process to a single cpu thread, like so:
taskset -c 0 python3 run.py --exp_name test_alpgmm_hexa --teacher ALP-GMM --seed 42 --leg_size quadru -hexa &
taskset -c 1 python3 run.py --exp_name test_alpgmm_hexa --teacher ALP-GMM --seed 43 --leg_size quadru -hexa &
We tested the ALP-GMM teacher when paired with 3 different walker morphologies. For each of these walkers we show the learned walking gates after being trained on a curriculum generated by ALP-GMM. A single run of ALP-GMM allows to train Soft Actor-Critic controllers to master a wide range of track distributions.
The following videos show the evolution of parameter sampling by ALP-GMM for short, default, and quadrupedal walkers. Learning curricula generated by ALP-GMM are tailored to the capacities of each student it is paired with.
To assess whether ALP-GMM is able to scale to parameter spaces of higher dimensionality, containing irrelevant dimensions, and whose difficulty gradients are non-linear, we performed experiments with quadrupedal walkers on Hexagon Tracks, our 12-dimensional parametric Bipedal Walker environment. The following videos shows walking gates learned in a single ALP-GMM run.
In comparison, when using Random curriculum, most runs end-up with bad performances, like this one: