Skip to content

Commit

Permalink
Add curriculum learning example using simple adder (#47)
Browse files Browse the repository at this point in the history
* Add curriculum learning example

* fix bug in sim for spaces.Dict

---------

Co-authored-by: Jazmia Henry <48301423+jazmiahenry@users.noreply.github.com>
  • Loading branch information
jillianmclements and jazmiahenry authored Aug 29, 2023
1 parent be868f9 commit 2702410
Show file tree
Hide file tree
Showing 5 changed files with 368 additions and 0 deletions.
66 changes: 66 additions & 0 deletions examples/curriculum-learning/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# Curriculum Learning

In this example, we show how to use curriculum learning to train an RL agent on Azure ML with a custom Gymnasium environment (“Simple Adder”). Curriculum learning is a technique that orders the training data according to some measure of difficulty, and gradually exposes the model to harder episodes as it learns.

### What this sample covers
- How to modify a custom Gymnasium simulation environment to use curriculum learning with RLlib
- How to implement curriculum learning on your local machine and on Azure ML

### What this sample does not cover
- How to create an optimized curriculum for best performance
- How to evaluate the agent
- How to deploy the agent

## Prerequisites

- Install the Azure CLI on your machine:
```
pip install azure-cli
```
- Add the ML extension:
```
az extension add -n ml
```
- [Create an AML workspace and compute cluster](https://azure.github.io/plato/#create-azure-resources)
- Create an AML environment using the conda file provided (``conda.yml``) by running the following command:
```bash
az ml environment create --name curriculum-learning-env --conda-file conda.yml --image mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04 --resource-group $YOUR_RESOURCE_GROUP --workspace-name $YOUR_WORKSPACE
```

## Example Overview
The simulation environment in this sample is "Simple Adder" (`scr/sim_curriculum_capable.py`), where the agent has to choose a number between -10 and 10 to add to a state value. The goal is to get the state value equal to 50 in 10 time steps or less. The difficulty level depends on how far the state value is from 50 at the start, as larger distances make the task harder. The curriculum learning strategy is to gradually expand the range of possible state values around 50, increasing the level of difficulty. The agent starts training on the easiest range (smallest distances) and progresses to the next range, which includes the previous one, after reaching a certain average reward threshold. The agent repeats this process until it reaches the hardest range (largest distances) or the maximum number of iterations.

## Run Locally
As a preliminary step, you should check that the simulation works on your local machine to save precious development time. The main.py script in the src folder allows you to test locally with the following command:

```
python main.py --test-local
```

## Tutorial: Run on AML
After you checked that the simulation works properly, follow these steps to train an RL agent on AML using curriculum learning:

1. Open the `conda.yaml` file and fill in the values for the AML `workspace_name`, `resource_group`, `subscription_id`, and `compute_target_name` with the ones you created in the prerequisites section. These values are used to connect to your AML workspace and compute cluster.

2. Open the `src/main.py` file and do the following:
- Modify the curriculum learning function (`curriculum_fn()`) to return a new task (or difficulty level) for your environment based on some criteria. For example, you can set a threshold on the average episode reward as a measure of difficulty.
- Adjust the `train()` function parameters, such as `trainable`, `rollouts`, and `stopping_criteria`, according to your desired strategy.
- Modify the `CurriculumCallback()` class to log the current task of the environment to TensorBoard. This class can also implement other methods to customize the training behavior, such as `on_train_result`, `on_episode_end`, etc. For example, you can log other metrics, save checkpoints, or update hyperparameters based on the curriculum learning progress.

3. Open your custom simulation environment file (`src/sim_curriculum_capable.py`) and make sure it inherits from the `TaskSettableEnv` class from Ray RLLib and implements its methods, such as `get_task()` and `set_task()`. These methods are used by the curriculum learning function and callback to get and set the current task of the environment. The task should be a dictionary that contains any information that defines the difficulty of the environment, such as the number of obstacles, the size of the grid, the speed of the agent, etc.

4. Launch the job using the Azure CLI:
```
az ml job create -f job.yml --workspace-name $YOUR_WORKSPACE --resource-group $YOUR_RESOURCE_GROUP
```

5. Check that it is running by finding the job you launched in AML studio. You should see that ray is writing logs in the Outputs + logs tab in the user_logs folder.

6. Monitor the curriculum learning progress and results on the AML studio or using TensorBoard on your local machine. You should see a custom metric called “task” that shows the current difficulty level of the environment for each episode.

7. Once the job is completed, download the model checkpoints from AML studio under the Outputs + Logs tab of your job in the outputs folder.

## Next Steps
Now that you've successfully trained an agent using curriculum learning, you can experiment with different ways to design and evaluate curriculum learning. For example, you can use reward, entropy, uncertainty, or diversity as measures of difficulty.

To learn more about how to use your trained agent, check out our [deploy-agent sample](https://github.com/Azure/plato/tree/main/examples/deploy-agent), which shows you how to deploy a trained agent and interact with it.
18 changes: 18 additions & 0 deletions examples/curriculum-learning/conda.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
channels:
- anaconda
- conda-forge
dependencies:
- python=3.10.11
- pip=23.0.1
- pip:
# Dependencies for Ray on AML
- azureml-mlflow
- azureml-defaults
- ray-on-aml
- ray[data,rllib]==2.5.0
# Deps for RLlib
- torch==2.0.1
- tensorflow_probability==0.19.0
# Dependencies for the Simple Adder
- gymnasium==0.26.3
- numpy==1.24.3
16 changes: 16 additions & 0 deletions examples/curriculum-learning/job.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: src
command: >-
python main.py
environment: azureml:curriculum-learning-env@latest
compute: azureml:env-medium
display_name: curriculum-learning
experiment_name: curriculum-learning
description: Run curriculum learning and log metrics.
# Needed for using ray on AML
distribution:
type: mpi
# Modify the following and num_rollout_workers in main to use more workers
resources:
instance_count: 1
177 changes: 177 additions & 0 deletions examples/curriculum-learning/src/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
"""
Adapted from
https://github.com/ray-project/ray/blob/master/rllib/examples/curriculum_learning.py
to use the simple adder simulation environment.
Example of a curriculum learning setup using the `TaskSettableEnv` API
and the env_task_fn config.
This example shows:
- Writing your own curriculum-capable environment using gym.Env.
- Defining an env_task_fn that determines whether and which new task
the env(s) should be set to (using the TaskSettableEnv API).
- Using Tune and RLlib to curriculum-learn this env.
You can visualize experiment results in ~/ray_results using TensorBoard locally,
or via AML performance metrics if you run this script on AML.
"""
import argparse
import os
import sys

from azureml.core import Run
from ray import air, tune
from ray.rllib.algorithms.callbacks import DefaultCallbacks
from ray.rllib.env.apis.task_settable_env import TaskSettableEnv, TaskType
from ray.rllib.env.env_context import EnvContext
from ray.tune.registry import register_env
from ray_on_aml.core import Ray_On_AML

# IMPORTANT: Remember to change it for your own simulation environment
from sim_curriculum_capable import SimpleAdder as CurriculumCapableEnv

register_env("curriculum_env", lambda config: CurriculumCapableEnv(config))


# Define an env_task_fn that returns a new task based on some criteria
def curriculum_fn(
train_results: dict, task_settable_env: TaskSettableEnv, env_ctx: EnvContext
) -> TaskType:
"""Function returning a possibly new task to set `task_settable_env` to.
Args:
train_results (dict): The train results returned by Algorithm.train().
task_settable_env (TaskSettableEnv): A single TaskSettableEnv object
used inside any worker and at any vector position. Use `env_ctx`
to get the worker_index, vector_index, and num_workers.
env_ctx (EnvContext): The env context object (i.e. env's config dict plus
properties worker_index, vector_index and num_workers) used to setup the
`task_settable_env`.
Returns:
TaskType: The task to set the env to. This may be the same as the current one.
"""
# With each task, the initial state value will be between (50-2**exponent) &
# (50+2**exponent)
# Task 1: Randomly sample a number between 48 and 52
# Task 2: Randomly sample a number between 46 and 54
# We will thus increase the task number each time we hit the reward threshold
# Define a reward threshold for each task
reward_threshold = 0
# Get the current task level
task_exponent = task_settable_env.get_task()["exponent"]
# Get the average episode reward over the last training iteration
avg_reward = train_results["episode_reward_mean"]
# If the average reward is above or equal to the threshold, increase the task's
# exponent
if avg_reward >= reward_threshold:
# Increase the task level by 1
return {"exponent": task_exponent + 1}
else:
# Keep the same task level
return task_settable_env.get_task()


class CurriculumCallback(DefaultCallbacks):
"""A custom callback class that logs the current task of the environment to
tensorboard and Azure ML.
This class inherits from the DefaultCallbacks class provided by RLlib and overrides
the on_episode_start and on_epoch_end methods to access the curriculum "task"
information from the base environment and the episode object, and log it to both
tensorboard and Azure ML.
"""

def __init__(self):
self.run = Run.get_context()

def on_episode_start(
self, *, worker, base_env, policies, episode, env_index, **kwargs
):
# Get the current task of the sim
task = base_env.get_sub_environments()[env_index].get_task()
# Log the task to tensorboard
episode.custom_metrics["task"] = task["exponent"]

def on_train_result(self, *, algorithm, result: dict, **kwargs):
"""Called at the end of Algorithm.train().
Args:
algorithm: Current Algorithm instance.
result: Dict of results returned from Algorithm.train() call.
You can mutate this object to add additional metrics.
kwargs: Forward compatibility placeholder.
"""
print(
"Algorithm.train() result: {} -> {} episodes".format(
algorithm, result["episodes_this_iter"]
)
)
# Log metrics to TensorBoard
super().on_train_result(algorithm=algorithm, result=result, **kwargs)

# Filter the results dictionary to only log metrics with the substring "episode"
to_log = {
k: v for k, v in result.items() if "episode" in k and "media" not in k
}
# Add the curriculum task to the dictionary
to_log["task"] = result["custom_metrics"]["task_mean"]
# Log metrics to Azure ML
for k, v in to_log.items():
self.run.log(name=k, value=v)


def train():
# Define a config object with the desired parameters
param_space = {
"env": "curriculum_env",
"env_task_fn": curriculum_fn,
"framework": "torch",
# IMPORTANT: Change num_workers to scale training
"num_workers": 1,
# Use GPUs iff `RLLIB_NUM_GPUS` env var set to > 0.
"num_gpus": int(os.environ.get("RLLIB_NUM_GPUS", "0")),
"callbacks": CurriculumCallback,
}

stopping_criteria = {
"training_iteration": 300,
"timesteps_total": 100000,
# "episode_reward_mean": 0,
}

# Build the algorithm from the config and pass it to the tune.Tuner constructor
tuner = tune.Tuner(
"PPO",
param_space=param_space,
run_config=air.RunConfig(
stop=stopping_criteria,
verbose=2,
),
)

results = tuner.fit()

return results


if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--test-local", action="store_true", default=True)
args = parser.parse_args()

if args.test_local:
train()
sys.exit()

ray_on_aml = Ray_On_AML()
ray = ray_on_aml.getRay()

if ray:
print("head node detected")
ray.init(address="auto")
print(ray.cluster_resources())
train()
ray.shutdown()
else:
print("in worker node")
91 changes: 91 additions & 0 deletions examples/curriculum-learning/src/sim_curriculum_capable.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
"""Implementation of a simple simulation/environment in AML."""
import numpy as np

# from gymnasium import Env
from gymnasium.spaces import Box, Dict

# Import TaskSettableEnv from RLlib
from ray.rllib.env.apis.task_settable_env import TaskSettableEnv
from ray.rllib.utils.annotations import override


class SimpleAdder(TaskSettableEnv):
"""
Implement a SimpleAdder as a custom Gymnasium environment.
Details on which attributes and methods are required for the integration
can be found in the docs.
The environment has a pretty simple state and action space. The state is
composed of an integer numbers. The action is composed of an integer number
between -10 and 10. At each episode, the state number is initialized between
0 and 100, and at each iteration the agent chooses a number between -10 and 10.
The chosen number is added to the state. The purpose of the simulation is to
get the state equal to 50, at which point the episode terminates. The episode
duration is limited to 10 iterations.
"""

def __init__(self, env_config):
self.observation_space = Dict(
{"value": Box(low=-float("inf"), high=float("inf"))}
)
self.action_space = Dict({"addend": Box(low=-10, high=10, dtype=np.int32)})

# Initialize the task exponent attribute to 1
self.exponent = 1

def _get_obs(self):
"""Get the observable state."""
return {"value": np.array([self.state["value"]])}

def _get_info(self):
"""Get additional info not needed by the agent's decision."""
return {}

def reward(self, state):
"""
Return the reward value.
For this simple example this is just the distance to the number 50.
We add 10 (maximum steps per episode) to the reward and subtract the
current step to encourage to finish the episode as fast as possible.
"""
return -abs(state["value"] - 50) + 10 - self.iter

def reset(self, *, seed=None, options=None):
"""Start a new episode."""
self.iter = 0
# Get the current task (curriculum level)
task = self.get_task()
# Get the exponent of 2 for the task
exponent = task["exponent"]
# Initialize the state value randomly between +/- 2**exponent from target of 50
self.state = {"value": 50 + np.random.randint(-(2**exponent), 2**exponent)}
return self._get_obs(), self._get_info()

def step(self, action):
"""Advance one iteration by applying the given ``action``."""
self.state["value"] += action["addend"].item()
self.iter += 1
reward = self.reward(self.state)
terminated = self.state["value"] == 50
truncated = self.iter >= 10
return (
self._get_obs(),
reward,
terminated,
truncated,
self._get_info(),
)

@override(TaskSettableEnv)
def get_task(self):
"""Implement this to get the current task (curriculum level)."""
# Return the current exponent value as the task
return {"exponent": self.exponent}

@override(TaskSettableEnv)
def set_task(self, task):
"""Set a new task for this sim env."""
# Set the exponent value based on the task
self.exponent = task["exponent"]

0 comments on commit 2702410

Please sign in to comment.