Add curriculum learning example using simple adder (#47)

* Add curriculum learning example * fix bug in sim for spaces.Dict --------- Co-authored-by: Jazmia Henry <48301423+jazmiahenry@users.noreply.github.com>
Azure · Aug 29, 2023 · 2702410 · 2702410
1 parent be868f9
commit 2702410
Show file tree

Hide file tree

Showing 5 changed files with 368 additions and 0 deletions.
diff --git a/examples/curriculum-learning/README.md b/examples/curriculum-learning/README.md
@@ -0,0 +1,66 @@
+# Curriculum Learning
+
+In this example, we show how to use curriculum learning to train an RL agent on Azure ML with a custom Gymnasium environment (“Simple Adder”). Curriculum learning is a technique that orders the training data according to some measure of difficulty, and gradually exposes the model to harder episodes as it learns.
+
+### What this sample covers
+- How to modify a custom Gymnasium simulation environment to use curriculum learning with RLlib
+- How to implement curriculum learning on your local machine and on Azure ML
+
+### What this sample does not cover
+- How to create an optimized curriculum for best performance
+- How to evaluate the agent
+- How to deploy the agent
+
+## Prerequisites
+
+- Install the Azure CLI on your machine:
+```
+pip install azure-cli
+```
+- Add the ML extension:
+```
+az extension add -n ml
+```
+- [Create an AML workspace and compute cluster](https://azure.github.io/plato/#create-azure-resources)
+- Create an AML environment using the conda file provided (``conda.yml``) by running the following command:
+```bash
+az ml environment create --name curriculum-learning-env --conda-file conda.yml --image mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04 --resource-group $YOUR_RESOURCE_GROUP --workspace-name $YOUR_WORKSPACE
+```
+
+## Example Overview
+The simulation environment in this sample is "Simple Adder" (`scr/sim_curriculum_capable.py`), where the agent has to choose a number between -10 and 10 to add to a state value. The goal is to get the state value equal to 50 in 10 time steps or less. The difficulty level depends on how far the state value is from 50 at the start, as larger distances make the task harder. The curriculum learning strategy is to gradually expand the range of possible state values around 50, increasing the level of difficulty. The agent starts training on the easiest range (smallest distances) and progresses to the next range, which includes the previous one, after reaching a certain average reward threshold. The agent repeats this process until it reaches the hardest range (largest distances) or the maximum number of iterations.
+
+## Run Locally
+As a preliminary step, you should check that the simulation works on your local machine to save precious development time. The main.py script in the src folder allows you to test locally with the following command:
+
+```
+python main.py --test-local
+```
+
+## Tutorial: Run on AML
+After you checked that the simulation works properly, follow these steps to train an RL agent on AML using curriculum learning:
+
+1. Open the `conda.yaml` file and fill in the values for the AML `workspace_name`, `resource_group`, `subscription_id`, and `compute_target_name` with the ones you created in the prerequisites section. These values are used to connect to your AML workspace and compute cluster.
+
+2. Open the `src/main.py` file and do the following:
+    - Modify the curriculum learning function (`curriculum_fn()`) to return a new task (or difficulty level) for your environment based on some criteria. For example, you can set a threshold on the average episode reward as a measure of difficulty.
+    - Adjust the `train()` function parameters, such as `trainable`, `rollouts`, and `stopping_criteria`, according to your desired strategy.
+    - Modify the `CurriculumCallback()` class to log the current task of the environment to TensorBoard. This class can also implement other methods to customize the training behavior, such as `on_train_result`, `on_episode_end`, etc. For example, you can log other metrics, save checkpoints, or update hyperparameters based on the curriculum learning progress.
+
+3. Open your custom simulation environment file (`src/sim_curriculum_capable.py`) and make sure it inherits from the `TaskSettableEnv` class from Ray RLLib and implements its methods, such as `get_task()` and `set_task()`. These methods are used by the curriculum learning function and callback to get and set the current task of the environment. The task should be a dictionary that contains any information that defines the difficulty of the environment, such as the number of obstacles, the size of the grid, the speed of the agent, etc.
+
+4. Launch the job using the Azure CLI:
+```
+az ml job create -f job.yml --workspace-name $YOUR_WORKSPACE --resource-group $YOUR_RESOURCE_GROUP
+```
+
+5. Check that it is running by finding the job you launched in AML studio. You should see that ray is writing logs in the Outputs + logs tab in the user_logs folder.
+
+6. Monitor the curriculum learning progress and results on the AML studio or using TensorBoard on your local machine. You should see a custom metric called “task” that shows the current difficulty level of the environment for each episode.
+
+7. Once the job is completed, download the model checkpoints from AML studio under the Outputs + Logs tab of your job in the outputs folder.
+
+## Next Steps
+Now that you've successfully trained an agent using curriculum learning, you can experiment with different ways to design and evaluate curriculum learning. For example, you can use reward, entropy, uncertainty, or diversity as measures of difficulty.
+
+To learn more about how to use your trained agent, check out our [deploy-agent sample](https://github.com/Azure/plato/tree/main/examples/deploy-agent), which shows you how to deploy a trained agent and interact with it.
diff --git a/examples/curriculum-learning/conda.yml b/examples/curriculum-learning/conda.yml
@@ -0,0 +1,18 @@
+channels:
+   - anaconda
+   - conda-forge
+dependencies:
+   - python=3.10.11
+   - pip=23.0.1
+   - pip:
+        # Dependencies for Ray on AML
+        - azureml-mlflow
+        - azureml-defaults
+        - ray-on-aml
+        - ray[data,rllib]==2.5.0
+        # Deps for RLlib
+        - torch==2.0.1
+        - tensorflow_probability==0.19.0
+        # Dependencies for the Simple Adder
+        - gymnasium==0.26.3
+        - numpy==1.24.3
diff --git a/examples/curriculum-learning/job.yml b/examples/curriculum-learning/job.yml
@@ -0,0 +1,16 @@
+$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
+code: src
+command: >-
+  python main.py
+
+environment: azureml:curriculum-learning-env@latest
+compute: azureml:env-medium
+display_name: curriculum-learning
+experiment_name: curriculum-learning
+description: Run curriculum learning and log metrics.
+# Needed for using ray on AML
+distribution:
+  type: mpi
+# Modify the following and num_rollout_workers in main to use more workers
+resources:
+  instance_count: 1
diff --git a/examples/curriculum-learning/src/main.py b/examples/curriculum-learning/src/main.py
@@ -0,0 +1,177 @@
+"""
+Adapted from
+https://github.com/ray-project/ray/blob/master/rllib/examples/curriculum_learning.py
+to use the simple adder simulation environment.
+
+Example of a curriculum learning setup using the `TaskSettableEnv` API
+and the env_task_fn config.
+
+This example shows:
+  - Writing your own curriculum-capable environment using gym.Env.
+  - Defining an env_task_fn that determines whether and which new task
+    the env(s) should be set to (using the TaskSettableEnv API).
+  - Using Tune and RLlib to curriculum-learn this env.
+
+You can visualize experiment results in ~/ray_results using TensorBoard locally,
+or via AML performance metrics if you run this script on AML.
+"""
+import argparse
+import os
+import sys
+
+from azureml.core import Run
+from ray import air, tune
+from ray.rllib.algorithms.callbacks import DefaultCallbacks
+from ray.rllib.env.apis.task_settable_env import TaskSettableEnv, TaskType
+from ray.rllib.env.env_context import EnvContext
+from ray.tune.registry import register_env
+from ray_on_aml.core import Ray_On_AML
+
+# IMPORTANT: Remember to change it for your own simulation environment
+from sim_curriculum_capable import SimpleAdder as CurriculumCapableEnv
+
+register_env("curriculum_env", lambda config: CurriculumCapableEnv(config))
+
+
+# Define an env_task_fn that returns a new task based on some criteria
+def curriculum_fn(
+    train_results: dict, task_settable_env: TaskSettableEnv, env_ctx: EnvContext
+) -> TaskType:
+    """Function returning a possibly new task to set `task_settable_env` to.
+
+    Args:
+        train_results (dict): The train results returned by Algorithm.train().
+        task_settable_env (TaskSettableEnv): A single TaskSettableEnv object
+            used inside any worker and at any vector position. Use `env_ctx`
+            to get the worker_index, vector_index, and num_workers.
+        env_ctx (EnvContext): The env context object (i.e. env's config dict plus
+            properties worker_index, vector_index and num_workers) used to setup the
+            `task_settable_env`.
+
+    Returns:
+        TaskType: The task to set the env to. This may be the same as the current one.
+    """
+    # With each task, the initial state value will be between (50-2**exponent) &
+    # (50+2**exponent)
+    # Task 1: Randomly sample a number between 48 and 52
+    # Task 2: Randomly sample a number between 46 and 54
+    # We will thus increase the task number each time we hit the reward threshold
+    # Define a reward threshold for each task
+    reward_threshold = 0
+    # Get the current task level
+    task_exponent = task_settable_env.get_task()["exponent"]
+    # Get the average episode reward over the last training iteration
+    avg_reward = train_results["episode_reward_mean"]
+    # If the average reward is above or equal to the threshold, increase the task's
+    # exponent
+    if avg_reward >= reward_threshold:
+        # Increase the task level by 1
+        return {"exponent": task_exponent + 1}
+    else:
+        # Keep the same task level
+        return task_settable_env.get_task()
+
+
+class CurriculumCallback(DefaultCallbacks):
+    """A custom callback class that logs the current task of the environment to
+    tensorboard and Azure ML.
+
+    This class inherits from the DefaultCallbacks class provided by RLlib and overrides
+    the on_episode_start and on_epoch_end methods to access the curriculum "task"
+    information from the base environment and the episode object, and log it to both
+    tensorboard and Azure ML.
+    """
+
+    def __init__(self):
+        self.run = Run.get_context()
+
+    def on_episode_start(
+        self, *, worker, base_env, policies, episode, env_index, **kwargs
+    ):
+        # Get the current task of the sim
+        task = base_env.get_sub_environments()[env_index].get_task()
+        # Log the task to tensorboard
+        episode.custom_metrics["task"] = task["exponent"]
+
+    def on_train_result(self, *, algorithm, result: dict, **kwargs):
+        """Called at the end of Algorithm.train().
+
+        Args:
+            algorithm: Current Algorithm instance.
+            result: Dict of results returned from Algorithm.train() call.
+                You can mutate this object to add additional metrics.
+            kwargs: Forward compatibility placeholder.
+        """
+        print(
+            "Algorithm.train() result: {} -> {} episodes".format(
+                algorithm, result["episodes_this_iter"]
+            )
+        )
+        # Log metrics to TensorBoard
+        super().on_train_result(algorithm=algorithm, result=result, **kwargs)
+
+        # Filter the results dictionary to only log metrics with the substring "episode"
+        to_log = {
+            k: v for k, v in result.items() if "episode" in k and "media" not in k
+        }
+        # Add the curriculum task to the dictionary
+        to_log["task"] = result["custom_metrics"]["task_mean"]
+        # Log metrics to Azure ML
+        for k, v in to_log.items():
+            self.run.log(name=k, value=v)
+
+
+def train():
+    # Define a config object with the desired parameters
+    param_space = {
+        "env": "curriculum_env",
+        "env_task_fn": curriculum_fn,
+        "framework": "torch",
+        # IMPORTANT: Change num_workers to scale training
+        "num_workers": 1,
+        # Use GPUs iff `RLLIB_NUM_GPUS` env var set to > 0.
+        "num_gpus": int(os.environ.get("RLLIB_NUM_GPUS", "0")),
+        "callbacks": CurriculumCallback,
+    }
+
+    stopping_criteria = {
+        "training_iteration": 300,
+        "timesteps_total": 100000,
+        # "episode_reward_mean": 0,
+    }
+
+    # Build the algorithm from the config and pass it to the tune.Tuner constructor
+    tuner = tune.Tuner(
+        "PPO",
+        param_space=param_space,
+        run_config=air.RunConfig(
+            stop=stopping_criteria,
+            verbose=2,
+        ),
+    )
+
+    results = tuner.fit()
+
+    return results
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--test-local", action="store_true", default=True)
+    args = parser.parse_args()
+
+    if args.test_local:
+        train()
+        sys.exit()
+
+    ray_on_aml = Ray_On_AML()
+    ray = ray_on_aml.getRay()
+
+    if ray:
+        print("head node detected")
+        ray.init(address="auto")
+        print(ray.cluster_resources())
+        train()
+        ray.shutdown()
+    else:
+        print("in worker node")
diff --git a/examples/curriculum-learning/src/sim_curriculum_capable.py b/examples/curriculum-learning/src/sim_curriculum_capable.py
@@ -0,0 +1,91 @@
+"""Implementation of a simple simulation/environment in AML."""
+import numpy as np
+
+# from gymnasium import Env
+from gymnasium.spaces import Box, Dict
+
+# Import TaskSettableEnv from RLlib
+from ray.rllib.env.apis.task_settable_env import TaskSettableEnv
+from ray.rllib.utils.annotations import override
+
+
+class SimpleAdder(TaskSettableEnv):
+    """
+    Implement a SimpleAdder as a custom Gymnasium environment.
+
+    Details on which attributes and methods are required for the integration
+    can be found in the docs.
+
+    The environment has a pretty simple state and action space. The state is
+    composed of an integer numbers. The action is composed of an integer number
+    between -10 and 10. At each episode, the state number is initialized between
+    0 and 100, and at each iteration the agent chooses a number between -10 and 10.
+    The chosen number is added to the state. The purpose of the simulation is to
+    get the state equal to 50, at which point the episode terminates. The episode
+    duration is limited to 10 iterations.
+    """
+
+    def __init__(self, env_config):
+        self.observation_space = Dict(
+            {"value": Box(low=-float("inf"), high=float("inf"))}
+        )
+        self.action_space = Dict({"addend": Box(low=-10, high=10, dtype=np.int32)})
+
+        # Initialize the task exponent attribute to 1
+        self.exponent = 1
+
+    def _get_obs(self):
+        """Get the observable state."""
+        return {"value": np.array([self.state["value"]])}
+
+    def _get_info(self):
+        """Get additional info not needed by the agent's decision."""
+        return {}
+
+    def reward(self, state):
+        """
+        Return the reward value.
+
+        For this simple example this is just the distance to the number 50.
+        We add 10 (maximum steps per episode) to the reward and subtract the
+        current step to encourage to finish the episode as fast as possible.
+        """
+        return -abs(state["value"] - 50) + 10 - self.iter
+
+    def reset(self, *, seed=None, options=None):
+        """Start a new episode."""
+        self.iter = 0
+        # Get the current task (curriculum level)
+        task = self.get_task()
+        # Get the exponent of 2 for the task
+        exponent = task["exponent"]
+        # Initialize the state value randomly between +/- 2**exponent from target of 50
+        self.state = {"value": 50 + np.random.randint(-(2**exponent), 2**exponent)}
+        return self._get_obs(), self._get_info()
+
+    def step(self, action):
+        """Advance one iteration by applying the given ``action``."""
+        self.state["value"] += action["addend"].item()
+        self.iter += 1
+        reward = self.reward(self.state)
+        terminated = self.state["value"] == 50
+        truncated = self.iter >= 10
+        return (
+            self._get_obs(),
+            reward,
+            terminated,
+            truncated,
+            self._get_info(),
+        )
+
+    @override(TaskSettableEnv)
+    def get_task(self):
+        """Implement this to get the current task (curriculum level)."""
+        # Return the current exponent value as the task
+        return {"exponent": self.exponent}
+
+    @override(TaskSettableEnv)
+    def set_task(self, task):
+        """Set a new task for this sim env."""
+        # Set the exponent value based on the task
+        self.exponent = task["exponent"]