Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop generalization training #2232

Merged
merged 70 commits into from
Jul 25, 2019
Merged

Conversation

sankalp04
Copy link
Contributor

Reopening a new PR since I accidentally closed the other one; This PR consists of all the edits from the previous PR and omit the lesson_controller class.

def __init__(self, intervals, **kwargs):
self.intervals = intervals
# Measure the length of the intervals
self.interval_lengths = list(map(lambda x: abs(x[1] - x[0]), self.intervals))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find list comprehensions more readable than maps and reduces (and they were also removed from python3). I think this works:

self.interval_lengths = [abs(x[1] - x[0]) for x in self.intervals]
self.cum_interval_length = sum(self.interval_lengths)
self.interval_weights = [x / self.cum_interval_length for x in self.interval_lengths]

and I also think you can keep interval_lengths and cum_interval_length as local variables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converted them to local variables

If self.samplers is empty, then bool of it returns false, indicating
there is no sampler manager.
"""
return not bool(self.sampler_manager.samplers)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about making this a method or property on SamplerManager instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on making check_empty_sampler_manager part of SamplerManager

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the check part of the SamplerManager class

if changed:
self.trainers[brain_name].reward_buffer.clear()

if ( ((self.meta_curriculum) and any(lessons_incremented.values()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is way too complicated. If you're actually checking what the comment says, then split it up into something like

lessons_were_incremented = ...
ready_for_reset = ...
if lessons_were_incremented or ready_for_reset:
    # do stuff

(also watch the modulo by zero)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a check to ensure that global_step isn't 0 to ensure modulo safety

@@ -37,6 +39,8 @@ def __init__(
external_brains: Dict[str, BrainParameters],
training_seed: int,
fast_simulation: bool,
sampler_manager,
Copy link
Contributor

@chriselion chriselion Jul 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type annotation here: sampler_manager: SamplerManager,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added type annotation

@@ -73,6 +76,21 @@ def run_training(sub_id: int, run_seed: int, run_options, process_queue):
docker_target_name=docker_target_name
)

sampler = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think sampler_config is a better name for this. sampler implies it's an instance of a Sampler


def sample_all(self):
res = {}
if self.samplers == {}:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is what you want here; it's definitely more pythonic to do if not self.samplers: But even then, you don't need this if, since doing a for loop will iterate over 0 items.

if self.samplers == {}:
pass
else:
for param_name, param_sampler in list(self.samplers.items()):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need the list() here (only if we were trying for some sort of python2+3 compatibility).

@@ -203,20 +215,30 @@ def _create_model_path(model_path):
"permissions are set correctly.".format(model_path)
)

@staticmethod
def _check_reset_params(reset_params, new_config):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking, where is this method called?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code trailing from use of lesson_controller; not needed anymore so removed from next update.


from .exception import SamplerException

class SamplerException(Exception):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW this should inherit from UnityException as with all of the other exception classes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instance of dead code I was using for testing; removed in the next update


Reinforcement learning has a rather unique setup as opposed to supervised and
unsupervised learning. Agents here are trained and tested on the same exact environment,
which is analogous to a model being trained and tested on an identical dataset in supervised learning! This setting results in overfitting; the inability of the agent to generalize to slight tweaks or variations in the environment. This is problematic in instances when environments are randomly instantiated with varying properties. To make agents more robust, we train an agent over multiple variations of the environment. The agent is trained with the intent that it learns to maintain a minimum performance regardless of the environment variant and that it generalizes to maintain this in unseen future variants of the environment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 3 and 4 are cut but line 5 is not. You need to make sure that the number of characters per line is consistent throughout the document.


_Variations of the 3D Ball environment._

To vary the environments, we first decided what the reset parameters for the environment. These parameters are known as `Reset Parameters`, introduced in [Curriculum Learning](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Training-Curriculum-Learning.md). In the 3D ball environment example displayed in the figure above, the reset parameters are `gravity`, `ball_mass` and `ball_scale`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You say "we first decided what the reset parameters are" and then introduce the reset Parameters in the following sentence. This is confusing. Also, reset parameters were not introduced in Curriculum learning, so I would say Curriculum learning also uses reset parameters instead of introduced them.


_Variations of the 3D Ball environment._

To vary the environments, we first decided what the reset parameters for the environment. These parameters are known as `Reset Parameters`, introduced in [Curriculum Learning](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Training-Curriculum-Learning.md). In the 3D ball environment example displayed in the figure above, the reset parameters are `gravity`, `ball_mass` and `ball_scale`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To vary the environments, we first decided what the reset parameters for the environment. This is grammatically incorrect I think.

## In-practice

To test the effectiveness of this training procedure, we train 3 models over 50000 steps:
1. Model Trained on Default reset parameter values (Default model).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This numbering does not look good in the markdown. I think you need to skip a line somewhere for it to be a numbered list.


To test the effectiveness of this training procedure, we train 3 models over 50000 steps:
1. Model Trained on Default reset parameter values (Default model).
2. Model Trained on a range of reset parameter values (Random model). Reset parameter values are picked uniformly over the range and the model is trained on each configuration for 5000 steps before they are randomly sampled again. The range consists of the default values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ranges consist of the default values --> What does this mean ?


For generalization training, we need to provide a way to modify the environment by supplying a set of reset parameters. This provision can be either deterministic or randomized. Each reset parameter is assigned a sampler. If a sampler isn't provided for a reset parameter, the parameter maintains the default value throughout the training, remaining unchanged. The samplers for all the reset parameters are handled by a **Sampler Manager**, which is also responsible for generating a new set of values for the reset parameters when needed.

To setup the Sampler Manager, we setup a YAML file that specifies how we wish to generate new samples. In this file, we specify the samplers and the `resampling-duration` (number of training steps after which reset parameters are resampled). Below is an example of a sampler file for the 3D ball environment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resampling duration does not exist in the YAML provided.


For generalization training, we need to provide a way to modify the environment by supplying a set of reset parameters. This provision can be either deterministic or randomized. Each reset parameter is assigned a sampler. If a sampler isn't provided for a reset parameter, the parameter maintains the default value throughout the training, remaining unchanged. The samplers for all the reset parameters are handled by a **Sampler Manager**, which is also responsible for generating a new set of values for the reset parameters when needed.

To setup the Sampler Manager, we setup a YAML file that specifies how we wish to generate new samples. In this file, we specify the samplers and the `resampling-duration` (number of training steps after which reset parameters are resampled). Below is an example of a sampler file for the 3D ball environment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

number of training steps after which reset parameters are resampled --> these are not necessarily training steps, they are just simulations steps.

```yaml
mass:
sampler-type: "custom-sampler"
argA: 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the ordering of the args relevant or only their names ? This should be specified.

We first begin with setting up the sampler file. After the sampler file is defined and configured, we proceed by launching `mlagents-learn` and specify our configured sampler file with the `--sampler` flag. To demonstrate, if we wanted to train a 3D ball agent with generalization using the `generalization-test.yaml` sampling setup, we can run

```sh
mlagents-learn config/trainer_config.yaml --sampler=config/generalize_test.yaml --run-id=3D-Ball-generalization --train
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generalization-test.yaml =/= config/generalize_test.yaml

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the sample code you use config/generalize_test.yaml but in the text you use config/generalization-test.yaml

episode-length: 5000

mass:
sampler-type: "uniform"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You provide multi-range, normal and uniform sampling, you need to explain what they are and what their arguments correspond to.

identical dataset in supervised learning! This setting results in overfitting;
the inability of the agent to generalize to slight tweaks or variations in the
environment. This is problematic in instances when environments are randomly
instantiated with varying properties. To make agents more robust, we train an
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would rephrase: "To make agents more robust, one approach is to train over multiple variations of the environment."

@@ -0,0 +1,123 @@
# Training Generalized Reinforcement Learning agents
Copy link
Contributor

@ervteng ervteng Jul 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: capitalize agents

instantiated with varying properties. To make agents more robust, we train an
agent over multiple variations of the environment. The agent is trained with
the intent that it learns to maintain a minimum performance regardless of the
environment variant and that it generalizes to maintain this in unseen future
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would rephrase: "... and that it generalizes to be robust to future unseen variants of the environment."

## How-to

For generalization training, we need to provide a way to modify the environment
by supplying a set of reset parameters. This provision can be either
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make this clearer, I'd break this into two paragraphs. First: "For generalization training, we need to provide a way to ... reset parameters, and vary them over time. The parameters could be chosen either deterministically or randomly."

2nd paragraph: "This is done by assigning each reset parameter a sampler, which (insert description of what a sampler does). ..."

resampled). Below is an example of a sampler file for the 3D ball environment.

```yaml
episode-length: 5000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which on is it ? resampling-duration or episode length ?

We first begin with setting up the sampler file. After the sampler file is defined and configured, we proceed by launching `mlagents-learn` and specify our configured sampler file with the `--sampler` flag. To demonstrate, if we wanted to train a 3D ball agent with generalization using the `generalization-test.yaml` sampling setup, we can run

```sh
mlagents-learn config/trainer_config.yaml --sampler=config/generalize_test.yaml --run-id=3D-Ball-generalization --train
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the sample code you use config/generalize_test.yaml but in the text you use config/generalization-test.yaml

@@ -197,6 +200,7 @@ are conducting, see:
* [Training with PPO](Training-PPO.md)
* [Using Recurrent Neural Networks](Feature-Memory.md)
* [Training with Curriculum Learning](Training-Curriculum-Learning.md)
* [Training with Generalization](Training-Generalization-Learning.md)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Training with Generalization sounds weird to me. I would say Training with Environment Parameters Sampling

Uniformly draws a single sample in the range [min_value, max_value).
"""

def __init__(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need a doc string on the init methods so we know what the arguments correspond to

@@ -23,3 +23,5 @@ class MetaCurriculumError(TrainerError):
"""
Any error related to the configuration of a metacurriculum.
"""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra line ?

@@ -99,6 +107,8 @@ def run_training(
lesson,
run_seed,
fast_simulation,
sampler_manager,
resampling_interval,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is resampling duration the same as resampling interval ? The naming needs to be consistent if this is the case.

@@ -108,6 +118,28 @@ def run_training(
tc.start_learning(env, trainer_config)


def create_sampler_manager(sampler_file_path, env_reset_params):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring ?

Copy link
Contributor

@ervteng ervteng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved - docs need a couple fixes, we'll submit a separate PR for that. Remaining changes (they won't affect functionality)

  • Fix doc references to resampling_interval
  • Rename generalize_test.yaml to 3dball_generalize.yaml
  • Fix references to yaml in docs

@ervteng ervteng merged commit 5d7dd57 into develop Jul 25, 2019
@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 18, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants