Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/mla 2205 separate schedule lr beta epsilon #5538

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 13 additions & 1 deletion com.unity.ml-agents/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,33 +8,45 @@ and this project adheres to

## [Unreleased]
### Major Changes

#### com.unity.ml-agents / com.unity.ml-agents.extensions (C#)

#### ml-agents / ml-agents-envs / gym-unity (Python)

### Minor Changes

#### com.unity.ml-agents / com.unity.ml-agents.extensions (C#)
- Added the capacity to initialize behaviors from any checkpoint and not just the latest one (#5525)

#### ml-agents / ml-agents-envs / gym-unity (Python)
- Set gym version in gym-unity to gym release 0.20.0
- Added support for having `beta`, `epsilon`, and `learning rate` on separate schedules (affects only PPO and POCA). (#5538)

### Bug Fixes

#### com.unity.ml-agents / com.unity.ml-agents.extensions (C#)

#### ml-agents / ml-agents-envs / gym-unity (Python)
- Fixed a bug in multi-agent cooperative training where agents might not receive all of the states of
terminated teammates. (#5441)
- Fixed wrong attribute name in argparser for torch device option (#5433)(#5467)
- Fixed conflicting CLI and yaml options regarding resume & initialize_from (#5495)
- Fixed failing tests for gym-unity due to gym 0.20.0 release
- Fixed failing tests for gym-unity due to gym 0.20.0 release (#5540)
- Fixed a bug in VAIL where the variational bottleneck was not properly passing gradients (#5546)
- Added minimal analytics collection to LL-API (#5511)

## [2.1.0-exp.1] - 2021-06-09
### Minor Changes
#### com.unity.ml-agents / com.unity.ml-agents.extensions (C#)
- update Barracuda to 2.0.0-pre.3. (#5385)
- Fixed NullReferenceException when adding Behavior Parameters with no Agent. (#5382)
- Add stacking option in Editor for `VectorSensorComponent`. (#5376)

#### ml-agents / ml-agents-envs / gym-unity (Python)
- Lock cattrs dependency version to 1.6. (#5397)
- Added a fully connected visual encoder for environments with very small image inputs. (#5351)
- Colab notebooks illustrating the use of the Python API are now part of the repository. (#5399)

### Bug Fixes
#### com.unity.ml-agents / com.unity.ml-agents.extensions (C#)
- RigidBodySensorComponent now displays a warning if it's used in a way that won't generate useful observations. (#5387)
Expand Down
2 changes: 2 additions & 0 deletions docs/Learning-Environment-Create-New.md
Original file line number Diff line number Diff line change
Expand Up @@ -417,6 +417,8 @@ behaviors:
lambd: 0.99
num_epoch: 3
learning_rate_schedule: linear
beta_schedule: constant
epsilon_schedule: linear
network_settings:
normalize: false
hidden_units: 128
Expand Down
2 changes: 2 additions & 0 deletions docs/Training-Configuration-File.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,8 @@ the `trainer` setting above).
| :---------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `hyperparameters -> beta` | (default = `5.0e-3`) Strength of the entropy regularization, which makes the policy "more random." This ensures that agents properly explore the action space during training. Increasing this will ensure more random actions are taken. This should be adjusted such that the entropy (measurable from TensorBoard) slowly decreases alongside increases in reward. If entropy drops too quickly, increase beta. If entropy drops too slowly, decrease `beta`. <br><br>Typical range: `1e-4` - `1e-2` |
| `hyperparameters -> epsilon` | (default = `0.2`) Influences how rapidly the policy can evolve during training. Corresponds to the acceptable threshold of divergence between the old and new policies during gradient descent updating. Setting this value small will result in more stable updates, but will also slow the training process. <br><br>Typical range: `0.1` - `0.3` |
| `hyperparameters -> beta_schedule` | (default = `learning_rate_schedule`) Determines how beta changes over time. <br><br>`linear` decays beta linearly, reaching 0 at max_steps, while `constant` keeps beta constant for the entire training run. If not explicitly set, the default beta schedule will be set to `hyperparameters -> learning_rate_schedule`. |
| `hyperparameters -> epsilon_schedule` | (default = `learning_rate_schedule `) Determines how epsilon changes over time (PPO only). <br><br>`linear` decays epsilon linearly, reaching 0 at max_steps, while `constant` keeps the epsilon constant for the entire training run. If not explicitly set, the default epsilon schedule will be set to `hyperparameters -> learning_rate_schedule`.
| `hyperparameters -> lambd` | (default = `0.95`) Regularization parameter (lambda) used when calculating the Generalized Advantage Estimate ([GAE](https://arxiv.org/abs/1506.02438)). This can be thought of as how much the agent relies on its current value estimate when calculating an updated value estimate. Low values correspond to relying more on the current value estimate (which can be high bias), and high values correspond to relying more on the actual rewards received in the environment (which can be high variance). The parameter provides a trade-off between the two, and the right value can lead to a more stable training process. <br><br>Typical range: `0.9` - `0.95` |
| `hyperparameters -> num_epoch` | (default = `3`) Number of passes to make through the experience buffer when performing gradient descent optimization.The larger the batch_size, the larger it is acceptable to make this. Decreasing this will ensure more stable updates, at the cost of slower learning. <br><br>Typical range: `3` - `10` |

Expand Down
2 changes: 2 additions & 0 deletions docs/Training-ML-Agents.md
Original file line number Diff line number Diff line change
Expand Up @@ -269,7 +269,9 @@ behaviors:
# PPO-specific hyperparameters
# Replaces the "PPO-specific hyperparameters" section above
beta: 5.0e-3
beta_schedule: constant
epsilon: 0.2
epsilon_schedule: linear
lambd: 0.95
num_epoch: 3

Expand Down
4 changes: 2 additions & 2 deletions ml-agents/mlagents/trainers/poca/optimizer_torch.py
Original file line number Diff line number Diff line change
Expand Up @@ -172,13 +172,13 @@ def __init__(self, policy: TorchPolicy, trainer_settings: TrainerSettings):
self.trainer_settings.max_steps,
)
self.decay_epsilon = ModelUtils.DecayedValue(
self.hyperparameters.learning_rate_schedule,
self.hyperparameters.epsilon_schedule,
self.hyperparameters.epsilon,
0.1,
self.trainer_settings.max_steps,
)
self.decay_beta = ModelUtils.DecayedValue(
self.hyperparameters.learning_rate_schedule,
self.hyperparameters.beta_schedule,
self.hyperparameters.beta,
1e-5,
self.trainer_settings.max_steps,
Expand Down
4 changes: 2 additions & 2 deletions ml-agents/mlagents/trainers/ppo/optimizer_torch.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,13 +50,13 @@ def __init__(self, policy: TorchPolicy, trainer_settings: TrainerSettings):
self.trainer_settings.max_steps,
)
self.decay_epsilon = ModelUtils.DecayedValue(
self.hyperparameters.learning_rate_schedule,
self.hyperparameters.epsilon_schedule,
self.hyperparameters.epsilon,
0.1,
self.trainer_settings.max_steps,
)
self.decay_beta = ModelUtils.DecayedValue(
self.hyperparameters.learning_rate_schedule,
self.hyperparameters.beta_schedule,
self.hyperparameters.beta,
1e-5,
self.trainer_settings.max_steps,
Expand Down
48 changes: 34 additions & 14 deletions ml-agents/mlagents/trainers/settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,33 @@ def check_and_structure(key: str, value: Any, class_type: type) -> Any:
return cattr.structure(value, attr_fields_dict[key].type)


class TrainerType(Enum):
PPO: str = "ppo"
SAC: str = "sac"
POCA: str = "poca"

def to_settings(self) -> type:
_mapping = {
TrainerType.PPO: PPOSettings,
TrainerType.SAC: SACSettings,
TrainerType.POCA: POCASettings,
}
return _mapping[self]


def check_hyperparam_schedules(val: Dict, trainer_type: TrainerType) -> Dict:
# Check if beta and epsilon are set. If not, set to match learning rate schedule.
if trainer_type is TrainerType.PPO or trainer_type is TrainerType.POCA:
if "beta_schedule" not in val.keys() and "learning_rate_schedule" in val.keys():
val["beta_schedule"] = val["learning_rate_schedule"]
if (
"epsilon_schedule" not in val.keys()
and "learning_rate_schedule" in val.keys()
):
val["epsilon_schedule"] = val["learning_rate_schedule"]
return val


def strict_to_cls(d: Mapping, t: type) -> Any:
if not isinstance(d, Mapping):
raise TrainerConfigError(f"Unsupported config {d} for {t.__name__}.")
Expand Down Expand Up @@ -91,6 +118,8 @@ class EncoderType(Enum):
class ScheduleType(Enum):
CONSTANT = "constant"
LINEAR = "linear"
# TODO add support for lesson based scheduling
# LESSON = "lesson"


class ConditioningType(Enum):
Expand Down Expand Up @@ -151,6 +180,8 @@ class PPOSettings(HyperparamSettings):
lambd: float = 0.95
num_epoch: int = 3
learning_rate_schedule: ScheduleType = ScheduleType.LINEAR
beta_schedule: ScheduleType = ScheduleType.LINEAR
epsilon_schedule: ScheduleType = ScheduleType.LINEAR


@attr.s(auto_attribs=True)
Expand Down Expand Up @@ -608,20 +639,6 @@ def _team_change_default(self):
initial_elo: float = 1200.0


class TrainerType(Enum):
PPO: str = "ppo"
SAC: str = "sac"
POCA: str = "poca"

def to_settings(self) -> type:
_mapping = {
TrainerType.PPO: PPOSettings,
TrainerType.SAC: SACSettings,
TrainerType.POCA: POCASettings,
}
return _mapping[self]


@attr.s(auto_attribs=True)
class TrainerSettings(ExportableSettings):
default_override: ClassVar[Optional["TrainerSettings"]] = None
Expand Down Expand Up @@ -700,6 +717,9 @@ def structure(d: Mapping, t: type) -> Any:
"Hyperparameters were specified but no trainer_type was given."
)
else:
d_copy[key] = check_hyperparam_schedules(
val, d_copy["trainer_type"]
)
d_copy[key] = strict_to_cls(
d_copy[key], TrainerType(d_copy["trainer_type"]).to_settings()
)
Expand Down
2 changes: 2 additions & 0 deletions ml-agents/mlagents/trainers/tests/test_config_conversion.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@
lambd: 0.95
learning_rate: 3.0e-4
learning_rate_schedule: linear
beta_schedule: constant
epsilon_schedule: linear
miguelalonsojr marked this conversation as resolved.
Show resolved Hide resolved
max_steps: 5.0e5
memory_size: 256
normalize: false
Expand Down
21 changes: 21 additions & 0 deletions ml-agents/mlagents/trainers/tests/test_settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
TrainerType,
deep_update_dict,
strict_to_cls,
ScheduleType,
)
from mlagents.trainers.exception import TrainerConfigError

Expand Down Expand Up @@ -160,6 +161,26 @@ def test_trainersettings_structure():
TrainerSettings.structure(trainersettings_dict, TrainerSettings)


def test_trainersettingsschedules_structure():
"""
Test structuring method for Trainer Settings Schedule
"""
trainersettings_dict = {
"trainer_type": "ppo",
"hyperparameters": {
"learning_rate_schedule": "linear",
"beta_schedule": "constant",
},
}
trainer_settings = TrainerSettings.structure(trainersettings_dict, TrainerSettings)
assert isinstance(trainer_settings.hyperparameters, PPOSettings)
assert (
trainer_settings.hyperparameters.learning_rate_schedule == ScheduleType.LINEAR
)
assert trainer_settings.hyperparameters.beta_schedule == ScheduleType.CONSTANT
assert trainer_settings.hyperparameters.epsilon_schedule == ScheduleType.LINEAR


def test_reward_signal_structure():
"""
Tests the RewardSignalSettings structure method. This one is special b/c
Expand Down