Unity-Technologies · miguelalonsojr · Sep 30, 2021 · Sep 9, 2021 · Sep 16, 2021 · Sep 16, 2021
diff --git a/com.unity.ml-agents/CHANGELOG.md b/com.unity.ml-agents/CHANGELOG.md
@@ -8,33 +8,45 @@ and this project adheres to
 
 ## [Unreleased]
 ### Major Changes
+
 #### com.unity.ml-agents / com.unity.ml-agents.extensions (C#)
+
 #### ml-agents / ml-agents-envs / gym-unity (Python)
+
 ### Minor Changes
+
 #### com.unity.ml-agents / com.unity.ml-agents.extensions (C#)
 - Added the capacity to initialize behaviors from any checkpoint and not just the latest one (#5525)
+
 #### ml-agents / ml-agents-envs / gym-unity (Python)
 - Set gym version in gym-unity to gym release 0.20.0
+- Added support for having `beta`, `epsilon`, and `learning rate` on separate schedules (affects only PPO and POCA). (#5538)
+
 ### Bug Fixes
+
 #### com.unity.ml-agents / com.unity.ml-agents.extensions (C#)
+
 #### ml-agents / ml-agents-envs / gym-unity (Python)
 - Fixed a bug in multi-agent cooperative training where agents might not receive all of the states of
 terminated teammates. (#5441)
 - Fixed wrong attribute name in argparser for torch device option (#5433)(#5467)
 - Fixed conflicting CLI and yaml options regarding resume & initialize_from (#5495)
-- Fixed failing tests for gym-unity due to gym 0.20.0 release
+- Fixed failing tests for gym-unity due to gym 0.20.0 release (#5540)
 - Fixed a bug in VAIL where the variational bottleneck was not properly passing gradients (#5546)
 - Added minimal analytics collection to LL-API (#5511)
+
 ## [2.1.0-exp.1] - 2021-06-09
 ### Minor Changes
 #### com.unity.ml-agents / com.unity.ml-agents.extensions (C#)
 - update Barracuda to 2.0.0-pre.3. (#5385)
 - Fixed NullReferenceException when adding Behavior Parameters with no Agent. (#5382)
 - Add stacking option in Editor for `VectorSensorComponent`. (#5376)
+
 #### ml-agents / ml-agents-envs / gym-unity (Python)
 - Lock cattrs dependency version to 1.6. (#5397)
 - Added a fully connected visual encoder for environments with very small image inputs. (#5351)
 - Colab notebooks illustrating the use of the Python API are now part of the repository. (#5399)
+
 ### Bug Fixes
 #### com.unity.ml-agents / com.unity.ml-agents.extensions (C#)
 - RigidBodySensorComponent now displays a warning if it's used in a way that won't generate useful observations. (#5387)

diff --git a/docs/Learning-Environment-Create-New.md b/docs/Learning-Environment-Create-New.md
@@ -417,6 +417,8 @@ behaviors:
       lambd: 0.99
       num_epoch: 3
       learning_rate_schedule: linear
+      beta_schedule: constant
+      epsilon_schedule: linear
     network_settings:
       normalize: false
       hidden_units: 128

diff --git a/docs/Training-Configuration-File.md b/docs/Training-Configuration-File.md
@@ -59,6 +59,8 @@ the `trainer` setting above).
 | :---------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `hyperparameters -> beta`      | (default = `5.0e-3`) Strength of the entropy regularization, which makes the policy "more random." This ensures that agents properly explore the action space during training. Increasing this will ensure more random actions are taken. This should be adjusted such that the entropy (measurable from TensorBoard) slowly decreases alongside increases in reward. If entropy drops too quickly, increase beta. If entropy drops too slowly, decrease `beta`. <br><br>Typical range: `1e-4` - `1e-2`                                                                                                                                                                     |
 | `hyperparameters -> epsilon`   | (default = `0.2`) Influences how rapidly the policy can evolve during training. Corresponds to the acceptable threshold of divergence between the old and new policies during gradient descent updating. Setting this value small will result in more stable updates, but will also slow the training process. <br><br>Typical range: `0.1` - `0.3`                                                                                                                                                                                                                                                                                                                      |
+| `hyperparameters -> beta_schedule` | (default = `learning_rate_schedule`) Determines how beta changes over time. <br><br>`linear` decays beta linearly, reaching 0 at max_steps, while `constant` keeps beta constant for the entire training run. If not explicitly set, the default beta schedule will be set to `hyperparameters -> learning_rate_schedule`.                                                                                                           |
+| `hyperparameters -> epsilon_schedule` | (default = `learning_rate_schedule `) Determines how epsilon changes over time (PPO only). <br><br>`linear` decays epsilon linearly, reaching 0 at max_steps, while `constant` keeps the epsilon constant for the entire training run. If not explicitly set, the default epsilon schedule will be set to `hyperparameters -> learning_rate_schedule`.
 | `hyperparameters -> lambd`     | (default = `0.95`) Regularization parameter (lambda) used when calculating the Generalized Advantage Estimate ([GAE](https://arxiv.org/abs/1506.02438)). This can be thought of as how much the agent relies on its current value estimate when calculating an updated value estimate. Low values correspond to relying more on the current value estimate (which can be high bias), and high values correspond to relying more on the actual rewards received in the environment (which can be high variance). The parameter provides a trade-off between the two, and the right value can lead to a more stable training process. <br><br>Typical range: `0.9` - `0.95` |
 | `hyperparameters -> num_epoch` | (default = `3`) Number of passes to make through the experience buffer when performing gradient descent optimization.The larger the batch_size, the larger it is acceptable to make this. Decreasing this will ensure more stable updates, at the cost of slower learning. <br><br>Typical range: `3` - `10`                                                                                                                                                                                                                                                                                                                                                           |
 

diff --git a/docs/Training-ML-Agents.md b/docs/Training-ML-Agents.md
@@ -269,7 +269,9 @@ behaviors:
       # PPO-specific hyperparameters
       # Replaces the "PPO-specific hyperparameters" section above
       beta: 5.0e-3
+      beta_schedule: constant
       epsilon: 0.2
+      epsilon_schedule: linear
       lambd: 0.95
       num_epoch: 3
 

diff --git a/ml-agents/mlagents/trainers/poca/optimizer_torch.py b/ml-agents/mlagents/trainers/poca/optimizer_torch.py
@@ -172,13 +172,13 @@ def __init__(self, policy: TorchPolicy, trainer_settings: TrainerSettings):
             self.trainer_settings.max_steps,
         )
         self.decay_epsilon = ModelUtils.DecayedValue(
-            self.hyperparameters.learning_rate_schedule,
+            self.hyperparameters.epsilon_schedule,
             self.hyperparameters.epsilon,
             0.1,
             self.trainer_settings.max_steps,
         )
         self.decay_beta = ModelUtils.DecayedValue(
-            self.hyperparameters.learning_rate_schedule,
+            self.hyperparameters.beta_schedule,
             self.hyperparameters.beta,
             1e-5,
             self.trainer_settings.max_steps,

diff --git a/ml-agents/mlagents/trainers/ppo/optimizer_torch.py b/ml-agents/mlagents/trainers/ppo/optimizer_torch.py
@@ -50,13 +50,13 @@ def __init__(self, policy: TorchPolicy, trainer_settings: TrainerSettings):
             self.trainer_settings.max_steps,
         )
         self.decay_epsilon = ModelUtils.DecayedValue(
-            self.hyperparameters.learning_rate_schedule,
+            self.hyperparameters.epsilon_schedule,
             self.hyperparameters.epsilon,
             0.1,
             self.trainer_settings.max_steps,
         )
         self.decay_beta = ModelUtils.DecayedValue(
-            self.hyperparameters.learning_rate_schedule,
+            self.hyperparameters.beta_schedule,
             self.hyperparameters.beta,
             1e-5,
             self.trainer_settings.max_steps,

diff --git a/ml-agents/mlagents/trainers/settings.py b/ml-agents/mlagents/trainers/settings.py
@@ -44,6 +44,33 @@ def check_and_structure(key: str, value: Any, class_type: type) -> Any:
     return cattr.structure(value, attr_fields_dict[key].type)
 
 
+class TrainerType(Enum):
+    PPO: str = "ppo"
+    SAC: str = "sac"
+    POCA: str = "poca"
+
+    def to_settings(self) -> type:
+        _mapping = {
+            TrainerType.PPO: PPOSettings,
+            TrainerType.SAC: SACSettings,
+            TrainerType.POCA: POCASettings,
+        }
+        return _mapping[self]
+
+
+def check_hyperparam_schedules(val: Dict, trainer_type: TrainerType) -> Dict:
+    # Check if beta and epsilon are set. If not, set to match learning rate schedule.
+    if trainer_type is TrainerType.PPO or trainer_type is TrainerType.POCA:
+        if "beta_schedule" not in val.keys() and "learning_rate_schedule" in val.keys():
+            val["beta_schedule"] = val["learning_rate_schedule"]
+        if (
+            "epsilon_schedule" not in val.keys()
+            and "learning_rate_schedule" in val.keys()
+        ):
+            val["epsilon_schedule"] = val["learning_rate_schedule"]
+    return val
+
+
 def strict_to_cls(d: Mapping, t: type) -> Any:
     if not isinstance(d, Mapping):
         raise TrainerConfigError(f"Unsupported config {d} for {t.__name__}.")
@@ -91,6 +118,8 @@ class EncoderType(Enum):
 class ScheduleType(Enum):
     CONSTANT = "constant"
     LINEAR = "linear"
+    # TODO add support for lesson based scheduling
+    # LESSON = "lesson"
 
 
 class ConditioningType(Enum):
@@ -151,6 +180,8 @@ class PPOSettings(HyperparamSettings):
     lambd: float = 0.95
     num_epoch: int = 3
     learning_rate_schedule: ScheduleType = ScheduleType.LINEAR
+    beta_schedule: ScheduleType = ScheduleType.LINEAR
+    epsilon_schedule: ScheduleType = ScheduleType.LINEAR
 
 
 @attr.s(auto_attribs=True)
@@ -608,20 +639,6 @@ def _team_change_default(self):
     initial_elo: float = 1200.0
 
 
-class TrainerType(Enum):
-    PPO: str = "ppo"
-    SAC: str = "sac"
-    POCA: str = "poca"
-
-    def to_settings(self) -> type:
-        _mapping = {
-            TrainerType.PPO: PPOSettings,
-            TrainerType.SAC: SACSettings,
-            TrainerType.POCA: POCASettings,
-        }
-        return _mapping[self]
-
-
 @attr.s(auto_attribs=True)
 class TrainerSettings(ExportableSettings):
     default_override: ClassVar[Optional["TrainerSettings"]] = None
@@ -700,6 +717,9 @@ def structure(d: Mapping, t: type) -> Any:
                         "Hyperparameters were specified but no trainer_type was given."
                     )
                 else:
+                    d_copy[key] = check_hyperparam_schedules(
+                        val, d_copy["trainer_type"]
+                    )
                     d_copy[key] = strict_to_cls(
                         d_copy[key], TrainerType(d_copy["trainer_type"]).to_settings()
                     )

diff --git a/ml-agents/mlagents/trainers/tests/test_config_conversion.py b/ml-agents/mlagents/trainers/tests/test_config_conversion.py
@@ -27,6 +27,8 @@
         lambd: 0.95
         learning_rate: 3.0e-4
         learning_rate_schedule: linear
+        beta_schedule: constant
+        epsilon_schedule: linear
         max_steps: 5.0e5
         memory_size: 256
         normalize: false

diff --git a/ml-agents/mlagents/trainers/tests/test_settings.py b/ml-agents/mlagents/trainers/tests/test_settings.py
@@ -24,6 +24,7 @@
     TrainerType,
     deep_update_dict,
     strict_to_cls,
+    ScheduleType,
 )
 from mlagents.trainers.exception import TrainerConfigError
 
@@ -160,6 +161,26 @@ def test_trainersettings_structure():
         TrainerSettings.structure(trainersettings_dict, TrainerSettings)
 
 
+def test_trainersettingsschedules_structure():
+    """
+    Test structuring method for Trainer Settings Schedule
+    """
+    trainersettings_dict = {
+        "trainer_type": "ppo",
+        "hyperparameters": {
+            "learning_rate_schedule": "linear",
+            "beta_schedule": "constant",
+        },
+    }
+    trainer_settings = TrainerSettings.structure(trainersettings_dict, TrainerSettings)
+    assert isinstance(trainer_settings.hyperparameters, PPOSettings)
+    assert (
+        trainer_settings.hyperparameters.learning_rate_schedule == ScheduleType.LINEAR
+    )
+    assert trainer_settings.hyperparameters.beta_schedule == ScheduleType.CONSTANT
+    assert trainer_settings.hyperparameters.epsilon_schedule == ScheduleType.LINEAR
+
+
 def test_reward_signal_structure():
     """
     Tests the RewardSignalSettings structure method. This one is special b/c