Unity-Technologies · andrewcoh · Feb 22, 2021 · Feb 18, 2021 · Feb 18, 2021 · Feb 18, 2021
diff --git a/com.unity.ml-agents/CHANGELOG.md b/com.unity.ml-agents/CHANGELOG.md
@@ -18,7 +18,7 @@ and this project adheres to
 ### Bug Fixes
 #### com.unity.ml-agents (C#)
 #### ml-agents / ml-agents-envs / gym-unity (Python)
-
+- An issue that caused `GAIL` to fail for environments where agents can terminate episodes by self-sacrifice has been fixed. (#4971)
 
 ## [1.8.0-preview] - 2021-02-17
 ### Major Changes

diff --git a/config/imitation/PushBlock.yaml b/config/imitation/PushBlock.yaml
@@ -16,16 +16,24 @@ behaviors:
       num_layers: 2
       vis_encode_type: simple
     reward_signals:
-      gail:
+      extrinsic:
         gamma: 0.99
         strength: 1.0
+      gail:
+        gamma: 0.99
+        strength: 0.01
         encoding_size: 128
         learning_rate: 0.0003
         use_actions: false
         use_vail: false
         demo_path: Project/Assets/ML-Agents/Examples/PushBlock/Demos/ExpertPush.demo
     keep_checkpoints: 5
-    max_steps: 15000000
+    max_steps: 1000000
     time_horizon: 64
     summary_freq: 60000
     threaded: true
+    behavioral_cloning:
+      demo_path: Project/Assets/ML-Agents/Examples/PushBlock/Demos/ExpertPush.demo
+      steps: 50000
+      strength: 1.0
+      samples_per_update: 0
diff --git a/docs/ML-Agents-Overview.md b/docs/ML-Agents-Overview.md
@@ -472,12 +472,23 @@ Learning (GAIL). In most scenarios, you can combine these two features:
 - If you want to help your agents learn (especially with environments that have
   sparse rewards) using pre-recorded demonstrations, you can generally enable
   both GAIL and Behavioral Cloning at low strengths in addition to having an
-  extrinsic reward. An example of this is provided for the Pyramids example
-  environment under `PyramidsLearning` in `config/gail_config.yaml`.
-- If you want to train purely from demonstrations, GAIL and BC _without_ an
-  extrinsic reward signal is the preferred approach. An example of this is
-  provided for the Crawler example environment under `CrawlerStaticLearning` in
-  `config/gail_config.yaml`.
+  extrinsic reward. An example of this is provided for the PushBlock example
+  environment in `config/imitation/PushBlock.yaml`.
+- If you want to train purely from demonstrations with GAIL and BC _without_ an
+  extrinsic reward signal, please see the CrawlerStatic example environment under
+  in `config/imitation/CrawlerStatic.yaml`.
+
+***Note:*** GAIL introduces a [_survivor bias_](https://arxiv.org/pdf/1809.02925.pdf)
+to the learning process. That is, by giving positive rewards based on similarity
+to the expert, the agent is incentivized to remain alive for as long as possible.
+This can directly conflict with goal-oriented tasks like our PushBlock or Pyramids
+example environments where an agent must reach a goal state thus ending the
+episode as quickly as possible. In these cases, we strongly recommend that you
+use a low strength GAIL reward signal and a sparse extrinisic signal when
+the agent achieves the task. This way, the GAIL reward signal will guide the
+agent until it discovers the extrnisic signal and will not overpower it. If the
+agent appears to be ignoring the extrinsic reward signal, you should reduce
+the strength of GAIL.
 
 #### GAIL (Generative Adversarial Imitation Learning)
 
@@ -504,6 +515,7 @@ actions. In addition to learning purely from demonstrations, the GAIL reward
 signal can be mixed with an extrinsic reward signal to guide the learning
 process.
 
+
 #### Behavioral Cloning (BC)
 
 BC trains the Agent's policy to exactly mimic the actions shown in a set of

diff --git a/ml-agents/mlagents/trainers/torch/components/reward_providers/gail_reward_provider.py b/ml-agents/mlagents/trainers/torch/components/reward_providers/gail_reward_provider.py
@@ -21,12 +21,13 @@
 class GAILRewardProvider(BaseRewardProvider):
     def __init__(self, specs: BehaviorSpec, settings: GAILSettings) -> None:
         super().__init__(specs, settings)
-        self._ignore_done = True
+        self._ignore_done = False
         self._discriminator_network = DiscriminatorNetwork(specs, settings)
         self._discriminator_network.to(default_device())
         _, self._demo_buffer = demo_to_buffer(
             settings.demo_path, 1, specs
         )  # This is supposed to be the sequence length but we do not have access here
+        self._discriminator_network.encoder.update_normalization(self._demo_buffer)
         params = list(self._discriminator_network.parameters())
         self.optimizer = torch.optim.Adam(params, lr=settings.learning_rate)
 
@@ -44,6 +45,7 @@ def evaluate(self, mini_batch: AgentBuffer) -> np.ndarray:
             )
 
     def update(self, mini_batch: AgentBuffer) -> Dict[str, np.ndarray]:
+
         expert_batch = self._demo_buffer.sample_mini_batch(
             mini_batch.num_experiences, 1
         )
@@ -73,7 +75,7 @@ def __init__(self, specs: BehaviorSpec, settings: GAILSettings) -> None:
         self._settings = settings
 
         encoder_settings = NetworkSettings(
-            normalize=False,
+            normalize=True,
             hidden_units=settings.encoding_size,
             num_layers=2,
             vis_encode_type=EncoderType.SIMPLE,