Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set ignore done=False in GAIL #4971

Merged
merged 8 commits into from
Feb 22, 2021
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion com.unity.ml-agents/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ and this project adheres to
### Bug Fixes
#### com.unity.ml-agents (C#)
#### ml-agents / ml-agents-envs / gym-unity (Python)

- An issue that caused `GAIL` to fail for environments where agents can terminate episodes by self-sacrifice has been fixed. (#4971)

## [1.8.0-preview] - 2021-02-17
### Major Changes
Expand Down
12 changes: 10 additions & 2 deletions config/imitation/PushBlock.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,16 +16,24 @@ behaviors:
num_layers: 2
vis_encode_type: simple
reward_signals:
gail:
extrinsic:
gamma: 0.99
strength: 1.0
gail:
gamma: 0.99
andrewcoh marked this conversation as resolved.
Show resolved Hide resolved
strength: 0.01
encoding_size: 128
learning_rate: 0.0003
use_actions: false
use_vail: false
demo_path: Project/Assets/ML-Agents/Examples/PushBlock/Demos/ExpertPush.demo
keep_checkpoints: 5
max_steps: 15000000
max_steps: 1000000
time_horizon: 64
summary_freq: 60000
threaded: true
behavioral_cloning:
demo_path: Project/Assets/ML-Agents/Examples/PushBlock/Demos/ExpertPush.demo
steps: 50000
strength: 1.0
samples_per_update: 0
24 changes: 18 additions & 6 deletions docs/ML-Agents-Overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -472,12 +472,23 @@ Learning (GAIL). In most scenarios, you can combine these two features:
- If you want to help your agents learn (especially with environments that have
sparse rewards) using pre-recorded demonstrations, you can generally enable
both GAIL and Behavioral Cloning at low strengths in addition to having an
extrinsic reward. An example of this is provided for the Pyramids example
environment under `PyramidsLearning` in `config/gail_config.yaml`.
- If you want to train purely from demonstrations, GAIL and BC _without_ an
extrinsic reward signal is the preferred approach. An example of this is
provided for the Crawler example environment under `CrawlerStaticLearning` in
`config/gail_config.yaml`.
extrinsic reward. An example of this is provided for the PushBlock example
environment in `config/imitation/PushBlock.yaml`.
- If you want to train purely from demonstrations with GAIL and BC _without_ an
extrinsic reward signal, please see the CrawlerStatic example environment under
in `config/imitation/CrawlerStatic.yaml`.

***Note:*** GAIL introduces a [_survivor bias_](https://arxiv.org/pdf/1809.02925.pdf)
to the learning process. That is, by giving positive rewards based on similarity
to the expert, the agent is incentivized to remain alive for as long as possible.
This can directly conflict with goal-oriented tasks like our PushBlock or Pyramids
example environments where an agent must reach a goal state thus ending the
episode as quickly as possible. In these cases, we strongly recommend that you
use a low strength GAIL reward signal and a sparse extrinisic signal when
the agent achieves the task. This way, the GAIL reward signal will guide the
agent until it discovers the extrnisic signal and will not overpower it. If the
agent appears to be ignoring the extrinsic reward signal, you should reduce
the strength of GAIL.

#### GAIL (Generative Adversarial Imitation Learning)

Expand All @@ -504,6 +515,7 @@ actions. In addition to learning purely from demonstrations, the GAIL reward
signal can be mixed with an extrinsic reward signal to guide the learning
process.


andrewcoh marked this conversation as resolved.
Show resolved Hide resolved
#### Behavioral Cloning (BC)

BC trains the Agent's policy to exactly mimic the actions shown in a set of
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,13 @@
class GAILRewardProvider(BaseRewardProvider):
def __init__(self, specs: BehaviorSpec, settings: GAILSettings) -> None:
super().__init__(specs, settings)
self._ignore_done = True
self._ignore_done = False
self._discriminator_network = DiscriminatorNetwork(specs, settings)
self._discriminator_network.to(default_device())
_, self._demo_buffer = demo_to_buffer(
settings.demo_path, 1, specs
) # This is supposed to be the sequence length but we do not have access here
self._discriminator_network.encoder.update_normalization(self._demo_buffer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long will this take for large demo sets? Can this become prohibitive and if yes, should we use a sample of the demo buffer ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced this with samples from the expert demos in the update

params = list(self._discriminator_network.parameters())
self.optimizer = torch.optim.Adam(params, lr=settings.learning_rate)

Expand All @@ -44,6 +45,7 @@ def evaluate(self, mini_batch: AgentBuffer) -> np.ndarray:
)

def update(self, mini_batch: AgentBuffer) -> Dict[str, np.ndarray]:

expert_batch = self._demo_buffer.sample_mini_batch(
mini_batch.num_experiences, 1
)
Expand Down Expand Up @@ -73,7 +75,7 @@ def __init__(self, specs: BehaviorSpec, settings: GAILSettings) -> None:
self._settings = settings

encoder_settings = NetworkSettings(
normalize=False,
normalize=True,
hidden_units=settings.encoding_size,
num_layers=2,
vis_encode_type=EncoderType.SIMPLE,
Expand Down