[bug-fix] Fix POCA LSTM, pad sequences in the back #5206

ervteng · 2021-03-31T18:21:30Z

Proposed change(s)

This PR does two things:

Fixes an issue with POCA's LSTM when sequence_length < time_horizon.
Moves padding of sequences shorter than sequence_length to the back. This is because we store the initial memories for each sequence (both policy and critic), but since we pad from the front, short sequences will be initialized, then run through some zeros. For instance, given a trajectory [s0, s1, s2], we will store the initial memory m0. Then, when we update the model, we might have padded the trajectory [pad0, pad1, s0, s1, s2], but will put the memory m0 in the LSTM with pad0.

This PR will change the padding to [s0, s1, s2, pad0, pad1], so that m0 goes with s0. This will also let us implement burn-in very easily by masking the loss at the start of the sequence.

Pending: experiment results (currently running Hallway).

Useful links (Github issues, JIRA tickets, ML-Agents forum threads etc.)

JIRA MLA-1873

Types of change(s)

Checklist

Added tests that prove my fix is effective or that my feature works
Updated the changelog (if applicable)
Updated the documentation (if applicable)
Updated the migration guide (if applicable)

Other comments

vincentpierre · 2021-03-31T22:26:50Z

ml-agents/mlagents/trainers/optimizer/torch_optimizer.py

        # trajectory
        for seq_num in range(
-            1, math.ceil((num_experiences) / (self.policy.sequence_length))
+            0, math.floor((num_experiences) / (self.policy.sequence_length))


Suggested change

0, math.floor((num_experiences) / (self.policy.sequence_length))

math.floor((num_experiences) / (self.policy.sequence_length))

Suggested change

0, math.floor((num_experiences) / (self.policy.sequence_length))

num_experiences // self.policy.sequence_length

vincentpierre · 2021-03-31T22:29:35Z

ml-agents/mlagents/trainers/optimizer/torch_optimizer.py

+            # For the last sequence, the initial memory should be the one at the
+            # end of this trajectory.
+            for _ in range(last_seq_len):
+                all_next_memories.append(ModelUtils.to_numpy(_mem.squeeze()))


Why does it matter what the memory in the padding is? Can we make this all zeros ?

This isn't in the padding, as the padding isn't added to the trajectory until later. This method only has unpadded data. Added a comment

I don't understand that comment :

# For the last sequence, the initial memory should be the one at the # end of this trajectory.

Can you give more info ?

vincentpierre · 2021-03-31T22:30:43Z

ml-agents/mlagents/trainers/poca/optimizer_torch.py

        all_next_baseline_mem = AgentBufferField()
-        # In the buffer, the 1st sequence are the ones that are padded. So if seq_len = 3 and
-        # trajectory is of length 10, the 1st sequence is [pad,pad,obs].
+        # In the buffer, the last sequence are the ones that are padded. So if seq_len = 3 and


Mention in this comment that this is about LSTM and memories.

vincentpierre · 2021-03-31T22:31:05Z

ml-agents/mlagents/trainers/poca/optimizer_torch.py

        # trajectory
        for seq_num in range(
-            1, math.ceil((num_experiences) / (self.policy.sequence_length))
+            0, math.floor((num_experiences) / (self.policy.sequence_length))


This can be simplified now.

vincentpierre · 2021-03-31T22:31:51Z

ml-agents/mlagents/trainers/poca/optimizer_torch.py

-            end = (seq_num + 1) * self.policy.sequence_length - (
-                self.policy.sequence_length - leftover
-            )
+            start = seq_num * self.policy.sequence_length


This feels like there is a bunch of duplicate code with this part : https://github.com/Unity-Technologies/ml-agents/pull/5206/files#diff-01b42574a4de05de250e29039fc1bbc67b200f3a7c5ed0676b2eee549b943c11R95
Is it possible to combine some of it ?

It's annoyingly similar but just slightly different enough to make it really hard to combine. The POCA version has to operate on group obs and actions as well 🤔

vincentpierre · 2021-03-31T22:33:02Z

ml-agents/mlagents/trainers/poca/optimizer_torch.py

+                all_next_value_mem.append(ModelUtils.to_numpy(_value_mem.squeeze()))
+                all_next_baseline_mem.append(
+                    ModelUtils.to_numpy(_baseline_mem.squeeze())


Why are we padding with this value and not NaN or zeros ?

The sequence at this point isn't padded - it's just the leftover bit that would go into a new sequence. E.g. if there were three sequences of length "2" [[s0, s1],[s2, s3],[s4]], at this point this is just s4. When we pad in the buffer, it will become [s4, 0], but not at this point.

With that said, only the 1st memory of each sequence is used. I'm repeating the memory rather than using zeros or ones so we don't need to allocate a new numpy array every time.

andrewcoh · 2021-03-31T22:46:04Z

ml-agents/mlagents/trainers/optimizer/torch_optimizer.py

+        last_seq_len = leftover
+        if last_seq_len > 0:


Suggested change

last_seq_len = leftover

if last_seq_len > 0:

if leftover > 0:

andrewcoh · 2021-03-31T22:47:15Z

ml-agents/mlagents/trainers/optimizer/torch_optimizer.py

+        last_seq_len = leftover
+        if last_seq_len > 0:
+            for _obs in tensor_obs:
+                last_seq_obs = _obs[-last_seq_len:]


nit: why use last_seq_len instead of leftover. Or rename leftover when its first created.

andrewcoh · 2021-03-31T22:49:14Z

ml-agents/mlagents/trainers/optimizer/torch_optimizer.py

+
+            # For the last sequence, the initial memory should be the one at the
+            # end of this trajectory.
+            for _ in range(last_seq_len):


Why do we add the last _mem for each extra leftover obs? Why isnt it enough to just append this once?

We need the mem array to be the same length as all the other ones in the buffer. But you're right this isn't needed. To address @vincentpierre's comment above, we could also add _mem once and add a bunch of zeros/NaNs, but this saves us from having to create/allocate those empty arrays.

ervteng · 2021-04-05T15:48:28Z

Experimental update:

Hallway trains a little faster than on main (green line is main, 2 others are 2 runs of this PR).

Value estimates are much cleaner. Value loss shows a similar difference. I suspect this is b/c of the lack of random-length zero-pad at the beginning of the sequence.

andrewcoh · 2021-04-05T16:11:41Z

config/ppo/Hallway.yaml

      buffer_size: 1024
      learning_rate: 0.0003
-      beta: 0.01
+      beta: 0.03


Should we add in the documentation that it may be good practice to increase beta with LSTM? Or do we think that this is a peculiarity of Hallway?

I think it's b/c of Hallway's sparse reward and reward structure that has a bunch of local maxima (at least 2, given the reward curve). Hard to say if we'd need to increase beta for other envs.
Our betas might just be too low across-the-board.

andrewcoh · 2021-04-05T16:12:21Z

ml-agents/mlagents/trainers/optimizer/torch_optimizer.py

-            for signal_name in init_values.keys()
-        }
-
+        # When using LSTM, we need to divide the trajectory into sequences of even length. Sometimes,


Suggested change

# When using LSTM, we need to divide the trajectory into sequences of even length. Sometimes,

# When using LSTM, we need to divide the trajectory into sequences of equal length. Sometimes,

andrewcoh · 2021-04-05T16:17:12Z

ml-agents/mlagents/trainers/poca/optimizer_torch.py

-            signal_name: [init_values[signal_name]]
-            for signal_name in init_values.keys()
-        }
+        # When using LSTM, we need to divide the trajectory into sequences of even length. Sometimes,


Suggested change

# When using LSTM, we need to divide the trajectory into sequences of even length. Sometimes,

# When using LSTM, we need to divide the trajectory into sequences of equal length. Sometimes,

vincentpierre · 2021-04-05T17:39:15Z

ml-agents/mlagents/trainers/optimizer/torch_optimizer.py

-        for seq_num in range(
-            1, math.ceil((num_experiences) / (self.policy.sequence_length))
-        ):
+        for seq_num in range(num_experiences // (self.policy.sequence_length)):


Suggested change

for seq_num in range(num_experiences // (self.policy.sequence_length)):

for seq_num in range(num_experiences // self.policy.sequence_length):

* Pad buffer at the end * Fix padding in optimizer value estimate * Fix additional bugs and POCA * Fix groupmate obs, add tests * Update changelog * Improve tests * Address comments * Fix poca test * Fix buffer test * Increase entropy for Hallway * Add EOF newline * Fix Behavior Name * Address comments (cherry picked from commit 2ce6810)

Ervin Teng added 4 commits March 30, 2021 18:45

Pad buffer at the end

b64e5b8

Fix padding in optimizer value estimate

29af343

Fix additional bugs and POCA

39f6d57

Fix groupmate obs, add tests

48dafb6

ervteng requested a review from andrewcoh March 31, 2021 18:21

Update changelog

0a5b798

ervteng requested a review from vincentpierre March 31, 2021 18:31

Improve tests

be54401

vincentpierre reviewed Mar 31, 2021

View reviewed changes

andrewcoh reviewed Mar 31, 2021

View reviewed changes

Ervin Teng added 6 commits March 31, 2021 19:01

Address comments

439b1bd

Fix poca test

795c269

Fix buffer test

8a0f044

Increase entropy for Hallway

5a6f8d7

Add EOF newline

234f7c8

Fix Behavior Name

54b3766

Merge branch 'main' into develop-lstm-padding

97ad004

andrewcoh reviewed Apr 5, 2021

View reviewed changes

andrewcoh approved these changes Apr 5, 2021

View reviewed changes

vincentpierre approved these changes Apr 5, 2021

View reviewed changes

Address comments

e2ae9aa

ervteng merged commit 2ce6810 into main Apr 5, 2021

delete-merged-branch bot deleted the develop-lstm-padding branch April 5, 2021 22:42

github-actions bot locked as resolved and limited conversation to collaborators Apr 6, 2022

	0, math.floor((num_experiences) / (self.policy.sequence_length))
	math.floor((num_experiences) / (self.policy.sequence_length))

	0, math.floor((num_experiences) / (self.policy.sequence_length))
	num_experiences // self.policy.sequence_length

	# When using LSTM, we need to divide the trajectory into sequences of even length. Sometimes,
	# When using LSTM, we need to divide the trajectory into sequences of equal length. Sometimes,

	for seq_num in range(num_experiences // (self.policy.sequence_length)):
	for seq_num in range(num_experiences // self.policy.sequence_length):

[bug-fix] Fix POCA LSTM, pad sequences in the back #5206

[bug-fix] Fix POCA LSTM, pad sequences in the back #5206

Uh oh!

Conversation

ervteng commented Mar 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed change(s)

Useful links (Github issues, JIRA tickets, ML-Agents forum threads etc.)

Types of change(s)

Checklist

Other comments

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewcoh Mar 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ervteng commented Apr 5, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ervteng commented Mar 31, 2021 •

edited

Loading

andrewcoh Mar 31, 2021 •

edited

Loading