-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug-fix] Fix POCA LSTM, pad sequences in the back #5206
Conversation
# Evaluate other trajectories, carrying over _mem after each | ||
# trajectory | ||
for seq_num in range( | ||
1, math.ceil((num_experiences) / (self.policy.sequence_length)) | ||
0, math.floor((num_experiences) / (self.policy.sequence_length)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0, math.floor((num_experiences) / (self.policy.sequence_length)) | |
math.floor((num_experiences) / (self.policy.sequence_length)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0, math.floor((num_experiences) / (self.policy.sequence_length)) | |
num_experiences // self.policy.sequence_length |
# For the last sequence, the initial memory should be the one at the | ||
# end of this trajectory. | ||
for _ in range(last_seq_len): | ||
all_next_memories.append(ModelUtils.to_numpy(_mem.squeeze())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does it matter what the memory in the padding is? Can we make this all zeros ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't in the padding, as the padding isn't added to the trajectory until later. This method only has unpadded data. Added a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand that comment :
# For the last sequence, the initial memory should be the one at the
# end of this trajectory.
Can you give more info ?
@@ -381,116 +382,110 @@ def _evaluate_by_sequence_team( | |||
num_experiences = self_obs[0].shape[0] | |||
all_next_value_mem = AgentBufferField() | |||
all_next_baseline_mem = AgentBufferField() | |||
# In the buffer, the 1st sequence are the ones that are padded. So if seq_len = 3 and | |||
# trajectory is of length 10, the 1st sequence is [pad,pad,obs]. | |||
# In the buffer, the last sequence are the ones that are padded. So if seq_len = 3 and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mention in this comment that this is about LSTM and memories.
|
||
# Evaluate other trajectories, carrying over _mem after each | ||
# trajectory | ||
for seq_num in range( | ||
1, math.ceil((num_experiences) / (self.policy.sequence_length)) | ||
0, math.floor((num_experiences) / (self.policy.sequence_length)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be simplified now.
end = (seq_num + 1) * self.policy.sequence_length - ( | ||
self.policy.sequence_length - leftover | ||
) | ||
start = seq_num * self.policy.sequence_length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like there is a bunch of duplicate code with this part : https://github.com/Unity-Technologies/ml-agents/pull/5206/files#diff-01b42574a4de05de250e29039fc1bbc67b200f3a7c5ed0676b2eee549b943c11R95
Is it possible to combine some of it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's annoyingly similar but just slightly different enough to make it really hard to combine. The POCA version has to operate on group obs and actions as well 🤔
all_next_value_mem.append(ModelUtils.to_numpy(_value_mem.squeeze())) | ||
all_next_baseline_mem.append( | ||
ModelUtils.to_numpy(_baseline_mem.squeeze()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we padding with this value and not NaN or zeros ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sequence at this point isn't padded - it's just the leftover bit that would go into a new sequence. E.g. if there were three sequences of length "2" [[s0, s1],[s2, s3],[s4]]
, at this point this is just s4
. When we pad in the buffer, it will become [s4, 0], but not at this point.
With that said, only the 1st memory of each sequence is used. I'm repeating the memory rather than using zeros or ones so we don't need to allocate a new numpy array every time.
last_seq_len = leftover | ||
if last_seq_len > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
last_seq_len = leftover | |
if last_seq_len > 0: | |
if leftover > 0: |
last_seq_len = leftover | ||
if last_seq_len > 0: | ||
for _obs in tensor_obs: | ||
last_seq_obs = _obs[-last_seq_len:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: why use last_seq_len
instead of leftover
. Or rename leftover when its first created.
|
||
# For the last sequence, the initial memory should be the one at the | ||
# end of this trajectory. | ||
for _ in range(last_seq_len): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we add the last _mem
for each extra leftover obs? Why isnt it enough to just append this once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need the mem array to be the same length as all the other ones in the buffer. But you're right this isn't needed. To address @vincentpierre's comment above, we could also add _mem
once and add a bunch of zeros/NaNs, but this saves us from having to create/allocate those empty arrays.
@@ -5,7 +5,7 @@ behaviors: | |||
batch_size: 128 | |||
buffer_size: 1024 | |||
learning_rate: 0.0003 | |||
beta: 0.01 | |||
beta: 0.03 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add in the documentation that it may be good practice to increase beta with LSTM? Or do we think that this is a peculiarity of Hallway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's b/c of Hallway's sparse reward and reward structure that has a bunch of local maxima (at least 2, given the reward curve). Hard to say if we'd need to increase beta for other envs.
Our betas might just be too low across-the-board.
for signal_name in init_values.keys() | ||
} | ||
|
||
# When using LSTM, we need to divide the trajectory into sequences of even length. Sometimes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# When using LSTM, we need to divide the trajectory into sequences of even length. Sometimes, | |
# When using LSTM, we need to divide the trajectory into sequences of equal length. Sometimes, |
signal_name: [init_values[signal_name]] | ||
for signal_name in init_values.keys() | ||
} | ||
# When using LSTM, we need to divide the trajectory into sequences of even length. Sometimes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# When using LSTM, we need to divide the trajectory into sequences of even length. Sometimes, | |
# When using LSTM, we need to divide the trajectory into sequences of equal length. Sometimes, |
for seq_num in range( | ||
1, math.ceil((num_experiences) / (self.policy.sequence_length)) | ||
): | ||
for seq_num in range(num_experiences // (self.policy.sequence_length)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for seq_num in range(num_experiences // (self.policy.sequence_length)): | |
for seq_num in range(num_experiences // self.policy.sequence_length): |
* Pad buffer at the end * Fix padding in optimizer value estimate * Fix additional bugs and POCA * Fix groupmate obs, add tests * Update changelog * Improve tests * Address comments * Fix poca test * Fix buffer test * Increase entropy for Hallway * Add EOF newline * Fix Behavior Name * Address comments (cherry picked from commit 2ce6810)
Proposed change(s)
This PR does two things:
Fixes an issue with POCA's LSTM when
sequence_length
<time_horizon
.Moves padding of sequences shorter than
sequence_length
to the back. This is because we store the initial memories for each sequence (both policy and critic), but since we pad from the front, short sequences will be initialized, then run through some zeros. For instance, given a trajectory[s0, s1, s2]
, we will store the initial memorym0
. Then, when we update the model, we might have padded the trajectory[pad0, pad1, s0, s1, s2]
, but will put the memorym0
in the LSTM withpad0
.This PR will change the padding to
[s0, s1, s2, pad0, pad1]
, so thatm0
goes withs0
. This will also let us implement burn-in very easily by masking the loss at the start of the sequence.Pending: experiment results (currently running Hallway).
Useful links (Github issues, JIRA tickets, ML-Agents forum threads etc.)
JIRA MLA-1873
Types of change(s)
Checklist
Other comments