Skip to content

Conversation

@ericl
Copy link
Contributor

@ericl ericl commented Jul 7, 2018

What do these changes do?

Make it work with sync samples optimizer too (this enables LSTM), and add back debug stats.

for k, v in iter_extra_fetches.items():
all_extra_fetches[k] += [v]
iter_extra_fetches[k].append(v)
print(i, _averaged(iter_extra_fetches))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was incorrect before; you want to return the last epoch's stats, not the mean across all epochs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the previous impl returned stats of all epochs and not only an average; it was no less "incorrect".

The reason why you need the "last epoch" is PPO specific.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, I'm not sure why returning all epoch stats makes sense in any scenario, unless your epochs were very small and the values noisy.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6532/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6543/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6550/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6542/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6567/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6570/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6571/
Test FAILed.

Copy link
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(comments were made from partial review a couple days ago)

self.local_evaluator, self.remote_evaluators)

def _train(self):
def postprocess_samples(batch):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

# Which observation filter to apply to the observation
"observation_filter": "MeanStdFilter",
# Debug only: use the sync samples optimizer instead of the multi-gpu one
"debug_use_simple_optimizer": False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this debug_only?

for k, v in iter_extra_fetches.items():
all_extra_fetches[k] += [v]
iter_extra_fetches[k].append(v)
print(i, _averaged(iter_extra_fetches))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the previous impl returned stats of all epochs and not only an average; it was no less "incorrect".

The reason why you need the "last epoch" is PPO specific.

all_extra_fetches = defaultdict(list)
num_batches = (
int(tuples_per_device) // int(self.per_device_batch_size))
print("== sgd epochs ==")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you flag this off with a verbose?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(flag off all of the prints in the optimizers with verbose?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do this with log level in the future. For now, we've always printed by default so this is just restoring 0.4 functionality.

rewards_plus_v = np.concatenate(
[rollout["rewards"], np.array([last_r])])
traj["advantages"] = discount(rewards_plus_v, gamma)[:-1]
traj["value_targets"] = np.zeros_like(traj["advantages"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use a critic without using GAE, but this does not allow that functionality. Can you add documentation noting this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6591/
Test FAILed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants