Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] PPO torch RLTrainer #31801

Merged
merged 134 commits into from
Feb 8, 2023
Merged

Conversation

kouroshHakha
Copy link
Contributor

@kouroshHakha kouroshHakha commented Jan 20, 2023

Why are these changes needed?

This PR creates a PPO torch RLTrainer.
PRs that have to merge first:

  1. [RLlib] Torch trainer #31628
  2. [RLlib] Add algorithm config to RLTrainer #31800
  3. [RLlib] Single agent RLTrainer made easy #31802

PRs that make clean ups and are less hacky for this to work:

  1. [RLlib] Reparameterize the construction of TrainerRunner and RLTrainers #31991
  2. [RLlib] Error out if action_dict is empty in MultiAgentEnv.  #32129
  3. [RLlib] Exclude gpu tag from Examples test suite in RLlib #32141
  4. [RLlib] Fix revert of trainer runner #32146

Here is the learning curves for CartPole-v1. Blue is the old training stack with RLModules, the red is the new training stack on one CPU. Throughput is also relatively good compared to before.
image

image

Let's check it out in multi-gpu case:

Blue is zero gpu, red is 1 gpu, and cyan is 2 gpus.

image

image

Screen Shot 2023-01-31 at 3 22 38 PM

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
…nabled

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
2. don't do numpy conversion for batch on the base class

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
xwjiang2010 and others added 9 commits January 31, 2023 13:00
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
…er get_weights()

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
@kouroshHakha
Copy link
Contributor Author

I have to rebase once the pre-reqs are merged.

Copy link
Member

@gjoliver gjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some comments.

trainer_bundle = [
{
"CPU": cf.num_cpus_per_trainer_worker,
"GPU": int(cf.num_gpus_per_trainer_worker > 0),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overriding user configuration is confusing.
maybe consider validate and raise error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it just implements what's in the docs, if num_gpus_trainer_worker > 1 and num_trainer_workers = 0 we will just use one gpu. This is just enforcing that from tune's perspective.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. I just want to provide my personal perspective here.
I feel like we always try to "auto-correct" for our users. for example, compute and override some parameters based on values of other parameters. we also put up a lot of explanations / docs around the "corrections" we may potentially do.
if it's 100% up to me, I would instead just raise error and telling users that I am seeing contradictory configs, because num_trainer_workers=0 doesn't work with num_gpus_per_trainer_worker > 0. now, we have no idea what the actual user intention is. did they mis-specify num_trainer_workers or did they mis-specify num_gpus_per_trainer_worker? the only one who can fix this is the user him/herself.
now, this is just some of my thoughts. I can't enforce this, so hopefully you understand what I meant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. make sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -201,12 +216,16 @@ def training(
self.lr_schedule = lr_schedule
if use_critic is not NotProvided:
self.use_critic = use_critic
# TODO (Kourosh) This is experimental. Set rl_trainer_hps parameters as
# well. Don't forget to remote .use_critic from algorithm config.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does remote .use_critic mean?

Copy link
Contributor Author

@kouroshHakha kouroshHakha Feb 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: remove :) not remote

# subtract that to get the total set of pids to update.
# TODO (Kourosh): We need to make a better design for the hierarchy of the
# train results, so that all the policy ids end up in the same level.
policies_to_update = set(train_results["loss"].keys()) - {"total_loss"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think you should be explicit, and not rely on some keys on the result dict to tell you which policies need to be updated.
train_results should not be used as control messages basically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to hear more about what you mean by being more explicit? I was planning on revisiting the train_results structure to remove these requirements in the next round of updates, But would love to hear your thoughts on how it should ideally look like?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not opinionated about how the result dict should look like.
but I do think we shouldn't use it as a control message, meaning, I can only get my policies updated if I add something in the result dict.
these two things probably shouldn't go together?

Copy link
Contributor Author

@kouroshHakha kouroshHakha Feb 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point and I actually have a better idea? right now I haven't even made trainer_runner such that it only updates the policies that is allowed (via policies_to_train). I think with that variable lingering around, I can infer the policies to update. Then the retuened results won't be used as the message passing medium. I'll add a todo with better design guideline in the next PR.

rllib/core/rl_trainer/trainer_runner.py Outdated Show resolved Hide resolved

samples_to_concat = []
# cycle through the batch until we have enough samples
while e >= len(module_batch):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you duplicate module_batches multiple times to make a mini_batch large enough?
this block of code is actually very confusing ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, generally speaking, if number of samples within each policy batch is un-even and skewed, you need to make sure that they are sharded almost equally when you pass them to the RLTrainer. Say one policy1 have 100 samples, and policy2 has 20 samples. but when I want to pass a minibatch size of 40, from policy1 I will select 0-39, but from policy2 I select 0-19+0-19 to make up 40 samples. This is the problem of sharding across policies, I couldn't figure out an easy-to-understand sharding strategy. But I want to think about this more when I think about the double batch communication overhead, they are very relevant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. maybe make a util function out of this. so then it's pluggable. and we can write and compare a few different schemes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, not sure if you want to add these as comments in the code. that would help a lot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will update taking into account the suggestions 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I am creating a utility MiniBatchIterator that can be re-used once we move this iteration to sharded batches inside RLTrainer as well. Thanks for the suggestion. I really like the design break down.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
@gjoliver gjoliver merged commit 1f77e04 into ray-project:master Feb 8, 2023
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants