[RLlib] PPO torch RLTrainer #31801

kouroshHakha · 2023-01-20T02:22:12Z

Why are these changes needed?

This PR creates a PPO torch RLTrainer.
PRs that have to merge first:

PRs that make clean ups and are less hacky for this to work:

Here is the learning curves for CartPole-v1. Blue is the old training stack with RLModules, the red is the new training stack on one CPU. Throughput is also relatively good compared to before.

Let's check it out in multi-gpu case:

Blue is zero gpu, red is 1 gpu, and cyan is 2 gpus.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

…nabled Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

2. don't do numpy conversion for batch on the base class Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

…project#32070) Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

…er get_weights() Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

kouroshHakha · 2023-02-01T01:18:20Z

I have to rebase once the pre-reqs are merged.

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

gjoliver

I have some comments.

gjoliver · 2023-02-06T19:23:28Z

rllib/algorithms/algorithm.py

+                trainer_bundle = [
+                    {
+                        "CPU": cf.num_cpus_per_trainer_worker,
+                        "GPU": int(cf.num_gpus_per_trainer_worker > 0),


overriding user configuration is confusing.
maybe consider validate and raise error?

it just implements what's in the docs, if num_gpus_trainer_worker > 1 and num_trainer_workers = 0 we will just use one gpu. This is just enforcing that from tune's perspective.

ok. I just want to provide my personal perspective here.
I feel like we always try to "auto-correct" for our users. for example, compute and override some parameters based on values of other parameters. we also put up a lot of explanations / docs around the "corrections" we may potentially do.
if it's 100% up to me, I would instead just raise error and telling users that I am seeing contradictory configs, because num_trainer_workers=0 doesn't work with num_gpus_per_trainer_worker > 0. now, we have no idea what the actual user intention is. did they mis-specify num_trainer_workers or did they mis-specify num_gpus_per_trainer_worker? the only one who can fix this is the user him/herself.
now, this is just some of my thoughts. I can't enforce this, so hopefully you understand what I meant.

got it. make sense.

gjoliver · 2023-02-06T19:28:25Z

rllib/algorithms/ppo/ppo.py

@@ -201,12 +216,16 @@ def training(
            self.lr_schedule = lr_schedule
        if use_critic is not NotProvided:
            self.use_critic = use_critic
+            # TODO (Kourosh) This is experimental. Set rl_trainer_hps parameters as
+            # well. Don't forget to remote .use_critic from algorithm config.


what does remote .use_critic mean?

typo: remove :) not remote

gjoliver · 2023-02-06T19:31:34Z

rllib/algorithms/ppo/ppo.py

+            # subtract that to get the total set of pids to update.
+            # TODO (Kourosh): We need to make a better design for the hierarchy of the
+            # train results, so that all the policy ids end up in the same level.
+            policies_to_update = set(train_results["loss"].keys()) - {"total_loss"}


I actually think you should be explicit, and not rely on some keys on the result dict to tell you which policies need to be updated.
train_results should not be used as control messages basically.

I'd like to hear more about what you mean by being more explicit? I was planning on revisiting the train_results structure to remove these requirements in the next round of updates, But would love to hear your thoughts on how it should ideally look like?

I am not opinionated about how the result dict should look like.
but I do think we shouldn't use it as a control message, meaning, I can only get my policies updated if I add something in the result dict.
these two things probably shouldn't go together?

I see your point and I actually have a better idea? right now I haven't even made trainer_runner such that it only updates the policies that is allowed (via policies_to_train). I think with that variable lingering around, I can infer the policies to update. Then the retuened results won't be used as the message passing medium. I'll add a todo with better design guideline in the next PR.

rllib/core/rl_trainer/trainer_runner.py

gjoliver · 2023-02-06T20:03:54Z

rllib/core/rl_trainer/trainer_runner.py

+
+                samples_to_concat = []
+                # cycle through the batch until we have enough samples
+                while e >= len(module_batch):


you duplicate module_batches multiple times to make a mini_batch large enough?
this block of code is actually very confusing ...

yes, generally speaking, if number of samples within each policy batch is un-even and skewed, you need to make sure that they are sharded almost equally when you pass them to the RLTrainer. Say one policy1 have 100 samples, and policy2 has 20 samples. but when I want to pass a minibatch size of 40, from policy1 I will select 0-39, but from policy2 I select 0-19+0-19 to make up 40 samples. This is the problem of sharding across policies, I couldn't figure out an easy-to-understand sharding strategy. But I want to think about this more when I think about the double batch communication overhead, they are very relevant.

I see. maybe make a util function out of this. so then it's pluggable. and we can write and compare a few different schemes.

btw, not sure if you want to add these as comments in the code. that would help a lot.

will update taking into account the suggestions 👍

ok I am creating a utility MiniBatchIterator that can be re-used once we move this iteration to sharded batches inside RLTrainer as well. Thanks for the suggestion. I really like the design break down.

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

…ppo-torch-trainer

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

kouroshHakha added 30 commits January 10, 2023 10:35

added quick cleanups to trainer_runner.

28679ca

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

created test_trainer_runner

a2f9439

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

added TODO tag

d8b36c1

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

Merge branch 'master' into trainer-runner-quick-cleanups

d24bff5

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

fixed imports

2b67577

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

typo in BUILD

fe60e20

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

started to create torch_rl_trainer

916d674

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

added bc_rl_trainer

71026e5

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

torch trainer test works now

ae61014

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

lint

ef1ffb8

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

Merge branch 'master' into torch-trainer

f393cea

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

updated TODOs and BUILD

16f64f9

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

wip: trainer_runner multi-gpu test

9719518

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

torch version runs but the parameters are not synced

77730d8

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

wip

97573dc

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

got the multi-gpu gradient sync up working

091c406

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

fixed add/remove multi-gpu tests

a42b0f1

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

moved the DDPRLModuleWrapper outside of RLTrainer + lint

62fe11f

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

merged tf and torch train_runner tests

2cc5185

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

fixed trainer_runner auto-scaling on a cluster where autoscaling is e…

435c352

…nabled Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

fix rl_trainer unittest failures.

ff845c3

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

1. renamed the DDP wrapper

7c3eed7

2. don't do numpy conversion for batch on the base class Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

removed in_test from the production code

d56ce2c

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

clarified todo

200b5f7

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

comments

f747e50

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

renamed make_distributed to make_distributed_module

7ce81f0

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

fixed test torch rl_trainer lint

b2ddd2d

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

fixed marl_module stuff

5bc625c

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

fixed the import issue

3302db7

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

lint

5455e29

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

xwjiang2010 and others added 9 commits January 31, 2023 13:00

[release] minor fix to pytorch_pbt_failure test when using gpu. (ray-…

d91eca5

…project#32070) Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

Merge branch 'master' into err-out-marl-env

5b209be

Merge branch 'err-out-marl-env' into ppo-torch-trainer

833e491

error out when no agent is passed in in the indepenent MARL case

e29021d

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

Merge branch 'err-out-marl-env' into ppo-torch-trainer

d113d3a

1. set resources for trainable 2. convert_to_numpy weights on RLTrain…

f28a385

…er get_weights() Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

added examples as a unittest to BUILD kite

993932f

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

fixed test name conflict

05c8297

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

removed the wrong tag from docs

839ff90

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

kouroshHakha added 4 commits January 31, 2023 20:29

Merge branch 'master' into ppo-torch-trainer

320b116

fixed as test flag

8466fc8

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

Merge branch 'master' into ppo-torch-trainer

4c4f7cc

made the sync_weights equivalent to the implementation before this PR

d78f3d5

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

gjoliver reviewed Feb 6, 2023

View reviewed changes

gjoliver approved these changes Feb 6, 2023

View reviewed changes

kouroshHakha added 4 commits February 6, 2023 16:57

addressed jun's comments, created a minibatchCycleIterator

7a68bb4

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

Merge branch 'ppo-torch-trainer' of github.com:kouroshHakha/ray into …

0008833

…ppo-torch-trainer

added annotations

edbf081

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

Merge branch 'master' into ppo-torch-trainer

2698973

kouroshHakha mentioned this pull request Feb 7, 2023

[RLlib] Move minibatching into RLTrainer instead of TrainerRunner #32262

Merged

7 tasks

kouroshHakha added 3 commits February 6, 2023 22:46

empty

4c8ce18

empty

9f29038

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

empty

b1d3f63

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

avnishn mentioned this pull request Feb 7, 2023

[RLlib] Modifications to gpu resource logic in rl_trainer #32149

Merged

7 tasks

gjoliver merged commit 1f77e04 into ray-project:master Feb 8, 2023

edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023

[RLlib] PPO torch RLTrainer (ray-project#31801)

0e0aefe

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib] PPO torch RLTrainer #31801

[RLlib] PPO torch RLTrainer #31801

kouroshHakha commented Jan 20, 2023 •

edited

Loading

kouroshHakha commented Feb 1, 2023

gjoliver left a comment

gjoliver Feb 6, 2023

kouroshHakha Feb 6, 2023

gjoliver Feb 6, 2023

kouroshHakha Feb 6, 2023

kouroshHakha Feb 6, 2023

gjoliver Feb 6, 2023

kouroshHakha Feb 6, 2023 •

edited

Loading

gjoliver Feb 6, 2023

kouroshHakha Feb 6, 2023

gjoliver Feb 6, 2023

kouroshHakha Feb 6, 2023 •

edited

Loading

gjoliver Feb 6, 2023

kouroshHakha Feb 6, 2023

gjoliver Feb 6, 2023

gjoliver Feb 6, 2023

kouroshHakha Feb 6, 2023

kouroshHakha Feb 7, 2023

gjoliver Feb 7, 2023

[RLlib] PPO torch RLTrainer #31801

[RLlib] PPO torch RLTrainer #31801

Conversation

kouroshHakha commented Jan 20, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

kouroshHakha commented Feb 1, 2023

gjoliver left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kouroshHakha Feb 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kouroshHakha Feb 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kouroshHakha commented Jan 20, 2023 •

edited

Loading

kouroshHakha Feb 6, 2023 •

edited

Loading

kouroshHakha Feb 6, 2023 •

edited

Loading