[RLlib] Torch trainer #31628

kouroshHakha · 2023-01-12T06:19:12Z

Why are these changes needed?

This creates the torch trainer along with its unittest. There will be a few clean up PRs after this:

Consolidate torch and tf RLTrainer ensuring the super-class/sub-class relation is intuitive and well-documented.
increase the coverage of the unittest (some public methods like compile_results are not tested). Test scenarios that are multi-gpu multi-node, as well as simple non-distributed versions that people may use for getting a sense about the implementations.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

…nabled Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

2. don't do numpy conversion for batch on the base class Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

avnishn · 2023-01-19T05:46:07Z

rllib/core/rl_trainer/rl_trainer.py

+        # TODO (Kourosh): This method is built for multi-agent. While it is still
+        # possible to write single-agent losses, it may become confusing to users. We
+        # should find a way to allow them to specify single-agent losses as well,
+        # without having to think about one extra layer of hierarchy for module ids.


what if we have 2 functions:

1 called compute loss single agent, and 1 called compute loss multi agent.

if you implement compute loss single agent, update calls that for all agents. If you call compute loss multi agent then update calls that function.

if you implement both we throw an error.

the one downside is I think it involves using the overrides decorator from ray/rllib

I think we should roll a bit with multi-agent as first class citizen and see the impression of users. I have written BC trainer and honestly it's not that confusing. Especially if you think about the way people usually write these stuff for the first time, is that they most likely put breakpoints in the loss computation code to see what data they'll get and go from there. My hypothesis is that for an average "advanced" user (who is writing their own loss / algorithm) having the input as a MultiAgentSample batch is a good indicator of what they need to do plus they'll see examples of how other algorithm's losses are written. The advantage of this is that there is less api that a user would have to cope with hence a lower cognitive load.

I have a proposal for this in a follow up PR.

avnishn · 2023-01-19T06:01:51Z

rllib/core/rl_trainer/torch/torch_rl_trainer.py

+        self._module[module_id].to(self._device)
+        if self.distributed:
+            self._module.add_module(
+                module_id, TorchDDPRLModule(self._module[module_id]), override=True


this looks so nice

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

…h-trainer

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

…e the unittest is locateed and import torch would import the relative torch module instead of the global torch moduel Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* added quick cleanups to trainer_runner. * created test_trainer_runner * added bc_rl_trainer * moved the DDPRLModuleWrapper outside of RLTrainer + lint * merged tf and torch train_runner tests Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: Andrea Pisoni <andreapiso@gmail.com>

kouroshHakha added 10 commits January 10, 2023 10:35

added quick cleanups to trainer_runner.

28679ca

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

created test_trainer_runner

a2f9439

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

added TODO tag

d8b36c1

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

Merge branch 'master' into trainer-runner-quick-cleanups

d24bff5

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

fixed imports

2b67577

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

typo in BUILD

fe60e20

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

started to create torch_rl_trainer

916d674

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

added bc_rl_trainer

71026e5

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

torch trainer test works now

ae61014

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

lint

ef1ffb8

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

kouroshHakha requested review from sven1977, gjoliver, avnishn, ArturNiederfahrenhorst, smorad, maxpumperla and krfricke as code owners January 12, 2023 06:19

kouroshHakha assigned avnishn Jan 12, 2023

kouroshHakha added 5 commits January 13, 2023 12:54

Merge branch 'master' into torch-trainer

f393cea

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

updated TODOs and BUILD

16f64f9

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

wip: trainer_runner multi-gpu test

9719518

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

torch version runs but the parameters are not synced

77730d8

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

wip

97573dc

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

kouroshHakha added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed tests-ok The tagger certifies test failures are unrelated and assumes personal liability. labels Jan 14, 2023

kouroshHakha added 5 commits January 17, 2023 11:41

got the multi-gpu gradient sync up working

091c406

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

fixed add/remove multi-gpu tests

a42b0f1

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

moved the DDPRLModuleWrapper outside of RLTrainer + lint

62fe11f

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

merged tf and torch train_runner tests

2cc5185

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

fixed trainer_runner auto-scaling on a cluster where autoscaling is e…

435c352

…nabled Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

kouroshHakha added 11 commits January 18, 2023 09:08

1. renamed the DDP wrapper

7c3eed7

2. don't do numpy conversion for batch on the base class Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

removed in_test from the production code

d56ce2c

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

clarified todo

200b5f7

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

comments

f747e50

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

renamed make_distributed to make_distributed_module

7ce81f0

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

fixed test torch rl_trainer lint

b2ddd2d

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

fixed marl_module stuff

5bc625c

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

fixed the import issue

3302db7

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

lint

5455e29

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

Merge branch 'master' into torch-trainer

a2d042f

fixed lint

b9159a8

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

avnishn reviewed Jan 19, 2023

View reviewed changes

gjoliver approved these changes Jan 19, 2023

View reviewed changes

kouroshHakha added 7 commits January 19, 2023 10:37

Merge branch 'master' into torch-trainer

f3edd50

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

test trainer runner updated

2aec198

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

fixed the scaling config and in_test issues introduced after the merge.

873cdd5

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

fixed the scaling config and in_test issues introduced after the merge.

13e19aa

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

Merge branch 'torch-trainer' of github.com:kouroshHakha/ray into torc…

68b72e4

…h-trainer

Merge branch 'master' into torch-trainer

ff3b335

fixed trainer_runner config test

dac0d6b

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

This was referenced Jan 20, 2023

[RLlib] Add algorithm config to RLTrainer #31800

Merged

[RLlib] PPO torch RLTrainer #31801

Merged

kouroshHakha added 4 commits January 19, 2023 21:41

fixed torch import

7b5938b

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

removed the override decorator for nn.Module

f4cbe5a

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

fixed import torch in bc_module.py

eac5223

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

fixed the bazel bug where the working directory gets switched to wher…

0baa3cf

…e the unittest is locateed and import torch would import the relative torch module instead of the global torch moduel Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

gjoliver merged commit 66d6ce6 into ray-project:master Jan 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib] Torch trainer #31628

[RLlib] Torch trainer #31628

kouroshHakha commented Jan 12, 2023 •

edited

Loading

avnishn Jan 19, 2023

kouroshHakha Jan 19, 2023

kouroshHakha Jan 20, 2023

avnishn Jan 19, 2023

[RLlib] Torch trainer #31628

[RLlib] Torch trainer #31628

Conversation

kouroshHakha commented Jan 12, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

avnishn Jan 19, 2023

Choose a reason for hiding this comment

kouroshHakha Jan 19, 2023

Choose a reason for hiding this comment

kouroshHakha Jan 20, 2023

Choose a reason for hiding this comment

avnishn Jan 19, 2023

Choose a reason for hiding this comment

kouroshHakha commented Jan 12, 2023 •

edited

Loading