Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support custom trainer and backend #91

Merged
merged 9 commits into from
Dec 15, 2022
Merged

Support custom trainer and backend #91

merged 9 commits into from
Dec 15, 2022

Conversation

KKIEEK
Copy link
Contributor

@KKIEEK KKIEEK commented Dec 2, 2022

Motivation

Since MM-based repositories already use torch.distributed.init_process_group, using TorchTrainer for DDP in ray framework causes the RuntimeError("trying to initialize the default process group twice!").
To solve this problem, I introduced a custom backend modified from here.

@KKIEEK KKIEEK requested review from yhna940 and nijkah December 2, 2022 16:38
@KKIEEK KKIEEK marked this pull request as ready for review December 2, 2022 22:22
@KKIEEK KKIEEK marked this pull request as draft December 2, 2022 22:23
@KKIEEK KKIEEK marked this pull request as ready for review December 7, 2022 17:01
class _CustomTorchBackend(_TorchBackend):
share_cuda_visible_devices: bool = True

def on_start(self, worker_group: WorkerGroup,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the process group is not initialized, how about initializing it here without throwing an error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have no way to know whether the process group is initialized in Task.run method.

siatune/ray/config.py Outdated Show resolved Hide resolved
siatune/ray/config.py Outdated Show resolved Hide resolved
KKIEEK and others added 3 commits December 15, 2022 11:45
Co-authored-by: Hakjin Lee <nijkah@gmail.com>
Signed-off-by: Junhwa Song <ethan9867@gmail.com>
siatune/ray/config.py Outdated Show resolved Hide resolved
KKIEEK and others added 2 commits December 15, 2022 11:56
Signed-off-by: Junhwa Song <ethan9867@gmail.com>
@KKIEEK KKIEEK merged commit 9eda02d into ray/v2.1.0 Dec 15, 2022
@KKIEEK KKIEEK deleted the v2.1.0/custom branch December 15, 2022 04:09
KKIEEK added a commit that referenced this pull request Dec 19, 2022
* Bump ray from 1.9.1 to 2.1.0

* Fix deprecated warning

* Refactor

* Fix modules

* Fix requirements

* Fix test code

* Support custom trainer and backend (#91)

* Upgrade MMTask (#97)

* Fix minor (#100)

* Fix blocking issue at test_tasks.py

* Support single GPU tuning

* Bump FLAML to v1.0.14 to avoid deprecated warning

* Supplement documentations (#102)

* Support resume (#104)

Co-authored-by: Younghwan Na <100389977+yhna940@users.noreply.github.com>
Co-authored-by: Hakjin Lee <nijkah@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants