-
Notifications
You must be signed in to change notification settings - Fork 129
[torchx/local_scheduler] Set CUDA_VISIBLE_DEVICES correctly when running distributed job on local GPU #377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
1 of 10 tasks
Milestone
Comments
aivanou
added a commit
to aivanou/torchx-1
that referenced
this issue
Feb 8, 2022
Summary: The diff adds automatic set of `CUDA_VISIBLE_DEVICES` based on `num_replicas`. Each replica gets the same number of devices The alg. applies only when `CUDA_VISIBLE_DEVICES` is not set pytorch#297 pytorch#377 Differential Revision: D34064433 fbshipit-source-id: bce7f25cde2336de10b20ac8a37cc0d154e1b8c4
aivanou
added a commit
to aivanou/torchx-1
that referenced
this issue
Feb 8, 2022
…rch#383) Summary: Pull Request resolved: pytorch#383 The diff adds automatic set of `CUDA_VISIBLE_DEVICES` based on `num_replicas`. Each replica gets the same number of devices The alg. applies only when `CUDA_VISIBLE_DEVICES` is not set The diff uses `nvidia-smi` to determine the number of GPUs pytorch#297 pytorch#377 Differential Revision: D34064433 fbshipit-source-id: 788ce92b0ad79e24f4be22bb2d5e9f784f25004b
aivanou
added a commit
to aivanou/torchx-1
that referenced
this issue
Feb 9, 2022
…rch#383) Summary: Pull Request resolved: pytorch#383 The diff adds automatic set of `CUDA_VISIBLE_DEVICES` based on `num_replicas`. Each replica gets the same number of devices The alg. applies only when `CUDA_VISIBLE_DEVICES` is not set The diff uses `nvidia-smi` to determine the number of GPUs pytorch#297 pytorch#377 Differential Revision: D34064433 fbshipit-source-id: 17c9df4d22e2c56ee65739db22714499e8daff18
aivanou
added a commit
to aivanou/torchx-1
that referenced
this issue
Feb 9, 2022
…rch#383) Summary: Pull Request resolved: pytorch#383 The diff adds automatic set of `CUDA_VISIBLE_DEVICES` based on `num_replicas`. Each replica gets the same number of devices The alg. applies only when `CUDA_VISIBLE_DEVICES` is not set The diff uses `nvidia-smi` to determine the number of GPUs pytorch#297 pytorch#377 Differential Revision: D34064433 fbshipit-source-id: 1bdada1765b6ce740c64e434b079353df4acb702
aivanou
added a commit
to aivanou/torchx-1
that referenced
this issue
Feb 10, 2022
…rch#383) Summary: Pull Request resolved: pytorch#383 The diff adds automatic set of `CUDA_VISIBLE_DEVICES` based on `num_replicas`. Each replica gets the same number of devices The alg. applies only when `CUDA_VISIBLE_DEVICES` is not set The diff uses `nvidia-smi` to determine the number of GPUs pytorch#297 pytorch#377 Differential Revision: D34064433 fbshipit-source-id: e9641e93fb487b38000f77c88b550a3149443f75
aivanou
added a commit
to aivanou/torchx-1
that referenced
this issue
Feb 14, 2022
…rch#383) Summary: Pull Request resolved: pytorch#383 The diff adds automatic set of `CUDA_VISIBLE_DEVICES` based on `num_replicas`. Each replica gets the same number of devices The alg. applies only when `CUDA_VISIBLE_DEVICES` is not set The diff uses `nvidia-smi` to determine the number of GPUs pytorch#297 pytorch#377 Differential Revision: D34064433 fbshipit-source-id: de22a96744bb5f625d331c970f529cb5d316ed27
aivanou
added a commit
to aivanou/torchx-1
that referenced
this issue
Feb 14, 2022
…rch#383) Summary: Pull Request resolved: pytorch#383 The diff adds automatic set of `CUDA_VISIBLE_DEVICES` based on `num_replicas`. Each replica gets the same number of devices The alg. applies only when `CUDA_VISIBLE_DEVICES` is not set The diff uses `nvidia-smi` to determine the number of GPUs pytorch#297 pytorch#377 Differential Revision: D34064433 fbshipit-source-id: 8436d962d4a3444608b4f86eb507598487c2cc5b
aivanou
added a commit
to aivanou/torchx-1
that referenced
this issue
Feb 14, 2022
…rch#383) Summary: Pull Request resolved: pytorch#383 The diff adds automatic set of `CUDA_VISIBLE_DEVICES` based on `num_replicas`. Each replica gets the same number of devices The alg. applies only when `CUDA_VISIBLE_DEVICES` is not set The diff uses `nvidia-smi` to determine the number of GPUs pytorch#297 pytorch#377 Differential Revision: D34064433 fbshipit-source-id: fe30725bc81876e8dd712f311f86cfb66ba658fd
aivanou
added a commit
to aivanou/torchx-1
that referenced
this issue
Feb 15, 2022
…rch#383) Summary: Pull Request resolved: pytorch#383 The diff adds automatic set of `CUDA_VISIBLE_DEVICES` based on `num_replicas`. Each replica gets the same number of devices The alg. applies only when `CUDA_VISIBLE_DEVICES` is not set The diff uses `nvidia-smi` to determine the number of GPUs pytorch#297 pytorch#377 Differential Revision: D34064433 fbshipit-source-id: 8db12112f924c8d66b935d6b98f3a186ea5fa08c
aivanou
added a commit
to aivanou/torchx-1
that referenced
this issue
Feb 15, 2022
…rch#383) Summary: Pull Request resolved: pytorch#383 The diff adds automatic set of `CUDA_VISIBLE_DEVICES` based on `num_replicas`. Each replica gets the same number of devices The alg. applies only when `CUDA_VISIBLE_DEVICES` is not set The diff uses `nvidia-smi` to determine the number of GPUs pytorch#297 pytorch#377 Differential Revision: D34064433 fbshipit-source-id: 3fee285d7b17d77abdbee8d9049f63b3a582943e
aivanou
added a commit
to aivanou/torchx-1
that referenced
this issue
Feb 22, 2022
…rch#383) Summary: Pull Request resolved: pytorch#383 The diff adds automatic set of `CUDA_VISIBLE_DEVICES` based on `num_replicas`. Each replica gets the same number of devices The alg. applies only when `CUDA_VISIBLE_DEVICES` is not set The diff uses `nvidia-smi` to determine the number of GPUs pytorch#297 pytorch#377 Reviewed By: kiukchung Differential Revision: D34064433 fbshipit-source-id: 03719702eeaff1b8f5dfcc0c9cd36d54ff660499
facebook-github-bot
pushed a commit
that referenced
this issue
Feb 22, 2022
Summary: Pull Request resolved: #383 The diff adds automatic set of `CUDA_VISIBLE_DEVICES` based on `num_replicas`. Each replica gets the same number of devices The alg. applies only when `CUDA_VISIBLE_DEVICES` is not set The diff uses `nvidia-smi` to determine the number of GPUs #297 #377 Reviewed By: kiukchung Differential Revision: D34064433 fbshipit-source-id: 7ec24c9707e2133fafd6747b4357960b9dd0e253
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
🐛 Bug
Running DDP on a devgpu with 4 GPUs with
--nprocs_per_node=2
and--nnodes=2
does not work when the script usesLOCAL_RANK
to set the cuda device.Module (check all that applies):
torchx.spec
torchx.component
torchx.apps
torchx.runtime
torchx.cli
torchx.schedulers
torchx.pipelines
torchx.aws
torchx.examples
other
To Reproduce
See description above, easily repros with a training script:
try running the above with
Expected behavior
TorchX local scheduler should set CUDA_VISIBLE_DEVICE=0,1 on the first two workers, and CUDA_VISIBLE_DEVICE=2,3 on the next two workers.
Environment
conda
,pip
, source,docker
):Additional context
The text was updated successfully, but these errors were encountered: