Skip to content

[torchx/local_scheduler] Set CUDA_VISIBLE_DEVICES correctly when running distributed job on local GPU #377

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 of 10 tasks
kiukchung opened this issue Jan 27, 2022 · 0 comments
Milestone

Comments

@kiukchung
Copy link
Contributor

🐛 Bug

Running DDP on a devgpu with 4 GPUs with --nprocs_per_node=2 and --nnodes=2 does not work when the script uses LOCAL_RANK to set the cuda device.

torchx run dist.ddp -j 2x2

Module (check all that applies):

  • torchx.spec
  • torchx.component
  • torchx.apps
  • torchx.runtime
  • torchx.cli
  • torchx.schedulers
  • torchx.pipelines
  • torchx.aws
  • torchx.examples
  • other

To Reproduce

See description above, easily repros with a training script:

if __name__ == "__main__":
    torch.cuda.set_device(int(os.environ["LOCAL_RANK"]))

try running the above with

torchx run dist.ddp -j 2x2 main.py

Expected behavior

TorchX local scheduler should set CUDA_VISIBLE_DEVICE=0,1 on the first two workers, and CUDA_VISIBLE_DEVICE=2,3 on the next two workers.

Environment

  • torchx version (e.g. 0.1.0rc1):
  • Python version:
  • OS (e.g., Linux):
  • How you installed torchx (conda, pip, source, docker):
  • Docker image and tag (if using docker):
  • Git commit (if installed from source):
  • Execution environment (on-prem, AWS, GCP, Azure etc):
  • Any other relevant information:

Additional context

@kiukchung kiukchung added this to the 0.1.2 release milestone Jan 27, 2022
aivanou added a commit to aivanou/torchx-1 that referenced this issue Feb 8, 2022
Summary:
The diff adds automatic set of `CUDA_VISIBLE_DEVICES` based on `num_replicas`.

Each replica gets the same number of devices

The alg. applies only when `CUDA_VISIBLE_DEVICES` is not set

pytorch#297

pytorch#377

Differential Revision: D34064433

fbshipit-source-id: bce7f25cde2336de10b20ac8a37cc0d154e1b8c4
aivanou added a commit to aivanou/torchx-1 that referenced this issue Feb 8, 2022
…rch#383)

Summary:
Pull Request resolved: pytorch#383

The diff adds automatic set of `CUDA_VISIBLE_DEVICES` based on `num_replicas`.

Each replica gets the same number of devices

The alg. applies only when `CUDA_VISIBLE_DEVICES` is not set
The diff uses `nvidia-smi` to determine the number of GPUs

pytorch#297

pytorch#377

Differential Revision: D34064433

fbshipit-source-id: 788ce92b0ad79e24f4be22bb2d5e9f784f25004b
aivanou added a commit to aivanou/torchx-1 that referenced this issue Feb 9, 2022
…rch#383)

Summary:
Pull Request resolved: pytorch#383

The diff adds automatic set of `CUDA_VISIBLE_DEVICES` based on `num_replicas`.

Each replica gets the same number of devices

The alg. applies only when `CUDA_VISIBLE_DEVICES` is not set
The diff uses `nvidia-smi` to determine the number of GPUs

pytorch#297

pytorch#377

Differential Revision: D34064433

fbshipit-source-id: 17c9df4d22e2c56ee65739db22714499e8daff18
aivanou added a commit to aivanou/torchx-1 that referenced this issue Feb 9, 2022
…rch#383)

Summary:
Pull Request resolved: pytorch#383

The diff adds automatic set of `CUDA_VISIBLE_DEVICES` based on `num_replicas`.

Each replica gets the same number of devices

The alg. applies only when `CUDA_VISIBLE_DEVICES` is not set
The diff uses `nvidia-smi` to determine the number of GPUs

pytorch#297

pytorch#377

Differential Revision: D34064433

fbshipit-source-id: 1bdada1765b6ce740c64e434b079353df4acb702
aivanou added a commit to aivanou/torchx-1 that referenced this issue Feb 10, 2022
…rch#383)

Summary:
Pull Request resolved: pytorch#383

The diff adds automatic set of `CUDA_VISIBLE_DEVICES` based on `num_replicas`.

Each replica gets the same number of devices

The alg. applies only when `CUDA_VISIBLE_DEVICES` is not set
The diff uses `nvidia-smi` to determine the number of GPUs

pytorch#297

pytorch#377

Differential Revision: D34064433

fbshipit-source-id: e9641e93fb487b38000f77c88b550a3149443f75
aivanou added a commit to aivanou/torchx-1 that referenced this issue Feb 14, 2022
…rch#383)

Summary:
Pull Request resolved: pytorch#383

The diff adds automatic set of `CUDA_VISIBLE_DEVICES` based on `num_replicas`.

Each replica gets the same number of devices

The alg. applies only when `CUDA_VISIBLE_DEVICES` is not set
The diff uses `nvidia-smi` to determine the number of GPUs

pytorch#297

pytorch#377

Differential Revision: D34064433

fbshipit-source-id: de22a96744bb5f625d331c970f529cb5d316ed27
aivanou added a commit to aivanou/torchx-1 that referenced this issue Feb 14, 2022
…rch#383)

Summary:
Pull Request resolved: pytorch#383

The diff adds automatic set of `CUDA_VISIBLE_DEVICES` based on `num_replicas`.

Each replica gets the same number of devices

The alg. applies only when `CUDA_VISIBLE_DEVICES` is not set
The diff uses `nvidia-smi` to determine the number of GPUs

pytorch#297

pytorch#377

Differential Revision: D34064433

fbshipit-source-id: 8436d962d4a3444608b4f86eb507598487c2cc5b
aivanou added a commit to aivanou/torchx-1 that referenced this issue Feb 14, 2022
…rch#383)

Summary:
Pull Request resolved: pytorch#383

The diff adds automatic set of `CUDA_VISIBLE_DEVICES` based on `num_replicas`.

Each replica gets the same number of devices

The alg. applies only when `CUDA_VISIBLE_DEVICES` is not set
The diff uses `nvidia-smi` to determine the number of GPUs

pytorch#297

pytorch#377

Differential Revision: D34064433

fbshipit-source-id: fe30725bc81876e8dd712f311f86cfb66ba658fd
aivanou added a commit to aivanou/torchx-1 that referenced this issue Feb 15, 2022
…rch#383)

Summary:
Pull Request resolved: pytorch#383

The diff adds automatic set of `CUDA_VISIBLE_DEVICES` based on `num_replicas`.

Each replica gets the same number of devices

The alg. applies only when `CUDA_VISIBLE_DEVICES` is not set
The diff uses `nvidia-smi` to determine the number of GPUs

pytorch#297

pytorch#377

Differential Revision: D34064433

fbshipit-source-id: 8db12112f924c8d66b935d6b98f3a186ea5fa08c
aivanou added a commit to aivanou/torchx-1 that referenced this issue Feb 15, 2022
…rch#383)

Summary:
Pull Request resolved: pytorch#383

The diff adds automatic set of `CUDA_VISIBLE_DEVICES` based on `num_replicas`.

Each replica gets the same number of devices

The alg. applies only when `CUDA_VISIBLE_DEVICES` is not set
The diff uses `nvidia-smi` to determine the number of GPUs

pytorch#297

pytorch#377

Differential Revision: D34064433

fbshipit-source-id: 3fee285d7b17d77abdbee8d9049f63b3a582943e
aivanou added a commit to aivanou/torchx-1 that referenced this issue Feb 22, 2022
…rch#383)

Summary:
Pull Request resolved: pytorch#383

The diff adds automatic set of `CUDA_VISIBLE_DEVICES` based on `num_replicas`.

Each replica gets the same number of devices

The alg. applies only when `CUDA_VISIBLE_DEVICES` is not set
The diff uses `nvidia-smi` to determine the number of GPUs

pytorch#297

pytorch#377

Reviewed By: kiukchung

Differential Revision: D34064433

fbshipit-source-id: 03719702eeaff1b8f5dfcc0c9cd36d54ff660499
facebook-github-bot pushed a commit that referenced this issue Feb 22, 2022
Summary:
Pull Request resolved: #383

The diff adds automatic set of `CUDA_VISIBLE_DEVICES` based on `num_replicas`.

Each replica gets the same number of devices

The alg. applies only when `CUDA_VISIBLE_DEVICES` is not set
The diff uses `nvidia-smi` to determine the number of GPUs

#297

#377

Reviewed By: kiukchung

Differential Revision: D34064433

fbshipit-source-id: 7ec24c9707e2133fafd6747b4357960b9dd0e253
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant