Skip to content

[torchx/local_scheduler] Set CUDA_VISIBLE_DEVICES correctly when running distributed job on local GPU #377

Closed
@kiukchung

Description

@kiukchung

🐛 Bug

Running DDP on a devgpu with 4 GPUs with --nprocs_per_node=2 and --nnodes=2 does not work when the script uses LOCAL_RANK to set the cuda device.

torchx run dist.ddp -j 2x2

Module (check all that applies):

  • torchx.spec
  • torchx.component
  • torchx.apps
  • torchx.runtime
  • torchx.cli
  • torchx.schedulers
  • torchx.pipelines
  • torchx.aws
  • torchx.examples
  • other

To Reproduce

See description above, easily repros with a training script:

if __name__ == "__main__":
    torch.cuda.set_device(int(os.environ["LOCAL_RANK"]))

try running the above with

torchx run dist.ddp -j 2x2 main.py

Expected behavior

TorchX local scheduler should set CUDA_VISIBLE_DEVICE=0,1 on the first two workers, and CUDA_VISIBLE_DEVICE=2,3 on the next two workers.

Environment

  • torchx version (e.g. 0.1.0rc1):
  • Python version:
  • OS (e.g., Linux):
  • How you installed torchx (conda, pip, source, docker):
  • Docker image and tag (if using docker):
  • Git commit (if installed from source):
  • Execution environment (on-prem, AWS, GCP, Azure etc):
  • Any other relevant information:

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions