Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tpu initial release #1354

Merged
merged 5 commits into from
Jun 26, 2024
Merged

tpu initial release #1354

merged 5 commits into from
Jun 26, 2024

Conversation

Bihan
Copy link
Collaborator

@Bihan Bihan commented Jun 24, 2024

Fixes

  1. TPU is detected from actual workloads
    --privileged flag added to docker run
  2. Env variable PJRT_DEVICE set to TPU
    This is necessary otherwise Warning is issued while running training script.
  3. Ensure all single-VM TPUs can be used
    v2, v3, v4, v5p, v5litepod are different versions of TPUs provides by GCP. Except v4, all TPU versions containing number v2-{number} where number <= 8 is a single VM TPU. v4 version are all TPU Pods
    Note: To use TPU version v4, v5p and v5litepod quotas should be requested.
  4. dstack-runner automatically sets env LD_LIBRARY_PATH in the container

Before Starting the Test

  1. Since there is modification in dstack-shim and dstack-runner ensure that latest dstack-shim-linux-amd64 and dstack-runner binary is used.

How to test TPU

  1. % dstack run . -b gcp --gpu tpu-v2-8
  2. After provisioning is complete set env LD_LIBRARY_PATH in the container as below
    (workflow) root@t1v-n-345435e8-w-0:~ export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$(python3-config --prefix)/lib"
  3. (workflow) root@t1v-n-345435e8-w-0:~ pip install torch~=2.3.0 torch_xla[tpu]~=2.3.0 torchvision -f https://storage.googleapis.com/libtpu-releases/index.html
  4. (workflow) root@t1v-n-345435e8-w-0:~ git clone --recursive https://github.com/pytorch/xla.git
  5. (workflow) root@t1v-n-345435e8-w-0:~ python3 xla/test/test_train_mp_imagenet.py --fake_data --model=resnet50 --num_epochs=1

Test Using Task


python: "3.11"


commands:
  - pip install torch~=2.3.0 torch_xla[tpu]~=2.3.0 torchvision -f https://storage.googleapis.com/libtpu-releases/index.html
  - git clone --recursive https://github.com/pytorch/xla.git
  - python3 xla/test/test_train_mp_imagenet.py --fake_data --model=resnet50 --num_epochs=1

# (Optional) Configure `gpu`, `memory`, `disk`, etc
resources:
  gpu:  tpu-v2-8



@@ -317,8 +317,12 @@ func createContainer(ctx context.Context, client docker.APIClient, runnerDir str
Cmd: []string{strings.Join(dockerParams.DockerShellCommands(taskConfig.PublicKeys), " && ")},
Entrypoint: []string{"/bin/sh", "-c"},
ExposedPorts: exposePorts(dockerParams.DockerPorts()...),
Env: []string{
"PJRT_DEVICE=TPU",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean we always set PJRT_DEVICE=TPU? So even when not running TPU but CUDA? I think we should set PJRT_DEVICE=TPU when running on TPUs only.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@r4victor Should I set it as optional flag in docker run something similar to privileged_flag below?
nohup dstack-shim {dev_flag} docker --keep-container {privileged_flag} {pjrt_device}>{DSTACK_WORKING_DIR}/shim.log 2>&1 &

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd set PJRT_DEVICE in the runner where we execute the job and set other envs:

jobEnvs := map[string]string{
"RUN_NAME": ex.run.RunName, // deprecated, remove in 0.19
"REPO_ID": ex.run.RepoId, // deprecated, remove in 0.19
"DSTACK_RUN_NAME": ex.run.RunName,
"DSTACK_REPO_ID": ex.run.RepoId,
"DSTACK_MASTER_NODE_IP": ex.clusterInfo.MasterJobIP,
"DSTACK_NODE_RANK": strconv.Itoa(node_rank),
"DSTACK_NODES_NUM": strconv.Itoa(nodes_num),
"DSTACK_GPUS_PER_NODE": strconv.Itoa(gpus_per_node_num),
"DSTACK_GPUS_NUM": strconv.Itoa(gpus_num),
}

This is good because we avoid introducing another place where we set envs. This would require passing whether tpu is used or not to the runner API but this should not be hard to do.

Setting PJRT_DEVICE via shim arg would probably work as well. Feel free to go this route if it works.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@r4victor "This is good because we avoid introducing another place where we set envs." I think is a very valid point. I will set in the runner's jobEnvs

@@ -317,8 +317,12 @@ func createContainer(ctx context.Context, client docker.APIClient, runnerDir str
Cmd: []string{strings.Join(dockerParams.DockerShellCommands(taskConfig.PublicKeys), " && ")},
Entrypoint: []string{"/bin/sh", "-c"},
ExposedPorts: exposePorts(dockerParams.DockerPorts()...),
Env: []string{
fmt.Sprintf("PJRT_DEVICE=%s", dockerParams.DockerPJRTDevice()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If pjrt-device arg is not set, we still set PJRT_DEVICE="", which may not be the same as not setting PJRT_DEVICE. Let's add PJRT_DEVICE to Env only if DockerPJRTDevice() is not empty string?

@r4victor r4victor merged commit e0d8906 into dstackai:master Jun 26, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants