[k8s] On-demand single-host TPU support on GKE #3947

landscapepainter · 2024-09-16T09:12:54Z

One of our users requested a feature to use spot TPU from GKE. This is an intial step to support the request for on-demand single host TPU.

This PR does not contain the support for:

multi-host TPU support
autoscaler support
spot TPU support

Tested (run the relevant ones):

tpu_gke.yaml:

file_mounts:
  /result:
    store: gcs
    name: tpu-mount-test-dy
    mode: MOUNT

setup: |
  git clone https://github.com/google/flax.git --branch v0.8.2

  conda activate flax
  if [ $? -eq 0 ]; then
    echo 'conda env exists'
  else
    conda create -n flax python=3.10 -y
    conda activate flax
    # Make sure to install TPU related packages in a conda env to avoid package conflicts.
    pip install \
      -f https://storage.googleapis.com/jax-releases/libtpu_releases.html "jax[tpu]==0.4.25" \
      clu \
      tensorflow tensorflow-datasets
    pip install -e flax
  fi

run: |
  conda activate flax
  pip install clu
  cd flax/examples/mnist
  python3 main.py --workdir=/tmp/mnist \
    --config=configs/default.py \
    --config.learning_rate=0.05 \
    --config.num_epochs=10 >> /result/output.log 2>&1

…com/gpu and google.com/tpu

landscapepainter · 2024-11-01T05:40:42Z

I tried to launch a task in existing cluster but failed with the following error. Manually commenting out the accelerators: tpu-v2-8 in the example yaml resolved the issue for me, but it seems we have an error to infer the cloud for resources if cli and task YAML has some inconsistencies. Could you take a look at what is happening here?

@cblmemo sky/task.py::Task.set_resources_override is setting new_resources to be GCP({'tpu-v5-lite-podslice': 4}, accelerator_args={'runtime_version': 'tpu-vm-base'}), which does not exist. And this results in the issue you are encountering. Seems like overriding resource should update the cloud type from GCP to Kubernetes as well, but such logic doesn't seem to exist. Do we currently allow this from Skypilot?

cblmemo · 2024-11-02T20:31:31Z

Also, when I tries to launch with 2 tpus, it seems the error does not illustrate the real reason (e.g. only tpu:4 and tpu:1 is available). ... Should we show similar fuzzy result like this?

@cblmemo Seems like this is a consistent behavior for Kubernetes in general, and not just for TPU support. And the error is illustrating the real reason, but just not the fuzzy result. Is it supposed to be displaying the fuzzy result when one cloud is specified as well? Anyways, it seems to be out of scope of this PR.
$ sky launch --gpus A100:4  --cloud kubernetes
No resource satisfying Kubernetes({'A100': 4}) on Kubernetes.
sky.exceptions.ResourcesUnavailableError: Kubernetes cluster does not contain any instances satisfying the request: 1x Kubernetes({'A100': 4}).
To fix: relax or change the resource requirements.

Hint: sky show-gpus to list available accelerators.
      sky check to check the enabled clouds.

Got it. Maybe worth filing an issue for this and implement this elsewhere ;)

cblmemo · 2024-11-02T20:41:38Z

Also, when I tries to launch with 2 tpus, it seems the error does not illustrate the real reason (e.g. only tpu:4 and tpu:1 is available). ... Should we show similar fuzzy result like this?

@cblmemo Seems like this is a consistent behavior for Kubernetes in general, and not just for TPU support. And the error is illustrating the real reason, but just not the fuzzy result. Is it supposed to be displaying the fuzzy result when one cloud is specified as well? Anyways, it seems to be out of scope of this PR.
$ sky launch --gpus A100:4  --cloud kubernetes
No resource satisfying Kubernetes({'A100': 4}) on Kubernetes.
sky.exceptions.ResourcesUnavailableError: Kubernetes cluster does not contain any instances satisfying the request: 1x Kubernetes({'A100': 4}).
To fix: relax or change the resource requirements.

Hint: sky show-gpus to list available accelerators.
      sky check to check the enabled clouds.

(sky-serve) ➜  skypilot git:(new_provision_api) ✗ sky launch @temp/a.yaml --instance-type n2-standard-8
Task from YAML spec: @temp/a.yaml
ValueError: Invalid instance type 'n2-standard-8' for cloud AWS.
(sky-serve) ➜  skypilot git:(new_provision_api) ✗ cat @temp/a.yaml
resources:
  cloud: aws

At least we should show such error information? Current error is a little bit confusing to me..

Also, current conflict is two auto-filled cloud's conflict. If the user explicitly set the cloud in the YAML and cause a conflict, that sounds reasonable to me. But I would be surprised if I didn't set the cloud but two of SkyPilot's auto inference of cloud is causing conflict..

landscapepainter · 2024-11-02T22:42:05Z

When launching with --gpus tpu-v5-lite-podslice:1, it still detects 4 TPUs:

(tpuvm_mnist, pid=3324) I1028 20:50:43.360553 135160242983552 main.py:52] JAX local devices: [TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0), TpuDevice(id=1, process_index=0, coords=(1,0,0), core_on_chip=0), TpuDevice(id=2, process_index=0, coords=(0,1,0), core_on_chip=0), TpuDevice(id=3, process_index=0, coords=(1,1,0), core_on_chip=0)]
Is this expected (e.g. the 4 of them is a single basic schedule unit)? If it is, should we disable users to specify TPUs that is less than 4 (or any other TPU count in one host)?

Also, the above example failed for me. Could you check this as well?

@cblmemo Not sure why that error is printing out info regards to TPU with 4 chips, but it's fixed at 688c0b4.
sky launch --gpus tpu-v5-lite-podslice:4 -c gke-tpu-4 examples/tpu/tpuvm_mnist.yaml does not work due to the issue mentioned here, but specifying the cloud to be kubernetes works now with: sky launch --gpus tpu-v5-lite-podslice:4 --cloud kubernetes -c gke-tpu-4 examples/tpu/tpuvm_mnist.yaml:

$ sky launch --gpus tpu-v5-lite-podslice:4 --cloud kubernetes -c gke-tpu-4 examples/tpu/tpuvm_mnist.yaml

...

(tpuvm_mnist, pid=2036) I1102 22:44:01.655345 136981829304960 train.py:148] epoch:  7, train_loss: 0.0167, train_accuracy: 99.47, test_loss: 0.0266, test_accuracy: 99.18
(tpuvm_mnist, pid=2036) I1102 22:44:03.135087 136981829304960 train.py:148] epoch:  8, train_loss: 0.0134, train_accuracy: 99.58, test_loss: 0.0260, test_accuracy: 99.16
(tpuvm_mnist, pid=2036) I1102 22:44:04.615064 136981829304960 train.py:148] epoch:  9, train_loss: 0.0117, train_accuracy: 99.65, test_loss: 0.0248, test_accuracy: 99.21
(tpuvm_mnist, pid=2036) I1102 22:44:06.100036 136981829304960 train.py:148] epoch: 10, train_loss: 0.0086, train_accuracy: 99.75, test_loss: 0.0268, test_accuracy: 99.14
✓ Job finished (status: SUCCEEDED).

📋 Useful Commands
Job ID: 1
├── To cancel the job:		sky cancel gke-tpu-4 1
├── To stream job logs:		sky logs gke-tpu-4 1
└── To view job queue:		sky queue gke-tpu-4

Cluster name: gke-tpu-4
├── To log into the head VM:	ssh gke-tpu-4
├── To submit a job:		sky exec gke-tpu-4 yaml_file
├── To stop the cluster:	sky stop gke-tpu-4
└── To teardown the cluster:	sky down gke-tpu-4
Tip: `sky down` will delete launched TPU(s) too.

Also, it does not detect TPU with 4 chips when a pod with 1 TPU chip is provisioned:

Python 3.10.15 (main, Oct  3 2024, 07:27:34) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import jax
>>> print(jax.devices());
[TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0)]

landscapepainter · 2024-11-03T01:17:29Z

Got it. Maybe worth filing an issue for this and implement this elsewhere ;)

I just tested this out again by specifying --cloud aws, and it's exactly the same what kubernetes error displayed from your end. And we get the error you see,

$ sky launch --gpus tpu-v5-lite-podslice:2 -c gke-tpu-2
No resource satisfying Kubernetes({'tpu-v5-lite-podslice': 2}, accelerator_args={}) on Kubernetes.
sky.exceptions.ResourcesUnavailableError: Kubernetes cluster does not contain any instances satisfying the request: 1x Kubernetes({'tpu-v5-lite-podslice': 2}, accelerator_args={}).
To fix: relax or change the resource requirements.

Hint: sky show-gpus to list available accelerators.
      sky check to check the enabled clouds.

, because skypilot knows that tpu-v5-lite-podslice is only available in kubernetes, unlike A100.

So the fuzzy error message is supposed to appear only when either when the cloud is not specified but the instance is available in multiple clouds. I guess there isn't anything to file an issue to fix?

landscapepainter · 2024-11-03T01:17:58Z

@cblmemo @romilbhardwaj This is ready for another round. Thanks!!

cblmemo · 2024-11-09T21:24:34Z

Got it. Maybe worth filing an issue for this and implement this elsewhere ;)

I just tested this out again by specifying --cloud aws, and it's exactly the same what kubernetes error displayed from your end. And we get the error you see,
$ sky launch --gpus tpu-v5-lite-podslice:2 -c gke-tpu-2
No resource satisfying Kubernetes({'tpu-v5-lite-podslice': 2}, accelerator_args={}) on Kubernetes.
sky.exceptions.ResourcesUnavailableError: Kubernetes cluster does not contain any instances satisfying the request: 1x Kubernetes({'tpu-v5-lite-podslice': 2}, accelerator_args={}).
To fix: relax or change the resource requirements.

Hint: sky show-gpus to list available accelerators.
      sky check to check the enabled clouds.
, because skypilot knows that tpu-v5-lite-podslice is only available in kubernetes, unlike A100.

So the fuzzy error message is supposed to appear only when either when the cloud is not specified but the instance is available in multiple clouds. I guess there isn't anything to file an issue to fix?

I see. LGTM

cblmemo · 2024-11-09T21:29:01Z

When launching with --gpus tpu-v5-lite-podslice:1, it still detects 4 TPUs:

(tpuvm_mnist, pid=3324) I1028 20:50:43.360553 135160242983552 main.py:52] JAX local devices: [TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0), TpuDevice(id=1, process_index=0, coords=(1,0,0), core_on_chip=0), TpuDevice(id=2, process_index=0, coords=(0,1,0), core_on_chip=0), TpuDevice(id=3, process_index=0, coords=(1,1,0), core_on_chip=0)]
Is this expected (e.g. the 4 of them is a single basic schedule unit)? If it is, should we disable users to specify TPUs that is less than 4 (or any other TPU count in one host)?

Also, the above example failed for me. Could you check this as well?

@cblmemo Not sure why that error is printing out info regards to TPU with 4 chips, but it's fixed at 688c0b4. sky launch --gpus tpu-v5-lite-podslice:4 -c gke-tpu-4 examples/tpu/tpuvm_mnist.yaml does not work due to the issue mentioned here, but specifying the cloud to be kubernetes works now with: sky launch --gpus tpu-v5-lite-podslice:4 --cloud kubernetes -c gke-tpu-4 examples/tpu/tpuvm_mnist.yaml:

$ sky launch --gpus tpu-v5-lite-podslice:4 --cloud kubernetes -c gke-tpu-4 examples/tpu/tpuvm_mnist.yaml

...

(tpuvm_mnist, pid=2036) I1102 22:44:01.655345 136981829304960 train.py:148] epoch:  7, train_loss: 0.0167, train_accuracy: 99.47, test_loss: 0.0266, test_accuracy: 99.18
(tpuvm_mnist, pid=2036) I1102 22:44:03.135087 136981829304960 train.py:148] epoch:  8, train_loss: 0.0134, train_accuracy: 99.58, test_loss: 0.0260, test_accuracy: 99.16
(tpuvm_mnist, pid=2036) I1102 22:44:04.615064 136981829304960 train.py:148] epoch:  9, train_loss: 0.0117, train_accuracy: 99.65, test_loss: 0.0248, test_accuracy: 99.21
(tpuvm_mnist, pid=2036) I1102 22:44:06.100036 136981829304960 train.py:148] epoch: 10, train_loss: 0.0086, train_accuracy: 99.75, test_loss: 0.0268, test_accuracy: 99.14
✓ Job finished (status: SUCCEEDED).

📋 Useful Commands
Job ID: 1
├── To cancel the job:		sky cancel gke-tpu-4 1
├── To stream job logs:		sky logs gke-tpu-4 1
└── To view job queue:		sky queue gke-tpu-4

Cluster name: gke-tpu-4
├── To log into the head VM:	ssh gke-tpu-4
├── To submit a job:		sky exec gke-tpu-4 yaml_file
├── To stop the cluster:	sky stop gke-tpu-4
└── To teardown the cluster:	sky down gke-tpu-4
Tip: `sky down` will delete launched TPU(s) too.

Also, it does not detect TPU with 4 chips when a pod with 1 TPU chip is provisioned:

Python 3.10.15 (main, Oct  3 2024, 07:27:34) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import jax
>>> print(jax.devices());
[TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0)]

Oh sorry I mean launch with --gpus tpu:1 on a cluster provisioned by --gpus tpu:4. IIRC for GPU cluster we will limit the visible number of GPUs to this job to only 1, so pytorch will only detect and use one GPU. I'm not sure if TPU provides this type of isolation, but if not, maybe we should error out or at least print some warnings when launching with less tpus than the cluster has.

cc @romilbhardwaj for a look here

cblmemo · 2024-11-09T22:06:59Z

To reproduce:

$ sky launch examples/tpu/tpuvm_mnist.yaml --gpus tpu-v5-lite-podslice:4
Task from YAML spec: examples/tpu/tpuvm_mnist.yaml
Considered resources (1 node):
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 CLOUD        INSTANCE                           vCPUs   Mem(GB)   ACCELERATORS             REGION/ZONE                                         COST ($)   CHOSEN   
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Kubernetes   2CPU--8GB--4tpu-v5-lite-podslice   2       8         tpu-v5-lite-podslice:4   gke_skypilot-375900_us-south1-a_mix-tpu-test-txia   0.00          ✔     
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Launching a new cluster 'sky-3ccc-txia'. Proceed? [Y/n]: 
⚙︎ Launching on Kubernetes.
└── Pod is up.
✓ Cluster launched: sky-3ccc-txia.  View logs at: ~/sky_logs/sky-2024-11-09-13-52-46-235927/provision.log
⚙︎ Running setup on 1 pod.
✓ Setup completed.  View logs at: ~/sky_logs/sky-2024-11-09-13-52-46-235927/setup-*.log
⚙︎ Job submitted, ID: 1
├── Waiting for task resources on 1 node.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
✓ Job finished (status: SUCCEEDED).

Job ID: 1
📋 Useful Commands
├── To cancel the job:          sky cancel sky-3ccc-txia 1
├── To stream job logs:         sky logs sky-3ccc-txia 1
└── To view job queue:          sky queue sky-3ccc-txia

Cluster name: sky-3ccc-txia
├── To log into the head VM:    ssh sky-3ccc-txia
├── To submit a job:            sky exec sky-3ccc-txia yaml_file
├── To stop the cluster:        sky stop sky-3ccc-txia
└── To teardown the cluster:    sky down sky-3ccc-txia
Tip: `sky down` will delete launched TPU(s) too.

$ sky launch --gpus tpu-v5-lite-podslice:2 -c sky-3ccc-txia 'conda activate flax; python -c "import jax; print(jax.devices())"'
Task from command: conda activate flax; python -c "import jax; print(jax.devices())"
Running task on cluster sky-3ccc-txia...
⚙︎ Launching on Kubernetes.
└── Pod is up.
✓ Cluster launched: sky-3ccc-txia.  View logs at: ~/sky_logs/sky-2024-11-09-14-05-11-848087/provision.log
⚙︎ Job submitted, ID: 3
├── Waiting for task resources on 1 node.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
(sky-cmd, pid=34717) [TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0), TpuDevice(id=1, process_index=0, coords=(1,0,0), core_on_chip=0), TpuDevice(id=2, process_index=0, coords=(0,1,0), core_on_chip=0), TpuDevice(id=3, process_index=0, coords=(1,1,0), core_on_chip=0)]
✓ Job finished (status: SUCCEEDED).

Job ID: 3
📋 Useful Commands
├── To cancel the job:          sky cancel sky-3ccc-txia 3
├── To stream job logs:         sky logs sky-3ccc-txia 3
└── To view job queue:          sky queue sky-3ccc-txia

Cluster name: sky-3ccc-txia
├── To log into the head VM:    ssh sky-3ccc-txia
├── To submit a job:            sky exec sky-3ccc-txia yaml_file
├── To stop the cluster:        sky stop sky-3ccc-txia
└── To teardown the cluster:    sky down sky-3ccc-txia
Tip: `sky down` will delete launched TPU(s) too.

cblmemo

Thanks @landscapepainter ! Tested it again and it works smoothly. After the above mentioned issue resolved it should be ready to go!

romilbhardwaj · 2024-11-09T22:19:28Z

Oh sorry I mean launch with --gpus tpu:1 on a cluster provisioned by --gpus tpu:4. IIRC for GPU cluster we will limit the visible number of GPUs to this job to only 1, so pytorch will only detect and use one GPU. I'm not sure if TPU provides this type of isolation, but if not, maybe we should error out or at least print some warnings when launching with less tpus than the cluster has.

I think our current cloud TPUs also behave in the same way, so allowing --gpus tpu:1 on a cluster provisioned by --gpus tpu:4. We can maybe leave a TODO somewhere in the code to keep a track.

This reverts commit b10803a.

landscapepainter added 3 commits September 16, 2024 08:59

initial version of TPU support on GKE

a929474

revert unnecesary change

80e1877

revert

70a07ab

landscapepainter changed the title ~~[k8s] TPU support on GKE~~ [k8s] on-demand TPU support on GKE Sep 16, 2024

landscapepainter marked this pull request as draft September 16, 2024 09:25

landscapepainter added 25 commits September 17, 2024 06:19

use TPU_LABEL_KEY constant

0cba9a5

nit

17bcbd8

nit

9233bf5

update detect_gpu_label_formatter() to use match_label_key()

12e62c0

tidy get_gpu_label_key_value

c795fe7

nit

1c895f0

update method name

a8f5b6b

update get_gke_accelerator_name to support TPU

bdb3469

add support for get_label_keys method due to TPU label key

1d2d243

syntax

92f4f38

update get_tpu_topology_label_key_value

2662ec8

nit

58f8ad6

refactor error surfacing methods to have it work with TPU support

1cf82b6

update toleration comment

7b551c9

support listing available TPUs and show-gpus for TPUs

81a05ee

nit

e8764f1

update help message

3497aee

Update /tmp/tpu_logs dir's write permission

724806a

nit

e8d73fe

nit

7ac5036

comment update on TPU resource lackage error handling

4470dbe

Update to use global constant instead of hard coded string of nvidia.…

0860e45

…com/gpu and google.com/tpu

add smoke test and make exec work on TPU pods

35f3c80

update smoke test to check if TPU is reachable.

2b56a9e

add comment

305705c

update tpuvm_mnist.yaml

688c0b4

resolve comments

2dec7f9

landscapepainter added 2 commits November 4, 2024 01:34

update display message for show-gpus

dc23e88

Merge branch 'master' into k8s-tpu-support-on-gke

22445bc

romilbhardwaj requested a review from cblmemo November 8, 2024 02:42

Merge branch 'master' into k8s-tpu-support-on-gke

b445f81

cblmemo approved these changes Nov 9, 2024

View reviewed changes

landscapepainter added 2 commits November 13, 2024 04:13

Merge branch 'master' into k8s-tpu-support-on-gke

fbd4d3c

format

b2390a9

landscapepainter enabled auto-merge November 13, 2024 04:23

landscapepainter added this pull request to the merge queue Nov 13, 2024

Merged via the queue into skypilot-org:master with commit eea13cc Nov 13, 2024
20 checks passed

landscapepainter deleted the k8s-tpu-support-on-gke branch November 13, 2024 04:29

nkwangleiGIT mentioned this pull request Nov 14, 2024

[k8s] support to use custom gpu resource name if it's not nvidia.com/gpu #4337

Merged

5 tasks

zpoint added this to the v0.7.1 milestone Nov 29, 2024

zpoint pushed a commit to zpoint/skypilot that referenced this pull request Nov 29, 2024

merge skypilot-org#3947

b10803a

zpoint removed this from the v0.7.1 milestone Dec 4, 2024

zpoint added a commit to zpoint/skypilot that referenced this pull request Dec 4, 2024

Revert "merge skypilot-org#3947"

2c98ed7

This reverts commit b10803a.

romilbhardwaj mentioned this pull request Jan 8, 2025

Skypilot automatically adds a "nvidia.com/gpu" toleration to non gpu workloads #4542

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[k8s] On-demand single-host TPU support on GKE #3947

[k8s] On-demand single-host TPU support on GKE #3947

landscapepainter commented Sep 16, 2024 •

edited

Loading

landscapepainter commented Nov 1, 2024

cblmemo commented Nov 2, 2024

cblmemo commented Nov 2, 2024

landscapepainter commented Nov 2, 2024 •

edited

Loading

landscapepainter commented Nov 3, 2024 •

edited

Loading

landscapepainter commented Nov 3, 2024

cblmemo commented Nov 9, 2024

cblmemo commented Nov 9, 2024

cblmemo commented Nov 9, 2024

cblmemo left a comment

romilbhardwaj commented Nov 9, 2024

[k8s] On-demand single-host TPU support on GKE #3947

[k8s] On-demand single-host TPU support on GKE #3947

Conversation

landscapepainter commented Sep 16, 2024 • edited Loading

landscapepainter commented Nov 1, 2024

cblmemo commented Nov 2, 2024

cblmemo commented Nov 2, 2024

landscapepainter commented Nov 2, 2024 • edited Loading

landscapepainter commented Nov 3, 2024 • edited Loading

landscapepainter commented Nov 3, 2024

cblmemo commented Nov 9, 2024

cblmemo commented Nov 9, 2024

cblmemo commented Nov 9, 2024

cblmemo left a comment

Choose a reason for hiding this comment

romilbhardwaj commented Nov 9, 2024

landscapepainter commented Sep 16, 2024 •

edited

Loading

landscapepainter commented Nov 2, 2024 •

edited

Loading

landscapepainter commented Nov 3, 2024 •

edited

Loading