Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] On-demand single-host TPU support on GKE #3947

Merged

Conversation

landscapepainter
Copy link
Collaborator

@landscapepainter landscapepainter commented Sep 16, 2024

One of our users requested a feature to use spot TPU from GKE. This is an intial step to support the request for on-demand single host TPU.

This PR does not contain the support for:

  • multi-host TPU support
  • autoscaler support
  • spot TPU support

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • Using GKE cluster with 2 single host TPU podslice of 1x1 topology and 2 CPU instances.
      • sky launch tpu_gke.yaml --cloud kubernetes --gpus tpu-v5-lite-podslice --num-nodes 2 -y
      • sky launch --cloud kubernetes --cpus=2 -y
      • sky show-gpus --cloud kubernetes
    • Using GKE cluster with 1 single host TPU podslice of 2x2 topology and 2 CPU instances.
      • sky launch tpu_gke.yaml --cloud kubernetes --gpus tpu-v5-lite-podslice:4 -y
  • All smoke tests: pytest tests/test_smoke.py besides the ones also failing from master branch:
    • test_managed_jobs_storage
    • test_multiple_accelerators_unordered_with_default
    • test_skyserve_user_bug_restart
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
    • pytest tests/test_smoke.py::test_tpu_pod_slice_gke --kubernetes
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

tpu_gke.yaml:

file_mounts:
  /result:
    store: gcs
    name: tpu-mount-test-dy
    mode: MOUNT

setup: |
  git clone https://github.com/google/flax.git --branch v0.8.2

  conda activate flax
  if [ $? -eq 0 ]; then
    echo 'conda env exists'
  else
    conda create -n flax python=3.10 -y
    conda activate flax
    # Make sure to install TPU related packages in a conda env to avoid package conflicts.
    pip install \
      -f https://storage.googleapis.com/jax-releases/libtpu_releases.html "jax[tpu]==0.4.25" \
      clu \
      tensorflow tensorflow-datasets
    pip install -e flax
  fi

run: |
  conda activate flax
  pip install clu
  cd flax/examples/mnist
  python3 main.py --workdir=/tmp/mnist \
    --config=configs/default.py \
    --config.learning_rate=0.05 \
    --config.num_epochs=10 >> /result/output.log 2>&1

@landscapepainter landscapepainter changed the title [k8s] TPU support on GKE [k8s] on-demand TPU support on GKE Sep 16, 2024
@landscapepainter landscapepainter marked this pull request as draft September 16, 2024 09:25
@landscapepainter
Copy link
Collaborator Author

Also, when I tries to launch with 2 tpus, it seems the error does not illustrate the real reason (e.g. only tpu:4 and tpu:1 is available). ... Should we show similar fuzzy result like this?

@cblmemo Seems like this is a consistent behavior for Kubernetes in general, and not just for TPU support. And the error is illustrating the real reason, but just not the fuzzy result. Is it supposed to be displaying the fuzzy result when one cloud is specified as well? Anyways, it seems to be out of scope of this PR.

$ sky launch --gpus A100:4  --cloud kubernetes
No resource satisfying Kubernetes({'A100': 4}) on Kubernetes.
sky.exceptions.ResourcesUnavailableError: Kubernetes cluster does not contain any instances satisfying the request: 1x Kubernetes({'A100': 4}).
To fix: relax or change the resource requirements.

Hint: sky show-gpus to list available accelerators.
      sky check to check the enabled clouds.

@landscapepainter
Copy link
Collaborator Author

I tried to launch a task in existing cluster but failed with the following error. Manually commenting out the accelerators: tpu-v2-8 in the example yaml resolved the issue for me, but it seems we have an error to infer the cloud for resources if cli and task YAML has some inconsistencies. Could you take a look at what is happening here?

@cblmemo sky/task.py::Task.set_resources_override is setting new_resources to be GCP({'tpu-v5-lite-podslice': 4}, accelerator_args={'runtime_version': 'tpu-vm-base'}), which does not exist. And this results in the issue you are encountering. Seems like overriding resource should update the cloud type from GCP to Kubernetes as well, but such logic doesn't seem to exist. Do we currently allow this from Skypilot?

@cblmemo
Copy link
Collaborator

cblmemo commented Nov 2, 2024

Also, when I tries to launch with 2 tpus, it seems the error does not illustrate the real reason (e.g. only tpu:4 and tpu:1 is available). ... Should we show similar fuzzy result like this?

@cblmemo Seems like this is a consistent behavior for Kubernetes in general, and not just for TPU support. And the error is illustrating the real reason, but just not the fuzzy result. Is it supposed to be displaying the fuzzy result when one cloud is specified as well? Anyways, it seems to be out of scope of this PR.

$ sky launch --gpus A100:4  --cloud kubernetes
No resource satisfying Kubernetes({'A100': 4}) on Kubernetes.
sky.exceptions.ResourcesUnavailableError: Kubernetes cluster does not contain any instances satisfying the request: 1x Kubernetes({'A100': 4}).
To fix: relax or change the resource requirements.

Hint: sky show-gpus to list available accelerators.
      sky check to check the enabled clouds.

Got it. Maybe worth filing an issue for this and implement this elsewhere ;)

@cblmemo
Copy link
Collaborator

cblmemo commented Nov 2, 2024

Also, when I tries to launch with 2 tpus, it seems the error does not illustrate the real reason (e.g. only tpu:4 and tpu:1 is available). ... Should we show similar fuzzy result like this?

@cblmemo Seems like this is a consistent behavior for Kubernetes in general, and not just for TPU support. And the error is illustrating the real reason, but just not the fuzzy result. Is it supposed to be displaying the fuzzy result when one cloud is specified as well? Anyways, it seems to be out of scope of this PR.

$ sky launch --gpus A100:4  --cloud kubernetes
No resource satisfying Kubernetes({'A100': 4}) on Kubernetes.
sky.exceptions.ResourcesUnavailableError: Kubernetes cluster does not contain any instances satisfying the request: 1x Kubernetes({'A100': 4}).
To fix: relax or change the resource requirements.

Hint: sky show-gpus to list available accelerators.
      sky check to check the enabled clouds.
(sky-serve) ➜  skypilot git:(new_provision_api) ✗ sky launch @temp/a.yaml --instance-type n2-standard-8
Task from YAML spec: @temp/a.yaml
ValueError: Invalid instance type 'n2-standard-8' for cloud AWS.
(sky-serve) ➜  skypilot git:(new_provision_api) ✗ cat @temp/a.yaml
resources:
  cloud: aws

At least we should show such error information? Current error is a little bit confusing to me..

Also, current conflict is two auto-filled cloud's conflict. If the user explicitly set the cloud in the YAML and cause a conflict, that sounds reasonable to me. But I would be surprised if I didn't set the cloud but two of SkyPilot's auto inference of cloud is causing conflict..

@landscapepainter
Copy link
Collaborator Author

landscapepainter commented Nov 2, 2024

When launching with --gpus tpu-v5-lite-podslice:1, it still detects 4 TPUs:

(tpuvm_mnist, pid=3324) I1028 20:50:43.360553 135160242983552 main.py:52] JAX local devices: [TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0), TpuDevice(id=1, process_index=0, coords=(1,0,0), core_on_chip=0), TpuDevice(id=2, process_index=0, coords=(0,1,0), core_on_chip=0), TpuDevice(id=3, process_index=0, coords=(1,1,0), core_on_chip=0)]
Is this expected (e.g. the 4 of them is a single basic schedule unit)? If it is, should we disable users to specify TPUs that is less than 4 (or any other TPU count in one host)?

Also, the above example failed for me. Could you check this as well?

@cblmemo Not sure why that error is printing out info regards to TPU with 4 chips, but it's fixed at 688c0b4.
sky launch --gpus tpu-v5-lite-podslice:4 -c gke-tpu-4 examples/tpu/tpuvm_mnist.yaml does not work due to the issue mentioned here, but specifying the cloud to be kubernetes works now with: sky launch --gpus tpu-v5-lite-podslice:4 --cloud kubernetes -c gke-tpu-4 examples/tpu/tpuvm_mnist.yaml:

$ sky launch --gpus tpu-v5-lite-podslice:4 --cloud kubernetes -c gke-tpu-4 examples/tpu/tpuvm_mnist.yaml

...

(tpuvm_mnist, pid=2036) I1102 22:44:01.655345 136981829304960 train.py:148] epoch:  7, train_loss: 0.0167, train_accuracy: 99.47, test_loss: 0.0266, test_accuracy: 99.18
(tpuvm_mnist, pid=2036) I1102 22:44:03.135087 136981829304960 train.py:148] epoch:  8, train_loss: 0.0134, train_accuracy: 99.58, test_loss: 0.0260, test_accuracy: 99.16
(tpuvm_mnist, pid=2036) I1102 22:44:04.615064 136981829304960 train.py:148] epoch:  9, train_loss: 0.0117, train_accuracy: 99.65, test_loss: 0.0248, test_accuracy: 99.21
(tpuvm_mnist, pid=2036) I1102 22:44:06.100036 136981829304960 train.py:148] epoch: 10, train_loss: 0.0086, train_accuracy: 99.75, test_loss: 0.0268, test_accuracy: 99.14
✓ Job finished (status: SUCCEEDED).

📋 Useful Commands
Job ID: 1
├── To cancel the job:		sky cancel gke-tpu-4 1
├── To stream job logs:		sky logs gke-tpu-4 1
└── To view job queue:		sky queue gke-tpu-4

Cluster name: gke-tpu-4
├── To log into the head VM:	ssh gke-tpu-4
├── To submit a job:		sky exec gke-tpu-4 yaml_file
├── To stop the cluster:	sky stop gke-tpu-4
└── To teardown the cluster:	sky down gke-tpu-4
Tip: `sky down` will delete launched TPU(s) too.

Also, it does not detect TPU with 4 chips when a pod with 1 TPU chip is provisioned:

Python 3.10.15 (main, Oct  3 2024, 07:27:34) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import jax
>>> print(jax.devices());
[TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0)]

@landscapepainter
Copy link
Collaborator Author

landscapepainter commented Nov 3, 2024

Got it. Maybe worth filing an issue for this and implement this elsewhere ;)

I just tested this out again by specifying --cloud aws, and it's exactly the same what kubernetes error displayed from your end. And we get the error you see,

$ sky launch --gpus tpu-v5-lite-podslice:2 -c gke-tpu-2
No resource satisfying Kubernetes({'tpu-v5-lite-podslice': 2}, accelerator_args={}) on Kubernetes.
sky.exceptions.ResourcesUnavailableError: Kubernetes cluster does not contain any instances satisfying the request: 1x Kubernetes({'tpu-v5-lite-podslice': 2}, accelerator_args={}).
To fix: relax or change the resource requirements.

Hint: sky show-gpus to list available accelerators.
      sky check to check the enabled clouds.

, because skypilot knows that tpu-v5-lite-podslice is only available in kubernetes, unlike A100.

So the fuzzy error message is supposed to appear only when either when the cloud is not specified but the instance is available in multiple clouds. I guess there isn't anything to file an issue to fix?

@landscapepainter
Copy link
Collaborator Author

@cblmemo @romilbhardwaj This is ready for another round. Thanks!!

@cblmemo
Copy link
Collaborator

cblmemo commented Nov 9, 2024

Got it. Maybe worth filing an issue for this and implement this elsewhere ;)

I just tested this out again by specifying --cloud aws, and it's exactly the same what kubernetes error displayed from your end. And we get the error you see,

$ sky launch --gpus tpu-v5-lite-podslice:2 -c gke-tpu-2
No resource satisfying Kubernetes({'tpu-v5-lite-podslice': 2}, accelerator_args={}) on Kubernetes.
sky.exceptions.ResourcesUnavailableError: Kubernetes cluster does not contain any instances satisfying the request: 1x Kubernetes({'tpu-v5-lite-podslice': 2}, accelerator_args={}).
To fix: relax or change the resource requirements.

Hint: sky show-gpus to list available accelerators.
      sky check to check the enabled clouds.

, because skypilot knows that tpu-v5-lite-podslice is only available in kubernetes, unlike A100.

So the fuzzy error message is supposed to appear only when either when the cloud is not specified but the instance is available in multiple clouds. I guess there isn't anything to file an issue to fix?

I see. LGTM

@cblmemo
Copy link
Collaborator

cblmemo commented Nov 9, 2024

When launching with --gpus tpu-v5-lite-podslice:1, it still detects 4 TPUs:

(tpuvm_mnist, pid=3324) I1028 20:50:43.360553 135160242983552 main.py:52] JAX local devices: [TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0), TpuDevice(id=1, process_index=0, coords=(1,0,0), core_on_chip=0), TpuDevice(id=2, process_index=0, coords=(0,1,0), core_on_chip=0), TpuDevice(id=3, process_index=0, coords=(1,1,0), core_on_chip=0)]
Is this expected (e.g. the 4 of them is a single basic schedule unit)? If it is, should we disable users to specify TPUs that is less than 4 (or any other TPU count in one host)?

Also, the above example failed for me. Could you check this as well?

@cblmemo Not sure why that error is printing out info regards to TPU with 4 chips, but it's fixed at 688c0b4. sky launch --gpus tpu-v5-lite-podslice:4 -c gke-tpu-4 examples/tpu/tpuvm_mnist.yaml does not work due to the issue mentioned here, but specifying the cloud to be kubernetes works now with: sky launch --gpus tpu-v5-lite-podslice:4 --cloud kubernetes -c gke-tpu-4 examples/tpu/tpuvm_mnist.yaml:

$ sky launch --gpus tpu-v5-lite-podslice:4 --cloud kubernetes -c gke-tpu-4 examples/tpu/tpuvm_mnist.yaml

...

(tpuvm_mnist, pid=2036) I1102 22:44:01.655345 136981829304960 train.py:148] epoch:  7, train_loss: 0.0167, train_accuracy: 99.47, test_loss: 0.0266, test_accuracy: 99.18
(tpuvm_mnist, pid=2036) I1102 22:44:03.135087 136981829304960 train.py:148] epoch:  8, train_loss: 0.0134, train_accuracy: 99.58, test_loss: 0.0260, test_accuracy: 99.16
(tpuvm_mnist, pid=2036) I1102 22:44:04.615064 136981829304960 train.py:148] epoch:  9, train_loss: 0.0117, train_accuracy: 99.65, test_loss: 0.0248, test_accuracy: 99.21
(tpuvm_mnist, pid=2036) I1102 22:44:06.100036 136981829304960 train.py:148] epoch: 10, train_loss: 0.0086, train_accuracy: 99.75, test_loss: 0.0268, test_accuracy: 99.14
✓ Job finished (status: SUCCEEDED).

📋 Useful Commands
Job ID: 1
├── To cancel the job:		sky cancel gke-tpu-4 1
├── To stream job logs:		sky logs gke-tpu-4 1
└── To view job queue:		sky queue gke-tpu-4

Cluster name: gke-tpu-4
├── To log into the head VM:	ssh gke-tpu-4
├── To submit a job:		sky exec gke-tpu-4 yaml_file
├── To stop the cluster:	sky stop gke-tpu-4
└── To teardown the cluster:	sky down gke-tpu-4
Tip: `sky down` will delete launched TPU(s) too.

Also, it does not detect TPU with 4 chips when a pod with 1 TPU chip is provisioned:

Python 3.10.15 (main, Oct  3 2024, 07:27:34) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import jax
>>> print(jax.devices());
[TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0)]

Oh sorry I mean launch with --gpus tpu:1 on a cluster provisioned by --gpus tpu:4. IIRC for GPU cluster we will limit the visible number of GPUs to this job to only 1, so pytorch will only detect and use one GPU. I'm not sure if TPU provides this type of isolation, but if not, maybe we should error out or at least print some warnings when launching with less tpus than the cluster has.

cc @romilbhardwaj for a look here

@cblmemo
Copy link
Collaborator

cblmemo commented Nov 9, 2024

To reproduce:

$ sky launch examples/tpu/tpuvm_mnist.yaml --gpus tpu-v5-lite-podslice:4
Task from YAML spec: examples/tpu/tpuvm_mnist.yaml
Considered resources (1 node):
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 CLOUD        INSTANCE                           vCPUs   Mem(GB)   ACCELERATORS             REGION/ZONE                                         COST ($)   CHOSEN   
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Kubernetes   2CPU--8GB--4tpu-v5-lite-podslice   2       8         tpu-v5-lite-podslice:4   gke_skypilot-375900_us-south1-a_mix-tpu-test-txia   0.00          ✔     
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Launching a new cluster 'sky-3ccc-txia'. Proceed? [Y/n]: 
⚙︎ Launching on Kubernetes.
└── Pod is up.
✓ Cluster launched: sky-3ccc-txia.  View logs at: ~/sky_logs/sky-2024-11-09-13-52-46-235927/provision.log
⚙︎ Running setup on 1 pod.
✓ Setup completed.  View logs at: ~/sky_logs/sky-2024-11-09-13-52-46-235927/setup-*.log
⚙︎ Job submitted, ID: 1
├── Waiting for task resources on 1 node.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
✓ Job finished (status: SUCCEEDED).

Job ID: 1
📋 Useful Commands
├── To cancel the job:          sky cancel sky-3ccc-txia 1
├── To stream job logs:         sky logs sky-3ccc-txia 1
└── To view job queue:          sky queue sky-3ccc-txia

Cluster name: sky-3ccc-txia
├── To log into the head VM:    ssh sky-3ccc-txia
├── To submit a job:            sky exec sky-3ccc-txia yaml_file
├── To stop the cluster:        sky stop sky-3ccc-txia
└── To teardown the cluster:    sky down sky-3ccc-txia
Tip: `sky down` will delete launched TPU(s) too.

$ sky launch --gpus tpu-v5-lite-podslice:2 -c sky-3ccc-txia 'conda activate flax; python -c "import jax; print(jax.devices())"'
Task from command: conda activate flax; python -c "import jax; print(jax.devices())"
Running task on cluster sky-3ccc-txia...
⚙︎ Launching on Kubernetes.
└── Pod is up.
✓ Cluster launched: sky-3ccc-txia.  View logs at: ~/sky_logs/sky-2024-11-09-14-05-11-848087/provision.log
⚙︎ Job submitted, ID: 3
├── Waiting for task resources on 1 node.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
(sky-cmd, pid=34717) [TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0), TpuDevice(id=1, process_index=0, coords=(1,0,0), core_on_chip=0), TpuDevice(id=2, process_index=0, coords=(0,1,0), core_on_chip=0), TpuDevice(id=3, process_index=0, coords=(1,1,0), core_on_chip=0)]
✓ Job finished (status: SUCCEEDED).

Job ID: 3
📋 Useful Commands
├── To cancel the job:          sky cancel sky-3ccc-txia 3
├── To stream job logs:         sky logs sky-3ccc-txia 3
└── To view job queue:          sky queue sky-3ccc-txia

Cluster name: sky-3ccc-txia
├── To log into the head VM:    ssh sky-3ccc-txia
├── To submit a job:            sky exec sky-3ccc-txia yaml_file
├── To stop the cluster:        sky stop sky-3ccc-txia
└── To teardown the cluster:    sky down sky-3ccc-txia
Tip: `sky down` will delete launched TPU(s) too.

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @landscapepainter ! Tested it again and it works smoothly. After the above mentioned issue resolved it should be ready to go!

@romilbhardwaj
Copy link
Collaborator

Oh sorry I mean launch with --gpus tpu:1 on a cluster provisioned by --gpus tpu:4. IIRC for GPU cluster we will limit the visible number of GPUs to this job to only 1, so pytorch will only detect and use one GPU. I'm not sure if TPU provides this type of isolation, but if not, maybe we should error out or at least print some warnings when launching with less tpus than the cluster has.

I think our current cloud TPUs also behave in the same way, so allowing --gpus tpu:1 on a cluster provisioned by --gpus tpu:4. We can maybe leave a TODO somewhere in the code to keep a track.

@landscapepainter landscapepainter added this pull request to the merge queue Nov 13, 2024
Merged via the queue into skypilot-org:master with commit eea13cc Nov 13, 2024
20 checks passed
@landscapepainter landscapepainter deleted the k8s-tpu-support-on-gke branch November 13, 2024 04:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants