-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[k8s] On-demand single-host TPU support on GKE #3947
[k8s] On-demand single-host TPU support on GKE #3947
Conversation
…com/gpu and google.com/tpu
@cblmemo Seems like this is a consistent behavior for
|
@cblmemo |
Got it. Maybe worth filing an issue for this and implement this elsewhere ;) |
(sky-serve) ➜ skypilot git:(new_provision_api) ✗ sky launch @temp/a.yaml --instance-type n2-standard-8
Task from YAML spec: @temp/a.yaml
ValueError: Invalid instance type 'n2-standard-8' for cloud AWS.
(sky-serve) ➜ skypilot git:(new_provision_api) ✗ cat @temp/a.yaml
resources:
cloud: aws At least we should show such error information? Current error is a little bit confusing to me.. Also, current conflict is two auto-filled cloud's conflict. If the user explicitly set the cloud in the YAML and cause a conflict, that sounds reasonable to me. But I would be surprised if I didn't set the cloud but two of SkyPilot's auto inference of cloud is causing conflict.. |
@cblmemo Not sure why that error is printing out info regards to TPU with 4 chips, but it's fixed at 688c0b4.
Also, it does not detect TPU with 4 chips when a pod with 1 TPU chip is provisioned:
|
I just tested this out again by specifying
, because skypilot knows that So the fuzzy error message is supposed to appear only when either when the cloud is not specified but the instance is available in multiple clouds. I guess there isn't anything to file an issue to fix? |
@cblmemo @romilbhardwaj This is ready for another round. Thanks!! |
I see. LGTM |
Oh sorry I mean launch with cc @romilbhardwaj for a look here |
To reproduce: $ sky launch examples/tpu/tpuvm_mnist.yaml --gpus tpu-v5-lite-podslice:4
Task from YAML spec: examples/tpu/tpuvm_mnist.yaml
Considered resources (1 node):
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Kubernetes 2CPU--8GB--4tpu-v5-lite-podslice 2 8 tpu-v5-lite-podslice:4 gke_skypilot-375900_us-south1-a_mix-tpu-test-txia 0.00 ✔
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Launching a new cluster 'sky-3ccc-txia'. Proceed? [Y/n]:
⚙︎ Launching on Kubernetes.
└── Pod is up.
✓ Cluster launched: sky-3ccc-txia. View logs at: ~/sky_logs/sky-2024-11-09-13-52-46-235927/provision.log
⚙︎ Running setup on 1 pod.
✓ Setup completed. View logs at: ~/sky_logs/sky-2024-11-09-13-52-46-235927/setup-*.log
⚙︎ Job submitted, ID: 1
├── Waiting for task resources on 1 node.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
✓ Job finished (status: SUCCEEDED).
Job ID: 1
📋 Useful Commands
├── To cancel the job: sky cancel sky-3ccc-txia 1
├── To stream job logs: sky logs sky-3ccc-txia 1
└── To view job queue: sky queue sky-3ccc-txia
Cluster name: sky-3ccc-txia
├── To log into the head VM: ssh sky-3ccc-txia
├── To submit a job: sky exec sky-3ccc-txia yaml_file
├── To stop the cluster: sky stop sky-3ccc-txia
└── To teardown the cluster: sky down sky-3ccc-txia
Tip: `sky down` will delete launched TPU(s) too.
$ sky launch --gpus tpu-v5-lite-podslice:2 -c sky-3ccc-txia 'conda activate flax; python -c "import jax; print(jax.devices())"'
Task from command: conda activate flax; python -c "import jax; print(jax.devices())"
Running task on cluster sky-3ccc-txia...
⚙︎ Launching on Kubernetes.
└── Pod is up.
✓ Cluster launched: sky-3ccc-txia. View logs at: ~/sky_logs/sky-2024-11-09-14-05-11-848087/provision.log
⚙︎ Job submitted, ID: 3
├── Waiting for task resources on 1 node.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
(sky-cmd, pid=34717) [TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0), TpuDevice(id=1, process_index=0, coords=(1,0,0), core_on_chip=0), TpuDevice(id=2, process_index=0, coords=(0,1,0), core_on_chip=0), TpuDevice(id=3, process_index=0, coords=(1,1,0), core_on_chip=0)]
✓ Job finished (status: SUCCEEDED).
Job ID: 3
📋 Useful Commands
├── To cancel the job: sky cancel sky-3ccc-txia 3
├── To stream job logs: sky logs sky-3ccc-txia 3
└── To view job queue: sky queue sky-3ccc-txia
Cluster name: sky-3ccc-txia
├── To log into the head VM: ssh sky-3ccc-txia
├── To submit a job: sky exec sky-3ccc-txia yaml_file
├── To stop the cluster: sky stop sky-3ccc-txia
└── To teardown the cluster: sky down sky-3ccc-txia
Tip: `sky down` will delete launched TPU(s) too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @landscapepainter ! Tested it again and it works smoothly. After the above mentioned issue resolved it should be ready to go!
I think our current cloud TPUs also behave in the same way, so allowing |
One of our users requested a feature to use spot TPU from GKE. This is an intial step to support the request for on-demand single host TPU.
This PR does not contain the support for:
Tested (run the relevant ones):
bash format.sh
sky launch tpu_gke.yaml --cloud kubernetes --gpus tpu-v5-lite-podslice --num-nodes 2 -y
sky launch --cloud kubernetes --cpus=2 -y
sky show-gpus --cloud kubernetes
sky launch tpu_gke.yaml --cloud kubernetes --gpus tpu-v5-lite-podslice:4 -y
pytest tests/test_smoke.py
besides the ones also failing from master branch:pytest tests/test_smoke.py::test_fill_in_the_name
pytest tests/test_smoke.py::test_tpu_pod_slice_gke --kubernetes
conda deactivate; bash -i tests/backward_compatibility_tests.sh
tpu_gke.yaml
: