Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ResourceHandle semantics #1481

Merged
merged 11 commits into from
Dec 1, 2022
Merged

Fix ResourceHandle semantics #1481

merged 11 commits into from
Dec 1, 2022

Conversation

iojw
Copy link
Collaborator

@iojw iojw commented Dec 1, 2022

This fixes the following issues with the new ResourceHandle introduced in #1380.

  • For clusters from the older version that are in INIT status, sky status -r will not update them to the latest version, causing the IPs for these clusters to be queried on each call which is slow. This PR updates it such that these clusters will be updated to the latest version on refresh.
  • [TPU] Can't launch a TPU Pod #1480: fixes the handling of TPU pods
  • The external_ips and internal_ips methods should always return a list as that is what is expected by the rest of the code

Tested:

  • test_tpu_vm_pod

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix @iojw! I am testing this with the TPU pod now.

Let's remove the following lines to make sure the TPU pod tests will be run in future smoke tests. : )

# Mark slow because it's expensive to run.
@pytest.mark.slow

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
@Michaelvll
Copy link
Collaborator

Tested:

  • ./tests/run_smoke_tests.sh test_tpu_vm_pod

@iojw iojw requested a review from Michaelvll December 1, 2022 08:13
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the quick fix @iojw!

@infwinston
Copy link
Member

For smoke tests, let's try to use spot TPU pod to save some costs?

sky launch -y -c {name} examples/tpu/tpuvm_mnist.yaml --gpus tpu-v2-32 --use-spot --zone europe-west4-a

as we have some free quota in europe-west4-a.
https://github.com/skypilot-org/skypilot/pull/1481/files#diff-da9ef240243bf38ab4e33d1f8a2a64caee7677dc780e58557f3c676687a4b334R531

@iojw
Copy link
Collaborator Author

iojw commented Dec 1, 2022

@infwinston Updated!

@iojw iojw merged commit 1955bee into master Dec 1, 2022
@iojw iojw deleted the worker-ips-fix branch December 1, 2022 08:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants