Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GCP][Disk] Add supported instance type check for disk tier ultra #3935

Merged
merged 4 commits into from
Sep 13, 2024

Conversation

cblmemo
Copy link
Collaborator

@cblmemo cblmemo commented Sep 11, 2024

https://cloud.google.com/compute/docs/disks/extreme-persistent-disk#machine_shape_support

On GCP, disk tier ultra only supports handful of instance type. This PR adds the check and fixes #3933.

From the documentation, it seems like we could still attach the pd-extreme disk to some instance type but will get a degraded performance. We should benchmark and see how severe the degradation is, as the instance type that officially supports the ultra tier is too small (no GPU instance types is officially supported).

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • sky launch --cloud gcp --disk-tier best --gpus L4
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

Comment on lines +940 to +943
# Ultra disk tier (pd-extreme) only support m2, m3 and part of n2
# instance types, so we failover to lower tiers for other instance
# types. Reference:
# https://cloud.google.com/compute/docs/disks/extreme-persistent-disk#machine_shape_support # pylint: disable=line-too-long
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean we cannot use ultra disk tier with high end GPUs, such as A100, L4 or H100? In that case, this ultra disk sounds a bit useless here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.. This is a quick fix to unblock the blog post first. We should think of better solution here. cc @Conless here

sky/clouds/gcp.py Outdated Show resolved Hide resolved
return r.disk_tier
# Failover disk tier from high to low.
all_tiers = list(reversed(resources_utils.DiskTier))
start_index = all_tiers.index(GCP._translate_disk_tier(r.disk_tier))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this _translate_disk_tier do?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convert best and None to corresponding tier

@@ -916,6 +931,36 @@ def _check_instance_type_accelerators_combination(
resources.instance_type, resources.accelerators, resources.zone,
'gcp')

@classmethod
def check_disk_tier(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function seems not publicly exposed? Should we keep it private, or just merge into check_disk_tier_enabled?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Converted it into private. merge into check_disk_tier_enabled is a little bit tricky as we want a function to return a boolean value, instead of raising an error, which is the expected behaviour of check_disk_tier_enabled, across all clouds.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I found a bug that requires to expose this function. Just changed back to public.

cblmemo and others added 3 commits September 12, 2024 15:28
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix @cblmemo! I found the logging for sky launch --disk-tier ultra --cpus 2 --cloud gcp super confusing, as it does not actually show the reason why we cannot launch the instance on GCP.

sky launch --cpus 2 --cloud gcp --disk-tier ultra
I 09-13 01:05:04 optimizer.py:1301] No resource satisfying GCP(cpus=2, disk_tier=ultra) on GCP.
sky.exceptions.ResourcesUnavailableError: Catalog does not contain any instances satisfying the request:
Task<name=sky-cmd>(run=<empty>)
  resources: GCP(cpus=2, disk_tier=ultra).

To fix: relax or change the resource requirements.

Hint: sky show-gpus to list available accelerators.
      sky check to check the enabled clouds.

Can we file an issue for that?

@cblmemo
Copy link
Collaborator Author

cblmemo commented Sep 13, 2024

sky launch --cpus 2 --cloud gcp --disk-tier ultra
I 09-13 01:05:04 optimizer.py:1301] No resource satisfying GCP(cpus=2, disk_tier=ultra) on GCP.
sky.exceptions.ResourcesUnavailableError: Catalog does not contain any instances satisfying the request:
Task<name=sky-cmd>(run=<empty>)
  resources: GCP(cpus=2, disk_tier=ultra).

To fix: relax or change the resource requirements.

Hint: sky show-gpus to list available accelerators.
      sky check to check the enabled clouds.

Done! #3943

@cblmemo cblmemo added this pull request to the merge queue Sep 13, 2024
Merged via the queue into master with commit 7940552 Sep 13, 2024
20 checks passed
@cblmemo cblmemo deleted the fix-gcp-ultra branch September 13, 2024 05:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[GCP] ultra disk fail to work with L4 GPUs
2 participants