Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AWS] Fix check/status failure when no permission is granted for the account #2415

Merged
merged 9 commits into from
Aug 21, 2023

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Aug 17, 2023

Fixes #2414

This PR fixes the following:

  1. The read-only operators (sky show-gpus/sky status) should not require the actual az mapping.
  2. The sky check should disable AWS if the permission is not enough for fetching the availability zone mappings

Future TODO: #2416

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    1. setup an aws account with no permission granted, and have an AWS cluster in the cluster table.
    2. sky status
    3. sky check
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

@Michaelvll Michaelvll marked this pull request as ready for review August 17, 2023 17:23
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, quick comments.

sky/resources.py Outdated Show resolved Hide resolved
sky/clouds/service_catalog/config.py Outdated Show resolved Hide resolved
sky/clouds/service_catalog/aws_catalog.py Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
sky/clouds/service_catalog/config.py Outdated Show resolved Hide resolved
sky/resources.py Outdated
@@ -172,7 +177,11 @@ def __init__(
self._try_validate_disk_tier()
self._try_validate_ports()

@service_catalog.use_default_catalog_if_failed
# When querying the accelerators for the instance type, we will check the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, not following. Which callsite inside repr causes this? I only see some {accelerators} formatting.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.accelerators is a @property which will call into the self.cloud.get_accelerators_from_instance_type and invoke service_catalog.get_accelerators_from_instance_type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, tried to incorporate this in.

sky/resources.py Outdated
@@ -305,6 +314,7 @@ def memory(self) -> Optional[str]:
return self._memory

@property
@functools.lru_cache()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to reduce the frequency we call into the service_catalog underneath to reduce the overhead, i.e., searching the df for the accelerators. Since the resources object is immutable, caching this function should be safe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Is the overhead noticeable? Worth documenting if so.
  • Shall we add max_size=1?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overhead is not significant, but it seems no harm to cache the output. The following commands take 3.3 seconds in master branch, but 0.008 seconds in the current PR.

python -u <<EOF                            
import sky
import time
r = sky.Resources(instance_type='p3.2xlarge')
total_time = 0
for i in range(1000): 
  start = time.time()
  str(r)
  total_time += time.time() - start
print(total_time)

EOF

Michaelvll and others added 2 commits August 17, 2023 12:13
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll. LGTM.

sky/resources.py Outdated Show resolved Hide resolved
sky/resources.py Outdated Show resolved Hide resolved
sky/resources.py Outdated
@@ -172,7 +177,11 @@ def __init__(
self._try_validate_disk_tier()
self._try_validate_ports()

@service_catalog.use_default_catalog_if_failed
# When querying the accelerators for the instance type, we will check the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, tried to incorporate this in.

sky/resources.py Outdated
@@ -305,6 +314,7 @@ def memory(self) -> Optional[str]:
return self._memory

@property
@functools.lru_cache()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Is the overhead noticeable? Worth documenting if so.
  • Shall we add max_size=1?

Michaelvll and others added 3 commits August 21, 2023 13:09
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
@Michaelvll Michaelvll merged commit 0249308 into master Aug 21, 2023
17 checks passed
@Michaelvll Michaelvll deleted the aws-check-fail-when-no-permission branch August 21, 2023 22:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Skypilot still attempts to connect to all cloud even when explicitly given one particular cloud provider
2 participants