Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UX] Catch any exception for the spot queue fetching failure #1757

Merged
merged 4 commits into from
Mar 10, 2023

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Mar 10, 2023

This is to fix the issue when the sky status meet other exceptions during the spot job query. It could happen when the current active account is different from the one used for the spot controller.

Previous:

Managed spot jobs
multiprocessing.pool.RemoteTraceback: 
"""
sky.exceptions.ClusterOwnerIdentityMismatchError: 'sky-spot-controller-9ce1ce58' (GCP) is owned by account 'zhwu@berkeley.edu\n [project_id=skypilot-375900]', but the activated account is 'zhwu@berkeley.edu\n [project_id=skypilot-managed-spot]'.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/zhwu/miniconda3/envs/sky-dev/bin/sky", line 33, in <module>
    sys.exit(load_entry_point('skypilot', 'console_scripts', 'sky')())
  File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/Users/zhwu/Library/CloudStorage/OneDrive-Personal/AResource/PhD/Research/sky-computing/code/sky-experiment-dev/sky/utils/common_utils.py", line 220, in _record
    return f(*args, **kwargs)
  File "/Users/zhwu/Library/CloudStorage/OneDrive-Personal/AResource/PhD/Research/sky-computing/code/sky-experiment-dev/sky/cli.py", line 1041, in invoke
    return super().invoke(ctx)
  File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/Users/zhwu/Library/CloudStorage/OneDrive-Personal/AResource/PhD/Research/sky-computing/code/sky-experiment-dev/sky/utils/common_utils.py", line 241, in _record
    return f(*args, **kwargs)
  File "/Users/zhwu/Library/CloudStorage/OneDrive-Personal/AResource/PhD/Research/sky-computing/code/sky-experiment-dev/sky/cli.py", line 1595, in status
    num_in_progress_jobs, msg = spot_jobs_future.get()
  File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/multiprocessing/pool.py", line 771, in get
    raise self._value
sky.exceptions.ClusterOwnerIdentityMismatchError: 'sky-spot-controller-9ce1ce58' (GCP) is owned by account 'zhwu@berkeley.edu\n [project_id=skypilot-375900]', but the activated account is 'zhwu@berkeley.edu\n [project_id=skypilot-managed-spot]'.

Now:


Managed spot jobs
Failed to query spot jobs: [sky.exceptions.ClusterOwnerIdentityMismatchError]: 'sky-spot-controller-9ce1ce58' (GCP) is owned by account 'zhwu@berkeley.edu\n [project_id=skypilot-375900]', but the activated account is 'zhwu@berkeley.edu\n [project_id=skypilot-managed-spot]'.

* 1 cluster has auto{stop,down} scheduled. Refresh statuses with: sky status --refresh

Tested (run the relevant ones):

  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

@Michaelvll Michaelvll marked this pull request as ready for review March 10, 2023 05:49
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we catch KeyboardInterrupt as well? To repro: run status, and immediately ctrl-c (may need to try a few times with different timing). On master it'd show a long stacktrace then KeyboardInterrupt.

@Michaelvll
Copy link
Collaborator Author

Should we catch KeyboardInterrupt as well? To repro: run status, and immediately ctrl-c (may need to try a few times with different timing). On master it'd show a long stacktrace then KeyboardInterrupt.

Good point! Fixed the KeyboardInterrupt related to the multiprocessing, but will leave the KeyboardInterrupt for other parts of the code to the future PR, as we may want to use signal handling instead to make the handling of the KeyboardInterrupt global.

Another fix added: our previous use of the threading.local is problematic, as the assignment to the local variable will only happen once when the module is imported in the main thread, i.e. in the other threads, it will raise AttributeError.

def get_use_default_catalog() -> bool:
if not hasattr(_thread_local_config, 'use_default_catalog'):
_thread_local_config.use_default_catalog = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe copy the other file's comments here too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Added. Thanks!

# down, and the hint for showing sky spot queue
# will still be shown.
num_in_progress_jobs = -1
msg = 'KeyboardInterrupt'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still see the stacktrace, and a duplicate line showing KeyboardInterrupt. Maybe fine to leave this fix to future, if too difficult to handle.

Screen Shot 2023-03-10 at 11 18 48

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weirdly, I cannot reproduce the problem shown in the figure, but the keyboard interruption handling here is indeed a bit tricky as it involves multiple processes. Let's leave it for a future PR. : )

@Michaelvll Michaelvll merged commit 35d017d into master Mar 10, 2023
sumanthgenz pushed a commit to sumanthgenz/skypilot that referenced this pull request Mar 14, 2023
…t-org#1757)

* Catch any exception for the spot queue fetching failure.

* Fix keyboard interrupt

* lint

* Add comment
sumanthgenz pushed a commit to sumanthgenz/skypilot that referenced this pull request Mar 15, 2023
…t-org#1757)

* Catch any exception for the spot queue fetching failure.

* Fix keyboard interrupt

* lint

* Add comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants