Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spot] Show logging from the controller and grace period for cluster status checking #1951

Merged
merged 12 commits into from
May 12, 2023

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented May 11, 2023

Previously, the log from the controller process was left out of the spot jobs' output, which made the debugging harder. This PR fixes the logging from the spot controller.

Another change made: add a grace period between the job status check and the cluster status check to make the annotation of the job failure more conservative. This is to avoid the case when the job fails due to the preemption while the cluster's status on the cloud has not been set to non-UP status yet.

Tested (run the relevant ones):

  • All smoke tests: pytest tests/test_smoke.py --managed-spot

@Michaelvll Michaelvll changed the title [Spot] Show logging from the controller [Spot] Show logging from the controller and grace period for cluster status checking May 11, 2023
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Main question is: why did multiple processes invoking controller.py non-deterministically cause problems in logging? Note that the multiple SpotController processes also call into e.g.,

def set_starting(job_id: int):

which use the module-level logger as well. Before this PR, did those logging non-deterministically disappear as well?

sky/spot/controller.py Outdated Show resolved Hide resolved
# console.
# Create a logger for this process
self.logger = sky_logging.init_logger(
f'{logger.name}.controller_process')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f'{logger.name}.controller_process')
f'{logger.name}.SpotController')

Nit: to avoid confusion with a potential nested module called controller_process. This naming convention suggests that SpotController is not a module.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, after trying a bit more, it seems the problem is not related to this multiprocessing. I reverted to the original one, but only keep the __name__ change for getting the global logger.

sky/spot/controller.py Outdated Show resolved Hide resolved
@concretevitamin
Copy link
Member

Thanks for fixing this @Michaelvll! I tried this out and controller.py messages now appear. This leads to some UX observations:

(sky-bbb8-zongheng, pid=6947) I 05-12 01:19:01 controller.py:64] Submitted spot job; SKYPILOT_JOB_ID: sky-2023-05-12-01-19-01-461464_spot_id-9
(sky-bbb8-zongheng, pid=6947) I 05-12 01:19:01 controller.py:92] Started monitoring spot task sky-bbb8-zongheng (id: 9)
...
(sky-bbb8-zongheng, pid=6947) I 05-12 01:21:50 controller.py:322] Killing controller process 7046
(sky-bbb8-zongheng, pid=6947) I 05-12 01:21:50 controller.py:330] Controller process 7046 killed.
(sky-bbb8-zongheng, pid=6947) I 05-12 01:21:50 controller.py:332] Cleaning up spot clusters of job 9.
(sky-bbb8-zongheng, pid=6947) I 05-12 01:22:34 controller.py:341] Spot clusters of job 9 has been taken down.
  • Started monitoring spot task sky-bbb8-zongheng (id: 9) -> Started monitoring spot job 9, name: <name> ("job" is mentioned in previous line)
  • Killing controller process 7046 -> add period
  • Last 2 lines: "spot clusters" -> "spot cluster"

@Michaelvll Michaelvll merged commit f32be46 into master May 12, 2023
@Michaelvll Michaelvll deleted the fix-controller-logging branch May 12, 2023 03:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants