-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spot] Show logging from the controller and grace period for cluster status checking #1951
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Main question is: why did multiple processes invoking controller.py non-deterministically cause problems in logging? Note that the multiple SpotController processes also call into e.g.,
skypilot/sky/spot/spot_state.py
Line 213 in bc67043
def set_starting(job_id: int): |
which use the module-level logger as well. Before this PR, did those logging non-deterministically disappear as well?
sky/spot/controller.py
Outdated
# console. | ||
# Create a logger for this process | ||
self.logger = sky_logging.init_logger( | ||
f'{logger.name}.controller_process') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f'{logger.name}.controller_process') | |
f'{logger.name}.SpotController') |
Nit: to avoid confusion with a potential nested module called controller_process
. This naming convention suggests that SpotController is not a module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, after trying a bit more, it seems the problem is not related to this multiprocessing. I reverted to the original one, but only keep the __name__
change for getting the global logger.
Thanks for fixing this @Michaelvll! I tried this out and controller.py messages now appear. This leads to some UX observations:
|
Previously, the log from the controller process was left out of the spot jobs' output, which made the debugging harder. This PR fixes the logging from the spot controller.
Another change made: add a grace period between the job status check and the cluster status check to make the annotation of the job failure more conservative. This is to avoid the case when the job fails due to the preemption while the cluster's status on the cloud has not been set to non-UP status yet.
Tested (run the relevant ones):
pytest tests/test_smoke.py --managed-spot