[Spot] Show logging from the controller and grace period for cluster status checking #1951

Michaelvll · 2023-05-11T19:21:18Z

Previously, the log from the controller process was left out of the spot jobs' output, which made the debugging harder. This PR fixes the logging from the spot controller.

Another change made: add a grace period between the job status check and the cluster status check to make the annotation of the job failure more conservative. This is to avoid the case when the job fails due to the preemption while the cluster's status on the cloud has not been set to non-UP status yet.

Tested (run the relevant ones):

All smoke tests: pytest tests/test_smoke.py --managed-spot

…ntroller-logging

concretevitamin

LGTM. Main question is: why did multiple processes invoking controller.py non-deterministically cause problems in logging? Note that the multiple SpotController processes also call into e.g.,

skypilot/sky/spot/spot_state.py

Line 213 in bc67043

def set_starting(job_id: int):

which use the module-level logger as well. Before this PR, did those logging non-deterministically disappear as well?

sky/spot/controller.py

concretevitamin · 2023-05-12T00:47:38Z

sky/spot/controller.py

+        # console.
+        # Create a logger for this process
+        self.logger = sky_logging.init_logger(
+            f'{logger.name}.controller_process')


Suggested change

f'{logger.name}.controller_process')

f'{logger.name}.SpotController')

Nit: to avoid confusion with a potential nested module called controller_process. This naming convention suggests that SpotController is not a module.

Ahh, after trying a bit more, it seems the problem is not related to this multiprocessing. I reverted to the original one, but only keep the __name__ change for getting the global logger.

sky/spot/controller.py

concretevitamin · 2023-05-12T03:38:30Z

Thanks for fixing this @Michaelvll! I tried this out and controller.py messages now appear. This leads to some UX observations:

(sky-bbb8-zongheng, pid=6947) I 05-12 01:19:01 controller.py:64] Submitted spot job; SKYPILOT_JOB_ID: sky-2023-05-12-01-19-01-461464_spot_id-9
(sky-bbb8-zongheng, pid=6947) I 05-12 01:19:01 controller.py:92] Started monitoring spot task sky-bbb8-zongheng (id: 9)
...
(sky-bbb8-zongheng, pid=6947) I 05-12 01:21:50 controller.py:322] Killing controller process 7046
(sky-bbb8-zongheng, pid=6947) I 05-12 01:21:50 controller.py:330] Controller process 7046 killed.
(sky-bbb8-zongheng, pid=6947) I 05-12 01:21:50 controller.py:332] Cleaning up spot clusters of job 9.
(sky-bbb8-zongheng, pid=6947) I 05-12 01:22:34 controller.py:341] Spot clusters of job 9 has been taken down.

Started monitoring spot task sky-bbb8-zongheng (id: 9) -> Started monitoring spot job 9, name: <name> ("job" is mentioned in previous line)
Killing controller process 7046 -> add period
Last 2 lines: "spot clusters" -> "spot cluster"

Michaelvll added 4 commits May 11, 2023 09:59

wip

df908a5

Fix the logging

8dda915

revert changes in sky_logging

2fdf3b2

format

e2ccb6f

Michaelvll changed the title ~~[Spot] Show logging from the controller~~ [Spot] Show logging from the controller and grace period for cluster status checking May 11, 2023

Michaelvll requested a review from concretevitamin May 11, 2023 19:21

Michaelvll added 3 commits May 11, 2023 12:23

Add comment

802b579

format

b65f9a1

Merge branch 'master' of github.com:skypilot-org/skypilot into fix-co…

bc67043

…ntroller-logging

concretevitamin approved these changes May 12, 2023

View reviewed changes

Michaelvll added 3 commits May 11, 2023 17:55

remove remnant

78fc093

revert logger for process

930daaa

fix name for tail_log

d5aa875

concretevitamin approved these changes May 12, 2023

View reviewed changes

sky/spot/controller.py Outdated Show resolved Hide resolved

remnant

5141f27

ux for controller logging

8f5daff

Michaelvll merged commit f32be46 into master May 12, 2023

Michaelvll deleted the fix-controller-logging branch May 12, 2023 03:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spot] Show logging from the controller and grace period for cluster status checking #1951

[Spot] Show logging from the controller and grace period for cluster status checking #1951

Michaelvll commented May 11, 2023 •

edited

Loading

concretevitamin left a comment

concretevitamin May 12, 2023

Michaelvll May 12, 2023

concretevitamin commented May 12, 2023

	f'{logger.name}.controller_process')
	f'{logger.name}.SpotController')

[Spot] Show logging from the controller and grace period for cluster status checking #1951

[Spot] Show logging from the controller and grace period for cluster status checking #1951

Conversation

Michaelvll commented May 11, 2023 • edited Loading

concretevitamin left a comment

Choose a reason for hiding this comment

concretevitamin May 12, 2023

Choose a reason for hiding this comment

Michaelvll May 12, 2023

Choose a reason for hiding this comment

concretevitamin commented May 12, 2023

Michaelvll commented May 11, 2023 •

edited

Loading