Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spot] Switch to download and streaming for the failed user program logs #2330

Merged
merged 7 commits into from
Aug 1, 2023

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Jul 31, 2023

Fixes #2329

This should make the debugging of the user program much easier.

Tested (run the relevant ones):

@Michaelvll Michaelvll changed the title Switch to download and streaming for the failed user program logs [Spot] Switch to download and streaming for the failed user program logs Jul 31, 2023
@Michaelvll Michaelvll mentioned this pull request Jul 31, 2023
5 tasks
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll! Some questions.

sky/spot/controller.py Show resolved Hide resolved
@@ -74,6 +75,36 @@ def __init__(self, job_id: int, dag_yaml: str,
job_id_env_vars)
task.update_envs(task_envs)

def _download_log_and_stream(self, handle):
""" Download the logs from spot cluster, and stream them
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
""" Download the logs from spot cluster, and stream them
"""Downloads the logs of the latest job of a spot cluster, and streams them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have to reduce the number of words a bit to fit it in one line. Modified. Thanks!

sky/spot/controller.py Outdated Show resolved Hide resolved
sky/spot/controller.py Outdated Show resolved Hide resolved
log_dir = list(log_dirs.values())[0]

# Print the logs to the console.
for log_file in os.listdir(log_dir):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: what is the layout of ~/sky_logs/spot_jobs/? Does it have a subdir or a file per spot job id (that has failed)? Wondering why we're printing every *.log file here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each spot job will have a subdir under the ~/sky_logs/spot_jobs, and the log_dir here is the folder for that particular job.

We can only print out the run.log under the folder, but I was trying to make sure we don't miss any possible logs in future. Just changed back to run.log for clarity.

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll! Some nits.

sky/spot/controller.py Show resolved Hide resolved
sky/spot/controller.py Outdated Show resolved Hide resolved
sky/spot/controller.py Outdated Show resolved Hide resolved
Michaelvll and others added 4 commits August 1, 2023 13:38
@Michaelvll Michaelvll merged commit f7ad351 into master Aug 1, 2023
15 checks passed
@Michaelvll Michaelvll deleted the more-reliable-spot-log branch August 1, 2023 21:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Spot] Failed spot jobs do not have complete log in the sky spot logs --controller
2 participants