Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runner.aws_batch: Gracefully handle errors when fetching logs for completed jobs #406

Merged
merged 1 commit into from
Nov 1, 2024

Conversation

tsibley
Copy link
Member

@tsibley tsibley commented Oct 31, 2024

For completed jobs, it's more useful to continue on with printing the job status (e.g. success or reason for failure) and downloading job results even if an error occurs when fetching logs. As a concrete example, we've observed cases where a failed job has a log stream associated with it in Batch but that log stream does not actually exist in CloudWatch Logs.¹ The log fetch error hid the reason for job failure, hampering troubleshooting.

¹ https://bedfordlab.slack.com/archives/C01LCTT7JNN/p1730406138009409

Checklist

  • Checks pass

…pleted jobs

For completed jobs, it's more useful to continue on with printing the
job status (e.g. success or reason for failure) and downloading job
results even if an error occurs when fetching logs.  As a concrete
example, we've observed cases where a failed job has a log stream
associated with it in Batch but that log stream does not actually exist
in CloudWatch Logs.¹  The log fetch error hid the reason for job
failure, hampering troubleshooting.

¹ <https://bedfordlab.slack.com/archives/C01LCTT7JNN/p1730406138009409>
Copy link
Member

@jameshadfield jameshadfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much improved for the error I ran into.

Before:

Attaching to Nextstrain AWS Batch Job ID: XXX
Job is FAILED
Traceback (most recent call last):
...
botocore.errorfactory.ResourceNotFoundException: An error occurred (ResourceNotFoundException) when calling the FilterLogEvents operation: The specified log stream does not exist.

After:

Attaching to Nextstrain AWS Batch Job ID: XXX
Job is FAILED
Unable to fetch job logs: An error occurred (ResourceNotFoundException) when calling the FilterLogEvents operation: The specified log stream does not exist.
Job FAILED after 4.1 minutes (DockerTimeoutError: Could not transition to created; timed out after waiting 4m0s, Task failed to start)

@tsibley tsibley merged commit c3dc26e into master Nov 1, 2024
45 checks passed
@tsibley tsibley deleted the trs/aws-batch/handle-failure-fetching-logs branch November 1, 2024 18:48
@tsibley
Copy link
Member Author

tsibley commented Nov 1, 2024

This is part of the 8.5.4 release that I've just kicked off.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants