Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS spot failure - custom error message #5240

Open
ewels opened this issue Aug 19, 2024 · 3 comments
Open

AWS spot failure - custom error message #5240

ewels opened this issue Aug 19, 2024 · 3 comments

Comments

@ewels
Copy link
Member

ewels commented Aug 19, 2024

We recently set the default for aws.batch.maxSpotAttempts to 0 in #5215 to avoid unexpected costs in cloud.

This is good, but it means we go back to the state we had before this feature was implemented, which is people's Nextflow pipeline runs crashing with the extremely unhelpful message that AWS returns. From memory this has no mention spot reclamation or anything, and is not at all intuitive for new users.

In order for aws.batch.maxSpotAttempts to work, I assume that Nextflow must be capturing these spot reclamation errors already. Even if we're not retrying, can we use that opportunity to print a more helpful error message to the Nextflow log explaining what has happened, and pointing to the maxSpotAttempts config option so that the user knows how to resolve it?

@ewels
Copy link
Member Author

ewels commented Aug 19, 2024

Found some of this logic here:

/*
* retry on spot reclaim
* https://aws.amazon.com/blogs/compute/introducing-retry-strategies-for-aws-batch/
*/
final attempts = maxSpotAttempts()
if( attempts>0 ) {
// retry the job when an Ec2 instance is terminate
final cond1 = new EvaluateOnExit().withAction('RETRY').withOnStatusReason('Host EC2*')
// the exit condition prevent to retry for other reason and delegate
// instead to nextflow error strategy the handling of the error
final cond2 = new EvaluateOnExit().withAction('EXIT').withOnReason('*')
final retry = new RetryStrategy()
.withAttempts( attempts )
.withEvaluateOnExit(cond1, cond2)
result.setRetryStrategy(retry)
}

However that's in advance of the error actually triggering, so needs a deeper dive into the code to find the right place for the log message.

@pditommaso

This comment was marked as outdated.

@pditommaso
Copy link
Member

Maybe it could be done check error reason returned by Batch to customise the nextflow error message

final reason = errReason(job)
// retry all CannotPullContainer errors apart when it does not exist or cannot be accessed
final unrecoverable = reason.contains('CannotPullContainer') && reason.contains('unauthorized')
task.error = unrecoverable ? new ProcessUnrecoverableException(reason) : new ProcessException(reason)
task.stderr = executor.getJobOutputStream(jobId) ?: errorFile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants