AWS spot failure - custom error message #5240

ewels · 2024-08-19T19:42:51Z

We recently set the default for aws.batch.maxSpotAttempts to 0 in #5215 to avoid unexpected costs in cloud.

This is good, but it means we go back to the state we had before this feature was implemented, which is people's Nextflow pipeline runs crashing with the extremely unhelpful message that AWS returns. From memory this has no mention spot reclamation or anything, and is not at all intuitive for new users.

In order for aws.batch.maxSpotAttempts to work, I assume that Nextflow must be capturing these spot reclamation errors already. Even if we're not retrying, can we use that opportunity to print a more helpful error message to the Nextflow log explaining what has happened, and pointing to the maxSpotAttempts config option so that the user knows how to resolve it?

The text was updated successfully, but these errors were encountered:

ewels · 2024-08-19T19:45:37Z

Found some of this logic here:

nextflow/plugins/nf-amazon/src/main/nextflow/cloud/aws/batch/AwsBatchTaskHandler.groovy

Lines 728 to 743 in 12ea4d7

 /* 

  * retry on spot reclaim 

  * https://aws.amazon.com/blogs/compute/introducing-retry-strategies-for-aws-batch/ 

  */ 

 final attempts = maxSpotAttempts() 

 if( attempts>0 ) { 

 // retry the job when an Ec2 instance is terminate 

 final cond1 = new EvaluateOnExit().withAction('RETRY').withOnStatusReason('Host EC2*') 

 // the exit condition prevent to retry for other reason and delegate 

 // instead to nextflow error strategy the handling of the error 

 final cond2 = new EvaluateOnExit().withAction('EXIT').withOnReason('*') 

 final retry = new RetryStrategy() 

 .withAttempts( attempts ) 

 .withEvaluateOnExit(cond1, cond2) 

 result.setRetryStrategy(retry) 

 }

However that's in advance of the error actually triggering, so needs a deeper dive into the code to find the right place for the log message.

pditommaso · 2024-09-02T13:45:52Z

Maybe it could be done check error reason returned by Batch to customise the nextflow error message

nextflow/plugins/nf-amazon/src/main/nextflow/cloud/aws/batch/AwsBatchTaskHandler.groovy

Lines 271 to 275 in 5a37e61

 final reason = errReason(job) 

 // retry all CannotPullContainer errors apart when it does not exist or cannot be accessed 

 final unrecoverable = reason.contains('CannotPullContainer') && reason.contains('unauthorized') 

 task.error = unrecoverable ? new ProcessUnrecoverableException(reason) : new ProcessException(reason) 

 task.stderr = executor.getJobOutputStream(jobId) ?: errorFile

This comment was marked as outdated.

Sign in to view

bentsherman added the executor/aws-batch label Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS spot failure - custom error message #5240

AWS spot failure - custom error message #5240

ewels commented Aug 19, 2024

ewels commented Aug 19, 2024

This comment was marked as outdated.

pditommaso commented Sep 2, 2024

AWS spot failure - custom error message #5240

AWS spot failure - custom error message #5240

Comments

ewels commented Aug 19, 2024

ewels commented Aug 19, 2024

This comment was marked as outdated.

pditommaso commented Sep 2, 2024