Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Google Batch NOT_FOUND error management #5690

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jorgee
Copy link
Contributor

@jorgee jorgee commented Jan 21, 2025

This PR includes a possible fix for the NotFoundException returned by the Google Batch API when getting some tasks.

When the client.getTaskStatus throws a NotFoundException it is caught and managed in the following way:

  • If the retrieved from client.listTasks has several tasks. It tries to find the task and check the status
  • otherwise or if the task is not found in the list it retrieves the job status.
  • If it is not able to neither get the task or job status it returns pending

Including a unit test producing the NotFoundException to validate the logic.

Two corner cases could not be correctly managed.

  1. When neither task status nor a job status is found, it is returning PENDING. I think that it could happen in the initial stage when Google Batch is creating the job and tasks. I am assuming that a task or job status will be received at some point in the execution. It is the same management that we did when no tasks were found in the job.
  2. When a task is not found in Google Batch, it belongs to a task array job and the job status is RUNNING or FAILED, we could get an incorrect task status. It is not very important in the case of RUNNING, but with FAILED we could get a task failed but, its state is unknown. According to the API documentation, the RUNNING or FAILED job states mean that at least one of the tasks is in this state. Job could be FAILED when there has been a job failure (such as the invalid type one we saw in Google Batch run hangs when a job fail to start #5550). So, I am assuming that when the task status is not found, it is not in the list and the Job is FAILED, It is due to a job failure and the task is also FAILED. Other alternatives that I have considered are: set the task to PENDING, but I think we could get a deadlock if it is job failure; or throw an exception to abort the execution. Other suggestions are welcome.

Signed-off-by: jorgee <jorge.ejarque@seqera.io>
@jorgee jorgee linked an issue Jan 21, 2025 that may be closed by this pull request
Copy link

netlify bot commented Jan 21, 2025

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit e60d4b9
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/678ff88d36d07d0008188b73

@pditommaso
Copy link
Member

Should @ejseqera run stress test on this?

@ejseqera
Copy link

I'm on it

@ewels
Copy link
Member

ewels commented Jan 22, 2025

@jorgee I know we just discussed on it on the call, but can we confirm that this PR also handles these states please?

  • UNAVAILABLE
  • DEADLINE_EXCEEDED
  • RESOURCE_EXHARSTED
  • UNKNOWN
  • ..more

Thanks!

@jorgee
Copy link
Contributor Author

jorgee commented Jan 22, 2025

@jorgee I know we just discussed on it on the call, but can we confirm that this PR also handles these states please?

  • UNAVAILABLE
  • DEADLINE_EXCEEDED
  • RESOURCE_EXHARSTED
  • UNKNOWN
  • ..more

Thanks!

I can't see these states in the Google Batch API

@robnewman
Copy link
Contributor

@jorgee #4537

@pditommaso
Copy link
Member

#4537 is a different issue

@jorgee
Copy link
Contributor Author

jorgee commented Jan 22, 2025

Ok, you mean the code in the exception. No, I did it is only for NOT_FOUND because it was not solved even with long retries.

@robnewman
Copy link
Contributor

@jorgee We're trying to determine if this PR solves this customer feature request

@jorgee
Copy link
Contributor Author

jorgee commented Jan 22, 2025

@jorgee We're trying to determine if this PR solves this customer feature request

No, this PR is not solving it. You can find the current retriable exceptions here. If we require to support other, we should include them, but let's discuss in another issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NOT_FOUND error on google-batch
5 participants