Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WX-1595 GCP Batch backend refactor to include the PAPI request manager #7412

Merged
merged 80 commits into from
May 29, 2024

Conversation

AlexITC
Copy link
Collaborator

@AlexITC AlexITC commented Apr 24, 2024

WARNING: This PR is huge and needs to be reviewed carefully, we have already performed many manual tests + ported many other tests from PAPI.

Intro

The main goal is to refactor Batch backend to include PipelinesApiRequestManager and PipelinesApiRequestWorker.

This also fixes a few missing details from the initial Batch integration (#7177), for example:

  1. Missing metrics are now published.
  2. The job status is queried before deleting it to try preventing the deletion of jobs that are in a final state (PAPI can abort jobs but Batch deletes them instead).

I have been trying to split this into multiple smaller PRs, please let me know if you can find any piece that can be submitted independently, previous PRs:

Questions (already resolved)

Questions

  1. There is a centaur test included in the Batch suite, still, this seems to invoke a papi test testCentaurGcpBatch.sh (see papi_v2alpha1_gcsa.test, the test itself says that batch backend is not used). Could this be related to the false-alarms from codecoverage's bot?
  2. There warnings raised by codecov which seem wrong, for example, the lines mentioned on GcpBatchGroupedRequests are covered by GcpBatchGroupedRequestsSpec
  3. Should we set GcpBatchAsyncBackendJobExecutionActor#requestsAbortAndDiesImmediately to false? this is set by PAPI but it causes a centaur test to fail.
  4. While this is inherited from PAPI, I think we need to change the behavior but I'd like to get a 2nd option, increasing request-workers also increases the worker's delay to pull work, for example, setting this value to 100 or above causes would cause the delay to become ~18m which seems insane (see BatchApiRequestManager.scala), putting an upper limit on the delay seems worth it, any thoughts?
  5. Do we need to get anything else for the job execution events? see below and BatchRequestExecutor#getEventList.
Execution events details

What GCP provides:

Event type=STATUS_CHANGED
time=seconds: 1712173852,nanos: 952604950
taskState=STATE_UNSPECIFIED,
description=Job state is set from QUEUED to SCHEDULED for job projects/392615380452/locations/us-south1/jobs/job-ba81bad8-82e9-4d95-8fc0-04dfbbd746da.
taskExecution.exitCode=0

Event type=STATUS_CHANGED,
time=seconds: 1712173947, nanos: 568998105
taskState=STATE_UNSPECIFIED
description=Job state is set from SCHEDULED to RUNNING for job projects/392615380452/locations/us-south1/jobs/job-ba81bad8-82e9-4d95-8fc0-04dfbbd746da.
taskExecution.exitCode=0

Event type=STATUS_CHANGED
time=seconds: 1712173989, nanos: 937816549
taskState=STATE_UNSPECIFIED
description=Job state is set from RUNNING to SUCCEEDED for job projects/392615380452/locations/us-south1/jobs/job-ba81bad8-82e9-4d95-8fc0-04dfbbd746da.
taskExecution.exitCode=0

What we define as execution events:

ExecutionEvent(Job state is set from QUEUED to SCHEDULED for job projects/392615380452/locations/us-south1/jobs/job-321db1bc-9a68-4171-aa2a-46885d781656.,2024-04-03T20:10:01.704137839Z,None)
ExecutionEvent(Job state is set from SCHEDULED to RUNNING for job projects/392615380452/locations/us-south1/jobs/job-321db1bc-9a68-4171-aa2a-46885d781656.,2024-04-03T20:11:30.631264449Z,None)
ExecutionEvent(Job state is set from RUNNING to SUCCEEDED for job projects/392615380452/locations/us-south1/jobs/job-321db1bc-9a68-4171-aa2a-46885d781656.,2024-04-03T20:12:16.898798407Z,None)

Load test results

We have executed many load tests, this is the latest one involving 14k jobs.

Data / Backend Batch with Mysql PAPIv2 with Mysql
Jobs 14400 14400
Execution time 20936 seconds 24451 seconds

Overall, all our tests indicate that Batch finishes executing the jobs faster than PAPIv2.

Load tests settings

We have ran Cromwell in server mode with the following settings:

  • request-timeout: 10m
  • idle-timeout: 10m
  • job-rate-control: jobs = 20, per = 10 seconds
  • max-workflow-launch-count: 50
  • new-workflow-poll-rate: 1
  • database: MySQL
  • virtual-private-cloud setup
  • maximum-polling-interval: 600s
  • localization-attempts: 3
  • google.auth: service account
  • request-workers: 3
  • concurrent-job-limit: 14400

JVM Options:

  • -Xms512m -Xmx64g

NOTE: Initially we found a bottleneck on Batch but Google enabled an experimental settings to schedule many jobs concurrently which reduced the total execution time.

Server capacity (from Google Cloud):

  • VM Machine Type: n2-standard-16
  • Virtual CPUs: 16
  • Memory: 64G
  • Architecture: x86/64
  • CPU Platform: Intel Cascade Lake

BatchApiRequestWorkerSpec works! The goal is to port the papiv2 request manager behavior into GCP batch.

We just need to rename the methods to remove the PAPI names.
Now, we just need to fix the runtime errors, wiring the correct messages, etc.
The queue must be cleared after executing the requests.
Abort request handler was returning the wrong message, also, keep the old
behavior where abort request is handled once only.
This grabs many details from PAPIv2, work still pending on the tests.

NOTE: This could break the current integration.
@aednichols
Copy link
Collaborator

I hope to circle back to this soon but in the meantime it looks like we picked up some compiling issues from merging all the other PRs.

@AlexITC
Copy link
Collaborator Author

AlexITC commented May 15, 2024

I should be able to solve these before you wake up, thanks!

@AlexITC
Copy link
Collaborator Author

AlexITC commented May 16, 2024

I have resolved the problems, it is also worth to execute the new tests from #7440.

@aednichols
Copy link
Collaborator

I need to put this down for the moment to finish #7439 which is currently affecting users, hope to get back to it tomorrow morning.

This allow us handling the abort result instead of blindly marking the job as aborted.
@AlexITC
Copy link
Collaborator Author

AlexITC commented May 21, 2024

Good news, I was able to fix StandardAsyncExecutionActor#requestsAbortAndDiesImmediately=false.

Turns out that when this flag is true, Cromwell blindly marks the job as aborted, now, Cromwell waits until the abort request is executed and the job can't be retrieved from GCP anymore.

@aednichols
Copy link
Collaborator

Awesome, will have time to look today!

Now, we look for the submit requests before sending an abort request, canceling jobs that were not submitted to GCP.

Turns out that this is now unnecessary because we are already mapping query errors to a RunStatus.
Turns out that this is now unnecessary because we are already mapping query errors to a RunStatus.
When aborting an individual job, only that job can be aborted instead of
all the jobs from that workflow.
case GcpBatchBackendSingletonActor.Event.JobSubmitted(job) =>
log.info(s"Job submitted to GCP: ${job.getName}")
case job: StandardAsyncJob =>
log.info(s"A job was submitted successfully: ${job.jobId}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no problem merging as-is with these logs in place, seeing as we'll probably have some more rounds of debugging. That said, we'll probably want to reduce the info ones eventually.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the time being, I have marked the noisy logs as debug and kept the rest as info, it will be help to debug some of the currently open issues.

}
// If error code 10, add some extra messaging to the server logging
else if (runStatus.errorCode.getCode.value() == BatchMysteriouslyCrashedErrorCode) {
jobLogger.info(s"Job Failed with Error Code 10 for a machine where Preemptible is set to $preemptible")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Batch use the same range of error codes? The enum com.google.cloud.batch.v1.JobStatus.State only goes up to 6 it looks like.

@@ -1179,7 +1383,7 @@ class GcpBatchAsyncBackendJobExecutionActor(override val standardParams: Standar
}
}

// No need for Cromwell-performed localization in the PAPI backend, ad hoc values are localized directly from GCS to the VM by PAPI.
// No need for Cromwell-performed localization in the Batch backend, ad hoc values are localized directly from GCS to the VM by Batch.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm so excited about this 🚀

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be aware that I did not do anything here, its the code/comment ported from PAPI.

@@ -1138,6 +1363,9 @@ class GcpBatchAsyncBackendJobExecutionActor(override val standardParams: Standar
batchOutputs collectFirst {
case batchOutput if batchOutput.name == makeSafeReferenceName(path) =>
val pathAsString = batchOutput.cloudPath.pathAsString

// TODO: batchOutput.cloudPath.exists invokes GCP, which causes a test ported from papi-common to fai
// because GCP is not configured in tests, shall we do anything?
if (batchOutput.isFileParameter && !batchOutput.cloudPath.exists) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave it as a TODO for now. It seems awkward to use a throw to signal "it's OK to do nothing".


case apiQuery: BatchApiRequest =>
log.debug("Forwarding API query to Batch request manager actor")
jesApiQueryManager.forward(apiQuery)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Legacy naming: not a super high priority, but "JES" stands for "Job Execution Service" and is an ancestor of Batch.

JES -> Pipelines API v1 -> Pipeines API v2alpha -> Pipeines API v2beta -> Genomics -> Life Sciences -> Batch

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have resolved this.

import scala.util.Try

// Mirrors com.google.api.client.googleapis.batch.BatchRequest but this is immutable
class GcpBatchGroupedRequests(requests: List[(BatchApiRequest, Promise[Try[BatchApiResponse]])]) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there still a value in grouping them in that case, or should we just fire them off as they come in?

Turns out that this is not the correct fix.
- Switch the noisy logs to debug level.
- Remove status codes ported from PAPI because they do not have any usefulness in batch.
- Remove all the test cases involved the PAPI codes.
- Clean RunStatus from the unused args.
- Rename JES occurrences to Batch.
@AlexITC
Copy link
Collaborator Author

AlexITC commented May 25, 2024

Due to the activity noise, the comments are hidden, I'll post here for better visibility.

Request grouping

Originally, this was created because we hoped that Google had an alternative to Batch requests, by now, Google has confirmed that there is no way to do that.

These are some notes from our internal discussions:

  1. The code becomes way simpler if this grouping gets removed.
  2. We have not checked the potential implications on creating a batch client for every request, or, reusing the same client for the application's lifecycle.
  3. Grouping requests could allow us to eventually implement streaming like fs2/akka-stream, which could allow us to throttle the requests, still, if Cromwell already does this in another layer, this becomes unnecessary.

Given that the current code has been tested so many times, my suggestion is to keep the grouping and potentially remove it in another iteration.

Error codes

Google has confirmed that there are more error codes than what the grpc response provides, still, these can be found at the job events, hence, they need to be parsed from the strings (PAPI does something similar). But, this has not been done in this PR which is why I have removed a lot of code that is not necessary.

In a following PR, we should implement part of this in order to handle preemption errors.

See https://cloud.google.com/batch/docs/troubleshooting#reserved-exit-codes

Thanks.

@AlexITC AlexITC merged commit 1515aa8 into develop May 29, 2024
37 checks passed
@AlexITC AlexITC deleted the gcp-batch-request-manager-refactor-v2 branch May 29, 2024 21:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants