WX-1595 GCP Batch backend refactor to include the PAPI request manager #7412

AlexITC · 2024-04-24T11:17:25Z

WARNING: This PR is huge and needs to be reviewed carefully, we have already performed many manual tests + ported many other tests from PAPI.

Intro

The main goal is to refactor Batch backend to include PipelinesApiRequestManager and PipelinesApiRequestWorker.

This also fixes a few missing details from the initial Batch integration (#7177), for example:

Missing metrics are now published.
The job status is queried before deleting it to try preventing the deletion of jobs that are in a final state (PAPI can abort jobs but Batch deletes them instead).

I have been trying to split this into multiple smaller PRs, please let me know if you can find any piece that can be submitted independently, previous PRs:

Questions (already resolved)

Questions

There is a centaur test included in the Batch suite, still, this seems to invoke a papi test testCentaurGcpBatch.sh (see papi_v2alpha1_gcsa.test, the test itself says that batch backend is not used). Could this be related to the false-alarms from codecoverage's bot?
There warnings raised by codecov which seem wrong, for example, the lines mentioned on GcpBatchGroupedRequests are covered by GcpBatchGroupedRequestsSpec
Should we set GcpBatchAsyncBackendJobExecutionActor#requestsAbortAndDiesImmediately to false? this is set by PAPI but it causes a centaur test to fail.
While this is inherited from PAPI, I think we need to change the behavior but I'd like to get a 2nd option, increasing request-workers also increases the worker's delay to pull work, for example, setting this value to 100 or above causes would cause the delay to become ~18m which seems insane (see BatchApiRequestManager.scala), putting an upper limit on the delay seems worth it, any thoughts?
Do we need to get anything else for the job execution events? see below and BatchRequestExecutor#getEventList.

Execution events details

What GCP provides:

Event type=STATUS_CHANGED
time=seconds: 1712173852,nanos: 952604950
taskState=STATE_UNSPECIFIED,
description=Job state is set from QUEUED to SCHEDULED for job projects/392615380452/locations/us-south1/jobs/job-ba81bad8-82e9-4d95-8fc0-04dfbbd746da.
taskExecution.exitCode=0

Event type=STATUS_CHANGED,
time=seconds: 1712173947, nanos: 568998105
taskState=STATE_UNSPECIFIED
description=Job state is set from SCHEDULED to RUNNING for job projects/392615380452/locations/us-south1/jobs/job-ba81bad8-82e9-4d95-8fc0-04dfbbd746da.
taskExecution.exitCode=0

Event type=STATUS_CHANGED
time=seconds: 1712173989, nanos: 937816549
taskState=STATE_UNSPECIFIED
description=Job state is set from RUNNING to SUCCEEDED for job projects/392615380452/locations/us-south1/jobs/job-ba81bad8-82e9-4d95-8fc0-04dfbbd746da.
taskExecution.exitCode=0

What we define as execution events:

ExecutionEvent(Job state is set from QUEUED to SCHEDULED for job projects/392615380452/locations/us-south1/jobs/job-321db1bc-9a68-4171-aa2a-46885d781656.,2024-04-03T20:10:01.704137839Z,None)
ExecutionEvent(Job state is set from SCHEDULED to RUNNING for job projects/392615380452/locations/us-south1/jobs/job-321db1bc-9a68-4171-aa2a-46885d781656.,2024-04-03T20:11:30.631264449Z,None)
ExecutionEvent(Job state is set from RUNNING to SUCCEEDED for job projects/392615380452/locations/us-south1/jobs/job-321db1bc-9a68-4171-aa2a-46885d781656.,2024-04-03T20:12:16.898798407Z,None)

Load test results

We have executed many load tests, this is the latest one involving 14k jobs.

Data / Backend	Batch with Mysql	PAPIv2 with Mysql
Jobs	14400	14400
Execution time	20936 seconds	24451 seconds

Overall, all our tests indicate that Batch finishes executing the jobs faster than PAPIv2.

Load tests settings

We have ran Cromwell in server mode with the following settings:

request-timeout: 10m
idle-timeout: 10m
job-rate-control: jobs = 20, per = 10 seconds
max-workflow-launch-count: 50
new-workflow-poll-rate: 1
database: MySQL
virtual-private-cloud setup
maximum-polling-interval: 600s
localization-attempts: 3
google.auth: service account
request-workers: 3
concurrent-job-limit: 14400

JVM Options:

-Xms512m -Xmx64g

NOTE: Initially we found a bottleneck on Batch but Google enabled an experimental settings to schedule many jobs concurrently which reduced the total execution time.

Server capacity (from Google Cloud):

VM Machine Type: n2-standard-16
Virtual CPUs: 16
Memory: 64G
Architecture: x86/64
CPU Platform: Intel Cascade Lake

BatchApiRequestWorkerSpec works! The goal is to port the papiv2 request manager behavior into GCP batch. We just need to rename the methods to remove the PAPI names.

Now, we just need to fix the runtime errors, wiring the correct messages, etc.

The queue must be cleared after executing the requests.

Abort request handler was returning the wrong message, also, keep the old behavior where abort request is handled once only.

This grabs many details from PAPIv2, work still pending on the tests. NOTE: This could break the current integration.

…not in batch

aednichols · 2024-05-15T19:44:47Z

I hope to circle back to this soon but in the meantime it looks like we picked up some compiling issues from merging all the other PRs.

AlexITC · 2024-05-15T20:40:04Z

I should be able to solve these before you wake up, thanks!

AlexITC · 2024-05-16T09:29:21Z

I have resolved the problems, it is also worth to execute the new tests from #7440.

aednichols · 2024-05-16T16:58:26Z

I need to put this down for the moment to finish #7439 which is currently affecting users, hope to get back to it tomorrow morning.

This allow us handling the abort result instead of blindly marking the job as aborted.

AlexITC · 2024-05-21T12:41:08Z

Good news, I was able to fix StandardAsyncExecutionActor#requestsAbortAndDiesImmediately=false.

Turns out that when this flag is true, Cromwell blindly marks the job as aborted, now, Cromwell waits until the abort request is executed and the job can't be retrieved from GCP anymore.

aednichols · 2024-05-21T13:30:28Z

Awesome, will have time to look today!

Now, we look for the submit requests before sending an abort request, canceling jobs that were not submitted to GCP. Turns out that this is now unnecessary because we are already mapping query errors to a RunStatus.

Turns out that this is now unnecessary because we are already mapping query errors to a RunStatus.

When aborting an individual job, only that job can be aborted instead of all the jobs from that workflow.

In theory, this solves #7407

aednichols · 2024-05-22T19:47:08Z

...le/batch/src/main/scala/cromwell/backend/google/batch/actors/BatchApiRunCreationClient.scala

- case GcpBatchBackendSingletonActor.Event.JobSubmitted(job) =>
- log.info(s"Job submitted to GCP: ${job.getName}")
+ case job: StandardAsyncJob =>
+ log.info(s"A job was submitted successfully: ${job.jobId}")


I have no problem merging as-is with these logs in place, seeing as we'll probably have some more rounds of debugging. That said, we'll probably want to reduce the info ones eventually.

For the time being, I have marked the noisy logs as debug and kept the rest as info, it will be help to debug some of the currently open issues.

aednichols · 2024-05-22T20:19:05Z

.../main/scala/cromwell/backend/google/batch/actors/GcpBatchAsyncBackendJobExecutionActor.scala

+ }
+ // If error code 10, add some extra messaging to the server logging
+ else if (runStatus.errorCode.getCode.value() == BatchMysteriouslyCrashedErrorCode) {
+ jobLogger.info(s"Job Failed with Error Code 10 for a machine where Preemptible is set to $preemptible")


Does Batch use the same range of error codes? The enum com.google.cloud.batch.v1.JobStatus.State only goes up to 6 it looks like.

aednichols · 2024-05-22T20:20:29Z

.../main/scala/cromwell/backend/google/batch/actors/GcpBatchAsyncBackendJobExecutionActor.scala

@@ -1179,7 +1383,7 @@ class GcpBatchAsyncBackendJobExecutionActor(override val standardParams: Standar
 }
 }

- // No need for Cromwell-performed localization in the PAPI backend, ad hoc values are localized directly from GCS to the VM by PAPI.
+ // No need for Cromwell-performed localization in the Batch backend, ad hoc values are localized directly from GCS to the VM by Batch.


I'm so excited about this 🚀

Be aware that I did not do anything here, its the code/comment ported from PAPI.

aednichols · 2024-05-22T20:30:33Z

.../main/scala/cromwell/backend/google/batch/actors/GcpBatchAsyncBackendJobExecutionActor.scala

@@ -1138,6 +1363,9 @@ class GcpBatchAsyncBackendJobExecutionActor(override val standardParams: Standar
 batchOutputs collectFirst {
 case batchOutput if batchOutput.name == makeSafeReferenceName(path) =>
 val pathAsString = batchOutput.cloudPath.pathAsString
+
+ // TODO: batchOutput.cloudPath.exists invokes GCP, which causes a test ported from papi-common to fai
+ // because GCP is not configured in tests, shall we do anything?
 if (batchOutput.isFileParameter && !batchOutput.cloudPath.exists) {


Let's leave it as a TODO for now. It seems awkward to use a throw to signal "it's OK to do nothing".

aednichols · 2024-05-22T20:32:17Z

...atch/src/main/scala/cromwell/backend/google/batch/actors/GcpBatchBackendSingletonActor.scala

+
+ case apiQuery: BatchApiRequest =>
+ log.debug("Forwarding API query to Batch request manager actor")
+ jesApiQueryManager.forward(apiQuery)


Legacy naming: not a super high priority, but "JES" stands for "Job Execution Service" and is an ancestor of Batch.

JES -> Pipelines API v1 -> Pipeines API v2alpha -> Pipeines API v2beta -> Genomics -> Life Sciences -> Batch

I have resolved this.

aednichols · 2024-05-22T20:45:31Z

...batch/src/main/scala/cromwell/backend/google/batch/api/request/GcpBatchGroupedRequests.scala

+import scala.util.Try
+
+// Mirrors com.google.api.client.googleapis.batch.BatchRequest but this is immutable
+class GcpBatchGroupedRequests(requests: List[(BatchApiRequest, Promise[Try[BatchApiResponse]])]) {


Is there still a value in grouping them in that case, or should we just fire them off as they come in?

Turns out that this is not the correct fix.

- Switch the noisy logs to debug level. - Remove status codes ported from PAPI because they do not have any usefulness in batch. - Remove all the test cases involved the PAPI codes. - Clean RunStatus from the unused args. - Rename JES occurrences to Batch.

AlexITC · 2024-05-25T22:27:51Z

Due to the activity noise, the comments are hidden, I'll post here for better visibility.

Request grouping

Originally, this was created because we hoped that Google had an alternative to Batch requests, by now, Google has confirmed that there is no way to do that.

These are some notes from our internal discussions:

The code becomes way simpler if this grouping gets removed.
We have not checked the potential implications on creating a batch client for every request, or, reusing the same client for the application's lifecycle.
Grouping requests could allow us to eventually implement streaming like fs2/akka-stream, which could allow us to throttle the requests, still, if Cromwell already does this in another layer, this becomes unnecessary.

Given that the current code has been tested so many times, my suggestion is to keep the grouping and potentially remove it in another iteration.

Error codes

Google has confirmed that there are more error codes than what the grpc response provides, still, these can be found at the job events, hence, they need to be parsed from the strings (PAPI does something similar). But, this has not been done in this PR which is why I have removed a lot of code that is not necessary.

In a following PR, we should implement part of this in order to handle preemption errors.

See https://cloud.google.com/batch/docs/troubleshooting#reserved-exit-codes

Thanks.

AlexITC added 30 commits April 18, 2024 17:19

Refactor GCP Batch request manager

ff8543c

BatchApiRequestWorkerSpec works! The goal is to port the papiv2 request manager behavior into GCP batch. We just need to rename the methods to remove the PAPI names.

Rename PAPI stuff from BatchApiRequestManager

5cca13a

Fix scalafmt

0823351

Temporary disable CI

0f5ab24

Add missing test to GcpBatchAsyncBackendJobExecutionActorSpec

94d1aaf

Missing rename from PAPI to Batch in BatchApiRunCreationClient

af6b42d

draft

fdb5c24

draft

ff23056

Yet another draft

ecc28d1

Code compiles! tests are pending to be fixed

2afef28

Implement abort operation

a249835

Get has been implemented

7ef43da

Now, we just need to fix the runtime errors, wiring the correct messages, etc.

Tests are compiling

71daebe

Complete the PAPIv2 migration to batch

1e8bcef

Fix GcpBatchGroupedRequests

ad51153

The queue must be cleared after executing the requests.

Enable tests

476a923

Clean up unnecessary code + batch abort request bugfix

f92974b

Abort request handler was returning the wrong message, also, keep the old behavior where abort request is handled once only.

Handle JobAbortedException

0bf7a90

Handle GCP errors

94ce5cd

Refactor to run batch requests in parallel through futures

76bb726

Refactor GcpBatchGroupedRequests to be immutable

dbd6d24

Enable commented tests

e619ef0

Huge refactor on batch RunStatus

057ee14

This grabs many details from PAPIv2, work still pending on the tests. NOTE: This could break the current integration.

Tag the PipelinesApiAsyncBackendJobExecutionActorSpec tests that are …

6fe1518

…not in batch

Clean up + enable missing telemetry entries for batch

0412a9d

Add more tests

d71751d

Add more tests that seem to reproduce a bug

3ccc34d

Minor fixes

810a061

Add TODO

9acc240

Add note about google sdk on AbortRequestHandler

7a4db8c

AlexITC added 3 commits May 14, 2024 23:58

Revert temporal flag change

f4511b9

Merge branch 'develop' into gcp-batch-request-manager-refactor-v2

b30be69

Merge branch 'develop' into gcp-batch-request-manager-refactor-v2

8f16dbc

AlexITC added 2 commits May 16, 2024 10:23

Fix merge errors

28a62ef

Merge branch 'develop' into gcp-batch-request-manager-refactor-v2

0e4ddcb

aednichols removed the Back With Originator ♻️ label May 16, 2024

AlexITC added 2 commits May 21, 2024 12:20

Merge branch 'develop' into gcp-batch-request-manager-refactor-v2

f1a83ff

Set requestsAbortAndDiesImmediately=false

dca1420

This allow us handling the abort result instead of blindly marking the job as aborted.

AlexITC added 5 commits May 21, 2024 15:50

Improve abort job handler

eca754f

Now, we look for the submit requests before sending an abort request, canceling jobs that were not submitted to GCP. Turns out that this is now unnecessary because we are already mapping query errors to a RunStatus.

Remove customPollStatusFailure

0095a87

Turns out that this is now unnecessary because we are already mapping query errors to a RunStatus.

Fix compile errors

837f86e

Fix abort from BatchApiRequestManager

dcd0efe

When aborting an individual job, only that job can be aborted instead of all the jobs from that workflow.

Try fixing preemption errors from GCP

3c9d020

In theory, this solves #7407

aednichols approved these changes May 22, 2024

View reviewed changes

AlexITC added 5 commits May 24, 2024 15:25

Rollback the preemption fixes

e261cb9

Turns out that this is not the correct fix.

Final cleanup

1050850

- Switch the noisy logs to debug level. - Remove status codes ported from PAPI because they do not have any usefulness in batch. - Remove all the test cases involved the PAPI codes. - Clean RunStatus from the unused args. - Rename JES occurrences to Batch.

Merge branch 'develop' into gcp-batch-request-manager-refactor-v2

75acee0

Yet another cleanup

387e399

Run scalafmt

5f01bc9

THWiseman approved these changes May 29, 2024

View reviewed changes

Merge branch 'develop' into gcp-batch-request-manager-refactor-v2

37e5fde

AlexITC merged commit 1515aa8 into develop May 29, 2024
37 checks passed

AlexITC deleted the gcp-batch-request-manager-refactor-v2 branch May 29, 2024 21:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WX-1595 GCP Batch backend refactor to include the PAPI request manager #7412

WX-1595 GCP Batch backend refactor to include the PAPI request manager #7412

AlexITC commented Apr 24, 2024 •

edited

Loading

aednichols commented May 15, 2024

AlexITC commented May 15, 2024

AlexITC commented May 16, 2024

aednichols commented May 16, 2024

AlexITC commented May 21, 2024 •

edited

Loading

aednichols commented May 21, 2024

aednichols May 22, 2024

AlexITC May 25, 2024

aednichols May 22, 2024

aednichols May 22, 2024

AlexITC May 25, 2024

aednichols May 22, 2024

aednichols May 22, 2024

AlexITC May 25, 2024

aednichols May 22, 2024

AlexITC commented May 25, 2024

WX-1595 GCP Batch backend refactor to include the PAPI request manager #7412

WX-1595 GCP Batch backend refactor to include the PAPI request manager #7412

Conversation

AlexITC commented Apr 24, 2024 • edited Loading

Intro

Questions

Load test results

aednichols commented May 15, 2024

AlexITC commented May 15, 2024

AlexITC commented May 16, 2024

aednichols commented May 16, 2024

AlexITC commented May 21, 2024 • edited Loading

aednichols commented May 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlexITC commented May 25, 2024

AlexITC commented Apr 24, 2024 •

edited

Loading

AlexITC commented May 21, 2024 •

edited

Loading