Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Methods to Improve handling of GCP backend timeouts #7551

Open
osha52101 opened this issue Sep 19, 2024 · 0 comments
Open

Methods to Improve handling of GCP backend timeouts #7551

osha52101 opened this issue Sep 19, 2024 · 0 comments

Comments

@osha52101
Copy link

We randomly receive PKIX and RPC deadline errors on a small subset of GCP LS API and Batch runs, which always crash the main Cromwell java process. We have added local certs to our java install launching cromwell - this has reduced, but not eliminated PKIX errors. We have no remedy for these issues at the moment besides submitting again. Is there anything we can do as users to make cromwell more fault-tolerant of cloud-related issues like these?

PKIX error example:

Failed to query job status (projects/XXXXX/jobs/job-7638b0fb-XXXX-XXXX-81ca-9f609da4c664) from GCP
com.google.api.gax.rpc.UnavailableException: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
Channel Pipeline: [SslHandler#0, ProtocolNegotiators$ClientTlsHandler#0, WriteBufferingAndExceptionHandler#0, DefaultChannelPipeline$TailContext#0]
at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:112)
at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:41)
at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:86)
at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:66)
at com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97)
at com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:84)
at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1133)
at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1277)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808)
at io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:574)
at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:544)
at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
at com.google.api.gax.grpc.ChannelPool$ReleasingClientCall$1.onClose(ChannelPool.java:541)
at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:576)
at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:757)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:736)
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1570)
Suppressed: com.google.api.gax.rpc.AsyncTaskException: Asynchronous task failed
at com.google.api.gax.rpc.ApiExceptions.callAndTranslateApiException(ApiExceptions.java:57)
at com.google.api.gax.rpc.UnaryCallable.call(UnaryCallable.java:112)
at com.google.cloud.batch.v1.BatchServiceClient.getJob(BatchServiceClient.java:427)
at cromwell.backend.google.batch.api.GcpBatchApiRequestHandler.$anonfun$query$1(GcpBatchApiRequestHandler.scala:15)
at cromwell.backend.google.batch.api.GcpBatchApiRequestHandler.withClient(GcpBatchApiRequestHandler.scala:29)
at cromwell.backend.google.batch.api.GcpBatchApiRequestHandler.query(GcpBatchApiRequestHandler.scala:14)
at cromwell.backend.google.batch.actors.GcpBatchBackendSingletonActor$$anonfun$normalReceive$1.$anonfun$applyOrElse$3(GcpBatchBackendSingletonActor.scala:80)
at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:678)
at scala.concurrent.impl.Promise$Transformation.run(Promise.scala:467)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:49)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
Channel Pipeline: [SslHandler#0, ProtocolNegotiators$ClientTlsHandler#0, WriteBufferingAndExceptionHandler#0, DefaultChannelPipeline$TailContext#0]
at io.grpc.Status.asRuntimeException(Status.java:539)
... 14 common frames omitted
Caused by: javax.net.ssl.SSLHandshakeException: General OpenSslEngine problem
at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslEngine.handshakeException(ReferenceCountedOpenSslEngine.java:1907)
at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslEngine.wrap(ReferenceCountedOpenSslEngine.java:834)
at java.base/javax.net.ssl.SSLEngine.wrap(SSLEngine.java:564)
at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.wrap(SslHandler.java:1041)
at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.wrapNonAppData(SslHandler.java:927)
at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1409)
at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.unwrapNonAppData(SslHandler.java:1327)
at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.access$1800(SslHandler.java:169)
at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler$SslTasksRunner.resumeOnEventExecutor(SslHandler.java:1718)
at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler$SslTasksRunner.access$2000(SslHandler.java:1609)
at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler$SslTasksRunner$2.run(SslHandler.java:1770)
at io.grpc.netty.shaded.io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
at io.grpc.netty.shaded.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:403)
at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
... 1 common frames omitted
Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at java.base/sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:388)
at java.base/sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:271)
at java.base/sun.security.validator.Validator.validate(Validator.java:256)
at java.base/sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:284)
at java.base/sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:144)
at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslClientContext$ExtendedTrustManagerVerifyCallback.verify(ReferenceCountedOpenSslClientContext.java:234)
at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslContext$AbstractCertificateVerifier.verify(ReferenceCountedOpenSslContext.java:779)
at io.grpc.netty.shaded.io.netty.internal.tcnative.CertificateVerifierTask.runTask(CertificateVerifierTask.java:36)
at io.grpc.netty.shaded.io.netty.internal.tcnative.SSLTask.run(SSLTask.java:48)
at io.grpc.netty.shaded.io.netty.internal.tcnative.SSLTask.run(SSLTask.java:42)
at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslEngine.runAndResetNeedTask(ReferenceCountedOpenSslEngine.java:1496)
at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslEngine.access$700(ReferenceCountedOpenSslEngine.java:94)
at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslEngine$TaskDecorator.run(ReferenceCountedOpenSslEngine.java:1471)
at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler$SslTasksRunner.run(SslHandler.java:1787)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
... 1 common frames omitted
Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at java.base/sun.security.provider.certpath.SunCertPathBuilder.build(SunCertPathBuilder.java:148)
at java.base/sun.security.provider.certpath.SunCertPathBuilder.engineBuild(SunCertPathBuilder.java:129)
at java.base/java.security.cert.CertPathBuilder.build(CertPathBuilder.java:297)
at java.base/sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:383)
... 16 common frames omitted

Deadline RPC error example:

Failed to submit job (job-2fd08385-36fe-XXXX-XXXX-cf9e03ae29fe) to GCP, workflowId = 6fd00a65-XXXX-XXXX-84ac-2c0030ba224c
com.google.api.gax.rpc.DeadlineExceededException: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: Deadline expired before operation could complete.

WorkflowManagerActor: Workflow 415b5011-XXXX-XXXX-953b-68923c164eb0 failed (during ExecutingWorkflowState): cromwell.core.CromwellFatalException: com.google.api.gax.rpc.DeadlineExceededException: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: Deadline expired before operation could complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant