You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We randomly receive PKIX and RPC deadline errors on a small subset of GCP LS API and Batch runs, which always crash the main Cromwell java process. We have added local certs to our java install launching cromwell - this has reduced, but not eliminated PKIX errors. We have no remedy for these issues at the moment besides submitting again. Is there anything we can do as users to make cromwell more fault-tolerant of cloud-related issues like these?
PKIX error example:
Failed to query job status (projects/XXXXX/jobs/job-7638b0fb-XXXX-XXXX-81ca-9f609da4c664) from GCP
com.google.api.gax.rpc.UnavailableException: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
Channel Pipeline: [SslHandler#0, ProtocolNegotiators$ClientTlsHandler#0, WriteBufferingAndExceptionHandler#0, DefaultChannelPipeline$TailContext#0]
at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:112)
at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:41)
at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:86)
at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:66)
at com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97)
at com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:84)
at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1133)
at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1277)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808)
at io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:574)
at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:544)
at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
at com.google.api.gax.grpc.ChannelPool$ReleasingClientCall$1.onClose(ChannelPool.java:541)
at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:576)
at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:757)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:736)
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1570)
Suppressed: com.google.api.gax.rpc.AsyncTaskException: Asynchronous task failed
at com.google.api.gax.rpc.ApiExceptions.callAndTranslateApiException(ApiExceptions.java:57)
at com.google.api.gax.rpc.UnaryCallable.call(UnaryCallable.java:112)
at com.google.cloud.batch.v1.BatchServiceClient.getJob(BatchServiceClient.java:427)
at cromwell.backend.google.batch.api.GcpBatchApiRequestHandler.$anonfun$query$1(GcpBatchApiRequestHandler.scala:15)
at cromwell.backend.google.batch.api.GcpBatchApiRequestHandler.withClient(GcpBatchApiRequestHandler.scala:29)
at cromwell.backend.google.batch.api.GcpBatchApiRequestHandler.query(GcpBatchApiRequestHandler.scala:14)
at cromwell.backend.google.batch.actors.GcpBatchBackendSingletonActor$$anonfun$normalReceive$1.$anonfun$applyOrElse$3(GcpBatchBackendSingletonActor.scala:80)
at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:678)
at scala.concurrent.impl.Promise$Transformation.run(Promise.scala:467)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:49)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
Channel Pipeline: [SslHandler#0, ProtocolNegotiators$ClientTlsHandler#0, WriteBufferingAndExceptionHandler#0, DefaultChannelPipeline$TailContext#0]
at io.grpc.Status.asRuntimeException(Status.java:539)
... 14 common frames omitted
Caused by: javax.net.ssl.SSLHandshakeException: General OpenSslEngine problem
at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslEngine.handshakeException(ReferenceCountedOpenSslEngine.java:1907)
at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslEngine.wrap(ReferenceCountedOpenSslEngine.java:834)
at java.base/javax.net.ssl.SSLEngine.wrap(SSLEngine.java:564)
at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.wrap(SslHandler.java:1041)
at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.wrapNonAppData(SslHandler.java:927)
at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1409)
at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.unwrapNonAppData(SslHandler.java:1327)
at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.access$1800(SslHandler.java:169)
at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler$SslTasksRunner.resumeOnEventExecutor(SslHandler.java:1718)
at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler$SslTasksRunner.access$2000(SslHandler.java:1609)
at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler$SslTasksRunner$2.run(SslHandler.java:1770)
at io.grpc.netty.shaded.io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
at io.grpc.netty.shaded.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:403)
at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
... 1 common frames omitted
Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at java.base/sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:388)
at java.base/sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:271)
at java.base/sun.security.validator.Validator.validate(Validator.java:256)
at java.base/sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:284)
at java.base/sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:144)
at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslClientContext$ExtendedTrustManagerVerifyCallback.verify(ReferenceCountedOpenSslClientContext.java:234)
at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslContext$AbstractCertificateVerifier.verify(ReferenceCountedOpenSslContext.java:779)
at io.grpc.netty.shaded.io.netty.internal.tcnative.CertificateVerifierTask.runTask(CertificateVerifierTask.java:36)
at io.grpc.netty.shaded.io.netty.internal.tcnative.SSLTask.run(SSLTask.java:48)
at io.grpc.netty.shaded.io.netty.internal.tcnative.SSLTask.run(SSLTask.java:42)
at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslEngine.runAndResetNeedTask(ReferenceCountedOpenSslEngine.java:1496)
at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslEngine.access$700(ReferenceCountedOpenSslEngine.java:94)
at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslEngine$TaskDecorator.run(ReferenceCountedOpenSslEngine.java:1471)
at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler$SslTasksRunner.run(SslHandler.java:1787)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
... 1 common frames omitted
Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at java.base/sun.security.provider.certpath.SunCertPathBuilder.build(SunCertPathBuilder.java:148)
at java.base/sun.security.provider.certpath.SunCertPathBuilder.engineBuild(SunCertPathBuilder.java:129)
at java.base/java.security.cert.CertPathBuilder.build(CertPathBuilder.java:297)
at java.base/sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:383)
... 16 common frames omitted
Deadline RPC error example:
Failed to submit job (job-2fd08385-36fe-XXXX-XXXX-cf9e03ae29fe) to GCP, workflowId = 6fd00a65-XXXX-XXXX-84ac-2c0030ba224c
com.google.api.gax.rpc.DeadlineExceededException: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: Deadline expired before operation could complete.
WorkflowManagerActor: Workflow 415b5011-XXXX-XXXX-953b-68923c164eb0 failed (during ExecutingWorkflowState): cromwell.core.CromwellFatalException: com.google.api.gax.rpc.DeadlineExceededException: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: Deadline expired before operation could complete.
The text was updated successfully, but these errors were encountered:
We randomly receive PKIX and RPC deadline errors on a small subset of GCP LS API and Batch runs, which always crash the main Cromwell java process. We have added local certs to our java install launching cromwell - this has reduced, but not eliminated PKIX errors. We have no remedy for these issues at the moment besides submitting again. Is there anything we can do as users to make cromwell more fault-tolerant of cloud-related issues like these?
PKIX error example:
Failed to query job status (projects/XXXXX/jobs/job-7638b0fb-XXXX-XXXX-81ca-9f609da4c664) from GCP
com.google.api.gax.rpc.UnavailableException: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
Channel Pipeline: [SslHandler#0, ProtocolNegotiators$ClientTlsHandler#0, WriteBufferingAndExceptionHandler#0, DefaultChannelPipeline$TailContext#0]
at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:112)
at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:41)
at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:86)
at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:66)
at com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97)
at com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:84)
at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1133)
at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1277)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808)
at io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:574)
at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:544)
at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
at com.google.api.gax.grpc.ChannelPool$ReleasingClientCall$1.onClose(ChannelPool.java:541)
at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:576)
at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:757)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:736)
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1570)
Suppressed: com.google.api.gax.rpc.AsyncTaskException: Asynchronous task failed
at com.google.api.gax.rpc.ApiExceptions.callAndTranslateApiException(ApiExceptions.java:57)
at com.google.api.gax.rpc.UnaryCallable.call(UnaryCallable.java:112)
at com.google.cloud.batch.v1.BatchServiceClient.getJob(BatchServiceClient.java:427)
at cromwell.backend.google.batch.api.GcpBatchApiRequestHandler.$anonfun$query$1(GcpBatchApiRequestHandler.scala:15)
at cromwell.backend.google.batch.api.GcpBatchApiRequestHandler.withClient(GcpBatchApiRequestHandler.scala:29)
at cromwell.backend.google.batch.api.GcpBatchApiRequestHandler.query(GcpBatchApiRequestHandler.scala:14)
at cromwell.backend.google.batch.actors.GcpBatchBackendSingletonActor$$anonfun$normalReceive$1.$anonfun$applyOrElse$3(GcpBatchBackendSingletonActor.scala:80)
at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:678)
at scala.concurrent.impl.Promise$Transformation.run(Promise.scala:467)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:49)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
Channel Pipeline: [SslHandler#0, ProtocolNegotiators$ClientTlsHandler#0, WriteBufferingAndExceptionHandler#0, DefaultChannelPipeline$TailContext#0]
at io.grpc.Status.asRuntimeException(Status.java:539)
... 14 common frames omitted
Caused by: javax.net.ssl.SSLHandshakeException: General OpenSslEngine problem
at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslEngine.handshakeException(ReferenceCountedOpenSslEngine.java:1907)
at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslEngine.wrap(ReferenceCountedOpenSslEngine.java:834)
at java.base/javax.net.ssl.SSLEngine.wrap(SSLEngine.java:564)
at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.wrap(SslHandler.java:1041)
at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.wrapNonAppData(SslHandler.java:927)
at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1409)
at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.unwrapNonAppData(SslHandler.java:1327)
at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.access$1800(SslHandler.java:169)
at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler$SslTasksRunner.resumeOnEventExecutor(SslHandler.java:1718)
at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler$SslTasksRunner.access$2000(SslHandler.java:1609)
at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler$SslTasksRunner$2.run(SslHandler.java:1770)
at io.grpc.netty.shaded.io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
at io.grpc.netty.shaded.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:403)
at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
... 1 common frames omitted
Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at java.base/sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:388)
at java.base/sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:271)
at java.base/sun.security.validator.Validator.validate(Validator.java:256)
at java.base/sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:284)
at java.base/sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:144)
at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslClientContext$ExtendedTrustManagerVerifyCallback.verify(ReferenceCountedOpenSslClientContext.java:234)
at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslContext$AbstractCertificateVerifier.verify(ReferenceCountedOpenSslContext.java:779)
at io.grpc.netty.shaded.io.netty.internal.tcnative.CertificateVerifierTask.runTask(CertificateVerifierTask.java:36)
at io.grpc.netty.shaded.io.netty.internal.tcnative.SSLTask.run(SSLTask.java:48)
at io.grpc.netty.shaded.io.netty.internal.tcnative.SSLTask.run(SSLTask.java:42)
at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslEngine.runAndResetNeedTask(ReferenceCountedOpenSslEngine.java:1496)
at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslEngine.access$700(ReferenceCountedOpenSslEngine.java:94)
at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslEngine$TaskDecorator.run(ReferenceCountedOpenSslEngine.java:1471)
at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler$SslTasksRunner.run(SslHandler.java:1787)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
... 1 common frames omitted
Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at java.base/sun.security.provider.certpath.SunCertPathBuilder.build(SunCertPathBuilder.java:148)
at java.base/sun.security.provider.certpath.SunCertPathBuilder.engineBuild(SunCertPathBuilder.java:129)
at java.base/java.security.cert.CertPathBuilder.build(CertPathBuilder.java:297)
at java.base/sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:383)
... 16 common frames omitted
Deadline RPC error example:
Failed to submit job (job-2fd08385-36fe-XXXX-XXXX-cf9e03ae29fe) to GCP, workflowId = 6fd00a65-XXXX-XXXX-84ac-2c0030ba224c
com.google.api.gax.rpc.DeadlineExceededException: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: Deadline expired before operation could complete.
WorkflowManagerActor: Workflow 415b5011-XXXX-XXXX-953b-68923c164eb0 failed (during ExecutingWorkflowState): cromwell.core.CromwellFatalException: com.google.api.gax.rpc.DeadlineExceededException: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: Deadline expired before operation could complete.
The text was updated successfully, but these errors were encountered: