-
Notifications
You must be signed in to change notification settings - Fork 602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix REST client connection handling #19883
Comments
Currently, the gateway cannot handle long-lasting requests or connections well enough for all use cases, apparently. This will be tackled with #19883. Until then, those ITs are disabled in the QA module.
ObservationsCopied from #19333 (comment)
|
Currently, the gateway cannot handle long-lasting requests or connections well enough for all use cases, apparently. This will be tackled with #19883. Until then, those ITs are disabled in the QA module.
Currently, the gateway cannot handle long-lasting requests or connections well enough for all use cases, apparently. This will be tackled with #19883. Until then, those ITs are disabled in the QA module.
Observations round 2After another round of manual tests using cURL locally, I strongly assume the issue resides on the client instead of the server. SetupsSetup A
Setup B
Setup C
ResultsResults A
Results B
Results C
|
Currently, the gateway cannot handle long-lasting requests or connections well enough for all use cases, apparently. This will be tackled with #19883. Until then, those ITs are disabled in the QA module.
ZPA mob-programming session: We did some additional investigations on the two failing tests using Wireshark. We managed to capture the communication of both successful and failing runs for both tests. The successful runs were made by modifying the tests. The modifications are explained below. The captures can be found on this (internal) link. It seems that the root cause for the two failing tests is different: Test
Test
|
Regarding
|
@tmetzke since TCP is a transport layer protocol, we can't control it from the application layer, i.e. an HTTP client. gRPC, which is built on HTTP/2, supports multiplexing and streaming manages to push the TCP window size to larger values more quickly. I found that we can influence the TCP window size by increasing socket buffer sizes through the following Apache HttpClient 5 configuration options: final IOReactorConfig ioReactorConfig =
IOReactorConfig.custom() // the IOReactor manages non-blocking connection
.setSoTimeout(Timeout.ofSeconds(30)) // Overall socket timeout
.setSndBufSize(64 * 1024) // Larger send buffer size
.setRcvBufSize(64 * 1024) // Larger receive buffer size
.setTcpNoDelay(true) // Disable Nagle's algorithm for lower latency
.build();
final HttpAsyncClientBuilder builder =
HttpAsyncClients.custom()
.setIOReactorConfig(ioReactorConfig) However, the truncated test continues to fail. This also doesn't guarantee a stable connection, i.e. there might still be failures depending on the size of the data. |
Thanks for giving this a closer look 👍 |
AFAIK I'd need to look at the dumps I guess, it's unclear to me. AFAIK this shouldn't be a limitation of HTTP/1.1, we should be able to send arbitrarily large responses - the only thing I could imagine is someone sending a RST due to a time out somewhere if the read/send doesn't finish fast enough. Maybe I misunderstood the issue though 🤔 |
Thanks, @npepinpe, that would also tie in better with my observation that cURL is capable of handling the response from the server. It indicates a specific problem with the Apache HTTP client, not the HTTP/1.1 communication. This remains a tough nut to crack 🙈 |
@npepinpe the stack trace of the failure shows that a timeout is triggered. However, no configuration adjustments have impacted the timeout so far. I'll continue debugging. |
So the RST likely comes from the time out, right? The question is why the client stops acknowledging - are we overwhelming it? Maybe we are? OTOH, things I would check (and instrument with metrics):
|
@npepinpe so far I've tried the following configuration adjustments, but no changes in the failures: in private HttpAsyncClientBuilder defaultClientBuilder() {
...
final PoolingAsyncClientConnectionManager connectionManager =
PoolingAsyncClientConnectionManagerBuilder.create()
.setTlsStrategy(tlsStrategy)
.setMaxConnPerRoute(10) // <--- tried increasing the max connections per route and total
.setMaxConnTotal(20)
.setPoolConcurrencyPolicy(PoolConcurrencyPolicy.LAX) // <--- and made the concurrency policy LAX to allow for more connections
.setDefaultConnectionConfig( // <---- I also tried passing along a default connection config
ConnectionConfig.custom()
.setSocketTimeout(Timeout.ofSeconds(40L))
.setConnectTimeout(Timeout.ofSeconds(40L))
.build())
.build();
// the IOReactorConfig is completely new. I added it to control the socket management of async connections
final IOReactorConfig ioReactorConfig =
IOReactorConfig.custom()
.setSoTimeout(Timeout.ofSeconds(30)) // Overall socket timeout
.setSndBufSize(64 * 1024) // Larger send buffer size (the idea was to influence the TCP window size)
.setRcvBufSize(64 * 1024) // Larger receive buffer size (the idea was to influence the TCP window size)
.setTcpNoDelay(true) // Disable Nagle's algorithm for lower latency (the idea was to influence the TCP window size)
.build();
final HttpAsyncClientBuilder builder =
HttpAsyncClients.custom()
.setConnectionManager(connectionManager)
.setIOReactorConfig(ioReactorConfig) // <--- the IOReacotorConfig is passed here
.setDefaultHeaders(Collections.singletonList(acceptHeader))
.setUserAgent("zeebe-client-java/" + VersionUtil.getVersion())
.evictExpiredConnections()
.setCharCodingConfig(
CharCodingConfig.custom().setCharset(StandardCharsets.UTF_8).build())
.evictIdleConnections(TimeValue.ofSeconds(30))
.useSystemProperties(); // allow users to customize via system properties
...
}
private Builder defaultClientRequestConfigBuilder() {
return RequestConfig.custom()
.setResponseTimeout(Timeout.ofSeconds(30L))
.setConnectionRequestTimeout(Timeout.ofSeconds(40L)) // <--- Also tried increasing the request timeout
...
} I think these config properties cover the first three points you listed above. I'm not sure about the last one, I'll do some research on how that can be done. |
UPDATE:
java.util.concurrent.CompletionException: io.camunda.zeebe.client.api.command.ProblemException: Failed with code 503: 'Service Unavailable'. Details: 'class ProblemDetail {
type: about:blank
title: Service Unavailable
status: 503
detail: null
instance: /v2/jobs/activation
}'
at java.base/java.util.concurrent.CompletableFuture.reportJoin(CompletableFuture.java:413)
at java.base/java.util.concurrent.CompletableFuture.join(CompletableFuture.java:2118)
at io.camunda.zeebe.it.client.command.LongPollingActivateJobsTest.shouldActivateJobForOpenRequest(LongPollingActivateJobsTest.java:130)
at java.base/java.lang.reflect.Method.invoke(Method.java:580)
at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:212)
at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:194)
at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:212)
at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:212)
at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:212)
at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:212)
at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:212)
at java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:1024)
at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:556)
at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:546)
at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:265)
at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:611)
at java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:291)
at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:212)
at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:212)
at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:212)
at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1709)
at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:556)
at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:546)
at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:265)
at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:611)
at java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:291)
at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1709)
at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:556)
at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:546)
at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:265)
at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:611)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1597)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1597)
Caused by: io.camunda.zeebe.client.api.command.ProblemException: Failed with code 503: 'Service Unavailable'. Details: 'class ProblemDetail {
type: about:blank
title: Service Unavailable
status: 503
detail: null
instance: /v2/jobs/activation
}'
at io.camunda.zeebe.client.impl.http.ApiCallback.handleErrorResponse(ApiCallback.java:111)
at io.camunda.zeebe.client.impl.http.ApiCallback.completed(ApiCallback.java:59)
at io.camunda.zeebe.client.impl.http.ApiCallback.completed(ApiCallback.java:30)
at org.apache.hc.core5.concurrent.BasicFuture.completed(BasicFuture.java:123)
at org.apache.hc.core5.concurrent.ComplexFuture.completed(ComplexFuture.java:72)
at org.apache.hc.client5.http.impl.async.InternalAbstractHttpAsyncClient$2$1.completed(InternalAbstractHttpAsyncClient.java:304)
at org.apache.hc.core5.http.nio.support.AbstractAsyncResponseConsumer$1.completed(AbstractAsyncResponseConsumer.java:101)
at org.apache.hc.core5.http.nio.entity.AbstractBinAsyncEntityConsumer.completed(AbstractBinAsyncEntityConsumer.java:84)
at org.apache.hc.core5.http.nio.entity.AbstractBinDataConsumer.streamEnd(AbstractBinDataConsumer.java:81)
at org.apache.hc.core5.http.nio.support.AbstractAsyncResponseConsumer.streamEnd(AbstractAsyncResponseConsumer.java:142)
at org.apache.hc.client5.http.impl.async.HttpAsyncMainClientExec$1.streamEnd(HttpAsyncMainClientExec.java:251)
at org.apache.hc.core5.http.impl.nio.ClientHttp1StreamHandler.dataEnd(ClientHttp1StreamHandler.java:270)
at org.apache.hc.core5.http.impl.nio.ClientHttp1StreamDuplexer.dataEnd(ClientHttp1StreamDuplexer.java:366)
at org.apache.hc.core5.http.impl.nio.AbstractHttp1StreamDuplexer.onInput(AbstractHttp1StreamDuplexer.java:334)
at org.apache.hc.core5.http.impl.nio.AbstractHttp1IOEventHandler.inputReady(AbstractHttp1IOEventHandler.java:64)
at org.apache.hc.core5.http.impl.nio.ClientHttp1IOEventHandler.inputReady(ClientHttp1IOEventHandler.java:41)
at org.apache.hc.core5.reactor.InternalDataChannel.onIOEvent(InternalDataChannel.java:142)
at org.apache.hc.core5.reactor.InternalChannel.handleIOEvent(InternalChannel.java:51)
at org.apache.hc.core5.reactor.SingleCoreIOReactor.processEvents(SingleCoreIOReactor.java:178)
at org.apache.hc.core5.reactor.SingleCoreIOReactor.doExecute(SingleCoreIOReactor.java:127)
at org.apache.hc.core5.reactor.AbstractSingleCoreIOReactor.execute(AbstractSingleCoreIOReactor.java:86)
at org.apache.hc.core5.reactor.IOReactorWorker.run(IOReactorWorker.java:44)
at java.base/java.lang.Thread.run(Thread.java:1570) |
I pushed my changes to #22302 |
Who is producing the 503? The server is returning this? Or is the client producing this error because the response never arrives? From the stack trace, I assume the server, right? Why are we returning 503? |
The server is producing a When debugging locally I see the following debug message from the server side:
I also see the following client-side error during the setup phase of the tests: 14:01:54.693 [] [] [httpclient-dispatch-1] ERROR org.apache.hc.client5.http.impl.async - Pool entry is not present in the set of leased entries
java.lang.IllegalStateException: Pool entry is not present in the set of leased entries
at org.apache.hc.core5.pool.StrictConnPool.release(StrictConnPool.java:268) ~[httpcore5-5.2.5.jar:5.2.5]
at org.apache.hc.client5.http.impl.nio.PoolingAsyncClientConnectionManager.release(PoolingAsyncClientConnectionManager.java:410) ~[httpclient5-5.3.1.jar:5.3.1]
at org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime.discardEndpoint(InternalHttpAsyncExecRuntime.java:147) ~[httpclient5-5.3.1.jar:5.3.1]
at org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime.discardEndpoint(InternalHttpAsyncExecRuntime.java:170) ~[httpclient5-5.3.1.jar:5.3.1]
at org.apache.hc.client5.http.impl.async.InternalAbstractHttpAsyncClient$2.failed(InternalAbstractHttpAsyncClient.java:346) ~[httpclient5-5.3.1.jar:5.3.1]
at org.apache.hc.client5.http.impl.async.AsyncRedirectExec$1.failed(AsyncRedirectExec.java:248) ~[httpclient5-5.3.1.jar:5.3.1]
at org.apache.hc.client5.http.impl.async.AsyncHttpRequestRetryExec$1.failed(AsyncHttpRequestRetryExec.java:197) ~[httpclient5-5.3.1.jar:5.3.1]
at org.apache.hc.client5.http.impl.async.AsyncProtocolExec$1.failed(AsyncProtocolExec.java:295) ~[httpclient5-5.3.1.jar:5.3.1]
at org.apache.hc.client5.http.impl.async.HttpAsyncMainClientExec$1.failed(HttpAsyncMainClientExec.java:131) ~[httpclient5-5.3.1.jar:5.3.1]
at org.apache.hc.core5.http.impl.nio.ClientHttp1StreamHandler.failed(ClientHttp1StreamHandler.java:285) ~[httpcore5-5.2.5.jar:5.2.5]
at org.apache.hc.core5.http.impl.nio.ClientHttp1StreamDuplexer.disconnected(ClientHttp1StreamDuplexer.java:220) ~[httpcore5-5.2.5.jar:5.2.5]
at org.apache.hc.core5.http.impl.nio.AbstractHttp1StreamDuplexer.onDisconnect(AbstractHttp1StreamDuplexer.java:408) ~[httpcore5-5.2.5.jar:5.2.5]
at org.apache.hc.core5.http.impl.nio.AbstractHttp1IOEventHandler.disconnected(AbstractHttp1IOEventHandler.java:95) ~[httpcore5-5.2.5.jar:5.2.5]
at org.apache.hc.core5.http.impl.nio.ClientHttp1IOEventHandler.disconnected(ClientHttp1IOEventHandler.java:41) ~[httpcore5-5.2.5.jar:5.2.5]
at org.apache.hc.core5.reactor.InternalDataChannel.disconnected(InternalDataChannel.java:204) ~[httpcore5-5.2.5.jar:5.2.5]
at org.apache.hc.core5.reactor.SingleCoreIOReactor.processClosedSessions(SingleCoreIOReactor.java:231) ~[httpcore5-5.2.5.jar:5.2.5]
at org.apache.hc.core5.reactor.SingleCoreIOReactor.doTerminate(SingleCoreIOReactor.java:106) ~[httpcore5-5.2.5.jar:5.2.5]
at org.apache.hc.core5.reactor.AbstractSingleCoreIOReactor.execute(AbstractSingleCoreIOReactor.java:93) ~[httpcore5-5.2.5.jar:5.2.5]
at org.apache.hc.core5.reactor.IOReactorWorker.run(IOReactorWorker.java:44) ~[httpcore5-5.2.5.jar:5.2.5]
at java.base/java.lang.Thread.run(Thread.java:1570) [?:?] |
Until this bug is root-caused and fixed, the limitation is documented through camunda/camunda-docs#4375 |
Describe the bug
The following integration tests don't work as expected over REST:
shouldActivateJobsIfBatchIsTruncated
in the LongPollingActivateJobsTestshouldActivateJobForOpenRequest
in the LongPollingActivateJobsTestTo Reproduce
true
to the test cases in thebooleans
array in theValueSource
annotation, basically reverting this commit.Expected behavior
The tests pass.
Log/Stacktrace
Environment:
Workaround
Switch to using the gRPC protocol for job activation using long polling.
Additional context
This limitation is documented in our docs with camunda/camunda-docs#4375
The text was updated successfully, but these errors were encountered: