Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion failures cause yaml REST tests to timeout instead of displaying error #31834

Closed
polyfractal opened this issue Jul 5, 2018 · 11 comments
Assignees
Labels
>bug :Delivery/Build Build or test infrastructure Team:Delivery Meta label for Delivery team

Comments

@polyfractal
Copy link
Contributor

It appears that the yaml REST tests run with assertions enabled, but if an assertion is tripped it isn't reported to the user. This causes the REST test to stall for 30s before failing with a generic timeout, and errors about lingering tasks.

I'm afraid I don't know enough about the test infra to know how to fix this, or if it's fixable. But it was very time-consuming to hunt down the cause. Timeouts make you think faulty async logic or something, not an assertion being tripped. If someone points me in the right direction I can take a crack at fixing this, if it is indeed fixable.

Longer version
The timeout error looks like this. You receive a null body back, the test framework complains about the request task still running and the listener times out after 30s:

HEARTBEAT J0 PID(10689@desktop): 2018-07-03T17:44:26, stalled for 31.7s at: XPackRestIT.test {p0=rollup/get_rollup_index_caps/Verify one job caps by rollup index}
  1> [2018-07-04T05:44:27,179][INFO ][o.e.x.t.r.XPackRestIT    ] Stash dump on test failure [{
  1>   "stash" : {
  1>     "body" : null
  1>   }
  1> }]
HEARTBEAT J0 PID(10689@desktop): 2018-07-03T17:44:36, stalled for 41.7s at: XPackRestIT.test {p0=rollup/get_rollup_index_caps/Verify one job caps by rollup index}
  1> [2018-07-04T05:44:37,858][INFO ][o.e.x.t.r.XPackRestIT    ] There are still tasks running after this test that might break subsequent tests [indices:data/read/xpack/rollup/get/index/caps].
  1> [2018-07-04T05:44:37,893][INFO ][o.e.x.t.r.XPackRestIT    ] [test {p0=rollup/get_rollup_index_caps/Verify one job caps by rollup index}]: after test
  2> REPRODUCE WITH: ./gradlew :x-pack:plugin:integTestRunner -Dtests.seed=90C0204471BA5F00 -Dtests.class=org.elasticsearch.xpack.test.rest.XPackRestIT -Dtests.method="test {p0=rollup/get_rollup_index_caps/Verify one job caps by rollup index}" -Dtests.security.manager=true -Dtests.locale=de-CH -Dtests.timezone=Asia/Irkutsk -Dtests.rest.blacklist=getting_started/10_monitor_cluster_health/*
  2> REPRODUCE WITH: ./gradlew :x-pack:plugin:integTestRunner -Dtests.seed=90C0204471BA5F00 -Dtests.class=org.elasticsearch.xpack.test.rest.XPackRestIT -Dtests.method="test {p0=rollup/get_rollup_index_caps/Verify one job caps by rollup index}" -Dtests.security.manager=true -Dtests.locale=de-CH -Dtests.timezone=Asia/Irkutsk -Dtests.rest.blacklist=getting_started/10_monitor_cluster_health/*
ERROR   43.3s | XPackRestIT.test {p0=rollup/get_rollup_index_caps/Verify one job caps by rollup index} <<< FAILURES!
   > Throwable #1: java.lang.RuntimeException: Failure at [rollup/get_rollup_index_caps:65]: listener timeout after waiting for [30000] ms
   >    at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.executeSection(ESClientYamlSuiteTestCase.java:396)
   >    at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.test(ESClientYamlSuiteTestCase.java:371)
   >    at java.lang.Thread.run(Thread.java:748)
   > Caused by: java.io.IOException: listener timeout after waiting for [30000] ms
   >    at org.elasticsearch.client.RestClient$SyncResponseListener.get(RestClient.java:899)
   >    at org.elasticsearch.client.RestClient.performRequest(RestClient.java:227)
   >    at org.elasticsearch.test.rest.yaml.ClientYamlTestClient.callApi(ClientYamlTestClient.java:192)
   >    at org.elasticsearch.test.rest.yaml.ClientYamlTestExecutionContext.callApiInternal(ClientYamlTestExecutionContext.java:168)
   >    at org.elasticsearch.test.rest.yaml.ClientYamlTestExecutionContext.callApi(ClientYamlTestExecutionContext.java:100)
   >    at org.elasticsearch.test.rest.yaml.section.DoSection.execute(DoSection.java:243)
   >    at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.executeSection(ESClientYamlSuiteTestCase.java:387)
   >    ... 38 moreThrowable #2: java.lang.AssertionError: 1 active tasks found:
   > indices:data/read/xpack/rollup/get/index/caps GvLOpehATvWmUC_M7s1A0A:46 -                         transport 1530654237179 17:43:57 40.2s       127.0.0.1 node-0 
   >  expected:<0> but was:<1>
   >    at org.elasticsearch.xpack.test.rest.XPackRestTestHelper.lambda$waitForPendingTasks$2(XPackRestTestHelper.java:110)
   >    at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:767)
   >    at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:741)
   >    at org.elasticsearch.xpack.test.rest.XPackRestTestHelper.waitForPendingTasks(XPackRestTestHelper.java:90)
   >    at org.elasticsearch.xpack.test.rest.XPackRestIT.cleanup(XPackRestIT.java:246)
   >    at java.lang.Thread.run(Thread.java:748)
   >    Suppressed: java.lang.AssertionError: 1 active tasks found:
   > indices:data/read/xpack/rollup/get/index/caps GvLOpehATvWmUC_M7s1A0A:46 -                         transport 1530654237179 17:43:57 30s         127.0.0.1 node-0 
   >  expected:<0> but was:<1>
   >            at org.elasticsearch.xpack.test.rest.XPackRestTestHelper.lambda$waitForPendingTasks$2(XPackRestTestHelper.java:110)
   >            at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:755)
   >            ... 38 more

After some debugging, I realized I was tripping this assertion. If I comment out that assertion and rerun the test, I get the auth denial exception thrown by Security:

 1> [2018-07-04T06:25:22,987][INFO ][o.e.x.t.r.XPackRestIT    ] Stash dump on test failure [{
  1>   "stash" : {
  1>     "body" : {
  1>       "error" : {
  1>         "root_cause" : [
  1>           {
  1>             "type" : "security_exception",
  1>             "reason" : "action [indices:data/read/xpack/rollup/get/index/caps] is unauthorized for user [x_pack_rest_user]",
  1>             "stack_trace" : "ElasticsearchSecurityException[action [indices:data/read/xpack/rollup/get/index/caps] is unauthorized for user [x_pack_rest_user]]
  1>    at org.elasticsearch.xpack.core.security.support.Exceptions.authorizationError(Exceptions.java:30)
  1>    at org.elasticsearch.xpack.security.authz.AuthorizationService.denialException(AuthorizationService.java:574)
  1>    at org.elasticsearch.xpack.security.authz.AuthorizationService.denial(AuthorizationService.java:552)
  1>    at org.elasticsearch.xpack.security.authz.AuthorizationService.authorize(AuthorizationService.java:260)
  1>    at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.lambda$authorizeRequest$4(SecurityActionFilter.java:171)
  1>    at org.elasticsearch.xpack.security.authz.AuthorizationUtils$AsyncAuthorizer.maybeRun(AuthorizationUtils.java:173)
  1>    at org.elasticsearch.xpack.security.authz.AuthorizationUtils$AsyncAuthorizer.setRunAsRoles(AuthorizationUtils.java:167)
  1>    at org.elasticsearch.xpack.security.authz.AuthorizationUtils$AsyncAuthorizer.authorize(AuthorizationUtils.java:155)
  1>    at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.authorizeRequest(SecurityActionFilter.java:183)
  1>    at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.lambda$applyInternal$3(SecurityActionFilter.java:161)
  1>    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:60)
...
...

So it's definitely the assertion that's causing the timeout problem (aside from the actual issue that caused the assertion to be tripped).

@polyfractal polyfractal added >bug :Delivery/Build Build or test infrastructure labels Jul 5, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@javanna
Copy link
Member

javanna commented Jul 6, 2018

heya @polyfractal thanks for raising this. That type of error from the low-level client "listener timeout after waiting for [30000] ms" has to do with the maxRetryTimeout setting. The idea behind max retry timeout is to honour some timeout throughout multiple retries, while socket and connect timeout allow to control low-level timeouts for each specific retry.

I am not too happy about this mechanism, for instance we may not want to apply it for the first attempt of a request, and most likely we should change the default values given that setting socket timeout and max retry timeout to the same value will probably make the listener timeout trip first which is kind of a confusing message. See also #25951 for some related issue.

In this specific case, I would rather want to see a socket timeout due to the node dying while executing the request. Instead, the sync listener fails first as it reaches the max retry timeout. Things would not stall that much is we lowered all these timeouts, but they were set to higher values because tests were running in slow machines and timeouts were happening too often. I am not sure what we can do here besides working on the improvement that I mentioned above. Ideas?

@polyfractal
Copy link
Contributor Author

polyfractal commented Jul 6, 2018

I agree, it seems like the maxRetryTimeout shouldn't trip before the original request's timeout, or only start counting after the first request fails and the retry process actually starts. That would probably help here a bit, since at least you'd know the node died for some reason.

I guess when a node dies from an AssertionError, the response is never sent back to the client, so there's no way to report the assert error?

I'm not really familiar with how the REST test infra works... but is it possible for whatever spawned the REST cluster to report on the node's death? The timeout isn't really the issue, it's losing the assertion message.

Edit: Nevermind, I ran into this on a different PR (unlucky me :) ), and sometimes the first request timeout manages to fire first, which returns the assertion error as the node shuts down.

@javanna
Copy link
Member

javanna commented Jul 9, 2018

Edit: Nevermind, I ran into this on a different PR (unlucky me :) ), and sometimes the first request timeout manages to fire first, which returns the assertion error as the node shuts down.

Can you point me to an example of this scenario?

@alpar-t
Copy link
Contributor

alpar-t commented Jul 9, 2018

I'm working on improvements to how we start clusters during the build as port of #30904.
Currently we do it with ant and a wrapper script, which makes it difficult to check on the started process.
The new implementation I have ( no PR yet) starts a process and has a direct handle, so checking if it's still alive is practical, and I already do it when waiting for the cluster to start and when stopping it. In this case it would notice that the cluster was no longer running when we tried to stop it and relay the failure.
That might be possible with the old code as well, just not sure if still worth looking into.

On the other hand, it would be preferable not to hang waiting for some timeout if the process dies.
That might prove to be a bit trickier as the test runner knows nothing about the clusters.
We could start a thread to monitor the process of each node in the cluster and we would detect if any one of them dies, but not sure if there's any way to tell Gradle that it needs to fail the build from that thread. There are no hooks for in-between task execution, so the soonest we could pick up that the process died is after the task completes - after the timeout runs out.

@alpar-t alpar-t self-assigned this Jul 9, 2018
@polyfractal
Copy link
Contributor Author

@atorok that sounds great :)

Can you point me to an example of this scenario?

@javanna I went back to collect logs/reproduction, and I'm still note entirely sure what's going on. The assertion failure was on this PR: #31037. Specifically, this commit (b501da4) would break with the REST tests:

./gradlew :qa:smoke-test-multinode:integTestRunner -Dtests.seed=1C97F75110933B51 -Dtests.class=org.elasticsearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT

Looking at the logs, what happens is the bad test breaks with node disconnection:

1> [2018-07-09T21:11:03,563][INFO ][o.e.s.SmokeTestMultiNodeClientYamlTestSuiteIT] Stash dump on test failure [{
  1>   "stash" : {
  1>     "body" : {
  1>       "error" : {
  1>         "root_cause" : [
  1>           {
  1>             "type" : "node_disconnected_exception",
  1>             "reason" : "[node-1][127.0.0.1:47113][indices:data/read/search[phase/query]] disconnected",
  1>             "stack_trace" : "NodeDisconnectedException[[node-1][127.0.0.1:47113][indices:data/read/search[phase/query]] disconnected]
  1> "
  1>           }
  1>         ],
  1>         "type" : "search_phase_execution_exception",
  1>         "reason" : "all shards failed",
  1>         "phase" : "query",
  1>         "grouped" : true,
  1>         "failed_shards" : [
  1>           {
  1>             "shard" : 0,
  1>             "index" : "test_1",
  1>             "node" : "RngMzGKJQWyctbFfnTaVCQ",
  1>             "reason" : {
  1>               "type" : "node_disconnected_exception",
  1>               "reason" : "[node-1][127.0.0.1:47113][indices:data/read/search[phase/query]] disconnected",
  1>               "stack_trace" : "NodeDisconnectedException[[node-1][127.0.0.1:47113][indices:data/read/search[phase/query]] disconnected]
  1> "
  1>             }
  1>           }
  1>         ],
  1>         "stack_trace" : "Failed to execute phase [query], all shards failed; shardFailures {[RngMzGKJQWyctbFfnTaVCQ][test_1][0]: NodeDisconnectedException[[node-1][127.0.0.1:47113][indices:data/read/search[phase/query]] disconnected]}
  1>    at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:293)
  1>    at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:133)
  1>    at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:254)
  1>    at org.elasticsearch.action.search.InitialSearchPhase.onShardFailure(InitialSearchPhase.java:106)
  1>    at org.elasticsearch.action.search.InitialSearchPhase.access$200(InitialSearchPhase.java:50)
  1>    at org.elasticsearch.action.search.InitialSearchPhase$2.onFailure(InitialSearchPhase.java:277)
  1>    at org.elasticsearch.action.search.SearchExecutionStatsCollector.onFailure(SearchExecutionStatsCollector.java:73)
  1>    at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:51)
  1>    at org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:497)
  1>    at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1069)
  1>    at org.elasticsearch.transport.TransportService.lambda$onConnectionClosed$8(TransportService.java:929)
  1>    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624)
  1>    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135)
  1>    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
  1>    at java.base/java.lang.Thread.run(Thread.java:844)
  1> Caused by: NodeDisconnectedException[[node-1][127.0.0.1:47113][indices:data/read/search[phase/query]] disconnected]
  1> "
  1>       },
  1>       "status" : 500
  1>     }
  1>   }how was 
  1> }]

No assertion message, but the disconnect is at least somewhat helpful to figure out what's going on. But it gets drowned out and is hard to spot because the rest of the tests stall and then timeout:

ERROR   30.0s | SmokeTestMultiNodeClientYamlTestSuiteIT.test {yaml=cluster.state/20_filtering/Filtering the cluster state by blocks should return the blocks field even if the response is empty} <<< FAILURES!
   > Throwable #1: java.lang.RuntimeException: Failure at [cluster.state/20_filtering:2]: listener timeout after waiting for [30000] ms
   >    at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.executeSection(ESClientYamlSuiteTestCase.java:396)
   >    at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.test(ESClientYamlSuiteTestCase.java:362)
   >    at java.lang.Thread.run(Thread.java:748)
   > Caused by: java.io.IOException: listener timeout after waiting for [30000] ms
   >    at org.elasticsearch.client.RestClient$SyncResponseListener.get(RestClient.java:899)
   >    at org.elasticsearch.client.RestClient.performRequest(RestClient.java:227)
   >    at org.elasticsearch.test.rest.yaml.ClientYamlTestClient.callApi(ClientYamlTestClient.java:192)
   >    at org.elasticsearch.test.rest.yaml.ClientYamlTestExecutionContext.callApiInternal(ClientYamlTestExecutionContext.java:168)
   >    at org.elasticsearch.test.rest.yaml.ClientYamlTestExecutionContext.callApi(ClientYamlTestExecutionContext.java:100)
   >    at org.elasticsearch.test.rest.yaml.section.DoSection.execute(DoSection.java:243)
   >    at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.executeSection(ESClientYamlSuiteTestCase.java:387)
   >    ... 37 moreThrowable #2: java.net.ConnectException: Connection refused
   >    at org.elasticsearch.client.RestClient$SyncResponseListener.get(RestClient.java:943)
   >    at org.elasticsearch.client.RestClient.performRequest(RestClient.java:227)
   >    at org.elasticsearch.test.rest.ESRestTestCase.wipeCluster(ESRestTestCase.java:263)
   >    at org.elasticsearch.test.rest.ESRestTestCase.cleanUpCluster(ESRestTestCase.java:179)
   >    at java.lang.Thread.run(Thread.java:748)
   > Caused by: java.net.ConnectException: Connection refused
   >    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
   >    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
   >    at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvent(DefaultConnectingIOReactor.java:171)
   >    at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvents(DefaultConnectingIOReactor.java:145)
   >    at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:348)
   >    at org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:192)
   >    at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64)

Running just the troublesome tests yields the assertion error, but I think it's just because the node's logs are immediately printed after the test and not swamped with the rest of the timeouts?

./gradlew :qa:smoke-test-multinode:integTestRunner -Dtests.seed=1C97F75110933B51 -Dtests.class=org.elasticsearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT -Dtests.method="test {yaml=search.aggregation/260_weighted_avg/*}"
[2018-07-09T09:35:17,565][WARN ][o.e.t.n.Netty4Transport  ] [node-1] exception caught on transport layer [Netty4TcpChannel{localAddress=/127.0.0.1:36933, remoteAddress=/127.0.0.1:38138}], closing connection
java.lang.Exception: java.lang.AssertionError: Wrong read constructor called for subclass that serializes its targetValueType
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.exceptionCaught(Netty4MessageChannelHandler.java:76) [transport-netty4-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
        at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.channel.AbstractChannelHandlerContext.notifyHandlerException(AbstractChannelHandlerContext.java:850) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:364) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) [netty-codec-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297) [netty-codec-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413) [netty-codec-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265) [netty-codec-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) [netty-handler-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:545) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:499) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) [netty-common-4.1.16.Final.jar:4.1.16.Final]
        at java.lang.Thread.run(Thread.java:844) [?:?]
Caused by: java.lang.AssertionError: Wrong read constructor called for subclass that serializes its targetValueType

So maybe the information was there all along, I just didn't notice it because of all the timeouts from the rest of the test suite? Not sure. This is a different assertion failure from the one that prompted the issue, but it's probably the same situation.

@javanna
Copy link
Member

javanna commented Jul 9, 2018

That's what I thought @polyfractal thanks for clarifying. The assertion can't be returned to the client, as it makes the server die. Would be funny if the server replied to the client saying "goodbye, I am dying, with dignity though". I think we need to improve the maxRetryTimeout mechanism to not trip when trying the first node, though I am not sure it will improve this specific problem, you would probably just get a different timeout (socket timeout). The timeout is set pretty high though, and that is what makes things stall, not sure what to do about that.

@polyfractal
Copy link
Contributor Author

"goodbye, I am dying, with dignity though"

Haha :D

think we need to improve the maxRetryTimeout mechanism to not trip when trying the first node, though I am not sure it will improve this specific problem, you would probably just get a different timeout (socket timeout).

I agree, I'm not sure that would help here... but it does seem like a useful change in general. Maybe my problem is unfixable, although it sounds like the work @atorok is doing may help from a different direction.

javanna added a commit to javanna/elasticsearch that referenced this issue Jan 31, 2019
We have had various reports of problems caused by the maxRetryTimeout
setting in the low-level REST client. Such setting was initially added
in the attempts to not have requests go through retries if the request
already took longer than the provided timeout.

The implementation was problematic though as such timeout would also
expire in the first request attempt (see elastic#31834), would leave the
request executing after expiration causing memory leaks (see elastic#33342),
and would not take into account the http client internal queuing (see

Given all these issues, my conclusion is that this custom timeout
mechanism gives little benefits while causing a lot of harm. We should
rather rely on connect and socket timeout exposed by the underlying
http client and accept that a request can overall take longer than the
configured timeout, which is the case even with a single retry anyways.

This commit removes the maxRetryTimeout setting and all of its usages.
javanna added a commit that referenced this issue Feb 6, 2019
We have had various reports of problems caused by the maxRetryTimeout
setting in the low-level REST client. Such setting was initially added
in the attempts to not have requests go through retries if the request
already took longer than the provided timeout.

The implementation was problematic though as such timeout would also
expire in the first request attempt (see #31834), would leave the
request executing after expiration causing memory leaks (see #33342),
and would not take into account the http client internal queuing (see #25951).

Given all these issues, it seems that this custom timeout mechanism 
gives little benefits while causing a lot of harm. We should rather rely 
on connect and socket timeout exposed by the underlying http client 
and accept that a request can overall take longer than the configured 
timeout, which is the case even with a single retry anyways.

This commit removes the `maxRetryTimeout` setting and all of its usages.
@javanna
Copy link
Member

javanna commented Feb 6, 2019

I think we need to improve the maxRetryTimeout mechanism to not trip when trying the first node, though I am not sure it will improve this specific problem, you would probably just get a different timeout (socket timeout). The timeout is set pretty high though, and that is what makes things stall, not sure what to do about that.

We have deprecated max retry timeout in 6.x, as well as increased its default value to 90 seconds (#38425) and removed it in 7.0 (#38085).

@alpar-t
Copy link
Contributor

alpar-t commented Sep 5, 2019

This came up in a different context when looking at a test failure with @not-napoleon .
Tripping an assertion makes the node die and we do a poor job of surfacing this info.
We can't really be proactive about it since Gradle doesn't allow interrupting any executing test task, you will see a bunch of tests failing to connect, but we should surface the fact that an assertion tripped once the test task fails.

@alpar-t
Copy link
Contributor

alpar-t commented Sep 5, 2019

Closing in favor of #46379

@alpar-t alpar-t closed this as completed Oct 4, 2019
@mark-vieira mark-vieira added the Team:Delivery Meta label for Delivery team label Nov 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Delivery/Build Build or test infrastructure Team:Delivery Meta label for Delivery team
Projects
None yet
Development

No branches or pull requests

5 participants