-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assertion failures cause yaml REST tests to timeout instead of displaying error #31834
Comments
Pinging @elastic/es-core-infra |
heya @polyfractal thanks for raising this. That type of error from the low-level client "listener timeout after waiting for [30000] ms" has to do with the maxRetryTimeout setting. The idea behind max retry timeout is to honour some timeout throughout multiple retries, while socket and connect timeout allow to control low-level timeouts for each specific retry. I am not too happy about this mechanism, for instance we may not want to apply it for the first attempt of a request, and most likely we should change the default values given that setting socket timeout and max retry timeout to the same value will probably make the listener timeout trip first which is kind of a confusing message. See also #25951 for some related issue. In this specific case, I would rather want to see a socket timeout due to the node dying while executing the request. Instead, the sync listener fails first as it reaches the max retry timeout. Things would not stall that much is we lowered all these timeouts, but they were set to higher values because tests were running in slow machines and timeouts were happening too often. I am not sure what we can do here besides working on the improvement that I mentioned above. Ideas? |
I agree, it seems like the maxRetryTimeout shouldn't trip before the original request's timeout, or only start counting after the first request fails and the retry process actually starts. That would probably help here a bit, since at least you'd know the node died for some reason. I guess when a node dies from an AssertionError, the response is never sent back to the client, so there's no way to report the assert error? I'm not really familiar with how the REST test infra works... but is it possible for whatever spawned the REST cluster to report on the node's death? The timeout isn't really the issue, it's losing the assertion message. Edit: Nevermind, I ran into this on a different PR (unlucky me :) ), and sometimes the first request timeout manages to fire first, which returns the assertion error as the node shuts down. |
Can you point me to an example of this scenario? |
I'm working on improvements to how we start clusters during the build as port of #30904. On the other hand, it would be preferable not to hang waiting for some timeout if the process dies. |
@atorok that sounds great :)
@javanna I went back to collect logs/reproduction, and I'm still note entirely sure what's going on. The assertion failure was on this PR: #31037. Specifically, this commit (b501da4) would break with the REST tests:
Looking at the logs, what happens is the bad test breaks with node disconnection:
No assertion message, but the disconnect is at least somewhat helpful to figure out what's going on. But it gets drowned out and is hard to spot because the rest of the tests stall and then timeout:
Running just the troublesome tests yields the assertion error, but I think it's just because the node's logs are immediately printed after the test and not swamped with the rest of the timeouts?
So maybe the information was there all along, I just didn't notice it because of all the timeouts from the rest of the test suite? Not sure. This is a different assertion failure from the one that prompted the issue, but it's probably the same situation. |
That's what I thought @polyfractal thanks for clarifying. The assertion can't be returned to the client, as it makes the server die. Would be funny if the server replied to the client saying "goodbye, I am dying, with dignity though". I think we need to improve the maxRetryTimeout mechanism to not trip when trying the first node, though I am not sure it will improve this specific problem, you would probably just get a different timeout (socket timeout). The timeout is set pretty high though, and that is what makes things stall, not sure what to do about that. |
Haha :D
I agree, I'm not sure that would help here... but it does seem like a useful change in general. Maybe my problem is unfixable, although it sounds like the work @atorok is doing may help from a different direction. |
We have had various reports of problems caused by the maxRetryTimeout setting in the low-level REST client. Such setting was initially added in the attempts to not have requests go through retries if the request already took longer than the provided timeout. The implementation was problematic though as such timeout would also expire in the first request attempt (see elastic#31834), would leave the request executing after expiration causing memory leaks (see elastic#33342), and would not take into account the http client internal queuing (see Given all these issues, my conclusion is that this custom timeout mechanism gives little benefits while causing a lot of harm. We should rather rely on connect and socket timeout exposed by the underlying http client and accept that a request can overall take longer than the configured timeout, which is the case even with a single retry anyways. This commit removes the maxRetryTimeout setting and all of its usages.
We have had various reports of problems caused by the maxRetryTimeout setting in the low-level REST client. Such setting was initially added in the attempts to not have requests go through retries if the request already took longer than the provided timeout. The implementation was problematic though as such timeout would also expire in the first request attempt (see #31834), would leave the request executing after expiration causing memory leaks (see #33342), and would not take into account the http client internal queuing (see #25951). Given all these issues, it seems that this custom timeout mechanism gives little benefits while causing a lot of harm. We should rather rely on connect and socket timeout exposed by the underlying http client and accept that a request can overall take longer than the configured timeout, which is the case even with a single retry anyways. This commit removes the `maxRetryTimeout` setting and all of its usages.
We have deprecated max retry timeout in 6.x, as well as increased its default value to 90 seconds (#38425) and removed it in 7.0 (#38085). |
This came up in a different context when looking at a test failure with @not-napoleon . |
Closing in favor of #46379 |
It appears that the yaml REST tests run with assertions enabled, but if an assertion is tripped it isn't reported to the user. This causes the REST test to stall for 30s before failing with a generic timeout, and errors about lingering tasks.
I'm afraid I don't know enough about the test infra to know how to fix this, or if it's fixable. But it was very time-consuming to hunt down the cause. Timeouts make you think faulty async logic or something, not an assertion being tripped. If someone points me in the right direction I can take a crack at fixing this, if it is indeed fixable.
Longer version
The timeout error looks like this. You receive a null body back, the test framework complains about the request task still running and the listener times out after 30s:
After some debugging, I realized I was tripping this assertion. If I comment out that assertion and rerun the test, I get the auth denial exception thrown by Security:
So it's definitely the assertion that's causing the timeout problem (aside from the actual issue that caused the assertion to be tripped).
The text was updated successfully, but these errors were encountered: