-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak in RestClient when elasticsearch takes more than 30 seconds to respond #33342
Comments
Pinging @elastic/es-core-infra |
Do individual objects never get released (i.e. if you add logic to your tests that eventually stops the firing of requests in a hot loop) or is the overall number of these futures simply never going down? It seems the former would be a problem while the latter is some what expected with the way the async HTTP client works in the background (you can enqueue requests quicker than they are potentially handled by the server -> your queue grows). |
The objects eventually get cleaned up if I stop sending the requests but this behavior is not seen when there is continuous requests to ElasticSearch under load. The cleanup happens quickly if the timeout (RestClient.setMaxRetryTimeoutMillis ) is increased to 100 seconds even with continuous load. I am not sure what are you referring to as former and latter. |
Interesting, will take a look and see if I can reproduce this :) |
@appadesai I can't find any different behaviour here depending on the client side timeout (30s vs 100s). Waht's the Java code you use for testing here? (maybe you're leaking resources somehow in your tests and trigger the different behaviour that way?) |
@original-brownbear I have uploaded test code that should help you reproduce the issue. You will have to install nodejs from https://nodejs.org/en/download/ and start the dummy server as follows I have added two timeouts in the test client code (ESClientTest.java) Also added screen shot of the profiler to show the growth of BasicFuture objects. Notice that these objects are strongly reachable. |
Ok, I can confirm that the behaviour here is somewhat sub-optimal. |
Is there any workaround for this? I am experiencing this exact same problem. After some time, PoolingNHttpClientConnectionManager was 3GB in size! This was accumulated in the CPool leasingRequests LinkedList, containing more than 200,000 instances of BasicFuture, each at over 11KB. I need to find a solution to prevent this behavior, so that memory is cleaned up appropriately. |
We changed to maxRetryTimeout on restClient from default 30 seconds to 200
seconds as a workaround to fix this issue.
RestClient.setMaxRetryTimeoutMillis(200000)
…On Wed, Jan 30, 2019 at 8:40 AM nathankopp ***@***.***> wrote:
Is there any workaround for this? I am experiencing this exact same
problem. After some time, PoolingNHttpClientConnectionManager was 3GB in
size! This was accumulated in the CPool leasingRequests LinkedList,
containing more than 200,000 instances of BasicFuture, each at over 11KB.
I need to find a solution to prevent this behavior, so that memory is
cleaned up appropriately.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#33342 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ASrolk2mpG3jTp0H52P0MQHpRWZv_0pqks5vIQ0ogaJpZM4WXDnr>
.
|
I'm using ES client version 6.3.2, and I don't see a setMaxRetryTimeoutMillis() method in RestClient. However, I do see a similarly-named method in RestClientBuilder, so hopefully setting that to 200000 will have the same effect. |
We have had various reports of problems caused by the maxRetryTimeout setting in the low-level REST client. Such setting was initially added in the attempts to not have requests go through retries if the request already took longer than the provided timeout. The implementation was problematic though as such timeout would also expire in the first request attempt (see elastic#31834), would leave the request executing after expiration causing memory leaks (see elastic#33342), and would not take into account the http client internal queuing (see Given all these issues, my conclusion is that this custom timeout mechanism gives little benefits while causing a lot of harm. We should rather rely on connect and socket timeout exposed by the underlying http client and accept that a request can overall take longer than the configured timeout, which is the case even with a single retry anyways. This commit removes the maxRetryTimeout setting and all of its usages.
Increasing the maxRetryTimeout seems to have fixed our problem. When using the longer timeout, the memory leak does not occur. I would say that this is not a case of "sub-optimal performance", but rather should be considered a true bug and should be fixed. |
agreed @nathankopp see #38425 and #38085 . |
We have had various reports of problems caused by the maxRetryTimeout setting in the low-level REST client. Such setting was initially added in the attempts to not have requests go through retries if the request already took longer than the provided timeout. The implementation was problematic though as such timeout would also expire in the first request attempt (see #31834), would leave the request executing after expiration causing memory leaks (see #33342), and would not take into account the http client internal queuing (see #25951). Given all these issues, it seems that this custom timeout mechanism gives little benefits while causing a lot of harm. We should rather rely on connect and socket timeout exposed by the underlying http client and accept that a request can overall take longer than the configured timeout, which is the case even with a single retry anyways. This commit removes the `maxRetryTimeout` setting and all of its usages.
Thanks for the detailed explanation. We've got other reports of issues caused by the max retry timeout, and the more I looked into it the more I realized that this timeout mechanism did not play well with the internals of the underlying http client that we rely on. I think it's better to rely on the different timeouts exposed by the http client directly and remove this additional mechanism that ends up causing more harm than good. We have deprecated max retry timeout in 6.x, as well as increased its default value to 90 seconds (#38425) and removed it in 7.0 (#38085). |
Describe the feature:
Elasticsearch version (
bin/elasticsearch --version
):Version: 6.1.2, Build: 5b1fea5/2018-01-10T02:35:59.208Z, JVM: 1.8.0_181
Plugins installed: [] None
JVM version (
java -version
):1.8.0_181
OS version (
uname -a
if on a Unix-like system):CentOS Linux release 7.5.1804 (Core)
Description of the problem including expected versus actual behavior:
org.apache.http.concurrent.BasicFuture objects keep building up in jvm heap consuming most of the heap space eventually leading to OutOfMemory error when elasticsearch is under load and does not respond to performRequest call or takes more than 30 seconds to respond.
Error reported by the library in this case is
java.io.IOException: listener timeout after waiting for [30000] ms
Steps to reproduce:
Please include a minimal but complete recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc. The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.
Ex: nodejs server with following script
Provide logs (if relevant):
Workaround:
This issues is not seen when RestClient.setMaxRetryTimeoutMillis(100000) is used which results in SocketTimeout instead of listener timeout as shown below
Socket timeout Exception:
11:11:17.216 [I/O dispatcher 1] DEBUG
org.apache.http.impl.nio.client.InternalIODispatch - http-outgoing-0 [ACTIVE]
Timeout
11:11:17.216 [I/O dispatcher 1] DEBUG
org.apache.http.impl.nio.conn.ManagedNHttpClientConnectionImpl - http-outgoing-0
127.0.0.1:52928<->127.0.0.1:9200[ACTIVE] [rw:w]: Shutdown
11:11:17.218 [I/O dispatcher 1] DEBUG
org.apache.http.impl.nio.client.InternalHttpAsyncClient - [exchange: 1]
connection aborted
11:11:17.218 [I/O dispatcher 1] DEBUG
org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager - Releasing
connection: [id: http-outgoing-0][route: {}->http://127.0.0.1:9200][total kept
alive: 0; route allocated: 1 of 10; total allocated: 1 of 30]
11:11:17.218 [I/O dispatcher 1] DEBUG
org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager - Connection
released: [id: http-outgoing-0][route: {}->http://127.0.0.1:9200][total kept
alive: 0; route allocated: 0 of 10; total allocated: 0 of 30]
11:11:17.222 [I/O dispatcher 1] DEBUG org.elasticsearch.client.RestClient -
request [POST http://127.0.0.1:9200/_bulk] failed
java.net.SocketTimeoutException: null
at
org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:375)
at
org.apache.http.impl.nio.client.InternalRequestExecutor.timeout(InternalRequestExecutor.java:116)
at
org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:92)
at
org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:39)
at
org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175)
at
org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:263)
at
org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:492)
at
org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:213)
at
org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280)
at
org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
at
org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588)
at java.lang.Thread.run(Thread.java:748)
11:11:17.223 [I/O dispatcher 1] DEBUG org.elasticsearch.client.RestClient -
added host [http://127.0.0.1:9200] to blacklist
11:11:17.224 [I/O dispatcher 1] DEBUG
org.apache.http.impl.nio.conn.ManagedNHttpClientConnectionImpl - http-outgoing-0
0.0.0.0:52928<->127.0.0.1:9200[CLOSED][]: Shutdown
11:11:17.224 [I/O dispatcher 1] DEBUG
org.apache.http.impl.nio.client.InternalIODispatch - http-outgoing-0 [CLOSED]:
Disconnected
The text was updated successfully, but these errors were encountered: