Memory leak in RestClient when elasticsearch takes more than 30 seconds to respond #33342

appadesai · 2018-09-03T06:50:24Z

Describe the feature:

Elasticsearch version (bin/elasticsearch --version):
Version: 6.1.2, Build: 5b1fea5/2018-01-10T02:35:59.208Z, JVM: 1.8.0_181

Plugins installed: [] None

JVM version (java -version):
1.8.0_181

OS version (uname -a if on a Unix-like system):
CentOS Linux release 7.5.1804 (Core)

Description of the problem including expected versus actual behavior:
org.apache.http.concurrent.BasicFuture objects keep building up in jvm heap consuming most of the heap space eventually leading to OutOfMemory error when elasticsearch is under load and does not respond to performRequest call or takes more than 30 seconds to respond.

Error reported by the library in this case is
java.io.IOException: listener timeout after waiting for [30000] ms

Steps to reproduce:
Please include a minimal but complete recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc. The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.

To simulate loaded elasticsearch server, start a webserver to respond to any query with a delay of 200 seconds
Ex: nodejs server with following script

          var http = require('http');
          var timeout = 100000; //sleep 100 seconds
          http.createServer(function (req, res) {
                 setTimeout((function() {
                       res.writeHead(200, {'Content-Type': 'text/plain'});
                            res.end("Hello I am awake");
                 }), timeout);
          }).listen(9200);

Create a java threadpool of 32 threads and continuously submit tasks which should be picked up by each of this thread in the pool
Each thread upon getting the request should make a performRequest call with a StringEntity of about 1MB size on a single instance of RestClient shared across all the threads. In our scenario, we make a _bulk request to elasticsearch with 500 update operations.
Monitor the heap objects after every 15 minutes and you will notice buildup of BasicFuture objects with strong references from the GC root which do not get released.

Provide logs (if relevant):

Workaround:
This issues is not seen when RestClient.setMaxRetryTimeoutMillis(100000) is used which results in SocketTimeout instead of listener timeout as shown below
Socket timeout Exception:
11:11:17.216 [I/O dispatcher 1] DEBUG
org.apache.http.impl.nio.client.InternalIODispatch - http-outgoing-0 [ACTIVE]
Timeout
11:11:17.216 [I/O dispatcher 1] DEBUG
org.apache.http.impl.nio.conn.ManagedNHttpClientConnectionImpl - http-outgoing-0
127.0.0.1:52928<->127.0.0.1:9200[ACTIVE] [rw:w]: Shutdown
11:11:17.218 [I/O dispatcher 1] DEBUG
org.apache.http.impl.nio.client.InternalHttpAsyncClient - [exchange: 1]
connection aborted
11:11:17.218 [I/O dispatcher 1] DEBUG
org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager - Releasing
connection: [id: http-outgoing-0][route: {}->http://127.0.0.1:9200][total kept
alive: 0; route allocated: 1 of 10; total allocated: 1 of 30]
11:11:17.218 [I/O dispatcher 1] DEBUG
org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager - Connection
released: [id: http-outgoing-0][route: {}->http://127.0.0.1:9200][total kept
alive: 0; route allocated: 0 of 10; total allocated: 0 of 30]
11:11:17.222 [I/O dispatcher 1] DEBUG org.elasticsearch.client.RestClient -
request [POST http://127.0.0.1:9200/_bulk] failed
java.net.SocketTimeoutException: null
at
org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:375)
at
org.apache.http.impl.nio.client.InternalRequestExecutor.timeout(InternalRequestExecutor.java:116)
at
org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:92)
at
org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:39)
at
org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175)
at
org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:263)
at
org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:492)
at
org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:213)
at
org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280)
at
org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
at
org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588)
at java.lang.Thread.run(Thread.java:748)
11:11:17.223 [I/O dispatcher 1] DEBUG org.elasticsearch.client.RestClient -
added host [http://127.0.0.1:9200] to blacklist
11:11:17.224 [I/O dispatcher 1] DEBUG
org.apache.http.impl.nio.conn.ManagedNHttpClientConnectionImpl - http-outgoing-0
0.0.0.0:52928<->127.0.0.1:9200[CLOSED][]: Shutdown
11:11:17.224 [I/O dispatcher 1] DEBUG
org.apache.http.impl.nio.client.InternalIODispatch - http-outgoing-0 [CLOSED]:
Disconnected

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-09-03T08:48:29Z

Pinging @elastic/es-core-infra

original-brownbear · 2018-09-03T08:53:54Z

@appadesai

Monitor the heap objects after every 15 minutes and you will notice buildup of BasicFuture objects with strong references from the GC root which do not get released.

Do individual objects never get released (i.e. if you add logic to your tests that eventually stops the firing of requests in a hot loop) or is the overall number of these futures simply never going down?

It seems the former would be a problem while the latter is some what expected with the way the async HTTP client works in the background (you can enqueue requests quicker than they are potentially handled by the server -> your queue grows).

appadesai · 2018-09-03T10:06:17Z

The objects eventually get cleaned up if I stop sending the requests but this behavior is not seen when there is continuous requests to ElasticSearch under load. The cleanup happens quickly if the timeout (RestClient.setMaxRetryTimeoutMillis ) is increased to 100 seconds even with continuous load.

I am not sure what are you referring to as former and latter.

original-brownbear · 2018-09-03T10:13:19Z

@appadesai

The cleanup happens quickly if the timeout (RestClient.setMaxRetryTimeoutMillis ) is increased to 100 seconds even with continuous load.

Interesting, will take a look and see if I can reproduce this :)

original-brownbear · 2018-09-03T11:49:03Z

@appadesai I can't find any different behaviour here depending on the client side timeout (30s vs 100s). Waht's the Java code you use for testing here? (maybe you're leaking resources somehow in your tests and trigger the different behaviour that way?)

appadesai · 2018-09-04T05:43:00Z

ESClientTest.zip

@original-brownbear I have uploaded test code that should help you reproduce the issue.

You will have to install nodejs from https://nodejs.org/en/download/ and start the dummy server as follows
bin/node node_app.js

I have added two timeouts in the test client code (ESClientTest.java)
private final int maxRetryTimeoutMillis = 30000; //Timeout to reproduce leak
//private final int maxRetryTimeoutMillis = 200000; //Timeout to fix leak
Uncomment each one to reproduce or avoid the issue.

Also added screen shot of the profiler to show the growth of BasicFuture objects. Notice that these objects are strongly reachable.

original-brownbear · 2018-09-04T14:48:04Z

Ok, I can confirm that the behaviour here is somewhat sub-optimal.
If we set a timeout smaller than the time taken by the request, then we will not interrupt the request at all only throw an exception from org.elasticsearch.client.RestClient.SyncResponseListener#get.
=> we're leaking memory here if we are retrying requests by user code quicker than they time out on the HTTP client level because we never interrupt any requests.

nathankopp · 2019-01-30T03:09:28Z

Is there any workaround for this? I am experiencing this exact same problem. After some time, PoolingNHttpClientConnectionManager was 3GB in size! This was accumulated in the CPool leasingRequests LinkedList, containing more than 200,000 instances of BasicFuture, each at over 11KB.

I need to find a solution to prevent this behavior, so that memory is cleaned up appropriately.

appadesai · 2019-01-30T03:50:12Z

We changed to maxRetryTimeout on restClient from default 30 seconds to 200 seconds as a workaround to fix this issue. RestClient.setMaxRetryTimeoutMillis(200000)

…

On Wed, Jan 30, 2019 at 8:40 AM nathankopp ***@***.***> wrote: Is there any workaround for this? I am experiencing this exact same problem. After some time, PoolingNHttpClientConnectionManager was 3GB in size! This was accumulated in the CPool leasingRequests LinkedList, containing more than 200,000 instances of BasicFuture, each at over 11KB. I need to find a solution to prevent this behavior, so that memory is cleaned up appropriately. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#33342 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ASrolk2mpG3jTp0H52P0MQHpRWZv_0pqks5vIQ0ogaJpZM4WXDnr> .

nathankopp · 2019-01-30T04:41:18Z

I'm using ES client version 6.3.2, and I don't see a setMaxRetryTimeoutMillis() method in RestClient. However, I do see a similarly-named method in RestClientBuilder, so hopefully setting that to 200000 will have the same effect.

We have had various reports of problems caused by the maxRetryTimeout setting in the low-level REST client. Such setting was initially added in the attempts to not have requests go through retries if the request already took longer than the provided timeout. The implementation was problematic though as such timeout would also expire in the first request attempt (see elastic#31834), would leave the request executing after expiration causing memory leaks (see elastic#33342), and would not take into account the http client internal queuing (see Given all these issues, my conclusion is that this custom timeout mechanism gives little benefits while causing a lot of harm. We should rather rely on connect and socket timeout exposed by the underlying http client and accept that a request can overall take longer than the configured timeout, which is the case even with a single retry anyways. This commit removes the maxRetryTimeout setting and all of its usages.

nathankopp · 2019-01-31T23:17:43Z

Increasing the maxRetryTimeout seems to have fixed our problem. When using the longer timeout, the memory leak does not occur. I would say that this is not a case of "sub-optimal performance", but rather should be considered a true bug and should be fixed.

javanna · 2019-02-05T14:48:37Z

agreed @nathankopp see #38425 and #38085 .

We have had various reports of problems caused by the maxRetryTimeout setting in the low-level REST client. Such setting was initially added in the attempts to not have requests go through retries if the request already took longer than the provided timeout. The implementation was problematic though as such timeout would also expire in the first request attempt (see #31834), would leave the request executing after expiration causing memory leaks (see #33342), and would not take into account the http client internal queuing (see #25951). Given all these issues, it seems that this custom timeout mechanism gives little benefits while causing a lot of harm. We should rather rely on connect and socket timeout exposed by the underlying http client and accept that a request can overall take longer than the configured timeout, which is the case even with a single retry anyways. This commit removes the `maxRetryTimeout` setting and all of its usages.

javanna · 2019-02-06T07:49:04Z

Thanks for the detailed explanation. We've got other reports of issues caused by the max retry timeout, and the more I looked into it the more I realized that this timeout mechanism did not play well with the internals of the underlying http client that we rely on. I think it's better to rely on the different timeouts exposed by the http client directly and remove this additional mechanism that ends up causing more harm than good. We have deprecated max retry timeout in 6.x, as well as increased its default value to 90 seconds (#38425) and removed it in 7.0 (#38085).

colings86 added the :Core/Java High Level REST Client label Sep 3, 2018

original-brownbear self-assigned this Sep 3, 2018

original-brownbear added the feedback_needed label Sep 3, 2018

original-brownbear added :Clients/Java Low Level REST Client Minimal dependencies Java Client for Elasticsearch >bug and removed :Core/Java High Level REST Client feedback_needed labels Sep 4, 2018

original-brownbear removed their assignment Sep 4, 2018

javanna mentioned this issue Jan 31, 2019

Remove support for maxRetryTimeout from low-level REST client #38085

Merged

javanna closed this as completed Feb 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak in RestClient when elasticsearch takes more than 30 seconds to respond #33342

Memory leak in RestClient when elasticsearch takes more than 30 seconds to respond #33342

appadesai commented Sep 3, 2018 •

edited

Loading

elasticmachine commented Sep 3, 2018

original-brownbear commented Sep 3, 2018

appadesai commented Sep 3, 2018 •

edited

Loading

original-brownbear commented Sep 3, 2018 •

edited

Loading

original-brownbear commented Sep 3, 2018 •

edited

Loading

appadesai commented Sep 4, 2018

original-brownbear commented Sep 4, 2018

nathankopp commented Jan 30, 2019

appadesai commented Jan 30, 2019 via email

nathankopp commented Jan 30, 2019

nathankopp commented Jan 31, 2019

javanna commented Feb 5, 2019

javanna commented Feb 6, 2019

Memory leak in RestClient when elasticsearch takes more than 30 seconds to respond #33342

Memory leak in RestClient when elasticsearch takes more than 30 seconds to respond #33342

Comments

appadesai commented Sep 3, 2018 • edited Loading

elasticmachine commented Sep 3, 2018

original-brownbear commented Sep 3, 2018

appadesai commented Sep 3, 2018 • edited Loading

original-brownbear commented Sep 3, 2018 • edited Loading

original-brownbear commented Sep 3, 2018 • edited Loading

appadesai commented Sep 4, 2018

original-brownbear commented Sep 4, 2018

nathankopp commented Jan 30, 2019

appadesai commented Jan 30, 2019 via email

nathankopp commented Jan 30, 2019

nathankopp commented Jan 31, 2019

javanna commented Feb 5, 2019

javanna commented Feb 6, 2019

appadesai commented Sep 3, 2018 •

edited

Loading

appadesai commented Sep 3, 2018 •

edited

Loading

original-brownbear commented Sep 3, 2018 •

edited

Loading

original-brownbear commented Sep 3, 2018 •

edited

Loading