Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in RestClient when elasticsearch takes more than 30 seconds to respond #33342

Closed
appadesai opened this issue Sep 3, 2018 · 13 comments
Labels
>bug :Clients/Java Low Level REST Client Minimal dependencies Java Client for Elasticsearch

Comments

@appadesai
Copy link

appadesai commented Sep 3, 2018

Describe the feature:

Elasticsearch version (bin/elasticsearch --version):
Version: 6.1.2, Build: 5b1fea5/2018-01-10T02:35:59.208Z, JVM: 1.8.0_181

Plugins installed: [] None

JVM version (java -version):
1.8.0_181

OS version (uname -a if on a Unix-like system):
CentOS Linux release 7.5.1804 (Core)

Description of the problem including expected versus actual behavior:
org.apache.http.concurrent.BasicFuture objects keep building up in jvm heap consuming most of the heap space eventually leading to OutOfMemory error when elasticsearch is under load and does not respond to performRequest call or takes more than 30 seconds to respond.

Error reported by the library in this case is
java.io.IOException: listener timeout after waiting for [30000] ms

Steps to reproduce:
Please include a minimal but complete recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc. The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.

  1. To simulate loaded elasticsearch server, start a webserver to respond to any query with a delay of 200 seconds
    Ex: nodejs server with following script
          var http = require('http');
          var timeout = 100000; //sleep 100 seconds
          http.createServer(function (req, res) {
                 setTimeout((function() {
                       res.writeHead(200, {'Content-Type': 'text/plain'});
                            res.end("Hello I am awake");
                 }), timeout);
          }).listen(9200);
  1. Create a java threadpool of 32 threads and continuously submit tasks which should be picked up by each of this thread in the pool
  2. Each thread upon getting the request should make a performRequest call with a StringEntity of about 1MB size on a single instance of RestClient shared across all the threads. In our scenario, we make a _bulk request to elasticsearch with 500 update operations.
  3. Monitor the heap objects after every 15 minutes and you will notice buildup of BasicFuture objects with strong references from the GC root which do not get released.

Provide logs (if relevant):

Workaround:
This issues is not seen when RestClient.setMaxRetryTimeoutMillis(100000) is used which results in SocketTimeout instead of listener timeout as shown below
Socket timeout Exception:
11:11:17.216 [I/O dispatcher 1] DEBUG
org.apache.http.impl.nio.client.InternalIODispatch - http-outgoing-0 [ACTIVE]
Timeout
11:11:17.216 [I/O dispatcher 1] DEBUG
org.apache.http.impl.nio.conn.ManagedNHttpClientConnectionImpl - http-outgoing-0
127.0.0.1:52928<->127.0.0.1:9200[ACTIVE] [rw:w]: Shutdown
11:11:17.218 [I/O dispatcher 1] DEBUG
org.apache.http.impl.nio.client.InternalHttpAsyncClient - [exchange: 1]
connection aborted
11:11:17.218 [I/O dispatcher 1] DEBUG
org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager - Releasing
connection: [id: http-outgoing-0][route: {}->http://127.0.0.1:9200][total kept
alive: 0; route allocated: 1 of 10; total allocated: 1 of 30]
11:11:17.218 [I/O dispatcher 1] DEBUG
org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager - Connection
released: [id: http-outgoing-0][route: {}->http://127.0.0.1:9200][total kept
alive: 0; route allocated: 0 of 10; total allocated: 0 of 30]
11:11:17.222 [I/O dispatcher 1] DEBUG org.elasticsearch.client.RestClient -
request [POST http://127.0.0.1:9200/_bulk] failed
java.net.SocketTimeoutException: null
at
org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:375)
at
org.apache.http.impl.nio.client.InternalRequestExecutor.timeout(InternalRequestExecutor.java:116)
at
org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:92)
at
org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:39)
at
org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175)
at
org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:263)
at
org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:492)
at
org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:213)
at
org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280)
at
org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
at
org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588)
at java.lang.Thread.run(Thread.java:748)
11:11:17.223 [I/O dispatcher 1] DEBUG org.elasticsearch.client.RestClient -
added host [http://127.0.0.1:9200] to blacklist
11:11:17.224 [I/O dispatcher 1] DEBUG
org.apache.http.impl.nio.conn.ManagedNHttpClientConnectionImpl - http-outgoing-0
0.0.0.0:52928<->127.0.0.1:9200[CLOSED][]: Shutdown
11:11:17.224 [I/O dispatcher 1] DEBUG
org.apache.http.impl.nio.client.InternalIODispatch - http-outgoing-0 [CLOSED]:
Disconnected

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@original-brownbear
Copy link
Member

@appadesai

Monitor the heap objects after every 15 minutes and you will notice buildup of BasicFuture objects with strong references from the GC root which do not get released.

Do individual objects never get released (i.e. if you add logic to your tests that eventually stops the firing of requests in a hot loop) or is the overall number of these futures simply never going down?

It seems the former would be a problem while the latter is some what expected with the way the async HTTP client works in the background (you can enqueue requests quicker than they are potentially handled by the server -> your queue grows).

@appadesai
Copy link
Author

appadesai commented Sep 3, 2018

The objects eventually get cleaned up if I stop sending the requests but this behavior is not seen when there is continuous requests to ElasticSearch under load. The cleanup happens quickly if the timeout (RestClient.setMaxRetryTimeoutMillis ) is increased to 100 seconds even with continuous load.

I am not sure what are you referring to as former and latter.

@original-brownbear
Copy link
Member

original-brownbear commented Sep 3, 2018

@appadesai

The cleanup happens quickly if the timeout (RestClient.setMaxRetryTimeoutMillis ) is increased to 100 seconds even with continuous load.

Interesting, will take a look and see if I can reproduce this :)

@original-brownbear original-brownbear self-assigned this Sep 3, 2018
@original-brownbear
Copy link
Member

original-brownbear commented Sep 3, 2018

@appadesai I can't find any different behaviour here depending on the client side timeout (30s vs 100s). Waht's the Java code you use for testing here? (maybe you're leaking resources somehow in your tests and trigger the different behaviour that way?)

@appadesai
Copy link
Author

ESClientTest.zip

@original-brownbear I have uploaded test code that should help you reproduce the issue.

You will have to install nodejs from https://nodejs.org/en/download/ and start the dummy server as follows
bin/node node_app.js

I have added two timeouts in the test client code (ESClientTest.java)
private final int maxRetryTimeoutMillis = 30000; //Timeout to reproduce leak
//private final int maxRetryTimeoutMillis = 200000; //Timeout to fix leak
Uncomment each one to reproduce or avoid the issue.

Also added screen shot of the profiler to show the growth of BasicFuture objects. Notice that these objects are strongly reachable.
basicfutureleak

@original-brownbear
Copy link
Member

Ok, I can confirm that the behaviour here is somewhat sub-optimal.
If we set a timeout smaller than the time taken by the request, then we will not interrupt the request at all only throw an exception from org.elasticsearch.client.RestClient.SyncResponseListener#get.
=> we're leaking memory here if we are retrying requests by user code quicker than they time out on the HTTP client level because we never interrupt any requests.

@original-brownbear original-brownbear removed their assignment Sep 4, 2018
@nathankopp
Copy link

Is there any workaround for this? I am experiencing this exact same problem. After some time, PoolingNHttpClientConnectionManager was 3GB in size! This was accumulated in the CPool leasingRequests LinkedList, containing more than 200,000 instances of BasicFuture, each at over 11KB.

I need to find a solution to prevent this behavior, so that memory is cleaned up appropriately.

@appadesai
Copy link
Author

appadesai commented Jan 30, 2019 via email

@nathankopp
Copy link

I'm using ES client version 6.3.2, and I don't see a setMaxRetryTimeoutMillis() method in RestClient. However, I do see a similarly-named method in RestClientBuilder, so hopefully setting that to 200000 will have the same effect.

javanna added a commit to javanna/elasticsearch that referenced this issue Jan 31, 2019
We have had various reports of problems caused by the maxRetryTimeout
setting in the low-level REST client. Such setting was initially added
in the attempts to not have requests go through retries if the request
already took longer than the provided timeout.

The implementation was problematic though as such timeout would also
expire in the first request attempt (see elastic#31834), would leave the
request executing after expiration causing memory leaks (see elastic#33342),
and would not take into account the http client internal queuing (see

Given all these issues, my conclusion is that this custom timeout
mechanism gives little benefits while causing a lot of harm. We should
rather rely on connect and socket timeout exposed by the underlying
http client and accept that a request can overall take longer than the
configured timeout, which is the case even with a single retry anyways.

This commit removes the maxRetryTimeout setting and all of its usages.
@nathankopp
Copy link

Increasing the maxRetryTimeout seems to have fixed our problem. When using the longer timeout, the memory leak does not occur. I would say that this is not a case of "sub-optimal performance", but rather should be considered a true bug and should be fixed.

@javanna
Copy link
Member

javanna commented Feb 5, 2019

agreed @nathankopp see #38425 and #38085 .

javanna added a commit that referenced this issue Feb 6, 2019
We have had various reports of problems caused by the maxRetryTimeout
setting in the low-level REST client. Such setting was initially added
in the attempts to not have requests go through retries if the request
already took longer than the provided timeout.

The implementation was problematic though as such timeout would also
expire in the first request attempt (see #31834), would leave the
request executing after expiration causing memory leaks (see #33342),
and would not take into account the http client internal queuing (see #25951).

Given all these issues, it seems that this custom timeout mechanism 
gives little benefits while causing a lot of harm. We should rather rely 
on connect and socket timeout exposed by the underlying http client 
and accept that a request can overall take longer than the configured 
timeout, which is the case even with a single retry anyways.

This commit removes the `maxRetryTimeout` setting and all of its usages.
@javanna
Copy link
Member

javanna commented Feb 6, 2019

Thanks for the detailed explanation. We've got other reports of issues caused by the max retry timeout, and the more I looked into it the more I realized that this timeout mechanism did not play well with the internals of the underlying http client that we rely on. I think it's better to rely on the different timeouts exposed by the http client directly and remove this additional mechanism that ends up causing more harm than good. We have deprecated max retry timeout in 6.x, as well as increased its default value to 90 seconds (#38425) and removed it in 7.0 (#38085).

@javanna javanna closed this as completed Feb 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Clients/Java Low Level REST Client Minimal dependencies Java Client for Elasticsearch
Projects
None yet
Development

No branches or pull requests

6 participants