Possible memory leak when using pooled connections #3367

vitjouda · 2024-07-24T15:38:51Z

I am facing periodic OOM issues with my service, which uses reactor-netty based WebClient for HTTP communication with other REST services. I am not sure this is 100% reactor-netty related, but the only thing that managed to mitigate this issue was switching to another ClientHttpConnector. Therefore I assume this is the correct place to start asking for help :).

When I inspect the memory dump of my service, I can see that most of the growing memory is retained by instances of io.netty.channel.epoll.EpollSocketChannel. When I inspect the path to GC root of some random EpollSocketChannel instance, there is a long chain of Reactor operators retained. I can see that there are "blocks" of the processing pipeline that correspond to different and unrelated server requests chained together, presumably by the HttpClientOperations instance holding reference to PooledConnection / EpollSocketChannel that is now assigned to another request. I tried to visualize it in the attached MAT screenshot, but I know its hard to explain.

In the image, each color block represents a reference chain of Channel to HttpClientOperations. Each block belongs to distinct server request processing pipeline, as shown in this diagram. There are hundreds of concurrent requests at any given time.

During the lifetime of the service, these chains get longer and retain more and more memory until OOM.

Steps to Reproduce

Unfortunately I can´t provide reproducer because I never managed to simulate this behavior locally (and I tried for days) and I cannot share the actual memory dump. If I can provide any information that would help, please tell me.

Your Environment

Reactor version(s) used: reactor-core:3.6.7, reactor-netty:1.1.20, reactor-pool:1.0.6
JVM version: 21
OS and version: Ubuntu 22.04.3

The text was updated successfully, but these errors were encountered:

violetagg · 2024-07-25T07:29:31Z

@vitjouda with so little information we can't start any investigation.

Is it HTTP/1.1 or HTTP/2?
Can you share your HttpClient/WebClient configuration and the usage?

vitjouda · 2024-07-25T08:24:44Z

I understand, I will provide everything I can.

HTTP 1.1

Usage (adjusted so that properties are hardcoded with real world values)

    @Bean
    @Scope("prototype")
    public WebClient.Builder webClientBuilder(ObjectProvider<WebClientCustomizer> customizerProvider) {
        WebClient.Builder builder = WebClient.builder();
        customizerProvider.orderedStream().forEach((customizer) -> customizer.customize(builder));
        return builder;
    }

    @Bean("serviceClientBuilder")
    public WebClient.Builder serviceWebClientBuilder(@Qualifier("webClientBuilder") WebClient.Builder builder) {
        return builder
                .baseUrl("serviceUrl")
                .clientConnector(new ReactorClientHttpConnector(customHttpClient()));
    }

    protected HttpClient customHttpClient() {
        return HttpClient.create(ConnectionProvider.builder("serviceClientPool")
                        .maxConnections(500)
                        .maxIdleTime(Duration.ofMinutes(5))
                        .maxLifeTime(Duration.ofHours(1))
                        .metrics(true)
                        .build())
                .responseTimeout(Duration.ofSeconds(120));
    }

  public Response get(String url, String... urlParams) {
  // I have some abstraction around, but this is the resulting usage.
  // Instance of webClient is created from builder above in the class constructor
     return webClient
              .get()
              .uri(url, (Object[]) urlParams)
              .accept(MediaType.APPLICATION_JSON)
              .retrieve()
              .bodyToMono(Response.class);
    }

violetagg · 2024-07-25T09:12:46Z

@vitjouda I see that you have metrics for the ConnectionProvider. Can you check how many connection pools are created?

vitjouda · 2024-07-25T09:24:55Z

If you mean the reactor_netty_connection_provider_total_connections{name="serviceClientPool"}, then its approx. 2 to 3 (fluctuates) per instance.

violetagg · 2024-07-25T09:28:04Z

@vitjouda these might have different id which is different connection pool. So the question is how many the ConnectionProvider with name=serviceClientPool has created?

violetagg · 2024-07-25T09:29:59Z

for example check this issue where too many connection pools were created because of wrong configuration #3315

vitjouda · 2024-07-25T09:43:03Z

Aha, then I believe there is only 1 pool per remote_address, as all of these have the same id since the app startup (during startup there is a heavy load, so many more connections were created, all with same ID).

violetagg · 2024-07-25T09:48:16Z

@vitjouda then we will really need something to play with :(

violetagg · 2024-07-25T10:06:42Z

@vitjouda Do you often change the remote_address?

vitjouda · 2024-07-25T10:16:09Z

Yeah I understand. I have been trying to simulate it by creating some reproducer which mimicked my setup to running perf test to locally deployed app but never managed to reproduce it outside of our production setup, and I cannot provide its memory dump.

I really believe that the problem is that there are a lot of EpollSocketChannel that are inactive and closed, and they are always linked to one that is active. Like something does not dereference it after original request / response is finished. For some time I suspected it might be the Observation issue, and found something similar Remove Observation.Context from WebClient request, but I had those issues before and after versions mentioned there.

vitjouda · 2024-07-25T10:18:18Z

@vitjouda Do you often change the remote_address?

Not really, its another Spring service in our cluster which is generally pretty stable.

vitjouda · 2024-07-30T13:37:06Z

I have been trying to investigate further and I believe I know how the problem manifests. I cannot reproduce the leak itself yet locally, but I am still trying.

The root problem is, that there are inactive EpollSocketChannels, that have connectionOwner attribute set to DisposableAcquire, instead of ConnectionObserver.emptyListener(). This should happen in PooledConnection.onStateChange(), but in my case it does not happen. I don´t yet fully understand the logic there. I can see there are some conditions that bypass this with comment // Will be released by closeFuture. But in that case, connectionOwner is not cleared. Furthermore, I don´t see that those "unreleased" connection have any closeFuture listener at all (but that might have been cleared before I made the memory dump). I attached some screens that could help illustrate the issue.

This itself would probably not cause any problems, but if you have such inactive EpollSocketChannel still referenced by some long lived operation, for example another EpollSocketChannel used for websocket, it is never garbage collected. What is worse, before the channel is closed, it is reused for another HTTP call to the same server and linked to another EpollSocketChannel and so on, which causes memory leak. This is exactly what I observe in my memory dump. The order of operations in that case is something like this.

Application calls Server S1 and based on the response opens stable WS connection to the same S1
- now there is one EpollChannel for WS which is linked to the pooled EpollChannel CH1 used to make the first call to S1
Application calls S1 (reusing pooled CH1), and then calls S2 (opening another pooled EpollChannel CH2)
- now the websocket EpollChannel is linked to CH1 (because the reactor chain still captures original HttpClientOperations), CH1 is linked to CH2 (using connectionOwner attribute)
CH1 is closed without clearing the owner
Application calls S2 using still active CH2 and so on......

In my reproducer, I managed to simulate this to the point that websocket -> CH1 -> CH2, but when the CH1 closes (I set maxLifeTime), connectionOwner is set to emptyListener(), therefore breaking the chain and leak does not happen.

violetagg · 2024-07-30T16:53:09Z

@vitjouda Thank you for your further investigation. Let's continue with it. So when a connection is upgraded to websocket we mark it as non-persistent, it happens here in the code below, can you check it? It is important because non-persistent channels are never returned to the connection pool:

reactor-netty/reactor-netty-http/src/main/java/reactor/netty/http/client/WebsocketClientOperations.java

Line 87 in 22b4332

markPersistent(false);

so it is correct that when we don't need the connection and it is non-persistent then we will wait for the close event here:

reactor-netty/reactor-netty-core/src/main/java/reactor/netty/resources/DefaultPooledConnectionProvider.java

Line 414 in 22b4332

if (!isPersistent() && channel.isActive()) {

Application calls S1 (reusing pooled CH1), and then calls S2 (opening another pooled EpollChannel CH2)
now the websocket EpollChannel is linked to CH1 (because the reactor chain still captures original HttpClientOperations), CH1 is linked to CH2 (using connectionOwner attribute)

I don't understand this

vitjouda · 2024-07-31T11:07:11Z

The WS connection is made from another httpClient, which does not use ConnectionPool (it is configured as ConnectionProvider.newConnection()). I have different httpClient that uses pooling for plain HTTP requests to the same server, but never WS. The only only reason why its important in this case is because its always active during the application lifetime and therefore provides path from GC root to the next EpollSocketChannel, its visible in the image I posted. I believe that if I killed that WS, it would free the rest of the memory held by it. Or at least thats what MAT tells me.

Another thing I managed to simulate just now is, that when the connection is closed BEFORE response, it does not get cleared and still has the connectionOwner set. I believe it might be part of the problem.

Ill try to explain better. The whole reason that the leak happens is, that there are super long to infinite reference chains of EpollSocketChannel -> EpollSocketChannel -> another.... The moment I make the memory dump, they are all inactive, yet linked together by connectionOwner attribute. I believe that these chains are created by reusing pooled EpollSocketChannel before it is closed.

I tried to make a quick diagram that might help.

In this diagram, first session 1 uses Pooled Channel A to call Server A, receives the response and open WS channel to another server (doesn't really matter which one). After this, the WS Channel as always active on purpose, and is linked via Reactor reference chain to the Pooled Channel A (as seen in the dump). Pooled Channel A is still active at this time for reuse.

Now session 2 starts, by calling Server A. Because Pooled Channel A is free, it is picked and used to make such call. After it receives response, it calls Server B using new Pooled Channel B and does some additional work. At this point, Pooled Channel A is linked via Reactor sink (connectionOwner) to Pooled Channel B and rest of the reactive pipeline.

At this point, there exists a reference chain from WS Channel to Pooled Channel A and also from Pooled Channel A to Pooled Channel B via connectionOwner. If the at this point Pooled Channel A is closed without clearing the connectionOwner attribute (I managed to replicate it by forcibly closing the socket on the target server), it forever captures also the Pooled Channel B and rest of the pipeline, despite them being 2 independent sessions. This repeats itself if another session uses Pooled Channel B which is still active and links it to another Channel and so on.

violetagg · 2024-07-31T11:36:05Z

@vitjouda Please show code how you are doing all this linking. You should never keep reference to a pooled channel. The idea of the pooled channel is to serve a given request/response and then to be released. How exactly you open a WS connection from HTTP connection that is scheduled for recycling?

vitjouda · 2024-07-31T11:46:55Z

I am testing it in Kotlin. This is enough to keep the pooled connection referenced from the EpollSocket used for WS.

// created at the class level and reused for all WS calls
val httpClient = HttpClient.create(ConnectionProvider.newConnection())
        .metrics(true, Function.identity())

Mono.just("uri")
    .flatMap {
        // 'it' is ignored but in real app I call the client in flatMap operator, so I kept it here to simulate it
       // client is classic webClient setup with pooled connections as per code posted previously
        client
            .get()
            .uri("/data")
            .retrieve()
            .bodyToMono(Data::class.java)
    }
    .flatMap { result ->
        //result is ignored in this example but in real app its needed to open WS (upgrade request needs session header)
        ReactorNettyWebSocketClient(httpClient).execute(URI("localhost:8083/ev")) {
            it.receive()
                .`as`(it::send)
                .then()
        }
    }
    .subscribe({}, {}, { println("WS Closed") })

violetagg · 2024-07-31T12:42:04Z

@vitjouda I cannot comment for Kotlin, in Java, this

.flatMap {
        // 'it' is ignored but in real app I call the client in flatMap operator, so I kept it here to simulate it
       // client is classic webClient setup with pooled connections as per code posted previously
        client
            .get()
            .uri("/data")
            .retrieve()
            .bodyToMono(Data::class.java)
    }

will consume the response and will release the connection to the pool, you should be able to see:

Channel cleaned, now: 0 active connections, 1 inactive connections and 0 pending acquire requests.

If I need to edit your picture above, it will be something like this (two independent connections, where the ws consumes the result from the request made with pooled channel A)

pooled channel A
|
response result
|
ws connection

vitjouda · 2024-07-31T12:55:22Z

Yes, it will release the connection to the pool, but there is still a reference chain which is keeping it from being garbage collected. I will post my partial reproducer together with memory dump snippet where you can see the path from WS to unrelated Channel. I probably cannot make it today, but you can see the reference chain on this image. The selected Channel is the WS, the last is the one used to make HTTP call. So even if the response is consumed, there is still a reference preventing it from being garbage collected.

Actually the re-using of the channel without clearing the connectionOwner attribute afterwards is the problem. At least I think so. I believe that if this attribute would be cleared in every case when the channel is being closed (currently it does not happen in some cases, for example when remote closes the connection), the memory leak would stop.

github-actions · 2024-08-08T06:29:44Z

If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 7 days this issue will be closed.

github-actions · 2024-08-15T06:29:46Z

Closing due to lack of requested feedback. If you would like us to look at this issue, please provide the requested information and we will re-open.

Related to #3416, #3367

…3459) - When terminating detach the connection from request/response objects - Move registration to terminate event outside of doFinally when we have the real connection - Obtain the event loop outside of doFinally when we have the real connection - Dispose the connection only if it is not disposed Related to #3416, #3367

vitjouda added status/need-triage A new issue that still need to be evaluated as a whole type/bug A general bug labels Jul 24, 2024

violetagg self-assigned this Jul 25, 2024

violetagg added for/user-attention This issue needs user attention (feedback, rework, etc...) and removed status/need-triage A new issue that still need to be evaluated as a whole labels Jul 25, 2024

github-actions bot added the status/need-feedback label Aug 8, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 15, 2024

violetagg removed type/bug A general bug for/user-attention This issue needs user attention (feedback, rework, etc...) status/need-feedback labels Aug 15, 2024

violetagg added the status/cannot-reproduce We cannot reproduce this issue label Aug 15, 2024

vitjouda mentioned this issue Sep 17, 2024

The EpollSocketChannel object is too large to be reclaimed by the jvm #3416

Closed

violetagg added a commit that referenced this issue Oct 7, 2024

When terminating detach the connection from request/response objects

2dedcd7

Related to #3416, #3367

violetagg mentioned this issue Oct 7, 2024

When terminating detach the connection from request/response objects #3459

Merged

violetagg added a commit that referenced this issue Oct 7, 2024

When terminating detach the connection from request/response objects

c1741b8

Related to #3416, #3367

violetagg added a commit that referenced this issue Oct 7, 2024

When terminating detach the connection from request/response objects

68f0b2a

Related to #3416, #3367

violetagg added a commit that referenced this issue Oct 8, 2024

When terminating detach the connection from request/response objects

04841c5

Related to #3416, #3367

violetagg added a commit that referenced this issue Oct 9, 2024

When terminating detach the connection from request/response objects

6aaa4f2

Related to #3416, #3367

violetagg added a commit that referenced this issue Oct 17, 2024

When terminating detach the connection from request/response objects

04ec2b8

Related to #3416, #3367

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible memory leak when using pooled connections #3367

Possible memory leak when using pooled connections #3367

vitjouda commented Jul 24, 2024

violetagg commented Jul 25, 2024

vitjouda commented Jul 25, 2024

violetagg commented Jul 25, 2024

vitjouda commented Jul 25, 2024

violetagg commented Jul 25, 2024

violetagg commented Jul 25, 2024

vitjouda commented Jul 25, 2024

violetagg commented Jul 25, 2024

violetagg commented Jul 25, 2024

vitjouda commented Jul 25, 2024

vitjouda commented Jul 25, 2024

vitjouda commented Jul 30, 2024 •

edited

Loading

violetagg commented Jul 30, 2024

vitjouda commented Jul 31, 2024

violetagg commented Jul 31, 2024 •

edited

Loading

vitjouda commented Jul 31, 2024 •

edited

Loading

violetagg commented Jul 31, 2024 •

edited

Loading

vitjouda commented Jul 31, 2024 •

edited

Loading

github-actions bot commented Aug 8, 2024

github-actions bot commented Aug 15, 2024

Possible memory leak when using pooled connections #3367

Possible memory leak when using pooled connections #3367

Comments

vitjouda commented Jul 24, 2024

Steps to Reproduce

Your Environment

violetagg commented Jul 25, 2024

vitjouda commented Jul 25, 2024

violetagg commented Jul 25, 2024

vitjouda commented Jul 25, 2024

violetagg commented Jul 25, 2024

violetagg commented Jul 25, 2024

vitjouda commented Jul 25, 2024

violetagg commented Jul 25, 2024

violetagg commented Jul 25, 2024

vitjouda commented Jul 25, 2024

vitjouda commented Jul 25, 2024

vitjouda commented Jul 30, 2024 • edited Loading

violetagg commented Jul 30, 2024

vitjouda commented Jul 31, 2024

violetagg commented Jul 31, 2024 • edited Loading

vitjouda commented Jul 31, 2024 • edited Loading

violetagg commented Jul 31, 2024 • edited Loading

vitjouda commented Jul 31, 2024 • edited Loading

github-actions bot commented Aug 8, 2024

github-actions bot commented Aug 15, 2024

vitjouda commented Jul 30, 2024 •

edited

Loading

violetagg commented Jul 31, 2024 •

edited

Loading

vitjouda commented Jul 31, 2024 •

edited

Loading

violetagg commented Jul 31, 2024 •

edited

Loading

vitjouda commented Jul 31, 2024 •

edited

Loading