Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible memory leak when using pooled connections #3367

Closed
vitjouda opened this issue Jul 24, 2024 · 20 comments
Closed

Possible memory leak when using pooled connections #3367

vitjouda opened this issue Jul 24, 2024 · 20 comments
Assignees
Labels
status/cannot-reproduce We cannot reproduce this issue

Comments

@vitjouda
Copy link

I am facing periodic OOM issues with my service, which uses reactor-netty based WebClient for HTTP communication with other REST services. I am not sure this is 100% reactor-netty related, but the only thing that managed to mitigate this issue was switching to another ClientHttpConnector. Therefore I assume this is the correct place to start asking for help :).

When I inspect the memory dump of my service, I can see that most of the growing memory is retained by instances of io.netty.channel.epoll.EpollSocketChannel. When I inspect the path to GC root of some random EpollSocketChannel instance, there is a long chain of Reactor operators retained. I can see that there are "blocks" of the processing pipeline that correspond to different and unrelated server requests chained together, presumably by the HttpClientOperations instance holding reference to PooledConnection / EpollSocketChannel that is now assigned to another request. I tried to visualize it in the attached MAT screenshot, but I know its hard to explain.

image

In the image, each color block represents a reference chain of Channel to HttpClientOperations. Each block belongs to distinct server request processing pipeline, as shown in this diagram. There are hundreds of concurrent requests at any given time.

Diagram

During the lifetime of the service, these chains get longer and retain more and more memory until OOM.

Steps to Reproduce

Unfortunately I can´t provide reproducer because I never managed to simulate this behavior locally (and I tried for days) and I cannot share the actual memory dump. If I can provide any information that would help, please tell me.

Your Environment

  • Reactor version(s) used: reactor-core:3.6.7, reactor-netty:1.1.20, reactor-pool:1.0.6
  • JVM version: 21
  • OS and version: Ubuntu 22.04.3
@vitjouda vitjouda added status/need-triage A new issue that still need to be evaluated as a whole type/bug A general bug labels Jul 24, 2024
@violetagg
Copy link
Member

@vitjouda with so little information we can't start any investigation.

  • Is it HTTP/1.1 or HTTP/2?
  • Can you share your HttpClient/WebClient configuration and the usage?

@violetagg violetagg self-assigned this Jul 25, 2024
@violetagg violetagg added for/user-attention This issue needs user attention (feedback, rework, etc...) and removed status/need-triage A new issue that still need to be evaluated as a whole labels Jul 25, 2024
@vitjouda
Copy link
Author

I understand, I will provide everything I can.

  • HTTP 1.1

Usage (adjusted so that properties are hardcoded with real world values)

    @Bean
    @Scope("prototype")
    public WebClient.Builder webClientBuilder(ObjectProvider<WebClientCustomizer> customizerProvider) {
        WebClient.Builder builder = WebClient.builder();
        customizerProvider.orderedStream().forEach((customizer) -> customizer.customize(builder));
        return builder;
    }

    @Bean("serviceClientBuilder")
    public WebClient.Builder serviceWebClientBuilder(@Qualifier("webClientBuilder") WebClient.Builder builder) {
        return builder
                .baseUrl("serviceUrl")
                .clientConnector(new ReactorClientHttpConnector(customHttpClient()));
    }

    protected HttpClient customHttpClient() {
        return HttpClient.create(ConnectionProvider.builder("serviceClientPool")
                        .maxConnections(500)
                        .maxIdleTime(Duration.ofMinutes(5))
                        .maxLifeTime(Duration.ofHours(1))
                        .metrics(true)
                        .build())
                .responseTimeout(Duration.ofSeconds(120));
    }
  public Response get(String url, String... urlParams) {
  // I have some abstraction around, but this is the resulting usage.
  // Instance of webClient is created from builder above in the class constructor
     return webClient
              .get()
              .uri(url, (Object[]) urlParams)
              .accept(MediaType.APPLICATION_JSON)
              .retrieve()
              .bodyToMono(Response.class);
    }

@violetagg
Copy link
Member

@vitjouda I see that you have metrics for the ConnectionProvider. Can you check how many connection pools are created?

@vitjouda
Copy link
Author

If you mean the reactor_netty_connection_provider_total_connections{name="serviceClientPool"}, then its approx. 2 to 3 (fluctuates) per instance.

@violetagg
Copy link
Member

@vitjouda these might have different id which is different connection pool. So the question is how many the ConnectionProvider with name=serviceClientPool has created?

@violetagg
Copy link
Member

for example check this issue where too many connection pools were created because of wrong configuration #3315

@vitjouda
Copy link
Author

Aha, then I believe there is only 1 pool per remote_address, as all of these have the same id since the app startup (during startup there is a heavy load, so many more connections were created, all with same ID).

@violetagg
Copy link
Member

@vitjouda then we will really need something to play with :(

@violetagg
Copy link
Member

@vitjouda Do you often change the remote_address?

@vitjouda
Copy link
Author

Yeah I understand. I have been trying to simulate it by creating some reproducer which mimicked my setup to running perf test to locally deployed app but never managed to reproduce it outside of our production setup, and I cannot provide its memory dump.

I really believe that the problem is that there are a lot of EpollSocketChannel that are inactive and closed, and they are always linked to one that is active. Like something does not dereference it after original request / response is finished. For some time I suspected it might be the Observation issue, and found something similar Remove Observation.Context from WebClient request, but I had those issues before and after versions mentioned there.

@vitjouda
Copy link
Author

@vitjouda Do you often change the remote_address?

Not really, its another Spring service in our cluster which is generally pretty stable.

@vitjouda
Copy link
Author

vitjouda commented Jul 30, 2024

I have been trying to investigate further and I believe I know how the problem manifests. I cannot reproduce the leak itself yet locally, but I am still trying.

The root problem is, that there are inactive EpollSocketChannels, that have connectionOwner attribute set to DisposableAcquire, instead of ConnectionObserver.emptyListener(). This should happen in PooledConnection.onStateChange(), but in my case it does not happen. I don´t yet fully understand the logic there. I can see there are some conditions that bypass this with comment // Will be released by closeFuture. But in that case, connectionOwner is not cleared. Furthermore, I don´t see that those "unreleased" connection have any closeFuture listener at all (but that might have been cleared before I made the memory dump). I attached some screens that could help illustrate the issue.

inactive channel

reference

This itself would probably not cause any problems, but if you have such inactive EpollSocketChannel still referenced by some long lived operation, for example another EpollSocketChannel used for websocket, it is never garbage collected. What is worse, before the channel is closed, it is reused for another HTTP call to the same server and linked to another EpollSocketChannel and so on, which causes memory leak. This is exactly what I observe in my memory dump. The order of operations in that case is something like this.

  • Application calls Server S1 and based on the response opens stable WS connection to the same S1
    • now there is one EpollChannel for WS which is linked to the pooled EpollChannel CH1 used to make the first call to S1
  • Application calls S1 (reusing pooled CH1), and then calls S2 (opening another pooled EpollChannel CH2)
    • now the websocket EpollChannel is linked to CH1 (because the reactor chain still captures original HttpClientOperations), CH1 is linked to CH2 (using connectionOwner attribute)
  • CH1 is closed without clearing the owner
  • Application calls S2 using still active CH2 and so on......

In my reproducer, I managed to simulate this to the point that websocket -> CH1 -> CH2, but when the CH1 closes (I set maxLifeTime), connectionOwner is set to emptyListener(), therefore breaking the chain and leak does not happen.

@violetagg
Copy link
Member

@vitjouda Thank you for your further investigation. Let's continue with it. So when a connection is upgraded to websocket we mark it as non-persistent, it happens here in the code below, can you check it? It is important because non-persistent channels are never returned to the connection pool:


so it is correct that when we don't need the connection and it is non-persistent then we will wait for the close event here:

Application calls S1 (reusing pooled CH1), and then calls S2 (opening another pooled EpollChannel CH2)
now the websocket EpollChannel is linked to CH1 (because the reactor chain still captures original HttpClientOperations), CH1 is linked to CH2 (using connectionOwner attribute)

I don't understand this

@vitjouda
Copy link
Author

The WS connection is made from another httpClient, which does not use ConnectionPool (it is configured as ConnectionProvider.newConnection()). I have different httpClient that uses pooling for plain HTTP requests to the same server, but never WS. The only only reason why its important in this case is because its always active during the application lifetime and therefore provides path from GC root to the next EpollSocketChannel, its visible in the image I posted. I believe that if I killed that WS, it would free the rest of the memory held by it. Or at least thats what MAT tells me.

Another thing I managed to simulate just now is, that when the connection is closed BEFORE response, it does not get cleared and still has the connectionOwner set. I believe it might be part of the problem.

Ill try to explain better. The whole reason that the leak happens is, that there are super long to infinite reference chains of EpollSocketChannel -> EpollSocketChannel -> another.... The moment I make the memory dump, they are all inactive, yet linked together by connectionOwner attribute. I believe that these chains are created by reusing pooled EpollSocketChannel before it is closed.

I tried to make a quick diagram that might help.

netty

In this diagram, first session 1 uses Pooled Channel A to call Server A, receives the response and open WS channel to another server (doesn't really matter which one). After this, the WS Channel as always active on purpose, and is linked via Reactor reference chain to the Pooled Channel A (as seen in the dump). Pooled Channel A is still active at this time for reuse.

Now session 2 starts, by calling Server A. Because Pooled Channel A is free, it is picked and used to make such call. After it receives response, it calls Server B using new Pooled Channel B and does some additional work. At this point, Pooled Channel A is linked via Reactor sink (connectionOwner) to Pooled Channel B and rest of the reactive pipeline.

At this point, there exists a reference chain from WS Channel to Pooled Channel A and also from Pooled Channel A to Pooled Channel B via connectionOwner. If the at this point Pooled Channel A is closed without clearing the connectionOwner attribute (I managed to replicate it by forcibly closing the socket on the target server), it forever captures also the Pooled Channel B and rest of the pipeline, despite them being 2 independent sessions. This repeats itself if another session uses Pooled Channel B which is still active and links it to another Channel and so on.

@violetagg
Copy link
Member

violetagg commented Jul 31, 2024

@vitjouda Please show code how you are doing all this linking. You should never keep reference to a pooled channel. The idea of the pooled channel is to serve a given request/response and then to be released. How exactly you open a WS connection from HTTP connection that is scheduled for recycling?

@vitjouda
Copy link
Author

vitjouda commented Jul 31, 2024

I am testing it in Kotlin. This is enough to keep the pooled connection referenced from the EpollSocket used for WS.

// created at the class level and reused for all WS calls
val httpClient = HttpClient.create(ConnectionProvider.newConnection())
        .metrics(true, Function.identity())

Mono.just("uri")
    .flatMap {
        // 'it' is ignored but in real app I call the client in flatMap operator, so I kept it here to simulate it
       // client is classic webClient setup with pooled connections as per code posted previously
        client
            .get()
            .uri("/data")
            .retrieve()
            .bodyToMono(Data::class.java)
    }
    .flatMap { result ->
        //result is ignored in this example but in real app its needed to open WS (upgrade request needs session header)
        ReactorNettyWebSocketClient(httpClient).execute(URI("localhost:8083/ev")) {
            it.receive()
                .`as`(it::send)
                .then()
        }
    }
    .subscribe({}, {}, { println("WS Closed") })

@violetagg
Copy link
Member

violetagg commented Jul 31, 2024

@vitjouda I cannot comment for Kotlin, in Java, this

.flatMap {
        // 'it' is ignored but in real app I call the client in flatMap operator, so I kept it here to simulate it
       // client is classic webClient setup with pooled connections as per code posted previously
        client
            .get()
            .uri("/data")
            .retrieve()
            .bodyToMono(Data::class.java)
    }

will consume the response and will release the connection to the pool, you should be able to see:

Channel cleaned, now: 0 active connections, 1 inactive connections and 0 pending acquire requests.

If I need to edit your picture above, it will be something like this (two independent connections, where the ws consumes the result from the request made with pooled channel A)

pooled channel A
|
response result
|
ws connection

@vitjouda
Copy link
Author

vitjouda commented Jul 31, 2024

Yes, it will release the connection to the pool, but there is still a reference chain which is keeping it from being garbage collected. I will post my partial reproducer together with memory dump snippet where you can see the path from WS to unrelated Channel. I probably cannot make it today, but you can see the reference chain on this image. The selected Channel is the WS, the last is the one used to make HTTP call. So even if the response is consumed, there is still a reference preventing it from being garbage collected.

image

Actually the re-using of the channel without clearing the connectionOwner attribute afterwards is the problem. At least I think so. I believe that if this attribute would be cleared in every case when the channel is being closed (currently it does not happen in some cases, for example when remote closes the connection), the memory leak would stop.

Copy link

github-actions bot commented Aug 8, 2024

If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 7 days this issue will be closed.

Copy link

Closing due to lack of requested feedback. If you would like us to look at this issue, please provide the requested information and we will re-open.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 15, 2024
@violetagg violetagg removed type/bug A general bug for/user-attention This issue needs user attention (feedback, rework, etc...) status/need-feedback labels Aug 15, 2024
@violetagg violetagg added the status/cannot-reproduce We cannot reproduce this issue label Aug 15, 2024
violetagg added a commit that referenced this issue Oct 17, 2024
…3459)

- When terminating detach the connection from request/response objects
- Move registration to terminate event outside of doFinally when we have the real connection
- Obtain the event loop outside of doFinally when we have the real connection
- Dispose the connection only if it is not disposed

Related to #3416, #3367
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/cannot-reproduce We cannot reproduce this issue
Projects
None yet
Development

No branches or pull requests

2 participants