-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Envoy OOM problem with TLS #3592
Comments
It's hard to say without knowing where the memory is being used. I would take a look at some stats to see what is growing, and you can also enable heap profiles via tcmalloc: https://gperftools.github.io/gperftools/heapprofile.html |
Thanks @mattklein123. |
@thileesf the stat you point out is a counter, so it loosely tracks with the rate of incoming connections. It's not a gauge. For stats I would be looking at increasing active connections, increasing buffered data, etc. Those might tell you if there is some type of unbounded growth that could be fixed by config. Otherwise, it's a memory leak (though I don't know of any open leak issues). For that, I would just startup the tcmalloc heap profiler and take differential heap dumps and then compare them. It should be fairly obvious what the issue is. |
@mattklein123 thank you for the pointer. I ran with heap profiler, and it shows
The full file is attached here: Note that envoy hasn't crashed yet, but it is getting close to exhausting system memory. I'll update with a newer differential when it gets to OOM. |
@thileesf this looks to me like there is a large amount of data that is being buffered that is not being drained. I think it's unlikely there is a leak here though it's possible. Did you check buffered data stats to see if that accounts for the memory? Do you have a lot of stalled streams that aren't getting written out somehow due to no timeouts or otherwise? cc @alyssawilk @PiotrSikora |
Yeah, that's really frequently OOMing - as Matt says I'd surprised if we have a leak that bad that no one else has picked up on. I'd definitely check your connection stats - presumably you added the new listener to handle new traffic flows so maybe there's just too many connections? If it is just one direction draining slowly you could make the per-connection buffer limits smaller. #373 will also help, and we finally have someone to work on that who has time to actually pick it up, so there should be progress soon! |
I doubt that's it, but seeing as this is TLS & libevent related, could you try reverting #2680? Maybe there is some issue with |
This is the revision we are working with:
|
@eziskind for visibility as well. |
@mattklein123 @alyssawilk it does not look like there are a lot of connections. The buffer stats also do not show anything that can explain the multi-GB growth. See the screenshots below (the green line is the envoy being heap profiled). The processes consistently hit OOM in 11-12 hours.
I am also adding couple of other stats that seem interesting, but these may not be relevant.
|
@mattklein123 how do I check "if there are a lot of stalled streams that aren't getting written out somehow due to no timeouts" as in your comment? Let me also see if I can adjust the per-connection buffer limits to be smaller, as @alyssawilk mentioned. |
I have two questions:
|
|
@thileesf it's highly likely that the upstream (collector) is not keeping up and is backing up, causing a lot of buffering in Envoy. With the full stat dump it should be more clear what needs to be tuned. I really doubt this is an actual leak. |
@mattklein123 we are investigating the collector also. I'll update here if/when we find any issues. In the mean time, here are the envoy |
@thileesf how big is each of the requests you are sending to the collector? By my math they are very large. Also, please start running master and provide fresh differential profiles and stat dumps. It's not clear to me what's going on here but I don't want to do further debugging unless you are running master. Thank you! |
Closing, please reopen if this still shows up on master and you have the answers to the above questions. |
@mattklein123
|
Hi, we are running into
envoy.server.memory_heap_size
unbounded growth problem, causing the kernel to kill envoy process. Below are the details:We had a lightstep-collector deployment in a 4 node cluster (r4.large EC2 instances, with 15G RAM), each of which also had envoyproxy running on it (without processing any traffic). The collectors directly received system traces (from about 40 servers) on a secure port over grpc. The hosts were running with about 9GB system memory utilization.
Then we added a TLS listener to the envoyproxy on the same 4 hosts to intercept the traces and route them to the localhost lightstep collector, on a plaintext port. The traffic was correctly going via envoy to the collector. But after a point envoy crashed due to out of memory.
Looking at
envoy.server.memory_heap_size
we see it linearly increasing at 100MB/hr for about 6 hours, and then going up at a faster rate to reach ~9GB in under 12 hours, at which point the kernel killed envoy due to out of memory (system & the other processes accounting for remaining memory).Is there a memory leak in envoy or is there a config I can set to throttle envoy or control memory buffers? I am not reporting a bug because I am not sure if this is a config issue.
Over the whole period, the CPU utilization on the 4 hosts was fairly low, hovering around 18%.
The relevant envoy config is:
The text was updated successfully, but these errors were encountered: