-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
receive: v0.18.0 memory leak (v0.16.0 regression) #3726
Comments
Rechecked on:
Result still the same (v0.16.0 before 17:43): |
Sorry for the late response @sepich, is this still an issue with the latest Thanos version? |
Result for latest v0.18.0 is above. |
Possible duplicate of #3471 |
I am also facing this issue and same behavior, v0.16.0 fixes the possible leak. As we can see below Until 9h40 all nodes were running v0.18.0, at this time I had all receive nodes restarted. At 9h45 all nodes were back up, still on v0.18.0, and we can see a rapid memory increase. At 10h17 I had 2 nodes downgraded to v0.17.2 and v0.16.0. (ip-10-184-125-10 to v0.17.2 and ip-10-184-125-13 to v0.16.0) At 10h35 the v0.17.2 node starts to increase the memory usage, following the v0.18.0 nodes behavior, but the v0.16.0 keeps the memory usage stable. Here we can see the memory heap usage |
Ack, so this means something change between 0.17.2 and 0.18? Let's check the v0.19.0-rc.0 I am cutting this week if anything helps there, then we can try to look closer on the commit log and bisect on commit level, especially if you can amazingly reproduce the problem 🤗 |
@bwplotka I believe something changed between 0.16.0 and 0.17.2 and then apparently got worse on 0.18.0. On 0.17.2 we can see it takes longer to start building the memory usage, but it does start to increase, following the same behavior as 0.18.0. I'll update one of the instance to the v0.19.0-rc.0 and get back to you with some more info. Right now I got all nodes on v0.16.0 and memory is as stable as it can get: At 11h18 I had all instances downgraded and at 11h34 got metrics ingestion back on |
I would start bisecting the commits between 0.17.2 🤗 that would be helpful. |
I just ran a couple more tests and did not have to go very far on the image tags to notice a pattern change on memory consumption. Here is a screenshot of the memory graphs, the green line is the test instance and the yellow line is the instance running 0.16.0. The first 3 big slopes are version 0.17.0, 0.17.2 and 0.18.0 consecutively Starting at 15h49 im using The pattern changes, and it looks like after every GC, the consumption increases a little. I'll do some more testing tomorrow |
There are cool findings made by @svenwltr: |
We are experiencing the same issue (with both v0.17.2 and v0.18.0), but the downgrade to v0.16.0 seems to help as suggested. The fall in memory usage between 13:20 and 14:40 is caused by the instances beeing oomkilled. |
We are facing the same issue. Glad this issue exists. Looking forward to a fix! |
Thanks a lot! Anyone can help but I would love to find out what's wrong this week, ideally before 0.19.0 but let's see. We got some profile but let's see if this is helpful:
Ideally we pin point commit which introduced the regression 🤗 The problem will be if that's TSDB update (it probably is). |
Profiles captured via conprof:
|
BTW those memory metrics you have is for what exactly metrics? (it matters, before Go1.6) |
Memory used is
from cAdvisor and Memory Used based on thanos metrics is:
|
is inflated, see https://www.bwplotka.dev/2019/golang-memory-monitoring/ Some amazing ideas from Cortex experience are using following options (adding those envvars to Thanos process):
|
Is this 10 million samples per scrape or total? 0.19.0-rc.1 also leaks for me with about ~1 million samples per scrape. (Prometheus 2.22.2) |
I am not sure that this bugfix that was merged into 0.16.0 release branch made it back into main branch. |
@jmichalek132 I think you're right. I was never merged back into v0.17+ but instead entirely replaced with the ZLabel. I'm wondering if @bwplotka based that work on the fixed label or ignored it and did an entire rewrite? |
What's the latest? 🤗 |
We discussed some of this today's Contributor Hours: https://docs.google.com/document/d/137XnxfOT2p1NcNUq6NWZjwmtlSdA6Wyti86Pd6cyQhs/edit#heading=h.dmnpchivqkn9 I will try to look on this a bit. |
Investigation: I found this #3327 being on v0.16.0 but not on master. I think merge to v0.16.0 failed. The long term fix was merged, but it looks like not properly, as this issue is exactly the same as #3265 Long term fix attempt no 1: #3279 |
Thank you all for your patience and help. It kind of silly but the fix was already in PR but never merged: #3334 Up-to-date fix is available here: #3943 and will be part of v0.19.0 🚀 Some learnings when we would be investigating such issues:
So it was easy to tell that such "quick fix" never made to master 🤗 So it was as easy as really porting #3334 (stripping from unrelated changes) and ensure we have profiles that back up our thinking. E.g those: Before fix: https://share.polarsignals.com/68255aa/ Notice this big 200MB chunk that does not exist now 🤗 |
Awesome news @bwplotka ! Looking forward to v0.19 |
I've tested 0.19-rc2 and issued is fixed. |
Done then. I see the capacity to reduce the resource usage after v0.19.0 a lot too. Let's release it and iterate. Thanks all for help. |
Thanos, Prometheus and Golang version used:
thanosio/thanos:v0.17.2
Object Storage Provider:
GCS
What happened:
I'm trying to upgrade from v0.16.0 to v0.17.2 and see that thanos-receive memory is "leaking":
Here are 3 thanos-receive pods in a hashring with equal load, red and blue lines are v0.16.0. At 18:30 i'm restarting thanos-receive-0 (orange line) as v0.17.2, then at 19:33 restart it back as v0.16.0.
GC load profile also differs between versions:
Related:
#3265
What you expected to happen:
Stable memory usage.
How to reproduce it (as minimally and precisely as possible):
Only reproducible on production with 80k samples/s per thanos-receive pod.
Full logs to relevant components:
Attaching pprof heap.zip, right before second restart.
heap.zip
The text was updated successfully, but these errors were encountered: