-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: excessive memory use between 1.21.0 -> 1.21.1 due to hugepages and the linux/amd64 max_ptes_none
default of 512
#64332
Comments
perhaps try looking at profiling |
@seankhliao : we did, the OOM killer incident was a combination of high memory usage and a bug on our side, which we fixed. After that fix, memory profiles would report only small amounts for 'inuse_size', in the 300-400 Mb range when our process' RSS would be several GB, and more importantly: the RSS of our process would grow over time, but the profiles would report no extra used memory. We could definitely optimize allocations at large, but the fact is: compiling the exact same code in 1.19 produces a binary which doesn't leak. [edit: we will watch for the "leak" part over the next few days, but the system memory footprint is definitely lower, and shrinks on occasions, which didn't happen with go 1.21] |
@seankhliao : I see the WaitingForInfo tag, do you have a more specific request in mind ? |
@seankhliao (sidenote: I work with @LeGEC): we did extensive memory profiling of our processes during several weeks (and can provide full data privately if necessary) but from the point of view of the memory profiler nothing was leaking and nothing would explain the constant grow in memory usage, mostly unreleased to the OS. |
And for completeness: we also tried various combination of debug.FreeOSMemory() / GOMEMLIMIT=1500MiB with Unless someone from the go compiler / runtime team has a better suggestion : we are planning to make a test run with go 1.20.0 tomorrow and update this issue with the corresponding numbers. |
I remember some things changed regarding huge pages on linux and were backported to 1.21, that may be related. Worth it to read the wiki for THP and GOGC https://go.dev/doc/gc-guide#Linux_transparent_huge_pages |
cc @mknyszek |
@mauri870 on our production system:
If huge page are indeed involved: is there a way to disable transparent_hugepage for a single process / without root access ? (Alternatively, is there a way to patch the go compiler to prevent transparent_hugepage usage ?) |
Memstats of our go1.19 program :
RSS of process: 1158688 Kb (so ~ 1,16 Gb) |
The huge page related changes can't be the culprit, because huge pages were only forced by the runtime between Go 1.21.0 and Go 1.21.3 (inclusive). Go 1.21.4 no longer adjusts huge page settings at all. I don't see any of those versions in the conversation above. |
How are you measuring RSS? Also, as a side note, "Sys" is virtual memory footprint. It's not going to match up with RSS. "Sys - HeapReleased" will be much closer. Hmm... is it possible this is related to a regression in a specific standard library package for instance? IIRC we didn't see anything like this back when Go 1.20 was released, so it's a bit surprising. It's still possible it's related to the compiler/runtime, so I don't mean to rule it out. Is it possible to reproduce this on a smaller scale? The difference seems fairly big and "obvious" so if you can bisect down to the commit between Go 1.19 and Go 1.20 that causes it, that would immediately tell us what happened. |
@mknyszek RSS was mesured by simple output of ps aux. The value mentionned in the inital post may have been slightly wrong indeed, but not by much : the RSS of our process right before restarting the service with our version compiled with go 1.19 was 4397688 (with a VSZ of 9107484) The same production process with go 1.19.13 gives currently the following values, much more coherent with the actual allocations reported by profiling : |
@mknyszek yes we also thought possible that it could be related to changes in some standard library packages. I can share - privately - a heap dump obtained using /pprof/heap endpoint when the relevant service was on go 1.21.4 if it can help. Bisecting beetween 1.19 and 1.20 is a possibility, we can do that, but we can only try 1 version / day in order to reproduce since the issue looks dependent on user / service activity... Can you suggest a list of commits to try ? |
@vanackere There were a lot of commits that went into Go 1.20, it would take quite a while to enumerate them and try to guess what went wrong. 😅 The commit range is e99f53f..9088c69 which contains 2018 commits.
Although... it occurs to me that Go 1.20 had some significant crypto regressions that could make TLS (and/or other similar operations like validating JWT tokens) slower. I could imagine that perhaps that could cause requests to pile up in your service? I would've expected them to have been mitigated somewhat in Go 1.21, but perhaps that's worth looking into? See #63516 and #59442 maybe? (EDIT: I accidentally put the wrong issue number in a moment ago. It should be correct now.) On that note, what does a CPU profile say? |
one update: we compiled and deployed our server using go 1.20.0, and the leaky behavior isn't observed. We're pondering between "testing" the 1.20.11 (the latest 1.20) and 1.21.0 . Would you have some insight about these two versions ? for example: were there some fixes or updates in memory handling in one of the 1.20 minor versions ? |
update: we tried go 1.20.11 today, and still have a reasonable (for us ...) RSS value behavior change happens between 1.20.11 and 1.21.4. updating the title accordingly |
@Nasfame Unfortunately neither of us can bisect because we can't run the reproducer. @LeGEC Thanks for the updates. I think the only thing I'll say is it seems like the crypto-related regressions are unrelated. At this point bisection and/or a reproducer seems like it would be the most fruitful path forward. As before, it may be worthwhile to look at a CPU profile before and after, which might reveal other unexpected changes in application behavior that lead to the culprit. |
@mknyszek : yes, we're trying to run the parts of our services in isolation, unfortunately none of the would be culprits seem to be guilty alone (or, more probably, we haven't identified the right culprit). Today we rebuilt our server with gotip (golang master version 0c7e5d3) and deployed, and the same buggy memory behavior clearly shows up. |
@mknyszek : One extra information, if it is of any help:
|
Just to be clear I didn't mean that in a bad way -- I definitely understand the difficulty of creating reproducers (especially when system details might be involved). 😅 And unless the reproducing code is open source it's difficult and/or impossible to share too many details. I was just clarifying for @Nasfame. Thanks for your continued communication here!
This is something I would expect with a kernel forcing address space randomization. The race detector (TSAN) requires all heap memory to exist in a very specific range of addresses. This is unfortunately not easy to change. IIUC it's fairly fundamental to the technique. |
Acknowledged. That does seem to be one big thing that's unique about your system and any other details you can share about what's activated would be helpful. I agree that it seems plausible that this issue is due to some unfortunate interaction between Go 1.21 and your specific environment. |
To give a clearer view of how visible is the memory misbehavior on our server: here is a graph of the RAM usage (RSS) over the last 2 weeks of our 3 main services (the green line is the sum of the 3 others) grayed zones are weekends (our service is way less used on weekends), |
This comment was marked as off-topic.
This comment was marked as off-topic.
@Nasfame I appreciate that you're trying to help create a reproducer, but I don't think replicating the high-level details of the application is going to result in a reproducible case. There are plenty of high-load services (microservices, monoliths, etc.) on common platforms that are running with Go 1.21, and to our knowledge they haven't experienced the same issue. Therefore, the issue has to lie in something very specific to @LeGEC's application, either in how their application uses Go or in how Go is interacting with their environment (even if the issue is ultimately in the Go runtime). The relevant details appear to be that (1) it reproduces in Go 1.21 and not Go 1.20, and (2) the system is running with a bunch of security features enabled that to my knowledge aren't super common (like ASLR). As a side-note, the number of lines of code in an application's source isn't really going to be indicative of much in general.
It was a mistake. I marked the comment off-topic. |
blind shot: did something land on go1.21 which changed the way a go process returns memory to the OS ? |
Regarding the allocated pages for what is the expected gain from backing these pages with hugepages (or not) ? |
Go 1.21.1 and Go 1.22 have ceased working around an issue with Linux kernel defaults for transparent huge pages that can result in excessive memory overheads. (https://bugzilla.kernel.org/show_bug.cgi?id=93111) Many Linux distributions disable huge pages altogether these days, so this problem isn't quite as far-reaching as it used to be. Also, the problem only affects Go programs with very particular memory usage patterns. That being said, because the runtime used to actively deal with this problem (but with some unpredictable behavior), it's preventing users that don't have a lot of control over their execution environment from upgrading to Go beyond Go 1.20. This change adds a GODEBUG to smooth over the transition. The GODEBUG setting disables transparent huge pages for all heap memory on Linux, which is much more predictable than restoring the old behavior. For golang#64332. Fixes golang#64561. Change-Id: I73b1894337f0f0b1a5a17b90da1221e118e0b145 Reviewed-on: https://go-review.googlesource.com/c/go/+/547475 Reviewed-by: Michael Pratt <mpratt@google.com> Auto-Submit: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> (cherry picked from commit c915215)
We can confirm that the [edit] for completeness: we tested this patch on top of version 1.21.5, which is the the version referenced by this branch in our fork: test-fix-64332 |
It's less about backing with huge pages and more about disabling huge pages for small heaps. These mappings tend to be quite large in comparison to small heaps which might lead to the kernel backing them with a huge page, leading to large proportional overheads. The main reason to keep the mitigation in this case is that the runtime doesn't ever return any of this memory to the OS, so even if the Linux configuration doesn't have a high If it were up to me we wouldn't call Overall, I think the cost/benefit just works out in favor of keeping this particular mitigation. It helps small programs stay small without really hurting anyone else in practice. |
Go 1.21.1 and Go 1.22 have ceased working around an issue with Linux kernel defaults for transparent huge pages that can result in excessive memory overheads. (https://bugzilla.kernel.org/show_bug.cgi?id=93111) Many Linux distributions disable huge pages altogether these days, so this problem isn't quite as far-reaching as it used to be. Also, the problem only affects Go programs with very particular memory usage patterns. That being said, because the runtime used to actively deal with this problem (but with some unpredictable behavior), it's preventing users that don't have a lot of control over their execution environment from upgrading to Go beyond Go 1.20. This change adds a GODEBUG to smooth over the transition. The GODEBUG setting disables transparent huge pages for all heap memory on Linux, which is much more predictable than restoring the old behavior. For #64332. Fixes #64561. Change-Id: I73b1894337f0f0b1a5a17b90da1221e118e0b145 Reviewed-on: https://go-review.googlesource.com/c/go/+/547475 Reviewed-by: Michael Pratt <mpratt@google.com> Auto-Submit: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> (cherry picked from commit c915215) Reviewed-on: https://go-review.googlesource.com/c/go/+/547636 Reviewed-by: Mauri de Souza Meneguzzo <mauri870@gmail.com> TryBot-Bypass: Michael Knyszek <mknyszek@google.com> Auto-Submit: Matthew Dempsky <mdempsky@google.com>
Our production applications also experienced similar behavior and we used the GODEBUG=disablethp=1 environment as suggested in this issue. We still couldn't figure out how to validate this fix in the production system as the issue is reproduced only in long-living nodes [>15hours]. We don't want to wait for a long time to validate this fix and then explore alternative fixes. Is there any information that we can read from system files to validate that the THP is disabled for a particular go process? |
One thing you can do is dump |
Yes. I can see nh attribute in a few memory regions. We explicitly disabled the flag in the go code as well. To rule out any memory leaks, we deployed one node with binary built with go1.18 and 2 nodes with go1.21 toolchain. The issue persists only in the binaries built with go1.21 and the other node's memory stays flat. Check the attached screenshot for reference.
On a side, Is this fixed in Go 1.22 by any chance? We are anyways experimenting with the new version. I will update after observing the metrics for a day. |
@ihtkas If you're disabling huge pages and still seeing a memory increase, then that is something else, then that is independent of this issue. Please file a new issue. I have updated the issue title to be more precise.
I'm not sure what you mean. As of Go 1.21, the runtime is no longer going to try and work around the |
max_ptes_none
default of 512
Increase memory limits for steps running openshift-tests to account for increased memory consumption. This change is necessary due to the removal of a workaround for a Linux kernel bug in golang versions 1.21+. See golang/go#64332.
Increase memory limits for steps running openshift-tests to account for increased memory consumption. This change is necessary due to the removal of a workaround for a Linux kernel bug in golang versions 1.21+. See golang/go#64332.
Go 1.21.1 and Go 1.22 have ceased working around an issue with Linux kernel defaults for transparent huge pages that can result in excessive memory overheads. (https://bugzilla.kernel.org/show_bug.cgi?id=93111) Many Linux distributions disable huge pages altogether these days, so this problem isn't quite as far-reaching as it used to be. Also, the problem only affects Go programs with very particular memory usage patterns. That being said, because the runtime used to actively deal with this problem (but with some unpredictable behavior), it's preventing users that don't have a lot of control over their execution environment from upgrading to Go beyond Go 1.20. This change adds a GODEBUG to smooth over the transition. The GODEBUG setting disables transparent huge pages for all heap memory on Linux, which is much more predictable than restoring the old behavior. Fixes golang#64332. Change-Id: I73b1894337f0f0b1a5a17b90da1221e118e0b145 Reviewed-on: https://go-review.googlesource.com/c/go/+/547475 Reviewed-by: Michael Pratt <mpratt@google.com> Auto-Submit: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Increase memory limits for steps running openshift-tests to account for increased memory consumption. This change is necessary due to the removal of a workaround for a Linux kernel bug in golang versions 1.21+. See golang/go#64332.
This was increased recently for golang/go#64332 but metal-ipi bm jobs persist in being OOM'd. Bump it further.
Increase memory limits for steps running openshift-tests to account for increased memory consumption. This change is necessary due to the removal of a workaround for a Linux kernel bug in golang versions 1.21+. See golang/go#64332.
This was increased recently for golang/go#64332 but metal-ipi bm jobs persist in being OOM'd. Bump it further.
Increase memory limits for steps running openshift-tests to account for increased memory consumption. This change is necessary due to the removal of a workaround for a Linux kernel bug in golang versions 1.21+. See golang/go#64332.
This was increased recently for golang/go#64332 but metal-ipi bm jobs persist in being OOM'd. Bump it further.
Go 1.21.1 and Go 1.22 have ceased working around an issue with Linux kernel defaults for transparent huge pages that can result in excessive memory overheads. (https://bugzilla.kernel.org/show_bug.cgi?id=93111) Many Linux distributions disable huge pages altogether these days, so this problem isn't quite as far-reaching as it used to be. Also, the problem only affects Go programs with very particular memory usage patterns. That being said, because the runtime used to actively deal with this problem (but with some unpredictable behavior), it's preventing users that don't have a lot of control over their execution environment from upgrading to Go beyond Go 1.20. This adds documentation about this change in behavior in both the GC guide and the Go 1.21 release notes. For golang/go#64332. Change-Id: I29baaffcc678d08255364a3cd6f11211ce4164ba Reviewed-on: https://go-review.googlesource.com/c/website/+/547675 Auto-Submit: Michael Knyszek <mknyszek@google.com> Reviewed-by: Mauri de Souza Meneguzzo <mauri870@gmail.com> Reviewed-by: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Increase memory limits for steps running openshift-tests to account for increased memory consumption. This change is necessary due to the removal of a workaround for a Linux kernel bug in golang versions 1.21+. See golang/go#64332.
This was increased recently for golang/go#64332 but metal-ipi bm jobs persist in being OOM'd. Bump it further.
@LeGEC ,we use go1.22.5 and met the same question. Have you resolved this issue? Can you provide us with some suggestions? |
@lojies : in our setting, yes, setting Note however that we first ran several memory profiles on our app to rule out any "regular" memory leaks in our code -- and we found and fixed several candidates for memory consumption by doing so. |
Thank you for your suggestion. We tried, but it had no effect. |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
Our production service was recently shut down by the OOM, which led us to inspect its memory usage in details.
We discovered that, when running, our process had a memory consumption (RSS) that grew over time and never shrank.
sample runtime.MemStats report :
at that time, the reported RSS for our process was 3,19Gb.
We looked at our history in more details, and observed that we had a big change in production behavior when we upgraded our go version from 1.19.5 to 1.20.0 - we unfortunately didn't notice the issue at that time, because we upgrade (and restart) our service on a regular basis.
To confirm this theory, we have downgraded our go version back to 1.19.13, and our memory consumption is now small and stable again.
Here is a graph of the RSS of our service over the last 48h, the drop corresponds to our new deployment with go 1.19.13 :
It should be noted that our production kernel is a hardened kernel based on grsecurity 5.15.28, which may be related to this issue (randomized heap addresses ?)
What did you expect to see?
A constant and stable memory usage.
What did you see instead?
The go runtime seems to not release memory back to the system.
Unfortunately, we have only been able to observe this issue on our production system, in production conditions.
We were not yet able to reproduce the issue on other systems, or by running isolated features in test programs deployed on our production infrastructure.
The text was updated successfully, but these errors were encountered: