-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: huge page fragmentation on Linux leads to out of memory #12233
Comments
@bmhatfield can you please run the process with
The gctrace log will show both. btw. this information is always being collected, just normally supressed, so there is little impact in running it in production. |
Can do. In the meantime, I am not seeing any odd memory usage (stack, heap, or just general system memory). I have graphs of everything available in https://github.com/bmhatfield/go-runtime-metrics/blob/master/runstats.go if you have any specific requests. Here's a high level of system memory (plotted against a metric that demonstrates the crash occurance) |
CC @RLH @aclements |
@bmhatfield, do you happen to know the approximate heap size (or have the equivalent graphs) for your service when it was running on Go 1.4? It looks like your heap size graph is going up even between restarts (each restart cycle quickly climbs to a larger heap size than the previous cycle had when it crashed). That's a bit perplexing. |
Hi @aclements - here's an annotated graph of the normal heap size for this service. My general takeaway is that the heap size appears to be within expected bounds for this service, and the normal operating size does not appear to be "changed" as of 1.5. This service is essentially an HTTP cache using the go-cache library with some additional features focused on analyzing the traffic it receives (those features being the primary reason this service exists). When this service is deployed (or re-spawned after a crash), it exhibits some oscillation for a short period of time, related to the initially-synchronized 10 minute cache entry TTLs, which explains the heap growth and then reduction that you're seeing. This behavior normalizes relatively quickly, and was "normalized" when these crashes occurred. |
Thanks for the plot and explanation. Is the traceback on the crash always/usually the same? In the one you posted, it's failing in a surprisingly specific place. |
Great question, sorry I did not think of that before. Doing some digging now, I see one that looks like:
and another like this:
Please note that at the time that this claimed "out of memory", the system had been using less than 50% (~7GB) of available memory (~15GB instances), and the process has been able to claim more memory from the system both before (on 1.5) and after (still 1.5) the occurrences of those crashes. |
Here's another example of the 'runtime out of memory' traceback, from a different host at a different time:
|
And another example in
|
And finally, another example of the pthread_create crash:
|
@aclements @davecheney Caught one with the GC logging on! And with a new backtrace "reason". I've included an arbitrary number of GC log messages "back" from the crash:
|
Are you able to track the number of threads active in that process. It Can you try setting GODEBUG=netdns=go to force the net package to only use On Fri, Aug 21, 2015 at 4:52 AM, Brian Hatfield notifications@github.com
|
Yep, can do. This generally does use a lot of threads, yes, as it's very network-IO heavy. |
That's interesting. Network IO should not consume excessive threads in go programs because of the integrated poller. |
@davecheney This program doesn't do any file IO at all, only network (http serving, http client, and a TCP client), and it's using about 4100 threads:
|
@bmhatfield did yo have a chance to try that GODEBUG change I suggested? |
I hadn't yet, will roll it out shortly. |
Hi, We've been running with
and
Sorry for the delay in gathering this data. |
Is that the entire crash report? Can you post the full thing somewhere?
|
It is not, I cut off everything below what seemed to be the most relevant part of the panic. The rest of the crash report has info on about ~3000+ goroutines, so I thought it prudent to minimize. I'll post up the whole thing in a gist. |
Thanks the full report should make it possible to figure out where all the native threads are going.
|
Unfortunately these traces are too large for Chrome to paste into a gist, so here's a couple share links to them in gzip form from Google Drive: https://drive.google.com/file/d/0BxGgMbrngfJbOXE1a2R2MjRpeW8/view?usp=sharing https://drive.google.com/file/d/0BxGgMbrngfJbVjh4VXNTdmo5eG8/view?usp=sharing |
@bmhatfield Do you mean GODEBUG=netdns=go ? Because GODEBUG=netdns won't do anything in particular. |
Ah, yes, you are correct. I just checked /proc/PID/environ, and confirmed we are indeed running with |
I've been debugging a similar problem some teams are seeing inside Google. I assume this is on Linux? If you can, please try
and see if that helps. I bet it will. I am not sure why, but something we are doing differently compared to Go 1.4 seems to be confusing the Linux kernel memory overcommit logic, such that it decides there isn't enough memory on the machine to satisfy a completely reasonable request and starts returning ENOMEM. I have not gotten a case I can reproduce on my own Linux workstation yet. I suspect once we do get one it will be easy to understand what we're doing to make the kernel unhappy and work around it. Any reproduction cases are most welcome. |
Thanks Brian. Looking at report-2 I'm happy that there are no signs of the cgo dns I did find several goroutines, 454 of them, blocking on a mutex from a 3rd goroutine 743188177 [semacquire]: Is it possible to rule out that 3rd party service ? On Wed, Sep 2, 2015 at 11:28 PM, Brian Hatfield notifications@github.com
|
@rsc something about that clicks for me, though it is a little surprising. Happy to make that change. @davecheney That is our metrics library, and unfortunately, it's not really possible to rule it out as it's critical to the operation of the service :-( - If there's some evidence it's a problem, however, I could invest some time changing it's behavior to use less goroutines. EDIT: @rsc As for reproduction cases, this is not happening on anything that I'd call a consistent pattern, it's just observed behavior, with varying intervals between occurrences. |
New test with some debug output:
For reference, the struct and macro:
It took a few minutes after the initial data load, during which time VSZ crept up to 3500504, and then boom:
Tracking the pc just to be precise:
That line would be:
which is no surprise, given the null pointer returned by malloc. I'm only surprised it didn't crash on the line before, but perhaps the assignments were performed out of order. I am also running Ubuntu 14.04 LTS. |
@rowland, could you try my suggestion from #12233 (comment)? It may be that Go is running up the process' VMA count and C malloc is getting unlucky and trying to make that one last mmap that breaks the camel's back. If my suggestion doesn't help, then I think your problem is on the C side. |
@aclements It's certainly helping. The process has lasted 40 times as long with more than twice as many mallocs. It would be preferable not to need the sysctl workaround, but this gives me time until I can replace the CGO bits with native Go equivalents. Thanks. |
I'm considering simply disabling huge pages for the Go heap for the 1.5.2 release. This wouldn't be a huge loss, considering Linux only gained transparent huge page support quite recently. For 1.6 we may be able to find a better way to take advantage of huge pages without fragmenting our VMAs so badly. |
@aclements @davecheney @rsc Great news! After running over the weekend with the system's |
@bmhatfield, great! I'll put together a CL to disable huge pages for Go 1.5.2, which should fix this, and we'll think about how to intelligently re-enable them for 1.6. |
@bmhatfield, excellent detective work, I'm delighted you got to the bottom On Tue, 29 Sep 2015 08:27 Austin Clements notifications@github.com wrote:
|
I've posted a CL for this: https://go-review.googlesource.com/#/c/15191 @bmhatfield or anyone else having this problem, could you give this CL a try? It cherry-picks cleanly to 1.5.1 or you can try it on master. I've confirmed that it reduces the number of VMAs, but I don't have any test cases myself that drive the VMA count anywhere near the limit. |
CL https://golang.org/cl/15191 mentions this issue. |
I'm happy to give that CL a try, but building the right version of Go from source seems a bit arduous. Any recommendations for the simplest way to get a binary? |
@bmhatfield, I'm not sure how one would get a binary of this, since we don't even ship binaries of releases for Linux. But here's the exact sequence of commands that should build the right source:
This assumes you have some Go version installed as /usr/bin/go, such as from your package manager. Tweak the GOROOT_BOOTSTRAP in the last line if it's somewhere else. This should give you a go binary in go/bin that you can build with. |
Not intended to be a factual statement. Still, it's probably easiest to just run those commands. If that doesn't work, let me know and I'll see what I can do about getting you a binary. |
Thanks for the outline of how to rebuild Go, that was very helpful. There were some minor details I had to tweak (Mainly, I was as rigorous as possible about using it to recompile my program, but there's a small chance I messed up rebuilding my program with the correct That said: Prior to updating the binary: $ sudo cat /proc/`pidof $PROCESS`/smaps | grep ^Size -c
131057 After updating the binary and letting it get up to full-steam (about 27 minutes): $ sudo cat /proc/`pidof $PROCESS`/smaps | grep ^Size -c
384
EDIT: A couple minutes later:
$ sudo cat /proc/`pidof $PROCESS`/smaps | grep ^Size -c
276 After restarting a not-updated binary and letting it get up to full-steam (about 18 minutes): $ sudo cat /proc/`pidof $PROCESS`/smaps | grep ^Size -c
45242
EDIT: A couple minutes later:
$ sudo cat /proc/`pidof $PROCESS`/smaps | grep ^Size -c
60518 So it looks like this change has had the intended effect. |
Yay! I'm very glad to see this resolved. It's certainly been interesting trying to provide enough information as we tried to track this down. Thanks for the support and help! |
Compiled Go 1.5.2 from source, as prescribed in by @aclements |
In case more anecdata is useful in gauging how frequently this bug occurs in the wild, we have two different server processes that occasionally crash with this issue. They are both some kind of database and have large and active heaps. We're using the raise-the-max_map_count workaround for now. |
CL https://golang.org/cl/16980 mentions this issue. |
…age granularity This fixes an issue where the runtime panics with "out of memory" or "cannot allocate memory" even though there's ample memory by reducing the number of memory mappings created by the memory allocator. Commit 7e1b61c worked around issue #8832 where Linux's transparent huge page support could dramatically increase the RSS of a Go process by setting the MADV_NOHUGEPAGE flag on any regions of pages released to the OS with MADV_DONTNEED. This had the side effect of also increasing the number of VMAs (memory mappings) in a Go address space because a separate VMA is needed for every region of the virtual address space with different flags. Unfortunately, by default, Linux limits the number of VMAs in an address space to 65530, and a large heap can quickly reach this limit when the runtime starts scavenging memory. This commit dramatically reduces the number of VMAs. It does this primarily by only adjusting the huge page flag at huge page granularity. With this change, on amd64, even a pessimal heap that alternates between MADV_NOHUGEPAGE and MADV_HUGEPAGE must reach 128GB to reach the VMA limit. Because of this rounding to huge page granularity, this change is also careful to leave large used and unused regions huge page-enabled. This change reduces the maximum number of VMAs during the runtime benchmarks with GODEBUG=scavenge=1 from 692 to 49. Fixes #12233. Change-Id: Ic397776d042f20d53783a1cacf122e2e2db00584 Reviewed-on: https://go-review.googlesource.com/15191 Reviewed-by: Keith Randall <khr@golang.org> Reviewed-on: https://go-review.googlesource.com/16980 Run-TryBot: Austin Clements <austin@google.com> Reviewed-by: Ian Lance Taylor <iant@golang.org> Reviewed-by: Russ Cox <rsc@golang.org>
I recently upgraded to Go 1.5 from 1.4.2 for a moderately large production service, serving thousands of requests per second. This particular program has been run on every Go version from 1.1 through (now) 1.5, and has been stable in production for about 2+ years. Switching to Go 1.5 saw dramatic improvements in GC pause time, as expected, but the processes are intermittently exiting (somewhere between 1-6h of uptime) with the below panic.
On each version, this has been running on Ubuntu 12.04.
The text was updated successfully, but these errors were encountered: