Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: mechanism for monitoring heap size #16843

Closed
bradfitz opened this issue Aug 22, 2016 · 59 comments
Closed

runtime: mechanism for monitoring heap size #16843

bradfitz opened this issue Aug 22, 2016 · 59 comments
Labels
FrozenDueToAge NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made.
Milestone

Comments

@bradfitz
Copy link
Contributor

Tracking bug for some way for applications to monitor memory usage, apply backpressure, stay within limits, etc.

Related previous issues: #5049, #14162

@rgooch
Copy link

rgooch commented Aug 23, 2016

Can you please expand on what you have in mind?

@bradfitz
Copy link
Contributor Author

I have nothing specific in mind. This bug was filed as part of a triage meeting with a bunch of us. One bug (#5049) was ancient with no activity and one bug (#14162) proposed a solution instead of discussing the problem.

This bug is recognition that there is a problem, and we've heard problem statements and potential solutions (and sometimes both) from a number of people.

The reality is that there are always memory limits, and it'd be nice for the Go runtime to help applications stay within them, through perhaps some combination of limiting itself, and/or helping the application apply backpressure when resources are getting tight. That might involve new runtime API surface to help applications know when things are getting tight.

/cc @nictuku also.

@bradfitz
Copy link
Contributor Author

Btw, there was lots of good conversation at #14162 and it wasn't our intention to kill it or devalue it. It just didn't fit the proposal process, and we also didn't want to decline it, nor close it as a dup of #5049.

Changing the language is out of scope, so all discussions of things like catching memory allocation failures, language additions like "trymake" or "tryappend", etc, are all not going to happen.

But we can add runtime APIs to help out. That's what this bug is tracking.

/cc @matloob @aclements

@rgooch
Copy link

rgooch commented Aug 23, 2016

Agreed. "try*" isn't practical. It would require changing too make call-sites and even then would not catch all allocations. Adding runtime.SetSoftMemoryLimit() still seems like the best approach.

@nictuku
Copy link
Contributor

nictuku commented Aug 23, 2016

It would be nice to have the ability to set a limit to the memory usage.

After a limit is set, perhaps the runtime could provide a clear indication that we're under memory pressure and that the application should avoid creating new allocations. Example new runtime APIs that would help:

  • func InMemoryPushback() bool; or
  • func RegisterPushbackFunc(func(inPushback bool))

That would provide a clear signal to the application. How exactly that's decided should be an internal implementation decision and not part of the API. An example implementation, to illustrate: if we limit ourselves to the heap size specified by the user, we could trigger GC whenever the used heap is close to the limit. Then we could enter pushback whenever the GC performance (latency or CPU overhead) is outside certain bounds. Apply smoothing as needed.

The approach suggested by this API has limitations.

For example, it's still possible for an application that is behaving well to do one monstrous allocation after it has checked for the pushback state. This would be common for HTTP and RPC servers that do admittance control at the beginning of the request processing. If the monstrous allocation would bring the memory heap above the limit, Go should probably panic. Since we don't want to change the language to add memory allocation error checks, I think this is fine. And we have no other option :).

Another problem is that deciding what is the right time to pushback can be hard. Whatever the runtime implements, some folks may find it too aggressive (pushing back too much, leading to poor resource utilization) or too conservative (pushing back too late, leading to high latency due to excessive GC). I guess the go team could provide a knob similar to GOGC to control the pushbackiness of the runtime, if folks are really paranoid about it.

@RLH
Copy link
Contributor

RLH commented Aug 23, 2016

The runtime could set up a channel and send a message whenever it completes
a GC. The application could have a heap monitor goroutine (HMG) watching
that channel. Whenever the HMG gets a message it inspects the state of the
heap. To determine the size of the heap the HMG would look at the live heap
size and GOGC. If need be it could adjust GOGC so that the total heap does
not exceed whatever limit the application finds appropriate. If things are
going badly for the application the HMG can start applying back pressure to
whatever part of the application is causing the increase in heap size. The
HMG would be part of the application so a wide variety of application
specific strategies could be implemented.

Trying to pick up the pieces after a failure does not seem doable. Likewise
deciding what is "close to a failure" is very application specific and a
global metric that potentially involves external OS issues such as
co-tenancy as well as other issue well beyond the scope of the Go runtime.
Decisions and actions need to be made well ahead if one expects them to
reliable prevent an OOM.

I believe this is where we were headed in #14162
#14162 and this is a recap of some of
that discussion.

I would be interested in what useful policy could not be implemented using
the HMG mechanism and current runtime mechanisms.

On Tue, Aug 23, 2016 at 1:43 AM, Yves Junqueira notifications@github.com
wrote:

For Google's internal needs, it would be nice to have the ability to set a
limit to the memory usage.

After a limit is set, perhaps the runtime could provide a clear indication
that we're under memory pressure and that the application should avoid
creating new allocations. Example new runtime APIs that would help:

  • func InMemoryPushback() bool; or
  • func RegisterPushbackFunc(func(inPushback bool))

That would provide a clear signal to the application. How exactly
that's decided should be an internal implementation decision and not part
of the API. An example implementation, to illustrate: if we limit ourselves
to the heap size specified by the user, we could trigger GC whenever the
used heap is close to the limit. Then we could enter pushback whenever the
GC performance (latency or CPU overhead) is outside certain bounds. Apply
smoothing as needed.

The approach suggested by this API has limitations.

For example, it's still possible for an application that is behaving well
to do one monstrous allocation after it has checked for the pushback state.
This would be common for HTTP and RPC servers that do admittance control at
the beginning of the request processing. If the monstrous allocation would
bring the memory heap above the limit, Go should probably panic. Since we
don't want to change the language to add memory allocation error checks, I
think this is fine. And we have no other option :).

Another problem is that deciding what is the right time to pushback can be
hard. Whatever the runtime implements, some folks may find it too
aggressive (pushing back too much, leading to poor resource utilization) or
too conservative (pushing back too late, leading to high latency due to
excessive GC). I guess the go team could provide a knob similar to GOGC to
control the pushbackiness of the runtime, if folks are really paranoid
about it.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#16843 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AA7Wn-x0kWzbQY0w2nI8daJSWBbIHPWHks5qiohsgaJpZM4Jqa25
.

@rgooch
Copy link

rgooch commented Aug 23, 2016

I previously gave the reasoning why using a channel or a callback to receive memory exceeded events won't work: #14162
That same reasoning applies to a channel whenever a GC run is completed.

To robustly handling exceeding a memory limit the check for the limit has to be part of the allocator, not done after a GC run. This is because you can't afford to wait. If you wait for the next GC run, it may be too late. Consider a single large slice allocation that would put you over the soft limit and would exceed the hard memory limit. You'll get an OOM panic. The same applies to a callback function.

You need to immediately stop the code which is doing the heavy allocating. To do that you need a check in the allocator and you need to send a panic(). It's up to the application to set the soft memory limit at which these optional, catchable panics are sent.

Please, before rehashing old suggestions or coming up with new variants, read through #14162 where I gave the reasoning why a panic and a check in the allocator is needed. Otherwise we keep covering the same old ground.

@quentinmit
Copy link
Contributor

@rgooch If you are allocating giant arrays, you probably know exactly where in your code that is happening, and you can add code there to first check if there is enough memory available. You can even do that using the GC information we're discussing passing down a channel.

I do think there is a race here, but in the opposite case - if code is sitting in a tight loop making many small allocations, your channel read/callback might not run in time to actually trigger a new GC soon enough without OOMing.

@rgooch
Copy link

rgooch commented Aug 23, 2016

I discussed all this in #14162: you can be reading GOB-encoded data from a network connection. No way to know ahead of time how big it's going to be. Or it can be some other library you don't control where a lot of data are allocated, whether a single huge slice or a lot of small allocations. The point is, you don't know how much will be allocated before you enter the library code and you've got no way to reach in there and stop things if you hit some pre-defined limit. And, as you say, if you're in a loop watching allocations, even if you could stop things, you may not get there in time. Spinning in a loop watching the memory level is grossly expensive. This needs to be tied to the allocator.

@RLH
Copy link
Contributor

RLH commented Aug 23, 2016

This does not propose a callback or channel for delivering a memory
exceeded message or a memory almost exceeded message. At that point it is
already too late. This proposes a mechanism for providing the application
timely information that it can use to avoid the OOM. The application knows
how best to predict memory usage and, if need be, throttle its memory usage.

One suggestion was
func runtime.ReserveOOMBuffer(size uint64)

The application's heap monitor goroutine, HMG, could initially allocate a
large object of the required size and retain a single reference to it. If
the HMG using information provided by the runtime determines that the
current GOGC and live heap size will not support the application's
predicted allocations then it can release that single reference confident
that the next GC will recover those spans and make them available. It the
HMG wants the GC to happen sooner than currently scheduled then it can
lower GOGC using SetGCPercent.

If ReserveOOMBuffer is the API that some Go application needs then this
provides it. The intent of this proposal is to provide the application with
the information it needs to create the abstractions that best fits its need
while minimizing Go's runtime API surface.

On Tue, Aug 23, 2016 at 11:13 AM, rgooch notifications@github.com wrote:

I previously gave the reasoning why using a channel or a callback to
receive memory exceeded events won't work: #14162
#14162
That same reasoning applies to a channel whenever a GC run is completed.

To robustly handling exceeding a memory limit the check for the limit has
to be part of the allocator, not done after a GC run. This is because you
can't afford to wait. If you wait for the next GC run, it may be too late.
Consider a single large slice allocation that would put you over the soft
limit and would exceed the hard memory limit. You'll get an OOM panic.
The same applies to a callback function.

You need to immediately stop the code which is doing the heavy allocating.
To do that you need a check in the allocator and you need to send a
panic(). It's up to the application to set the soft memory limit at which
these optional, catchable panics are sent.

Please, before rehashing old suggestions or coming up with new variants,
read through #14162 #14162 where I
gave the reasoning why a panic and a check in the allocator is needed.
Otherwise we keep covering the same old ground.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#16843 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AA7Wn4rwLDnFazl8ko7MEgqGqjlHlYJKks5qiw4rgaJpZM4Jqa25
.

@dr2chase
Copy link
Contributor

As I read this, #14162 describes a workload where (analogy follows) sometimes the python attempts to swallow a rhino, and if the attempt is not halted ASAP it is guaranteed to end badly. Is it in fact the case that the rhino will never be successfully swallowed? (I can imagine DOS attacks on servers where this might be the case.)

I think that the periodic notification scheme is intended to deal with a python diet of a large number of smaller prey; if an application has the two constraints of m=memory < M and l=latency < L, and if m is affine in workload W (reasonable assumption) and l is also affine in workload W (semi-reasonable), then simply comparing observed m with limit M and observed l with limit L tells you how much more work can be admitted (W' = W * min(M/m, L/l)), with the usual handwaving around unlucky variations in the input and lag in the measurement. It's possible to adjust GOGC up or down if M/m and L/l are substantially different, so as to maximize the workload within constraints -- this however also requires knowledge of the actual GC overhead imposed on the actual application (supposed to be 25% during GC, but high allocation rates change this). One characteristic of this approach is that a newly started application might not snap online immediately at full load, but would increase its intake as it figured out what load it could handle.

But this is no help for intermittent rhino-swallowing.

@quentinmit quentinmit added this to the Go1.8Maybe milestone Sep 6, 2016
@jessfraz
Copy link
Contributor

jessfraz commented Oct 5, 2016

@bradfitz would you be open to me taking some of the ideas from #14162 and applying the Go proposal process so it is considered? As long of course the proposed solution doesn't break the API or change the language.

@bradfitz
Copy link
Contributor Author

bradfitz commented Oct 5, 2016

As long as the proposal isn't to "make it possible to catch failed memory allocations", which I'm pretty sure everybody agrees isn't going to happen.

But any proposal should address or at least consider the whole range of related issues in this space. (back pressure, runtime & applications being aware of limits & usage levels)

@jessfraz
Copy link
Contributor

jessfraz commented Oct 5, 2016

I was thinking a couple additions to the runtime package to expose information that might be useful for applications like you said in #16843 (comment)

@quentinmit quentinmit added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Oct 11, 2016
@rsc rsc modified the milestones: Go1.9, Go1.8Maybe Oct 21, 2016
@juliandroid
Copy link

Is there any decision about how this would be properly implemented?

In Perl there is documented a notorious $^M global variable that user code could initialize to some lengthy string, that in case of Out of memory error could be used as an emergency memory pool after die()ing. However I couldn't find a working example and it seems that feature was never implemented.

However it seems logical approach. Since you are most probably in multitenancy environment, sharing memory with other go/non-go programs, so the only buffer that you can rely on is the emergency one allocated by yourself. Using that memory by go runtime in case of low memory and immediately notifying the subscribed process that you are running out of memory seems like a good measure to prevent pure go programs panic.

@nictuku
Copy link
Contributor

nictuku commented Feb 20, 2017

My proposal is here: https://docs.google.com/document/d/1zn4f3-XWmoHNj702mCCNvHqaS7p9rzqQGa74uOwOBKM/edit

I hope to have an implementation open sourced soon. I don't know if it could be included in the standard libraries.

I would like to make it as robust as possible, so if you'd like to test it, please drop me an email (see my github profile) and I'll contact you later. Thanks!

@rgooch
Copy link

rgooch commented Mar 8, 2017

This proposal looks interesting. I made a couple of comments in the document:

  1. Support the pattern of pre-allocating at startup (up to a percentage of the VM/container memory) and never give that memory back to the OS

  2. Have a hard memory limit and push back+GC harder as you get closer to the limit.

@CAFxX
Copy link
Contributor

CAFxX commented Mar 8, 2017

Added feedback to optionally trigger orderly application shutdown when GC pacing fails to keep memory below the set maximum.

@tve
Copy link

tve commented May 11, 2017

I'm dealing with an app that runs out of memory (on a 16GB box) and that eventually lead me here. Some of the notes I took along the way are below, apologies if these fall into a "yeah, we know" category.

  • On 64-bit linux, I hit the out-of-memory panic in sysMap in mem_linux.go:216, but when I look up the call stack I see it passing through grow in mheap.go:774 and the code leads me to believe that if sysMap had returned an error instead of just panicking then grow could have tried a smaller allocation.
fatal error: runtime: out of memory

runtime stack:
runtime.throw(0x8a2de5, 0x16)
        /usr/local/go/src/runtime/panic.go:596 +0x95
runtime.sysMap(0xc437a10000, 0x5800000, 0xc420394800, 0xaebef8)
        /usr/local/go/src/runtime/mem_linux.go:216 +0x1d0
runtime.(*mheap).sysAlloc(0xad31a0, 0x5800000, 0x421b81)
        /usr/local/go/src/runtime/malloc.go:428 +0x374
runtime.(*mheap).grow(0xad31a0, 0x2c00, 0x0)
        /usr/local/go/src/runtime/mheap.go:774 +0x62
runtime.(*mheap).allocSpanLocked(0xad31a0, 0x2c00, 0xaceb30)
        /usr/local/go/src/runtime/mheap.go:678 +0x44f
  • I'm running in a container env where the container has a max memory set and I'm trying to understand what fraction of that can realistically be "in_use". It appears that I have to count for anywhere from 25% to 50% overhead. E.g., if the cgroup has memory=16GB then the actual in-use heap data structures may be in the 8GB..12GB range before I hit the out-of-memory panic. On the one hand, with GC that's perhaps in the reasonable ballpark, on the other hand this does represent $$.
  • The amount of "unused heap overhead" seems to be tunable using the GOGC env variable, I didn't see a way to modify this at run-time. For example. while the process is far from its limit using 100% reduces GC overhead, but when it reaches perhaps 60% of its limit I may want to change it to 20% to trade memory vs cpu. In my app I see it going from 1% to 6% of cpu overhead.
  • I'm very interested in being able to capture control when the process runs out of memory or is about to. I understand that in the absolute this is a difficult problem, but I'm looking at it from a troubleshooting perspective. I would first use it to output a memory profile or similar information so I can understand how much memory is allocated where, plus some info about GC (e.g. allocated but unused space). It would be OK for this to trigger before absolutely-out-of-memory occurs, e.g. the first time the runtime gets back-pressure from the OS (see first bullet point).
  • I do believe that many services can adjust their memory consumption by, broadly speaking, adjusting the concurrency. For example, an HTTP server can adjust the number of requests that are concurrently processed. I believe the runtime.MemStats info is sufficient for this purpose, but it could be enhanced by having some callback mechanism when a threshold is exceeded. E.g., a web server could block processing of new requests when 80% of available memory is used and only resume when it drops below 75%.

Overall I concur with the sentiment that most apps that run out of memory will run out of memory regardless of how fancy a mechanism is added to the current situation. For this reason if I had a vote I would vote for adding some additional simple hooks so one can do some tuning and foremost troubleshoot when an app does run out of memory.

@aclements
Copy link
Member

On 64-bit linux, I hit the out-of-memory panic in sysMap in mem_linux.go:216, but when I look up the call stack I see it passing through grow in mheap.go:774 and the code leads me to believe that if sysMap had returned an error instead of just panicking then grow could have tried a smaller allocation.

I'm not sure what you're suggesting, exactly. grow can reduce its request by at most 64 KB, which probably isn't going to help when a multi-gigabyte heap is running out of room.

I'm running in a container env where the container has a max memory set and I'm trying to understand what fraction of that can realistically be "in_use". It appears that I have to count for anywhere from 25% to 50% overhead.

Assuming you mean runtime.MemStats.HeapInUse (and friends), note that this can vary depending on where you are in a GC cycle. Perhaps more interesting is MemStats.NextGC, which tells you what heap size this GC cycle is trying to keep you below. This changes only once per GC cycle.

The amount of "unused heap overhead" seems to be tunable using the GOGC env variable, I didn't see a way to modify this at run-time.

runtime/debug.SetGCPercent lets you change this. Right now this triggers a full STW GC, but in Go 1.9 this operation will let you change GOGC on the fly without triggering a GC (unless you set it low enough that you have to immediately start a GC, of course :)

@tve
Copy link

tve commented May 12, 2017

I'm not sure what you're suggesting, exactly. grow can reduce its request by at most 64 KB, which probably isn't going to help when a multi-gigabyte heap is running out of room.
Ah, I couldn't tell that, you're right then.

@tve
Copy link

tve commented May 12, 2017

My proposal is here: https://docs.google.com/document/d/1zn4f3-XWmoHNj702mCCNvHqaS7p9rzqQGa74uOwOBKM/edit

Nice long proposal write-up :-). I'm trying to understand the tl;dr; ...

The proposal seems to come down do "periodically measure live data size and set GCPercent such that GC is triggered before the desired total heap size is reached". As mentioned in the proposal, this can be done/approximated today in the app itself using runtime.MemStats and debug.SetGCPercent.

As far as I can tell the following changes to the runtime would be desirable to improve this:

  • ensure that the calls necessary are efficient (some (all?) optimization are in Go1.9 already)
  • provide a hook so GCPercent can be adjusted after each GC instead of relying on a periodic timer?

As a user I'm still left wondering a bit what a reasonable goal in all of this is. I'm imagining something like "for the vast majority of Go apps the tuning of GCPercent allows 80% of memory to be used for live data with moderate GC overhead and 90% with high to very high GC overhead". Maybe someone in the Go community has informed intuition about specific numbers.

The answer to requests to have some callback or rescue option when memory allocation fails would be that instead GC overhead exceeding N% or GCPercent falling below below M% should be used to trigger said rescue action.

@tve
Copy link

tve commented May 12, 2017

I did an experiment to use GCPercent to constrain heap size and while the principle works as expected, it does look sufficient to me. I'm working on an app that digests some giant CSVs where memory consumption is an issue. I'm running with GCPercent=25 to try and contain the memory overhead. I'm running with gctrace=1 and the highest heap size number I see is 797MB:

gc 389 @209.888s 6%: 0.013+888+0.10 ms clock, 0.055+164/183/1068+0.40 ms cpu, 796->797->613 MB, 797 MB goal, 4 P

A little later after some memory has been freed I grab MemStats and get the following HeapXxx stats which show 1.2GB of heap (all gctrace outputs since the above were lower):

Heap stats: sys=1205MB inuse=488MB alloc=438, idle=717, released=0

Data grabbed from top at about that time seems to agree with the heap stats (code/stack size are not significant):

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
17746 tve       20   0 1272812 983920   6952 S 226.2 24.9   7:44.84 csv-digest

I was trying to keep the memory used by my process to 613MB*1.25=767MB using GCPercent but clearly that's not really working.
The point here is that tuning GCPercent is not sufficient if there is some hard limit one wants to stay under.
(I understand that my 25% goal may very well be unrealistic but I don't think this invalidates the point.)

@vitalyisaev2
Copy link

vitalyisaev2 commented Sep 10, 2018

I have an urgent need in a tool that helps to understand what's actually going in a Go process address space, why RSS keeps growing despite of memory limitations and so on. Also I need to use cgo, which makes problem even more complicated. Currently I have to use a set of tools like pprof, valgring --tool=massif, viewcore (introduced few weeks ago), and some self-developed tools. But it looks like I can see only different aspects of the problem, not an entire problem.

For example, I see that process has 5GB RSS. Go runtime says that it takes 2GB (though only 5% of them is used, and other 95% are idle). cgo library says that it uses < 500MB in it's caches and other internal data structures. And no one knows what consumed the remaining 2.5GB. I can only speculate if it's Go runtime cost (for instance, because I have profiler enabled as well), or it's a memory leak due to cgo (malloc without free).

@aclements
Copy link
Member

@vitalyisaev2, please open a new issue or send an email to golang-nuts@googlegroups.com. In it, please elaborate on what you mean by "Go runtime says that it takes 2GB". The runtime exports many different statistics, and it's important to know which one you're talking about. I would start by looking closely at all of the runtime.MemStats statistics, since that should tell you if it's on the Go side or the C side. If it's on the Go side, viewcore is probably the right tool to find the problem. Please also describe the time-scale of the problem, since some things (like returning memory to the OS) happen on the scale of minutes.

@rsc
Copy link
Contributor

rsc commented Nov 14, 2018

@aclements, what do you think the decision is here that NeedsDecision refers to?
Is there a different bug for your memory pressure work?
Should this issue be closed?

@fortuna
Copy link

fortuna commented Feb 25, 2019

For the record, there's a change to improve how memory is restored to the OS that is scheduled to be released on Go 1.12 for Linux and Go 1.13 on iOS, preventing OOM: #29844.

gopherbot pushed a commit that referenced this issue Mar 5, 2019
This slightly rearranges gcSetTriggerRatio to compute the goal before
computing the other controls. This will simplify implementing the heap
limit, which needs to control the absolute goal and flow the rest of
the control parameters from this.

For #16843.

Change-Id: I46b7c1f8b6e4edbee78930fb093b60bd1a03d75e
Reviewed-on: https://go-review.googlesource.com/c/go/+/46750
Run-TryBot: Austin Clements <austin@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Rick Hudson <rlh@golang.org>
@andybons andybons modified the milestones: Go1.13, Go1.14 Jul 8, 2019
@rsc rsc modified the milestones: Go1.14, Backlog Oct 9, 2019
@rsc
Copy link
Contributor

rsc commented Feb 26, 2020

Closing as a duplicate of #29696.

@rsc rsc closed this as completed Feb 26, 2020
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/227767 mentions this issue: [release-branch.go1.14] runtime/debug: add SetMaxHeap API

@golang golang locked and limited conversation to collaborators Apr 9, 2021
@rsc rsc unassigned RLH and aclements Jun 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made.
Projects
None yet
Development

No branches or pull requests