-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: reclaim memory used by huge array that is no longer referenced #14045
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is sort of a dup of #5049. |
This seems like a different bug to me. Say the heap ends in 10 free pages. We try to allocate something 20 pages in size. Currently the code allocates 20 pages from the system to satisfy this request, because the request won't fit in 10 pages. We really only should ask for 10 pages, as they will (probably) be coalesced with the 10 free pages at the end of the heap, resulting in 20 contiguous free pages. The "probably" in there depends on what the OS decides to give us. If it isn't coalesceable, then asking for only 10 pages doesn't help. In fact it hurts. Maybe only if we would otherwise throw an out-of-memory error should we try and do this. |
I thought that this was caused by imprecise GC or something. In any case, why Go cannot release memory back to the system if it knows that there is a big chunk of unused pages and the new allocation cannot be satisfied? Is it because by design heap never shrinks? |
No, it is just that we wait some time before releasing pages back to the system, to amortize the cost of allocating them in the first place. What if we add another 85MB allocation to the end of your example? |
@randall77 My comments was about releasing memory back to the OS when a new allocation fails, not in general. Also, how long Go waits before releasing the memory back to the system? If I modify the program to wait a minute before allocating again, it still fails:
|
I see, if an allocation fails we should run a GC, trigger giving memory back to the OS, then try again. That seems reasonable. I think we currently wait 5 minutes (after being freed by GC) before giving unused memory back to the OS. |
I modified the program to wait 20 minutes, not one, after GC before trying to allocate again. It still reported out of memory. |
That's because you're ulimiting the virtual address space, not the resident set size. When we give memory back to the OS, that just reduces the RSS, not the virtual memory use. The runtime will need some sort of fix to reuse the virtual address space for the second allocation. Virtual address size limits are in any case probably not what you want. Virtual address space is essentially free on a 64-bit system. Also, you've allocated an 85MB chunk and then freed it again. There's no guarantee that the runtime (during runtime.GC or time.Sleep) doesn't allocate something small just after that 85MB chunk. In that case, there's no way to reuse the 85MB of virtual address space for the 150MB allocation. |
@randall77 Right, restricting just the total amount of physical memory the process can use, not VM space, leads to the example working with timeouts slightly above 5 minutes. So this bug is really about releasing the memory to the OS on out-of-memory without waiting for 5 minutes. |
Does debug.FreeOSMemory help? |
I found that reproducing this is slightly more complicated now. To grow the RSS, it's necessary to touch the first allocation (otherwise the pages don't get faulted in): package main
import (
"fmt"
"time"
)
func main() {
a := make([]byte, 85*1024*1024)
for i := 0; i < len(a); i += 4096 {
a[i] = 'x'
}
a = nil
//debug.FreeOSMemory()
time.Sleep(time.Second)
a = make([]byte, 150*1024*1024)
for i := 0; i < len(a); i += 4096 {
a[i] = 'x'
}
fmt.Printf("%c\n", a[0])
time.Sleep(time.Minute)
} With this code, the process' RSS grows to ~244 MB. If you comment out the sleep (which allocates) between the two big allocations, we will in fact reuse the whole 85 MB region for the 150 MB region, so the RSS grows to just over 150 MB. I suspect that didn't happen in 1.5.2 when this was originally reported because we didn't trigger the GC aggressively enough before satisfying the second big allocation. Uncommenting the FreeOSMemory does keep the RSS down, since this GCs the 85MB allocation and immediately scavenges those pages. However, fixing this is not as simple as just scavenging if we get ENOMEM. For one, there are actually about a dozen places in the runtime where we check for out of memory (most are where we allocate off-heap runtime structures). We could (and possibly should) consolidate these, but then the problem becomes that the scavenger has locking requirements and out-of-memory is detected and handled at a very low level, when we're holding various different locks and when the heap structures may be in an inconsistent state. This mean we'd have to thread the failure up the stack further than we currently do in many places until we reached a point where we could reliably drop locks, call the scavenger, and retry the operation. |
An alternative approach would be to get out in front of the problem. I haven't convinced myself if this is a good idea or not yet, but when sweeping frees a large object (for some value of large, probably at least 1 MB and maybe much more), it would be quite easy to release that memory back to the OS immediately. That would actively decouple the physical and virtual memory around large objects so the virtual address space fragmentation caused by large objects didn't cause over-retention of physical memory. The cost would be releasing and later re-faulting that memory, which would be amortized over the cost of allocating and populating a large object. Faulting on my laptop takes about 70–200µs / MB, and releasing memory takes about 4–30µs / MB (the lower bound is for large pages, the upper bound for regular pages). In contrast, allocating a large object from already faulted memory and fully populating it takes about 450–500µs / MB. Hence, in the worst case, this would increase the cost of allocating large objects by ~46%, but if large pages are available the worst case is more like 16%. Costs in practice would probably be lower. This solution would also be likely to reduce the running RSS of applications with large heaps, since it effectively makes the scavenger much more aggressive for large objects. This alone would be worth some cost.
|
Yet another way to get out in front of the problem would be to eagerly scavenge when we grow the heap or reuse freed memory. This would be less aggressive than my previous sweeping idea, so it would be less likely to free memory that's about to get reused. I think we're in a good position to do this now thanks to @RLH's free span treap, since that should let us quickly find a right-sized chunk of unused memory we can free. |
Change https://golang.org/cl/138918 mentions this issue: |
Change https://golang.org/cl/139837 mentions this issue: |
Change https://golang.org/cl/140198 mentions this issue: |
Change https://golang.org/cl/139297 mentions this issue: |
Currently, span scavenging was done nearly identically in two different locations. This change deduplicates that into one shared routine. For #14045. Change-Id: I15006b2c9af0e70b7a9eae9abb4168d3adca3860 Reviewed-on: https://go-review.googlesource.com/c/139297 Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Austin Clements <austin@google.com>
Currently, mheap tracks spans in both mSpanLists and mTreaps, but mSpanLists, while they tend to be smaller, complicate the implementation. Here we simplify the implementation by removing free and busy from mheap and renaming freelarge -> free and busylarge -> busy. This change also slightly changes the reclamation policy. Previously, for allocations under 1MB we would attempt to find a small span of the right size. Now, we just try to find any number of spans totaling the right size. This may increase heap fragmentation, but that will be dealt with using virtual memory tricks in follow-up CLs. For #14045. Garbage-heavy benchmarks show very little change, except what appears to be a decrease in STW times and peak RSS. name old STW-ns/GC new STW-ns/GC delta Garbage/benchmem-MB=64-8 263k ±64% 217k ±24% -17.66% (p=0.028 n=25+23) name old STW-ns/op new STW-ns/op delta Garbage/benchmem-MB=64-8 9.39k ±65% 7.80k ±24% -16.88% (p=0.037 n=25+23) name old peak-RSS-bytes new peak-RSS-bytes delta Garbage/benchmem-MB=64-8 281M ± 0% 249M ± 4% -11.40% (p=0.000 n=19+18) https://perf.golang.org/search?q=upload:20181005.1 Go1 benchmarks perform roughly the same, the most notable regression being the JSON encode/decode benchmark with worsens by ~2%. name old time/op new time/op delta BinaryTree17-8 3.02s ± 2% 2.99s ± 2% -1.18% (p=0.000 n=25+24) Fannkuch11-8 3.05s ± 1% 3.02s ± 2% -1.20% (p=0.000 n=25+25) FmtFprintfEmpty-8 43.6ns ± 5% 43.4ns ± 3% ~ (p=0.528 n=25+25) FmtFprintfString-8 74.9ns ± 3% 73.4ns ± 1% -2.03% (p=0.001 n=25+24) FmtFprintfInt-8 79.3ns ± 3% 77.9ns ± 1% -1.73% (p=0.003 n=25+25) FmtFprintfIntInt-8 119ns ± 6% 116ns ± 0% -2.68% (p=0.000 n=25+18) FmtFprintfPrefixedInt-8 134ns ± 4% 132ns ± 1% -1.52% (p=0.004 n=25+25) FmtFprintfFloat-8 240ns ± 1% 241ns ± 1% ~ (p=0.403 n=24+23) FmtManyArgs-8 543ns ± 1% 537ns ± 1% -1.00% (p=0.000 n=25+25) GobDecode-8 6.88ms ± 1% 6.92ms ± 4% ~ (p=0.088 n=24+22) GobEncode-8 5.92ms ± 1% 5.93ms ± 1% ~ (p=0.898 n=25+24) Gzip-8 267ms ± 2% 266ms ± 2% ~ (p=0.213 n=25+24) Gunzip-8 35.4ms ± 1% 35.6ms ± 1% +0.70% (p=0.000 n=25+25) HTTPClientServer-8 104µs ± 2% 104µs ± 2% ~ (p=0.686 n=25+25) JSONEncode-8 9.67ms ± 1% 9.80ms ± 4% +1.32% (p=0.000 n=25+25) JSONDecode-8 47.7ms ± 1% 48.8ms ± 5% +2.33% (p=0.000 n=25+25) Mandelbrot200-8 4.87ms ± 1% 4.91ms ± 1% +0.79% (p=0.000 n=25+25) GoParse-8 3.59ms ± 4% 3.55ms ± 1% ~ (p=0.199 n=25+24) RegexpMatchEasy0_32-8 90.3ns ± 1% 89.9ns ± 1% -0.47% (p=0.000 n=25+21) RegexpMatchEasy0_1K-8 204ns ± 1% 204ns ± 1% ~ (p=0.914 n=25+24) RegexpMatchEasy1_32-8 84.9ns ± 0% 84.6ns ± 1% -0.36% (p=0.000 n=24+25) RegexpMatchEasy1_1K-8 350ns ± 1% 348ns ± 3% -0.59% (p=0.007 n=25+25) RegexpMatchMedium_32-8 122ns ± 1% 121ns ± 0% -1.08% (p=0.000 n=25+18) RegexpMatchMedium_1K-8 36.1µs ± 1% 34.6µs ± 1% -4.02% (p=0.000 n=25+25) RegexpMatchHard_32-8 1.69µs ± 2% 1.65µs ± 1% -2.38% (p=0.000 n=25+25) RegexpMatchHard_1K-8 50.8µs ± 1% 49.4µs ± 1% -2.69% (p=0.000 n=25+24) Revcomp-8 453ms ± 2% 449ms ± 3% -0.74% (p=0.022 n=25+24) Template-8 63.2ms ± 2% 63.4ms ± 1% ~ (p=0.127 n=25+24) TimeParse-8 313ns ± 1% 315ns ± 3% ~ (p=0.924 n=24+25) TimeFormat-8 294ns ± 1% 292ns ± 2% -0.65% (p=0.004 n=23+24) [Geo mean] 49.9µs 49.6µs -0.65% name old speed new speed delta GobDecode-8 112MB/s ± 1% 110MB/s ± 4% -1.00% (p=0.036 n=24+24) GobEncode-8 130MB/s ± 1% 129MB/s ± 1% ~ (p=0.894 n=25+24) Gzip-8 72.7MB/s ± 2% 73.0MB/s ± 2% ~ (p=0.208 n=25+24) Gunzip-8 549MB/s ± 1% 545MB/s ± 1% -0.70% (p=0.000 n=25+25) JSONEncode-8 201MB/s ± 1% 198MB/s ± 3% -1.29% (p=0.000 n=25+25) JSONDecode-8 40.7MB/s ± 1% 39.8MB/s ± 5% -2.23% (p=0.000 n=25+25) GoParse-8 16.2MB/s ± 4% 16.3MB/s ± 1% ~ (p=0.211 n=25+24) RegexpMatchEasy0_32-8 354MB/s ± 1% 356MB/s ± 1% +0.47% (p=0.000 n=25+21) RegexpMatchEasy0_1K-8 5.00GB/s ± 0% 4.99GB/s ± 1% ~ (p=0.588 n=24+24) RegexpMatchEasy1_32-8 377MB/s ± 1% 378MB/s ± 1% +0.39% (p=0.000 n=25+25) RegexpMatchEasy1_1K-8 2.92GB/s ± 1% 2.94GB/s ± 3% +0.65% (p=0.008 n=25+25) RegexpMatchMedium_32-8 8.14MB/s ± 1% 8.22MB/s ± 1% +0.98% (p=0.000 n=25+24) RegexpMatchMedium_1K-8 28.4MB/s ± 1% 29.6MB/s ± 1% +4.19% (p=0.000 n=25+25) RegexpMatchHard_32-8 18.9MB/s ± 2% 19.4MB/s ± 1% +2.43% (p=0.000 n=25+25) RegexpMatchHard_1K-8 20.2MB/s ± 1% 20.7MB/s ± 1% +2.76% (p=0.000 n=25+24) Revcomp-8 561MB/s ± 2% 566MB/s ± 3% +0.75% (p=0.021 n=25+24) Template-8 30.7MB/s ± 2% 30.6MB/s ± 1% ~ (p=0.131 n=25+24) [Geo mean] 120MB/s 121MB/s +0.48% https://perf.golang.org/search?q=upload:20181004.6 Change-Id: I97f9fee34577961a116a8ddd445c6272253f0f95 Reviewed-on: https://go-review.googlesource.com/c/139837 Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Austin Clements <austin@google.com>
Change https://golang.org/cl/144717 mentions this issue: |
Change https://golang.org/cl/144718 mentions this issue: |
This change adds a method for computing a treap node's predecessor to the treap, which will simplify the implementation of algorithms used for heap growth scavenging. For #14045. Change-Id: Id203e4bd246db3504f2f0c5163ec36f4579167df Reviewed-on: https://go-review.googlesource.com/c/144717 Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Austin Clements <austin@google.com>
This change adds a method for computing a treap node's successor to the treap, which will simplify the implementation of algorithms used for heap growth scavenging. For #14045. Change-Id: If2af3f2707dbcbef5fb6e42cb2712061f9da5129 Reviewed-on: https://go-review.googlesource.com/c/144718 Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Austin Clements <austin@google.com>
This change adds a new treap to mheap which contains scavenged (i.e. its physical pages were returned to the OS) spans. As of this change, spans may no longer be partially scavenged. For #14045. Change-Id: I0d428a255c6d3f710b9214b378f841b997df0993 Reviewed-on: https://go-review.googlesource.com/c/139298 Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Austin Clements <austin@google.com>
This change removes npreleased from mspan since spans may now either be scavenged or not scavenged; how many of its pages were actually scavenged doesn't matter. It saves some space in mpsan overhead too, as the boolean fits into what would otherwise be struct padding. For #14045. Change-Id: I63f25a4d98658f5fe21c6a466fc38c59bfc5d0f5 Reviewed-on: https://go-review.googlesource.com/c/139737 Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Austin Clements <austin@google.com>
Currently, we mark a whole span as sysUsed before trimming, but this unnecessarily tells the OS that the trimmed section from the span is used when it may have been scavenged, if s was scavenged. Overall, this just makes invocations of sysUsed a little more fine-grained. It does come with the caveat that now heap_released needs to be managed a little more carefully in allocSpanLocked. In this case, we choose to (like before this change) negate any effect the span has on heap_released before trimming, then add it back if the trimmed part is scavengable. For #14045. Change-Id: Ifa384d989611398bfad3ca39d3bb595a5962a3ea Reviewed-on: https://go-review.googlesource.com/c/140198 Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Austin Clements <austin@google.com>
This change introduces a test to malloc_test which checks for overuse of physical memory in the large object treap. Due to fragmentation, there may be many pages of physical memory that are sitting unused in large-object space. For #14045. Change-Id: I3722468f45063b11246dde6301c7ad02ae34be55 Reviewed-on: https://go-review.googlesource.com/c/138918 Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Austin Clements <austin@google.com>
Change https://golang.org/cl/145999 mentions this issue: |
Change https://golang.org/cl/156417 mentions this issue: |
Change https://golang.org/cl/157837 mentions this issue: |
Change https://golang.org/cl/158078 mentions this issue: |
As a result of changes earlier in Go 1.12, the scavenger became much more aggressive. In particular, when scavenged and unscavenged spans coalesced, they would always become scavenged. This resulted in most spans becoming scavenged over time. While this is good for keeping the RSS of the program low, it also causes many more undue page faults and many more calls to madvise. For most applications, the impact of this was negligible. But for applications that repeatedly grow and shrink the heap by large amounts, the overhead can be significant. The overhead was especially obvious on older versions of Linux where MADV_FREE isn't available and MADV_DONTNEED must be used. This change makes it so that scavenged spans will never coalesce with unscavenged spans. This results in fewer page faults overall. Aside from this, the expected impact of this change is more heap growths on average, as span allocations will be less likely to be fulfilled. To mitigate this slightly, this change also coalesces spans eagerly after scavenging, to at least ensure that all scavenged spans and all unscavenged spans are coalesced with each other. Also, this change adds additional logic in the case where two adjacent spans cannot coalesce. In this case, on platforms where the physical page size is larger than the runtime's page size, we realign the boundary between the two adjacent spans to a physical page boundary. The advantage of this approach is that "unscavengable" spans, that is, spans which cannot be scavenged because they don't cover at least a single physical page are grown to a size where they have a higher likelihood of being discovered by the runtime's scavenging mechanisms when they border a scavenged span. This helps prevent the runtime from accruing pockets of "unscavengable" memory in between scavenged spans, preventing them from coalescing. We specifically choose to apply this logic to all spans because it simplifies the code, even though it isn't strictly necessary. The expectation is that this change will result in a slight loss in performance on platforms where the physical page size is larger than the runtime page size. Update #14045. Change-Id: I64fd43eac1d6de6f51d7a2ecb72670f10bb12589 Reviewed-on: https://go-review.googlesource.com/c/158078 Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Austin Clements <austin@google.com>
Change https://golang.org/cl/159500 mentions this issue: |
Because scavenged and unscavenged spans no longer coalesce, memory that is freed no longer has a high likelihood of being re-scavenged. As a result, if an application is allocating at a fast rate, it may work fast enough to undo all the scavenging work performed by the runtime's current scavenging mechanisms. This behavior is exacerbated by the global best-fit allocation policy the runtime uses, since scavenged spans are just as likely to be chosen as unscavenged spans on average. To remedy that, we treat each allocation of scavenged space as a heap growth, and scavenge other memory to make up for the allocation. This change makes performance of the runtime slightly worse, as now we're scavenging more often during allocation. The regression is particularly obvious with the garbage benchmark (3%) but most of the Go1 benchmarks are within the margin of noise. A follow-up change should help. Garbage: https://perf.golang.org/search?q=upload:20190131.3 Go1: https://perf.golang.org/search?q=upload:20190131.2 Updates #14045. Change-Id: I44a7e6586eca33b5f97b6d40418db53a8a7ae715 Reviewed-on: https://go-review.googlesource.com/c/159500 Reviewed-by: Austin Clements <austin@google.com>
Because scavenged and unscavenged spans no longer coalesce, memory that is freed no longer has a high likelihood of being re-scavenged. As a result, if an application is allocating at a fast rate, it may work fast enough to undo all the scavenging work performed by the runtime's current scavenging mechanisms. This behavior is exacerbated by the global best-fit allocation policy the runtime uses, since scavenged spans are just as likely to be chosen as unscavenged spans on average. To remedy that, we treat each allocation of scavenged space as a heap growth, and scavenge other memory to make up for the allocation. This change makes performance of the runtime slightly worse, as now we're scavenging more often during allocation. The regression is particularly obvious with the garbage benchmark (3%) but most of the Go1 benchmarks are within the margin of noise. A follow-up change should help. Garbage: https://perf.golang.org/search?q=upload:20190131.3 Go1: https://perf.golang.org/search?q=upload:20190131.2 Updates golang#14045. Change-Id: I44a7e6586eca33b5f97b6d40418db53a8a7ae715 Reviewed-on: https://go-review.googlesource.com/c/159500 Reviewed-by: Austin Clements <austin@google.com>
Because scavenged and unscavenged spans no longer coalesce, memory that is freed no longer has a high likelihood of being re-scavenged. As a result, if an application is allocating at a fast rate, it may work fast enough to undo all the scavenging work performed by the runtime's current scavenging mechanisms. This behavior is exacerbated by the global best-fit allocation policy the runtime uses, since scavenged spans are just as likely to be chosen as unscavenged spans on average. To remedy that, we treat each allocation of scavenged space as a heap growth, and scavenge other memory to make up for the allocation. This change makes performance of the runtime slightly worse, as now we're scavenging more often during allocation. The regression is particularly obvious with the garbage benchmark (3%) but most of the Go1 benchmarks are within the margin of noise. A follow-up change should help. Garbage: https://perf.golang.org/search?q=upload:20190131.3 Go1: https://perf.golang.org/search?q=upload:20190131.2 Updates golang#14045. Change-Id: I44a7e6586eca33b5f97b6d40418db53a8a7ae715 Reviewed-on: https://go-review.googlesource.com/c/159500 Reviewed-by: Austin Clements <austin@google.com>
Consider the following program that I run with Go 1.5.2 on 64-bit Fedora Linux:
It allocates 185MB byte array and then forces OS to commit memory to it via touching all the pages. This programs runs OK and prints expected
x
even if I restrict the size of available virtual memory per process to 200MB using ulimit:Now consider its modification like:
It allocates first 85MB, then clears the reference to the slice, and then allocates 150MB. This time under the same 200MB limit as set with ulimit it fails:
The same failure happens even with the explicit GC call after a = nil:
Is it just a runtime bug? If not, how can I force the runtime to release a large allocation?
The text was updated successfully, but these errors were encountered: