Skip to content

Commit 44078a3

Browse files
committed
runtime: adjust huge page flags only on huge page granularity
This fixes an issue where the runtime panics with "out of memory" or "cannot allocate memory" even though there's ample memory by reducing the number of memory mappings created by the memory allocator. Commit 7e1b61c worked around issue #8832 where Linux's transparent huge page support could dramatically increase the RSS of a Go process by setting the MADV_NOHUGEPAGE flag on any regions of pages released to the OS with MADV_DONTNEED. This had the side effect of also increasing the number of VMAs (memory mappings) in a Go address space because a separate VMA is needed for every region of the virtual address space with different flags. Unfortunately, by default, Linux limits the number of VMAs in an address space to 65530, and a large heap can quickly reach this limit when the runtime starts scavenging memory. This commit dramatically reduces the number of VMAs. It does this primarily by only adjusting the huge page flag at huge page granularity. With this change, on amd64, even a pessimal heap that alternates between MADV_NOHUGEPAGE and MADV_HUGEPAGE must reach 128GB to reach the VMA limit. Because of this rounding to huge page granularity, this change is also careful to leave large used and unused regions huge page-enabled. This change reduces the maximum number of VMAs during the runtime benchmarks with GODEBUG=scavenge=1 from 692 to 49. Fixes #12233. Change-Id: Ic397776d042f20d53783a1cacf122e2e2db00584 Reviewed-on: https://go-review.googlesource.com/15191 Reviewed-by: Keith Randall <khr@golang.org>
1 parent 9a31d38 commit 44078a3

File tree

1 file changed

+77
-17
lines changed

1 file changed

+77
-17
lines changed

src/runtime/mem_linux.go

+77-17
Original file line numberDiff line numberDiff line change
@@ -69,29 +69,89 @@ func sysAlloc(n uintptr, sysStat *uint64) unsafe.Pointer {
6969
}
7070

7171
func sysUnused(v unsafe.Pointer, n uintptr) {
72-
var s uintptr = hugePageSize // division by constant 0 is a compile-time error :(
73-
if s != 0 && (uintptr(v)%s != 0 || n%s != 0) {
74-
// See issue 8832
75-
// Linux kernel bug: https://bugzilla.kernel.org/show_bug.cgi?id=93111
76-
// Mark the region as NOHUGEPAGE so the kernel's khugepaged
77-
// doesn't undo our DONTNEED request. khugepaged likes to migrate
78-
// regions which are only partially mapped to huge pages, including
79-
// regions with some DONTNEED marks. That needlessly allocates physical
80-
// memory for our DONTNEED regions.
81-
madvise(v, n, _MADV_NOHUGEPAGE)
72+
// By default, Linux's "transparent huge page" support will
73+
// merge pages into a huge page if there's even a single
74+
// present regular page, undoing the effects of the DONTNEED
75+
// below. On amd64, that means khugepaged can turn a single
76+
// 4KB page to 2MB, bloating the process's RSS by as much as
77+
// 512X. (See issue #8832 and Linux kernel bug
78+
// https://bugzilla.kernel.org/show_bug.cgi?id=93111)
79+
//
80+
// To work around this, we explicitly disable transparent huge
81+
// pages when we release pages of the heap. However, we have
82+
// to do this carefully because changing this flag tends to
83+
// split the VMA (memory mapping) containing v in to three
84+
// VMAs in order to track the different values of the
85+
// MADV_NOHUGEPAGE flag in the different regions. There's a
86+
// default limit of 65530 VMAs per address space (sysctl
87+
// vm.max_map_count), so we must be careful not to create too
88+
// many VMAs (see issue #12233).
89+
//
90+
// Since huge pages are huge, there's little use in adjusting
91+
// the MADV_NOHUGEPAGE flag on a fine granularity, so we avoid
92+
// exploding the number of VMAs by only adjusting the
93+
// MADV_NOHUGEPAGE flag on a large granularity. This still
94+
// gets most of the benefit of huge pages while keeping the
95+
// number of VMAs under control. With hugePageSize = 2MB, even
96+
// a pessimal heap can reach 128GB before running out of VMAs.
97+
if hugePageSize != 0 {
98+
var s uintptr = hugePageSize // division by constant 0 is a compile-time error :(
99+
100+
// If it's a large allocation, we want to leave huge
101+
// pages enabled. Hence, we only adjust the huge page
102+
// flag on the huge pages containing v and v+n-1, and
103+
// only if those aren't aligned.
104+
var head, tail uintptr
105+
if uintptr(v)%s != 0 {
106+
// Compute huge page containing v.
107+
head = uintptr(v) &^ (s - 1)
108+
}
109+
if (uintptr(v)+n)%s != 0 {
110+
// Compute huge page containing v+n-1.
111+
tail = (uintptr(v) + n - 1) &^ (s - 1)
112+
}
113+
114+
// Note that madvise will return EINVAL if the flag is
115+
// already set, which is quite likely. We ignore
116+
// errors.
117+
if head != 0 && head+hugePageSize == tail {
118+
// head and tail are different but adjacent,
119+
// so do this in one call.
120+
madvise(unsafe.Pointer(head), 2*hugePageSize, _MADV_NOHUGEPAGE)
121+
} else {
122+
// Advise the huge pages containing v and v+n-1.
123+
if head != 0 {
124+
madvise(unsafe.Pointer(head), hugePageSize, _MADV_NOHUGEPAGE)
125+
}
126+
if tail != 0 && tail != head {
127+
madvise(unsafe.Pointer(tail), hugePageSize, _MADV_NOHUGEPAGE)
128+
}
129+
}
82130
}
131+
83132
madvise(v, n, _MADV_DONTNEED)
84133
}
85134

86135
func sysUsed(v unsafe.Pointer, n uintptr) {
87136
if hugePageSize != 0 {
88-
// Undo the NOHUGEPAGE marks from sysUnused. There is no alignment check
89-
// around this call as spans may have been merged in the interim.
90-
// Note that this might enable huge pages for regions which were
91-
// previously disabled. Unfortunately there is no easy way to detect
92-
// what the previous state was, and in any case we probably want huge
93-
// pages to back our heap if the kernel can arrange that.
94-
madvise(v, n, _MADV_HUGEPAGE)
137+
// Partially undo the NOHUGEPAGE marks from sysUnused
138+
// for whole huge pages between v and v+n. This may
139+
// leave huge pages off at the end points v and v+n
140+
// even though allocations may cover these entire huge
141+
// pages. We could detect this and undo NOHUGEPAGE on
142+
// the end points as well, but it's probably not worth
143+
// the cost because when neighboring allocations are
144+
// freed sysUnused will just set NOHUGEPAGE again.
145+
var s uintptr = hugePageSize
146+
147+
// Round v up to a huge page boundary.
148+
beg := (uintptr(v) + (s - 1)) &^ (s - 1)
149+
// Round v+n down to a huge page boundary.
150+
end := (uintptr(v) + n) &^ (s - 1)
151+
152+
if beg < end {
153+
madvise(unsafe.Pointer(beg), end-beg, _MADV_HUGEPAGE)
154+
}
95155
}
96156
}
97157

0 commit comments

Comments
 (0)