-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: TestGcSys fails on linux/amd64 #28574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I am unable to reproduce this. Does this happen to you consistently ? |
@agnivade It's only happened once, and I don't seem to be able to reproduce it. |
I just caught one in a local
|
I got this when running here is my /proc/cpuinfo and /proc/meminfo It didn't happen again when I reran the tests. |
just tried to build most recent version of golang 1.12.1 and got the same message
with exit code "Failed: exit status 1" cpuinfo:
/proc/meminfo
git log -1
|
Can we please just remove this test. Failing on 32bit platforms for me. |
@kmcfate, if you're observing a different (or consistent) failure mode on 32-bit platforms, please open a separate issue — it may be easier to diagnose and fix that failure mode than the intermittent |
As of this afternoon, seems to be failing for me with high test count: $ cd `go env GOROOT`/src/runtime $ go test -count 100 -test.run=TestGcSys . --- FAIL: TestGcSys (0.05s) gc_test.go:33: expected "OK\n", but got "using too much memory: 71141376 bytes\n" --- FAIL: TestGcSys (0.05s) gc_test.go:33: expected "OK\n", but got "using too much memory: 71075840 bytes\n" --- FAIL: TestGcSys (0.05s) gc_test.go:33: expected "OK\n", but got "using too much memory: 71141376 bytes\n" FAIL FAIL runtime 4.407s My work machine is linux/amd64 Xeon CPU E5-2690. |
I should add that I am working off tip (19966e9). |
I get (and have been getting) these failures sporadically on some of the 64-bit x86 Linux machines that I build Go on, and I can consistently reproduce the problem with @thanm's process. The pattern of what machines fail and don't fail for me is that machines with large amounts of memory fail and machines with smaller amounts don't. The usual machines I build Go on have 32 GB or 16 GB and they don't seem to fail even with large repeat counts (1000 or even 10,000). On machines with 64 GB, 96 GB, or 256 GB, I typically see a couple of failures in a I'm using a couple of versions of the current git tip (I would say the same version, but it updated a couple of times while I was building on different machines). |
CC @mknyszek |
It's unfortunate I didn't notice this issue earlier. For those experiencing the failures back to April and earlier, I expect it should be much better or even fixed with Go 1.12.5 and Go 1.13 (see #31616). As for the recent failures on tip, I'm unable to reproduce on a linux/amd64 machine with 30 GiB of memory, but I'm trying now on a beefier machine (based on @siebenmann's comments). If I can reproduce I'll try and bisect. |
OK so bisection didn't really work, but I also misunderstood what the test was checking. I though it was checking whether GCSys was increasing too much but it's not. It checks to see if Sys is increasing which... well, if you get unlucky with heap layout (or with a different number of GCs run at different times) you could end up mapping in another arena, which is why we see numbers like 70 MiB popping out in the failures consistently. Issues affecting #31616 definitely made this fail much more often, but I think that's no longer the fundamental issue. I think something weird is happening with GC pacing. Consider the GC trace for this successful run (the vast majority of them):
and this failing run:
At the very end of the latter the GC just assumes a heap growth is happening in all the rapid allocation and scales up the heap goal, which results in a heap growth. I'm not really sure what's going on here and it'll be difficult to bisect because I'm pretty sure any regression related to #31616 makes the history messy. I'll keep looking into it though. |
Actually, in the failing run, the last two GCs are quite long overall, with most of that time spent in concurrent mark. Looks like something is occasionally preventing a transition into mark termination? For posterity: I ran this on a linux/amd64 machine with 64 GiB of memory and at @thanm 's commit. Also at this point I don't think the amount of memory available on the system has much to do with what's going wrong. |
Thinking about it more, it may also be relevant that the large memory machines I have access to are also high CPU count machines, and in most cases also have NUMA memory. In fact on the single high memory, high CPU, but non-NUMA machine I have access to, it seems harder (but not impossible) to reproduce this issue. On some but not all of these NUMA systems, using (The successful |
OK so I think I've gotten close to the bottom of this. To make the analysis below clearer, here's an overview of the test: it sets The GCs that take a relatively long time invariably happen as a result of nothing actually doing any marking work. If you take a look at an execution trace of a bad run, you'll see that because The first really long GC then doesn't do any mark assists for a long time. Using Basically what's happening is that for most GCs mark assist is doing most of the work, and the pace is being kept. Sometime during the first "bad" GC the fractional background mark worker kicks in for a little bit and accumulates a bunch of credit. The next time mark assist kicks in on the one running G it tries to over-assist (but to be fair it does that every time, there just isn't much credit to steal), and then steals a ton of credit (around 30-40 MiB worth, it's very consistently close to the new heap goal for the first "bad" GC). Thus the next mark assist is then scheduled to happen several times later than before. But the fractional background mark worker doesn't kick in at all for the rest of the GC because the "fractional" part of it isn't very tightly controlled. So, the one running G allocates until it finally does a mark assist, and suddenly by the end of the GC With all that being said, I'm not actually sure what the solution here is. On a side note, if I increase |
@bcmills no progress since the last post, but I'm looking at this again now and getting a little further (I think). Will post back here when I have more info. |
Following the trail of things that seemed strange to me, I arrived at https://go.googlesource.com/go/+/refs/tags/go1.13beta1/src/runtime/mgc.go#492. Basically, there would be a window during which @aclements gave me a bunch of background on why the the code at mgc.go#492 works the way it does and offered a suggestion: if you already know you've done more scan work than you would expect, you should already pace yourself against the "hard" GC goal since you've already violated the "steady-state" 50% heap live assumption. I modified the code to do exactly this and am running the test in a loop. It's already been about 10 minutes without a failure (compare to usually getting a failure within 30 seconds). I'm going to keep running this in the background and if it's been a few hours without a failure I'm going to call this problem solved. This code change shouldn't affect most applications (most aren't just allocating in a loop and dropping pointers), but I'll run some benchmarks against it and see how they fare. |
Change https://golang.org/cl/184097 mentions this issue: |
Oops, also @bcmills I forgot to say that yes, those two issues are also reporting the same problem as the one described here AFIACT. |
Yeah OK it's been an hour of continuous execution without failure, I'm calling it good. Spinning up those benchmarks I mentioned now. |
Just to circle back on this, yep, there's little-to-no change in performance. It looks good overall I think. |
To update: just waiting on a review at this point. |
This change makes it so that if we're already finding ourselves in a situation where we've done more scan work than expected in the steady-state (that is, 50% of heap_scan for GOGC=100), then we fall back on the hard heap goal instead of continuing to assume the expected case. In some cases its possible that we're already doing more scan work than expected, and if GC assists come in just at that window where we notice it, they might accumulate way too much assist credit, causing undue heap growths if GOMAXPROCS=1 (since the fractional background worker isn't guaranteed to fire). This case seems awfully specific, and that's because it's exactly the case for TestGcSys, which has been flaky for some time as a result. Fixes golang#28574, golang#27636, and golang#27156. Change-Id: I771f42bed34739dbb1b84ad82cfe247f70836031 Reviewed-on: https://go-review.googlesource.com/c/go/+/184097 Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
It reproduces at master.
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
Tried to build go at master with
./all.bash
.What did you expect to see?
Test passing with:
What did you see instead?
Test failed with:
This is the same issue as #27636 and #27156, but for linux/amd64 under Fedora 29.
The text was updated successfully, but these errors were encountered: