Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: possible memory corruption on FreeBSD #46272

Closed
mknyszek opened this issue May 19, 2021 · 86 comments
Closed

runtime: possible memory corruption on FreeBSD #46272

mknyszek opened this issue May 19, 2021 · 86 comments
Labels
ExpertNeeded FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. OS-FreeBSD
Milestone

Comments

@mknyszek
Copy link
Contributor

Several failures in the last month on FreeBSD builders have failure modes that are very difficult to explain, such as SIGSEGVs in hot runtime paths accessing values that don't quite make sense (e.g. spanOf sees a bad arena entry, fixalloc accesses an out-of-bounds slot, a broken return PC in a runtime stack frame). I suspect they share an underlying cause. Three issues have already been opened for these: #45977, #46103, #46182.

As far as I know, these all seem to be specific to FreeBSD, and even more specifically, the "race" and "12_2" builders.

The relevant logs are available below.

2021-05-04T20:50:35-d19e549/freebsd-amd64-race
2021-05-10T23:42:56-5c48951/freebsd-amd64-12_2
2021-05-14T16:42:01-3d324f1/freebsd-amd64-race

@dmitshur
Copy link
Contributor

For information, the freebsd-amd64-12_2 builder was recently added as part of #44431, and in that same change, the freebsd-amd64-race builder was updated to also use FreeBSD 12.2 (up from FreeBSD 12.0 previously).

CC @paulzhol, @dmgk, @cagedmantis.

@mknyszek
Copy link
Contributor Author

If it's correlated with FreeBSD being updated, this may not be a release blocker. We should still probably figure out what's wrong, but I don't have any good ideas besides stress-testing all.bash.

It's also still possible that this is a Go issue, but just that it's only a problem on FreeBSD 12.2. Between when the builders got updated (looks like... April 23rd) and when the first failure happened, there's about 2 weeks. Also those two weeks happened to have the last week before the freeze.

@mknyszek
Copy link
Contributor Author

Running all.bash in a loop on 10 FreeBSD 12.2 instances. Let's see if I can get it to fail...

@dmitshur dmitshur added the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label May 27, 2021
@mknyszek
Copy link
Contributor Author

109 all.bash runs in and nothing yet.

@mknyszek
Copy link
Contributor Author

I stand corrected! I do actually have a failure that looks promising. Again, in fixalloc.

fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0x84cf86050 pc=0x41f43b]

runtime stack:
runtime.throw({0x67b9e2, 0x2a})
        /tmp/workdir/go/src/runtime/panic.go:1198 +0x74 fp=0x7fffdddeccf8 sp=0x7fffdddeccc8 pc=0x43ed5
4
runtime.sigpanic()
        /tmp/workdir/go/src/runtime/signal_unix.go:719 +0x4a5 fp=0x7fffdddecd58 sp=0x7fffdddeccf8 pc=0
x457545
runtime.(*fixalloc).alloc(0x8404f0)
        /tmp/workdir/go/src/runtime/mfixalloc.go:72 +0x3b fp=0x7fffdddecd98 sp=0x7fffdddecd58 pc=0x41f
43b
runtime.allocmcache.func1()
        /tmp/workdir/go/src/runtime/mcache.go:88 +0x48 fp=0x7fffdddecdc0 sp=0x7fffdddecd98 pc=0x41af08
runtime.allocmcache()
        /tmp/workdir/go/src/runtime/mcache.go:86 +0x58 fp=0x7fffdddecdf8 sp=0x7fffdddecdc0 pc=0x41ae58
runtime.(*p).init(0xc00019c800, 0x10)
        /tmp/workdir/go/src/runtime/proc.go:4860 +0x114 fp=0x7fffdddece18 sp=0x7fffdddecdf8 pc=0x44c8b
4
runtime.procresize(0x64)
        /tmp/workdir/go/src/runtime/proc.go:5036 +0x3a9 fp=0x7fffdddecec8 sp=0x7fffdddece18 pc=0x44d1a
9
runtime.startTheWorldWithSema(0x0)
        /tmp/workdir/go/src/runtime/proc.go:1256 +0xa5 fp=0x7fffdddecf28 sp=0x7fffdddecec8 pc=0x444625
runtime.startTheWorld.func1()
        /tmp/workdir/go/src/runtime/proc.go:1093 +0x26 fp=0x7fffdddecf48 sp=0x7fffdddecf28 pc=0x474fe6
runtime.systemstack()
        /tmp/workdir/go/src/runtime/asm_amd64.s:383 +0x49 fp=0x7fffdddecf50 sp=0x7fffdddecf48 pc=0x479
ba9

@mknyszek
Copy link
Contributor Author

Due to a bug in my script, I have lost the gomote state (and any potential core dump). Re-trying. At least I've found it's reproducible (kind of).

@heschi heschi removed the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label Jun 10, 2021
@heschi
Copy link
Contributor

heschi commented Jun 10, 2021

Ping -- I understand this is a tricky one, but it does still seem important to resolve in some way. Worst case we might need a prominent release note.

Do we know if this is a regression in Go? That seems worth understanding.

@mknyszek
Copy link
Contributor Author

I think a prominent release note is overkill. It's unclear where the regression is. Given the frequency of failure, it is still possible it's a FreeBSD 12.2 x Go 1.17 thing.

I was trying to reproduce it in gomotes but failed to since that one time. I'm going to check the logs again and update this. I'll also spin up the gomotes again.

@mknyszek
Copy link
Contributor Author

I still haven't seen any such failure on the builders since those three I posted earlier.

@bcmills
Copy link
Contributor

bcmills commented Jun 15, 2021

#45887 looks like memory corruption too (on the freebsd-386-12_2 builder). Possibly the same root cause?

@bcmills
Copy link
Contributor

bcmills commented Jun 16, 2021

@mknyszek
Copy link
Contributor Author

Thanks Bryan. I think those both are related. I'm still trying to reproduce.

I'm now somewhat worried that by setting ulimit -c unlimited when running these gomotes is actually preventing the failure. I can't remember now whether that bug in my script earlier prevented that. I'll take a step back and try to reproduce without that.

@bcmills
Copy link
Contributor

bcmills commented Jun 17, 2021

This could also be related to #34988, in that they both involve memory corruption on BSD variants in programs that fork subprocesses.

@mknyszek
Copy link
Contributor Author

I reproduced it!

Except run.bash thwarted my efforts to generate a core file! UGH.

I think if run.bash sees GOTRACEBACK=crash, it should not turn off core dumps.

@mknyszek
Copy link
Contributor Author

I think the theory that it's related to process startup makes sense -- we don't have very many sample points, but including the two failures I was able to reproduce, it seems like they happen in cmd/go tests, cmd/vet tests, and runtime tests. These tend to spawn a non-trivial amount of subprocesses, so it makes sense that the failure would likely happen there if this was true.

I'm refocusing my efforts on those packages' tests instead of on all.bash. Hopefully reproducing will be easier.

@bcmills
Copy link
Contributor

bcmills commented Jun 21, 2021

Maybe the same root cause, maybe different, but here is a recent SIGBUS in runtime.vdsoClockGettime, and also an older failure with a similar stack trace:
2021-06-18T22:05:09-9401172/freebsd-amd64-race
2020-07-27T19:25:51-9267083/freebsd-386-11_2

@mknyszek
Copy link
Contributor Author

Refocusing on those packages didn't help. I actually got the gomotes to run continuously all weekend on runtime tests and I got absolutely nothing.

Back to all.bash...

@mknyszek
Copy link
Contributor Author

@bcmills That 11.2 failure is interesting, and suggests maybe it isn't just a 12.2 bug?

@paulzhol
Copy link
Member

Maybe the same root cause, maybe different, but here is a recent SIGBUS in runtime.vdsoClockGettime, and also an older failure with a similar stack trace:
2021-06-18T22:05:09-9401172/freebsd-amd64-race
2020-07-27T19:25:51-9267083/freebsd-386-11_2

@bcmills there have been several fixes since I ported the initial fast gettimeofday code (stronger memory barriers) and a code re-org: https://cgit.freebsd.org/src/log/lib/libc/x86/sys/__vdso_gettc.c?h=stable/12.
Unfortunately I don't have a timeframe when I'll be able to sync with those.

Meanwhile there's a sysctl which can disable these codepaths on the builders:(https://github.com/golang/build/blob/df58bbac082bc87c4a3cdfe336d1ffe60bbaa916/env/freebsd-amd64/sysctl.conf).
A kern.timecounter.fast_gettime=0 can be added there, and the builder images rebuilt.

@emaste
Copy link

emaste commented Jan 11, 2022

I would not expect this to be fixed in a stock FreeBSD 12.3 release image (but should be fixed by an errata update available later today via FreeBSD-update).

From the change referenced above (https://golang.org/cl/375695) it looks like Go has an entry host-freebsd-12_3 which uses image freebsd-amd64-123-stable-20211230 which is a snapshot built from the stable/12 branch in late Dec 2021, it will indeed have the fix.

The distinction is likely immaterial for Go CI, but in case anyone encounters this issue on a production FreeBSD 12.3 deployment they'll want to ensure that they've updated to at least 12.3-RELEASE-p1.

@gopherbot
Copy link
Contributor

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

@heschi heschi removed the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Jan 13, 2022
@heschi heschi reopened this Jan 13, 2022
@heschi
Copy link
Contributor

heschi commented Jan 13, 2022

@emaste is correct, I built the images from snapshot, they're not exactly -p1.

@aclements
Copy link
Member

I got through 912 iterations of all.bash on 12_3 and 898 on 12_2. I got two failures that appear to be memory corruption on 12_2 and zero on 12_3. That alone isn't enough to base particularly strong statistical conclusions on, I'm afraid, so I'm going to take a Bayesian out and say that in combination with the fact that we're already pretty sure the bug was fixed, we can close this issue as fixed.

Thank you everyone!

@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Jan 13, 2022
@dmitshur dmitshur modified the milestones: Backlog, Unreleased Jan 13, 2022
@emaste
Copy link

emaste commented Jan 14, 2022

Thanks to the incredible analysis from @rsc in #46272 (comment) I'm quite confident this is indeed fixed.

gopherbot pushed a commit that referenced this issue Jan 14, 2022
Knowing whether test failures are correlated with specific CPU models on
has proven useful on several issues. Log it for prior to testing so it
is always available.

internal/sysinfo provides the CPU model, but it is not available in the
bootstrap toolchain, so we can't access this unconditionally in
cmd/dist. Instead use a build-tagged file, as the final version of
cmd/dist will use the final toolchain.

The addition of new data to the beginning of cmd/dist output will break
x/build/cmd/coordinator's banner parsing, leaving extra lines in the log
output, though information will not be lost.
https://golang.org/cl/372538 fixes up the coordinator and should be
submitted and deployed before this CL is submitted.

For #46272.
For #49209.
For #50146.

Change-Id: I515d2ec58e4c0034b76bf624ecaab38f16146074
Reviewed-on: https://go-review.googlesource.com/c/go/+/371474
Trust: Benny Siegert <bsiegert@gmail.com>
Reviewed-by: Benny Siegert <bsiegert@gmail.com>
Trust: Michael Pratt <mpratt@google.com>
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/378589 mentions this issue: cmd/dist: log CPU model when testing

@tarkhil
Copy link

tarkhil commented Jan 19, 2022

Sorry for asking for TL;DR: is patch available or not yet?

@ianlancetaylor
Copy link
Contributor

@tarkhil We believe this is fixed by a FreeBSD update. You want to be using at least version 12.3-RELEASE-p1.

@tarkhil
Copy link

tarkhil commented Jan 19, 2022

13.0 fails. Okay, tomorrow I'll add a disk, so it's a chance to upgrade to actual patchlevel for 13.0

@emaste
Copy link

emaste commented Jan 19, 2022

For 13 the fix is in 13.0-RELEASE-p6
The advisory for this issue: https://www.freebsd.org/security/advisories/FreeBSD-EN-22:02.xsave.asc

bsdjhb pushed a commit to bsdjhb/cheribsd that referenced this issue Jan 25, 2022
PR:	golang/go#46272
Reviewed by:	emaste
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
gopherbot pushed a commit that referenced this issue Mar 8, 2022
Knowing whether test failures are correlated with specific CPU models on
has proven useful on several issues. Log it for prior to testing so it
is always available.

internal/sysinfo provides the CPU model, but it is not available in the
bootstrap toolchain, so we can't access this in cmd/dist. Instead use a
separate binary which cmd/dist will only build once testing begins.

The addition of new data to the beginning of cmd/dist output will break
x/build/cmd/coordinator's banner parsing, leaving extra lines in the log
output, though information will not be lost.
https://golang.org/cl/372538 fixes up the coordinator and should be
submitted and deployed before this CL is submitted.

This is a redo of CL 371474. It switches back to the original approach
of using a separate binary, as the bootstap toolchain won't allow
cmd/dist to import internal packages.

For #46272.
For #49209.
For #50146.

Change-Id: I906bbda987902a2120c5183290a4e89a2440de58
Reviewed-on: https://go-review.googlesource.com/c/go/+/378589
Reviewed-by: Austin Clements <austin@google.com>
Trust: Michael Pratt <mpratt@google.com>
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
jproberts pushed a commit to jproberts/go that referenced this issue Jun 21, 2022
Knowing whether test failures are correlated with specific CPU models on
has proven useful on several issues. Log it for prior to testing so it
is always available.

internal/sysinfo provides the CPU model, but it is not available in the
bootstrap toolchain, so we can't access this unconditionally in
cmd/dist. Instead use a build-tagged file, as the final version of
cmd/dist will use the final toolchain.

The addition of new data to the beginning of cmd/dist output will break
x/build/cmd/coordinator's banner parsing, leaving extra lines in the log
output, though information will not be lost.
https://golang.org/cl/372538 fixes up the coordinator and should be
submitted and deployed before this CL is submitted.

For golang#46272.
For golang#49209.
For golang#50146.

Change-Id: I515d2ec58e4c0034b76bf624ecaab38f16146074
Reviewed-on: https://go-review.googlesource.com/c/go/+/371474
Trust: Benny Siegert <bsiegert@gmail.com>
Reviewed-by: Benny Siegert <bsiegert@gmail.com>
Trust: Michael Pratt <mpratt@google.com>
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
@rsc rsc unassigned heschi Jun 23, 2022
@golang golang locked and limited conversation to collaborators Jun 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
ExpertNeeded FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. OS-FreeBSD
Projects
None yet
Development

No branches or pull requests