Skip to content

runtime: non-empty mark queue after concurrent mark #69803

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
neild opened this issue Oct 7, 2024 · 8 comments
Closed

runtime: non-empty mark queue after concurrent mark #69803

neild opened this issue Oct 7, 2024 · 8 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. Critical A critical problem that affects the availability or correctness of production systems built using Go NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@neild
Copy link
Contributor

neild commented Oct 7, 2024

Test failure on https://go.dev/cl/617376/2, which I don't think was responsible for the failure. Only builder to fail was gotip-linux-amd64-aliastypeparams.

runtime: full=0xc0000d38000006 next=129 jobs=128 nDataRoots=1 nBSSRoots=1 nSpanRoots=16 nStackRoots=108
panic: non-empty mark queue after concurrent mark
fatal error: panic on system stack

runtime stack:
runtime.throw({0x6dcdac?, 0x73cf70?})
	/home/swarming/.swarming/w/ir/x/w/goroot/src/runtime/panic.go:1074 +0x48 fp=0x7fdc4b7fdd30 sp=0x7fdc4b7fdd00 pc=0x472a48
panic({0x67f660?, 0x73cf70?})
	/home/swarming/.swarming/w/ir/x/w/goroot/src/runtime/panic.go:751 +0x33b fp=0x7fdc4b7fdde0 sp=0x7fdc4b7fdd30 pc=0x47295b
runtime.gcMark(0x47b2ed?)
	/home/swarming/.swarming/w/ir/x/w/goroot/src/runtime/mgc.go:1531 +0x3ec fp=0x7fdc4b7fde58 sp=0x7fdc4b7fdde0 pc=0x41b78c
runtime.gcMarkTermination.func1()
	/home/swarming/.swarming/w/ir/x/w/goroot/src/runtime/mgc.go:980 +0x17 fp=0x7fdc4b7fde70 sp=0x7fdc4b7fde58 pc=0x41ab77
runtime.systemstack(0x800000)
	/home/swarming/.swarming/w/ir/x/w/goroot/src/runtime/asm_amd64.s:514 +0x4a fp=0x7fdc4b7fde80 sp=0x7fdc4b7fde70 pc=0x47792a

Full log at: https://logs.chromium.org/logs/golang/buildbucket/cr-buildbucket/8734717093063296065/+/u/step/11/log/2?format=raw

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Oct 7, 2024
@mknyszek mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Oct 7, 2024
@mknyszek mknyszek modified the milestones: Backlog, Go1.24 Oct 7, 2024
@mknyszek
Copy link
Contributor

mknyszek commented Oct 7, 2024

This suggests there was an out-standing GC mark buf after we'd already passed through mark termination, meaning the mark termination algorithm failed to catch something. It could also mean that GC work was generated during mark termination in a way that we missed.

However, this should already be caught here: https://cs.opensource.google/go/go/+/master:src/runtime/mgc.go;l=900;drc=123594d3863b0a4b9094a569957d1bd94ebe7512

We do already know that mark termination is buggy (we very rarely miss work, see #27993). But it should never reach this back-up check and I don't see how that would be possible. The mark buf pointer looks almost legit, but the 0x6 in the bottom bits is incredibly fishy which makes me think that some kind of memory corruption occurred.

EDIT: The 0x6 is a red herring. This is an lfstack tagged pointer, so the lower bits are used for some lock-free shenanigans.

@mknyszek
Copy link
Contributor

I have a hunch as to what's going on here.

I think this may be a super subtle bug in https://go.dev/cl/610396 that may be related to our broken mark termination condition (#27993). In particular, the re-check we do to paper over the broken termination condition only checks to make sure each P's work buf is empty after a write barrier buffer flush, but does not check the global list.

What if the call to shade, which queues pointers into work bufs directly, happens enough times before the stop-the-world (in a way that queues new pointers, so the pointers have to have been not observed by the GC before this cycle) to force a flush of a mark work buffer? On the one hand this sounds a little absurd, since each P's work buffer is something like 2 KiB, so it can hold something like 256 pointers, and that's a lot of pointers to queue for this to happen. (I think technically you might need 512, if the second checked-out buffer also happens to be empty. Not sure how likely that is.) But also, anything can happen while trying to stop the world, there's no time bound. Someone using the unique package could get unlucky and queue >256 unmarked (ha) objects in that window.

This could, in theory happen with write barriers too, which was the original reason the condition was broken. But it would take a lot of missed writes for it to happen. First, the write barrier buffers would have to be filled, and then flushed, and then there's enough space to fill the write barrier buffers again before anything gets flushed to the global queue.

If my theory is right, then:

  • Adding an explicit check on work.full and restarting here should fully fix the problem.
  • Replacing shade(ptr) with getg().m.p.ptr().wbBuf.get1()[0] = uintptr(ptr) in https://go.dev/cl/610396 should paper over the problem, possibly enough to prevent it from happening.

If I'm right, this is kinda bad. It would be much better (but harder) to actually fix mark termination. And however we fix mark termination in the future, it's clear to me that we need to account for this weak-to-strong conversion explicitly. It may be that we simply have to block the conversion from happening at a certain point during mark termination. (This would also be a valid solution to the problem if my theory is right.)

@mknyszek
Copy link
Contributor

I discussed this a bit more with @aclements and I'd misunderstood exactly what was going wrong in #27993. I really should have just read Austin's very clear explanation in #27993 (comment).

But, this is almost certainly an issue of the weak->strong conversion being able to generate new GC work at any time. I think the clearest fix to this would be to force weak->strong conversions to block during mark termination, to prevent the creation of new GC work once we've entered mark termination. We can do this efficiently by setting a global flag before the ragged barrier. The weak->strong conversion will then check this flag (non-atomic, the ragged barrier ensures it is observed), park the current goroutine, and place it on a list. During the mark termination STW, the flag will be unset, and all parked goroutines will be unparked.

I'm not certain what kind of adverse affects this will have, but assuming we want to backport this, it would be best to keep the solution as simple as possible. I would expect this to be efficient, even under heavy use of unique, since it guarantees mark termination progress. At that point in the GC cycle, that progress is important to latency, so I suspect this would work better than, for example, retrying mark termination if a weak->strong conversion is detected.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/623615 mentions this issue: runtime: prevent weak->strong conversions from creating more GC work

@mknyszek mknyszek moved this from Todo to In Progress in Go Compiler / Runtime Oct 30, 2024
@mknyszek
Copy link
Contributor

@gopherbot Please open a backport issue for Go 1.23. This bug causes random crashes with no workaround for anything using the unique package, which includes applications using net/http.

@gopherbot
Copy link
Contributor

Backport issue(s) opened: #70323 (for 1.23).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/627615 mentions this issue: [release-branch.go1.23] runtime: prevent weak->strong conversions during mark termination

@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. Critical A critical problem that affects the availability or correctness of production systems built using Go and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Nov 19, 2024
gopherbot pushed a commit that referenced this issue Nov 20, 2024
…ing mark termination

Currently it's possible for weak->strong conversions to create more GC
work during mark termination. When a weak->strong conversion happens
during the mark phase, we need to mark the newly-strong pointer, since
it may now be the only pointer to that object. In other words, the
object could be white.

But queueing new white objects creates GC work, and if this happens
during mark termination, we could end up violating mark termination
invariants. In the parlance of the mark termination algorithm, the
weak->strong conversion is a non-monotonic source of GC work, unlike the
write barriers (which will eventually only see black objects).

This change fixes the problem by forcing weak->strong conversions to
block during mark termination. We can do this efficiently by setting a
global flag before the ragged barrier that is checked at each
weak->strong conversion. If the flag is set, then the conversions block.
The ragged barrier ensures that all Ps have observed the flag and that
any weak->strong conversions which completed before the ragged barrier
have their newly-minted strong pointers visible in GC work queues if
necessary. We later unset the flag and wake all the blocked goroutines
during the mark termination STW.

There are a few subtleties that we need to account for. For one, it's
possible that a goroutine which blocked in a weak->strong conversion
wakes up only to find it's mark termination time again, so we need to
recheck the global flag on wake. We should also stay non-preemptible
while performing the check, so that if the check *does* appear as true,
it cannot switch back to false while we're actively trying to block. If
it switches to false while we try to block, then we'll be stuck in the
queue until the following GC.

All-in-all, this CL is more complicated than I would have liked, but
it's the only idea so far that is clearly correct to me at a high level.

This change adds a test which is somewhat invasive as it manipulates
mark termination, but hopefully that infrastructure will be useful for
debugging, fixing, and regression testing mark termination whenever we
do fix it.

For #69803.
Fixes #70323.

Change-Id: Ie314e6fd357c9e2a07a9be21f217f75f7aba8c4a
Reviewed-on: https://go-review.googlesource.com/c/go/+/623615
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
(cherry picked from commit 80d306d)
Reviewed-on: https://go-review.googlesource.com/c/go/+/627615
TryBot-Bypass: Dmitri Shuralyov <dmitshur@google.com>
Auto-Submit: Dmitri Shuralyov <dmitshur@google.com>
mpminardi pushed a commit to tailscale/go that referenced this issue Jan 28, 2025
…ing mark termination

Currently it's possible for weak->strong conversions to create more GC
work during mark termination. When a weak->strong conversion happens
during the mark phase, we need to mark the newly-strong pointer, since
it may now be the only pointer to that object. In other words, the
object could be white.

But queueing new white objects creates GC work, and if this happens
during mark termination, we could end up violating mark termination
invariants. In the parlance of the mark termination algorithm, the
weak->strong conversion is a non-monotonic source of GC work, unlike the
write barriers (which will eventually only see black objects).

This change fixes the problem by forcing weak->strong conversions to
block during mark termination. We can do this efficiently by setting a
global flag before the ragged barrier that is checked at each
weak->strong conversion. If the flag is set, then the conversions block.
The ragged barrier ensures that all Ps have observed the flag and that
any weak->strong conversions which completed before the ragged barrier
have their newly-minted strong pointers visible in GC work queues if
necessary. We later unset the flag and wake all the blocked goroutines
during the mark termination STW.

There are a few subtleties that we need to account for. For one, it's
possible that a goroutine which blocked in a weak->strong conversion
wakes up only to find it's mark termination time again, so we need to
recheck the global flag on wake. We should also stay non-preemptible
while performing the check, so that if the check *does* appear as true,
it cannot switch back to false while we're actively trying to block. If
it switches to false while we try to block, then we'll be stuck in the
queue until the following GC.

All-in-all, this CL is more complicated than I would have liked, but
it's the only idea so far that is clearly correct to me at a high level.

This change adds a test which is somewhat invasive as it manipulates
mark termination, but hopefully that infrastructure will be useful for
debugging, fixing, and regression testing mark termination whenever we
do fix it.

For golang#69803.
Fixes golang#70323.

Change-Id: Ie314e6fd357c9e2a07a9be21f217f75f7aba8c4a
Reviewed-on: https://go-review.googlesource.com/c/go/+/623615
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
(cherry picked from commit 80d306d)
Reviewed-on: https://go-review.googlesource.com/c/go/+/627615
TryBot-Bypass: Dmitri Shuralyov <dmitshur@google.com>
Auto-Submit: Dmitri Shuralyov <dmitshur@google.com>
jrguzman-ms pushed a commit to msft-mirror-aosp/platform.build.blueprint that referenced this issue Feb 12, 2025
Revert submission 3326317-remove_unique_list

Reason for revert: relanding after go 1.23.4 update that fixes golang/go#69803

Reverted changes: /q/submissionid:3326317-remove_unique_list

Change-Id: I65e797045af3b2489969c127f56ab04c53fc115b
jrguzman-ms pushed a commit to msft-mirror-aosp/platform.build.soong that referenced this issue Feb 12, 2025
Revert submission 3326317-remove_unique_list

Reason for revert: relanding after go 1.23.4 update that fixes golang/go#69803

Reverted changes: /q/submissionid:3326317-remove_unique_list

Change-Id: Ie96cb3aa775db360ec63e6643f980a9b9b749389
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. Critical A critical problem that affects the availability or correctness of production systems built using Go NeedsFix The path to resolution is known, but the work has not been done.
Projects
Development

No branches or pull requests

5 participants