-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: non-empty mark queue after concurrent mark #69803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This suggests there was an out-standing GC mark buf after we'd already passed through mark termination, meaning the mark termination algorithm failed to catch something. It could also mean that GC work was generated during mark termination in a way that we missed. However, this should already be caught here: https://cs.opensource.google/go/go/+/master:src/runtime/mgc.go;l=900;drc=123594d3863b0a4b9094a569957d1bd94ebe7512 We do already know that mark termination is buggy (we very rarely miss work, see #27993). But it should never reach this back-up check and I don't see how that would be possible. The mark buf pointer looks almost legit, but the 0x6 in the bottom bits is incredibly fishy which makes me think that some kind of memory corruption occurred. EDIT: The 0x6 is a red herring. This is an |
I have a hunch as to what's going on here. I think this may be a super subtle bug in https://go.dev/cl/610396 that may be related to our broken mark termination condition (#27993). In particular, the re-check we do to paper over the broken termination condition only checks to make sure each P's work buf is empty after a write barrier buffer flush, but does not check the global list. What if the call to This could, in theory happen with write barriers too, which was the original reason the condition was broken. But it would take a lot of missed writes for it to happen. First, the write barrier buffers would have to be filled, and then flushed, and then there's enough space to fill the write barrier buffers again before anything gets flushed to the global queue. If my theory is right, then:
If I'm right, this is kinda bad. It would be much better (but harder) to actually fix mark termination. And however we fix mark termination in the future, it's clear to me that we need to account for this weak-to-strong conversion explicitly. It may be that we simply have to block the conversion from happening at a certain point during mark termination. (This would also be a valid solution to the problem if my theory is right.) |
I discussed this a bit more with @aclements and I'd misunderstood exactly what was going wrong in #27993. I really should have just read Austin's very clear explanation in #27993 (comment). But, this is almost certainly an issue of the weak->strong conversion being able to generate new GC work at any time. I think the clearest fix to this would be to force weak->strong conversions to block during mark termination, to prevent the creation of new GC work once we've entered mark termination. We can do this efficiently by setting a global flag before the ragged barrier. The weak->strong conversion will then check this flag (non-atomic, the ragged barrier ensures it is observed), park the current goroutine, and place it on a list. During the mark termination STW, the flag will be unset, and all parked goroutines will be unparked. I'm not certain what kind of adverse affects this will have, but assuming we want to backport this, it would be best to keep the solution as simple as possible. I would expect this to be efficient, even under heavy use of |
Change https://go.dev/cl/623615 mentions this issue: |
@gopherbot Please open a backport issue for Go 1.23. This bug causes random crashes with no workaround for anything using the |
Backport issue(s) opened: #70323 (for 1.23). Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases. |
Change https://go.dev/cl/627615 mentions this issue: |
…ing mark termination Currently it's possible for weak->strong conversions to create more GC work during mark termination. When a weak->strong conversion happens during the mark phase, we need to mark the newly-strong pointer, since it may now be the only pointer to that object. In other words, the object could be white. But queueing new white objects creates GC work, and if this happens during mark termination, we could end up violating mark termination invariants. In the parlance of the mark termination algorithm, the weak->strong conversion is a non-monotonic source of GC work, unlike the write barriers (which will eventually only see black objects). This change fixes the problem by forcing weak->strong conversions to block during mark termination. We can do this efficiently by setting a global flag before the ragged barrier that is checked at each weak->strong conversion. If the flag is set, then the conversions block. The ragged barrier ensures that all Ps have observed the flag and that any weak->strong conversions which completed before the ragged barrier have their newly-minted strong pointers visible in GC work queues if necessary. We later unset the flag and wake all the blocked goroutines during the mark termination STW. There are a few subtleties that we need to account for. For one, it's possible that a goroutine which blocked in a weak->strong conversion wakes up only to find it's mark termination time again, so we need to recheck the global flag on wake. We should also stay non-preemptible while performing the check, so that if the check *does* appear as true, it cannot switch back to false while we're actively trying to block. If it switches to false while we try to block, then we'll be stuck in the queue until the following GC. All-in-all, this CL is more complicated than I would have liked, but it's the only idea so far that is clearly correct to me at a high level. This change adds a test which is somewhat invasive as it manipulates mark termination, but hopefully that infrastructure will be useful for debugging, fixing, and regression testing mark termination whenever we do fix it. For #69803. Fixes #70323. Change-Id: Ie314e6fd357c9e2a07a9be21f217f75f7aba8c4a Reviewed-on: https://go-review.googlesource.com/c/go/+/623615 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Cherry Mui <cherryyz@google.com> (cherry picked from commit 80d306d) Reviewed-on: https://go-review.googlesource.com/c/go/+/627615 TryBot-Bypass: Dmitri Shuralyov <dmitshur@google.com> Auto-Submit: Dmitri Shuralyov <dmitshur@google.com>
…ing mark termination Currently it's possible for weak->strong conversions to create more GC work during mark termination. When a weak->strong conversion happens during the mark phase, we need to mark the newly-strong pointer, since it may now be the only pointer to that object. In other words, the object could be white. But queueing new white objects creates GC work, and if this happens during mark termination, we could end up violating mark termination invariants. In the parlance of the mark termination algorithm, the weak->strong conversion is a non-monotonic source of GC work, unlike the write barriers (which will eventually only see black objects). This change fixes the problem by forcing weak->strong conversions to block during mark termination. We can do this efficiently by setting a global flag before the ragged barrier that is checked at each weak->strong conversion. If the flag is set, then the conversions block. The ragged barrier ensures that all Ps have observed the flag and that any weak->strong conversions which completed before the ragged barrier have their newly-minted strong pointers visible in GC work queues if necessary. We later unset the flag and wake all the blocked goroutines during the mark termination STW. There are a few subtleties that we need to account for. For one, it's possible that a goroutine which blocked in a weak->strong conversion wakes up only to find it's mark termination time again, so we need to recheck the global flag on wake. We should also stay non-preemptible while performing the check, so that if the check *does* appear as true, it cannot switch back to false while we're actively trying to block. If it switches to false while we try to block, then we'll be stuck in the queue until the following GC. All-in-all, this CL is more complicated than I would have liked, but it's the only idea so far that is clearly correct to me at a high level. This change adds a test which is somewhat invasive as it manipulates mark termination, but hopefully that infrastructure will be useful for debugging, fixing, and regression testing mark termination whenever we do fix it. For golang#69803. Fixes golang#70323. Change-Id: Ie314e6fd357c9e2a07a9be21f217f75f7aba8c4a Reviewed-on: https://go-review.googlesource.com/c/go/+/623615 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Cherry Mui <cherryyz@google.com> (cherry picked from commit 80d306d) Reviewed-on: https://go-review.googlesource.com/c/go/+/627615 TryBot-Bypass: Dmitri Shuralyov <dmitshur@google.com> Auto-Submit: Dmitri Shuralyov <dmitshur@google.com>
Revert submission 3326317-remove_unique_list Reason for revert: relanding after go 1.23.4 update that fixes golang/go#69803 Reverted changes: /q/submissionid:3326317-remove_unique_list Change-Id: I65e797045af3b2489969c127f56ab04c53fc115b
Revert submission 3326317-remove_unique_list Reason for revert: relanding after go 1.23.4 update that fixes golang/go#69803 Reverted changes: /q/submissionid:3326317-remove_unique_list Change-Id: Ie96cb3aa775db360ec63e6643f980a9b9b749389
Test failure on https://go.dev/cl/617376/2, which I don't think was responsible for the failure. Only builder to fail was gotip-linux-amd64-aliastypeparams.
Full log at: https://logs.chromium.org/logs/golang/buildbucket/cr-buildbucket/8734717093063296065/+/u/step/11/log/2?format=raw
The text was updated successfully, but these errors were encountered: