runtime: race between stack shrinking and channel send/recv leads to bad sudog values [1.15 backport] #40643

gopherbot · 2020-08-07T20:39:45Z

@mknyszek requested issue #40641 to be considered for backport to the next 1.15 minor release.

This should be fixed for Go 1.14 and Go 1.15. It's a bug that was introduced in Go 1.14, and may cause random and unavoidable crashes at any point in time. There may not be enough time to fix this for 1.15 (the failure is very rare, but we've seen it internally), and if not, it should definitely go in a point release.

@gopherbot please open a backport issue for 1.14.

gopherbot · 2020-09-21T15:52:26Z

Change https://golang.org/cl/256300 mentions this issue: [release-branch.go1.15] runtime: disable stack shrinking in activeStackChans race window

mknyszek · 2020-09-30T20:19:49Z

@dmitshur

Could you please take a look and see if the cherry pick should be approved? I think it should, since it's actively causing relatively rare crashes with no workaround. Also, could you submit https://golang.org/cl/256300 too, if it's approved? I've got 2 +2s on it now.

Thank you!

dmitshur · 2020-10-01T17:39:25Z

@mknyszek Thanks for providing a rationale for this backport and for getting the CLs into a ready state (here, and in #40642). We will discuss it in a release meeting soon, and take care of submission.

dmitshur · 2020-10-07T20:33:54Z

Approving per discussion in a release meeting. This backport applies to both 1.15 (this issue) and 1.14 (#40642).

gopherbot · 2020-10-07T20:34:38Z

Closed by merging bf79f91 to release-branch.go1.15.

…ckChans race window Currently activeStackChans is set before a goroutine blocks on a channel operation in an unlockf passed to gopark. The trouble is that the unlockf is called *after* the G's status is changed, and the G's status is what is used by a concurrent mark worker (calling suspendG) to determine that a G has successfully been suspended. In this window between the status change and unlockf, the mark worker could try to shrink the G's stack, and in particular observe that activeStackChans is false. This observation will cause the mark worker to *not* synchronize with concurrent channel operations when it should, and so updating pointers in the sudog for the blocked goroutine (which may point to the goroutine's stack) races with channel operations which may also manipulate the pointer (read it, dereference it, update it, etc.). Fix the problem by adding a new atomically-updated flag to the g struct called parkingOnChan, which is non-zero in the race window above. Then, in isShrinkStackSafe, check if parkingOnChan is zero. The race is resolved like so: * Blocking G sets parkingOnChan, then changes status in gopark. * Mark worker successfully suspends blocking G. * If the mark worker observes parkingOnChan is non-zero when checking isShrinkStackSafe, then it's not safe to shrink (we're in the race window). * If the mark worker observes parkingOnChan as zero, then because the mark worker observed the G status change, it can be sure that gopark's unlockf completed, and gp.activeStackChans will be correct. The risk of this change is low, since although it reduces the number of places that stack shrinking is allowed, the window here is incredibly small. Essentially, every place that it might crash now is replaced with no shrink. This change adds a test, but the race window is so small that it's hard to trigger without a well-placed sleep in park_m. Also, this change fixes stackGrowRecursive in proc_test.go to actually allocate a 128-byte stack frame. It turns out the compiler was destructuring the "pad" field and only allocating one uint64 on the stack. For #40641. Fixes #40643. Change-Id: I7dfbe7d460f6972b8956116b137bc13bc24464e8 Reviewed-on: https://go-review.googlesource.com/c/go/+/247050 Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Michael Pratt <mpratt@google.com> Trust: Michael Knyszek <mknyszek@google.com> (cherry picked from commit eb3c6a9) Reviewed-on: https://go-review.googlesource.com/c/go/+/256300 Reviewed-by: Austin Clements <austin@google.com>

…ckChans race window Currently activeStackChans is set before a goroutine blocks on a channel operation in an unlockf passed to gopark. The trouble is that the unlockf is called *after* the G's status is changed, and the G's status is what is used by a concurrent mark worker (calling suspendG) to determine that a G has successfully been suspended. In this window between the status change and unlockf, the mark worker could try to shrink the G's stack, and in particular observe that activeStackChans is false. This observation will cause the mark worker to *not* synchronize with concurrent channel operations when it should, and so updating pointers in the sudog for the blocked goroutine (which may point to the goroutine's stack) races with channel operations which may also manipulate the pointer (read it, dereference it, update it, etc.). Fix the problem by adding a new atomically-updated flag to the g struct called parkingOnChan, which is non-zero in the race window above. Then, in isShrinkStackSafe, check if parkingOnChan is zero. The race is resolved like so: * Blocking G sets parkingOnChan, then changes status in gopark. * Mark worker successfully suspends blocking G. * If the mark worker observes parkingOnChan is non-zero when checking isShrinkStackSafe, then it's not safe to shrink (we're in the race window). * If the mark worker observes parkingOnChan as zero, then because the mark worker observed the G status change, it can be sure that gopark's unlockf completed, and gp.activeStackChans will be correct. The risk of this change is low, since although it reduces the number of places that stack shrinking is allowed, the window here is incredibly small. Essentially, every place that it might crash now is replaced with no shrink. This change adds a test, but the race window is so small that it's hard to trigger without a well-placed sleep in park_m. Also, this change fixes stackGrowRecursive in proc_test.go to actually allocate a 128-byte stack frame. It turns out the compiler was destructuring the "pad" field and only allocating one uint64 on the stack. For golang#40641. Fixes golang#40643. Change-Id: I7dfbe7d460f6972b8956116b137bc13bc24464e8 Reviewed-on: https://go-review.googlesource.com/c/go/+/247050 Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Michael Pratt <mpratt@google.com> Trust: Michael Knyszek <mknyszek@google.com> (cherry picked from commit eb3c6a9) Reviewed-on: https://go-review.googlesource.com/c/go/+/256300 Reviewed-by: Austin Clements <austin@google.com>

gopherbot added the CherryPickCandidate Used during the release process for point releases label Aug 7, 2020

gopherbot mentioned this issue Aug 7, 2020

runtime: race between stack shrinking and channel send/recv leads to bad sudog values #40641

Closed

gopherbot added this to the Go1.15.1 milestone Aug 10, 2020

dmitshur modified the milestones: Go1.15.1, Go1.15.2 Sep 1, 2020

dmitshur modified the milestones: Go1.15.2, Go1.15.3 Sep 9, 2020

mknyszek mentioned this issue Sep 30, 2020

runtime: race between stack shrinking and channel send/recv leads to bad sudog values [1.14 backport] #40642

Closed

dmitshur added CherryPickApproved Used during the release process for point releases and removed CherryPickCandidate Used during the release process for point releases labels Oct 7, 2020

gopherbot closed this as completed Oct 7, 2020

navytux mentioned this issue Oct 13, 2020

runtime: panic: non-empty mark queue after concurrent mark (Go1.14, Go1.15) #41303

Closed

johnsonjh mentioned this issue Oct 22, 2020

Considering upgrading minimum required Golang to 1.15.3. pkt-cash/pktd#85

Closed

golang locked and limited conversation to collaborators Oct 7, 2021

gopherbot added the FrozenDueToAge label Oct 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: race between stack shrinking and channel send/recv leads to bad sudog values [1.15 backport] #40643

runtime: race between stack shrinking and channel send/recv leads to bad sudog values [1.15 backport] #40643

gopherbot commented Aug 7, 2020

gopherbot commented Sep 21, 2020

mknyszek commented Sep 30, 2020

dmitshur commented Oct 1, 2020

dmitshur commented Oct 7, 2020

gopherbot commented Oct 7, 2020

runtime: race between stack shrinking and channel send/recv leads to bad sudog values [1.15 backport] #40643

runtime: race between stack shrinking and channel send/recv leads to bad sudog values [1.15 backport] #40643

Comments

gopherbot commented Aug 7, 2020

gopherbot commented Sep 21, 2020

mknyszek commented Sep 30, 2020

dmitshur commented Oct 1, 2020

dmitshur commented Oct 7, 2020

gopherbot commented Oct 7, 2020