-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: panic: non-empty mark queue after concurrent mark (Go1.14, Go1.15) #41303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
( wcfs go unit tests are just regular go unit tests - they do not play with virtual memory at all, , and, to my knowledge, there is no Cgo involved. At that level FUSE is also not used. ) |
Unfortunately our test system hit a similar crash with go1.14.9:
full output
Unfortunately I don't have core this time. |
go1.14.9 crash was on: go env, go version, uname, lsb_release
|
(again with go1.14.9)
full output
@aclements, @mknyszek, do you think this is related to #40641 / #40642 / #40643 ? |
I'm sorry to say that the GC crash also happened with
full output
In other words #40641 / #40642 / #40643 seem to be unrelated. /cc @aclements, @mknyszek |
Sadly, I could also trigger the same crash on 1.15.3 (i.e. go1.15 with #40643 included):
full output
go version, go env, uname, ...
This time there is the core file. |
Just in case, have you run this test in the race detector? |
@networkimprov, here you are (this are normal and race runs; there is no GC failue): .../wcfs$ DBTail_SEED=1602769537289632682 go test -v
=== RUN TestZBlk
2020/10/15 20:39:04 zodb: FIXME: open testdata/zblk.fs: raw cache is not ready for invalidations -> NoCache forced
--- PASS: TestZBlk (0.03s)
=== RUN TestΔBTail
2020/10/15 20:39:05 zodb: FIXME: open /tmp/δBTail217888886/1.fs: raw cache is not ready for invalidations -> NoCache forced
--- PASS: TestΔBTail (2.72s)
=== RUN TestΔBTreeAllStructs
δbtail_test.go:1123: # maxdepth=2 maxsplit=1 nkeys=5 n=10 seed=1602769537289632682
2020/10/15 20:39:08 zodb: FIXME: open /tmp/δBTail128030557/1.fs: raw cache is not ready for invalidations -> NoCache forced
--- PASS: TestΔBTreeAllStructs (46.42s)
=== RUN TestIntSets
--- PASS: TestIntSets (0.00s)
=== RUN TestKVDiff
--- PASS: TestKVDiff (0.00s)
=== RUN TestKVTxt
--- PASS: TestKVTxt (0.00s)
PASS
ok lab.nexedi.com/nexedi/wendelin.core/wcfs 49.175s .../wcfs$ DBTail_SEED=1602769537289632682 go test -v -race
=== RUN TestZBlk
2020/10/15 20:40:12 zodb: FIXME: open testdata/zblk.fs: raw cache is not ready for invalidations -> NoCache forced
--- PASS: TestZBlk (0.10s)
=== RUN TestΔBTail
2020/10/15 20:40:13 zodb: FIXME: open /tmp/δBTail980323143/1.fs: raw cache is not ready for invalidations -> NoCache forced
--- PASS: TestΔBTail (13.17s)
=== RUN TestΔBTreeAllStructs
δbtail_test.go:1123: # maxdepth=2 maxsplit=1 nkeys=5 n=10 seed=1602769537289632682
2020/10/15 20:40:27 zodb: FIXME: open /tmp/δBTail188139514/1.fs: raw cache is not ready for invalidations -> NoCache forced
--- PASS: TestΔBTreeAllStructs (220.55s)
=== RUN TestIntSets
--- PASS: TestIntSets (0.00s)
=== RUN TestKVDiff
--- PASS: TestKVDiff (0.00s)
=== RUN TestKVTxt
--- PASS: TestKVTxt (0.00s)
PASS
ok lab.nexedi.com/nexedi/wendelin.core/wcfs 233.857s |
I suspect the bug is related to concurrency and in particular sharpy surrounding OS load tend to increase the probability for it to happen. This is likely due to the fact that internal Go scheduling is interdependent here to other processes, becuase the test spawns another python process and communicates with it in synchronous manner. With this in mind I've found a way to trigger GC crash more quickly via running several instances of the test simultaneously. Reproduction script: gogccrash. |
I've digged this down a bit:
I'm currently suspecting that only unsafe place in my code to be the source of this GC crash. My guess is that rematerializing the object queues more work for GC to mark, but if that catches GC at the time when it things it has finished the marking, that results in the crash. But I have not looked inside for details yet. I'm pausing my analysis here. |
That's definitely an unsupported use of unsafe. |
@randall77, thanks for feedback. To clarify about "rematerialize": when this happens we know for sure that the object is still alive, because there is synchronization on reference state in between finalizer and As this issue shows some synchronization is also needed to be done with respect to garbage-collector as well. (*) yes, stacks can be moved, but on-heap objects are not moved, and weak.Ref is supposed to work for on-heap objects only. |
Also, maybe I'm missing something, but I don't see any other way to implement weak references. |
Right. This code will force heap allocation because, among other things, passing anything to
I don't either. Go is not designed to support weak references. Just imagining one possible bug here. See the comment at line 119; it wonders whether we need a write barrier. This code looks like it doesn't, as the write at line 121 is to the stack, not the heap, and stack writes don't need write barriers. But if |
Weak references and finalizers have the same power. To implement weak references with finalizers, use an intermediate object. The catch is that you need explicit strong pointers as well as explicit weak pointers. // Package weak manages weak references to pointers.
// In order to use this with a particular pointer, you must use
// strong pointers, defined in this package.
// You may also use weak pointers.
// If all the strong pointers to some value are garbage collected,
// then the value may be collected, even if weak pointers still exist.
package weak
import "runtime"
// Strong is a strong pointer to some object.
// To fetch the pointer, use the Get method.
type Strong struct {
p *intermediate
}
// Get returns the value to which s points.
func (s *Strong) Get() interface{} {
return s.p.val
}
// Weak returns a weak reference to the value to which Strong points.
func (s *Strong) Weak() *Weak {
return &Weak{s.p}
}
// clear is a finalizer for s.
func (s *Strong) clear() {
s.p.val = nil
}
// Create a strong pointer to an object.
func MakeStrong(val interface{}) *Strong {
r := &Strong{&intermediate{val}}
runtime.SetFinalizer(r, func(s *Strong) { s.clear() })
return r
}
// Weak is a weak reference to some value.
type Weak struct {
p *intermediate
}
// Get returns the value to which w points.
// If there are no remaining Strong pointers to the value,
// Get may return nil.
func (w *Weak) Get() interface{} {
return w.p.val
}
// intermediate is used to implement weak references.
type intermediate struct {
val interface{}
} |
@randall77, thanks for feedback and commenting on potential trickiness regarding write barrier.
Explicitly using Strong is indeed the price for not having to use unsafe. That indeed works; thanks a lot, again, for the example. However the price might be too high: it requires all users of an API to use So, still, is there maybe a way to synchronize with the garbage-collector, and notify it properly about rematerialized pointer so that it does not see such rematerialization as marking invariant breakage? |
Well, technically all that is required is that there be at least one live As @randall77 said, your code is an unsupported use of unsafe. If you want to be able to safely rematerialize a pointer other than through the mechanisms documented at https://golang.org/pkg/unsafe, that would be a language change proposal that should go through the proposal process. But I think it is extremely unlikely that any such proposal would be accepted. |
@ianlancetaylor, thanks for feedback.
Unfortunately it is not the case: the objects - that The bug I'm talking about here is of exactly the same nature that Connection.objtab is trying to avoid via using weak pointers in the first place: if for an object A
Now if another object B2 references in the database on disk object A by A's OID, and that object B2 is accessed in the program, when the program will try to access/create live object for A in RAM via B2->A database link, it will look into Connection.objtab[A.oid], see the weak pointer there is nil, and create another in-RAM object for A. This breaks isomorphism in between in-database and in-RAM objects and so eventually leads to database corruption. The only solution I see is
With 1 by "everywhere" I really mean everywhere - for every pointer that potentially originates from Strong pointer. But then Strongs just replace all regular Go pointers...
I see... |
Timed out in state WaitingForInfo. Closing. (I am just a bot, though. Please speak up if this is a mistake or you have the requested information.) |
I created a working weak reference, this implementation is buggy IMHO. From the source code: // RELEASE
w.mu.Lock()
if w.state == objGot {
w.state = objLive // STORE #2
runtime.SetFinalizer(obj, release)
} else {
w.state = objReleased // STORE #2
}
// Uses atomic.AddInt32, which has no guarantee of memory fence!
w.mu.Unlock() // STORE #3
// GET
func (w *Ref) Get() (obj interface{}) {
w.mu.Lock()
// Note: We can see the stores in two orders:
// STORE #2, STORE #3 <-- This is what you assume to happen.
// STORE #3, STORE #2 <-- This is what can happen, when I get the memory model right.
//
// In other words: At this point, we may encounter that STORE #3 (Unlock) is visible.
// However, go-lang does not guaranee that STORE #2 is visible yet.
// In the bad case, STORE #2 is not yet visible, which means that the following condition is true
if w.state != objReleased {
w.state = objGot
// A sleep of 100ns does not change anything, it is pointless.
// obj reference is already at the stack and we already entered the worst case.
time.Sleep(100*time.Nanosecond)
// recreate interface{} from saved words.
// XXX do writes as pointers so that compiler emits write barriers to notify GC?
// --> write barriers do not help here, what is needed is a load barrier behind the Lock()!
i := (*iface)(unsafe.Pointer(&obj))
*i = w.iface
// This has potentially restored an invalid pointer that can crash the app!
}
w.mu.Unlock() // no guarantee of memory fence
return obj
} At least that is what I read from various specs, unless i missed something: |
Hello @xeus2001, thanks for feedback. Though that's a serious statement. What you say, in essence, is that if one x = ... // STORE #2
mu.Unlock() // STORE #3 in your speak and the second goroutine does mu.Lock()
read x // LOAD then the second goroutine might observe old value of If that would be true, then all usage of I've also checked the memory model, and it explicitly says that mutex unlock ---- 8< ---- var l sync.Mutex
var a string
func f() {
a = "hello, world"
l.Unlock()
}
func main() {
l.Lock()
go f()
l.Lock()
print(a)
} is guaranteed to print "hello, world". The first call to l.Unlock() (in f) (emphasis mine) My implementation is indeed buggy, but the only bug, in my view, is, as @MagicalTux said in #43615 (comment), that Kirill |
You are right about the memory model. I created a test for this and then added your weak-ref code, but it seems to work perfectly fine? Are you sure that the problem is really the weak reference? package main
import (
"fmt"
"math/rand"
"runtime"
"runtime/debug"
"sync"
"sync/atomic"
"time"
"unsafe"
)
const (
FALSE uint32 = 0
TRUE uint32 = 100
)
type Demo struct {
mu sync.Mutex
state uint32
id int
}
type iface struct {
typ uintptr // type
data uintptr // data
}
// weakRefState represents current state of an object Ref points to.
type weakRefState int32
const (
objGot weakRefState = +1 // Ref.Get returned !nil
objLive weakRefState = 0 // object is alive, Get did not run yet in this GC cycle
objReleased weakRefState = -1 // the finalizer marked object as released
)
// Ref is a weak reference.
//
// Create one with NewRef and retrieve referenced object with Get.
//
// There must be no more than 1 weak reference to any object.
// Weak references must not be attached to an object on which runtime.SetFinalizer is also used.
// Weak references must not be copied.
type Ref struct {
iface
// XXX try to do without mutex and only with atomics
mu sync.Mutex
state weakRefState
}
func main() {
// Test memory model.
demo := &Demo{}
demo.mu = sync.Mutex{}
fmt.Println("Start")
wg := sync.WaitGroup{}
const threads = 100
const loops = 100000
wg.Add(threads)
for i := 0; i < threads; i++ {
go func() {
for j := 0; j < loops; j++ {
demo.mu.Lock()
demo.state++
demo.mu.Unlock()
}
wg.Done()
}()
}
wg.Wait()
fmt.Println("End Memory Model Test -----------------")
result := atomic.LoadUint32(&demo.state)
if result == (threads * loops) {
fmt.Println("OK: ", result)
} else {
fmt.Println("Wrong: ", result)
}
// Test the weak reference.
// Force GCs
debug.SetGCPercent(0)
// Prepare concurrent access to weak refs
wg.Add(100)
refs := make([]*Ref, 100)
collected := [100]uint32{}
for gi := 0; gi < 100; gi++ {
s := &Demo{}
s.id = gi
refs[gi] = NewRef(s)
collected[gi] = FALSE
fmt.Print(" ", s.id)
}
fmt.Println(" created!")
var x = make([]int, 32768)
for g := 0; g < 100; g++ {
go func(id int) {
for i := 0; i < 10000; i++ {
for j := 0; j < 100; j++ {
num := refs[j].Get()
if num == nil && atomic.CompareAndSwapUint32(&collected[j], FALSE, TRUE) {
fmt.Println("Collected ", j)
}
// Increase from 10 to 50 to increase collection rate.
if rand.Intn(100) < 10 {
runtime.Gosched()
}
}
x2 := x
x2[i%len(x2)]++
}
wg.Done()
fmt.Println("End ", id)
}(g)
}
go func() {
for true {
for i := 0; i < 10000; i++ {
x := make([]int, 32768)
for y := 0; y < 1024; y++ {
x[y] = 32768 - y
}
runtime.Gosched()
}
PrintMemUsage()
time.Sleep(500 * time.Millisecond)
}
}()
wg.Wait()
fmt.Println("End 2")
time.Sleep(2 * time.Second)
for j := 0; j < 100; j++ {
ref := refs[j].Get()
fmt.Println("Collected: ", j, collected[j], ref)
}
}
// NewRef creates new weak reference pointing to obj.
//
// TODO + onrelease callback?
func NewRef(obj interface{}) *Ref {
// since starting from ~ Go1.4 the GC is precise, we can save interface
// pointers to uintptr and that won't prevent GC from garbage
// collecting the object.
w := &Ref{
iface: *(*iface)(unsafe.Pointer(&obj)),
state: objLive,
}
var release func(interface{})
release = func(obj interface{}) {
ifobj := *(*iface)(unsafe.Pointer(&obj))
if w.iface != ifobj {
panic(fmt.Sprintf("weak: release: w.iface != obj; w.iface=%x obj=%x", w.iface, ifobj))
}
// GC decided that the object is no longer reachable and
// scheduled us to run as finalizer. During the time till we
// actually run, Ref.Get might have been come to run and
// "rematerializing" the object for use. Check if we do not
// race with any Get in progress, and reschedule us to retry at
// next GC if we do.
w.mu.Lock()
if w.state == objGot {
w.state = objLive
runtime.SetFinalizer(obj, release)
// Uncomment this and you get spammed by this message!
//fmt.Println("Refresh a reference")
} else {
w.state = objReleased
fmt.Println("Collected a reference")
}
w.mu.Unlock()
}
runtime.SetFinalizer(obj, release)
return w
}
// Get returns object pointed to by this weak reference.
//
// If original object is still alive - it is returned.
// If not - nil is returned.
func (w *Ref) Get() (obj interface{}) {
w.mu.Lock()
if w.state != objReleased {
w.state = objGot
//time.Sleep(100 * time.Nanosecond)
// recreate interface{} from saved words.
// XXX do writes as pointers so that compiler emits write barriers to notify GC?
i := (*iface)(unsafe.Pointer(&obj))
*i = w.iface
}
w.mu.Unlock()
return obj
}
func PrintMemUsage() {
var m runtime.MemStats
runtime.ReadMemStats(&m)
// For info on each, see: https://golang.org/pkg/runtime/#MemStats
fmt.Printf("Alloc = %v MiB", bToMb(m.Alloc))
fmt.Printf("\tTotalAlloc = %v MiB", bToMb(m.TotalAlloc))
fmt.Printf("\tSys = %v MiB", bToMb(m.Sys))
fmt.Printf("\tNumGC = %v\n", m.NumGC)
}
func bToMb(b uint64) uint64 {
return b / 1024 / 1024
} |
@xeus2001 - yes, I'm sure the problem is in my weak reference implementation and in particular in the |
@navytux package main
import (
"fmt"
"unsafe"
)
type IFace struct {
typ uintptr // type
data uintptr // data
}
type R struct {
IFace
}
func (w *R) Get() (obj interface{}) {
i := (*IFace)(unsafe.Pointer(&obj))
// MOVQ 8(AX), BX
// MOVQ (AX), CX
// MOVQ CX, main.obj(SP)
// MOVQ BX, main.obj+8(SP) */
*i = w.IFace
// MOVQ main.obj(SP), AX
// MOVQ 16(SP), BP
return obj
}
func Test() (obj interface{}) {
obj = "bar"
// LEAQ type:string(SB), AX
// LEAQ main..stmp_0(SB), BX
return obj
}
func main() {
var s interface{} = "foo"
// LEAQ type:string(SB), DX
// MOVQ DX, main.s+56(SP)
// LEAQ main..stmp_1(SB), R8
// MOVQ R8, main.s+64(SP)
r := R{}
// MOVUPS X15, main.r+40(SP)
r.IFace = *((*IFace)(unsafe.Pointer(&s)))
// MOVQ main.s+64(SP), R8
// MOVQ main.s+56(SP), R9
// MOVQ R9, main.r+40(SP)
// MOVQ R8, main.r+48(SP)
result := r.Get()
fmt.Println(result)
result = Test()
fmt.Println(result)
} There is no difference, if your code breaks the GC, then the normal (safe) code would either. I think the reason for the problem lies somewhere else, maybe it just misleads to the weak reference. I have seen so far no strong proof that the weak reference is the reason and my assumption about memory barrier issue has proven wrong. From my side, both our implementations of weak references seem to be pretty save (IMHO). |
Hello up there.
Today, while working on wendelin.core I've got
panic: non-empty mark queue after concurrent mark
several times while running unit tests. It is hard for me to find time to try to reduce the problem to minimal example, but I just cannot ignore such garbage collector kind of bug, so I decided to report at least what I can. Please find details about GC crash below. I also attach dumped core in case it could be useful to analyse what was going on (no confidential data in there).Thanks beforehand,
Kirill
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
No data
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
t
revision=cc216b8c
.cd wcfs
export GOTRACEBACK=crash
ulimit -c unlimited
for i in range
seq 100; do go test -v || break ; done
What did you expect to see?
No Go runtime crashes
What did you see instead?
Go runtime crashed with
full output
core.zip
The text was updated successfully, but these errors were encountered: