-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
syscall: document or fix liveness during Syscall #13372
Comments
/cc @aclements for triage (likely a dup?) |
It's quite possible this is a dup of #13287, but I'm going to leave it open for now. @tkadauke, it would be great if you could share (possibly reduced) code to reproduce this. We have a few free list corruption reports, but yours seems by far the most reproducible and most likely to lead to a fix. Also, what OS/architecture is this? If it's easy, I'd be curious if you can reproduce this with the latest code from the master branch. I don't think we've fixed anything that would obviously have caused this, but if it is fixed on master, we can bisect it and probably narrow right in on the problem without you having to share your code. |
@aclements, this is a 40-core x86_64. I could reproduce it on a 32-core x84_64 (to exclude a hardware problem). I might be able to isolate it in a way so I can share the code with you (but not publicly) it that's ok with you. If that doesn't work, I can look into compiling against master. |
That would be okay. I have access to many-core machines if that's necessary to reproduce it. I assume the OS is Linux? |
Yes, a slightly patched Linux 3.10.72 |
@aclements: I just managed to isolate the problem somewhat and was able to trigger a segfault after a few seconds. GDB says:
So the same symptom. Where can I send the source archive? |
Austin is austin@google.com. Thanks. |
Sent an email with instructions to Austin. Hope you guys can figure out what's going on! :) |
Thanks for the code! It takes several minutes, but I can reproduce the mallocgc failure exactly like you saw it. I haven't been able to reproduce the sweep panic yet. |
Awesome! For me, it alternates between the two outcomes (SIGSEGV in malloc.go/panic in the GC) but the root cause in both cases is the corrupted free list. --Thomas Sent from my iPhone
|
I can also reproduce this at the head of master. |
So, from a first glance, do you think it’s a bug in the Go runtime or something that I’m doing wrong? I have ways to suppress this bug for a couple of hours, but I won’t get my program stable without a proper fix. —Thomas
|
I think this is a bug in the Go runtime. I skimmed over your unsafe code and none of it jumped out at me as likely to cause this, mostly because you usually don't write through unsafe pointers and the things you read through unsafe pointers don't themselves contain pointers*. It also seems unlikely you'd corrupt the free list in this particular way; you'd have to either back up one byte from an allocation and scribble a 1 over it or skip forward just the right number of bytes from the end of an allocation without mangling the intervening bytes. * That said, some of it is a bit scary. For example, your BinReader does nothing to keep the backing store alive, but all of the call sites happen to. |
It's hard to reproduce this but I have run into this several times today. |
@tkadauke Thanks very much for sending the simplified program. Austin and I were able to reproduce it on multiple machines. I'm sorry to say that it's not a runtime bug, at least not directly. Your program wraps sys_poll like this:
The problem is that nothing refers to fds after the syscall.Syscall arguments are prepared, so the compiler does not record fds as live during the system call. While the system call is blocked, the garbage collector can come along and reclaim the memory pointed at by fds. Then when the kernel is ready to return a result, it writes POLLIN (0x0001) into, say, fds[0].revents, which ends up overwriting the top 16 bits of a 64-bit word with 0x0001, in a block of memory that has already been freed and is sitting on a free list. That is how we end up in the crash with what looks like a valid pointer except for the 0x0001000000000000 high bit. If you change the code to use fds again after the syscall.Syscall, it will be marked live for the duration of the call and will therefore stay valid for the kernel to write to. One way to do this is to add something like this to your package:
and then after the syscall.Syscall, add a call
to keep fds live during the syscall.Syscall. This is basically what package syscall itself does, although there func use is implemented in assembly. With a change like this, Austin and I were both able to run the sample program for over an hour, when it used to crash after a few minutes. I've been thinking about what we should do in general here. At the least we need better documentation about how to use syscall.Syscall, but perhaps we can make this kind of thing less necessary. I'm going to leave this issue to track that work. Thanks very much for the report. I'm sad it wasn't (or doesn't appear to be) the runtime bug we're chasing in the other reports, but it's still certainly something we need to address. |
@rsc No, it's not the same program. And it's unlikely that I'll be able to share to source code. But sure I'll file a separate issue with the (probably a partial) stacktrace. |
It's tempting to think of either a "go vet" check, or something that the compiler might do automatically, since a pattern along the lines of "go pointer -> unsafe.Pointer -> uintptr -> cgo call arg & go pointer is now not live" is a red flag with a siren. I'd want a tool like that to make it easier to deal with bugs like this, might as well cut out the middle man and put it into go vet or the compiler itself. |
@dr2chase, FWIW, cgo gets this right, as do all the typed system call
wrappers. It's only syscall.Syscall, with its untyped arguments, that has
the problem. But I agree we might be able to just special case this in the
compiler and get it right automatically. That's what I want to figure out.
|
@rsc, thank you so much for your looking into this. Your findings are very fascinating and line up with the weird crashing behavior that less traced CPUs trigger crashes more often. This is because there are more calls to poll then, which means more opportunities for the GC to collect |
CL https://golang.org/cl/18584 mentions this issue. |
CL https://golang.org/cl/18640 mentions this issue. |
Add docs for valid uses of Pointer. Then document change made for #13372 in CL 18584. Fixes #8994. Change-Id: Ifba71e5aeafd11f684aed0b7ddacf3c8ec07c580 Reviewed-on: https://go-review.googlesource.com/18640 Reviewed-by: Alan Donovan <adonovan@google.com> Reviewed-by: Robert Griesemer <gri@golang.org> Reviewed-by: Ian Lance Taylor <iant@golang.org> Reviewed-by: Rob Pike <r@golang.org> Reviewed-by: Austin Clements <austin@google.com> Reviewed-by: Rick Hudson <rlh@golang.org>
CL https://golang.org/cl/23123 mentions this issue. |
CL https://golang.org/cl/23124 mentions this issue. |
Updates golang/go#13372. Change-Id: Idfd5001f4ad7bc80a4283df1c310f97612eba85c Reviewed-on: https://go-review.googlesource.com/23124 Run-TryBot: Mikio Hara <mikioh.mikioh@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Updates #13372. Change-Id: Id2402a781474e9d0bb0901c5844adbd899f76cbd Reviewed-on: https://go-review.googlesource.com/23123 Run-TryBot: Mikio Hara <mikioh.mikioh@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
CL https://golang.org/cl/24750 mentions this issue. |
CL https://golang.org/cl/24751 mentions this issue. |
Updates golang/go#13372. Change-Id: I623de97eb19880356148cbcb7d17759df82684aa Reviewed-on: https://go-review.googlesource.com/24751 Run-TryBot: Mikio Hara <mikioh.mikioh@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Ian Lance Taylor <iant@golang.org>
Updates #13372. Change-Id: If383c14af14839a303425ba7b80b97e35ca9b698 Reviewed-on: https://go-review.googlesource.com/24750 Run-TryBot: Mikio Hara <mikioh.mikioh@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Ian Lance Taylor <iant@golang.org>
I am working on a new Go tool that reads events via the Linux perf API from a bunch of mmap'ed files and then extracts useful information about those events. The fewer processors I inspect at the same time, the sooner I run into an issue with the Go garbage collector:
It's either this backtrace (GC has trouble sweeping because the free list is corrupt), or a SIGSEGV (
runtime/malloc.go:585
crashes because it picked a corrupted entry from the free list).This is 100% reproducible for me (Sorry, can't share the source code just yet). In the current configuration that I have running, it takes between 1 and 60 seconds to crash, with or without GDB.
Please note that the corrupt pointer looks like an intact pointer & 0x1000000000000.
I get the crash in Go 1.5 and Go 1.5.1.
The text was updated successfully, but these errors were encountered: