-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: linux/arm64 crash in runtime.sigtrampgo #32912
Comments
Related to #32738 ? |
No. That is for ARM32. |
@jing-rui thanks for the report. I'm a little confused with the information. You mentioned at the fault, R0=g=0x4000000480, and R1 is g.m. But R1 is 13, whereas when you print *g its m field is 0x4000034000. Are both printed at the same location? I'm confused with this. Also, sigtrampgo runs in the signal handler. So there should be a signal happens first. Do you happen to know what signal it is, and at which PC it is delivered (if it is a synchronous signal)? Is it possible to grab a stack trace from sigtrampgo where it faults? |
@cherrymui We could see this kind of behavior--a discrepancy between what gdb sees in memory and what it sees in the registers--if the setting of the |
Hi @jing-rui , do you have some test cases to reproduce this issue ? And can you share the docker container information with us? Thanks. |
////// more tips on containerd-shim core with go1.13 the addr of g.m should be g + 48B, so &g.m == 0x40000004b0. (gdb) x /2x 0x40000004b0 Dump of assembler code for function runtime.sigtrampgo: ////// test with modified go1.11.2 for runc. We can not get the argument sig, because it is optimized out. Tried #0 0x00000000004420d0 in runtime.(*sigctxt).preparePanic (c=0x4000058cb0, gp=, sig=) at /usr/lib/golang/src/runtime/signal_arm64.go:68 decode the siginfo struct: We test modified sigtrapgo when check it will panic, exit(sig) direct and get status code=11. so the sig should be 11. but we can not get more info for the cause of 11. ////// how to reproduce? Create many containers(more than 20) using command below, and wait runc cores(1-3 hours): Tested in vm, not reproduce. |
Using go1.11.2 we force exit when g.m == nil in sigtrapgo, we also get a coredump, quite strange here. 1645 mp := lockextra(false) The mp should not be 0, and _g_ also looks bad. |
Update: we add log in runc to detect where it crash, and we find it.
Modify code to exclude json package:
How pipe read crash golang runtime? |
@jing-rui I don't think a pipe read is causing the crash. The crash is occurring while handling a signal. My guess would be that your program has paused at the point of the pipe read, and then the signal occurs. |
we find another crash point.
The runc cmd.Start() will call "runc init" to exec a command in container namespace. When running "runc init", the C code nsexec() in nsexec.c running first before go code run. We add logs in nsexec.c and "runc init", the log shows that nsexec() exit success, but the parent process crash.
The parent runc crash during "runc init" running, so we add logs in the "runc init" and get the expected log.
From the core dump file, syscall.Syscall6() should be the wait4?
I'm wondering if the p.cmd.Process.Wait() can cause runtime crash? |
It's quite unlikely that the backtrace will show you what is causing the crash. The backtrace is typically going to show you a point where the program is stalled. The crash is due to a signal being received, and since a signal can occur at any time, and since most programs spend most of their time stalled, the odds are very good that when the signal occurs the program will be stalled. In this case, stalled in the Is there a way that we can recreate this problem ourselves? |
I am not sure if its related but the linux/arm (32 bit) build of Consul crashes in this exact spot. Its handling a SIGPROF signal but the call to To reproduce, I use the binary here: https://releases.hashicorp.com/consul/1.5.2/consul_1.5.2_linux_arm.zip Run |
I am not sure if it's related too, but the linux/arm32 build of containerd-shim crashes at the same point. I build containerd(v1.2.7) for armv6l using go 1.11. And run a docker container on arm32 (Raspberry Pi 3). A container is configured to check it's health. This SEGV happens on containerd-shim at after long time (30minutes〜24hours) running.
|
After further debugging, I found that roughly 75% of the time I am seeing a segfault when the For the 75% case I setup some hardware watchpoints to see where the go/src/runtime/sys_linux_arm.s Line 296 in 89d300b
The backtrace when my watchpoint was hit is:
So it would seem that the call to |
@mkeeler thanks! This is very helpful. I think you're right that R10 may be clobbered during __vdso_clock_gettime. Normally this is fine as the VDSO code would restore R10 upon return. But if we get a signal during VDSO, we would still see the bad g. I think we should save g to TLS before making a VDSO call, and load g from TLS in the signal handler. |
__vdso_clock_gettime definitely does use r10:
|
Both the arm 32 and arm 64 implementations of sigtramp conditionally load the Arm 64: go/src/runtime/sys_linux_arm64.s Lines 346 to 349 in 89d300b
Arm 32: go/src/runtime/sys_linux_arm.s Lines 442 to 444 in 89d300b
When debugging I also set some conditional breakpoints at that line in the arm 32 code to break when r0 was 0 and r10 was 0x1. It was hitting so the branch into the So the Note: to ensure my conditional breakpoint was working as expected I also set a breakpoint here: Line 62 in 89d300b
In the case where there was a segfault, sigtramp was always not taking the branch where |
If I understand correctly we only have TLS set up if cgo is used. When cgo is not used, I'm not sure where we can save the g register... |
Ah, okay. I was thinking that check was whether or not Go was in the middle of executing a cgo function, not whether it was enabled/disable. That would make sense because Consul gets compiled with cgo disabled. I will try with it enabled and see if the problem goes away. |
Enabling cgo probably doesn't solve the problem by itself, because we still didn't save the g register in runtime.walltime and nanotime. |
@cherrymui The following code can reproduce the segfault: package main
import (
_ "net/http/pprof"
"fmt"
"html"
"log"
"net/http"
"time"
)
func main() {
http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
fmt.Fprintf(w, "Hello, %q", html.EscapeString(r.URL.Path))
})
http.HandleFunc("/hi", func(w http.ResponseWriter, r *http.Request){
fmt.Fprintf(w, "Hi")
})
go func() {
for {
time.Now()
//time.Sleep(1 * time.Millisecond)
}
}()
log.Fatal(http.ListenAndServe(":8500", nil))
} Compile with:
Tested With: go 1.12.7 Reproduction:
#!/bin/bash
for ((i=1;i<=4;i++))
do
echo "Round $i"
curl -s http://localhost:8500/debug/pprof/heap >/dev/null
curl -s http://localhost:8500/debug/pprof/profile?seconds=30 >/dev/null &
curl -s http://localhost:8500/debug/pprof/trace?seconds=30 >/dev/null &
curl -s http://localhost:8500/debug/pprof/goroutine >/dev/null
while test "$(jobs | wc -l)" -gt 0
do
sleep 1
jobs
jobs | wc -l
done
done Utilizing the pprof debug endpoints causes various profiling timers to be setup and eventually delivers SIGPROF signals asynchronously. |
@mkeeler How long does it take your program to crash? I thought I would be able to reproduce the problem with this function, but so far no luck on the arm and arm64 builders. func TimeProfNow() {
f, err := ioutil.TempFile("", "timeprofnow")
if err != nil {
fmt.Fprintln(os.Stderr, err)
os.Exit(2)
}
if err := pprof.StartCPUProfile(f); err != nil {
fmt.Fprintln(os.Stderr, err)
os.Exit(2)
}
t0 := time.Now()
t1 := t0
// We should get a profiling signal 100 times a second,
// so running for 10 seconds should be sufficient.
for t1.Sub(t0) < 10*time.Second {
t1 = time.Now()
}
pprof.StopCPUProfile()
name := f.Name()
if err := f.Close(); err != nil {
fmt.Fprintln(os.Stderr, err)
os.Exit(2)
}
if err := os.Remove(name); err != nil {
fmt.Fprintln(os.Stderr, err)
os.Exit(2)
}
} |
For me it’s within 30 seconds although many times quicker |
I'm still looking at this, but I'm removing the release-blocker label because this is not a regression. |
I see. The environment configuration used to create images for the linux-arm and linux-arm64 builders are in these directories: From a quick look, it seems they both use pre-made Docker images as base:
In order to turn on VDSO, I understand we'd either have to find alternative images to use, or create our own. |
What if, as a first step, we try updating from xenial to bionic and see if that changes anything? Is that hard or inadvisable? I really know nothing about this. |
That sounds like a good first step to try. It's neither hard nor inadvisable, but it takes some work. Updating a builder image is a step we need to do whenever updating a builder to a new OS version, or when making kernel-level adjustments such as this one (we've done similar adjustments in the past to enable SMT in order to improve performance, see CL 145022). There's a relatively fixed amount of work to do each time. Do you mind opening a separate issue for the task of updating the builder from xenial to bionic so we can track it there? |
Opened #33574. |
We created a repo https://github.com/jing-rui/mypipe which can be reproduce the crash.
On our platform, build with go version go1.12.7 linux/arm64. we open about 5 terminal and run ./gencore.sh, it can be crash less than 30min, normally 10min. The example trace is at https://github.com/jing-rui/mypipe/tree/master/crash-traceback runtime stack: We find that, it is easy to reproduce using loop out side of mypipe.go, that means we should loop in gencore.sh. If we put the loop into mypipe.go like below, then it will not crash after a long time test. |
The crash occurs when go runtime calls a VDSO function (say __vdso_clock_gettime) and a signal arrives to that thread. Since VDSO functions temporarily destroy the G register (R10), Go functions asynchronously executed in that thread (i.e. Go's signal handler) can try to load data from the destroyed G, which causes segmentation fault.
Change https://golang.org/cl/192937 mentions this issue: |
The crash occurs when go runtime calls a VDSO function (say __vdso_clock_gettime) and a signal arrives to that thread. Since VDSO functions temporarily destroy the G register (R10), Go functions asynchronously executed in that thread (i.e. Go's signal handler) can try to load data from the destroyed G, which causes segmentation fault.
The crash occurs when go runtime calls a VDSO function (say __vdso_clock_gettime) and a signal arrives to that thread. Since VDSO functions temporarily destroy the G register (R10), Go functions asynchronously executed in that thread (i.e. Go's signal handler) can try to load data from the destroyed G, which causes segmentation fault.
The crash occurs when go runtime calls a VDSO function (say __vdso_clock_gettime) and a signal arrives to that thread. Since VDSO functions temporarily destroy the G register (R10), Go functions asynchronously executed in that thread (i.e. Go's signal handler) can try to load data from the destroyed G, which causes segmentation fault.
The crash occurs when go runtime calls a VDSO function (say __vdso_clock_gettime) and a signal arrives to that thread. Since VDSO functions temporarily destroy the G register (R10), Go functions asynchronously executed in that thread (i.e. Go's signal handler) can try to load data from the destroyed G, which causes segmentation fault.
The crash occurs when go runtime calls a VDSO function (say __vdso_clock_gettime) and a signal arrives to that thread. Since VDSO functions temporarily destroy the G register (R10), Go functions asynchronously executed in that thread (i.e. Go's signal handler) can try to load data from the destroyed G, which causes segmentation fault.
Enable CGO for all arm/arm64 builds to address Go bug. golang/go#32912 Also restrict arm(64) builds to Linux only as it is the only one anyone is using.
The crash occurs when go runtime calls a VDSO function (say __vdso_clock_gettime) and a signal arrives to that thread. Since VDSO functions temporarily destroy the G register (R10), Go functions asynchronously executed in that thread (i.e. Go's signal handler) can try to load data from the destroyed G, which causes segmentation fault.
Build all arm binaries with CGO enabled to address golang bug.. golang/go#32912 Includes arm64 to standardize the platforms supported with those consul-template supports. Fixes #227
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes, we reproduce it with go1.11.2 and go1.13beta1.
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
We are running docker test on arm64/aarch64 physical machine with kernel 4.19.36, docker containers configured with health-cmd, so dockerd will call exec command periodically. Test framework also execute docker run/stop commands to docker containers. The core dump happens on containerd-shim and runc.(1-5 core dumps per day.)
What did you expect to see?
no crash in runtime
What did you see instead?
containerd-shim core with go1.13beta1.
It looks like X0 is valid g struct, and g.m.gsignal.stack in memory is fine. but X1 which load from g.m is 0xd the bad one.
runc core dump is the same:
runc crash the same position, but X1=0. The runc core with go1.11.2.
The text was updated successfully, but these errors were encountered: