runtime: fatal error: free list corrupted (3) #13287

msackman · 2015-11-17T08:25:59Z

Possible dup of #11411 and #12879

runtime: free list of span 0x7ea0832a2d40:
0xc87618ea00 -> 0x80c87618fa40 (BAD)
fatal error: free list corrupted

runtime stack:
runtime.throw(0x85ecc0, 0x13)
    /home/matthew/src/golang/go1.5.1/src/runtime/panic.go:527 +0x90
runtime.mSpan_Sweep(0x7ea0832a2d40, 0x18100000100, 0xc80002a801)
    /home/matthew/src/golang/go1.5.1/src/runtime/mgcsweep.go:186 +0x800
runtime.sweepone(0x439b12)
    /home/matthew/src/golang/go1.5.1/src/runtime/mgcsweep.go:97 +0x154
runtime.gosweepone.func1()
    /home/matthew/src/golang/go1.5.1/src/runtime/mgcsweep.go:109 +0x21
runtime.systemstack(0xc820023500)
    /home/matthew/src/golang/go1.5.1/src/runtime/asm_amd64.s:262 +0x79
runtime.mstart()
    /home/matthew/src/golang/go1.5.1/src/runtime/proc1.go:674

> go version
go version go1.5.1 linux/amd64

The software is a distributed database server. At the time, there were 3 servers running, all connected to each other (and all running on the same machine). All the servers are running the exact same binary. Clients would connect, run some tests, disconnect. I was asleep.

From the rest of the stack traces, it looks as though the panic happened 22 minutes after the server was started and 14 minutes after the last client disconnect (test had passed). All 3 connected servers would have been idle at this point.

Of the 3 servers, one (server1) failed with the above, one survived (server2) until the morning when I found it, and the other (server3) appears to have failed at exactly the same time with:

fatal error: C malloc failed

goroutine 77 [running]:
runtime.throw(0x8241f0, 0xf)
    /home/matthew/src/golang/go1.5.1/src/runtime/panic.go:527 +0x90 fp=0xc8e6765408 sp=0xc8e67653f0
runtime.cmalloc(0xa, 0x409617)
    /home/matthew/src/golang/go1.5.1/src/runtime/cgocall.go:148 +0x68 fp=0xc8e6765438 sp=0xc8e6765408
net._Cfunc_CString(0xc8202cb800, 0x9, 0xc8e67654e8)
    ??:0 +0x28 fp=0xc8e67654a8 sp=0xc8e6765438
net.cgoLookupIPCNAME(0xc8202cb800, 0x9, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
    /home/matthew/src/golang/go1.5.1/src/net/cgo_unix.go:108 +0x13c fp=0xc8e67655d0 sp=0xc8e67654a8
net.cgoLookupIP(0xc8202cb800, 0x9, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
    /home/matthew/src/golang/go1.5.1/src/net/cgo_unix.go:163 +0x56 fp=0xc8e6765628 sp=0xc8e67655d0
net.lookupIP(0xc8202cb800, 0x9, 0x0, 0x0, 0x0, 0x0, 0x0)
    /home/matthew/src/golang/go1.5.1/src/net/lookup_unix.go:67 +0x94 fp=0xc8e6765698 sp=0xc8e6765628
net.glob.func15(0x8d0300, 0xc8202cb800, 0x9, 0x0, 0x0, 0x0, 0x0, 0x0)
    /home/matthew/src/golang/go1.5.1/src/net/hook.go:10 +0x4d fp=0xc8e67656d8 sp=0xc8e6765698
net.lookupIPMerge.func1(0x0, 0x0, 0x0, 0x0)
    /home/matthew/src/golang/go1.5.1/src/net/lookup.go:68 +0x71 fp=0xc8e6765758 sp=0xc8e67656d8
internal/singleflight.(*Group).doCall(0xc2c570, 0xc8764ae230, 0xc8202cb800, 0x9, 0xc8e6765950)
    /home/matthew/src/golang/go1.5.1/src/internal/singleflight/singleflight.go:93 +0x2c fp=0xc8e6765808 sp=0xc8e6765758
internal/singleflight.(*Group).Do(0xc2c570, 0xc8202cb800, 0x9, 0xc8e6765950, 0x0, 0x0, 0x0, 0x0, 0x0)
    /home/matthew/src/golang/go1.5.1/src/internal/singleflight/singleflight.go:63 +0x284 fp=0xc8e6765878 sp=0xc8e6765808
net.lookupIPMerge(0xc8202cb800, 0x9, 0x0, 0x0, 0x0, 0x0, 0x0)
    /home/matthew/src/golang/go1.5.1/src/net/lookup.go:69 +0x9b fp=0xc8e6765988 sp=0xc8e6765878
net.lookupIPDeadline(0xc8202cb800, 0x9, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
    /home/matthew/src/golang/go1.5.1/src/net/lookup.go:91 +0xde fp=0xc8e6765bc0 sp=0xc8e6765988
net.internetAddrList(0x821bd8, 0x3, 0xc8202cb800, 0xf, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /home/matthew/src/golang/go1.5.1/src/net/ipsock.go:252 +0x6ee fp=0xc8e6765d28 sp=0xc8e6765bc0
net.ResolveTCPAddr(0x821bd8, 0x3, 0xc8202cb800, 0xf, 0x5d8abc, 0x0, 0x0)
    /home/matthew/src/golang/go1.5.1/src/net/tcpsock.go:56 +0x11b fp=0xc8e6765de8 sp=0xc8e6765d28
....my code.

There could have been memory pressure at the time, but I find it unlikely give indications are it's some 14 mins after the last test finished (and passed). Syslog does not show any activity by the kernel OOM process killer.

This server (server3) would only have been in this code because it was trying to reconnect to server1 after server1 had failed. So this would have been exactly 5 seconds after server1 had failed. With server1 having failed, I can't believe there really could have been any memory pressure in the system.

Thus I think this could be the same issue as #12879 in that the server is idle at the time. I do not know if the two crashes are or could be related at all. I shall attempt to see how reproducible this is.

The text was updated successfully, but these errors were encountered:

msackman · 2015-11-17T09:27:43Z

Repeated the test. At the end of the test each server has about 2GB resident so nothing unusual. Does fall completely idle as expected. But now, some 30 minutes later, still no crash.

ianlancetaylor · 2015-11-17T13:28:54Z

CC @aclements @RLH

rsc · 2015-11-18T15:35:37Z

I'm going to go out on a limb and say this is a duplicate of #12879. I would very much like to see a simple way to reproduce this, though. If you find one, please comment on that issue. Thanks.

ianlancetaylor added this to the Go1.5.2 milestone Nov 17, 2015

rsc closed this as completed Nov 18, 2015

rsc removed this from the Go1.5.2 milestone Nov 18, 2015

aclements mentioned this issue Nov 23, 2015

syscall: document or fix liveness during Syscall #13372

Closed

This was referenced Jan 20, 2016

testing: Test minio server with restic project and do load testing. minio/minio#1024

Closed

rebase: Switch s3 library to allow for s3 compatible backends restic/restic#366

Merged

golang locked and limited conversation to collaborators Nov 17, 2016

gopherbot added the FrozenDueToAge label Nov 17, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: fatal error: free list corrupted (3) #13287

runtime: fatal error: free list corrupted (3) #13287

msackman commented Nov 17, 2015

msackman commented Nov 17, 2015

ianlancetaylor commented Nov 17, 2015

rsc commented Nov 18, 2015

runtime: fatal error: free list corrupted (3) #13287

runtime: fatal error: free list corrupted (3) #13287

Comments

msackman commented Nov 17, 2015

msackman commented Nov 17, 2015

ianlancetaylor commented Nov 17, 2015

rsc commented Nov 18, 2015