Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: huge page fragmentation on Linux leads to out of memory #12233

Closed
bmhatfield opened this issue Aug 20, 2015 · 71 comments
Closed

runtime: huge page fragmentation on Linux leads to out of memory #12233

bmhatfield opened this issue Aug 20, 2015 · 71 comments

Comments

@bmhatfield
Copy link

I recently upgraded to Go 1.5 from 1.4.2 for a moderately large production service, serving thousands of requests per second. This particular program has been run on every Go version from 1.1 through (now) 1.5, and has been stable in production for about 2+ years. Switching to Go 1.5 saw dramatic improvements in GC pause time, as expected, but the processes are intermittently exiting (somewhere between 1-6h of uptime) with the below panic.

On each version, this has been running on Ubuntu 12.04.

fatal error: runtime: cannot allocate memory

runtime stack:
runtime.throw(0x8c8f20, 0x1f)
        /usr/local/go/src/runtime/panic.go:527 +0x90
runtime.persistentalloc1(0x4000, 0x8, 0xaa8a78, 0xa8d3a8)
        /usr/local/go/src/runtime/malloc.go:878 +0x2e3
runtime.persistentalloc.func1()
        /usr/local/go/src/runtime/malloc.go:831 +0x3b
runtime.systemstack(0x7ff1377fdc78)
        /usr/local/go/src/runtime/asm_amd64.s:278 +0xab
runtime.persistentalloc(0x4000, 0x0, 0xaa8a78, 0x5)
        /usr/local/go/src/runtime/malloc.go:832 +0x58
runtime.fixAlloc_Alloc(0xa95eb8, 0xa8d3a8)
        /usr/local/go/src/runtime/mfixalloc.go:67 +0xee
runtime.mHeap_AllocSpanLocked(0xa89ba0, 0x5, 0x412c1b)
        /usr/local/go/src/runtime/mheap.go:561 +0x1a7
runtime.mHeap_Alloc_m(0xa89ba0, 0x5, 0x100000000, 0xc8200205bc)
        /usr/local/go/src/runtime/mheap.go:425 +0x1ac
runtime.mHeap_Alloc.func1()
        /usr/local/go/src/runtime/mheap.go:484 +0x41
runtime.systemstack(0x7ff1377fddb8)
        /usr/local/go/src/runtime/asm_amd64.s:278 +0xab
runtime.mHeap_Alloc(0xa89ba0, 0x5, 0x10100000000, 0x923300)
        /usr/local/go/src/runtime/mheap.go:485 +0x63
runtime.largeAlloc(0x9c40, 0x0, 0x4d4f07)
        /usr/local/go/src/runtime/malloc.go:745 +0xb3
runtime.mallocgc.func3()
        /usr/local/go/src/runtime/malloc.go:634 +0x33
runtime.systemstack(0xc820020000)
        /usr/local/go/src/runtime/asm_amd64.s:262 +0x79
runtime.mstart()
        /usr/local/go/src/runtime/proc1.go:674

goroutine 1145040165 [running]:
runtime.systemstack_switch()
        /usr/local/go/src/runtime/asm_amd64.s:216 fp=0xc8226375f0 sp=0xc8226375e8
runtime.mallocgc(0x9c40, 0x731f60, 0x0, 0x60)
        /usr/local/go/src/runtime/malloc.go:635 +0x9c4 fp=0xc8226376c0 sp=0xc8226375f0
runtime.newarray(0x731f60, 0x1388, 0x3)
        /usr/local/go/src/runtime/malloc.go:777 +0xc9 fp=0xc822637700 sp=0xc8226376c0
runtime.makechan(0x7320e0, 0x1388, 0xc93f800000)
        /usr/local/go/src/runtime/chan.go:72 +0x135 fp=0xc822637750 sp=0xc822637700
github.com/crashlytics/gusset/caching-proxy.(*CachingProxy).Request(0xc820110550, 0x7ff13c6c01b0, 0xc964463810, 0xc9644b30a4, 0x0, 0x0)
        /srv/go/src/github.com/crashlytics/gusset/caching-proxy/caching-proxy.go:445 +0x6d0 fp=0xc822637908 sp=0xc822637750
main.cachedRequestHandler(0x7ff13c6c0178, 0xc9644e2580, 0xc9642b6380)
        /srv/go/src/github.com/crashlytics/gusset/gusset.go:69 +0x437 fp=0xc822637b68 sp=0xc822637908
net/http.HandlerFunc.ServeHTTP(0x922e98, 0x7ff13c6c0178, 0xc9644e2580, 0xc9642b6380)
        /usr/local/go/src/net/http/server.go:1422 +0x3a fp=0xc822637b88 sp=0xc822637b68
net/http.(*ServeMux).ServeHTTP(0xc82010d530, 0x7ff13c6c0178, 0xc9644e2580, 0xc9642b6380)
        /usr/local/go/src/net/http/server.go:1699 +0x17d fp=0xc822637be0 sp=0xc822637b88
net/http.serverHandler.ServeHTTP(0xc82011c540, 0x7ff13c6c0178, 0xc9644e2580, 0xc9642b6380)
        /usr/local/go/src/net/http/server.go:1862 +0x19e fp=0xc822637c40 sp=0xc822637be0
net/http.(*conn).serve(0xc8aef18160)
        /usr/local/go/src/net/http/server.go:1361 +0xbee fp=0xc822637f98 sp=0xc822637c40
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1696 +0x1 fp=0xc822637fa0 sp=0xc822637f98
created by net/http.(*Server).Serve
        /usr/local/go/src/net/http/server.go:1910 +0x3f6
@davecheney
Copy link
Contributor

@bmhatfield can you please run the process with env GODEBUG=gctrace=1 and include the log, or at least the most relevant parts in this report. I suspect memory usage will do one of two things.

  1. climb slowly and inevitably towards a crash
  2. operate stably for a period of time, then skyrocket.

The gctrace log will show both.

btw. this information is always being collected, just normally supressed, so there is little impact in running it in production.

@bmhatfield
Copy link
Author

Can do. In the meantime, I am not seeing any odd memory usage (stack, heap, or just general system memory). I have graphs of everything available in https://github.com/bmhatfield/go-runtime-metrics/blob/master/runstats.go if you have any specific requests.

Here's a high level of system memory (plotted against a metric that demonstrates the crash occurance)

image

@ianlancetaylor
Copy link
Member

CC @RLH @aclements

@ianlancetaylor ianlancetaylor added this to the Go1.5.1 milestone Aug 20, 2015
@ianlancetaylor ianlancetaylor changed the title Go 1.5 upgrade: Intermittent 'fatal error: runtime: cannot allocate memory' runtime: Go 1.5 upgrade: Intermittent 'fatal error: runtime: cannot allocate memory' Aug 20, 2015
@aclements
Copy link
Member

@bmhatfield, do you happen to know the approximate heap size (or have the equivalent graphs) for your service when it was running on Go 1.4?

It looks like your heap size graph is going up even between restarts (each restart cycle quickly climbs to a larger heap size than the previous cycle had when it crashed). That's a bit perplexing.

@bmhatfield
Copy link
Author

Hi @aclements - here's an annotated graph of the normal heap size for this service. My general takeaway is that the heap size appears to be within expected bounds for this service, and the normal operating size does not appear to be "changed" as of 1.5.

heap-usage

This service is essentially an HTTP cache using the go-cache library with some additional features focused on analyzing the traffic it receives (those features being the primary reason this service exists). When this service is deployed (or re-spawned after a crash), it exhibits some oscillation for a short period of time, related to the initially-synchronized 10 minute cache entry TTLs, which explains the heap growth and then reduction that you're seeing. This behavior normalizes relatively quickly, and was "normalized" when these crashes occurred.

@aclements
Copy link
Member

Thanks for the plot and explanation. Is the traceback on the crash always/usually the same? In the one you posted, it's failing in a surprisingly specific place.

@bmhatfield
Copy link
Author

Great question, sorry I did not think of that before. Doing some digging now, I see one that looks like:

runtime/cgo: pthread_create failed: Resource temporarily unavailable
SIGABRT: abort
PC=0x7f82381d7425 m=2

goroutine 0 [idle]:

goroutine 1 [IO wait]:
net.runtime_pollWait(0x7f82387a5660, 0x72, 0xc820010100)
        /usr/local/go/src/runtime/netpoll.go:157 +0x60
net.(*pollDesc).Wait(0xc820133560, 0x72, 0x0, 0x0)
        /usr/local/go/src/net/fd_poll_runtime.go:73 +0x3a
net.(*pollDesc).WaitRead(0xc820133560, 0x0, 0x0)
        /usr/local/go/src/net/fd_poll_runtime.go:78 +0x36
net.(*netFD).accept(0xc820133500, 0x0, 0x7f82387a6628, 0xc86e6d93a0)
        /usr/local/go/src/net/fd_unix.go:408 +0x27c
net.(*TCPListener).AcceptTCP(0xc82010e0e8, 0x452790, 0x0, 0x0)
        /usr/local/go/src/net/tcpsock_posix.go:254 +0x4d
net/http.tcpKeepAliveListener.Accept(0xc82010e0e8, 0x0, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/http/server.go:2135 +0x41
net/http.(*Server).Serve(0xc82011e540, 0x7f82387a65f0, 0xc82010e0e8, 0x0, 0x0)
        /usr/local/go/src/net/http/server.go:1887 +0xb3
net/http.(*Server).ListenAndServe(0xc82011e540, 0x0, 0x0)
        /usr/local/go/src/net/http/server.go:1877 +0x136
main.main()
        /srv/go/src/github.com/crashlytics/gusset/gusset.go:260 +0xb50

goroutine 17 [syscall, 474 minutes, locked to thread]:
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1696 +0x1

and another like this:

runtime/cgo: out of memory in thread_start
SIGABRT: abort
PC=0x7fb4da635425 m=2

goroutine 0 [idle]:

goroutine 1 [IO wait]:
net.runtime_pollWait(0x7fb4d84fb240, 0x72, 0xc820010100)
        /usr/local/go/src/runtime/netpoll.go:157 +0x60
net.(*pollDesc).Wait(0xc820131560, 0x72, 0x0, 0x0)
        /usr/local/go/src/net/fd_poll_runtime.go:73 +0x3a
net.(*pollDesc).WaitRead(0xc820131560, 0x0, 0x0)
        /usr/local/go/src/net/fd_poll_runtime.go:78 +0x36
net.(*netFD).accept(0xc820131500, 0x0, 0x7fb4d853f080, 0xc92a6cde80)
        /usr/local/go/src/net/fd_unix.go:408 +0x27c
net.(*TCPListener).AcceptTCP(0xc820598028, 0x452790, 0x0, 0x0)
        /usr/local/go/src/net/tcpsock_posix.go:254 +0x4d
net/http.tcpKeepAliveListener.Accept(0xc820598028, 0x0, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/http/server.go:2135 +0x41
net/http.(*Server).Serve(0xc820110540, 0x7fb4d853f048, 0xc820598028, 0x0, 0x0)
        /usr/local/go/src/net/http/server.go:1887 +0xb3
net/http.(*Server).ListenAndServe(0xc820110540, 0x0, 0x0)
        /usr/local/go/src/net/http/server.go:1877 +0x136
main.main()
        /srv/go/src/github.com/crashlytics/gusset/gusset.go:260 +0xb50

goroutine 17 [syscall, 464 minutes, locked to thread]:
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1696 +0x1

Please note that at the time that this claimed "out of memory", the system had been using less than 50% (~7GB) of available memory (~15GB instances), and the process has been able to claim more memory from the system both before (on 1.5) and after (still 1.5) the occurrences of those crashes.

@bmhatfield
Copy link
Author

Here's another example of the 'runtime out of memory' traceback, from a different host at a different time:

fatal error: runtime: out of memory

runtime stack:
runtime.throw(0x8ba570, 0x16)
        /usr/local/go/src/runtime/panic.go:527 +0x90
runtime.sysMap(0xc8159fc000, 0x8000, 0xc96bf50000, 0xaa8a98)
        /usr/local/go/src/runtime/mem_linux.go:143 +0x9b
runtime.mHeap_MapBits(0xa89ba0, 0xc96c050000)
        /usr/local/go/src/runtime/mbitmap.go:144 +0xcc
runtime.mHeap_SysAlloc(0xa89ba0, 0x100000, 0x7f331b306f10)
        /usr/local/go/src/runtime/malloc.go:424 +0x186
runtime.mHeap_Grow(0xa89ba0, 0x8, 0x0)
        /usr/local/go/src/runtime/mheap.go:628 +0x63
runtime.mHeap_AllocSpanLocked(0xa89ba0, 0x1, 0x412ab1)
        /usr/local/go/src/runtime/mheap.go:532 +0x5f1
runtime.mHeap_Alloc_m(0xa89ba0, 0x1, 0x2, 0xc96b54dff0)
        /usr/local/go/src/runtime/mheap.go:425 +0x1ac
runtime.mHeap_Alloc.func1()
        /usr/local/go/src/runtime/mheap.go:484 +0x41
runtime.systemstack(0x7f3332777d28)
        /usr/local/go/src/runtime/asm_amd64.s:278 +0xab
runtime.mHeap_Alloc(0xa89ba0, 0x1, 0x10000000002, 0x410524)
        /usr/local/go/src/runtime/mheap.go:485 +0x63
runtime.mCentral_Grow(0xa91188, 0x0)
        /usr/local/go/src/runtime/mcentral.go:190 +0x93
runtime.mCentral_CacheSpan(0xa91188, 0x0)
        /usr/local/go/src/runtime/mcentral.go:86 +0x4d4
runtime.mCache_Refill(0x7f33367b3e28, 0xc900000002, 0xc964f2d7d8)
        /usr/local/go/src/runtime/mcache.go:118 +0xcf
runtime.mallocgc.func2()
        /usr/local/go/src/runtime/malloc.go:611 +0x2b
runtime.systemstack(0xc82001aa00)
        /usr/local/go/src/runtime/asm_amd64.s:262 +0x79
runtime.mstart()
        /usr/local/go/src/runtime/proc1.go:674

goroutine 1231230574 [running]:
runtime.systemstack_switch()
        /usr/local/go/src/runtime/asm_amd64.s:216 fp=0xc964f2d588 sp=0xc964f2d580
runtime.mallocgc(0x10, 0x7410a0, 0x430000c900000000, 0xc96b54dff0)
        /usr/local/go/src/runtime/malloc.go:612 +0x842 fp=0xc964f2d658 sp=0xc964f2d588
runtime.newarray(0x7410a0, 0x1, 0xc96bf4c0e8)
        /usr/local/go/src/runtime/malloc.go:777 +0xc9 fp=0xc964f2d698 sp=0xc964f2d658
runtime.makeslice(0x730d00, 0x1, 0x1, 0x0, 0x0, 0x0)
        /usr/local/go/src/runtime/slice.go:32 +0x165 fp=0xc964f2d6e8 sp=0xc964f2d698
net/http.Header.clone(0xc96b019950, 0xa7b200)
        /usr/local/go/src/net/http/header.go:63 +0x13e fp=0xc964f2d818 sp=0xc964f2d6e8
net/http.(*response).WriteHeader(0xc96ba11970, 0xc8)
        /usr/local/go/src/net/http/server.go:694 +0x102 fp=0xc964f2d908 sp=0xc964f2d818
main.cachedRequestHandler(0x7f3331737128, 0xc96ba11970, 0xc968f25960)
        /srv/go/src/github.com/crashlytics/gusset/gusset.go:93 +0x9b1 fp=0xc964f2db68 sp=0xc964f2d908
net/http.HandlerFunc.ServeHTTP(0x922e98, 0x7f3331737128, 0xc96ba11970, 0xc968f25960)
        /usr/local/go/src/net/http/server.go:1422 +0x3a fp=0xc964f2db88 sp=0xc964f2db68
net/http.(*ServeMux).ServeHTTP(0xc82010d530, 0x7f3331737128, 0xc96ba11970, 0xc968f25960)
        /usr/local/go/src/net/http/server.go:1699 +0x17d fp=0xc964f2dbe0 sp=0xc964f2db88
net/http.serverHandler.ServeHTTP(0xc82011c540, 0x7f3331737128, 0xc96ba11970, 0xc968f25960)
        /usr/local/go/src/net/http/server.go:1862 +0x19e fp=0xc964f2dc40 sp=0xc964f2dbe0
net/http.(*conn).serve(0xc8d3778dc0)
        /usr/local/go/src/net/http/server.go:1361 +0xbee fp=0xc964f2df98 sp=0xc964f2dc40
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1696 +0x1 fp=0xc964f2dfa0 sp=0xc964f2df98
created by net/http.(*Server).Serve
        /usr/local/go/src/net/http/server.go:1910 +0x3f6

@bmhatfield
Copy link
Author

And another example in thread_start:

runtime/cgo: out of memory in thread_start
SIGABRT: abort
PC=0x7f8a5a918425 m=11

goroutine 0 [idle]:

goroutine 47 [running]:
runtime.systemstack_switch()
        /usr/local/go/src/runtime/asm_amd64.s:216 fp=0xc907f09bd8 sp=0xc907f09bd0
runtime.newproc(0x38, 0x7f8a5805e4a8)
        /usr/local/go/src/runtime/proc1.go:2213 +0x62 fp=0xc907f09c20 sp=0xc907f09bd8
github.com/crashlytics/go-libraries/extralog.Timer(0xc8f3154510, 0x24, 0x900df)
        /srv/go/src/github.com/crashlytics/gusset/Godeps/_workspace/src/github.com/crashlytics/go-libraries/extralog/extralog.go:63 +0x2b5 fp=0xc907f09d00 sp=0xc907f09c20
github.com/crashlytics/go-libraries/startablebeanstalk.(*StartableBeanstalkWriter).Write(0xc82010e9c0, 0xc931f381e0, 0x1be, 0x1d7, 0x1423839494, 0x0, 0x0)
        /srv/go/src/github.com/crashlytics/gusset/Godeps/_workspace/src/github.com/crashlytics/go-libraries/startablebeanstalk/startablebeanstalk.go:68 +0xbb9 fp=0xc907f09ee8 sp=0xc907f09d00
github.com/crashlytics/go-libraries/poolwriter.WriterWorker(0x7f8a5805e4b0, 0xc82010e9c0, 0xc82011c480)
        /srv/go/src/github.com/crashlytics/gusset/Godeps/_workspace/src/github.com/crashlytics/go-libraries/poolwriter/poolwriter.go:34 +0xd2 fp=0xc907f09f88 sp=0xc907f09ee8
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1696 +0x1 fp=0xc907f09f90 sp=0xc907f09f88
created by github.com/crashlytics/go-libraries/poolwriter.NewWriter
        /srv/go/src/github.com/crashlytics/gusset/Godeps/_workspace/src/github.com/crashlytics/go-libraries/poolwriter/poolwriter.go:66 +0x1f3

goroutine 1 [IO wait]:
net.runtime_pollWait(0x7f8a5aee6ce8, 0x72, 0xc820010100)
        /usr/local/go/src/runtime/netpoll.go:157 +0x60
net.(*pollDesc).Wait(0xc820131560, 0x72, 0x0, 0x0)
        /usr/local/go/src/net/fd_poll_runtime.go:73 +0x3a
net.(*pollDesc).WaitRead(0xc820131560, 0x0, 0x0)
        /usr/local/go/src/net/fd_poll_runtime.go:78 +0x36
net.(*netFD).accept(0xc820131500, 0x0, 0x7f8a5805e758, 0xc93ddb03c0)
        /usr/local/go/src/net/fd_unix.go:408 +0x27c
net.(*TCPListener).AcceptTCP(0xc82002c0a8, 0x452790, 0x0, 0x0)
        /usr/local/go/src/net/tcpsock_posix.go:254 +0x4d
net/http.tcpKeepAliveListener.Accept(0xc82002c0a8, 0x0, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/http/server.go:2135 +0x41
net/http.(*Server).Serve(0xc82011c540, 0x7f8a5805e720, 0xc82002c0a8, 0x0, 0x0)
        /usr/local/go/src/net/http/server.go:1887 +0xb3
net/http.(*Server).ListenAndServe(0xc82011c540, 0x0, 0x0)
        /usr/local/go/src/net/http/server.go:1877 +0x136
main.main()
        /srv/go/src/github.com/crashlytics/gusset/gusset.go:260 +0xb50

goroutine 17 [syscall, 321 minutes, locked to thread]:
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1696 +0x1

@bmhatfield
Copy link
Author

And finally, another example of the pthread_create crash:

untime/cgo: pthread_create failed: Resource temporarily unavailable
SIGABRT: abort
PC=0x7ff354fc3425 m=2

goroutine 0 [idle]:

goroutine 1 [IO wait]:
net.runtime_pollWait(0x7ff355591840, 0x72, 0xc820010100)
        /usr/local/go/src/runtime/netpoll.go:157 +0x60
net.(*pollDesc).Wait(0xc820139560, 0x72, 0x0, 0x0)
        /usr/local/go/src/net/fd_poll_runtime.go:73 +0x3a
net.(*pollDesc).WaitRead(0xc820139560, 0x0, 0x0)
        /usr/local/go/src/net/fd_poll_runtime.go:78 +0x36
net.(*netFD).accept(0xc820139500, 0x0, 0x7ff3526cd688, 0xc894f90780)
        /usr/local/go/src/net/fd_unix.go:408 +0x27c
net.(*TCPListener).AcceptTCP(0xc8205d2000, 0x452790, 0x0, 0x0)
        /usr/local/go/src/net/tcpsock_posix.go:254 +0x4d
net/http.tcpKeepAliveListener.Accept(0xc8205d2000, 0x0, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/http/server.go:2135 +0x41
net/http.(*Server).Serve(0xc820124540, 0x7ff3526cd650, 0xc8205d2000, 0x0, 0x0)
        /usr/local/go/src/net/http/server.go:1887 +0xb3
net/http.(*Server).ListenAndServe(0xc820124540, 0x0, 0x0)
        /usr/local/go/src/net/http/server.go:1877 +0x136
main.main()
        /srv/go/src/github.com/crashlytics/gusset/gusset.go:260 +0xb50

goroutine 17 [syscall, 617 minutes, locked to thread]:
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1696 +0x1

@bmhatfield
Copy link
Author

Here's a graph demonstrating the cadence of these crashes. The "causes" are a mixture of the above tracebacks. It's probably worth highlighting (with the discovery of additional tracebacks) that this did not happen in 1.4.2 - no crashes at all.

image

@bmhatfield
Copy link
Author

@aclements @davecheney Caught one with the GC logging on! And with a new backtrace "reason". I've included an arbitrary number of GC log messages "back" from the crash:

gc 575 @9912.118s 3%: 11+150+2.0+810+21 ms clock, 57+150+0+3437/1615/0.029+109 ms cpu, 4809->4873->2483 MB, 4932 MB goal, 8 P
gc 576 @9930.019s 3%: 8.4+65+30+885+22 ms clock, 67+65+0+2530/1754/739+177 ms cpu, 4718->4782->2522 MB, 4839 MB goal, 8 P
gc 577 @9947.601s 3%: 11+168+5.2+812+22 ms clock, 57+168+0+3409/1622/0.046+111 ms cpu, 4793->4858->2507 MB, 4916 MB goal, 8 P
gc 578 @9965.602s 3%: 8.4+182+1.5+771+21 ms clock, 67+182+0+3283/1538/163+175 ms cpu, 4762->4824->2448 MB, 4884 MB goal, 8 P
gc 579 @9983.589s 3%: 8.1+111+5.0+863+21 ms clock, 40+111+0+3446/1725/0.022+108 ms cpu, 4652->4715->2513 MB, 4771 MB goal, 8 P
gc 580 @10001.383s 3%: 8.2+185+1.8+783+21 ms clock, 41+185+0+3417/1565/128+108 ms cpu, 4776->4839->2462 MB, 4898 MB goal, 8 P
gc 581 @10019.072s 3%: 8.5+100+2.4+857+22 ms clock, 51+100+0+3431/1715/35+133 ms cpu, 4679->4739->2521 MB, 4799 MB goal, 8 P
gc 582 @10037.624s 3%: 8.3+160+5.6+808+23 ms clock, 41+160+0+3379/1616/0.019+115 ms cpu, 4797->4861->2482 MB, 4920 MB goal, 8 P
scvg66: 14 MB released
scvg66: inuse: 4804, idle: 4598, sys: 9403, released: 4358, consumed: 5044 (MB)
gc 583 @10055.601s 3%: 8.2+111+0.57+837+21 ms clock, 66+111+0+3229/1668/31+174 ms cpu, 4715->4777->2435 MB, 4836 MB goal, 8 P
gc 584 @10072.948s 3%: 8.2+75+23+868+22 ms clock, 41+75+0+3131/1736/187+111 ms cpu, 4628->4691->2497 MB, 4747 MB goal, 8 P
gc 585 @10091.237s 3%: 8.2+163+2.7+820+22 ms clock, 41+163+0+3509/1637/83+111 ms cpu, 4747->4809->2450 MB, 4869 MB goal, 8 P
gc 586 @10109.203s 3%: 11+119+4.5+829+22 ms clock, 67+119+0+2848/1658/524+132 ms cpu, 4656->4711->2505 MB, 4776 MB goal, 8 P
gc 587 @10127.338s 3%: 8.3+120+1.0+838+22 ms clock, 41+120+0+3317/1675/0+110 ms cpu, 4777->4841->2470 MB, 4900 MB goal, 8 P
gc 588 @10145.555s 3%: 8.4+163+3.5+784+21 ms clock, 42+163+0+3220/1563/47+109 ms cpu, 4693->4755->2429 MB, 4813 MB goal, 8 P
gc 589 @10163.364s 3%: 8.2+141+1.1+822+22 ms clock, 41+141+0+3332/1643/0.043+110 ms cpu, 4616->4678->2490 MB, 4734 MB goal, 8 P
gc 590 @10181.239s 3%: 11+138+1.6+805+21 ms clock, 80+138+0+3363/1609/39+153 ms cpu, 4733->4795->2448 MB, 4855 MB goal, 8 P
gc 591 @10198.829s 3%: 8.1+102+21+844+23 ms clock, 40+102+0+3442/1686/0.054+117 ms cpu, 4652->4715->2510 MB, 4771 MB goal, 8 P
scvg67: 37 MB released
scvg67: inuse: 4889, idle: 4513, sys: 9403, released: 4390, consumed: 5012 (MB)
gc 592 @10216.888s 3%: 8.2+133+7.9+815+21 ms clock, 57+133+0+3350/1631/50+152 ms cpu, 4773->4835->2468 MB, 4895 MB goal, 8 P
runtime:objectstart Span weird: p=0xc900000000 k=0x6480000 s.start=0xc8ffffa000 s.limit=0xc900003c40 s.state=2
fatal error: objectstart: bad pointer in unexpected span

goroutine 108 [running]:
runtime.throw(0x8ea820, 0x2b)
        /usr/local/go/src/runtime/panic.go:527 +0x90 fp=0xc82058ee18 sp=0xc82058ee00
runtime.heapBitsForObject(0xc900000000, 0x0, 0x0, 0xc800000000, 0x7f8d16809d20)
        /usr/local/go/src/runtime/mbitmap.go:217 +0x287 fp=0xc82058ee50 sp=0xc82058ee18
runtime.scanobject(0xc95d6e8c30, 0xc82001f220)
        /usr/local/go/src/runtime/mgcmark.go:863 +0x239 fp=0xc82058ef20 sp=0xc82058ee50
runtime.gcDrain(0xc82001f220, 0x7d0)
        /usr/local/go/src/runtime/mgcmark.go:674 +0xf4 fp=0xc82058ef50 sp=0xc82058ef20
runtime.gcBgMarkWorker(0xc82001e000)
        /usr/local/go/src/runtime/mgc.go:1319 +0x281 fp=0xc82058efb8 sp=0xc82058ef50
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1696 +0x1 fp=0xc82058efc0 sp=0xc82058efb8
created by runtime.gcBgMarkStartWorkers
        /usr/local/go/src/runtime/mgc.go:1238 +0x93

goroutine 1 [runnable]:
net.runtime_pollWait(0x7f8d30dc3e40, 0x72, 0xc82000e100)
        /usr/local/go/src/runtime/netpoll.go:157 +0x60
net.(*pollDesc).Wait(0xc820131560, 0x72, 0x0, 0x0)
        /usr/local/go/src/net/fd_poll_runtime.go:73 +0x3a
net.(*pollDesc).WaitRead(0xc820131560, 0x0, 0x0)
        /usr/local/go/src/net/fd_poll_runtime.go:78 +0x36
net.(*netFD).accept(0xc820131500, 0x0, 0x7f8d30502038, 0xc92cf131a0)
        /usr/local/go/src/net/fd_unix.go:408 +0x27c
net.(*TCPListener).AcceptTCP(0xc82060a010, 0x452790, 0x0, 0x0)
        /usr/local/go/src/net/tcpsock_posix.go:254 +0x4d
net/http.tcpKeepAliveListener.Accept(0xc82060a010, 0x0, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/http/server.go:2135 +0x41
net/http.(*Server).Serve(0xc8201184e0, 0x7f8d30502000, 0xc82060a010, 0x0, 0x0)
        /usr/local/go/src/net/http/server.go:1887 +0xb3
net/http.(*Server).ListenAndServe(0xc8201184e0, 0x0, 0x0)
        /usr/local/go/src/net/http/server.go:1877 +0x136
main.main()
        /srv/go/src/github.com/crashlytics/gusset/gusset.go:260 +0xb50

goroutine 17 [syscall, 170 minutes, locked to thread]:
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1696 +0x1

@davecheney
Copy link
Contributor

Are you able to track the number of threads active in that process. It
looks like your program is consuming a lot of operating system threads and
the most common cause of that is DNS lookups.

Can you try setting GODEBUG=netdns=go to force the net package to only use
the pure go native resolver. If you already build with -tags netgo or some
other dance to disable cgo, this advice may not be relevant, but it could
be worth a try as the incantation for this procedure does change frequently.

On Fri, Aug 21, 2015 at 4:52 AM, Brian Hatfield notifications@github.com
wrote:

@aclements https://github.com/aclements @davecheney
https://github.com/davecheney Caught one with the GC logging on! And
with a new backtrace "reason". I've included an arbitrary number of GC log
messages "back" from the crash:

gc 575 @9912.118s 3%: 11+150+2.0+810+21 ms clock, 57+150+0+3437/1615/0.029+109 ms cpu, 4809->4873->2483 MB, 4932 MB goal, 8 P
gc 576 @9930.019s 3%: 8.4+65+30+885+22 ms clock, 67+65+0+2530/1754/739+177 ms cpu, 4718->4782->2522 MB, 4839 MB goal, 8 P
gc 577 @9947.601s 3%: 11+168+5.2+812+22 ms clock, 57+168+0+3409/1622/0.046+111 ms cpu, 4793->4858->2507 MB, 4916 MB goal, 8 P
gc 578 @9965.602s 3%: 8.4+182+1.5+771+21 ms clock, 67+182+0+3283/1538/163+175 ms cpu, 4762->4824->2448 MB, 4884 MB goal, 8 P
gc 579 @9983.589s 3%: 8.1+111+5.0+863+21 ms clock, 40+111+0+3446/1725/0.022+108 ms cpu, 4652->4715->2513 MB, 4771 MB goal, 8 P
gc 580 @10001.383s 3%: 8.2+185+1.8+783+21 ms clock, 41+185+0+3417/1565/128+108 ms cpu, 4776->4839->2462 MB, 4898 MB goal, 8 P
gc 581 @10019.072s 3%: 8.5+100+2.4+857+22 ms clock, 51+100+0+3431/1715/35+133 ms cpu, 4679->4739->2521 MB, 4799 MB goal, 8 P
gc 582 @10037.624s 3%: 8.3+160+5.6+808+23 ms clock, 41+160+0+3379/1616/0.019+115 ms cpu, 4797->4861->2482 MB, 4920 MB goal, 8 P
scvg66: 14 MB released
scvg66: inuse: 4804, idle: 4598, sys: 9403, released: 4358, consumed: 5044 (MB)
gc 583 @10055.601s 3%: 8.2+111+0.57+837+21 ms clock, 66+111+0+3229/1668/31+174 ms cpu, 4715->4777->2435 MB, 4836 MB goal, 8 P
gc 584 @10072.948s 3%: 8.2+75+23+868+22 ms clock, 41+75+0+3131/1736/187+111 ms cpu, 4628->4691->2497 MB, 4747 MB goal, 8 P
gc 585 @10091.237s 3%: 8.2+163+2.7+820+22 ms clock, 41+163+0+3509/1637/83+111 ms cpu, 4747->4809->2450 MB, 4869 MB goal, 8 P
gc 586 @10109.203s 3%: 11+119+4.5+829+22 ms clock, 67+119+0+2848/1658/524+132 ms cpu, 4656->4711->2505 MB, 4776 MB goal, 8 P
gc 587 @10127.338s 3%: 8.3+120+1.0+838+22 ms clock, 41+120+0+3317/1675/0+110 ms cpu, 4777->4841->2470 MB, 4900 MB goal, 8 P
gc 588 @10145.555s 3%: 8.4+163+3.5+784+21 ms clock, 42+163+0+3220/1563/47+109 ms cpu, 4693->4755->2429 MB, 4813 MB goal, 8 P
gc 589 @10163.364s 3%: 8.2+141+1.1+822+22 ms clock, 41+141+0+3332/1643/0.043+110 ms cpu, 4616->4678->2490 MB, 4734 MB goal, 8 P
gc 590 @10181.239s 3%: 11+138+1.6+805+21 ms clock, 80+138+0+3363/1609/39+153 ms cpu, 4733->4795->2448 MB, 4855 MB goal, 8 P
gc 591 @10198.829s 3%: 8.1+102+21+844+23 ms clock, 40+102+0+3442/1686/0.054+117 ms cpu, 4652->4715->2510 MB, 4771 MB goal, 8 P
scvg67: 37 MB released
scvg67: inuse: 4889, idle: 4513, sys: 9403, released: 4390, consumed: 5012 (MB)
gc 592 @10216.888s 3%: 8.2+133+7.9+815+21 ms clock, 57+133+0+3350/1631/50+152 ms cpu, 4773->4835->2468 MB, 4895 MB goal, 8 P
runtime:objectstart Span weird: p=0xc900000000 k=0x6480000 s.start=0xc8ffffa000 s.limit=0xc900003c40 s.state=2
fatal error: objectstart: bad pointer in unexpected span

goroutine 108 [running]:
runtime.throw(0x8ea820, 0x2b)
/usr/local/go/src/runtime/panic.go:527 +0x90 fp=0xc82058ee18 sp=0xc82058ee00
runtime.heapBitsForObject(0xc900000000, 0x0, 0x0, 0xc800000000, 0x7f8d16809d20)
/usr/local/go/src/runtime/mbitmap.go:217 +0x287 fp=0xc82058ee50 sp=0xc82058ee18
runtime.scanobject(0xc95d6e8c30, 0xc82001f220)
/usr/local/go/src/runtime/mgcmark.go:863 +0x239 fp=0xc82058ef20 sp=0xc82058ee50
runtime.gcDrain(0xc82001f220, 0x7d0)
/usr/local/go/src/runtime/mgcmark.go:674 +0xf4 fp=0xc82058ef50 sp=0xc82058ef20
runtime.gcBgMarkWorker(0xc82001e000)
/usr/local/go/src/runtime/mgc.go:1319 +0x281 fp=0xc82058efb8 sp=0xc82058ef50
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1696 +0x1 fp=0xc82058efc0 sp=0xc82058efb8
created by runtime.gcBgMarkStartWorkers
/usr/local/go/src/runtime/mgc.go:1238 +0x93

goroutine 1 [runnable]:
net.runtime_pollWait(0x7f8d30dc3e40, 0x72, 0xc82000e100)
/usr/local/go/src/runtime/netpoll.go:157 +0x60
net.(_pollDesc).Wait(0xc820131560, 0x72, 0x0, 0x0)
/usr/local/go/src/net/fd_poll_runtime.go:73 +0x3a
net.(_pollDesc).WaitRead(0xc820131560, 0x0, 0x0)
/usr/local/go/src/net/fd_poll_runtime.go:78 +0x36
net.(_netFD).accept(0xc820131500, 0x0, 0x7f8d30502038, 0xc92cf131a0)
/usr/local/go/src/net/fd_unix.go:408 +0x27c
net.(_TCPListener).AcceptTCP(0xc82060a010, 0x452790, 0x0, 0x0)
/usr/local/go/src/net/tcpsock_posix.go:254 +0x4d
net/http.tcpKeepAliveListener.Accept(0xc82060a010, 0x0, 0x0, 0x0, 0x0)
/usr/local/go/src/net/http/server.go:2135 +0x41
net/http.(_Server).Serve(0xc8201184e0, 0x7f8d30502000, 0xc82060a010, 0x0, 0x0)
/usr/local/go/src/net/http/server.go:1887 +0xb3
net/http.(_Server).ListenAndServe(0xc8201184e0, 0x0, 0x0)
/usr/local/go/src/net/http/server.go:1877 +0x136
main.main()
/srv/go/src/github.com/crashlytics/gusset/gusset.go:260 +0xb50

goroutine 17 [syscall, 170 minutes, locked to thread]:
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1696 +0x1


Reply to this email directly or view it on GitHub
#12233 (comment).

@bmhatfield
Copy link
Author

Yep, can do. This generally does use a lot of threads, yes, as it's very network-IO heavy.

@davecheney
Copy link
Contributor

Yep, can do. This generally does use a lot of threads, yes, as it's very network-IO heavy.

That's interesting. Network IO should not consume excessive threads in go programs because of the integrated poller.

@bmhatfield
Copy link
Author

@davecheney This program doesn't do any file IO at all, only network (http serving, http client, and a TCP client), and it's using about 4100 threads:

bhatfield:~$ ps -o nlwp 9763
NLWP
4168

@davecheney
Copy link
Contributor

@bmhatfield did yo have a chance to try that GODEBUG change I suggested?

@bmhatfield
Copy link
Author

I hadn't yet, will roll it out shortly.

@bmhatfield
Copy link
Author

Hi,

We've been running with GODEBUG=netdns=go for a few days, and we're still seeing crashes of the following form:

runtime/cgo: pthread_create failed: Resource temporarily unavailable
SIGABRT: abort
PC=0x7fc6088f3425 m=9

goroutine 0 [idle]:

goroutine 18 [syscall]:
runtime.notetsleepg(0xa81b38, 0x5ed10, 0x0)
        /usr/local/go/src/runtime/lock_futex.go:202 +0x4e fp=0xc820027f38 sp=0xc820027f10
runtime.timerproc()
        /usr/local/go/src/runtime/time.go:209 +0xde fp=0xc820027fc0 sp=0xc820027f38
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1696 +0x1 fp=0xc820027fc8 sp=0xc820027fc0
created by runtime.addtimerLocked
        /usr/local/go/src/runtime/time.go:116 +0x11f

goroutine 1 [IO wait]:
net.runtime_pollWait(0x7fc5ef7bfcc0, 0x72, 0xc82000e100)
        /usr/local/go/src/runtime/netpoll.go:157 +0x60
net.(*pollDesc).Wait(0xc820131560, 0x72, 0x0, 0x0)
        /usr/local/go/src/net/fd_poll_runtime.go:73 +0x3a
net.(*pollDesc).WaitRead(0xc820131560, 0x0, 0x0)
        /usr/local/go/src/net/fd_poll_runtime.go:78 +0x36
net.(*netFD).accept(0xc820131500, 0x0, 0x7fc606039670, 0xc89f316720)
        /usr/local/go/src/net/fd_unix.go:408 +0x27c
net.(*TCPListener).AcceptTCP(0xc8200a4038, 0x452790, 0x0, 0x0)
        /usr/local/go/src/net/tcpsock_posix.go:254 +0x4d
net/http.tcpKeepAliveListener.Accept(0xc8200a4038, 0x0, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/http/server.go:2135 +0x41
net/http.(*Server).Serve(0xc82011c540, 0x7fc606039638, 0xc8200a4038, 0x0, 0x0)
        /usr/local/go/src/net/http/server.go:1887 +0xb3
net/http.(*Server).ListenAndServe(0xc82011c540, 0x0, 0x0)
        /usr/local/go/src/net/http/server.go:1877 +0x136
main.main()
        /srv/go/src/github.com/crashlytics/gusset/gusset.go:260 +0xb50

goroutine 17 [syscall, 1242 minutes, locked to thread]:
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1696 +0x1

and

runtime/cgo: out of memory in thread_start
SIGABRT: abort
PC=0x7f2bea6bc425 m=8

goroutine 0 [idle]:

goroutine 1 [IO wait]:
net.runtime_pollWait(0x7f2be9e4a5d8, 0x72, 0xc82000e100)
        /usr/local/go/src/runtime/netpoll.go:157 +0x60
net.(*pollDesc).Wait(0xc82000b720, 0x72, 0x0, 0x0)
        /usr/local/go/src/net/fd_poll_runtime.go:73 +0x3a
net.(*pollDesc).WaitRead(0xc82000b720, 0x0, 0x0)
        /usr/local/go/src/net/fd_poll_runtime.go:78 +0x36
net.(*netFD).accept(0xc82000b6c0, 0x0, 0x7f2be9e4aee0, 0xc85750c2a0)
        /usr/local/go/src/net/fd_unix.go:408 +0x27c
net.(*TCPListener).AcceptTCP(0xc82002a178, 0x452790, 0x0, 0x0)
        /usr/local/go/src/net/tcpsock_posix.go:254 +0x4d
net/http.tcpKeepAliveListener.Accept(0xc82002a178, 0x0, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/http/server.go:2135 +0x41
net/http.(*Server).Serve(0xc820014780, 0x7f2be9e4aea8, 0xc82002a178, 0x0, 0x0)
        /usr/local/go/src/net/http/server.go:1887 +0xb3
net/http.(*Server).ListenAndServe(0xc820014780, 0x0, 0x0)
        /usr/local/go/src/net/http/server.go:1877 +0x136
main.main()
        /srv/go/src/github.com/crashlytics/gusset/gusset.go:260 +0xb50

goroutine 17 [syscall, 357 minutes, locked to thread]:
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1696 +0x1

...

rax    0x0
rbx    0x7f2be2ffcd70
rcx    0xffffffffffffffff
rdx    0x6
rdi    0x189a
rsi    0x18a1
rbp    0x431b00
rsp    0x7f2be2ffcb98
r8     0x7f2beaa41aa0
r9     0x7f2be2ffd700
r10    0x8
r11    0x206
r12    0x2c
r13    0x91fc1c
r14    0x0
r15    0x8
rip    0x7f2bea6bc425
rflags 0x206
cs     0x33
fs     0x0
gs     0x0

Sorry for the delay in gathering this data.

@davecheney
Copy link
Contributor

Is that the entire crash report? Can you post the full thing somewhere?

On 2 Sep 2015, at 22:39, Brian Hatfield notifications@github.com wrote:

Hi,

We've been running with GODEBUG=netdns for a few days, and we're still seeing crashes of the following form:

runtime/cgo: pthread_create failed: Resource temporarily unavailable
SIGABRT: abort
PC=0x7fc6088f3425 m=9

goroutine 0 [idle]:

goroutine 18 [syscall]:
runtime.notetsleepg(0xa81b38, 0x5ed10, 0x0)
/usr/local/go/src/runtime/lock_futex.go:202 +0x4e fp=0xc820027f38 sp=0xc820027f10
runtime.timerproc()
/usr/local/go/src/runtime/time.go:209 +0xde fp=0xc820027fc0 sp=0xc820027f38
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1696 +0x1 fp=0xc820027fc8 sp=0xc820027fc0
created by runtime.addtimerLocked
/usr/local/go/src/runtime/time.go:116 +0x11f

goroutine 1 [IO wait]:
net.runtime_pollWait(0x7fc5ef7bfcc0, 0x72, 0xc82000e100)
/usr/local/go/src/runtime/netpoll.go:157 +0x60
net.(_pollDesc).Wait(0xc820131560, 0x72, 0x0, 0x0)
/usr/local/go/src/net/fd_poll_runtime.go:73 +0x3a
net.(_pollDesc).WaitRead(0xc820131560, 0x0, 0x0)
/usr/local/go/src/net/fd_poll_runtime.go:78 +0x36
net.(_netFD).accept(0xc820131500, 0x0, 0x7fc606039670, 0xc89f316720)
/usr/local/go/src/net/fd_unix.go:408 +0x27c
net.(_TCPListener).AcceptTCP(0xc8200a4038, 0x452790, 0x0, 0x0)
/usr/local/go/src/net/tcpsock_posix.go:254 +0x4d
net/http.tcpKeepAliveListener.Accept(0xc8200a4038, 0x0, 0x0, 0x0, 0x0)
/usr/local/go/src/net/http/server.go:2135 +0x41
net/http.(_Server).Serve(0xc82011c540, 0x7fc606039638, 0xc8200a4038, 0x0, 0x0)
/usr/local/go/src/net/http/server.go:1887 +0xb3
net/http.(_Server).ListenAndServe(0xc82011c540, 0x0, 0x0)
/usr/local/go/src/net/http/server.go:1877 +0x136
main.main()
/srv/go/src/github.com/crashlytics/gusset/gusset.go:260 +0xb50

goroutine 17 [syscall, 1242 minutes, locked to thread]:
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1696 +0x1
and

runtime/cgo: out of memory in thread_start
SIGABRT: abort
PC=0x7f2bea6bc425 m=8

goroutine 0 [idle]:

goroutine 1 [IO wait]:
net.runtime_pollWait(0x7f2be9e4a5d8, 0x72, 0xc82000e100)
/usr/local/go/src/runtime/netpoll.go:157 +0x60
net.(_pollDesc).Wait(0xc82000b720, 0x72, 0x0, 0x0)
/usr/local/go/src/net/fd_poll_runtime.go:73 +0x3a
net.(_pollDesc).WaitRead(0xc82000b720, 0x0, 0x0)
/usr/local/go/src/net/fd_poll_runtime.go:78 +0x36
net.(_netFD).accept(0xc82000b6c0, 0x0, 0x7f2be9e4aee0, 0xc85750c2a0)
/usr/local/go/src/net/fd_unix.go:408 +0x27c
net.(_TCPListener).AcceptTCP(0xc82002a178, 0x452790, 0x0, 0x0)
/usr/local/go/src/net/tcpsock_posix.go:254 +0x4d
net/http.tcpKeepAliveListener.Accept(0xc82002a178, 0x0, 0x0, 0x0, 0x0)
/usr/local/go/src/net/http/server.go:2135 +0x41
net/http.(_Server).Serve(0xc820014780, 0x7f2be9e4aea8, 0xc82002a178, 0x0, 0x0)
/usr/local/go/src/net/http/server.go:1887 +0xb3
net/http.(_Server).ListenAndServe(0xc820014780, 0x0, 0x0)
/usr/local/go/src/net/http/server.go:1877 +0x136
main.main()
/srv/go/src/github.com/crashlytics/gusset/gusset.go:260 +0xb50

goroutine 17 [syscall, 357 minutes, locked to thread]:
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1696 +0x1
Sorry for the delay in gathering this data.


Reply to this email directly or view it on GitHub.

@bmhatfield
Copy link
Author

It is not, I cut off everything below what seemed to be the most relevant part of the panic. The rest of the crash report has info on about ~3000+ goroutines, so I thought it prudent to minimize. I'll post up the whole thing in a gist.

@davecheney
Copy link
Contributor

Thanks the full report should make it possible to figure out where all the native threads are going.

On 2 Sep 2015, at 22:49, Brian Hatfield notifications@github.com wrote:

It is not, I cut off everything below what seemed to be the most relevant part of the panic. The rest of the crash report has info on about ~3000+ goroutines, so I thought it prudent to minimize. I'll post up the whole thing in a gist.


Reply to this email directly or view it on GitHub.

@bmhatfield
Copy link
Author

Unfortunately these traces are too large for Chrome to paste into a gist, so here's a couple share links to them in gzip form from Google Drive:

https://drive.google.com/file/d/0BxGgMbrngfJbOXE1a2R2MjRpeW8/view?usp=sharing

https://drive.google.com/file/d/0BxGgMbrngfJbVjh4VXNTdmo5eG8/view?usp=sharing

@ianlancetaylor
Copy link
Member

@bmhatfield Do you mean GODEBUG=netdns=go ? Because GODEBUG=netdns won't do anything in particular.

@bmhatfield
Copy link
Author

Ah, yes, you are correct. I just checked /proc/PID/environ, and confirmed we are indeed running with GODEBUG=netdns=go. Just a typo in the Github Issue, my apologies.

@rsc
Copy link
Contributor

rsc commented Sep 5, 2015

I've been debugging a similar problem some teams are seeing inside Google. I assume this is on Linux? If you can, please try

sudo sysctl vm.overcommit_memory=1 

and see if that helps. I bet it will.

I am not sure why, but something we are doing differently compared to Go 1.4 seems to be confusing the Linux kernel memory overcommit logic, such that it decides there isn't enough memory on the machine to satisfy a completely reasonable request and starts returning ENOMEM.

I have not gotten a case I can reproduce on my own Linux workstation yet. I suspect once we do get one it will be easy to understand what we're doing to make the kernel unhappy and work around it. Any reproduction cases are most welcome.

@davecheney
Copy link
Contributor

Thanks Brian.

Looking at report-2 I'm happy that there are no signs of the cgo dns
resolver being used, so we can rule that out.

I did find several goroutines, 454 of them, blocking on a mutex from a 3rd
party dependency you have

goroutine 743188177 [semacquire]:
net.runtime_Semacquire(0xc82000b57c)
/usr/local/go/src/runtime/sema.go:48 +0x26
net.(_fdMutex).RWLock(0xc82000b570, 0x731b00, 0x37)
/usr/local/go/src/net/fd_mutex.go:143 +0x16c
net.(_netFD).writeLock(0xc82000b570, 0x0, 0x0)
/usr/local/go/src/net/fd_unix.go:176 +0x39
net.(_netFD).Write(0xc82000b570, 0xc89a8c08c0, 0x36, 0x40, 0x0, 0x0, 0x0)
/usr/local/go/src/net/fd_unix.go:301 +0x6c
net.(_conn).Write(0xc82002a130, 0xc89a8c08c0, 0x36, 0x40, 0x0, 0x0, 0x0)
/usr/local/go/src/net/net.go:184 +0xe4
github.com/crashlytics/g2s.(_statsd).publish(0xc82000ff10, 0xc8208f85c0,
0x1, 0x1)
/srv/go/src/
github.com/crashlytics/gusset/Godeps/_workspace/src/github.com/crashlytics/g2s/g2s.go:84
+0x16a
github.com/crashlytics/g2s.(_statsd).Counter(0xc82000ff10, 0x3f800000,
0xc86a0ba4e0, 0x2b, 0xc8208f85a0, 0x1, 0x1)
/srv/go/src/
github.com/crashlytics/gusset/Godeps/_workspace/src/github.com/crashlytics/g2s/g2s.go:134
+0x209
created by github.com/crashlytics/go-libraries/extralog.Counter
/srv/go/src/
github.com/crashlytics/gusset/Godeps/_workspace/src/github.com/crashlytics/go-libraries/extralog/extralog.go:69
+0x2b7

Is it possible to rule out that 3rd party service ?

On Wed, Sep 2, 2015 at 11:28 PM, Brian Hatfield notifications@github.com
wrote:

Unfortunately these traces are too large for Chrome to paste into a gist,
so here's a couple share links to them in gzip form from Google Drive:

https://drive.google.com/file/d/0BxGgMbrngfJbOXE1a2R2MjRpeW8/view?usp=sharing

https://drive.google.com/file/d/0BxGgMbrngfJbVjh4VXNTdmo5eG8/view?usp=sharing


Reply to this email directly or view it on GitHub
#12233 (comment).

@bmhatfield
Copy link
Author

@rsc something about that clicks for me, though it is a little surprising. Happy to make that change.

@davecheney That is our metrics library, and unfortunately, it's not really possible to rule it out as it's critical to the operation of the service :-( - If there's some evidence it's a problem, however, I could invest some time changing it's behavior to use less goroutines.

EDIT: @rsc As for reproduction cases, this is not happening on anything that I'd call a consistent pattern, it's just observed behavior, with varying intervals between occurrences.

@rowland
Copy link

rowland commented Sep 25, 2015

New test with some debug output:

XSQLDA* sqlda_alloc(long cols)
{
    XSQLDA *sqlda;

    sqlda = (XSQLDA*)malloc(XSQLDA_LENGTH(cols));
    printf("sqlda_alloc: malloc(%d) -> %p\n", XSQLDA_LENGTH(cols), sqlda);
    sqlda->version = SQLDA_VERSION1;
    sqlda->sqln = cols;
    sqlda->sqld = 0;
    return sqlda;
}

For reference, the struct and macro:

typedef struct
{
    ISC_SHORT   version;            /* version of this XSQLDA */
    ISC_SCHAR   sqldaid[8];         /* XSQLDA name field */
    ISC_LONG    sqldabc;            /* length in bytes of SQLDA */
    ISC_SHORT   sqln;               /* number of fields allocated */
    ISC_SHORT   sqld;               /* actual number of fields */
    XSQLVAR sqlvar[1];          /* first field address */
} XSQLDA;

#define XSQLDA_LENGTH(n)    (sizeof (XSQLDA) + (n - 1) * sizeof (XSQLVAR))

It took a few minutes after the initial data load, during which time VSZ crept up to 3500504, and then boom:

sqlda_alloc: malloc(8024) -> 0x7f295c004ac0
sqlda_alloc: malloc(8024) -> 0x7f295c006a20
sqlda_alloc: malloc(8024) -> 0x7f295c008980
sqlda_alloc: malloc(8024) -> 0x7f295c00a8e0
sqlda_alloc: malloc(8024) -> 0x11553f0
sqlda_alloc: malloc(8024) -> 0x1157350
sqlda_alloc: malloc(8024) -> 0x7f2934000b60
sqlda_alloc: malloc(8024) -> 0x7f2934002ac0
sqlda_alloc: malloc(36184) -> 0x7f29340050a0
sqlda_alloc: malloc(8024) -> 0x7f29100008c0
sqlda_alloc: malloc(8024) -> 0x7f2910002820
sqlda_alloc: malloc(8024) -> 0x7f2910004b40
sqlda_alloc: malloc(8024) -> 0x7f2910006aa0
sqlda_alloc: malloc(36344) -> 0x7f29100090b0
sqlda_alloc: malloc(8024) -> 0x7f2910011eb0
sqlda_alloc: malloc(8024) -> 0x7f2910013e10
sqlda_alloc: malloc(36344) -> 0x7f2910013e10
sqlda_alloc: malloc(8024) -> 0x7f291001cc10
sqlda_alloc: malloc(8024) -> 0x7f291001eb70
sqlda_alloc: malloc(36344) -> 0x7f291001eb70
sqlda_alloc: malloc(8024) -> 0x7f2910027970
sqlda_alloc: malloc(8024) -> 0x7f29100298d0
sqlda_alloc: malloc(36344) -> 0x7f29100298d0
sqlda_alloc: malloc(8024) -> 0x7f29100326d0
sqlda_alloc: malloc(8024) -> 0x7f2910034630
sqlda_alloc: malloc(36344) -> 0x7f2910034630
sqlda_alloc: malloc(8024) -> 0x7f291003db50
sqlda_alloc: malloc(8024) -> 0x7f291003fab0
sqlda_alloc: malloc(36344) -> 0x7f291003fab0
sqlda_alloc: malloc(8024) -> 0x7f2910048fd0
sqlda_alloc: malloc(8024) -> 0x7f291004af30
sqlda_alloc: malloc(36344) -> 0x7f291004af30
sqlda_alloc: malloc(8024) -> 0x7f2910054450
sqlda_alloc: malloc(8024) -> 0x7f29100563b0
sqlda_alloc: malloc(36344) -> 0x7f29100563b0
sqlda_alloc: malloc(8024) -> 0x7f291005f8d0
sqlda_alloc: malloc(8024) -> 0x7f2910061830
sqlda_alloc: malloc(36344) -> 0x7f2910061830
sqlda_alloc: malloc(8024) -> 0x7f291006ad50
sqlda_alloc: malloc(8024) -> 0x7f291006ccb0
sqlda_alloc: malloc(36344) -> 0x7f291006ccb0
sqlda_alloc: malloc(8024) -> 0x7f29100761d0
sqlda_alloc: malloc(8024) -> 0x7f2910078130
sqlda_alloc: malloc(36344) -> 0x7f2910078130
sqlda_alloc: malloc(8024) -> 0x7f2910081650
sqlda_alloc: malloc(8024) -> 0x7f29100835b0
sqlda_alloc: malloc(36344) -> 0x7f29100835b0
sqlda_alloc: malloc(8024) -> 0x7f291008cad0
sqlda_alloc: malloc(8024) -> 0x7f291008ea30
sqlda_alloc: malloc(8024) -> 0x7f2910090990
sqlda_alloc: malloc(8024) -> 0x7f29100928f0
sqlda_alloc: malloc(36344) -> 0x7f29100928f0
sqlda_alloc: malloc(8024) -> 0x7f291009be10
sqlda_alloc: malloc(8024) -> 0x7f291009dd70
sqlda_alloc: malloc(36344) -> 0x7f291009dd70
sqlda_alloc: malloc(8024) -> 0x7f29100a7290
sqlda_alloc: malloc(8024) -> 0x7f29100a91f0
sqlda_alloc: malloc(36344) -> 0x7f29100a91f0
sqlda_alloc: malloc(8024) -> 0x7f29100b2710
sqlda_alloc: malloc(8024) -> 0x7f29100b4670
sqlda_alloc: malloc(36344) -> 0x7f29100b4670
sqlda_alloc: malloc(8024) -> 0x7f29100bdb90
sqlda_alloc: malloc(8024) -> 0x7f29100bfaf0
sqlda_alloc: malloc(36344) -> 0x7f29100bfaf0
sqlda_alloc: malloc(8024) -> 0x7f29100c9010
sqlda_alloc: malloc(8024) -> 0x7f29100caf70
sqlda_alloc: malloc(36344) -> 0x7f29100caf70
sqlda_alloc: malloc(8024) -> 0x7f29100d4490
sqlda_alloc: malloc(8024) -> 0x7f29100d63f0
sqlda_alloc: malloc(36344) -> 0x7f29100d63f0
sqlda_alloc: malloc(8024) -> 0x7f29100df910
sqlda_alloc: malloc(8024) -> 0x7f29100e1870
sqlda_alloc: malloc(36344) -> 0x7f29100e1870
sqlda_alloc: malloc(8024) -> 0x7f29100ead90
sqlda_alloc: malloc(8024) -> 0x7f29100eccf0
sqlda_alloc: malloc(36344) -> 0x7f29100eccf0
sqlda_alloc: malloc(8024) -> 0x7f29100f6210
sqlda_alloc: malloc(8024) -> 0x7f29100f8170
sqlda_alloc: malloc(36344) -> 0x7f29100f8170
sqlda_alloc: malloc(8024) -> 0x7f2910101690
sqlda_alloc: malloc(8024) -> 0x7f29101035f0
sqlda_alloc: malloc(36344) -> 0x7f29101035f0
sqlda_alloc: malloc(8024) -> 0x7f291010cb10
sqlda_alloc: malloc(8024) -> 0x7f291010ea70
sqlda_alloc: malloc(36344) -> 0x7f291010ea70
sqlda_alloc: malloc(8024) -> 0x7f2910117f90
sqlda_alloc: malloc(8024) -> 0x7f2910119ef0
sqlda_alloc: malloc(36344) -> 0x7f2910119ef0
sqlda_alloc: malloc(8024) -> 0x7f29140008c0
sqlda_alloc: malloc(8024) -> 0x7f2914002820
sqlda_alloc: malloc(8024) -> 0x7f2914004b40
sqlda_alloc: malloc(8024) -> 0x7f2914006aa0
sqlda_alloc: malloc(36344) -> 0x7f29140090b0
sqlda_alloc: malloc(8024) -> 0x7f2914011eb0
sqlda_alloc: malloc(8024) -> 0x7f2914013e10
sqlda_alloc: malloc(36344) -> 0x7f2914013e10
sqlda_alloc: malloc(8024) -> 0x7f291401cc10
sqlda_alloc: malloc(8024) -> 0x7f291401eb70
sqlda_alloc: malloc(36344) -> 0x7f291401eb70
sqlda_alloc: malloc(8024) -> 0x7f2914027970
sqlda_alloc: malloc(8024) -> 0x7f29140298d0
sqlda_alloc: malloc(36344) -> 0x7f29140298d0
sqlda_alloc: malloc(8024) -> 0x7f29140326d0
sqlda_alloc: malloc(8024) -> 0x7f2914034630
sqlda_alloc: malloc(36344) -> 0x7f2914034630
sqlda_alloc: malloc(8024) -> 0x7f291403db50
sqlda_alloc: malloc(8024) -> 0x7f291403fab0
sqlda_alloc: malloc(36344) -> 0x7f291403fab0
sqlda_alloc: malloc(8024) -> 0x7f2914048fd0
sqlda_alloc: malloc(8024) -> 0x7f291404af30
sqlda_alloc: malloc(36344) -> 0x7f291404af30
sqlda_alloc: malloc(8024) -> 0x7f293400de00
sqlda_alloc: malloc(8024) -> 0x7f293400fd60
sqlda_alloc: malloc(8024) -> 0x7f2934011cc0
sqlda_alloc: malloc(8024) -> 0x7f2934013c20
sqlda_alloc: malloc(36344) -> 0x7f2934013c20
sqlda_alloc: malloc(8024) -> 0x7f2914054450
sqlda_alloc: malloc(8024) -> 0x7f29140563b0
sqlda_alloc: malloc(8024) -> 0x7f2914058310
sqlda_alloc: malloc(8024) -> 0x7f291405a270
sqlda_alloc: malloc(36344) -> 0x7f291405c880
sqlda_alloc: malloc(8024) -> 0x7f2910123410
sqlda_alloc: malloc(8024) -> 0x7f2910125370
sqlda_alloc: malloc(8024) -> 0x7f2914065680
sqlda_alloc: malloc(8024) -> 0x7f29140675e0
sqlda_alloc: malloc(8024) -> 0x7f293401ca20
sqlda_alloc: malloc(8024) -> 0x7f293401e980
sqlda_alloc: malloc(8024) -> 0x7f29340208e0
sqlda_alloc: malloc(8024) -> 0x7f2934022840
sqlda_alloc: malloc(36344) -> 0x7f2934022840
sqlda_alloc: malloc(8024) -> 0x7f29101272d0
sqlda_alloc: malloc(8024) -> 0x7f2910129230
sqlda_alloc: malloc(8024) -> 0x7f2914069540
sqlda_alloc: malloc(8024) -> 0x7f291406b4a0
sqlda_alloc: malloc(8024) -> 0x7f293402bd60
sqlda_alloc: malloc(8024) -> 0x7f293402dcc0
sqlda_alloc: malloc(8024) -> 0x7f291406d400
sqlda_alloc: malloc(8024) -> 0x7f291406f360
sqlda_alloc: malloc(8024) -> 0x7f293402fc20
sqlda_alloc: malloc(8024) -> 0x7f2934031b80
sqlda_alloc: malloc(8024) -> 0x7f291012b190
sqlda_alloc: malloc(8024) -> 0x7f291012d0f0
sqlda_alloc: malloc(8024) -> 0x7f29140712c0
sqlda_alloc: malloc(8024) -> 0x7f2914073220
sqlda_alloc: malloc(8024) -> 0x7f2934033ae0
sqlda_alloc: malloc(8024) -> 0x7f2934035a40
sqlda_alloc: malloc(8024) -> 0x7f2914075180
sqlda_alloc: malloc(8024) -> 0x7f29140770e0
sqlda_alloc: malloc(8024) -> 0x7f29340379a0
sqlda_alloc: malloc(8024) -> 0x7f2934039900
sqlda_alloc: malloc(8024) -> 0x7f2914079040
sqlda_alloc: malloc(8024) -> 0x7f291407afa0
sqlda_alloc: malloc(8024) -> 0x7f2958001060
sqlda_alloc: malloc(8024) -> 0x7f2958002fc0
sqlda_alloc: malloc(8024) -> 0x7f2938000a20
sqlda_alloc: malloc(8024) -> 0x7f2938002980
sqlda_alloc: malloc(8024) -> 0x7f29580052c0
sqlda_alloc: malloc(8024) -> 0x7f2958007220
sqlda_alloc: malloc(8024) -> 0x7f2938004c80
sqlda_alloc: malloc(8024) -> 0x7f2938006be0
sqlda_alloc: malloc(8024) -> 0x7f29580094e0
sqlda_alloc: malloc(8024) -> 0x7f295800b440
sqlda_alloc: malloc(8024) -> 0x7f2938008ea0
sqlda_alloc: malloc(8024) -> 0x7f293800ae00
sqlda_alloc: malloc(8024) -> 0x7f29140008c0
sqlda_alloc: malloc(8024) -> 0x7f2914002820
sqlda_alloc: malloc(8024) -> 0x7f295800d700
sqlda_alloc: malloc(8024) -> 0x7f295800f660
sqlda_alloc: malloc(8024) -> 0x7f2958011920
sqlda_alloc: malloc(8024) -> 0x7f2958013880
sqlda_alloc: malloc(8024) -> 0x7f2914004b40
sqlda_alloc: malloc(8024) -> 0x7f2914006aa0
sqlda_alloc: malloc(8024) -> 0x7f2958015800
sqlda_alloc: malloc(8024) -> 0x7f2958017760
sqlda_alloc: malloc(8024) -> 0x7f2914008d60
sqlda_alloc: malloc(8024) -> 0x7f291400acc0
sqlda_alloc: malloc(8024) -> 0x7f2958019a60
sqlda_alloc: malloc(8024) -> 0x7f295801b9c0
sqlda_alloc: malloc(8024) -> 0x7f293800d0c0
sqlda_alloc: malloc(8024) -> 0x7f293800f020
sqlda_alloc: malloc(8024) -> 0x7f29380112e0
sqlda_alloc: malloc(8024) -> 0x7f2938013240
sqlda_alloc: malloc(36344) -> 0x7f2938015850
sqlda_alloc: malloc(8024) -> 0x7f291400cf80
sqlda_alloc: malloc(8024) -> 0x7f291400eee0
sqlda_alloc: malloc(8024) -> 0x7f295801dc60
sqlda_alloc: malloc(8024) -> 0x7f295801fbc0
sqlda_alloc: malloc(8024) -> 0x7f29140111a0
sqlda_alloc: malloc(8024) -> 0x7f2914013100
sqlda_alloc: malloc(8024) -> 0x7f2958021e80
sqlda_alloc: malloc(8024) -> 0x7f2958023de0
sqlda_alloc: malloc(8024) -> 0x7f29580260a0
sqlda_alloc: malloc(8024) -> 0x7f2958028000
sqlda_alloc: malloc(36344) -> 0x7f295802a610
sqlda_alloc: malloc(8024) -> 0x7f2958033410
sqlda_alloc: malloc(8024) -> 0x7f2958035370
sqlda_alloc: malloc(8024) -> 0x7f293801e650
sqlda_alloc: malloc(8024) -> 0x7f29380205b0
sqlda_alloc: malloc(8024) -> 0x7f29580372d0
sqlda_alloc: malloc(8024) -> 0x7f2958039230
sqlda_alloc: malloc(8024) -> 0x7f295803b190
sqlda_alloc: malloc(8024) -> 0x7f295803d0f0
sqlda_alloc: malloc(36344) -> 0x7f295803d0f0
sqlda_alloc: malloc(8024) -> 0x7f29140153c0
sqlda_alloc: malloc(8024) -> 0x7f2914017320
sqlda_alloc: malloc(8024) -> 0x7f2958045ef0
sqlda_alloc: malloc(8024) -> 0x7f2958047e50
sqlda_alloc: malloc(8024) -> 0x7f29140195e0
sqlda_alloc: malloc(8024) -> 0x7f291401b540
sqlda_alloc: malloc(8024) -> 0x7f2958049db0
sqlda_alloc: malloc(8024) -> 0x7f295804bd10
sqlda_alloc: malloc(8024) -> 0x7f2938022510
sqlda_alloc: malloc(8024) -> 0x7f2938024470
sqlda_alloc: malloc(8024) -> 0x7f29380263d0
sqlda_alloc: malloc(8024) -> 0x7f2938028330
sqlda_alloc: malloc(36344) -> 0x7f2938028330
sqlda_alloc: malloc(8024) -> 0x7f2934000b60
sqlda_alloc: malloc(8024) -> 0x7f2934004fd0
sqlda_alloc: malloc(8024) -> 0x7f29080008c0
sqlda_alloc: malloc(8024) -> 0x7f2908002820
sqlda_alloc: malloc(8024) -> 0x7f2934006f30
sqlda_alloc: malloc(8024) -> 0x7f2934008e90
sqlda_alloc: malloc(8024) -> 0x7f2908004b40
sqlda_alloc: malloc(8024) -> 0x7f2908006aa0
sqlda_alloc: malloc(8024) -> 0x7f293400adf0
sqlda_alloc: malloc(8024) -> 0x7f293400cd50
sqlda_alloc: malloc(8024) -> 0x7f2958001060
sqlda_alloc: malloc(8024) -> 0x7f2958002fc0
sqlda_alloc: malloc(8024) -> 0x7f2908008d60
sqlda_alloc: malloc(8024) -> 0x7f290800acc0
sqlda_alloc: malloc(8024) -> 0x7f290800cf80
sqlda_alloc: malloc(8024) -> 0x7f290800eee0
sqlda_alloc: malloc(36344) -> 0x7f29080114f0
sqlda_alloc: malloc(8024) -> 0x7f293400ecb0
sqlda_alloc: malloc(8024) -> 0x7f2934010c10
sqlda_alloc: malloc(8024) -> 0x7f29580052c0
sqlda_alloc: malloc(8024) -> 0x7f2958007220
sqlda_alloc: malloc(8024) -> 0x7f2934012b70
sqlda_alloc: malloc(8024) -> 0x7f2934014ad0
sqlda_alloc: malloc(8024) -> 0x7f29580094e0
sqlda_alloc: malloc(8024) -> 0x7f295800b440
sqlda_alloc: malloc(8024) -> 0x7f290801a2f0
sqlda_alloc: malloc(8024) -> 0x7f290801c250
sqlda_alloc: malloc(8024) -> 0x7f290801e1b0
sqlda_alloc: malloc(8024) -> 0x7f2908020110
sqlda_alloc: malloc(36344) -> 0x7f2908020110
sqlda_alloc: malloc(8024) -> 0x7f2934016a30
sqlda_alloc: malloc(8024) -> 0x7f2934018990
sqlda_alloc: malloc(8024) -> 0x7f295800d700
sqlda_alloc: malloc(8024) -> 0x7f295800f660
sqlda_alloc: malloc(8024) -> 0x7f2958011920
sqlda_alloc: malloc(8024) -> 0x7f2958013880
sqlda_alloc: malloc(36344) -> 0x7f2958015e90
sqlda_alloc: malloc(8024) -> 0x7f293401a8f0
sqlda_alloc: malloc(8024) -> 0x7f293401c850
sqlda_alloc: malloc(8024) -> 0x7f293401e7b0
sqlda_alloc: malloc(8024) -> 0x7f2934020710
sqlda_alloc: malloc(8024) -> 0x7f2908028f10
sqlda_alloc: malloc(8024) -> 0x7f290802ae70
sqlda_alloc: malloc(8024) -> 0x7f290802cdd0
sqlda_alloc: malloc(8024) -> 0x7f290802ed30
sqlda_alloc: malloc(36344) -> 0x7f290802ed30
sqlda_alloc: malloc(8024) -> 0x7f2934022670
sqlda_alloc: malloc(8024) -> 0x7f29340245d0
sqlda_alloc: malloc(8024) -> 0x7f2908038250
sqlda_alloc: malloc(8024) -> 0x7f290803a1b0
sqlda_alloc: malloc(8024) -> 0x7f295801ec90
sqlda_alloc: malloc(8024) -> 0x7f2958020bf0
sqlda_alloc: malloc(8024) -> 0x7f2934026530
sqlda_alloc: malloc(8024) -> 0x7f2934028490
sqlda_alloc: malloc(8024) -> 0x7f2958022b50
sqlda_alloc: malloc(8024) -> 0x7f2958024ab0
sqlda_alloc: malloc(8024) -> 0x7f290803c110
sqlda_alloc: malloc(8024) -> 0x7f290803e070
sqlda_alloc: malloc(8024) -> 0x7f2958026a10
sqlda_alloc: malloc(8024) -> 0x7f2958028970
sqlda_alloc: malloc(8024) -> 0x7f295802a8d0
sqlda_alloc: malloc(8024) -> 0x7f295802c830
sqlda_alloc: malloc(36344) -> 0x7f295802c830
sqlda_alloc: malloc(8024) -> 0x7f293402a3f0
sqlda_alloc: malloc(8024) -> 0x7f293402c350
sqlda_alloc: malloc(8024) -> 0x7f293402e2b0
sqlda_alloc: malloc(8024) -> 0x7f2934030210
sqlda_alloc: malloc(36344) -> 0x7f2934032820
sqlda_alloc: malloc(8024) -> 0x7f290c0008c0
sqlda_alloc: malloc(8024) -> 0x7f290c002820
sqlda_alloc: malloc(8024) -> 0x7f2958001060
sqlda_alloc: malloc(8024) -> 0x7f2958002fc0
sqlda_alloc: malloc(8024) -> 0x7f290c004b40
sqlda_alloc: malloc(8024) -> 0x7f290c006aa0
sqlda_alloc: malloc(8024) -> 0x7f29580052c0
sqlda_alloc: malloc(8024) -> 0x7f2958007220
sqlda_alloc: malloc(8024) -> 0x7f290c008d60
sqlda_alloc: malloc(8024) -> 0x7f290c00acc0
sqlda_alloc: malloc(8024) -> 0x7f29580094e0
sqlda_alloc: malloc(8024) -> 0x7f295800b440
sqlda_alloc: malloc(8024) -> 0x7f295800d700
sqlda_alloc: malloc(8024) -> 0x7f295800f660
sqlda_alloc: malloc(36344) -> 0x7f2958011c70
sqlda_alloc: malloc(8024) -> 0x7f2934000b60
sqlda_alloc: malloc(8024) -> 0x7f2934004fd0
sqlda_alloc: malloc(8024) -> 0x7f290c00cf80
sqlda_alloc: malloc(8024) -> 0x7f290c00eee0
sqlda_alloc: malloc(8024) -> 0x7f2934006f30
sqlda_alloc: malloc(8024) -> 0x7f2934008e90
sqlda_alloc: malloc(8024) -> 0x7f295801aa70
sqlda_alloc: malloc(8024) -> 0x7f295801c9d0
sqlda_alloc: malloc(8024) -> 0x7f293400adf0
sqlda_alloc: malloc(8024) -> 0x7f293400cd50
sqlda_alloc: malloc(8024) -> 0x7f290c0111a0
sqlda_alloc: malloc(8024) -> 0x7f290c013100
sqlda_alloc: malloc(8024) -> 0x7f290c0153c0
sqlda_alloc: malloc(8024) -> 0x7f290c017320
sqlda_alloc: malloc(8024) -> 0x7f295801e930
sqlda_alloc: malloc(8024) -> 0x7f2958020890
sqlda_alloc: malloc(8024) -> 0x7f290c0192a0
sqlda_alloc: malloc(8024) -> 0x7f290c01b200
sqlda_alloc: malloc(8024) -> 0x7f29580227f0
sqlda_alloc: malloc(8024) -> 0x7f2958024750
sqlda_alloc: malloc(8024) -> 0x7f293400ecb0
sqlda_alloc: malloc(8024) -> 0x7f2934010c10
sqlda_alloc: malloc(8024) -> 0x7f29580266b0
sqlda_alloc: malloc(8024) -> 0x7f2958028610
sqlda_alloc: malloc(8024) -> 0x7f295802a570
sqlda_alloc: malloc(8024) -> 0x7f295802c4d0
sqlda_alloc: malloc(36344) -> 0x7f295802c4d0
sqlda_alloc: malloc(8024) -> 0x7f290c01d500
sqlda_alloc: malloc(8024) -> 0x7f290c01f460
sqlda_alloc: malloc(8024) -> 0x7f2934012b70
sqlda_alloc: malloc(8024) -> 0x7f2934014ad0
sqlda_alloc: malloc(8024) -> 0x7f29580352d0
sqlda_alloc: malloc(8024) -> 0x7f2958037230
sqlda_alloc: malloc(8024) -> 0x7f2934016a30
sqlda_alloc: malloc(8024) -> 0x7f2934018990
sqlda_alloc: malloc(8024) -> 0x7f2958039190
sqlda_alloc: malloc(8024) -> 0x7f295803b0f0
sqlda_alloc: malloc(8024) -> 0x7f295803d050
sqlda_alloc: malloc(8024) -> 0x7f295803efb0
sqlda_alloc: malloc(36344) -> 0x7f29580415c0
sqlda_alloc: malloc(8024) -> 0x7f293401a8f0
sqlda_alloc: malloc(8024) -> 0x7f293401c850
sqlda_alloc: malloc(8024) -> 0x7f290c021700
sqlda_alloc: malloc(8024) -> 0x7f290c023660
sqlda_alloc: malloc(8024) -> 0x7f29000008c0
sqlda_alloc: malloc(8024) -> 0x7f2900002820
sqlda_alloc: malloc(8024) -> 0x7f2934000b60
sqlda_alloc: malloc(8024) -> 0x7f2934004fd0
sqlda_alloc: malloc(8024) -> 0x7f2900004b40
sqlda_alloc: malloc(8024) -> 0x7f2900006aa0
sqlda_alloc: malloc(8024) -> 0x7f2934006f30
sqlda_alloc: malloc(8024) -> 0x7f2934008e90
sqlda_alloc: malloc(8024) -> 0x7f2900008d60
sqlda_alloc: malloc(8024) -> 0x7f290000acc0
sqlda_alloc: malloc(8024) -> (nil)
fatal error: unexpected signal during runtime execution
[signal 0xb code=0x1 addr=0x10 pc=0x4037e2]

runtime stack:
runtime.throw(0x818ac0, 0x2a)
    /usr/local/go/src/runtime/panic.go:527 +0x90
runtime.sigpanic()
    /usr/local/go/src/runtime/sigpanic_unix.go:12 +0x5a

goroutine 41 [syscall, locked to thread]:
runtime.cgocall(0x402ed0, 0xc820062ca8, 0x0)
    /usr/local/go/src/runtime/cgocall.go:120 +0x11b fp=0xc820062c70 sp=0xc820062c40
github.com/rowland/go-fb._Cfunc_sqlda_alloc(0x32, 0x0)
    ??:0 +0x36 fp=0xc820062ca8 sp=0xc820062c70
github.com/rowland/go-fb.newCursor(0xc85c715590, 0xc88e77c420, 0x0, 0x0)
    /home/brent/src/github.com/rowland/go-fb/cursor.go:50 +0x14e fp=0xc820062d18 sp=0xc820062ca8
github.com/rowland/go-fb.(*Connection).Execute(0xc85c715590, 0x828540, 0x57, 0xc820062ec0, 0x1, 0x1, 0xc88e77c370, 0x0, 0x0)
    /home/brent/src/github.com/rowland/go-fb/connection.go:150 +0x39 fp=0xc820062d68 sp=0xc820062d18
dev.go/projects/talent-search/talent.(*Search).JournalMonitor(0xc82000a160, 0xcc47b65)
    /home/brent/src/dev.go/projects/talent-search/talent/search.go:325 +0x441 fp=0xc820062fa0 sp=0xc820062d68
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1696 +0x1 fp=0xc820062fa8 sp=0xc820062fa0
created by main.main
    /home/brent/src/dev.go/projects/talent-search/talent-search.go:179 +0x9e2

goroutine 1 [select]:
main.ProcessIncomingMessages(0xc8204b00c0)
    /home/brent/src/dev.go/projects/talent-search/talent-search.go:218 +0xab6
main.main()
    /home/brent/src/dev.go/projects/talent-search/talent-search.go:198 +0x956

goroutine 17 [syscall, 17 minutes, locked to thread]:
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1696 +0x1

goroutine 5 [syscall, 17 minutes]:
os/signal.loop()
    /usr/local/go/src/os/signal/signal_unix.go:22 +0x18
created by os/signal.init.1
    /usr/local/go/src/os/signal/signal_unix.go:28 +0x37

goroutine 42 [IO wait, 8 minutes]:
net.runtime_pollWait(0x7f2969ce6b18, 0x72, 0xc820010190)
    /usr/local/go/src/runtime/netpoll.go:157 +0x60
net.(*pollDesc).Wait(0xc88ecc6060, 0x72, 0x0, 0x0)
    /usr/local/go/src/net/fd_poll_runtime.go:73 +0x3a
net.(*pollDesc).WaitRead(0xc88ecc6060, 0x0, 0x0)
    /usr/local/go/src/net/fd_poll_runtime.go:78 +0x36
net.(*netFD).accept(0xc88ecc6000, 0x0, 0x7f296845e0b0, 0xc88ecd2000)
    /usr/local/go/src/net/fd_unix.go:408 +0x27c
net.(*TCPListener).AcceptTCP(0xc88ecd0000, 0xc88ecb7de0, 0x0, 0x0)
    /usr/local/go/src/net/tcpsock_posix.go:254 +0x4d
net/http.tcpKeepAliveListener.Accept(0xc88ecd0000, 0x0, 0x0, 0x0, 0x0)
    /usr/local/go/src/net/http/server.go:2135 +0x41
net/http.(*Server).Serve(0xc820018120, 0x7f292c5890b0, 0xc88ecd0000, 0x0, 0x0)
    /usr/local/go/src/net/http/server.go:1887 +0xb3
net/http.(*Server).ListenAndServe(0xc820018120, 0x0, 0x0)
    /usr/local/go/src/net/http/server.go:1877 +0x136
net/http.ListenAndServe(0x7a9e90, 0x5, 0x0, 0x0, 0x0, 0x0)
    /usr/local/go/src/net/http/server.go:1967 +0x8f
created by main.main
    /home/brent/src/dev.go/projects/talent-search/talent-search.go:186 +0x6e4

goroutine 43 [IO wait]:
net.runtime_pollWait(0x7f2969ce6a58, 0x72, 0xc820010190)
    /usr/local/go/src/runtime/netpoll.go:157 +0x60
net.(*pollDesc).Wait(0xc8204a80d0, 0x72, 0x0, 0x0)
    /usr/local/go/src/net/fd_poll_runtime.go:73 +0x3a
net.(*pollDesc).WaitRead(0xc8204a80d0, 0x0, 0x0)
    /usr/local/go/src/net/fd_poll_runtime.go:78 +0x36
net.(*netFD).readFrom(0xc8204a8070, 0xc88ecaeea8, 0x1000, 0x1000, 0x0, 0x0, 0x0, 0x7f2969ce1050, 0xc820010190)
    /usr/local/go/src/net/fd_unix.go:259 +0x2a5
net.(*UDPConn).ReadFromUDP(0xc8204ae000, 0xc88ecaeea8, 0x1000, 0x1000, 0xc88ea70580, 0x18, 0x0, 0x0)
    /usr/local/go/src/net/udpsock_posix.go:61 +0x117
dev.go/utilities/brokerclient.(*BrokerClient).listener(0xc8204ac050, 0xc8204b00c0, 0xc8204b0060)
    /home/brent/src/dev.go/utilities/brokerclient/brokerclient.go:88 +0x135
created by dev.go/utilities/brokerclient.(*BrokerClient).StartListener
    /home/brent/src/dev.go/utilities/brokerclient/brokerclient.go:74 +0x9a

goroutine 44 [select, 8 minutes, locked to thread]:
runtime.gopark(0x843f78, 0xc8204fc728, 0x7adab8, 0x6, 0x432118, 0x2)
    /usr/local/go/src/runtime/proc.go:185 +0x163
runtime.selectgoImpl(0xc8204fc728, 0x0, 0x18)
    /usr/local/go/src/runtime/select.go:392 +0xa64
runtime.selectgo(0xc8204fc728)
    /usr/local/go/src/runtime/select.go:212 +0x12
runtime.ensureSigM.func1()
    /usr/local/go/src/runtime/signal1_unix.go:227 +0x353
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1696 +0x1

Tracking the pc just to be precise:

$ addr2line -fipe talent-search 0x4037e2
sqlda_alloc at /home/brent/src/github.com/rowland/go-fb/fb.c:339

That line would be:

sqlda->sqln = cols;

which is no surprise, given the null pointer returned by malloc. I'm only surprised it didn't crash on the line before, but perhaps the assignments were performed out of order.

I am also running Ubuntu 14.04 LTS.

@aclements
Copy link
Member

@rowland, could you try my suggestion from #12233 (comment)? It may be that Go is running up the process' VMA count and C malloc is getting unlucky and trying to make that one last mmap that breaks the camel's back. If my suggestion doesn't help, then I think your problem is on the C side.

@rowland
Copy link

rowland commented Sep 25, 2015

@aclements It's certainly helping. The process has lasted 40 times as long with more than twice as many mallocs. It would be preferable not to need the sysctl workaround, but this gives me time until I can replace the CGO bits with native Go equivalents. Thanks.

@aclements
Copy link
Member

I'm considering simply disabling huge pages for the Go heap for the 1.5.2 release. This wouldn't be a huge loss, considering Linux only gained transparent huge page support quite recently. For 1.6 we may be able to find a better way to take advantage of huge pages without fragmenting our VMAs so badly.

@bmhatfield
Copy link
Author

@aclements @davecheney @rsc Great news!

After running over the weekend with the system's vm.max_map_count doubled, I observed 0 crashes in the past 72 hours, vs a minimum of 3 crashes per day since upgrading to 1.5.

crashiness

@aclements
Copy link
Member

@bmhatfield, great! I'll put together a CL to disable huge pages for Go 1.5.2, which should fix this, and we'll think about how to intelligently re-enable them for 1.6.

@davecheney
Copy link
Contributor

@bmhatfield, excellent detective work, I'm delighted you got to the bottom
of this.

On Tue, 29 Sep 2015 08:27 Austin Clements notifications@github.com wrote:

@bmhatfield https://github.com/bmhatfield, great! I'll put together a
CL to disable huge pages for Go 1.5.2, which should fix this, and we'll
think about how to intelligently re-enable them for 1.6.


Reply to this email directly or view it on GitHub
#12233 (comment).

@aclements
Copy link
Member

I've posted a CL for this: https://go-review.googlesource.com/#/c/15191

@bmhatfield or anyone else having this problem, could you give this CL a try? It cherry-picks cleanly to 1.5.1 or you can try it on master. I've confirmed that it reduces the number of VMAs, but I don't have any test cases myself that drive the VMA count anywhere near the limit.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/15191 mentions this issue.

@bmhatfield
Copy link
Author

I'm happy to give that CL a try, but building the right version of Go from source seems a bit arduous. Any recommendations for the simplest way to get a binary?

@aclements
Copy link
Member

@bmhatfield, I'm not sure how one would get a binary of this, since we don't even ship binaries of releases for Linux. But here's the exact sequence of commands that should build the right source:

git clone https://go.googlesource.com/go
cd go
git checkout go1.5.1
git fetch https://go.googlesource.com/go refs/changes/91/15191/3
git cherry-pick FETCH_HEAD
cd src
GOROOT_BOOTSTRAP=/usr ./make.bash

This assumes you have some Go version installed as /usr/bin/go, such as from your package manager. Tweak the GOROOT_BOOTSTRAP in the last line if it's somewhere else.

This should give you a go binary in go/bin that you can build with.

@aclements
Copy link
Member

since we don't even ship binaries of releases for Linux

Not intended to be a factual statement.

Still, it's probably easiest to just run those commands. If that doesn't work, let me know and I'll see what I can do about getting you a binary.

@bmhatfield
Copy link
Author

Thanks for the outline of how to rebuild Go, that was very helpful. There were some minor details I had to tweak (Mainly, GOROOT_BOOTSTRAP=/usr needed to point all the way in at the go install, not just where bin/go is located), but I was successful.

I was as rigorous as possible about using it to recompile my program, but there's a small chance I messed up rebuilding my program with the correct go compiler. It's probably fine though. I'd be sweet if go version output the git sha it was built from, if it was built from within a repository context.

That said:

Prior to updating the binary:

$ sudo cat /proc/`pidof $PROCESS`/smaps | grep ^Size -c
131057

After updating the binary and letting it get up to full-steam (about 27 minutes):

$ sudo cat /proc/`pidof $PROCESS`/smaps | grep ^Size -c
384

EDIT: A couple minutes later:
$ sudo cat /proc/`pidof $PROCESS`/smaps | grep ^Size -c
276

After restarting a not-updated binary and letting it get up to full-steam (about 18 minutes):

$ sudo cat /proc/`pidof $PROCESS`/smaps | grep ^Size -c
45242

EDIT: A couple minutes later:
$ sudo cat /proc/`pidof $PROCESS`/smaps | grep ^Size -c
60518

So it looks like this change has had the intended effect.

@bmhatfield
Copy link
Author

Yay! I'm very glad to see this resolved. It's certainly been interesting trying to provide enough information as we tried to track this down. Thanks for the support and help!

@ivulfson
Copy link

Compiled Go 1.5.2 from source, as prescribed in by @aclements
Verified that my code is working in 1.5.2, and uses less RAM (about 10% less).
Also verified that it's still crashing in 1.5 and 1.5.1. :)
Thank you for fixing this!

@cespare
Copy link
Contributor

cespare commented Oct 29, 2015

In case more anecdata is useful in gauging how frequently this bug occurs in the wild, we have two different server processes that occasionally crash with this issue. They are both some kind of database and have large and active heaps. We're using the raise-the-max_map_count workaround for now.

@rsc rsc changed the title runtime: Go 1.5 upgrade: Intermittent 'fatal error: runtime: cannot allocate memory' runtime: huge page fragmentation on Linux leads to out of memory Nov 13, 2015
@gopherbot
Copy link
Contributor

CL https://golang.org/cl/16980 mentions this issue.

aclements added a commit that referenced this issue Nov 17, 2015
…age granularity

This fixes an issue where the runtime panics with "out of memory" or
"cannot allocate memory" even though there's ample memory by reducing
the number of memory mappings created by the memory allocator.

Commit 7e1b61c worked around issue #8832 where Linux's transparent
huge page support could dramatically increase the RSS of a Go process
by setting the MADV_NOHUGEPAGE flag on any regions of pages released
to the OS with MADV_DONTNEED. This had the side effect of also
increasing the number of VMAs (memory mappings) in a Go address space
because a separate VMA is needed for every region of the virtual
address space with different flags. Unfortunately, by default, Linux
limits the number of VMAs in an address space to 65530, and a large
heap can quickly reach this limit when the runtime starts scavenging
memory.

This commit dramatically reduces the number of VMAs. It does this
primarily by only adjusting the huge page flag at huge page
granularity. With this change, on amd64, even a pessimal heap that
alternates between MADV_NOHUGEPAGE and MADV_HUGEPAGE must reach 128GB
to reach the VMA limit. Because of this rounding to huge page
granularity, this change is also careful to leave large used and
unused regions huge page-enabled.

This change reduces the maximum number of VMAs during the runtime
benchmarks with GODEBUG=scavenge=1 from 692 to 49.

Fixes #12233.

Change-Id: Ic397776d042f20d53783a1cacf122e2e2db00584
Reviewed-on: https://go-review.googlesource.com/15191
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-on: https://go-review.googlesource.com/16980
Run-TryBot: Austin Clements <austin@google.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Russ Cox <rsc@golang.org>
@golang golang locked and limited conversation to collaborators Nov 16, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests