Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM during startup after upgrade to 0.6.1 #1579

Closed
nbrownus opened this issue Jan 8, 2016 · 11 comments
Closed

OOM during startup after upgrade to 0.6.1 #1579

nbrownus opened this issue Jan 8, 2016 · 11 comments
Assignees
Labels
type/crash The issue description contains a golang panic and stack trace type/docs Documentation needs to be created/updated/clarified

Comments

@nbrownus
Copy link
Contributor

nbrownus commented Jan 8, 2016

Previously running 0.5.2 and tried upgrading to 0.6.1 in a test environment. 0.5.2 takes a while to startup but it does eventually get there.

Log output:

Jan  8 23:34:03 ==> WARNING: BootstrapExpect Mode is specified as 1; this is the same as Bootstrap mode.
Jan  8 23:34:03 ==> WARNING: Bootstrap mode enabled! Do not enable unless necessary
Jan  8 23:34:03 ==> Starting Consul agent...
Jan  8 23:36:46 fatal error: runtime: out of memory
Jan  8 23:36:46 
Jan  8 23:36:46 runtime stack:
Jan  8 23:36:46 runtime.throw(0xdd3190, 0x16)
Jan  8 23:36:46     /usr/local/go/src/runtime/panic.go:527 +0x90
Jan  8 23:36:46 runtime.sysMap(0xc9cf200000, 0x100000, 0x7f6ba4ca6000, 0x11be858)
Jan  8 23:36:46     /usr/local/go/src/runtime/mem_linux.go:203 +0x9b
Jan  8 23:36:46 runtime.mHeap_SysAlloc(0x119e860, 0x100000, 0x0)
Jan  8 23:36:46     /usr/local/go/src/runtime/malloc.go:426 +0x160
Jan  8 23:36:46 runtime.mHeap_Grow(0x119e860, 0x8, 0x0)
Jan  8 23:36:46     /usr/local/go/src/runtime/mheap.go:628 +0x63
Jan  8 23:36:46 runtime.mHeap_AllocSpanLocked(0x119e860, 0x1, 0x7f6ba4ca70a0)
Jan  8 23:36:46     /usr/local/go/src/runtime/mheap.go:532 +0x5f1
Jan  8 23:36:46 runtime.mHeap_Alloc_m(0x119e860, 0x1, 0x7, 0x7f6ba4ca70a0)
Jan  8 23:36:46     /usr/local/go/src/runtime/mheap.go:425 +0x1ac
Jan  8 23:36:46 runtime.mHeap_Alloc.func1()
Jan  8 23:36:46     /usr/local/go/src/runtime/mheap.go:484 +0x41
Jan  8 23:36:46 runtime.systemstack(0xc82018fe50)
Jan  8 23:36:46     /usr/local/go/src/runtime/asm_amd64.s:278 +0xab
Jan  8 23:36:46 runtime.mHeap_Alloc(0x119e860, 0x1, 0x10000000007, 0x411014)
Jan  8 23:36:46     /usr/local/go/src/runtime/mheap.go:485 +0x63
Jan  8 23:36:46 runtime.mCentral_Grow(0x11a6438, 0x0)
Jan  8 23:36:46     /usr/local/go/src/runtime/mcentral.go:190 +0x93
Jan  8 23:36:46 runtime.mCentral_CacheSpan(0x11a6438, 0xc8852aa170)
Jan  8 23:36:46     /usr/local/go/src/runtime/mcentral.go:86 +0x4d4
Jan  8 23:36:46 runtime.mCache_Refill(0x7f6bad9fa000, 0x7, 0xc8852aa140)
Jan  8 23:36:46     /usr/local/go/src/runtime/mcache.go:118 +0xcf
Jan  8 23:36:46 runtime.mallocgc.func2()
Jan  8 23:36:46     /usr/local/go/src/runtime/malloc.go:614 +0x2b
Jan  8 23:36:46 runtime.systemstack(0xc82001a000)
Jan  8 23:36:46     /usr/local/go/src/runtime/asm_amd64.s:262 +0x79
Jan  8 23:36:46 runtime.mstart()
Jan  8 23:36:46     /usr/local/go/src/runtime/proc1.go:668
Jan  8 23:36:46 
Jan  8 23:36:46 goroutine 1 [running]:
Jan  8 23:36:46 runtime.systemstack_switch()
Jan  8 23:36:46     /usr/local/go/src/runtime/asm_amd64.s:216 fp=0xc8205ec818 sp=0xc8205ec810
Jan  8 23:36:46 runtime.mallocgc(0x60, 0xc8e180, 0x0, 0x0)
Jan  8 23:36:46     /usr/local/go/src/runtime/malloc.go:615 +0x842 fp=0xc8205ec8e8 sp=0xc8205ec818
Jan  8 23:36:46 runtime.newobject(0xc8e180, 0xc8201a47c0)
Jan  8 23:36:46     /usr/local/go/src/runtime/malloc.go:763 +0x42 fp=0xc8205ec910 sp=0xc8205ec8e8
Jan  8 23:36:46 github.com/hashicorp/consul/consul.(*consulFSM).Restore(0xc82019bf40, 0x7f6bab7ca000, 0xc8201a4020, 0x0, 0x0)
Jan  8 23:36:46     /opt/gopath/src/github.com/hashicorp/consul/consul/fsm.go:343 +0x4c9 fp=0xc8205ecb48 sp=0xc8205ec910
Jan  8 23:36:46 github.com/hashicorp/raft.(*Raft).restoreSnapshot(0xc820132380, 0x0, 0x0)
Jan  8 23:36:46     /opt/gopath/src/github.com/hashicorp/raft/raft.go:1800 +0x57d fp=0xc8205eccc8 sp=0xc8205ecb48
Jan  8 23:36:46 github.com/hashicorp/raft.NewRaft(0xc820108e80, 0x7f6bad80bd08, 0xc82019bf40, 0x7f6bad80bbc0, 0xc8201d81c0, 0x7f6bad80bb30, 0xc8201ae900, 0x7f6bad80bc10, 0xc8201aea80, 0x7f6bad80bcb0, ...)
Jan  8 23:36:46     /opt/gopath/src/github.com/hashicorp/raft/raft.go:242 +0xbed fp=0xc8205ece88 sp=0xc8205eccc8
Jan  8 23:36:46 github.com/hashicorp/consul/consul.(*Server).setupRaft(0xc8200f2a80, 0x0, 0x0)
Jan  8 23:36:46     /opt/gopath/src/github.com/hashicorp/consul/consul/server.go:416 +0x861 fp=0xc8205ed0f8 sp=0xc8205ece88
Jan  8 23:36:46 github.com/hashicorp/consul/consul.NewServer(0xc82010d520, 0xc82010d520, 0x0, 0x0)
Jan  8 23:36:46     /opt/gopath/src/github.com/hashicorp/consul/consul/server.go:249 +0xca2 fp=0xc8205ed308 sp=0xc8205ed0f8
Jan  8 23:36:46 github.com/hashicorp/consul/command/agent.(*Agent).setupServer(0xc8201442c0, 0x0, 0x0)
Jan  8 23:36:46     /opt/gopath/src/github.com/hashicorp/consul/command/agent/agent.go:361 +0x180 fp=0xc8205ed3a8 sp=0xc8205ed308
Jan  8 23:36:46 github.com/hashicorp/consul/command/agent.Create(0xc820158100, 0x7f6bad80a838, 0xc820168960, 0x0, 0x0, 0x0)
Jan  8 23:36:46     /opt/gopath/src/github.com/hashicorp/consul/command/agent/agent.go:176 +0xc7b fp=0xc8205ed630 sp=0xc8205ed3a8
Jan  8 23:36:46 github.com/hashicorp/consul/command/agent.(*Command).setupAgent(0xc82013e3c0, 0xc820158100, 0x7f6bad80a838, 0xc820168960, 0xc820101a10, 0x0, 0x0)
Jan  8 23:36:46     /opt/gopath/src/github.com/hashicorp/consul/command/agent/command.go:327 +0xd7 fp=0xc8205ed8f8 sp=0xc8205ed630
Jan  8 23:36:46 github.com/hashicorp/consul/command/agent.(*Command).Run(0xc82013e3c0, 0xc82000a160, 0x2, 0x2, 0x0)
Jan  8 23:36:46     /opt/gopath/src/github.com/hashicorp/consul/command/agent/command.go:640 +0xd27 fp=0xc8205edce0 sp=0xc8205ed8f8
Jan  8 23:36:46 github.com/mitchellh/cli.(*CLI).Run(0xc8201400b0, 0xc8201400b0, 0x0, 0x0)
Jan  8 23:36:46     /opt/gopath/src/github.com/mitchellh/cli/cli.go:112 +0x698 fp=0xc8205eddc0 sp=0xc8205edce0
Jan  8 23:36:46 main.realMain(0xc820000180)
Jan  8 23:36:46     /opt/gopath/src/github.com/hashicorp/consul/main.go:40 +0x214 fp=0xc8205edf30 sp=0xc8205eddc0
Jan  8 23:36:46 main.main()
Jan  8 23:36:46     /opt/gopath/src/github.com/hashicorp/consul/main.go:12 +0x18 fp=0xc8205edf40 sp=0xc8205edf30
Jan  8 23:36:46 runtime.main()
Jan  8 23:36:46     /usr/local/go/src/runtime/proc.go:111 +0x2b0 fp=0xc8205edf90 sp=0xc8205edf40
Jan  8 23:36:46 runtime.goexit()
Jan  8 23:36:46     /usr/local/go/src/runtime/asm_amd64.s:1721 +0x1 fp=0xc8205edf98 sp=0xc8205edf90
Jan  8 23:36:46 
Jan  8 23:36:46 goroutine 5 [syscall, 2 minutes]:
Jan  8 23:36:46 os/signal.loop()
Jan  8 23:36:46     /usr/local/go/src/os/signal/signal_unix.go:22 +0x18
Jan  8 23:36:46 created by os/signal.init.1
Jan  8 23:36:46     /usr/local/go/src/os/signal/signal_unix.go:28 +0x37
Jan  8 23:36:46 
Jan  8 23:36:46 goroutine 18 [select, 2 minutes, locked to thread]:
Jan  8 23:36:46 runtime.gopark(0xeb8d40, 0xc82001e728, 0xd35910, 0x6, 0x431118, 0x2)
Jan  8 23:36:46     /usr/local/go/src/runtime/proc.go:185 +0x163
Jan  8 23:36:46 runtime.selectgoImpl(0xc82001e728, 0x0, 0x18)
Jan  8 23:36:46     /usr/local/go/src/runtime/select.go:392 +0xa64
Jan  8 23:36:46 runtime.selectgo(0xc82001e728)
Jan  8 23:36:46     /usr/local/go/src/runtime/select.go:212 +0x12
Jan  8 23:36:46 runtime.ensureSigM.func1()
Jan  8 23:36:46     /usr/local/go/src/runtime/signal1_unix.go:227 +0x353
Jan  8 23:36:46 runtime.goexit()
Jan  8 23:36:46     /usr/local/go/src/runtime/asm_amd64.s:1721 +0x1
Jan  8 23:36:46 
Jan  8 23:36:46 goroutine 19 [select, 2 minutes]:
Jan  8 23:36:46 github.com/armon/go-metrics.(*InmemSignal).run(0xc820107b80)
Jan  8 23:36:46     /opt/gopath/src/github.com/armon/go-metrics/inmem_signal.go:63 +0xf2
Jan  8 23:36:46 created by github.com/armon/go-metrics.NewInmemSignal
Jan  8 23:36:46     /opt/gopath/src/github.com/armon/go-metrics/inmem_signal.go:37 +0x22b
Jan  8 23:36:46 
Jan  8 23:36:46 goroutine 20 [sleep]:
Jan  8 23:36:46 time.Sleep(0x3b9aca00)
Jan  8 23:36:46     /usr/local/go/src/runtime/time.go:59 +0xf9
Jan  8 23:36:46 github.com/armon/go-metrics.(*Metrics).collectStats(0xc82011ab40)
Jan  8 23:36:46     /opt/gopath/src/github.com/armon/go-metrics/metrics.go:67 +0x25
Jan  8 23:36:46 created by github.com/armon/go-metrics.New
Jan  8 23:36:46     /opt/gopath/src/github.com/armon/go-metrics/start.go:61 +0xdb
Jan  8 23:36:46 
Jan  8 23:36:46 goroutine 21 [select]:
Jan  8 23:36:46 github.com/hashicorp/consul/consul.(*ConnPool).reap(0xc82011adc0)
Jan  8 23:36:46     /opt/gopath/src/github.com/hashicorp/consul/consul/pool.go:412 +0x492
Jan  8 23:36:46 created by github.com/hashicorp/consul/consul.NewPool
Jan  8 23:36:46     /opt/gopath/src/github.com/hashicorp/consul/consul/pool.go:175 +0x1a1
Jan  8 23:36:46 
Jan  8 23:36:46 goroutine 22 [select]:
Jan  8 23:36:46 github.com/hashicorp/consul/consul.(*Coordinate).batchUpdate(0xc820168f00)
Jan  8 23:36:46     /opt/gopath/src/github.com/hashicorp/consul/consul/coordinate_endpoint.go:41 +0x228
Jan  8 23:36:46 created by github.com/hashicorp/consul/consul.NewCoordinate
Jan  8 23:36:46     /opt/gopath/src/github.com/hashicorp/consul/consul/coordinate_endpoint.go:33 +0xc0
Jan  8 23:36:46 
Jan  8 23:36:46 goroutine 23 [select, 2 minutes]:
Jan  8 23:36:46 github.com/hashicorp/consul/consul.(*RaftLayer).Accept(0xc82019a2c0, 0x0, 0x0, 0x0, 0x0)
Jan  8 23:36:46     /opt/gopath/src/github.com/hashicorp/consul/consul/raft_rpc.go:57 +0x18f
Jan  8 23:36:46 github.com/hashicorp/raft.(*NetworkTransport).listen(0xc8201617a0)
Jan  8 23:36:46     /opt/gopath/src/github.com/hashicorp/raft/net_transport.go:362 +0x79
Jan  8 23:36:46 created by github.com/hashicorp/raft.NewNetworkTransportWithLogger
Jan  8 23:36:46     /opt/gopath/src/github.com/hashicorp/raft/net_transport.go:154 +0x279```
@slackpad slackpad added the type/crash The issue description contains a golang panic and stack trace label Jan 9, 2016
@slackpad slackpad self-assigned this Jan 9, 2016
@slackpad
Copy link
Contributor

slackpad commented Jan 9, 2016

Hi @nbrownus - do you know how close your 0.5.2 setup was to maxxing your machine's memory? If you were close before I could see the fairly large GC behavior differences from Go <1.5 to Go >1.5 causing a marginal situation to blip over and fail. If you weren't even close and it's way different in 0.6.1 that would be surprising.

@nbrownus
Copy link
Contributor Author

nbrownus commented Jan 9, 2016

Looks like 0.5.2 is hovering around 4.5 GB used while running with vault 0.4 which is currently deleting millions of records out of consul. My production machines running 0.5.2 are using ~ 850MB of memory.

In the image below:

15:26 - 15:52 - consul 0.6.1 attempting to start and hitting OOM
15:53 - 16:02 - consul 0.5.2 started and listening for connections
16:05 - 16:12 - Checking the viability of the data in consul 0.5.2 before starting vault 0.4
16:13 - vault 0.4 started and unsealed

memory usage

@nbrownus
Copy link
Contributor Author

nbrownus commented Jan 9, 2016

Another item to note is that in consul 0.5.2, during startup, the amount of data in the data dir would bloat to enormous sizes. I did not see any disk write activity while 0.6.1 was trying to startup but you can clearly see it while 0.5.2 is starting up.

Same timeline as the above comment

disk write

@slackpad
Copy link
Contributor

slackpad commented Jan 9, 2016

The millions of KV entries is definitely a problem for Consul (#1278), though if you can get 0.6.1 to start without OOM-ing it actually has some changes to help it delete those keys more efficiently.

Is there any chance you can give this a try with https://www.consul.io/docs/agent/options.html#enable_debug and then capture some heap stats by pulling http://localhost:8500/debug/pprof/heap?debug=1. That will give us an idea about what's eating so much memory.

@nbrownus
Copy link
Contributor Author

nbrownus commented Jan 9, 2016

Only port that ever opens up is 8300. It really looks to me like 0.5.2 used to cache a lot of data to disk on startup and now 0.6.1 tries to load everything into memory.

@slackpad
Copy link
Contributor

slackpad commented Jan 9, 2016

Ah you probably need to set https://www.consul.io/docs/agent/options.html#ports to enable the HTTP API as well. Consul prior to 0.6.0 used LMDB for the state store, so that did have access to the disk. Now Consul's store is purely in-memory. Do you have a rough ETA on when Vault will be done deleting the keys? Once Consul does a fresh snapshot after that you should be in good shape for 0.6.1.

@nbrownus
Copy link
Contributor Author

nbrownus commented Jan 9, 2016

Manually defining the port had no impact. in 0.5.2 no other ports were opened until I saw Starting Consul agent in the logs.

The pruning is going pretty slow, doing ~6 deletes a second. I'll check on it throughout the weekend. Hopefully I will be able to continue testing on Monday.

Are you saying that consul must be able to fit all data in memory now? That's a rather large departure from previous versions isn't it? Probably worth noting in the release docs.

@slackpad slackpad added the type/docs Documentation needs to be created/updated/clarified label Jan 9, 2016
@slackpad
Copy link
Contributor

slackpad commented Jan 9, 2016

Ok - then the web server isn't getting started until the startup gets further along so that's not going to give us heap data.

The previous versions with LMDB were set up with flags to treat the state store like an in-memory database (turned off fsync after commits and a few other settings) so for the vast majority of Consul users the state store was all resident in memory, but because it is backed by a huge memory-mapped file it did have the disk underneath there. The new state store is completely in RAM and isn't backed by a memory-mapped file so you'll need to provision enough memory to hold everything to run Consul 0.6.0.

Hadn't run into this case in the wild yet but I'll definitely add a note about huge datasets to the upgrade section. Sorry you got stuck by this.

@slackpad
Copy link
Contributor

slackpad commented Jan 9, 2016

One other thing that might be worth a try would be to set the environment variable something like GOGC=50 to more aggressively garbage collect, which might be able to get you through the startup phase. Or enable swap on that box temporarily.

@slackpad
Copy link
Contributor

slackpad commented Jan 9, 2016

Added this note to the Consul 0.6-specific upgrade guide.

Also, in this release Consul switched from LMDB to a fully in-memory database for the state store. Because LMDB is a disk-based backing store, it was able to store more data than could fit in RAM in some cases (though this is not a recommended configuration for Consul). If you have an extremely large data set that won't fit into RAM, you may encounter issues upgrading to Consul 0.6.0 and later. Consul should be provisioned with physical memory approximately 2X the data set size to allow for bursty allocations and subsequent garbage collection.

@slackpad
Copy link
Contributor

Hi @nbrownus closing this out as we've updated the docs and I think we understand the root cause. Please re-open if you need anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/crash The issue description contains a golang panic and stack trace type/docs Documentation needs to be created/updated/clarified
Projects
None yet
Development

No branches or pull requests

2 participants