-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOM during startup after upgrade to 0.6.1 #1579
Comments
Hi @nbrownus - do you know how close your 0.5.2 setup was to maxxing your machine's memory? If you were close before I could see the fairly large GC behavior differences from Go <1.5 to Go >1.5 causing a marginal situation to blip over and fail. If you weren't even close and it's way different in 0.6.1 that would be surprising. |
Looks like 0.5.2 is hovering around 4.5 GB used while running with vault 0.4 which is currently deleting millions of records out of consul. My production machines running 0.5.2 are using ~ 850MB of memory. In the image below: 15:26 - 15:52 - consul 0.6.1 attempting to start and hitting OOM |
The millions of KV entries is definitely a problem for Consul (#1278), though if you can get 0.6.1 to start without OOM-ing it actually has some changes to help it delete those keys more efficiently. Is there any chance you can give this a try with https://www.consul.io/docs/agent/options.html#enable_debug and then capture some heap stats by pulling http://localhost:8500/debug/pprof/heap?debug=1. That will give us an idea about what's eating so much memory. |
Only port that ever opens up is 8300. It really looks to me like 0.5.2 used to cache a lot of data to disk on startup and now 0.6.1 tries to load everything into memory. |
Ah you probably need to set https://www.consul.io/docs/agent/options.html#ports to enable the HTTP API as well. Consul prior to 0.6.0 used LMDB for the state store, so that did have access to the disk. Now Consul's store is purely in-memory. Do you have a rough ETA on when Vault will be done deleting the keys? Once Consul does a fresh snapshot after that you should be in good shape for 0.6.1. |
Manually defining the port had no impact. in 0.5.2 no other ports were opened until I saw The pruning is going pretty slow, doing ~6 deletes a second. I'll check on it throughout the weekend. Hopefully I will be able to continue testing on Monday. Are you saying that consul must be able to fit all data in memory now? That's a rather large departure from previous versions isn't it? Probably worth noting in the release docs. |
Ok - then the web server isn't getting started until the startup gets further along so that's not going to give us heap data. The previous versions with LMDB were set up with flags to treat the state store like an in-memory database (turned off fsync after commits and a few other settings) so for the vast majority of Consul users the state store was all resident in memory, but because it is backed by a huge memory-mapped file it did have the disk underneath there. The new state store is completely in RAM and isn't backed by a memory-mapped file so you'll need to provision enough memory to hold everything to run Consul 0.6.0. Hadn't run into this case in the wild yet but I'll definitely add a note about huge datasets to the upgrade section. Sorry you got stuck by this. |
One other thing that might be worth a try would be to set the environment variable something like |
Added this note to the Consul 0.6-specific upgrade guide.
|
Hi @nbrownus closing this out as we've updated the docs and I think we understand the root cause. Please re-open if you need anything. |
Previously running 0.5.2 and tried upgrading to 0.6.1 in a test environment. 0.5.2 takes a while to startup but it does eventually get there.
Log output:
The text was updated successfully, but these errors were encountered: