Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul panics when committing changes to the state store #2724

Closed
cityofships opened this issue Feb 9, 2017 · 10 comments · Fixed by #2739
Closed

Consul panics when committing changes to the state store #2724

cityofships opened this issue Feb 9, 2017 · 10 comments · Fixed by #2739
Labels
type/bug Feature does not function as expected type/crash The issue description contains a golang panic and stack trace
Milestone

Comments

@cityofships
Copy link

consul version

Client: v0.7.4
Server: v0.7.4

Operating system and Environment details

consul in docker run in HA mode on 3 nodes with RHEL 7.3 and Docker 1.13.0

Kernel Version: 3.10.0-514.6.1.el7.x86_64
Operating System: Red Hat Enterprise Linux Server 7.3 (Maipo)
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 15.5 GiB
Storage Driver: devicemapper

consul config:

{
  "skip_leave_on_interrupt": true,
  "performance": {
    "raft_multiplier": 1
  }
}

Description of the Issue (and unexpected/desired result)

Consul panics during "stress test" - not being able to process all KV delete requests.

Reproduction steps

  1. Create 600 000 kv entries.
  2. Attempt to delete them by running simultaneously 6 commands on consul nodes:
    node1:
for i in `seq 100000`; do consul kv delete test1/test$i; done
for i in `seq 100000`; do consul kv delete test2/test$i; done

node2:

for i in `seq 100000`; do consul kv delete test3/test$i; done
for i in `seq 100000`; do consul kv delete test4/test$i; done

node3:

for i in `seq 100000`; do consul kv delete test5/test$i; done
for i in `seq 100000`; do consul kv delete test6/test$i; done

Log Fragments

https://gist.github.com/cityofships/9ffbaf9badac8b0198f352ce76cc7239

@slackpad slackpad added type/bug Feature does not function as expected type/crash The issue description contains a golang panic and stack trace labels Feb 9, 2017
@slackpad slackpad added this to the 0.8.0 milestone Feb 9, 2017
@slackpad
Copy link
Contributor

slackpad commented Feb 9, 2017

Hi @cityofships thanks for the report - we will figure this out (it was new code for 0.7.3).

/cc @dadgar

@mpuncel
Copy link
Contributor

mpuncel commented Feb 9, 2017

We saw a similar looking panic on 0.7.3 using the consul_0.7.3_linux_amd64 binary

https://gist.github.com/mpuncel/854425910b66cf697dfb7797b586f969

@slackpad
Copy link
Contributor

slackpad commented Feb 9, 2017

Thanks @mpuncel that helps show that this probably isn't related to slow_notify() which I initially suspected. Linking hashicorp/go-immutable-radix#11 which may be related.

@slackpad
Copy link
Contributor

slackpad commented Feb 9, 2017

@cityofships and @mpuncel are either of you able to reproduce this pretty readily? I've got some tests going now to try to trigger it but haven't had any luck.

@slackpad slackpad changed the title consul panics during kv delete Consul panics when committing changes to the state store Feb 10, 2017
@cityofships
Copy link
Author

yes, I'm willing to re-test

@slackpad
Copy link
Contributor

@cityofships thanks. Using your steps above (which I super appreciate) I was able to reproduce this and I've got a fix in the works. Should have something for you to try soon.

slackpad added a commit that referenced this issue Feb 14, 2017
This fixes #2724 by properly tracking leaf updates during very large
delete transactions.
@slackpad
Copy link
Contributor

@cityofships the fix has been merged to master - if you have a chance to fuzz this again please let us know how it goes. Thank you!

slackpad added a commit that referenced this issue Feb 14, 2017
This fixes #2724 by properly tracking leaf updates during very large
delete transactions.
@jtchoi
Copy link

jtchoi commented Feb 15, 2017

Are there any hints on how to recover an existing cluster from this once this happens? Is there a way to purge the queue of updates?

@slackpad
Copy link
Contributor

@jtchoi the cleanest recovery would be to shut down all the servers and upgrade the Consul binary of 0.7.5, which will be able to apply the update without a crash. You could also shut down the servers and remove the raft.db file from their data-dirs, though this will roll back to the last Raft snapshot, which will lose all the changes since that.

@jtchoi
Copy link

jtchoi commented Feb 15, 2017

@slackpad Thanks! We were considering both approaches and will try the former

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Feature does not function as expected type/crash The issue description contains a golang panic and stack trace
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants