Consul panics when committing changes to the state store #2724

cityofships · 2017-02-09T12:27:41Z

consul version

Client: v0.7.4
Server: v0.7.4

Operating system and Environment details

consul in docker run in HA mode on 3 nodes with RHEL 7.3 and Docker 1.13.0

Kernel Version: 3.10.0-514.6.1.el7.x86_64
Operating System: Red Hat Enterprise Linux Server 7.3 (Maipo)
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 15.5 GiB
Storage Driver: devicemapper

consul config:

{
  "skip_leave_on_interrupt": true,
  "performance": {
    "raft_multiplier": 1
  }
}

Description of the Issue (and unexpected/desired result)

Consul panics during "stress test" - not being able to process all KV delete requests.

Reproduction steps

Create 600 000 kv entries.
Attempt to delete them by running simultaneously 6 commands on consul nodes:
node1:

for i in `seq 100000`; do consul kv delete test1/test$i; done
for i in `seq 100000`; do consul kv delete test2/test$i; done

node2:

for i in `seq 100000`; do consul kv delete test3/test$i; done
for i in `seq 100000`; do consul kv delete test4/test$i; done

node3:

for i in `seq 100000`; do consul kv delete test5/test$i; done
for i in `seq 100000`; do consul kv delete test6/test$i; done

Log Fragments

https://gist.github.com/cityofships/9ffbaf9badac8b0198f352ce76cc7239

The text was updated successfully, but these errors were encountered:

slackpad · 2017-02-09T15:52:06Z

Hi @cityofships thanks for the report - we will figure this out (it was new code for 0.7.3).

/cc @dadgar

mpuncel · 2017-02-09T19:56:43Z

We saw a similar looking panic on 0.7.3 using the consul_0.7.3_linux_amd64 binary

https://gist.github.com/mpuncel/854425910b66cf697dfb7797b586f969

slackpad · 2017-02-09T20:00:40Z

Thanks @mpuncel that helps show that this probably isn't related to slow_notify() which I initially suspected. Linking hashicorp/go-immutable-radix#11 which may be related.

slackpad · 2017-02-09T23:01:33Z

@cityofships and @mpuncel are either of you able to reproduce this pretty readily? I've got some tests going now to try to trigger it but haven't had any luck.

cityofships · 2017-02-10T14:44:14Z

yes, I'm willing to re-test

slackpad · 2017-02-14T00:12:35Z

@cityofships thanks. Using your steps above (which I super appreciate) I was able to reproduce this and I've got a fix in the works. Should have something for you to try soon.

This fixes #2724 by properly tracking leaf updates during very large delete transactions.

slackpad · 2017-02-14T01:03:33Z

@cityofships the fix has been merged to master - if you have a chance to fuzz this again please let us know how it goes. Thank you!

This fixes #2724 by properly tracking leaf updates during very large delete transactions.

jtchoi · 2017-02-15T21:19:36Z

Are there any hints on how to recover an existing cluster from this once this happens? Is there a way to purge the queue of updates?

slackpad · 2017-02-15T22:24:31Z

@jtchoi the cleanest recovery would be to shut down all the servers and upgrade the Consul binary of 0.7.5, which will be able to apply the update without a crash. You could also shut down the servers and remove the raft.db file from their data-dirs, though this will roll back to the last Raft snapshot, which will lose all the changes since that.

jtchoi · 2017-02-15T23:00:07Z

@slackpad Thanks! We were considering both approaches and will try the former

slackpad added type/bug Feature does not function as expected type/crash The issue description contains a golang panic and stack trace labels Feb 9, 2017

slackpad added this to the 0.8.0 milestone Feb 9, 2017

slackpad changed the title ~~consul panics during kv delete~~ Consul panics when committing changes to the state store Feb 10, 2017

slackpad mentioned this issue Feb 13, 2017

Fixes several mutation tracking issues with leaf nodes. hashicorp/go-immutable-radix#13

Merged

slackpad added a commit that referenced this issue Feb 14, 2017

Updates hashicorp/go-immutable-radix to pick up leaf panic fixes.

1f64251

This fixes #2724 by properly tracking leaf updates during very large delete transactions.

slackpad mentioned this issue Feb 14, 2017

Updates hashicorp/go-immutable-radix to pick up leaf panic fixes. #2739

Merged

slackpad closed this as completed in #2739 Feb 14, 2017

slackpad added a commit that referenced this issue Feb 14, 2017

Updates hashicorp/go-immutable-radix to pick up leaf panic fixes.

1dfcc06

This fixes #2724 by properly tracking leaf updates during very large delete transactions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consul panics when committing changes to the state store #2724

Consul panics when committing changes to the state store #2724

cityofships commented Feb 9, 2017

slackpad commented Feb 9, 2017

mpuncel commented Feb 9, 2017

slackpad commented Feb 9, 2017

slackpad commented Feb 9, 2017

cityofships commented Feb 10, 2017

slackpad commented Feb 14, 2017

slackpad commented Feb 14, 2017

jtchoi commented Feb 15, 2017

slackpad commented Feb 15, 2017

jtchoi commented Feb 15, 2017

Consul panics when committing changes to the state store #2724

Consul panics when committing changes to the state store #2724

Comments

cityofships commented Feb 9, 2017

consul version

Operating system and Environment details

Description of the Issue (and unexpected/desired result)

Reproduction steps

Log Fragments

slackpad commented Feb 9, 2017

mpuncel commented Feb 9, 2017

slackpad commented Feb 9, 2017

slackpad commented Feb 9, 2017

cityofships commented Feb 10, 2017

slackpad commented Feb 14, 2017

slackpad commented Feb 14, 2017

jtchoi commented Feb 15, 2017

slackpad commented Feb 15, 2017

jtchoi commented Feb 15, 2017