Really out of date followers can get stuck not getting a snapshot installed. #212

superfell · 2017-05-25T02:49:05Z

We ran into an issue today where a follower that had been down for a couple of days came back up and it and the leader got stuck in a loop trying to get a snapshot to the follower but not getting anywhere.

The leader debug log shows entries like repeatedly. [ignore the localhost references, this is from a reproducible testcase i got locally].

[ERR] raft: Failed to get log at index 31542: log not found
[ERR] raft: Failed to install snapshot 2080-51566-1495670504268: write tcp 127.0.0.1:50004->127.0.0.1:5003: write: broken pipe
[ERR] raft: Failed to send snapshot to localhost:5003: write tcp 127.0.0.1:50004->127.0.0.1:5003: write: broken pipe

The follower logs show [in between repeated elections]
ERR] raft-net: Failed to decode incoming command: codec.decoder: Only encoded map or array can be decoded into a struct. (valueType: 2)

I was able to reproduce it locally by running a 3 node cluster, creating enough log entries that snapshots start compacting the log, then forcing a snapshot on the leader, stopping a follower, then creating enough log entries that the log index the follower is at is compacted out of the leaders log. Bringing the follower back up from this state will get into the above state.

Digging around what i see is that the leader sends an InstallSnapshot request to the follower, however the follower rejects this as it has a newer term [because of all the elections it tried to start because its not getting any appendEntry calls]. However in installSnapshot in raft.go it doesn't always consume the snapshot data, and so net_transport handleCommand ends up trying to decode a command out of the snapshot data. This fails, causing the connection to be closed. But with a big enough snapshot, the leader is still writing the snapshot on its end, and it gets the failed writes. This causes the leader to eventually retry exactly the same installSnapshot request, and so the circle continues.

If i modify installSnapshot to consume the snapshot from the stream even if it doesn't use it, allows the leader replication to move onto an AppendEntries call [which fails because the follower didn't use the snapshot], but this gets the replication metadata updated such that subsequent call to InstallSnapshot succeeds, and everything settles back down.

Am working on a patch for this, mostly trying to come up with an automated test for it.

…sue hashicorp#212

slackpad · 2017-06-01T23:07:34Z

Keeping this open until we take a look at the two library v2 branches, which probably need a similar fix.

superfell · 2017-06-07T00:57:23Z

@slackpad PRs for the 2 v2 branches submitted.

Ensure installSnapshot consume stream. fixes issue #212

Ensure installSnapshot always consumes stream. Fixes issue #212

slackpad · 2017-06-09T23:10:53Z

Merged those two - thank you!

This picks up the fix for hashicorp/raft#212, which can cause out-of-date followers to get stuck in a loop trying to sync because they don't discard old snapshot data. There's some incidental reordering of the vendor.json since the last update to that file was merged by hand.

Update raft to get hashicorp/raft#212 fix

slackpad added the bug label May 25, 2017

superfell pushed a commit to superfell/raft that referenced this issue May 25, 2017

Add Integ test that reproduces installSnapshot problem detailed in is…

fbba609

…sue hashicorp#212

superfell mentioned this issue May 25, 2017

Ensure InstallSnapshot always consumes the snapshot from the stream. #213

Merged

superfell added a commit to superfell/raft that referenced this issue Jun 7, 2017

Ensure installSnapshot consume stream. fixes issue hashicorp#212

ec99ca3

superfell added a commit to superfell/raft that referenced this issue Jun 7, 2017

Ensure installSnapshot always consumes stream. Fixes issue hashicorp#212

f5aa3cb

slackpad added a commit that referenced this issue Jun 9, 2017

Merge pull request #215 from superfell/library-v2-stage-one

e5e581e

Ensure installSnapshot consume stream. fixes issue #212

slackpad added a commit that referenced this issue Jun 9, 2017

Merge pull request #216 from superfell/library-v2-stage-two

85a7a8b

Ensure installSnapshot always consumes stream. Fixes issue #212

slackpad closed this as completed Jun 9, 2017

slackpad mentioned this issue Jun 28, 2017

Bumps Raft library. hashicorp/consul#3201

Merged

schmichael added a commit to hashicorp/nomad that referenced this issue Jul 7, 2017

Update raft to get hashicorp/raft#212 fix

9dfc9db

schmichael added a commit to hashicorp/nomad that referenced this issue Jul 7, 2017

Merge pull request #2794 from hashicorp/f-update-raft

7536f7e

Update raft to get hashicorp/raft#212 fix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Really out of date followers can get stuck not getting a snapshot installed. #212

Really out of date followers can get stuck not getting a snapshot installed. #212

superfell commented May 25, 2017

slackpad commented Jun 1, 2017

superfell commented Jun 7, 2017

slackpad commented Jun 9, 2017

Really out of date followers can get stuck not getting a snapshot installed. #212

Really out of date followers can get stuck not getting a snapshot installed. #212

Comments

superfell commented May 25, 2017

slackpad commented Jun 1, 2017

superfell commented Jun 7, 2017

slackpad commented Jun 9, 2017