Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Really out of date followers can get stuck not getting a snapshot installed. #212

Closed
superfell opened this issue May 25, 2017 · 3 comments
Closed
Labels

Comments

@superfell
Copy link
Contributor

We ran into an issue today where a follower that had been down for a couple of days came back up and it and the leader got stuck in a loop trying to get a snapshot to the follower but not getting anywhere.

The leader debug log shows entries like repeatedly. [ignore the localhost references, this is from a reproducible testcase i got locally].

[ERR] raft: Failed to get log at index 31542: log not found
[ERR] raft: Failed to install snapshot 2080-51566-1495670504268: write tcp 127.0.0.1:50004->127.0.0.1:5003: write: broken pipe
[ERR] raft: Failed to send snapshot to localhost:5003: write tcp 127.0.0.1:50004->127.0.0.1:5003: write: broken pipe

The follower logs show [in between repeated elections]
ERR] raft-net: Failed to decode incoming command: codec.decoder: Only encoded map or array can be decoded into a struct. (valueType: 2)

I was able to reproduce it locally by running a 3 node cluster, creating enough log entries that snapshots start compacting the log, then forcing a snapshot on the leader, stopping a follower, then creating enough log entries that the log index the follower is at is compacted out of the leaders log. Bringing the follower back up from this state will get into the above state.

Digging around what i see is that the leader sends an InstallSnapshot request to the follower, however the follower rejects this as it has a newer term [because of all the elections it tried to start because its not getting any appendEntry calls]. However in installSnapshot in raft.go it doesn't always consume the snapshot data, and so net_transport handleCommand ends up trying to decode a command out of the snapshot data. This fails, causing the connection to be closed. But with a big enough snapshot, the leader is still writing the snapshot on its end, and it gets the failed writes. This causes the leader to eventually retry exactly the same installSnapshot request, and so the circle continues.

If i modify installSnapshot to consume the snapshot from the stream even if it doesn't use it, allows the leader replication to move onto an AppendEntries call [which fails because the follower didn't use the snapshot], but this gets the replication metadata updated such that subsequent call to InstallSnapshot succeeds, and everything settles back down.

Am working on a patch for this, mostly trying to come up with an automated test for it.

@slackpad
Copy link
Contributor

slackpad commented Jun 1, 2017

Keeping this open until we take a look at the two library v2 branches, which probably need a similar fix.

@superfell
Copy link
Contributor Author

@slackpad PRs for the 2 v2 branches submitted.

slackpad added a commit that referenced this issue Jun 9, 2017
Ensure installSnapshot consume stream. fixes issue #212
slackpad added a commit that referenced this issue Jun 9, 2017
Ensure installSnapshot always consumes stream. Fixes issue #212
@slackpad
Copy link
Contributor

slackpad commented Jun 9, 2017

Merged those two - thank you!

@slackpad slackpad closed this as completed Jun 9, 2017
slackpad added a commit to hashicorp/consul that referenced this issue Jun 28, 2017
This picks up the fix for hashicorp/raft#212,
which can cause out-of-date followers to get stuck in a loop trying to
sync because they don't discard old snapshot data.

There's some incidental reordering of the vendor.json since the last
update to that file was merged by hand.
slackpad added a commit to hashicorp/consul that referenced this issue Jun 28, 2017
This picks up the fix for hashicorp/raft#212,
which can cause out-of-date followers to get stuck in a loop trying to
sync because they don't discard old snapshot data.

There's some incidental reordering of the vendor.json since the last
update to that file was merged by hand.
schmichael added a commit to hashicorp/nomad that referenced this issue Jul 7, 2017
schmichael added a commit to hashicorp/nomad that referenced this issue Jul 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants