Node ID/datacenter snapshot fix #4872

kyhavlov · 2018-10-30T22:54:07Z

This adds the missing ID and datacenter fields to the snapshot persist logic.

pearkes

LGTM. May want to hold this from master for one more set of eyes given 1.4.0.

pearkes · 2018-10-30T23:56:34Z

But we should try to get this in for sure.

banks

Maybe I missed a conversation here but why is this the right thing to do? Was there a specific issue that this addresses?

It seems like those fields could have been elided from the snapshot on purpose so that you can restore a snapshot to a different DC without the state being really messed up (e.g. for DR). Storing Node IDs in state also means that if you terminate all your client VMs, restore a snapshot on servers and then restart clients they will all fail to register because they will have same name but a different ID. Same would apply to restoring a snapshot on a new datacenter which is being spun up as a replica for disaster recovery - currently that just works (presumably) because the restored server state doesn't have node IDs or datacenters so the new clients with same names in the new DC will just register and things will work.

Both of those seem like relatively dangerous/high impact changes to me to merge without and rationale etc.

banks

I still see this potentially changing behaviour in the case where you restore a snapshot in a new cluster or DC, but in that case other Node info from the snapshot is probably wrong too (like IP address). It's also recoverable by just force leaving all the nodes and allowing them to rejoin.

I doubt given that, that it was intentional to not store this state since we also use these to bootstrap new replicas in the same cluster and this is corrupting their state which is clearly a bug.

slackpad · 2018-10-31T17:41:01Z

I don't think these omissions were intentional :-) Both of these fields were added later and I think we just forgot to plumb them through. Given how anti-entropy works it would have just synced things back up, kind of masking this issue.

kyhavlov added 2 commits October 30, 2018 15:52

fsm: add missing ID/datacenter to persistNodes

6483356

fsm: update snapshot/restore test to include ID and datacenter

bd6d0e5

pearkes approved these changes Oct 30, 2018

View reviewed changes

pearkes added this to the 1.4.0 milestone Oct 30, 2018

banks requested changes Oct 31, 2018

View reviewed changes

banks approved these changes Oct 31, 2018

View reviewed changes

kyhavlov merged commit 8337e3d into master Oct 31, 2018

kyhavlov deleted the node-snapshot-fix branch October 31, 2018 22:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node ID/datacenter snapshot fix #4872

Node ID/datacenter snapshot fix #4872

kyhavlov commented Oct 30, 2018

pearkes left a comment

pearkes commented Oct 30, 2018

banks left a comment •

edited

Loading

banks left a comment

slackpad commented Oct 31, 2018

Node ID/datacenter snapshot fix #4872

Node ID/datacenter snapshot fix #4872

Conversation

kyhavlov commented Oct 30, 2018

pearkes left a comment

Choose a reason for hiding this comment

pearkes commented Oct 30, 2018

banks left a comment • edited Loading

Choose a reason for hiding this comment

banks left a comment

Choose a reason for hiding this comment

slackpad commented Oct 31, 2018

banks left a comment •

edited

Loading