-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node ID/datacenter snapshot fix #4872
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. May want to hold this from master for one more set of eyes given 1.4.0.
But we should try to get this in for sure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I missed a conversation here but why is this the right thing to do? Was there a specific issue that this addresses?
It seems like those fields could have been elided from the snapshot on purpose so that you can restore a snapshot to a different DC without the state being really messed up (e.g. for DR). Storing Node IDs in state also means that if you terminate all your client VMs, restore a snapshot on servers and then restart clients they will all fail to register because they will have same name but a different ID. Same would apply to restoring a snapshot on a new datacenter which is being spun up as a replica for disaster recovery - currently that just works (presumably) because the restored server state doesn't have node IDs or datacenters so the new clients with same names in the new DC will just register and things will work.
Both of those seem like relatively dangerous/high impact changes to me to merge without and rationale etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still see this potentially changing behaviour in the case where you restore a snapshot in a new cluster or DC, but in that case other Node info from the snapshot is probably wrong too (like IP address). It's also recoverable by just force leaving all the nodes and allowing them to rejoin.
I doubt given that, that it was intentional to not store this state since we also use these to bootstrap new replicas in the same cluster and this is corrupting their state which is clearly a bug.
I don't think these omissions were intentional :-) Both of these fields were added later and I think we just forgot to plumb them through. Given how anti-entropy works it would have just synced things back up, kind of masking this issue. |
This adds the missing ID and datacenter fields to the snapshot persist logic.