-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix corrupt state after restart (missing key, bucket) #4807
Conversation
@jozef-slezak These fixes are great! Unfortunately they're all covered in another PR: #4803 In fact your checklist is surprisingly similar to one I was working off of internally!
👍 https://github.com/hashicorp/nomad/pull/4803/files#diff-44115f774c7a50972c7d0c97054ea0a4R197 (and they're only overwritten if the loaded values are non-nil)
👍 Agreed! The approach in #4803 (with further fixes coming in the
I'm sorry you're blocked by state corruption issues! I was personally hoping to have the code ready for beta testing by HashiConf (next week!), but unfortunately I was unable to make that deadline. I assure you this is my highest priority after HashiConf, and we will do everything we can to ensure 0.9 will not be released with state corruption errors. Since the code is still under very active development (which will be paused next week during HashiConf), it's hard to coordinate testing and changes. Feel free to review or test #4803. I'm afraid until Sorry we're not moving faster and thank you for your continued engagement! We'll get there! |
If you are inclined, we could get you some test binaries to help us test the restore code path once development continues after HashiConf. I'll start trying to attach linux_amd64 binaries to PRs starting with the upcoming Restore implementation PR. Until Restore is implemented there isn't much to test. |
Great, please let me know when I can test linux_amd64 binaries attach to PRs (please build it also with GUI). |
hi @schmichael apologies to sound like a looping record ... but how easy (possible?) would it be merge this changes into the 0.8 branch? I personally face this corruption issue often, though I have attributed it to me rebooting the machine abruptly. (I always do a sync and reboot, never a "reboot -f", yet this occurs) or if the undelying VM host reboots. Impacted machines for me is the "on premise" QA cluster so manageable so far. Along with the actual fixes, can we also have a deterministic way to check that the local nomad state is corrupted? |
Impossible I'm afraid. The fixes are part of a large refactor of the client code. Backporting them would effectively be the same as upgrading your clients to 0.9 but leaving your servers at 0.8.
We don't currently have plans for this as we're confident we can avoid corrupt state in all cases except OS/hardware failures. Even then our recovery should be more robust than it is today.
Hm, I'm curious what this does. Corrupt state should only break Nomad on reboot or restart of the agent. |
Hello @schmichael , please do you have test binaries that we could test? I would like to test.
Are you still planning to release 0.9 in November? (referring Hashiconf info) |
Sadly no. We still have a lot of polishing and testing to do. Sorry for the delays, but the last thing we want to do is ship before it's ready. I don't think a test binary at this stage would be useful. Maybe in another week or two. |
I understand. Please let's give a try in a wee or two. I can retest our scenario. I saw branch 0.8.7. Does it make sence to PR fix in 0.8.7? Ary you planning to release 0.8.7 with fixes because of 0.9 delay? |
@schmichael can we test and tweak v0.8.7-rc1 regarding this bug? |
v0.8.7 does not contain any state corruption related handling. As I mentioned above all of our state corruption fixes are coming in 0.9. Sorry for the delay! |
Hello, does it make sence to retest this scenario? Are test binaries in this stage useful now? |
Sorry I'm a bit late on the update, but yes! The beta releases should have fixed state corruption issues: https://releases.hashicorp.com/nomad/0.9.0-beta3/ |
@schmichael, thank you for your update (glad to hear that). We are going to retest the binary. |
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
Hello, I have tested recent changes in 0.9.0-dev and I was able again to reproduce (just by restarting a cluster) the corrupted state issue #2560, therefore, I am providing the fix.
Let's explain the fix:
Please, do not close this pull request. This is a serious problem for us and it blocks some deployment of Nomad. If something in the pull request needs to be improved then I am eager to help with that.
Jozef