Fix corrupt state after restart (missing key, bucket) #4807

jozef-slezak · 2018-10-19T15:28:37Z

Hello, I have tested recent changes in 0.9.0-dev and I was able again to reproduce (just by restarting a cluster) the corrupted state issue #2560, therefore, I am providing the fix.

Let's explain the fix:

original implementation fails too fast is a key or bucket does not exists
this pull requests contains defensive implementation/fix, basically it fallbacks to initial TaskState and LocalState if the problem occurs
I can imagine that somebody would argue that key/bucket should not be missing at the first place but we need reliable implementation and avoid workarounds based on deletion of local state file (see [feature] gracefully handle corrupt startup state #1367)

Please, do not close this pull request. This is a serious problem for us and it blocks some deployment of Nomad. If something in the pull request needs to be improved then I am eager to help with that.

Jozef

schmichael · 2018-10-19T17:10:54Z

@jozef-slezak These fixes are great! Unfortunately they're all covered in another PR: #4803

In fact your checklist is surprisingly similar to one I was working off of internally!

original implementation fails too fast is a key or bucket does not exists

👍 https://github.com/hashicorp/nomad/pull/4803/files#diff-64749d8d8c42d5b993eff64d7f522d00R164

this pull requests contains defensive implementation/fix, basically it fallbacks to initial TaskState and LocalState if the problem occurs

👍 https://github.com/hashicorp/nomad/pull/4803/files#diff-44115f774c7a50972c7d0c97054ea0a4R197 (and they're only overwritten if the loaded values are non-nil)

I can imagine that somebody would argue that key/bucket should not be missing at the first place but we need reliable implementation and avoid workarounds based on deletion of local state file (see [feature] gracefully handle corrupt startup state #1367)

👍 Agreed! The approach in #4803 (with further fixes coming in the b-restore-rerun work-in-progress branch) does not fail on missing data but reverts to an initial state. Right now we still fail on corrupt data (unmarshal/decode errors) with the assumption that something has gone very wrong. We may take a more defensive approach there as well based on further testing.

Please, do not close this pull request. This is a serious problem for us and it blocks some deployment of Nomad. If something in the pull request needs to be improved then I am eager to help with that.

I'm sorry you're blocked by state corruption issues! I was personally hoping to have the code ready for beta testing by HashiConf (next week!), but unfortunately I was unable to make that deadline.

I assure you this is my highest priority after HashiConf, and we will do everything we can to ensure 0.9 will not be released with state corruption errors.

Since the code is still under very active development (which will be paused next week during HashiConf), it's hard to coordinate testing and changes. Feel free to review or test #4803. I'm afraid until TaskRunner.Restore/Driver.RecoverTask is fully implemented, it will be difficult to fully test state handling. That work has been started in b-restore-rerun but won't be finished until after HashiConf.

Sorry we're not moving faster and thank you for your continued engagement! We'll get there!

schmichael · 2018-10-19T17:16:10Z

If you are inclined, we could get you some test binaries to help us test the restore code path once development continues after HashiConf. I'll start trying to attach linux_amd64 binaries to PRs starting with the upcoming Restore implementation PR. Until Restore is implemented there isn't much to test.

jozef-slezak · 2018-10-22T07:19:16Z

Great, please let me know when I can test linux_amd64 binaries attach to PRs (please build it also with GUI).
Thank you

shantanugadgil · 2018-10-22T16:13:45Z

hi @schmichael apologies to sound like a looping record ... but how easy (possible?) would it be merge this changes into the 0.8 branch?

I personally face this corruption issue often, though I have attributed it to me rebooting the machine abruptly. (I always do a sync and reboot, never a "reboot -f", yet this occurs) or if the undelying VM host reboots.

Impacted machines for me is the "on premise" QA cluster so manageable so far.

Along with the actual fixes, can we also have a deterministic way to check that the local nomad state is corrupted?
The reason I mention is this because I have a workaround (cron job) for now which checks the output of "systemctl status nomad" thrice.
If there is error in the output, it stops Nomad, Consul, Docker services, wipes out the Nomad's datadir and reboots the machine.
(It works, but I feel it is very hacky).

schmichael · 2018-10-22T22:51:40Z

how easy (possible?) would it be merge this changes into the 0.8 branch?

Impossible I'm afraid. The fixes are part of a large refactor of the client code. Backporting them would effectively be the same as upgrading your clients to 0.9 but leaving your servers at 0.8.

Along with the actual fixes, can we also have a deterministic way to check that the local nomad state is corrupted?

We don't currently have plans for this as we're confident we can avoid corrupt state in all cases except OS/hardware failures. Even then our recovery should be more robust than it is today.

The reason I mention is this because I have a workaround (cron job) for now which checks the output of "systemctl status nomad" thrice.

Hm, I'm curious what this does. Corrupt state should only break Nomad on reboot or restart of the agent.

jozef-slezak · 2018-11-28T16:23:31Z

Hello @schmichael , please do you have test binaries that we could test? I would like to test.

If you are inclined, we could get you some test binaries to help us test the restore code path once
development continues after HashiConf. I'll start trying to attach linux_amd64 binaries to PRs starting
with the upcoming Restore implementation PR. Until Restore is implemented there isn't much to test.

Are you still planning to release 0.9 in November? (referring Hashiconf info)

schmichael · 2018-11-28T21:40:21Z

Are you still planning to release 0.9 in November? (referring Hashiconf info)

Sadly no. We still have a lot of polishing and testing to do. Sorry for the delays, but the last thing we want to do is ship before it's ready.

I don't think a test binary at this stage would be useful. Maybe in another week or two.

jozef-slezak · 2018-11-29T09:02:39Z

I understand. Please let's give a try in a wee or two. I can retest our scenario.

I saw branch 0.8.7. Does it make sence to PR fix in 0.8.7? Ary you planning to release 0.8.7 with fixes because of 0.9 delay?

jozef-slezak · 2018-12-11T21:45:00Z

@schmichael can we test and tweak v0.8.7-rc1 regarding this bug?

schmichael · 2018-12-11T22:44:10Z

v0.8.7 does not contain any state corruption related handling. As I mentioned above all of our state corruption fixes are coming in 0.9. Sorry for the delay!

jozef-slezak · 2019-01-19T02:18:17Z

Sadly no. We still have a lot of polishing and testing to do. Sorry for the delays, but the last thing
we want to do is ship before it's ready.
I don't think a test binary at this stage would be useful. Maybe in another week or two.

Hello, does it make sence to retest this scenario? Are test binaries in this stage useful now?

schmichael · 2019-02-28T17:34:18Z

Sorry I'm a bit late on the update, but yes! The beta releases should have fixed state corruption issues: https://releases.hashicorp.com/nomad/0.9.0-beta3/

jozef-slezak · 2019-03-05T13:58:51Z

@schmichael, thank you for your update (glad to hear that). We are going to retest the binary.

github-actions · 2023-02-14T02:18:01Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

Jozef Slezak added 15 commits May 25, 2018 16:31

Custom driver - golang plugin - initial commit

c12ba76

fix init()

0749080

check if custom driver directori exists

7117170

fix init()

4370cee

fix init()

a63b5a8

unit test

edc5e1e

unit test

ae3a0f1

unit test

8b21899

fix

91f4d6e

fix

e1a7fd1

build only on linux

09e10c2

fix warnings

3be2f41

fix failing tests

44058a5

Merge branch 'master' of https://github.com/hashicorp/nomad

24eb315

state restore: handle missing keys and buckets

f512ee4

jozef-slezak mentioned this pull request Oct 19, 2018

Nomad Client crashes on startup failing to restore allocations with empty state files #2560

Closed

schmichael closed this Oct 19, 2018

hashicorp deleted a comment from shantanugadgil Oct 22, 2018

github-actions bot locked as resolved and limited conversation to collaborators Feb 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix corrupt state after restart (missing key, bucket) #4807

Fix corrupt state after restart (missing key, bucket) #4807

jozef-slezak commented Oct 19, 2018

schmichael commented Oct 19, 2018

schmichael commented Oct 19, 2018

jozef-slezak commented Oct 22, 2018

shantanugadgil commented Oct 22, 2018

schmichael commented Oct 22, 2018

jozef-slezak commented Nov 28, 2018

schmichael commented Nov 28, 2018

jozef-slezak commented Nov 29, 2018

jozef-slezak commented Dec 11, 2018

schmichael commented Dec 11, 2018

jozef-slezak commented Jan 19, 2019 •

edited

Loading

schmichael commented Feb 28, 2019

jozef-slezak commented Mar 5, 2019

github-actions bot commented Feb 14, 2023

Fix corrupt state after restart (missing key, bucket) #4807

Fix corrupt state after restart (missing key, bucket) #4807

Conversation

jozef-slezak commented Oct 19, 2018

schmichael commented Oct 19, 2018

schmichael commented Oct 19, 2018

jozef-slezak commented Oct 22, 2018

shantanugadgil commented Oct 22, 2018

schmichael commented Oct 22, 2018

jozef-slezak commented Nov 28, 2018

schmichael commented Nov 28, 2018

jozef-slezak commented Nov 29, 2018

jozef-slezak commented Dec 11, 2018

schmichael commented Dec 11, 2018

jozef-slezak commented Jan 19, 2019 • edited Loading

schmichael commented Feb 28, 2019

jozef-slezak commented Mar 5, 2019

github-actions bot commented Feb 14, 2023

jozef-slezak commented Jan 19, 2019 •

edited

Loading