Nomad Client crashes on startup failing to restore allocations with empty state files #2560

ashald · 2017-04-13T15:30:25Z

Nomad version

0.5.6

Operating system and Environment details

Centos 7

Issue

Nomad agent crashes on startup after the reboot say it cannot restore some allocations.

Nomad Client logs (if appropriate)

Apr 13 11:27:01 nomad-agent-02 nomad[1657]: client: failed to restore state for alloc 48a4b655-d25c-21e5-4dce-a2f03d610a43: failed to decode state: unexpected end of JSON input
Apr 13 11:27:01 nomad-agent-02 nomad[1657]: client: failed to restore state for alloc 6a25c5d7-d4a7-3ef3-df2a-1ca60eeab51d: failed to decode state: unexpected end of JSON input
Apr 13 11:27:01 nomad-agent-02 nomad[1657]: client: failed to restore state for alloc a95c1022-9a70-62a4-b440-76ea711d86ab: failed to decode state: unexpected end of JSON input
Apr 13 11:27:01 nomad-agent-02 nomad[1657]: client: failed to restore state for alloc c6785f33-9f18-095b-3682-6962c2b1157d: failed to decode state: unexpected end of JSON input
Apr 13 11:27:01 nomad-agent-02 nomad: ==> Error starting agent: client setup failed: failed to restore state: 4 error(s) occurred:
Apr 13 11:27:01 nomad-agent-02 nomad: * failed to decode state: unexpected end of JSON input
Apr 13 11:27:01 nomad-agent-02 nomad: * failed to decode state: unexpected end of JSON input
Apr 13 11:27:01 nomad-agent-02 nomad: * failed to decode state: unexpected end of JSON input
Apr 13 11:27:01 nomad-agent-02 nomad: * failed to decode state: unexpected end of JSON input

The text was updated successfully, but these errors were encountered:

dvusboy · 2017-04-13T16:16:53Z

This has happened to me numerous time with earlier versions, such as 0.4.1. The way I work around that is to nuke everything under /var/lib/nomad/client/alloc/. After that, the nomad client starts fine. I've notice 0.5.6 improved GC a lot. By chance, could this be during an upgrade cycle of Nomad? Did you do a node-drain prior to reboot?

dadgar · 2017-04-13T17:46:55Z

@ashald Can you tar up and share the data directory?

ashald · 2017-04-13T17:55:29Z

Unfortunately we cleaned it up to bring revive the agent. but if something like that will happen again I'll share it with you.

dadgar · 2017-04-17T23:54:18Z

@ashald Okay lets re-open this with more detail. Unfortunately there isn't too much that is actionable without more data 👍

ashald · 2017-04-20T18:46:32Z

@dadgar it just happened again after host was rebooted. :(

Apr 20 14:43:58 nomad-agent01 nomad: Loaded configuration from /etc/nomad/config.hcl
Apr 20 14:43:58 nomad-agent01 nomad: ==> Starting Nomad agent...
Apr 20 14:44:01 nomad-agent01 nomad: ==> Error starting agent: client setup failed: failed to restore state: 2 error(s) occurred:
Apr 20 14:44:01 nomad-agent01 nomad: * failed to decode state: unexpected end of JSON input
Apr 20 14:44:01 nomad-agent01 nomad: * failed to decode state: unexpected end of JSON input
Apr 20 14:44:01 nomad-agent01 nomad: 2017/04/20 14:43:58.739422 [INFO] client: using state directory /var/lib/nomad/client
Apr 20 14:44:01 nomad-agent01 nomad: 2017/04/20 14:43:58.739673 [INFO] client: using alloc directory /srv
Apr 20 14:44:01 nomad-agent01 nomad: 2017/04/20 14:43:58.755586 [INFO] fingerprint.cgroups: cgroups are available
Apr 20 14:44:01 nomad-agent01 nomad: 2017/04/20 14:43:58.760798 [INFO] fingerprint.consul: consul agent is available
Apr 20 14:44:01 nomad-agent01 nomad: 2017/04/20 14:43:58.776327 [WARN] fingerprint.network: Unable to parse Speed in output of '/usr/sbin/ethtool eth0'
Apr 20 14:44:01 nomad-agent01 nomad: 2017/04/20 14:44:01.887324 [ERR] client: failed to restore state for alloc 21522dd5-b002-0f35-9a4b-f498454fbc3e: failed to decode state: unexpected end of JSON input
Apr 20 14:44:01 nomad-agent01 nomad: 2017/04/20 14:44:01.887421 [ERR] client: failed to restore state for alloc 379ebc00-1c12-fed3-3239-3c23b3b8e95b: failed to decode state: unexpected end of JSON input
Apr 20 14:44:01 nomad-agent01 systemd: nomad.service: main process exited, code=exited, status=1/FAILURE
Apr 20 14:44:01 nomad-agent01 systemd: Unit nomad.service entered failed state.
Apr 20 14:44:01 nomad-agent01 systemd: nomad.service failed.

Data directory:

/var/lib/nomad $ tree
.
└── client
    ├── alloc
    │   ├── 21522dd5-b002-0f35-9a4b-f498454fbc3e
    │   │   ├── state.json
    │   │   └── task-484cbd7aea50a707eca9796a0b7e34f8
    │   │       └── state.json
    │   └── 379ebc00-1c12-fed3-3239-3c23b3b8e95b
    │       ├── state.json
    │       └── task-aa3a0177193f5479812eadbbb3e6c303
    │           └── state.json
    ├── client-id
    └── secret-id

All .json files are empty, client-id and secret-id contain UUIDs.

dadgar · 2017-04-20T18:48:37Z

@ashald Do you have logs before the node restarted? Also what were you doing before the restart? Did you drain the node?

ashald · 2017-04-20T18:49:27Z

Looking for them this very moment... Is there is something specific I should look for or just any logs?

ashald · 2017-04-20T19:04:42Z

Hm.... Interesting, the only thing that I found in logs is:

Apr  4 15:38:24.839 nomad-agent01 nomad: client: marking allocation 379ebc00-1c12-fed3-3239-3c23b3b8e95b for GC
Apr  4 15:38:24.839 nomad-agent01 nomad: client: marking allocation 379ebc00-1c12-fed3-3239-3c23b3b8e95b for GC
Apr  4 15:38:24.835 nomad-agent01 nomad: client: marking allocation 21522dd5-b002-0f35-9a4b-f498454fbc3e for GC
Apr  4 15:38:24.834 nomad-agent01 nomad: client: marking allocation 21522dd5-b002-0f35-9a4b-f498454fbc3e for GC

But that happened several week before the reboot. And now nomad tries to restore allocations that were supposed to be cleaned up?..

ashald · 2017-04-20T19:05:08Z

BTW, those allocation directories are there in place on the disk and are not cleaned.

dadgar · 2017-04-20T20:07:12Z

@ashald Yeah we are planning on improving the error handling during restores to not bail the whole client and to delete corrupt state.

ashald · 2017-04-20T20:10:25Z

Is it something that will be part of 0.6? Can we please re-open the issue?
At this point, considering that we ran into the issue couple of times by now, it's pretty critical for us as we are not sure we can rely on Nomad at this time unless we fix the issue... Unless we run only stateless applications and wipe out both data and allocation dirs before Nomad start...

dadgar · 2017-04-20T21:46:46Z

Sure. Will update this issue when the work is complete and we can post a test binary that you can play with.

Also why are you restarting the host machine so much?

ashald · 2017-04-20T21:50:37Z

Hosts are restarted every now and then when we apply security patches and so on. But since we're not 100% sure what causes the issue my assumption is that the issue might manifest during the regular service restart as well.

dadgar · 2017-04-20T21:53:01Z

@ashald Steady state client behavior is pretty solid. The recovery code is quite complex and I believe that is where this bug is cropping up.

ashald · 2017-04-20T21:55:57Z

Yeah, but as far as I understand, recovery mode is triggered during Nomad startup and therefore will be activated after each Nomad process restart (that might happen more often if you need to adjust config)

ashald · 2017-04-20T21:56:35Z

I will try to ensure that node drained each time Nomad is shut-down - maybe it will help to avoid the issue for now.

dvusboy · 2017-04-20T22:10:23Z

@ashald I'm just curious, but do you do a nomad drain-node before restart?

ashald · 2017-04-20T22:13:55Z

@dvusboy that's what I want to do

andrewduch · 2017-04-28T20:31:08Z

Just wanted to add that we saw a similar issue on a Windows client (running 0.5.6) today. After a restart of the system (without draining), nomad client can't start up with below error message.

Ideal option would be that if state is corrupted it gets wiped out as assumption is scheduler would have already scheduled these jobs on another node. I can provide state directory as well if that is helpful.

    Loaded configuration from C:\nomad\client.conf
==> Starting Nomad agent...
    2017/04/28 19:59:09.801512 [INFO] client: using state directory D:\nomad\client
    2017/04/28 19:59:09.823511 [INFO] client: using alloc directory D:\nomad\alloc
    2017/04/28 19:59:09.939514 [INFO] fingerprint.consul: consul agent is available
    2017/04/28 19:59:10.921551 [INFO] fingerprint.vault: Vault is available
    2017/04/28 19:59:10.925540 [WARN] driver.raw_exec: raw exec is enabled. Only enable if needed
    2017/04/28 19:59:10.928522 [ERR] client: failed to restore state for alloc 00099a77-f90a-c3e5-b957-d45198325820: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.928522 [ERR] client: failed to restore state for alloc 016b6808-593e-9063-bdb1-a86b109e63cc: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.928522 [ERR] client: failed to restore state for alloc 020e2d1a-81f7-0527-78d0-435a9c4b2cc2: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.928522 [ERR] client: failed to restore state for alloc 0d41d010-dab0-3306-b089-1ab297a653d4: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.928522 [ERR] client: failed to restore state for alloc 0ffdcc06-3482-39bc-082d-d1091e84b4c2: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.928522 [ERR] client: failed to restore state for alloc 206ea917-29dc-a06a-b533-f9f6f68254b4: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.928522 [ERR] client: failed to restore state for alloc 23c5aefa-7372-0808-0b27-cdbbf55dd342: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.928522 [ERR] client: failed to restore state for alloc 23f987e0-b429-7564-2bc6-9a4a6156562e: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.928522 [ERR] client: failed to restore state for alloc 2e49e199-61db-94f6-0e58-4bf4d88679da: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.928522 [ERR] client: failed to restore state for alloc 30e35b63-9fe5-0e67-a158-4d9bc125c8d9: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.928522 [ERR] client: failed to restore state for alloc 38038fe4-d9b2-ec56-90ea-768f085d282b: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.928522 [ERR] client: failed to restore state for alloc 406e1259-3162-d2af-98a3-7b1aa827eb8b: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.928522 [ERR] client: failed to restore state for alloc 415971e1-5c5d-23d6-a7da-c916ce1544a7: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.928522 [ERR] client: failed to restore state for alloc 4b86698b-25e8-9e87-2738-354e56314adb: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.929541 [ERR] client: failed to restore state for alloc 54d6f3c1-42c5-bfc7-77c4-98a7ea67c5a5: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.929541 [ERR] client: failed to restore state for alloc 56f69681-bbdb-3537-9849-991352593f42: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.929541 [ERR] client: failed to restore state for alloc 59e20a4c-5212-742e-66cf-da1530ec7150: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.929541 [ERR] client: failed to restore state for alloc 5bd83bdc-bb14-497c-b0b9-edc2ffe1ed47: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.929541 [ERR] client: failed to restore state for alloc 67265d54-f1e3-b120-36c0-e64200b21488: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.929541 [ERR] client: failed to restore state for alloc 6d898dbb-276e-f0c3-fadc-c9bd59017ecc: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.929541 [ERR] client: failed to restore state for alloc 704dde36-7e01-abcc-f6fb-a76fb4787552: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.929541 [ERR] client: failed to restore state for alloc 714cbc72-87a8-b02f-fb29-5007dcf8fd6e: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.929541 [ERR] client: failed to restore state for alloc 747eae02-3c20-b390-9932-4efaae6ff3c2: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.929541 [ERR] client: failed to restore state for alloc 7a8710a0-c03b-6ca3-121f-bcb8b178bfdc: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.929541 [ERR] client: failed to restore state for alloc 7e8efde5-1249-6a62-caaa-34f7de2ab8a0: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.929541 [ERR] client: failed to restore state for alloc 82b4867b-18e8-2ae1-89ac-e618c4a35c26: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.929541 [ERR] client: failed to restore state for alloc 869fd437-c07c-1fd6-4e9b-1089dca590f0: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.929541 [ERR] client: failed to restore state for alloc 8eb53ede-ed14-af36-3f8a-9abc7c7ba876: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.929541 [ERR] client: failed to restore state for alloc 90203232-606a-e100-798e-55b9c2756f96: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.929541 [ERR] client: failed to restore state for alloc 903f5ed2-3dc6-1afe-2310-4979bebd45ba: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.929541 [ERR] client: failed to restore state for alloc 9c723ebb-e6f1-be34-6e8c-5aca7208c9c0: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.930541 [ERR] client: failed to restore state for alloc a20424fe-82c1-3574-1fe3-36d75ca847dd: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.930541 [ERR] client: failed to restore state for alloc ab412a69-84b4-9ffa-cd94-0bb52a8dcd7f: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.930541 [ERR] client: failed to restore state for alloc b0c1d0b0-3399-6447-538c-32b688e74f54: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.931520 [ERR] client: failed to restore state for alloc b2e4b880-b5a6-0026-972c-612d98ae387c: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.931520 [ERR] client: failed to restore state for alloc b57502c3-6f27-a810-8173-1b6fae028749: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.931520 [ERR] client: failed to restore state for alloc b5fc1567-0245-e9c9-9571-4064cf881ce8: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.931520 [ERR] client: failed to restore state for alloc b682f968-58bd-8481-c390-3b81d5ada621: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.931520 [ERR] client: failed to restore state for alloc c76d1180-c879-4f02-a544-e5c0ddf12f5e: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.931520 [ERR] client: failed to restore state for alloc c95b703a-1741-6e0c-5602-3f05cc08fe86: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.931520 [ERR] client: failed to restore state for alloc cff32ed7-3123-0041-261d-1b2ae94110fc: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.931520 [ERR] client: failed to restore state for alloc d3071e8b-bdb0-bc49-6be8-07d62b7d18c4: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.931520 [ERR] client: failed to restore state for alloc d754293f-7521-f186-76b1-f2bd0304d5a5: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.931520 [ERR] client: failed to restore state for alloc dc56d315-6c5b-5992-1116-98970e1b2671: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.931520 [ERR] client: failed to restore state for alloc dc56dfba-da35-4f0d-b5e0-03594d3199e5: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.931520 [ERR] client: failed to restore state for alloc ddd694a4-a762-1994-a267-bba34b216210: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.931520 [ERR] client: failed to restore state for alloc eb88d77d-1412-bf3b-a34b-e87fed5533e2: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.931520 [ERR] client: failed to restore state for alloc fae26824-87a5-71df-7a2e-ea3cfd4cda0a: failed to decode state: invalid character '\x00' looking for beginning of value

sirkjohannsen · 2017-07-13T09:58:46Z

same happens still with nomad 0.6.0-dev

opaugam-unity · 2017-08-17T16:49:13Z

It does happen for us on 0.6.0: gracefully shutting down the agent leads to a 80+ % probability the next time the agent starts we hit somehing like "* failed to read allocation state: failed to read alloc runner alloc state: no data at key alloc".

Cleaning up the mounts + state db is ok-ay but any left-over running container will not be handled upon restart and lead to error if ever you use static ports.

elsbrock · 2018-05-02T21:14:48Z

I also get this on v0.7.1.

volodymyr-f · 2018-05-03T22:37:23Z

Got the same on Nomad v0.8.0

momania · 2018-06-20T10:57:20Z

I'm getting this too on v0.8.4
Allocation ids that it tries to recover don't match with the ids found in /var/lib/nomad/alloc

shantanugadgil · 2018-06-20T17:33:03Z

I too face this issue often, if I abruptly reboot my client nodes.
I have updated my scripts to stop the docker daemon altogether (my use case is dockers), stop Consul and Nomad, sync a few times (for good luck), and then reboot.

Since this modification, I haven't hit the issue.

qkate · 2018-07-13T22:23:33Z

Thanks all for the reports! We're in the middle of some refactoring work right now that we are hoping will resolve this (and other!) issues.

capone212 · 2018-07-17T09:34:16Z

Same problem, waiting for the solution/

onlyjob · 2018-08-15T02:45:49Z

I get this ~~very often~~ every time on v0.8.4 after restarting nomad after full drain: daemon just fails to start (after clean shutdown) with the following error:

[ERR] client: failed to restore state: 1 error occurred:
* failed to read allocation state: failed to read alloc runner alloc state: no data at 
key alloc

Removing /var/lib/nomad/client/state.db helps but should not be necessary...

kcwong-verseon · 2018-10-04T20:36:34Z

Just to echo @onlyjob and #4748, the corrupt state is different from before where clearing the alloc directory resolve the issue. The new behavior with 0.8.x is the state DB is corrupted and either removing it or cat /dev/null >/var/lib/nomad/client/state.db is needed, clearing alloc does not address the state.

jozef-slezak · 2018-10-08T14:04:26Z

I am facing the similar problem. I have a fix: #4739
@qkate the fix is being implemented against refactored code (at least I hope so because of the branch)

@dadgar I would suggest handling missing key like this:

It might be better if GetTaskRunnerState(), GetAllAllocations() would not return error if a key (local_state, task_state etc.) is missing (the key might not have been written).
Caller of GetTaskRunnerState(), GetAllAllocations() ffdbbce would use tr.initState() and &NoopPrevAlloc{} in case missing key.

@dadgar what about having 0.8.7 (it would help for our deployment) with the fix ported to master branch? I mean BoltDB: (not Badger), no error if key is missing and handle missing key (tr.initState NoopPrevAlloc).

I was able to reproduce the problem in:

Nomad version
0.8.6

Operating system and Environment details
Centos 7

Issue
Nomad agent crashes on startup after the reboot say it cannot restore some allocations.

Nomad Client logs (if appropriate)
no data at key local_state

jozef-slezak · 2018-10-19T15:32:55Z

This problem is still reproducible even today (with the latest commit) on 0.9.0-dev just by restarting the cluster. Please see my fix #4807.

Nomad version
0.9.0-dev

Operating system and Environment details
Centos 7

Issue
Nomad agent crashes on startup after the reboot say it cannot restore some allocations.
failed to read local task runner state: no data at key local_state"
bucket: Task bucket doesnt exists

Nomad Client logs (if appropriate)
==> Starting Nomad agent...
2018/10/19 16:29:08 [INFO] agent: Deregistered service "_nomad-task-inhseqgfpo432j5iihs2d7lwcpve7gpb"
==> Error starting agent: client setup failed: failed to restore state
2018-10-19T16:29:08.636+0200 [INFO ] agent: detected plugin: name=raw_exec type=driver plugin_version=0.1
2018-10-19T16:29:08.636+0200 [INFO ] agent: detected plugin: name=mock_driver type=driver plugin_version=
2018-10-19T16:29:08.636+0200 [INFO ] client: using state directory: state_dir=/var/lib/innovatrics/nomad/
2018-10-19T16:29:08.636+0200 [INFO ] client: using alloc directory: alloc_dir=/var/lib/innovatrics/nomad/
2018-10-19T16:29:08.638+0200 [INFO ] client.fingerprint_mgr.cgroup: cgroups are available
2018-10-19T16:29:08.641+0200 [INFO ] client.fingerprint_mgr.consul: consul agent is available
2018-10-19T16:29:08.642+0200 [WARN ] client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbi
2018-10-19T16:29:08.652+0200 [ERROR] client: error restoring alloc: error="failed to read local task runn
2018-10-19T16:29:08.652+0200 [ERROR] client: failed to restore state: error="1 error(s) occurred:

failed to read local task runner state: no data at key local_state"
2018-10-19T16:29:08.652+0200 [ERROR] client: Nomad is unable to start due to corrupt state. The safest wa
2018-10-19T16:29:08.652+0200 [ERROR] client: Corrupt state is often caused by a bug. Please report as muc

2018-10-19T16:43:44.230+0200 [ERROR] client: error restoring alloc: error="failed to get task "IDispatcher" bucket: Task bucket doesn
2018-10-19T16:43:44.230+0200 [ERROR] client: error restoring alloc: error="failed to get task "IDispatcher" bucket: Task bucket doesn
2018-10-19T16:43:44.232+0200 [ERROR] client: error restoring alloc: error="failed to get task "activemq515" bucket: Task bucket doesn
2018-10-19T16:43:44.233+0200 [ERROR] client: error restoring alloc: error="failed to get task "activemq515" bucket: Task bucket doesn
2018-10-19T16:43:44.233+0200 [ERROR] client: failed to restore state: error="4 error(s) occurred:

failed to get task "IDispatcher" bucket: Task bucket doesn't exist and transaction is not writable
failed to get task "IDispatcher" bucket: Task bucket doesn't exist and transaction is not writable
failed to get task "activemq515" bucket: Task bucket doesn't exist and transaction is not writable
failed to get task "activemq515" bucket: Task bucket doesn't exist and transaction is not writable"
2018-10-19T16:43:44.233+0200 [ERROR] client: Nomad is unable to start due to corrupt state. The safest way to proceed is to manually
2018-10-19T16:43:44.233+0200 [ERROR] client: Corrupt state is often caused by a bug. Please report as much information as possible to

robloxrob · 2018-10-30T21:16:41Z

I ran into this issue on nodes where my disk ended up filling up.

Nomad version
0.8.6

Operating system and Environment details
Ubuntu 16.04

shantanugadgil · 2018-10-31T06:55:38Z

@robloxrob
I use CentOS7 (systemd)
I run a cron job which checks the status of the service named "Nomad" to see if it is unhealthy/dead
If it is dead, I proceed to stop Nomad, Consul, Docker daemons, wipe out the "datadir" and reboot the machine.

This is a workaround, but keeps machine chugging along rather than having to debug this issue :)

HTH,
Shantanu

P.S.
I haven't implemented it yet, but I have an idea to parse the output of the command "systemctl status nomad" for specific error strings rather than the return value of systemctl.

schmichael · 2019-03-25T13:46:19Z

Nomad 0.9.0-rc1 was recently released which should address state corruption issues. Please open a new issue if you run into problems and thanks for your patience! I know this took a while to get fixed.

vkiranananda · 2019-04-13T13:20:46Z

Hello!
I have same problem. Nomad v0.9.0. Run docker service (memcached) and restart linux server.
logs:
Apr 13 16:01:17 beta-docker nomad[8199]: ==> Loaded configuration from /etc/nomad.d/server.hcl
Apr 13 16:01:17 beta-docker nomad[8199]: ==> Starting Nomad agent...
This is client...
ps uax |grep noma
root 8199 0.2 0.3 1933276 38040 ? Ssl 16:01 0:01 /usr/local/bin/nomad agent -config /etc/nomad.d
root 8246 0.0 0.3 1416256 31988 ? Sl 16:01 0:00 /usr/local/bin/nomad docker_logger

port 4646 don't opened. ss -lntp|grep nom
service nomad stop
ps uax |grep noma
root 8246 0.0 0.3 1416256 31924 ? Sl 16:01 0:00 /usr/local/bin/nomad docker_logger

If I delete dir "/var/lib/nomad/client/", then nomad starting is normal .

This is real problem for me :(

vkiranananda · 2019-04-13T13:44:20Z

I deleted files *.fifo in /var/lib/nomad/alloc/45be141a-b9ff-0312-0e64-866852ceae31/alloc/logs and nomad started well

vkiranananda · 2019-04-13T15:52:39Z

Bug fix :)

echo "[Unit]
Description=Nomad fix start
Before="nomad.service"

[Service]
ExecStart=/usr/bin/find /var/lib/nomad/alloc/ -type p -delete

[Install]
WantedBy=multi-user.target" > /etc/systemd/system/nomad-fix.service

systemctl daemon-reload
systemctl enable nomad-fix

github-actions · 2022-11-24T02:20:45Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

ashald changed the title ~~Agent crashes when cannot restore allocations~~ Nomad Client crashes when cannot restore allocations Apr 13, 2017

dadgar closed this as completed Apr 17, 2017

ashald changed the title ~~Nomad Client crashes when cannot restore allocations~~ Nomad Client crashes on startup failing to restore allocations marked for GC Apr 20, 2017

dadgar reopened this Apr 20, 2017

dadgar changed the title ~~Nomad Client crashes on startup failing to restore allocations marked for GC~~ Nomad Client crashes on startup failing to restore allocations with empty state files Apr 20, 2017

dvusboy mentioned this issue Apr 25, 2017

Upgrading clusters using blue/green style of underlying clients. #2584

Closed

dadgar added theme/client type/bug labels May 3, 2017

dadgar mentioned this issue Oct 4, 2018

Nomad goes to corrupt state after restart #4748

Closed

jozef-slezak mentioned this issue Oct 19, 2018

Fix corrupt state after restart (missing key, bucket) #4807

Closed

schmichael closed this as completed Mar 25, 2019

schmichael mentioned this issue Apr 15, 2019

Nomad 0.9.0 Client deadlocks when starting after reboot #5566

Closed

onlyjob mentioned this issue Dec 20, 2019

sometimes fails to start: Reattachment process not found hashicorp/nomad-driver-podman#12

Closed

github-actions bot locked as resolved and limited conversation to collaborators Nov 24, 2022

Nomad Client crashes on startup failing to restore allocations with empty state files #2560

Nomad Client crashes on startup failing to restore allocations with empty state files #2560

Comments

ashald commented Apr 13, 2017

Nomad version

Operating system and Environment details

Issue

Nomad Client logs (if appropriate)

dvusboy commented Apr 13, 2017

dadgar commented Apr 13, 2017

ashald commented Apr 13, 2017

dadgar commented Apr 17, 2017

ashald commented Apr 20, 2017

dadgar commented Apr 20, 2017 • edited Loading

ashald commented Apr 20, 2017 • edited Loading

ashald commented Apr 20, 2017

ashald commented Apr 20, 2017

dadgar commented Apr 20, 2017

ashald commented Apr 20, 2017

dadgar commented Apr 20, 2017

ashald commented Apr 20, 2017

dadgar commented Apr 20, 2017

ashald commented Apr 20, 2017

ashald commented Apr 20, 2017

dvusboy commented Apr 20, 2017

ashald commented Apr 20, 2017

andrewduch commented Apr 28, 2017 • edited Loading

sirkjohannsen commented Jul 13, 2017

opaugam-unity commented Aug 17, 2017

elsbrock commented May 2, 2018

volodymyr-f commented May 3, 2018

momania commented Jun 20, 2018

shantanugadgil commented Jun 20, 2018

qkate commented Jul 13, 2018

capone212 commented Jul 17, 2018

onlyjob commented Aug 15, 2018 • edited Loading

kcwong-verseon commented Oct 4, 2018

jozef-slezak commented Oct 8, 2018 • edited Loading

jozef-slezak commented Oct 19, 2018

robloxrob commented Oct 30, 2018

shantanugadgil commented Oct 31, 2018

schmichael commented Mar 25, 2019

vkiranananda commented Apr 13, 2019

vkiranananda commented Apr 13, 2019

vkiranananda commented Apr 13, 2019

github-actions bot commented Nov 24, 2022

dadgar commented Apr 20, 2017 •

edited

Loading

ashald commented Apr 20, 2017 •

edited

Loading

andrewduch commented Apr 28, 2017 •

edited

Loading

onlyjob commented Aug 15, 2018 •

edited

Loading

jozef-slezak commented Oct 8, 2018 •

edited

Loading