Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad Client crashes on startup failing to restore allocations with empty state files #2560

Closed
ashald opened this issue Apr 13, 2017 · 38 comments

Comments

@ashald
Copy link

ashald commented Apr 13, 2017

Nomad version

0.5.6

Operating system and Environment details

Centos 7

Issue

Nomad agent crashes on startup after the reboot say it cannot restore some allocations.

Nomad Client logs (if appropriate)

Apr 13 11:27:01 nomad-agent-02 nomad[1657]: client: failed to restore state for alloc 48a4b655-d25c-21e5-4dce-a2f03d610a43: failed to decode state: unexpected end of JSON input
Apr 13 11:27:01 nomad-agent-02 nomad[1657]: client: failed to restore state for alloc 6a25c5d7-d4a7-3ef3-df2a-1ca60eeab51d: failed to decode state: unexpected end of JSON input
Apr 13 11:27:01 nomad-agent-02 nomad[1657]: client: failed to restore state for alloc a95c1022-9a70-62a4-b440-76ea711d86ab: failed to decode state: unexpected end of JSON input
Apr 13 11:27:01 nomad-agent-02 nomad[1657]: client: failed to restore state for alloc c6785f33-9f18-095b-3682-6962c2b1157d: failed to decode state: unexpected end of JSON input
Apr 13 11:27:01 nomad-agent-02 nomad: ==> Error starting agent: client setup failed: failed to restore state: 4 error(s) occurred:
Apr 13 11:27:01 nomad-agent-02 nomad: * failed to decode state: unexpected end of JSON input
Apr 13 11:27:01 nomad-agent-02 nomad: * failed to decode state: unexpected end of JSON input
Apr 13 11:27:01 nomad-agent-02 nomad: * failed to decode state: unexpected end of JSON input
Apr 13 11:27:01 nomad-agent-02 nomad: * failed to decode state: unexpected end of JSON input
@ashald ashald changed the title Agent crashes when cannot restore allocations Nomad Client crashes when cannot restore allocations Apr 13, 2017
@dvusboy
Copy link

dvusboy commented Apr 13, 2017

This has happened to me numerous time with earlier versions, such as 0.4.1. The way I work around that is to nuke everything under /var/lib/nomad/client/alloc/. After that, the nomad client starts fine. I've notice 0.5.6 improved GC a lot. By chance, could this be during an upgrade cycle of Nomad? Did you do a node-drain prior to reboot?

@dadgar
Copy link
Contributor

dadgar commented Apr 13, 2017

@ashald Can you tar up and share the data directory?

@ashald
Copy link
Author

ashald commented Apr 13, 2017

Unfortunately we cleaned it up to bring revive the agent. but if something like that will happen again I'll share it with you.

@dadgar
Copy link
Contributor

dadgar commented Apr 17, 2017

@ashald Okay lets re-open this with more detail. Unfortunately there isn't too much that is actionable without more data 👍

@dadgar dadgar closed this as completed Apr 17, 2017
@ashald
Copy link
Author

ashald commented Apr 20, 2017

@dadgar it just happened again after host was rebooted. :(

Apr 20 14:43:58 nomad-agent01 nomad: Loaded configuration from /etc/nomad/config.hcl
Apr 20 14:43:58 nomad-agent01 nomad: ==> Starting Nomad agent...
Apr 20 14:44:01 nomad-agent01 nomad: ==> Error starting agent: client setup failed: failed to restore state: 2 error(s) occurred:
Apr 20 14:44:01 nomad-agent01 nomad: * failed to decode state: unexpected end of JSON input
Apr 20 14:44:01 nomad-agent01 nomad: * failed to decode state: unexpected end of JSON input
Apr 20 14:44:01 nomad-agent01 nomad: 2017/04/20 14:43:58.739422 [INFO] client: using state directory /var/lib/nomad/client
Apr 20 14:44:01 nomad-agent01 nomad: 2017/04/20 14:43:58.739673 [INFO] client: using alloc directory /srv
Apr 20 14:44:01 nomad-agent01 nomad: 2017/04/20 14:43:58.755586 [INFO] fingerprint.cgroups: cgroups are available
Apr 20 14:44:01 nomad-agent01 nomad: 2017/04/20 14:43:58.760798 [INFO] fingerprint.consul: consul agent is available
Apr 20 14:44:01 nomad-agent01 nomad: 2017/04/20 14:43:58.776327 [WARN] fingerprint.network: Unable to parse Speed in output of '/usr/sbin/ethtool eth0'
Apr 20 14:44:01 nomad-agent01 nomad: 2017/04/20 14:44:01.887324 [ERR] client: failed to restore state for alloc 21522dd5-b002-0f35-9a4b-f498454fbc3e: failed to decode state: unexpected end of JSON input
Apr 20 14:44:01 nomad-agent01 nomad: 2017/04/20 14:44:01.887421 [ERR] client: failed to restore state for alloc 379ebc00-1c12-fed3-3239-3c23b3b8e95b: failed to decode state: unexpected end of JSON input
Apr 20 14:44:01 nomad-agent01 systemd: nomad.service: main process exited, code=exited, status=1/FAILURE
Apr 20 14:44:01 nomad-agent01 systemd: Unit nomad.service entered failed state.
Apr 20 14:44:01 nomad-agent01 systemd: nomad.service failed.

Data directory:

/var/lib/nomad $ tree
.
└── client
    ├── alloc
    │   ├── 21522dd5-b002-0f35-9a4b-f498454fbc3e
    │   │   ├── state.json
    │   │   └── task-484cbd7aea50a707eca9796a0b7e34f8
    │   │       └── state.json
    │   └── 379ebc00-1c12-fed3-3239-3c23b3b8e95b
    │       ├── state.json
    │       └── task-aa3a0177193f5479812eadbbb3e6c303
    │           └── state.json
    ├── client-id
    └── secret-id

All .json files are empty, client-id and secret-id contain UUIDs.

@dadgar
Copy link
Contributor

dadgar commented Apr 20, 2017

@ashald Do you have logs before the node restarted? Also what were you doing before the restart? Did you drain the node?

@ashald
Copy link
Author

ashald commented Apr 20, 2017

Looking for them this very moment... Is there is something specific I should look for or just any logs?

@ashald
Copy link
Author

ashald commented Apr 20, 2017

Hm.... Interesting, the only thing that I found in logs is:

Apr  4 15:38:24.839 nomad-agent01 nomad: client: marking allocation 379ebc00-1c12-fed3-3239-3c23b3b8e95b for GC
Apr  4 15:38:24.839 nomad-agent01 nomad: client: marking allocation 379ebc00-1c12-fed3-3239-3c23b3b8e95b for GC
Apr  4 15:38:24.835 nomad-agent01 nomad: client: marking allocation 21522dd5-b002-0f35-9a4b-f498454fbc3e for GC
Apr  4 15:38:24.834 nomad-agent01 nomad: client: marking allocation 21522dd5-b002-0f35-9a4b-f498454fbc3e for GC

But that happened several week before the reboot. And now nomad tries to restore allocations that were supposed to be cleaned up?..

@ashald
Copy link
Author

ashald commented Apr 20, 2017

BTW, those allocation directories are there in place on the disk and are not cleaned.

@ashald ashald changed the title Nomad Client crashes when cannot restore allocations Nomad Client crashes on startup failing to restore allocations marked for GC Apr 20, 2017
@dadgar
Copy link
Contributor

dadgar commented Apr 20, 2017

@ashald Yeah we are planning on improving the error handling during restores to not bail the whole client and to delete corrupt state.

@ashald
Copy link
Author

ashald commented Apr 20, 2017

Is it something that will be part of 0.6? Can we please re-open the issue?
At this point, considering that we ran into the issue couple of times by now, it's pretty critical for us as we are not sure we can rely on Nomad at this time unless we fix the issue... Unless we run only stateless applications and wipe out both data and allocation dirs before Nomad start...

@dadgar dadgar reopened this Apr 20, 2017
@dadgar
Copy link
Contributor

dadgar commented Apr 20, 2017

Sure. Will update this issue when the work is complete and we can post a test binary that you can play with.

Also why are you restarting the host machine so much?

@dadgar dadgar changed the title Nomad Client crashes on startup failing to restore allocations marked for GC Nomad Client crashes on startup failing to restore allocations with empty state files Apr 20, 2017
@ashald
Copy link
Author

ashald commented Apr 20, 2017

Hosts are restarted every now and then when we apply security patches and so on. But since we're not 100% sure what causes the issue my assumption is that the issue might manifest during the regular service restart as well.

@dadgar
Copy link
Contributor

dadgar commented Apr 20, 2017

@ashald Steady state client behavior is pretty solid. The recovery code is quite complex and I believe that is where this bug is cropping up.

@ashald
Copy link
Author

ashald commented Apr 20, 2017

Yeah, but as far as I understand, recovery mode is triggered during Nomad startup and therefore will be activated after each Nomad process restart (that might happen more often if you need to adjust config)

@ashald
Copy link
Author

ashald commented Apr 20, 2017

I will try to ensure that node drained each time Nomad is shut-down - maybe it will help to avoid the issue for now.

@dvusboy
Copy link

dvusboy commented Apr 20, 2017

@ashald I'm just curious, but do you do a nomad drain-node before restart?

@ashald
Copy link
Author

ashald commented Apr 20, 2017

@dvusboy that's what I want to do

@andrewduch
Copy link

andrewduch commented Apr 28, 2017

Just wanted to add that we saw a similar issue on a Windows client (running 0.5.6) today. After a restart of the system (without draining), nomad client can't start up with below error message.

Ideal option would be that if state is corrupted it gets wiped out as assumption is scheduler would have already scheduled these jobs on another node. I can provide state directory as well if that is helpful.

    Loaded configuration from C:\nomad\client.conf
==> Starting Nomad agent...
    2017/04/28 19:59:09.801512 [INFO] client: using state directory D:\nomad\client
    2017/04/28 19:59:09.823511 [INFO] client: using alloc directory D:\nomad\alloc
    2017/04/28 19:59:09.939514 [INFO] fingerprint.consul: consul agent is available
    2017/04/28 19:59:10.921551 [INFO] fingerprint.vault: Vault is available
    2017/04/28 19:59:10.925540 [WARN] driver.raw_exec: raw exec is enabled. Only enable if needed
    2017/04/28 19:59:10.928522 [ERR] client: failed to restore state for alloc 00099a77-f90a-c3e5-b957-d45198325820: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.928522 [ERR] client: failed to restore state for alloc 016b6808-593e-9063-bdb1-a86b109e63cc: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.928522 [ERR] client: failed to restore state for alloc 020e2d1a-81f7-0527-78d0-435a9c4b2cc2: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.928522 [ERR] client: failed to restore state for alloc 0d41d010-dab0-3306-b089-1ab297a653d4: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.928522 [ERR] client: failed to restore state for alloc 0ffdcc06-3482-39bc-082d-d1091e84b4c2: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.928522 [ERR] client: failed to restore state for alloc 206ea917-29dc-a06a-b533-f9f6f68254b4: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.928522 [ERR] client: failed to restore state for alloc 23c5aefa-7372-0808-0b27-cdbbf55dd342: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.928522 [ERR] client: failed to restore state for alloc 23f987e0-b429-7564-2bc6-9a4a6156562e: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.928522 [ERR] client: failed to restore state for alloc 2e49e199-61db-94f6-0e58-4bf4d88679da: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.928522 [ERR] client: failed to restore state for alloc 30e35b63-9fe5-0e67-a158-4d9bc125c8d9: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.928522 [ERR] client: failed to restore state for alloc 38038fe4-d9b2-ec56-90ea-768f085d282b: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.928522 [ERR] client: failed to restore state for alloc 406e1259-3162-d2af-98a3-7b1aa827eb8b: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.928522 [ERR] client: failed to restore state for alloc 415971e1-5c5d-23d6-a7da-c916ce1544a7: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.928522 [ERR] client: failed to restore state for alloc 4b86698b-25e8-9e87-2738-354e56314adb: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.929541 [ERR] client: failed to restore state for alloc 54d6f3c1-42c5-bfc7-77c4-98a7ea67c5a5: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.929541 [ERR] client: failed to restore state for alloc 56f69681-bbdb-3537-9849-991352593f42: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.929541 [ERR] client: failed to restore state for alloc 59e20a4c-5212-742e-66cf-da1530ec7150: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.929541 [ERR] client: failed to restore state for alloc 5bd83bdc-bb14-497c-b0b9-edc2ffe1ed47: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.929541 [ERR] client: failed to restore state for alloc 67265d54-f1e3-b120-36c0-e64200b21488: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.929541 [ERR] client: failed to restore state for alloc 6d898dbb-276e-f0c3-fadc-c9bd59017ecc: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.929541 [ERR] client: failed to restore state for alloc 704dde36-7e01-abcc-f6fb-a76fb4787552: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.929541 [ERR] client: failed to restore state for alloc 714cbc72-87a8-b02f-fb29-5007dcf8fd6e: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.929541 [ERR] client: failed to restore state for alloc 747eae02-3c20-b390-9932-4efaae6ff3c2: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.929541 [ERR] client: failed to restore state for alloc 7a8710a0-c03b-6ca3-121f-bcb8b178bfdc: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.929541 [ERR] client: failed to restore state for alloc 7e8efde5-1249-6a62-caaa-34f7de2ab8a0: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.929541 [ERR] client: failed to restore state for alloc 82b4867b-18e8-2ae1-89ac-e618c4a35c26: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.929541 [ERR] client: failed to restore state for alloc 869fd437-c07c-1fd6-4e9b-1089dca590f0: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.929541 [ERR] client: failed to restore state for alloc 8eb53ede-ed14-af36-3f8a-9abc7c7ba876: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.929541 [ERR] client: failed to restore state for alloc 90203232-606a-e100-798e-55b9c2756f96: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.929541 [ERR] client: failed to restore state for alloc 903f5ed2-3dc6-1afe-2310-4979bebd45ba: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.929541 [ERR] client: failed to restore state for alloc 9c723ebb-e6f1-be34-6e8c-5aca7208c9c0: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.930541 [ERR] client: failed to restore state for alloc a20424fe-82c1-3574-1fe3-36d75ca847dd: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.930541 [ERR] client: failed to restore state for alloc ab412a69-84b4-9ffa-cd94-0bb52a8dcd7f: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.930541 [ERR] client: failed to restore state for alloc b0c1d0b0-3399-6447-538c-32b688e74f54: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.931520 [ERR] client: failed to restore state for alloc b2e4b880-b5a6-0026-972c-612d98ae387c: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.931520 [ERR] client: failed to restore state for alloc b57502c3-6f27-a810-8173-1b6fae028749: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.931520 [ERR] client: failed to restore state for alloc b5fc1567-0245-e9c9-9571-4064cf881ce8: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.931520 [ERR] client: failed to restore state for alloc b682f968-58bd-8481-c390-3b81d5ada621: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.931520 [ERR] client: failed to restore state for alloc c76d1180-c879-4f02-a544-e5c0ddf12f5e: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.931520 [ERR] client: failed to restore state for alloc c95b703a-1741-6e0c-5602-3f05cc08fe86: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.931520 [ERR] client: failed to restore state for alloc cff32ed7-3123-0041-261d-1b2ae94110fc: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.931520 [ERR] client: failed to restore state for alloc d3071e8b-bdb0-bc49-6be8-07d62b7d18c4: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.931520 [ERR] client: failed to restore state for alloc d754293f-7521-f186-76b1-f2bd0304d5a5: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.931520 [ERR] client: failed to restore state for alloc dc56d315-6c5b-5992-1116-98970e1b2671: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.931520 [ERR] client: failed to restore state for alloc dc56dfba-da35-4f0d-b5e0-03594d3199e5: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.931520 [ERR] client: failed to restore state for alloc ddd694a4-a762-1994-a267-bba34b216210: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.931520 [ERR] client: failed to restore state for alloc eb88d77d-1412-bf3b-a34b-e87fed5533e2: failed to decode state: invalid character '\x00' looking for beginning of value
    2017/04/28 19:59:10.931520 [ERR] client: failed to restore state for alloc fae26824-87a5-71df-7a2e-ea3cfd4cda0a: failed to decode state: invalid character '\x00' looking for beginning of value

@sirkjohannsen
Copy link

same happens still with nomad 0.6.0-dev

@opaugam-unity
Copy link

It does happen for us on 0.6.0: gracefully shutting down the agent leads to a 80+ % probability the next time the agent starts we hit somehing like "* failed to read allocation state: failed to read alloc runner alloc state: no data at key alloc".

Cleaning up the mounts + state db is ok-ay but any left-over running container will not be handled upon restart and lead to error if ever you use static ports.

@elsbrock
Copy link

elsbrock commented May 2, 2018

I also get this on v0.7.1.

@volodymyr-f
Copy link

Got the same on Nomad v0.8.0

@momania
Copy link

momania commented Jun 20, 2018

I'm getting this too on v0.8.4
Allocation ids that it tries to recover don't match with the ids found in /var/lib/nomad/alloc

@shantanugadgil
Copy link
Contributor

I too face this issue often, if I abruptly reboot my client nodes.
I have updated my scripts to stop the docker daemon altogether (my use case is dockers), stop Consul and Nomad, sync a few times (for good luck), and then reboot.

Since this modification, I haven't hit the issue.

@qkate
Copy link
Contributor

qkate commented Jul 13, 2018

Thanks all for the reports! We're in the middle of some refactoring work right now that we are hoping will resolve this (and other!) issues.

@capone212
Copy link
Contributor

Same problem, waiting for the solution/

@onlyjob
Copy link
Contributor

onlyjob commented Aug 15, 2018

I get this very often every time on v0.8.4 after restarting nomad after full drain: daemon just fails to start (after clean shutdown) with the following error:

[ERR] client: failed to restore state: 1 error occurred:
* failed to read allocation state: failed to read alloc runner alloc state: no data at 
key alloc

Removing /var/lib/nomad/client/state.db helps but should not be necessary...

@kcwong-verseon
Copy link
Contributor

Just to echo @onlyjob and #4748, the corrupt state is different from before where clearing the alloc directory resolve the issue. The new behavior with 0.8.x is the state DB is corrupted and either removing it or cat /dev/null >/var/lib/nomad/client/state.db is needed, clearing alloc does not address the state.

@jozef-slezak
Copy link

jozef-slezak commented Oct 8, 2018

I am facing the similar problem. I have a fix: #4739
@qkate the fix is being implemented against refactored code (at least I hope so because of the branch)

@dadgar I would suggest handling missing key like this:

  1. It might be better if GetTaskRunnerState(), GetAllAllocations() would not return error if a key (local_state, task_state etc.) is missing (the key might not have been written).
  2. Caller of GetTaskRunnerState(), GetAllAllocations() ffdbbce would use tr.initState() and &NoopPrevAlloc{} in case missing key.

@dadgar what about having 0.8.7 (it would help for our deployment) with the fix ported to master branch? I mean BoltDB: (not Badger), no error if key is missing and handle missing key (tr.initState NoopPrevAlloc).

I was able to reproduce the problem in:

Nomad version
0.8.6

Operating system and Environment details
Centos 7

Issue
Nomad agent crashes on startup after the reboot say it cannot restore some allocations.

Nomad Client logs (if appropriate)
no data at key local_state

@jozef-slezak
Copy link

This problem is still reproducible even today (with the latest commit) on 0.9.0-dev just by restarting the cluster. Please see my fix #4807.

Nomad version
0.9.0-dev

Operating system and Environment details
Centos 7

Issue
Nomad agent crashes on startup after the reboot say it cannot restore some allocations.
failed to read local task runner state: no data at key local_state"
bucket: Task bucket doesnt exists

Nomad Client logs (if appropriate)
==> Starting Nomad agent...
2018/10/19 16:29:08 [INFO] agent: Deregistered service "_nomad-task-inhseqgfpo432j5iihs2d7lwcpve7gpb"
==> Error starting agent: client setup failed: failed to restore state
2018-10-19T16:29:08.636+0200 [INFO ] agent: detected plugin: name=raw_exec type=driver plugin_version=0.1
2018-10-19T16:29:08.636+0200 [INFO ] agent: detected plugin: name=mock_driver type=driver plugin_version=
2018-10-19T16:29:08.636+0200 [INFO ] client: using state directory: state_dir=/var/lib/innovatrics/nomad/
2018-10-19T16:29:08.636+0200 [INFO ] client: using alloc directory: alloc_dir=/var/lib/innovatrics/nomad/
2018-10-19T16:29:08.638+0200 [INFO ] client.fingerprint_mgr.cgroup: cgroups are available
2018-10-19T16:29:08.641+0200 [INFO ] client.fingerprint_mgr.consul: consul agent is available
2018-10-19T16:29:08.642+0200 [WARN ] client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbi
2018-10-19T16:29:08.652+0200 [ERROR] client: error restoring alloc: error="failed to read local task runn
2018-10-19T16:29:08.652+0200 [ERROR] client: failed to restore state: error="1 error(s) occurred:

  • failed to read local task runner state: no data at key local_state"
    2018-10-19T16:29:08.652+0200 [ERROR] client: Nomad is unable to start due to corrupt state. The safest wa
    2018-10-19T16:29:08.652+0200 [ERROR] client: Corrupt state is often caused by a bug. Please report as muc

2018-10-19T16:43:44.230+0200 [ERROR] client: error restoring alloc: error="failed to get task "IDispatcher" bucket: Task bucket doesn
2018-10-19T16:43:44.230+0200 [ERROR] client: error restoring alloc: error="failed to get task "IDispatcher" bucket: Task bucket doesn
2018-10-19T16:43:44.232+0200 [ERROR] client: error restoring alloc: error="failed to get task "activemq515" bucket: Task bucket doesn
2018-10-19T16:43:44.233+0200 [ERROR] client: error restoring alloc: error="failed to get task "activemq515" bucket: Task bucket doesn
2018-10-19T16:43:44.233+0200 [ERROR] client: failed to restore state: error="4 error(s) occurred:

  • failed to get task "IDispatcher" bucket: Task bucket doesn't exist and transaction is not writable
  • failed to get task "IDispatcher" bucket: Task bucket doesn't exist and transaction is not writable
  • failed to get task "activemq515" bucket: Task bucket doesn't exist and transaction is not writable
  • failed to get task "activemq515" bucket: Task bucket doesn't exist and transaction is not writable"
    2018-10-19T16:43:44.233+0200 [ERROR] client: Nomad is unable to start due to corrupt state. The safest way to proceed is to manually
    2018-10-19T16:43:44.233+0200 [ERROR] client: Corrupt state is often caused by a bug. Please report as much information as possible to

@robloxrob
Copy link

I ran into this issue on nodes where my disk ended up filling up.

Nomad version
0.8.6

Operating system and Environment details
Ubuntu 16.04

@shantanugadgil
Copy link
Contributor

@robloxrob
I use CentOS7 (systemd)
I run a cron job which checks the status of the service named "Nomad" to see if it is unhealthy/dead
If it is dead, I proceed to stop Nomad, Consul, Docker daemons, wipe out the "datadir" and reboot the machine.

This is a workaround, but keeps machine chugging along rather than having to debug this issue :)

HTH,
Shantanu

P.S.
I haven't implemented it yet, but I have an idea to parse the output of the command "systemctl status nomad" for specific error strings rather than the return value of systemctl.

@schmichael
Copy link
Member

Nomad 0.9.0-rc1 was recently released which should address state corruption issues. Please open a new issue if you run into problems and thanks for your patience! I know this took a while to get fixed.

@vkiranananda
Copy link

Hello!
I have same problem. Nomad v0.9.0. Run docker service (memcached) and restart linux server.
logs:
Apr 13 16:01:17 beta-docker nomad[8199]: ==> Loaded configuration from /etc/nomad.d/server.hcl
Apr 13 16:01:17 beta-docker nomad[8199]: ==> Starting Nomad agent...
This is client...
ps uax |grep noma
root 8199 0.2 0.3 1933276 38040 ? Ssl 16:01 0:01 /usr/local/bin/nomad agent -config /etc/nomad.d
root 8246 0.0 0.3 1416256 31988 ? Sl 16:01 0:00 /usr/local/bin/nomad docker_logger

port 4646 don't opened. ss -lntp|grep nom
service nomad stop
ps uax |grep noma
root 8246 0.0 0.3 1416256 31924 ? Sl 16:01 0:00 /usr/local/bin/nomad docker_logger

If I delete dir "/var/lib/nomad/client/", then nomad starting is normal .

This is real problem for me :(

@vkiranananda
Copy link

I deleted files *.fifo in /var/lib/nomad/alloc/45be141a-b9ff-0312-0e64-866852ceae31/alloc/logs and nomad started well

@vkiranananda
Copy link

Bug fix :)

echo "[Unit]
Description=Nomad fix start
Before="nomad.service"

[Service]
ExecStart=/usr/bin/find /var/lib/nomad/alloc/ -type p -delete

[Install]
WantedBy=multi-user.target" > /etc/systemd/system/nomad-fix.service

systemctl daemon-reload
systemctl enable nomad-fix

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests