-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad Client crashes on startup failing to restore allocations with empty state files #2560
Comments
This has happened to me numerous time with earlier versions, such as 0.4.1. The way I work around that is to nuke everything under |
@ashald Can you tar up and share the data directory? |
Unfortunately we cleaned it up to bring revive the agent. but if something like that will happen again I'll share it with you. |
@ashald Okay lets re-open this with more detail. Unfortunately there isn't too much that is actionable without more data 👍 |
@dadgar it just happened again after host was rebooted. :(
Data directory:
All |
@ashald Do you have logs before the node restarted? Also what were you doing before the restart? Did you drain the node? |
Looking for them this very moment... Is there is something specific I should look for or just any logs? |
Hm.... Interesting, the only thing that I found in logs is:
But that happened several week before the reboot. And now nomad tries to restore allocations that were supposed to be cleaned up?.. |
BTW, those allocation directories are there in place on the disk and are not cleaned. |
@ashald Yeah we are planning on improving the error handling during restores to not bail the whole client and to delete corrupt state. |
Is it something that will be part of |
Sure. Will update this issue when the work is complete and we can post a test binary that you can play with. Also why are you restarting the host machine so much? |
Hosts are restarted every now and then when we apply security patches and so on. But since we're not 100% sure what causes the issue my assumption is that the issue might manifest during the regular service restart as well. |
@ashald Steady state client behavior is pretty solid. The recovery code is quite complex and I believe that is where this bug is cropping up. |
Yeah, but as far as I understand, recovery mode is triggered during Nomad startup and therefore will be activated after each Nomad process restart (that might happen more often if you need to adjust config) |
I will try to ensure that node drained each time Nomad is shut-down - maybe it will help to avoid the issue for now. |
@ashald I'm just curious, but do you do a |
@dvusboy that's what I want to do |
Just wanted to add that we saw a similar issue on a Windows client (running 0.5.6) today. After a restart of the system (without draining), nomad client can't start up with below error message. Ideal option would be that if state is corrupted it gets wiped out as assumption is scheduler would have already scheduled these jobs on another node. I can provide state directory as well if that is helpful.
|
same happens still with nomad 0.6.0-dev |
It does happen for us on 0.6.0: gracefully shutting down the agent leads to a 80+ % probability the next time the agent starts we hit somehing like "* failed to read allocation state: failed to read alloc runner alloc state: no data at key alloc". Cleaning up the mounts + state db is ok-ay but any left-over running container will not be handled upon restart and lead to error if ever you use static ports. |
I also get this on v0.7.1. |
Got the same on Nomad v0.8.0 |
I'm getting this too on v0.8.4 |
I too face this issue often, if I abruptly reboot my client nodes. Since this modification, I haven't hit the issue. |
Thanks all for the reports! We're in the middle of some refactoring work right now that we are hoping will resolve this (and other!) issues. |
Same problem, waiting for the solution/ |
I get this
Removing |
Just to echo @onlyjob and #4748, the corrupt state is different from before where clearing the |
I am facing the similar problem. I have a fix: #4739 @dadgar I would suggest handling missing key like this:
@dadgar what about having 0.8.7 (it would help for our deployment) with the fix ported to master branch? I mean BoltDB: (not Badger), no error if key is missing and handle missing key (tr.initState NoopPrevAlloc). I was able to reproduce the problem in: Nomad version Operating system and Environment details Issue Nomad Client logs (if appropriate) |
This problem is still reproducible even today (with the latest commit) on 0.9.0-dev just by restarting the cluster. Please see my fix #4807. Nomad version Operating system and Environment details Issue Nomad Client logs (if appropriate)
2018-10-19T16:43:44.230+0200 [ERROR] client: error restoring alloc: error="failed to get task "IDispatcher" bucket: Task bucket doesn
|
I ran into this issue on nodes where my disk ended up filling up. Nomad version Operating system and Environment details |
@robloxrob This is a workaround, but keeps machine chugging along rather than having to debug this issue :) HTH, P.S. |
Nomad 0.9.0-rc1 was recently released which should address state corruption issues. Please open a new issue if you run into problems and thanks for your patience! I know this took a while to get fixed. |
Hello! port 4646 don't opened. ss -lntp|grep nom If I delete dir "/var/lib/nomad/client/", then nomad starting is normal . This is real problem for me :( |
I deleted files *.fifo in /var/lib/nomad/alloc/45be141a-b9ff-0312-0e64-866852ceae31/alloc/logs and nomad started well |
Bug fix :)
|
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
0.5.6
Operating system and Environment details
Centos 7
Issue
Nomad agent crashes on startup after the reboot say it cannot restore some allocations.
Nomad Client logs (if appropriate)
The text was updated successfully, but these errors were encountered: