-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Constraint/count is not respected after Nomad cluster restart (previously failed allocs) #5921
Comments
I have verified this (even on a small cluster with a few clients), against Nomad 0.9.3 |
@jozef-slezak , I was not able to reproduce this with the attached latest build of Nomad. This build includes the following PR, which may have been the fix for this. If you are willing to test with this, it would be helpful. |
Sure, I can test that on Monday. |
It definitely does NOT work as expected. I reproduced the same behavior by using binary https://github.com/hashicorp/nomad/files/3364021/11afd990a7b76a9909c2b0328f17381ad3d27bef.zip. After doing that, I believe that I better understand what is happening. My previously failed job allocations were started after cluster restart (this seems to be the root cause - please try fixing here). My job count is 22. But nomad started 86 which caused out of memory (OOM killer was not configured therefore one server is not responding anymore). It did not even respect the memory quota (not only count/constraint).
Steps to reproduce: Step 1. I submitted a job (first version) that failed after start (I see some application error in logs)
Step 5. I waited until nomad stopped on all servers (one nomad instance - leader was killed) and than started
Stop 6. Nomad started too many allocations (86 which caused OOM killing...) Step 3 logs (HCL job count 10 -> 12):
Step 4 logs:
Later that day I checked cluster status:
Even later that day:
In my opinion, Nomad should not start allocs "blindly" but it should check count/constraint/memory first. I am thinking about a workaround (minimize the number of allocs that should be garbage collected):
|
Today, I was able to simulate this buggy behavior on a single node (just by using
|
similar to #5919, the latest build of Nomad |
I am going to retests with |
nomad version Job Count=1
|
thanks, @jozef-slezak , i'm looking at this right now. |
Today I have found a place in the code (alloc/task runner) that causes problem in combination with sudo reboot / KillMode cgroup. Let me explain my hypothesis (if it makes sence to you, I can make small pull request):
I was checking your e2e test and I think I can reuse some parts and isolate this behavior (and maybe test fix/pr) |
Okay, here's what I've been able to do. I'm running the same version of Nomad:
It's a single node cluster running both client and server. I've registered three
I run all three, see one allocation per job. Once they're up and running, I
There are still only three allocations (same IDs), and they've all restarted. Status looks like the following (don't mind the
There are three actual instances of the task running (one per job, as expected):
This is all about what I would expect; the client can't reconnect to the task because they were killed, wait to contact the server (per current expected behavior), then restart the task once the previous allocation is verified with the server. Do you see anything I've done differently from your case? |
Yes, I can see one difference: before restarting alloc e65cef3b desired=run+status=running and alloc dc64d219 desired=run status=failed . Then restart
After reboot I can see two allocs running (problem is the second alloc dc64d219 is also restored but Count=1 in job definition):
|
Thanks, I'll file this in our tracker. |
@jozef-slezak , is it possible for you to share your job spec for this job? if you'd rather not post it publicly, you can send it to nomad-oss-debug@hashicorp.com where it will only be viewable by Nomad developers. |
Sure, I will do that (probably later today). |
I'm concerned that the previously failed allocations still have
My procedure was:
Any additional help that you can give for reproducing this is very much appreciated. If this bug is still occurring, I want to make sure that we fix it. Thank you for your help, @jozef-slezak . |
Also, can you reproduce this from a clean cluster on 0eae387? |
Hello I am attaching HCL that you have asked for:
|
Version 0.9.4-dev definitely behaves better. I can confirm that there was a written state using v0.9.1 and later v0.9.3 (before I upgraded to 0.9.4-dev). I will definitely try to reproduce this on Monday as you suggested (clean cluster on 0eae387). If there will be a problem I am willing to share my screen if you like. By the way, what is the plan for the official 0.9.4 release? |
@cgbaker, I was able to retest everything in an isolated environment (3 node cluster, AWS, dummyapp). I am attaching everything: job HCL, nomad HCL, systemd service file, and simple source code of a dummyapp (you could test it probably with redis or any other executable binary) plus steps how to reproduce below. I can confirm that 0.9.4-dev behaves much better. I was testing the reschedule/failover of jobs/process as before. I will describe the following behavior: "for a short period of time there are more jobs/processes than configured Count of jobs in a cluster". I believe that nomad client restores the jobs (even if the PID does not exist anymore) and it connects with the server and then suddenly there are no processes anymore. Please check the steps on how to reproduce. Every time I repeat these steps I am able to reproduce this behavior. Reproduction steps
Job file (if appropriate)
I want to mention that I have intentionally not configured (KillSignal=SIGINT) - systemd efectively killed dummyapps and caused.
|
thanks, @jozef-slezak , will parse this after ☕️ |
Did the Nomad client restart those jobs on When I tested this, my Nomad client does not restart the jobs... the following is in my logs:
This was one of the recent changes, where Nomad will not restart a task until it has contacted the server. 0.9.4 RC is scheduled for the end of this week. |
@cgbaker Nomad client+server was killed by |
@jozef-slezak , I'm not able to reproduce this. As in my previous comment, when I restart the client, it does not restart the processes associated with the previous allocations until it has contacted the servers. Also, this single GitHub issue spanned multiple different problems and multiple versions of Nomad. If you are amenable, I would like to close this issue and have you test the current criticism with the 0.9.4 RC (which it planned for tomorrow, but will happen very soon). |
@jozef-slezak Thank you for your patience here. Looking at the notes, I suspect you are hitting #5984 , which is addressed in #6207 and #6216 . Would you be able to test those PRs in your cluster by any chance? |
Sure, I can retest the PR. Please send me a link with the compiled binary (ideally including the UI). |
We are restarting nomad cluster (3 nomad servers and tens of nomad clients all physical machines). Same test scenario #5917, #5908 but a different bug report
Please check if this is related to #5669
Nomad version
0.9.3
Operating system and Environment details
Linux CentOS
Issue
After restarting whole cluster (not every time but it can be reproduces by repeating restarts) we see higher number of instances then expected (92 instead Count = 70). It does not respect counts and constraints.
Reproduction steps
Restart all nomad servers and clients (sudo reboot).
Check Nomad console - it seems that some allocations were "started twice"
The text was updated successfully, but these errors were encountered: