-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Succesfully completed batch job is re-run with new allocation. #4299
Comments
We have also been having this behaviour since the rechedule policies were introduced in 0.8. We currently have a partial workaround, but would really appreciate a fix. We run 3 servers tied into an asg of clients if that helps at all. |
Thanks for reporting this issue- we will investigate this further on our end. |
No effect. We will be rolling back to a previous version of Nomad. @Lamboona Could you let me know what your workaround was? |
@nugend We run a lot of ephemeral jobs with small units of work, so we just discounted the work that was done out of order. It only partially works, but the extra work it causes is negligible enough that we can live with it, for now. The benefit we get from having the reschedule policy in the firstplace far outweighs the cost of the extra work. |
@dadgar At what point? The original allocation is garbage collected when the logs say its garbage collected. Do you mean I need to guess which allocation is going to experience this issue and capture it? Also, we're not running this version right now. I can request that we deploy this to our QA environment again if this is vital to your debugging efforts, but I'm disinclined to return to the affected version. |
@nugend - I've been looking at this and trying to reproduce. If you have debug logging turned on and the garbage collector collected this job, you should see log lines like following if the server tried to force garbage collect the completed allocation
You haven't included server logs, would you be able to attach server logs from around the time that you think it tried to restart? Another thing I noticed from the job file pasted is that the ID contains
And we'd still like the info @dadgar requested - you don't have to guess which allocs will get restarted. You should be able to capture the alloc status output when you first dispatch the job (before the 24 hour window that it uses to garbage collect), and the output of
Thanks for the help in narrowing this down, I realize this is a lot to ask but details on the above are greatly appreciated. |
@preetapan We were running three nodes in a combined server/client mode. I may have failed to configure the logging correctly to produce logs from both server and client or otherwise, simply did not include those specific log lines. I'll check to see if we still have those particular log statements in this case. The job itself was not parameterized, but its parent was. The parent was not re-executed at any point (it runs for a very long time since those jobs act as dynamic schedulers of sub-tasks which they monitor for work conditions). I'll see if we can automate that collection. The problem is that these one-off batch jobs are created dynamically as the children of the aforementioned parameterized job (or rather, of the many parameterized jobs we've engineered to work in this way). That is why I'd have to speculatively collect them. At any rate, it may be a bit before I can justify working on this. |
Yes that is expected - each child job has to be invoked explicitly via Can you share the parent job's spec while we wait for more info about the job/allocations? |
Sharing job spec of job I used to try to repro:
|
No, you are misunderstanding: The job which is rescheduled is not a parameterized job. It is a one-off batch job that has been customized to run a particular batch load. The parent of that one-off job is a dispatched instance of a parameter job. I manually specified the dispatch instance as parent in the creation of the one-off batch job. |
FWIW - We've seen some similar issues with one shot batch jobs being rescheduled during our 0.8.3 testing, I hadn't made the connection to GC yet, but I can take a look for that in the logs tomorrow. |
@preetapan Sadly, we had to put our 0.8.3 testing on hold for a couple of weeks after running into this, but I'll see if I can get a repro case or do more digging early next week |
@Lamboona will you be able to provide any debug info here - I need the output of |
To add more to what Preetha has asked for. What would be ideal is the following:
|
I don't know about the other users, but in our configurations, nomad is running in both client and server mode, so if there's a way to get the allocation information from the server and not the client, I would love to know how. |
@nugend The endpoint I gave would work in that mode |
@preetapan we run the GC more frequently, so i will have to make some environment changes to actually be able to inspect both of the allocations at once, we only discovered them because of the effect that they were having. If i can get a chance tomorrow i will spin up an environment with debug and the normal gc settings, to try and get some of these stats. |
Here are some debug logs that i have managed to scrape together this is the original allocation.
The following allocation 0350af5e-4203-7976-f90f-cfddae3b573f was running again by 12.12.11
I haven't been able to get any of the api responses for these out yet, having debugging turned on seems to cause the servers to bottleneck and then they dont re-elect a leader. I will try forcing this again but with debugging turned off One thing i have noticed looking at these logs is that the runner log messages seem to have occured for both of the allocations before either of the tasks actually ran, for a little more information the job itself would have been blocked for some time before it was able to be allocated. They are posted to the cluster using TBC |
@preetapan I have managed to get some of the information out of the API. It would seem that our problem doesn't actually stem from the garbage collection but actually happens far earlier in the process, as the batch scheduler seems to be scheduling a single dispatched job multiple times. I can attach the alloc/eval output if you want or should i open a seperate issue? As i feel anything else i add might detract from the issue that others are having. |
@Lamboona noted about the GC logs. I am pretty sure that's a red herring. There has to be another eval for a previously completed batch job to reschedule. So the details of the evals and allocs of the dispatched job would help here. |
@preetapan Thats the problem the repeated jobs we are seeing have seperate job IDs, previously with the GC being so frequent we hadn't actually seen that this was the case. Which is why i think this might be a seperate issue. If you want me to post the output from the allocs/evals they aren't exactly small do you want them pasted directly? |
That's interesting that they have different job ids, that usually means that something else external is calling You can add them as attachments here rather than posting them directly as comments if they are huge. Please attach a job spec as well. I can go ahead and create another issue as necessary but right now this ticket is fine. |
I wasn't able to attach just Json files so i have zipped them together, this shows two seperate allocations that go through seperate evaluations. The jobs themselves would have been created by a single post to the HTTP API dispatch endpoint, the TOOL_RUN_ID is the way i can link them together, as these are generated at dispatch as a uuidv4. |
@Lamboona Having looked through all the allocs/evals I can rule out that this is anything to do with rescheduling. All the evals have a TriggeredBy reason of "job-register", so the second set of dispatched jobs and the allocations for them were not triggered by the scheduler. One thing I noticed from the
That's telling me that those two jobIDS were created by two different calls to the job dispatch server end point, which in turn are triggered by calls to the HTTP dispatch end point. Do you use any http client libraries when wrapping calls to Nomad, perhaps that layer is doing a retry? |
I've been noticing this more and more recently as well, and unfortunately only usually notice it well after the allocations have been garbage collected, which happens extremely frequently in our environment due to our job volume. When I've seen it, I've seen the exact same dispatched job ID in the new allocation ( Here's what I think are the relevant bits of the job configuration; I can add more detail if necessary later on. We don't currently have a
I'll see what I can do to collect allocation and evaluation logs when the job completes, but I figured I'd jump on this issue in the meantime. |
A hunch: our servers (for historical reasons) have their time zone set to Can other folks maybe chime in with their system clock settings? |
@wyattanderson We run our entire infrastructure on UTC time (though things could be in any timezone) Would it be possible to switch your setups to UTC and see if the problem goes away (or occurs anyway)? |
In a test environment, I was able to hit this issue 1-5% depending on various configurations and potentially the position of the moon. I added my notes and reproduction steps and logs in https://github.com/notnoopci/nomad-0.8.3-job-rerun-bug . It does seem that problem is related to GC - and looks like job deregistering somehow resulting into new allocations for re-runs. This matches @camerondavison observation of the eval being We are digging into this further and we'll be happy to assist in debugging this or testing any hypothesis/patches, as this is blocking our adoption of Nomad 0.8. |
Fixes hashicorp#4299 There is a race between job purging and job-deregister evaluations handling that seem to cause completed jobs to re-run unexpectedly when they are being GCed. Here, we make `BatchDeregister` behave just like `Deregister`, where we submit the `job-deregister` evals after `JobBatchDeregisterRequest` is committed to raft.
Finally able to capture this. The redispatched allocation:
and the evaluation:
|
@wyattanderson I think that matches my observation too with One workaround is to run separate GC process that watches for nomad completed jobs and immediately calls |
Has anyone seen any of this behavior cease after upgrading to 0.8.6 or anything? |
@nugend I was able to reproduce it with latest nomad master in my tests. In simulated environment, two jobs re-run that's mentioned in https://github.com/notnoopci/nomad-0.8.3-job-rerun-bug/blob/master/run-1538770428/output.txt#L47-L48 |
Holy moly, I didn't notice how much work you've put into debugging this already! Since you have a script to reproduce and detect the issue, have you tried running it through I know for a fact you can put the start at 0.7.1 (we're using that in Production and haven't seen the issue). |
We've noticed that we have some jobs which failed and were explicitly configured to not restart or reschedule just get scheduled again on 0.7.1 I can only conclude that this has always been an issue and it's just certain garbage collection policy changes that cause it to be more or less prevalent. |
@nugend interesting. You are right in that it's GC change, as I believe the relevant change is #3982 . I have a WIP fix in #4744, but we are researching other ways to mitigate the condition before committing that PR. I'm not so sure about pre-0.8 - in my previous job, prior to upgrading Nomad to 0.8, we saw job retries rate of ~0.05%, that we attributed to noise and infrastructure related failures; but the rate jumped to 2-3% when we upgraded to 0.8. It's possible that there is another triggering condition that I missed in the noise. When possible, can you provide some logs and API results for the re-run jobs and associated evaluations/allocations as suggested above - that would help us tremendously. FYI - I just joined the Nomad team, and I'll be working on this issue very soon again. |
Fixes #4299 Upon investigating this case further, we determined the issue to be a race between applying `JobBatchDeregisterRequest` fsm operation and processing job-deregister evals. Processing job-deregister evals should wait until the FSM log message finishes applying, by using the snapshot index. However, with `JobBatchDeregister`, any single individual job deregistering was applied accidentally incremented the snapshot index and resulted into processing job-deregister evals. When a Nomad server receives an eval for a job in the batch that is yet to be deleted, we accidentally re-run it depending on the state of allocation. This change ensures that we delete deregister all of the jobs and inserts all evals in a single transactions, thus blocking processing related evals until deregistering complete.
Fixes #4299 Upon investigating this case further, we determined the issue to be a race between applying `JobBatchDeregisterRequest` fsm operation and processing job-deregister evals. Processing job-deregister evals should wait until the FSM log message finishes applying, by using the snapshot index. However, with `JobBatchDeregister`, any single individual job deregistering was applied accidentally incremented the snapshot index and resulted into processing job-deregister evals. When a Nomad server receives an eval for a job in the batch that is yet to be deleted, we accidentally re-run it depending on the state of allocation. This change ensures that we delete deregister all of the jobs and inserts all evals in a single transactions, thus blocking processing related evals until deregistering complete.
@nugend Thanks for your patience! We have resolved this issue and intend to ship a point to release with the fix soon. Would you be able to vet a test build in your environment soon? I'll update the ticket as soon as we cut.a release. |
I've already built a release from the master branch. We'll be testing it in a few days. |
@nugend master has a lot of churn due to ongoing work with driver plugins, so would definitely not recommend running a build based on master in production. |
@notnoop would this problem also manifest itself on batch allocations that fail? I am seeing a batch job run too many times (set the restart policy to only 1) and once it failed it still restarted. |
I set the restart parameter to 0 (zero) for my AMI build jobs which I want to run only once. |
sorry. I meant to say that it restarted 4 times at least, still in process of restarting. |
Just want to say this does seem to be fixed in 0.8.7 and I would suggest anyone experiencing it on an earlier build should try upgrading. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.8.3 (c85483d)
Operating system and Environment details
3.10.0-327.36.3.el7.x86_64
Issue
A batch job executed, completed successfully and then several hours later, when the allocation was garbage collected was re-run.
Reproduction steps
Not sure. Seems to be happening frequently on our cluster though.
Nomad logs
Job file (if appropriate)
What I can tell you for sure is that the allocation ran to completion and exited successfully.
We're going to try turning off the reschedule and restart policies to see if that has any effect since we're taking care of re-running these on any sort of job failure anyway.
The text was updated successfully, but these errors were encountered: