Succesfully completed batch job is re-run with new allocation. #4299

nugend · 2018-05-15T17:08:56Z

Nomad version

Nomad v0.8.3 (c85483d)

Operating system and Environment details

3.10.0-327.36.3.el7.x86_64

Issue

A batch job executed, completed successfully and then several hours later, when the allocation was garbage collected was re-run.

Reproduction steps

Not sure. Seems to be happening frequently on our cluster though.

Nomad logs

    2018/05/14 23:30:50.541581 [DEBUG] worker: dequeued evaluation 5e2dfa95-ce49-9e4d-621d-d0e900d6c3aa
    2018/05/14 23:30:50.541765 [DEBUG] sched: <Eval "5e2dfa95-ce49-9e4d-621d-d0e900d6c3aa" JobID: "REDACTED-9d565598-63f2-4c2d-b506-5ae64d4397a2" Namespace: "default">: Total changes: (place 1) (destructive 0) (inplace 0) (stop 0)
    2018/05/14 23:30:50.546101 [DEBUG] worker: submitted plan at index 355103 for evaluation 5e2dfa95-ce49-9e4d-621d-d0e900d6c3aa
    2018/05/14 23:30:50.546140 [DEBUG] sched: <Eval "5e2dfa95-ce49-9e4d-621d-d0e900d6c3aa" JobID: "REDACTED-9d565598-63f2-4c2d-b506-5ae64d4397a2" Namespace: "default">: setting status to complete
    2018/05/14 23:30:50.547618 [DEBUG] worker: updated evaluation <Eval "5e2dfa95-ce49-9e4d-621d-d0e900d6c3aa" JobID: "REDACTED-9d565598-63f2-4c2d-b506-5ae64d4397a2" Namespace: "default">
    2018/05/14 23:30:50.547683 [DEBUG] worker: ack for evaluation 5e2dfa95-ce49-9e4d-621d-d0e900d6c3aa

    2018/05/14 23:30:52.074437 [DEBUG] client: starting task runners for alloc '5d0016ac-dd71-6626-f929-6398e80ef28e'
    2018/05/14 23:30:52.074769 [DEBUG] client: starting task context for 'REDACTED-task' (alloc '5d0016ac-dd71-6626-f929-6398e80ef28e')
2018-05-14T23:30:52.085-0400 [DEBUG] plugin: starting plugin: path=REDACTED/bin/nomad args="[REDACTED/nomad executor {"LogFile":"REDACTED/alloc/5d0016ac-dd71-6626-f929-6398e80ef28e/REDACTED-task/executor.out","LogLevel":"DEBUG"}]"
    2018/05/14 23:34:32.288406 [INFO] client: task "REDACTED-task" for alloc "5d0016ac-dd71-6626-f929-6398e80ef28e" completed successfully
    2018/05/14 23:34:32.288438 [INFO] client: Not restarting task: REDACTED-task for alloc: 5d0016ac-dd71-6626-f929-6398e80ef28e
    2018/05/14 23:34:32.289213 [INFO] client.gc: marking allocation 5d0016ac-dd71-6626-f929-6398e80ef28e for GC
    2018/05/15 01:39:13.888635 [INFO] client.gc: garbage collecting allocation 5d0016ac-dd71-6626-f929-6398e80ef28e due to new allocations and over max (500)
    2018/05/15 01:39:15.389175 [WARN] client: failed to broadcast update to allocation "5d0016ac-dd71-6626-f929-6398e80ef28e"
    2018/05/15 01:39:15.389401 [INFO] client.gc: marking allocation 5d0016ac-dd71-6626-f929-6398e80ef28e for GC
    2018/05/15 01:39:15.390656 [DEBUG] client: terminating runner for alloc '5d0016ac-dd71-6626-f929-6398e80ef28e'
    2018/05/15 01:39:15.390714 [DEBUG] client.gc: garbage collected "5d0016ac-dd71-6626-f929-6398e80ef28e"
    2018/05/15 04:25:10.541590 [INFO] client.gc: garbage collecting allocation 5d0016ac-dd71-6626-f929-6398e80ef28e due to new allocations and over max (500)
    2018/05/15 04:25:10.541626 [DEBUG] client.gc: garbage collected "5d0016ac-dd71-6626-f929-6398e80ef28e"
    2018/05/15 05:46:37.119467 [DEBUG] worker: dequeued evaluation e15e469e-e4f5-2192-207b-84f6a17fd25f
    2018/05/15 05:46:37.139904 [DEBUG] sched: <Eval "e15e469e-e4f5-2192-207b-84f6a17fd25f" JobID: "REDACTED-9d565598-63f2-4c2d-b506-5ae64d4397a2" Namespace: "default">: Total changes: (place 1) (destructive 0) (inplace 0) (stop 0)
    2018/05/15 05:46:37.169051 [INFO] client.gc: marking allocation 5d0016ac-dd71-6626-f929-6398e80ef28e for GC
    2018/05/15 05:46:37.169149 [INFO] client.gc: garbage collecting allocation 5d0016ac-dd71-6626-f929-6398e80ef28e due to forced collection
    2018/05/15 05:46:37.169194 [DEBUG] client.gc: garbage collected "5d0016ac-dd71-6626-f929-6398e80ef28e"
    2018/05/15 05:46:37.177470 [DEBUG] worker: submitted plan at index 373181 for evaluation e15e469e-e4f5-2192-207b-84f6a17fd25f
    2018/05/15 05:46:37.177516 [DEBUG] sched: <Eval "e15e469e-e4f5-2192-207b-84f6a17fd25f" JobID: "REDACTED-9d565598-63f2-4c2d-b506-5ae64d4397a2" Namespace: "default">: setting status to complete
    2018/05/15 05:46:37.179391 [DEBUG] worker: updated evaluation <Eval "e15e469e-e4f5-2192-207b-84f6a17fd25f" JobID: "REDACTED-9d565598-63f2-4c2d-b506-5ae64d4397a2" Namespace: "default">
    2018/05/15 05:46:37.179783 [DEBUG] worker: ack for evaluation e15e469e-e4f5-2192-207b-84f6a17fd25f
    2018/05/15 05:46:40.218701 [DEBUG] client: starting task runners for alloc '928b0562-b7ed-a3c7-d989-89519edadee9'
    2018/05/15 05:46:40.218982 [DEBUG] client: starting task context for 'REDACTED-task' (alloc '928b0562-b7ed-a3c7-d989-89519edadee9')
2018-05-15T05:46:40.230-0400 [DEBUG] plugin: starting plugin: path=REDACTED/bin/nomad args="[REDACTED/nomad executor {"LogFile":"REDACTED/alloc/928b0562-b7ed-a3c7-d989-89519edadee9/REDACTED-task/executor.out","LogLevel":"DEBUG"}]"
    2018/05/15 11:50:17.836313 [INFO] client: task "REDACTED-task" for alloc "928b0562-b7ed-a3c7-d989-89519edadee9" completed successfully
    2018/05/15 11:50:17.836336 [INFO] client: Not restarting task: REDACTED-task for alloc: 928b0562-b7ed-a3c7-d989-89519edadee9
    2018/05/15 11:50:17.836698 [INFO] client.gc: marking allocation 928b0562-b7ed-a3c7-d989-89519edadee9 for GC

Job file (if appropriate)

{
    "Job": {
        "AllAtOnce": false,
        "Constraints": [
            {
                "LTarget": "${node.unique.id}",
                "Operand": "=",
                "RTarget": "52c7e5be-a5a0-3a34-1051-5209a91a0197"
            }
        ],
        "CreateIndex": 393646,
        "Datacenters": [
            "dc1"
        ],
        "ID": "REDACTED-9d565598-63f2-4c2d-b506-5ae64d4397a2",
        "JobModifyIndex": 393646,
        "Meta": null,
        "Migrate": null,
        "ModifyIndex": 393673,
        "Name": "REDACTED",
        "Namespace": "default",
        "ParameterizedJob": null,
        "ParentID": "REDACTED/dispatch-1526033570-3cdd72d9",
        "Payload": null,
        "Periodic": null,
        "Priority": 50,
        "Region": "global",
        "Reschedule": null,
        "Stable": false,
        "Status": "dead",
        "StatusDescription": "",
        "Stop": false,
        "SubmitTime": 1526403442162340993,
        "TaskGroups": [
            {
                "Constraints": [
                    {
                        "LTarget": "${attr.os.signals}",
                        "Operand": "set_contains",
                        "RTarget": "SIGTERM"
                    }
                ],
                "Count": 1,
                "EphemeralDisk": {
                    "Migrate": false,
                    "SizeMB": 300,
                    "Sticky": false
                },
                "Meta": null,
                "Migrate": null,
                "Name": "REDACTED",
                "ReschedulePolicy": {
                    "Attempts": 1,
                    "Delay": 5000000000,
                    "DelayFunction": "constant",
                    "Interval": 86400000000000,
                    "MaxDelay": 0,
                    "Unlimited": false
                },
                "RestartPolicy": {
                    "Attempts": 1,
                    "Delay": 15000000000,
                    "Interval": 86400000000000,
                    "Mode": "fail"
                },
                "Tasks": [
                    {
                        "Artifacts": null,
                        "Config": {
                            "command": "REDACTED",
                            "args": [REDACTED]
                        },
                        "Constraints": null,
                        "DispatchPayload": null,
                        "Driver": "raw_exec",
                        "Env": {REDACTED},
                        "KillSignal": "SIGTERM",
                        "KillTimeout": 5000000000,
                        "Leader": false,
                        "LogConfig": {
                            "MaxFileSizeMB": 10,
                            "MaxFiles": 10
                        },
                        "Meta": null,
                        "Name": "REDACTED",
                        "Resources": {
                            "CPU": 100,
                            "DiskMB": 0,
                            "IOPS": 0,
                            "MemoryMB": 256,
                            "Networks": null
                        },
                        "Services": null,
                        "ShutdownDelay": 0,
                        "Templates": null,
                        "User": "",
                        "Vault": null
                    }
                ],
                "Update": null
            }
        ],
        "Type": "batch",
        "Update": {
            "AutoRevert": false,
            "Canary": 0,
            "HealthCheck": "",
            "HealthyDeadline": 0,
            "MaxParallel": 0,
            "MinHealthyTime": 0,
            "Stagger": 0
        },
        "VaultToken": "",
        "Version": 0
    }
}

What I can tell you for sure is that the allocation ran to completion and exited successfully.

We're going to try turning off the reschedule and restart policies to see if that has any effect since we're taking care of re-running these on any sort of job failure anyway.

The text was updated successfully, but these errors were encountered:

Lamboona · 2018-05-16T08:09:41Z

We have also been having this behaviour since the rechedule policies were introduced in 0.8.
We use dispatch to schedule paramaterized jobs, but we also noticed the same behaviour, at some point towards the end of gc cycle it decides that the job never ran and runs it again with a new allocation.

We currently have a partial workaround, but would really appreciate a fix.

We run 3 servers tied into an asg of clients if that helps at all.

chelseakomlo · 2018-05-17T05:13:43Z

Thanks for reporting this issue- we will investigate this further on our end.

nugend · 2018-05-17T18:32:11Z

No effect. We will be rolling back to a previous version of Nomad.

@Lamboona Could you let me know what your workaround was?

Lamboona · 2018-05-18T09:02:44Z

@nugend We run a lot of ephemeral jobs with small units of work, so we just discounted the work that was done out of order. It only partially works, but the extra work it causes is negligible enough that we can live with it, for now.

The benefit we get from having the reschedule policy in the firstplace far outweighs the cost of the extra work.

dadgar · 2018-05-21T18:44:48Z

@nugend or @Lamboona Can you show the allocation output from the API of the original and retried allocations and the evaluations attached to the job

nugend · 2018-05-21T19:04:25Z

@dadgar At what point? The original allocation is garbage collected when the logs say its garbage collected. Do you mean I need to guess which allocation is going to experience this issue and capture it?

Also, we're not running this version right now. I can request that we deploy this to our QA environment again if this is vital to your debugging efforts, but I'm disinclined to return to the affected version.

preetapan · 2018-05-22T04:33:07Z

@nugend - I've been looking at this and trying to reproduce. If you have debug logging turned on and the garbage collector collected this job, you should see log lines like following if the server tried to force garbage collect the completed allocation

2018/05/21 23:09:49.869355 [DEBUG] sched.core: eval GC: scanning before index 27 (5m0s)
    2018/05/21 23:09:49.869680 [DEBUG] worker: ack for evaluation ed3bc611-108d-d651-3ee5-5f747e9ca239
    2018/05/21 23:09:49.869941 [DEBUG] worker: dequeued evaluation eb86564b-eddb-38e2-7a38-27d7344632a2
    2018/05/21 23:09:49.870068 [DEBUG] sched.core: job GC: scanning before index 27 (5m0s)
    2018/05/21 23:09:49.870170 [DEBUG] sched.core: job GC: 1 jobs, 1 evaluations, 1 allocs eligible

You haven't included server logs, would you be able to attach server logs from around the time that you think it tried to restart?

Another thing I noticed from the job file pasted is that the ID contains dispatch- so it looks like a parameterized job. Could you confirm that? I tried to create and dispatch a parameterized job with nomad job dispatch, and after it garbage collected, I ran nomad job status on it and saw output like the following. It did not create any more additional allocations, until nomad job dispatch is run again.

$nomad job status dispatch_test
ID            = dispatch_test
Name          = dispatch_test
Submit Date   = 2018-05-21T22:51:56-05:00
Type          = batch
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = true

Parameterized Job
Payload           = required
Required Metadata = test_key
Optional Metadata = <none>

Parameterized Job Summary
Pending  Running  Dead
0        0        2

No dispatched instances of parameterized job found

And we'd still like the info @dadgar requested - you don't have to guess which allocs will get restarted. You should be able to capture the alloc status output when you first dispatch the job (before the 24 hour window that it uses to garbage collect), and the output of nomad job status -evals REDACTED/dispatch-***. If you can do that before, and after you see it restart with another new allocation, that would help narrow down what's going on. In other words, the steps/API output to collect are:

nomad job status -evals redacted-**** right after you first dispatch the job
nomad eval status -json of any eval IDs returned in step 1
nomad alloc status -json of any allocation IDs returned in step 2
steps 1-3 after you see the problem described here where new allocations were created for the same job that had already completed.

Thanks for the help in narrowing this down, I realize this is a lot to ask but details on the above are greatly appreciated.

nugend · 2018-05-22T10:01:18Z

@preetapan We were running three nodes in a combined server/client mode. I may have failed to configure the logging correctly to produce logs from both server and client or otherwise, simply did not include those specific log lines. I'll check to see if we still have those particular log statements in this case.

The job itself was not parameterized, but its parent was. The parent was not re-executed at any point (it runs for a very long time since those jobs act as dynamic schedulers of sub-tasks which they monitor for work conditions).

I'll see if we can automate that collection. The problem is that these one-off batch jobs are created dynamically as the children of the aforementioned parameterized job (or rather, of the many parameterized jobs we've engineered to work in this way). That is why I'd have to speculatively collect them.

At any rate, it may be a bit before I can justify working on this.

preetapan · 2018-05-22T14:47:40Z

The job itself was not parameterized, but its parent was.

Yes that is expected - each child job has to be invoked explicitly via nomad job dispatch though. When trying to repro, I only see one allocation per nomad job dispatch of the parent, and then it eventually garbage collects those.

Can you share the parent job's spec while we wait for more info about the job/allocations?

preetapan · 2018-05-22T14:53:14Z

Sharing job spec of job I used to try to repro:

 $ cat parameterized.nomad
job "dispatch_test" {
  datacenters = ["dc1"]

  type = "batch"

  parameterized {
    payload       = "required"
    meta_required = ["test_key"]
  }

  group "t1" {


    task "t1" {
      driver = "raw_exec"

      config {
        command = "echo"

        # When dispatched, the payload is written to a file that is then read by
        # the created task upon startup
        args = ["world"]
      }

      dispatch_payload {
        file = "config.json"
      }
    }
  }
}
 $cat config.json
{
"hey":"ho"
}

$ nomad job dispatch -meta test_key=foo dispatch_test config.json

nugend · 2018-05-22T16:22:03Z

No, you are misunderstanding: The job which is rescheduled is not a parameterized job. It is a one-off batch job that has been customized to run a particular batch load. The parent of that one-off job is a dispatched instance of a parameter job. I manually specified the dispatch instance as parent in the creation of the one-off batch job.

endocrimes · 2018-05-22T20:20:53Z

FWIW - We've seen some similar issues with one shot batch jobs being rescheduled during our 0.8.3 testing, I hadn't made the connection to GC yet, but I can take a look for that in the logs tomorrow.

preetapan · 2018-05-23T13:53:29Z

@dantoml thanks, that would help. Having the output of v1/allocation/<id> of one of these batch jobs that were rescheduled would be useful as well.

@nugend I understand now, thanks for clarifying.

endocrimes · 2018-05-23T18:38:24Z

@preetapan Sadly, we had to put our 0.8.3 testing on hold for a couple of weeks after running into this, but I'll see if I can get a repro case or do more digging early next week

preetapan · 2018-05-23T20:56:27Z

@Lamboona will you be able to provide any debug info here - I need the output of v1/allocation/<id> of both the completed and the rescheduled allocation. Unfortunately, without that, it is not possible to piece together exactly what happened when I can't reproduce this issue with any test batch jobs I've written.

dadgar · 2018-05-23T21:13:50Z

To add more to what Preetha has asked for. What would be ideal is the following:

/v1/allocation/<alloc_id> for both the original and the rescheduled one (from the logs shared it looks like the GC happened on the client so the server should still have the original)
/v1/job/<name>/evaluations
Server and client debug level logs

nugend · 2018-05-24T00:01:54Z

I don't know about the other users, but in our configurations, nomad is running in both client and server mode, so if there's a way to get the allocation information from the server and not the client, I would love to know how.

dadgar · 2018-05-24T01:02:25Z

@nugend The endpoint I gave would work in that mode /v1/allocation/<alloc_id>

Lamboona · 2018-05-24T15:09:47Z

@preetapan we run the GC more frequently, so i will have to make some environment changes to actually be able to inspect both of the allocations at once, we only discovered them because of the effect that they were having. If i can get a chance tomorrow i will spin up an environment with debug and the normal gc settings, to try and get some of these stats.

Lamboona · 2018-05-25T13:45:38Z

Here are some debug logs that i have managed to scrape together this is the original allocation.
I have copied them out of kibana, so they may not be perfect, these are just the logs pertaining to the allocation, as there are too many surrounding it for me to put more in

May 25 12:10:14 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:10:14 [INFO] (runner) rendered "(dynamic)" => "/mnt/nomad/alloc/d5df6b11-b221-a600-3050-ddd238616736/tools-graph-insertion/local/tools.env"
May 25 12:10:53 ip-10-160-67-191 nomad[9620]: 	  2018-05-25T12:10:53.209Z [DEBUG] plugin: starting plugin: path=/usr/local/bin/nomad args="[/usr/local/bin/nomad executor {"LogFile":"/mnt/nomad/alloc/d5df6b11-b221-a600-3050-ddd238616736/tools-graph-insertion/executor.out","LogLevel":"INFO"}]"
May 25 12:11:33 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:11:33.218924 [INFO] client: task "tools-graph-insertion" for alloc "d5df6b11-b221-a600-3050-ddd238616736" completed successfully
May 25 12:11:33 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:11:33.218950 [INFO] client: Not restarting task: tools-graph-insertion for alloc: d5df6b11-b221-a600-3050-ddd238616736
May 25 12:11:33 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:11:33.237517 [INFO] client.gc: marking allocation d5df6b11-b221-a600-3050-ddd238616736 for GC
May 25 12:11:33 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:11:33.237517 [INFO] client.gc: marking allocation d5df6b11-b221-a600-3050-ddd238616736 for GC
May 25 12:11:34 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:11:34.553372 [INFO] client.gc: garbage collecting allocation d5df6b11-b221-a600-3050-ddd238616736 due to number of allocations (53) is over the limit (50)
May 25 12:11:36 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:11:36.065482 [WARN] client: failed to broadcast update to allocation "d5df6b11-b221-a600-3050-ddd238616736"
May 25 12:11:36 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:11:36.089629 [INFO] client.gc: marking allocation d5df6b11-b221-a600-3050-ddd238616736 for GC
May 25 12:11:37 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:11:37.634305 [INFO] client.gc: garbage collecting allocation d5df6b11-b221-a600-3050-ddd238616736 due to number of allocations (51) is over the limit (50)
May 25 12:15:09 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:15:09.806199 [INFO] client.gc: marking allocation d5df6b11-b221-a600-3050-ddd238616736 for GC
May 25 12:15:09 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:15:09.811470 [INFO] client.gc: garbage collecting allocation d5df6b11-b221-a600-3050-ddd238616736 due to forced collection
May 25 12:20:24 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:20:24.224329 [INFO] client.gc: marking allocation d5df6b11-b221-a600-3050-ddd238616736 for GC
May 25 12:20:25 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:20:25.101524 [INFO] client.gc: garbage collecting allocation d5df6b11-b221-a600-3050-ddd238616736 due to number of allocations (52) is over the limit (50)

The following allocation 0350af5e-4203-7976-f90f-cfddae3b573f was running again by 12.12.11

May 25 12:10:10 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:10:10 [INFO] (runner) rendered "(dynamic)" => "/mnt/nomad/alloc/0350af5e-4203-7976-f90f-cfddae3b573f/tools-graph-insertion/local/tools.env"
May 25 12:10:53 ip-10-160-67-191 nomad[9620]:     2018-05-25T12:10:53.211Z [DEBUG] plugin: starting plugin: path=/usr/local/bin/nomad args="[/usr/local/bin/nomad executor {"LogFile":"/mnt/nomad/alloc/0350af5e-4203-7976-f90f-cfddae3b573f/tools-graph-insertion/executor.out","LogLevel":"INFO"}]"
May 25 12:12:09 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:12:09.341873 [INFO] client: task "tools-graph-insertion" for alloc "0350af5e-4203-7976-f90f-cfddae3b573f" completed successfully
May 25 12:12:09 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:12:09.341901 [INFO] client: Not restarting task: tools-graph-insertion for alloc: 0350af5e-4203-7976-f90f-cfddae3b573f
May 25 12:12:09 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:12:09.361617 [INFO] client.gc: marking allocation 0350af5e-4203-7976-f90f-cfddae3b573f for GC
May 25 12:15:09 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:15:09.811342 [INFO] client.gc: garbage collecting allocation 0350af5e-4203-7976-f90f-cfddae3b573f due to forced collection
May 25 12:15:17 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:15:17.372589 [WARN] client: failed to broadcast update to allocation "0350af5e-4203-7976-f90f-cfddae3b573f"
May 25 12:20:23 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:20:23.110146 [INFO] client.gc: marking allocation 0350af5e-4203-7976-f90f-cfddae3b573f for GC
												  2018-05-25T12:20:23.116Z [DEBUG] plugin.nomad: plugin address: timestamp=2018-05-25T12:20:23.114Z address=/tmp/plugin814132855 network=unix
May 25 12:20:24 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:20:24.758808 [INFO] client.gc: garbage collecting allocation 0350af5e-4203-7976-f90f-cfddae3b573f due to number of allocations (52) is over the limit (50)
May 25 12:51:18 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:51:18.535147 [INFO] client.gc: marking allocation 0350af5e-4203-7976-f90f-cfddae3b573f for GC
May 25 12:51:18 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:51:18.540148 [INFO] client.gc: garbage collecting allocation 0350af5e-4203-7976-f90f-cfddae3b573f due to forced collection

I haven't been able to get any of the api responses for these out yet, having debugging turned on seems to cause the servers to bottleneck and then they dont re-elect a leader. I will try forcing this again but with debugging turned off

One thing i have noticed looking at these logs is that the runner log messages seem to have occured for both of the allocations before either of the tasks actually ran, for a little more information the job itself would have been blocked for some time before it was able to be allocated. They are posted to the cluster using /v1/job/:job_id/dispatch

TBC

Lamboona · 2018-05-29T08:54:46Z

@preetapan I have managed to get some of the information out of the API.

It would seem that our problem doesn't actually stem from the garbage collection but actually happens far earlier in the process, as the batch scheduler seems to be scheduling a single dispatched job multiple times. I can attach the alloc/eval output if you want or should i open a seperate issue? As i feel anything else i add might detract from the issue that others are having.

preetapan · 2018-05-29T13:48:59Z

@Lamboona noted about the GC logs. I am pretty sure that's a red herring. There has to be another eval for a previously completed batch job to reschedule. So the details of the evals and allocs of the dispatched job would help here.

Lamboona · 2018-05-29T14:52:03Z

@preetapan Thats the problem the repeated jobs we are seeing have seperate job IDs, previously with the GC being so frequent we hadn't actually seen that this was the case. Which is why i think this might be a seperate issue.

If you want me to post the output from the allocs/evals they aren't exactly small do you want them pasted directly?

preetapan · 2018-05-29T15:08:27Z

Thats the problem the repeated jobs we are seeing have seperate job IDs, previously with the GC being so frequent we hadn't actually seen that this was the case.

That's interesting that they have different job ids, that usually means that something else external is calling nomad job dispatch on the parameterized job.

You can add them as attachments here rather than posting them directly as comments if they are huge. Please attach a job spec as well.

I can go ahead and create another issue as necessary but right now this ticket is fine.

Lamboona · 2018-05-29T15:40:33Z

allocation_output.zip

I wasn't able to attach just Json files so i have zipped them together, this shows two seperate allocations that go through seperate evaluations. The jobs themselves would have been created by a single post to the HTTP API dispatch endpoint, the TOOL_RUN_ID is the way i can link them together, as these are generated at dispatch as a uuidv4.

preetapan · 2018-05-29T17:20:45Z

@Lamboona Having looked through all the allocs/evals I can rule out that this is anything to do with rescheduling. All the evals have a TriggeredBy reason of "job-register", so the second set of dispatched jobs and the allocations for them were not triggered by the scheduler.

One thing I noticed from the SubmitTime field of the jobs in the two allocs was that they were close to 3 seconds apart.

$grep SubmitTime alloc2.json 
        "SubmitTime": 1527261125866757635,
$grep SubmitTime alloc1.json 
        "SubmitTime": 1527261128492329485,

That's telling me that those two jobIDS were created by two different calls to the job dispatch server end point, which in turn are triggered by calls to the HTTP dispatch end point.

Do you use any http client libraries when wrapping calls to Nomad, perhaps that layer is doing a retry?

wyattanderson · 2018-09-11T22:02:36Z

I've been noticing this more and more recently as well, and unfortunately only usually notice it well after the allocations have been garbage collected, which happens extremely frequently in our environment due to our job volume. When I've seen it, I've seen the exact same dispatched job ID in the new allocation (REDACTED/dispatch-1534540234-a09077f0/periodic-1536645600). The parameterized part of the job is optional and we don't have any other systems that would have redispatched the job other than the periodic scheduler. The retries usually occur several hours apart, and the first allocation has always succeeded.

Here's what I think are the relevant bits of the job configuration; I can add more detail if necessary later on. We don't currently have a reschedule stanza on this job.

job "REDACTED" {
    type = "batch"

    periodic {
        cron = "0 2 * * * *"
        time_zone = "America/New_York"
    }

    group "REDACTED" {
        restart {
            attempts = 1
            mode = "fail"
        }
    }
}

I'll see what I can do to collect allocation and evaluation logs when the job completes, but I figured I'd jump on this issue in the meantime.

wyattanderson · 2018-10-02T16:29:27Z

A hunch: our servers (for historical reasons) have their time zone set to EST5EDT, which is UTC -4. I haven't confirmed this yet, but I've noticed that for a few duplicate batch jobs we've seen, they've come in almost exactly 4 hours after the job was originally scheduled. Is it possible that there's a faulty comparison somewhere between tz-aware and tz-naive datetimes?

Can other folks maybe chime in with their system clock settings?

shantanugadgil · 2018-10-02T16:41:56Z

@wyattanderson We run our entire infrastructure on UTC time (though things could be in any timezone)

Would it be possible to switch your setups to UTC and see if the problem goes away (or occurs anyway)?

notnoopci · 2018-10-02T18:29:11Z

In a test environment, I was able to hit this issue 1-5% depending on various configurations and potentially the position of the moon. I added my notes and reproduction steps and logs in https://github.com/notnoopci/nomad-0.8.3-job-rerun-bug .

It does seem that problem is related to GC - and looks like job deregistering somehow resulting into new allocations for re-runs. This matches @camerondavison observation of the eval being job-deregister. I suspect it's a race in purging of job and allocation from state and evaluating the job-deregister eval.

We are digging into this further and we'll be happy to assist in debugging this or testing any hypothesis/patches, as this is blocking our adoption of Nomad 0.8.

Fixes hashicorp#4299 There is a race between job purging and job-deregister evaluations handling that seem to cause completed jobs to re-run unexpectedly when they are being GCed. Here, we make `BatchDeregister` behave just like `Deregister`, where we submit the `job-deregister` evals after `JobBatchDeregisterRequest` is committed to raft.

wyattanderson · 2018-10-03T21:01:10Z

Finally able to capture this.

The redispatched allocation:

{
    "CreateIndex": 77003310, 
    "JobVersion": 0, 
    "ClientStatus": "running", 
    "DesiredTransition": {
        "Migrate": null, 
        "ForceReschedule": null, 
        "Reschedule": null
    }, 
    "DeploymentStatus": null, 
    "ModifyIndex": 77003764, 
    "DesiredDescription": "", 
    "JobID": "REDACTED/dispatch-1538582681-b197aba4", 
    "ClientDescription": "", 
    "TaskGroup": "REDACTED", 
    "CreateTime": 1538599086273802244, 
    "NodeID": "bd770e44-1610-7479-10fe-4d2b8e9c398d", 
    "ModifyTime": 1538599088473645875, 
    "DesiredStatus": "run", 
    "Name": "REDACTED/dispatch-1538582681-b197aba4.REDACTED[0]", 
    "EvalID": "1be2b1c5-0d98-41e0-4846-eb7898185bf9", 
    "FollowupEvalID": "", 
    "ID": "3a5945f4-b606-86c8-be60-e29d7b77a7e8", 
    "RescheduleTracker": null, 
    "TaskStates": {
        "task1": {
            "LastRestart": "0001-01-01T00:00:00Z", 
            "Restarts": 0, 
            "Failed": false, 
            "State": "running", 
            "FinishedAt": "0001-01-01T00:00:00Z", 
            "StartedAt": "2018-10-03T20:38:08.363318471Z", 
            "Events": [
                {
                    "DiskLimit": 0, 
                    "DisplayMessage": "Task received by client", 
                    "TaskSignal": "", 
                    "KillTimeout": 0, 
                    "VaultError": "", 
                    "Details": {}, 
                    "TaskSignalReason": "", 
                    "Type": "Received", 
                    "FailsTask": false, 
                    "DriverError": "", 
                    "KillReason": "", 
                    "GenericSource": "", 
                    "FailedSibling": "", 
                    "ValidationError": "", 
                    "Signal": 0, 
                    "KillError": "", 
                    "DownloadError": "", 
                    "Time": 1538599087847401470, 
                    "RestartReason": "", 
                    "DriverMessage": "", 
                    "StartDelay": 0, 
                    "SetupError": "", 
                    "Message": "", 
                    "ExitCode": 0
                }, 
                {
                    "DiskLimit": 0, 
                    "DisplayMessage": "Building Task Directory", 
                    "TaskSignal": "", 
                    "KillTimeout": 0, 
                    "VaultError": "", 
                    "Details": {
                        "message": "Building Task Directory"
                    }, 
                    "TaskSignalReason": "", 
                    "Type": "Task Setup", 
                    "FailsTask": false, 
                    "DriverError": "", 
                    "KillReason": "", 
                    "GenericSource": "", 
                    "FailedSibling": "", 
                    "ValidationError": "", 
                    "Signal": 0, 
                    "KillError": "", 
                    "DownloadError": "", 
                    "Time": 1538599087847468088, 
                    "RestartReason": "", 
                    "DriverMessage": "", 
                    "StartDelay": 0, 
                    "SetupError": "", 
                    "Message": "Building Task Directory", 
                    "ExitCode": 0
                }, 
                {
                    "DiskLimit": 0, 
                    "DisplayMessage": "Task started by client", 
                    "TaskSignal": "", 
                    "KillTimeout": 0, 
                    "VaultError": "", 
                    "Details": {}, 
                    "TaskSignalReason": "", 
                    "Type": "Started", 
                    "FailsTask": false, 
                    "DriverError": "", 
                    "KillReason": "", 
                    "GenericSource": "", 
                    "FailedSibling": "", 
                    "ValidationError": "", 
                    "Signal": 0, 
                    "KillError": "", 
                    "DownloadError": "", 
                    "Time": 1538599088351381536, 
                    "RestartReason": "", 
                    "DriverMessage": "", 
                    "StartDelay": 0, 
                    "SetupError": "", 
                    "Message": "", 
                    "ExitCode": 0
                }
            ]
        }, 
        "task2": {
            "LastRestart": "0001-01-01T00:00:00Z", 
            "Restarts": 0, 
            "Failed": false, 
            "State": "running", 
            "FinishedAt": "0001-01-01T00:00:00Z", 
            "StartedAt": "2018-10-03T20:38:08.316265838Z", 
            "Events": [
                {
                    "DiskLimit": 0, 
                    "DisplayMessage": "Task received by client", 
                    "TaskSignal": "", 
                    "KillTimeout": 0, 
                    "VaultError": "", 
                    "Details": {}, 
                    "TaskSignalReason": "", 
                    "Type": "Received", 
                    "FailsTask": false, 
                    "DriverError": "", 
                    "KillReason": "", 
                    "GenericSource": "", 
                    "FailedSibling": "", 
                    "ValidationError": "", 
                    "Signal": 0, 
                    "KillError": "", 
                    "DownloadError": "", 
                    "Time": 1538599087847317486, 
                    "RestartReason": "", 
                    "DriverMessage": "", 
                    "StartDelay": 0, 
                    "SetupError": "", 
                    "Message": "", 
                    "ExitCode": 0
                }, 
                {
                    "DiskLimit": 0, 
                    "DisplayMessage": "Building Task Directory", 
                    "TaskSignal": "", 
                    "KillTimeout": 0, 
                    "VaultError": "", 
                    "Details": {
                        "message": "Building Task Directory"
                    }, 
                    "TaskSignalReason": "", 
                    "Type": "Task Setup", 
                    "FailsTask": false, 
                    "DriverError": "", 
                    "KillReason": "", 
                    "GenericSource": "", 
                    "FailedSibling": "", 
                    "ValidationError": "", 
                    "Signal": 0, 
                    "KillError": "", 
                    "DownloadError": "", 
                    "Time": 1538599087847457864, 
                    "RestartReason": "", 
                    "DriverMessage": "", 
                    "StartDelay": 0, 
                    "SetupError": "", 
                    "Message": "Building Task Directory", 
                    "ExitCode": 0
                }, 
                {
                    "DiskLimit": 0, 
                    "DisplayMessage": "Task started by client", 
                    "TaskSignal": "", 
                    "KillTimeout": 0, 
                    "VaultError": "", 
                    "Details": {}, 
                    "TaskSignalReason": "", 
                    "Type": "Started", 
                    "FailsTask": false, 
                    "DriverError": "", 
                    "KillReason": "", 
                    "GenericSource": "", 
                    "FailedSibling": "", 
                    "ValidationError": "", 
                    "Signal": 0, 
                    "KillError": "", 
                    "DownloadError": "", 
                    "Time": 1538599088304795991, 
                    "RestartReason": "", 
                    "DriverMessage": "", 
                    "StartDelay": 0, 
                    "SetupError": "", 
                    "Message": "", 
                    "ExitCode": 0
                }
            ]
        }
    }
}

and the evaluation:

{
    "PreviousEval": "", 
    "Namespace": "default", 
    "ModifyIndex": 77003323, 
    "JobID": "REDACTED/dispatch-1538582681-b197aba4", 
    "Priority": 30, 
    "WaitUntil": "0001-01-01T00:00:00Z", 
    "CreateIndex": 77003266, 
    "Type": "batch", 
    "TriggeredBy": "job-deregister", 
    "Status": "complete", 
    "AnnotatePlan": false, 
    "QueuedAllocations": {
        "REDACTED": 0
    }, 
    "NextEval": "", 
    "JobModifyIndex": 0, 
    "DeploymentID": "", 
    "ID": "1be2b1c5-0d98-41e0-4846-eb7898185bf9", 
    "EscapedComputedClass": false, 
    "Wait": 0, 
    "ClassEligibility": null, 
    "FailedTGAllocs": null, 
    "StatusDescription": "", 
    "QuotaLimitReached": "", 
    "NodeID": "", 
    "SnapshotIndex": 77003266, 
    "BlockedEval": "", 
    "LeaderACL": "", 
    "NodeModifyIndex": 0
}

notnoopci · 2018-10-03T21:18:18Z

@wyattanderson I think that matches my observation too with job-deregister evals and GC being the trigger too.

One workaround is to run separate GC process that watches for nomad completed jobs and immediately calls Deregister on them directly before the core scheduler Jobs GC process them.

nugend · 2018-10-05T19:36:02Z

Has anyone seen any of this behavior cease after upgrading to 0.8.6 or anything?

notnoopci · 2018-10-05T20:37:01Z

@nugend I was able to reproduce it with latest nomad master in my tests. In simulated environment, two jobs re-run that's mentioned in https://github.com/notnoopci/nomad-0.8.3-job-rerun-bug/blob/master/run-1538770428/output.txt#L47-L48

nugend · 2018-10-06T03:01:58Z

Holy moly, I didn't notice how much work you've put into debugging this already!

Since you have a script to reproduce and detect the issue, have you tried running it through git bisect yet? I guess it'd take a while.

I know for a fact you can put the start at 0.7.1 (we're using that in Production and haven't seen the issue).

nugend · 2018-10-24T19:56:38Z

We've noticed that we have some jobs which failed and were explicitly configured to not restart or reschedule just get scheduled again on 0.7.1

I can only conclude that this has always been an issue and it's just certain garbage collection policy changes that cause it to be more or less prevalent.

notnoop · 2018-11-01T15:10:22Z

@nugend interesting. You are right in that it's GC change, as I believe the relevant change is #3982 . I have a WIP fix in #4744, but we are researching other ways to mitigate the condition before committing that PR.

I'm not so sure about pre-0.8 - in my previous job, prior to upgrading Nomad to 0.8, we saw job retries rate of ~0.05%, that we attributed to noise and infrastructure related failures; but the rate jumped to 2-3% when we upgraded to 0.8. It's possible that there is another triggering condition that I missed in the noise.

When possible, can you provide some logs and API results for the re-run jobs and associated evaluations/allocations as suggested above - that would help us tremendously.

FYI - I just joined the Nomad team, and I'll be working on this issue very soon again.

Fixes #4299 Upon investigating this case further, we determined the issue to be a race between applying `JobBatchDeregisterRequest` fsm operation and processing job-deregister evals. Processing job-deregister evals should wait until the FSM log message finishes applying, by using the snapshot index. However, with `JobBatchDeregister`, any single individual job deregistering was applied accidentally incremented the snapshot index and resulted into processing job-deregister evals. When a Nomad server receives an eval for a job in the batch that is yet to be deleted, we accidentally re-run it depending on the state of allocation. This change ensures that we delete deregister all of the jobs and inserts all evals in a single transactions, thus blocking processing related evals until deregistering complete.

notnoop · 2018-11-13T22:40:49Z

@nugend Thanks for your patience! We have resolved this issue and intend to ship a point to release with the fix soon. Would you be able to vet a test build in your environment soon? I'll update the ticket as soon as we cut.a release.

nugend · 2018-11-13T22:51:17Z

I've already built a release from the master branch. We'll be testing it in a few days.

preetapan · 2018-11-13T23:16:20Z

@nugend master has a lot of churn due to ongoing work with driver plugins, so would definitely not recommend running a build based on master in production.

camerondavison · 2018-12-11T16:28:49Z

@notnoop would this problem also manifest itself on batch allocations that fail? I am seeing a batch job run too many times (set the restart policy to only 1) and once it failed it still restarted.

shantanugadgil · 2018-12-11T16:32:32Z

I set the restart parameter to 0 (zero) for my AMI build jobs which I want to run only once.

camerondavison · 2018-12-11T17:00:16Z

sorry. I meant to say that it restarted 4 times at least, still in process of restarting.

nugend · 2019-02-14T19:43:44Z

Just want to say this does seem to be fixed in 0.8.7 and I would suggest anyone experiencing it on an earlier build should try upgrading.

github-actions · 2022-11-26T02:17:36Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

chelseakomlo added theme/scheduling stage/needs-investigation labels May 17, 2018

dadgar assigned qkate May 17, 2018

preetapan added the stage/waiting-reply label May 29, 2018

wiedenmeier mentioned this issue Jul 18, 2018

Nomad has running deployments for non existant jobs #4520

Closed

notnoopci mentioned this issue Oct 3, 2018

Separate BatchJobDeregister evals updates #4744

Closed

1 task

notnoop mentioned this issue Nov 10, 2018

Run job deregistering in a single transaction #4861

Merged

notnoop closed this as completed in #4861 Nov 13, 2018

preetapan added type/bug and removed stage/waiting-reply labels Dec 11, 2018

notnoop mentioned this issue Jun 2, 2020

Handling of committed inconsistent/corrupt state #8097

Open

github-actions bot locked as resolved and limited conversation to collaborators Nov 26, 2022

Succesfully completed batch job is re-run with new allocation. #4299

Succesfully completed batch job is re-run with new allocation. #4299

Comments

nugend commented May 15, 2018 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Nomad logs

Job file (if appropriate)

Lamboona commented May 16, 2018

chelseakomlo commented May 17, 2018

nugend commented May 17, 2018

Lamboona commented May 18, 2018

dadgar commented May 21, 2018

nugend commented May 21, 2018 • edited Loading

preetapan commented May 22, 2018 • edited Loading

nugend commented May 22, 2018

preetapan commented May 22, 2018 • edited Loading

preetapan commented May 22, 2018

nugend commented May 22, 2018 • edited Loading

endocrimes commented May 22, 2018

preetapan commented May 23, 2018

endocrimes commented May 23, 2018

preetapan commented May 23, 2018

dadgar commented May 23, 2018 • edited Loading

nugend commented May 24, 2018

dadgar commented May 24, 2018

Lamboona commented May 24, 2018

Lamboona commented May 25, 2018

Lamboona commented May 29, 2018

preetapan commented May 29, 2018

Lamboona commented May 29, 2018

preetapan commented May 29, 2018

Lamboona commented May 29, 2018

preetapan commented May 29, 2018 • edited Loading

wyattanderson commented Sep 11, 2018

wyattanderson commented Oct 2, 2018

shantanugadgil commented Oct 2, 2018

notnoopci commented Oct 2, 2018 • edited Loading

wyattanderson commented Oct 3, 2018

notnoopci commented Oct 3, 2018 • edited Loading

nugend commented Oct 5, 2018

notnoopci commented Oct 5, 2018 • edited Loading

nugend commented Oct 6, 2018

nugend commented Oct 24, 2018

notnoop commented Nov 1, 2018

notnoop commented Nov 13, 2018

nugend commented Nov 13, 2018

preetapan commented Nov 13, 2018

camerondavison commented Dec 11, 2018

shantanugadgil commented Dec 11, 2018

camerondavison commented Dec 11, 2018

nugend commented Feb 14, 2019

github-actions bot commented Nov 26, 2022

nugend commented May 15, 2018 •

edited

Loading

nugend commented May 21, 2018 •

edited

Loading

preetapan commented May 22, 2018 •

edited

Loading

preetapan commented May 22, 2018 •

edited

Loading

nugend commented May 22, 2018 •

edited

Loading

dadgar commented May 23, 2018 •

edited

Loading

preetapan commented May 29, 2018 •

edited

Loading

notnoopci commented Oct 2, 2018 •

edited

Loading

notnoopci commented Oct 3, 2018 •

edited

Loading

notnoopci commented Oct 5, 2018 •

edited

Loading