unhealthy deployment with poststart lifecycle #10058

tcdev0 · 2021-02-22T15:25:51Z

Nomad version

Nomad v1.0.3 (08741d9)

Operating system and Environment details

Ubuntu 18.04.5 LTS

Issue

Deployement with poststart task is immediately marked as unhealthy and still running as an active deployment until deadline hits.
The Service is registered healthy in Consul and reachable via traefik.

Removing the poststart job and the deployment is running fine.

I guess the poststart job is exiting too fast (under 10s) und therefore hits the min_healthy_time of 10s(default).

Reproduction steps

Job file

job "nginx" {

  datacenters = ["prod"]
  namespace   = "default"
  type        = "service"

  group "nginx" {
    count = 1
    service {
      name = "${BASE}"
      tags = [
        "traefik.enable=true",
        "traefik.http.routers.nginx.entrypoints=websecure",
        "traefik.http.routers.nginx.rule=Host(`nginx.domain`)",
        "traefik.http.routers.nginx.tls=true"
      ]
      port = "http"
      check {
        name     = "alive"
        type     = "tcp"
        interval = "2s"
        timeout  = "2s"
      }
    }

    network {
      port "http" {
        to = 80
      }
    }

    update {
      min_healthy_time  = "10s"
      healthy_deadline  = "5m"
      progress_deadline = "10m"
    }

    task "nginx-web" {
      driver = "docker"
      config {
        image = "nginx:latest"
        ports = ["http"]
      }
    }

    task "nginx-poststart" {
      driver = "docker"
      config {
        image   = "alpine:latest"
        command = "sh"
        args    = ["-c", "sleep 5 && echo 'running....' && exit 0;"]

      }
      lifecycle {
        hook    = "poststart"
        sidecar = false
      }
    }
  }
}

Enforcing the poststart job to wait for the min_healthy_time from the web service and exiting after, fixed the issue.
So i set the sleep timer slightly above the min_healthy_time = "10s" and it works.

    task "nginx-poststart" {
      driver = "docker"
      config {
        image   = "alpine:latest"
        command = "sh"
        args    = ["-c", "sleep 12 && echo 'running....' && exit 0;"]

      }
      lifecycle {
        hook    = "poststart"
        sidecar = false
      }
    }

Seen in 9361

Is it a bug or working as expected ?

Thanks.

The text was updated successfully, but these errors were encountered:

tgross · 2021-02-22T16:13:40Z

HI @tcdev0. Looks like the fix for #9361 shipped in Nomad 1.0.2. This problem is slightly different, and looks like a bug emerging from the interaction of three things:

A post start task
That's not a sidecar (and so should be expected to exit)
With a minimum healthy time in the update block

To make this work without your workaround (which looks sensible), we'd have to update the allocation health status to verify that the exit code is successful for post-stop tasks if we find they aren't "healthy" by the minimum healthy time.

tommyalatalo · 2021-04-20T13:54:08Z

We also observed this behavior today, having a task with a post start hook (not sidecar) and a min_healthy_time set in the update stanza. Will be following this, definitiely in need of a fix.

maxless · 2021-09-03T12:27:50Z

+1
Moreover, it feels weird that stuff in the update stanza is used to control prestart/poststart/poststop tasks. One of their intended uses as is shown in the documentation for lifecycle stanza is to just send HTTP requests to chat channels and exit. Which is precisely what it fails to do in the current implementation, marking allocation as unhealthy. I guess either there needs to be some flag in lifecycle, like "ignore_update", or separate task update block that overrides the group update block. It does work for prestart/poststop cases, so it's 2/3 job done.

Vadim-Che · 2021-09-24T14:14:01Z

Have Nomad v1.1.0 but faced the same issue. My jobs have Poststart and Poststop tasks. Poststop task is marked with "Task not running by deadline":

But the strangest thing is that there are other services, that are deployed by the same scheme and those are Ok (are not marked as failed):

Will this be fixed in recent future?

Oloremo · 2021-10-05T22:29:20Z

Hit that too on Nomad 1.1.5

havenith · 2021-11-04T01:12:54Z

+1
I have the same problem. The above sample fails on Nomad 1.1.6. And in our case we have a long running Java process and a short running(sub ten second), non-sidecar poststart task.
The deployment always fails with 'Task not running for min_healthy_time of 10s by deadline' even though the long running Java task is healthy in both Nomad and Consul.
Padding out the length of the poststart task doesn't seem to help though, so we don't have a workaround currently.
Is there a rough estimate on when a fix might come out?

beautifulentropy · 2021-11-13T03:53:24Z

Experiencing this too. As a new Nomad user I burned a ton of cycles on this before realizing this was just a bug.

axsuul · 2021-12-28T16:28:31Z

Still getting this issue on Nomad 1.2.0

SunSparc · 2022-01-04T22:10:31Z

Same. Nomad 1.2.0. I changed my poststart task to use the exec driver and handed it a bash script so I could use the sleep kludge that the OP used. This also got me past the problem, for now.

ksklareski · 2022-01-20T16:00:11Z

FYI issue persists in 1.2.4. Sleeping workaround still works.

beautifulentropy · 2022-01-24T23:33:44Z

A note for others who attempt to fix this with the a sleep of >=min_healthy_time. This workaround will result in a successful initial deployment. However, any subsequent update deployment where only new allocations are created forces create (not forces create/destroy update) will result in all previously healthy nodes becoming unhealthy; new allocations will be healthy though. The only real workaround I could find that truly avoids the issue is just making your existing post-start ephemeral hooks long-running sidecars instead.

OS macOS 12.1
Nomad v1.2.4
Consul v1.11.2

Repro Steps

Run the following job and wait for the first deployment to succeed.
Ensure the post-start allocation bar shows as having exited successfully.
Change count = 1 to count = 2
Run the updated job.
The original 'Healthy' allocations will now be 'Unhealthy' and the newly added allocations will be 'Healthy'.

job "foobar" {

  datacenters = ["dev"]
  type        = "service"

  group "foobar" {
    count = 1

    task "foo" {
      driver = "raw_exec"
      config {
        command = "sh"
        args    = ["-c", "sleep 5000000"]
      }
    }

    task "bar" {
      driver = "raw_exec"
      config {
        command = "sh"
        args    = ["-c", "sleep 10 && echo 'running....' && exit 0;"]
      }
      lifecycle {
        hook    = "poststart"
        sidecar = false
      }
    }
  }
}

Screenshots

(deployment 1)

(deployment 2)

edudobay · 2022-02-03T02:52:09Z

Should the same changes be applied to prestart tasks? The PR above only changed the behaviour for poststart tasks, but it seems like prestart should have been included as well.

tgross · 2022-02-03T13:38:59Z

@edudobay the code touched by #11945 already had a special case for prestart, and there wasn't any other discussion of prestart in this issue. If you're encountering something similar there, can you open a new issue for it? Thanks!

Previous to this change, we've avoided hashicorp/nomad#10058 by adding a long sleep before attache-control would otherwise exit successfully. Unfortunately this still means that subsequent scaling deployments will fail if the sleep incurred from the initial deployment has expired. This change makes attache-control a proper sidecar (long running ephemeral post-start task). Instead of exiting after a successful run, attache-control will stop the ticker channel and continue running until it receives a kill signal. - Configure attache-control as a side-car in the example nomad job - Remove attempt-limit flag and cliOpts.attemptLimit - Split the main func of attache-control into helpers - Add helper methods to scalingOpts - Replace calls to time.Tick with time.NewTicker - Add renewChan (chan struct{}) and acquired (bool) to the lock struct - lock.Acquire() sets lock.acquired to true, initializes lock.renewChan, and calls lock.periodicallyRenew() in a long-running go-routine - lock.Cleanup() closes lock.renewChan and sets lock.acquired to false Fixes #6 Part Of #11

shantanugadgil · 2022-04-27T09:18:11Z

Is there a v1.2.7 anticipated? or should we just wait for v1.3.0 GA ?

tgross · 2022-04-27T13:00:01Z

We'll ship backports to 1.2.7 and 1.1.x when we ship 1.3.0 GA, which is happening soon!

shantanugadgil · 2022-04-27T13:28:52Z

Thanks! 🚀

github-actions · 2022-10-08T02:36:14Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

tgross added theme/task lifecycle type/bug stage/needs-investigation labels Feb 22, 2021

tgross added stage/accepted Confirmed, and intend to work on. No timeline committment though. and removed stage/needs-investigation labels Feb 22, 2021

kberzinch mentioned this issue Nov 24, 2021

Allow configuring restart policy for Docker containers #11566

Closed

crestonbunch mentioned this issue Dec 22, 2021

Both tasks marked as unhealthy if only one fails. #9254

Closed

beautifulentropy mentioned this issue Jan 24, 2022

Make attache-control a long running sidecar letsencrypt/attache#24

Merged

beautifulentropy mentioned this issue Jan 27, 2022

Fix health checking for ephemeral poststart tasks #11945

Merged

tgross closed this as completed in #11945 Feb 2, 2022

This was referenced Apr 19, 2022

Backport of Fix health checking for ephemeral poststart tasks into release/1.1.x #12615

Merged

Backport of Fix health checking for ephemeral poststart tasks into release/1.2.x #12616

Merged

github-actions bot locked as resolved and limited conversation to collaborators Oct 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unhealthy deployment with poststart lifecycle #10058

unhealthy deployment with poststart lifecycle #10058

tcdev0 commented Feb 22, 2021

tgross commented Feb 22, 2021

tommyalatalo commented Apr 20, 2021

maxless commented Sep 3, 2021 •

edited

Loading

Vadim-Che commented Sep 24, 2021

Oloremo commented Oct 5, 2021

havenith commented Nov 4, 2021 •

edited

Loading

beautifulentropy commented Nov 13, 2021

axsuul commented Dec 28, 2021

SunSparc commented Jan 4, 2022

ksklareski commented Jan 20, 2022

beautifulentropy commented Jan 24, 2022 •

edited

Loading

edudobay commented Feb 3, 2022

tgross commented Feb 3, 2022

shantanugadgil commented Apr 27, 2022

tgross commented Apr 27, 2022

shantanugadgil commented Apr 27, 2022

github-actions bot commented Oct 8, 2022

unhealthy deployment with poststart lifecycle #10058

unhealthy deployment with poststart lifecycle #10058

Comments

tcdev0 commented Feb 22, 2021

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Job file

tgross commented Feb 22, 2021

tommyalatalo commented Apr 20, 2021

maxless commented Sep 3, 2021 • edited Loading

Vadim-Che commented Sep 24, 2021

Oloremo commented Oct 5, 2021

havenith commented Nov 4, 2021 • edited Loading

beautifulentropy commented Nov 13, 2021

axsuul commented Dec 28, 2021

SunSparc commented Jan 4, 2022

ksklareski commented Jan 20, 2022

beautifulentropy commented Jan 24, 2022 • edited Loading

Repro Steps

Screenshots

edudobay commented Feb 3, 2022

tgross commented Feb 3, 2022

shantanugadgil commented Apr 27, 2022

tgross commented Apr 27, 2022

shantanugadgil commented Apr 27, 2022

github-actions bot commented Oct 8, 2022

maxless commented Sep 3, 2021 •

edited

Loading

havenith commented Nov 4, 2021 •

edited

Loading

beautifulentropy commented Jan 24, 2022 •

edited

Loading