Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unhealthy deployment with poststart lifecycle #10058

Closed
tcdev0 opened this issue Feb 22, 2021 · 17 comments · Fixed by #11945
Closed

unhealthy deployment with poststart lifecycle #10058

tcdev0 opened this issue Feb 22, 2021 · 17 comments · Fixed by #11945
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/task lifecycle type/bug

Comments

@tcdev0
Copy link

tcdev0 commented Feb 22, 2021

Nomad version

Nomad v1.0.3 (08741d9)

Operating system and Environment details

Ubuntu 18.04.5 LTS

Issue

Deployement with poststart task is immediately marked as unhealthy and still running as an active deployment until deadline hits.
The Service is registered healthy in Consul and reachable via traefik.

nomad_poststart_1

nomad_poststart_2

Removing the poststart job and the deployment is running fine.

I guess the poststart job is exiting too fast (under 10s) und therefore hits the min_healthy_time of 10s(default).

Reproduction steps

Job file

job "nginx" {

  datacenters = ["prod"]
  namespace   = "default"
  type        = "service"

  group "nginx" {
    count = 1
    service {
      name = "${BASE}"
      tags = [
        "traefik.enable=true",
        "traefik.http.routers.nginx.entrypoints=websecure",
        "traefik.http.routers.nginx.rule=Host(`nginx.domain`)",
        "traefik.http.routers.nginx.tls=true"
      ]
      port = "http"
      check {
        name     = "alive"
        type     = "tcp"
        interval = "2s"
        timeout  = "2s"
      }
    }

    network {
      port "http" {
        to = 80
      }
    }

    update {
      min_healthy_time  = "10s"
      healthy_deadline  = "5m"
      progress_deadline = "10m"
    }

    task "nginx-web" {
      driver = "docker"
      config {
        image = "nginx:latest"
        ports = ["http"]
      }
    }

    task "nginx-poststart" {
      driver = "docker"
      config {
        image   = "alpine:latest"
        command = "sh"
        args    = ["-c", "sleep 5 && echo 'running....' && exit 0;"]

      }
      lifecycle {
        hook    = "poststart"
        sidecar = false
      }
    }
  }
}

Enforcing the poststart job to wait for the min_healthy_time from the web service and exiting after, fixed the issue.
So i set the sleep timer slightly above the min_healthy_time = "10s" and it works.

    task "nginx-poststart" {
      driver = "docker"
      config {
        image   = "alpine:latest"
        command = "sh"
        args    = ["-c", "sleep 12 && echo 'running....' && exit 0;"]

      }
      lifecycle {
        hook    = "poststart"
        sidecar = false
      }
    }

nomad_poststart_3

Seen in 9361

Is it a bug or working as expected ?

Thanks.

@tgross
Copy link
Member

tgross commented Feb 22, 2021

HI @tcdev0. Looks like the fix for #9361 shipped in Nomad 1.0.2. This problem is slightly different, and looks like a bug emerging from the interaction of three things:

  • A post start task
  • That's not a sidecar (and so should be expected to exit)
  • With a minimum healthy time in the update block

To make this work without your workaround (which looks sensible), we'd have to update the allocation health status to verify that the exit code is successful for post-stop tasks if we find they aren't "healthy" by the minimum healthy time.

@tgross tgross added stage/accepted Confirmed, and intend to work on. No timeline committment though. and removed stage/needs-investigation labels Feb 22, 2021
@tommyalatalo
Copy link

We also observed this behavior today, having a task with a post start hook (not sidecar) and a min_healthy_time set in the update stanza. Will be following this, definitiely in need of a fix.

@maxless
Copy link

maxless commented Sep 3, 2021

+1
Moreover, it feels weird that stuff in the update stanza is used to control prestart/poststart/poststop tasks. One of their intended uses as is shown in the documentation for lifecycle stanza is to just send HTTP requests to chat channels and exit. Which is precisely what it fails to do in the current implementation, marking allocation as unhealthy. I guess either there needs to be some flag in lifecycle, like "ignore_update", or separate task update block that overrides the group update block. It does work for prestart/poststop cases, so it's 2/3 job done.

@Vadim-Che
Copy link

Have Nomad v1.1.0 but faced the same issue. My jobs have Poststart and Poststop tasks. Poststop task is marked with "Task not running by deadline":
image

But the strangest thing is that there are other services, that are deployed by the same scheme and those are Ok (are not marked as failed):
image

Will this be fixed in recent future?

@Oloremo
Copy link
Contributor

Oloremo commented Oct 5, 2021

Hit that too on Nomad 1.1.5

@havenith
Copy link

havenith commented Nov 4, 2021

+1
I have the same problem. The above sample fails on Nomad 1.1.6. And in our case we have a long running Java process and a short running(sub ten second), non-sidecar poststart task.
The deployment always fails with 'Task not running for min_healthy_time of 10s by deadline' even though the long running Java task is healthy in both Nomad and Consul.
Padding out the length of the poststart task doesn't seem to help though, so we don't have a workaround currently.
Is there a rough estimate on when a fix might come out?

@beautifulentropy
Copy link
Contributor

Experiencing this too. As a new Nomad user I burned a ton of cycles on this before realizing this was just a bug.

@axsuul
Copy link
Contributor

axsuul commented Dec 28, 2021

Still getting this issue on Nomad 1.2.0

@SunSparc
Copy link

SunSparc commented Jan 4, 2022

Same. Nomad 1.2.0. I changed my poststart task to use the exec driver and handed it a bash script so I could use the sleep kludge that the OP used. This also got me past the problem, for now.

@ksklareski
Copy link

FYI issue persists in 1.2.4. Sleeping workaround still works.

@beautifulentropy
Copy link
Contributor

beautifulentropy commented Jan 24, 2022

A note for others who attempt to fix this with the a sleep of >=min_healthy_time. This workaround will result in a successful initial deployment. However, any subsequent update deployment where only new allocations are created forces create (not forces create/destroy update) will result in all previously healthy nodes becoming unhealthy; new allocations will be healthy though. The only real workaround I could find that truly avoids the issue is just making your existing post-start ephemeral hooks long-running sidecars instead.

OS macOS 12.1
Nomad v1.2.4
Consul v1.11.2

Repro Steps

  1. Run the following job and wait for the first deployment to succeed.
  2. Ensure the post-start allocation bar shows as having exited successfully.
  3. Change count = 1 to count = 2
  4. Run the updated job.
  5. The original 'Healthy' allocations will now be 'Unhealthy' and the newly added allocations will be 'Healthy'.
job "foobar" {

  datacenters = ["dev"]
  type        = "service"

  group "foobar" {
    count = 1

    task "foo" {
      driver = "raw_exec"
      config {
        command = "sh"
        args    = ["-c", "sleep 5000000"]
      }
    }

    task "bar" {
      driver = "raw_exec"
      config {
        command = "sh"
        args    = ["-c", "sleep 10 && echo 'running....' && exit 0;"]
      }
      lifecycle {
        hook    = "poststart"
        sidecar = false
      }
    }
  }
}

Screenshots

(deployment 1)
Screen Shot 2022-01-24 at 5 33 51 PM

(deployment 2)
Screen Shot 2022-01-24 at 5 34 43 PM

@edudobay
Copy link

edudobay commented Feb 3, 2022

Should the same changes be applied to prestart tasks? The PR above only changed the behaviour for poststart tasks, but it seems like prestart should have been included as well.

@tgross
Copy link
Member

tgross commented Feb 3, 2022

@edudobay the code touched by #11945 already had a special case for prestart, and there wasn't any other discussion of prestart in this issue. If you're encountering something similar there, can you open a new issue for it? Thanks!

beautifulentropy added a commit to letsencrypt/attache that referenced this issue Feb 4, 2022
Previous to this change, we've avoided hashicorp/nomad#10058 by adding a long
sleep before attache-control would otherwise exit successfully. Unfortunately
this still means that subsequent scaling deployments will fail if the sleep
incurred from the initial deployment has expired.

This change makes attache-control a proper sidecar (long running ephemeral
post-start task). Instead of exiting after a successful run, attache-control
will stop the ticker channel and continue running until it receives a kill
signal.

- Configure attache-control as a side-car in the example nomad job
- Remove attempt-limit flag and cliOpts.attemptLimit
- Split the main func of attache-control into helpers
- Add helper methods to scalingOpts 
- Replace calls to time.Tick with time.NewTicker
- Add renewChan (chan struct{}) and acquired (bool) to the lock struct
- lock.Acquire() sets lock.acquired to true, initializes lock.renewChan, and calls
  lock.periodicallyRenew() in a long-running go-routine
- lock.Cleanup() closes lock.renewChan and sets lock.acquired to false

Fixes #6
Part Of #11
@shantanugadgil
Copy link
Contributor

Is there a v1.2.7 anticipated? or should we just wait for v1.3.0 GA ?

@tgross
Copy link
Member

tgross commented Apr 27, 2022

We'll ship backports to 1.2.7 and 1.1.x when we ship 1.3.0 GA, which is happening soon!

@shantanugadgil
Copy link
Contributor

Thanks! 🚀

@github-actions
Copy link

github-actions bot commented Oct 8, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/task lifecycle type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.