-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unhealthy deployment with poststart lifecycle #10058
Comments
HI @tcdev0. Looks like the fix for #9361 shipped in Nomad 1.0.2. This problem is slightly different, and looks like a bug emerging from the interaction of three things:
To make this work without your workaround (which looks sensible), we'd have to update the allocation health status to verify that the exit code is successful for post-stop tasks if we find they aren't "healthy" by the minimum healthy time. |
We also observed this behavior today, having a task with a post start hook (not sidecar) and a min_healthy_time set in the update stanza. Will be following this, definitiely in need of a fix. |
+1 |
Hit that too on Nomad 1.1.5 |
+1 |
Experiencing this too. As a new Nomad user I burned a ton of cycles on this before realizing this was just a bug. |
Still getting this issue on Nomad 1.2.0 |
Same. Nomad 1.2.0. I changed my |
FYI issue persists in 1.2.4. Sleeping workaround still works. |
A note for others who attempt to fix this with the a sleep of OS macOS 12.1 Repro Steps
job "foobar" {
datacenters = ["dev"]
type = "service"
group "foobar" {
count = 1
task "foo" {
driver = "raw_exec"
config {
command = "sh"
args = ["-c", "sleep 5000000"]
}
}
task "bar" {
driver = "raw_exec"
config {
command = "sh"
args = ["-c", "sleep 10 && echo 'running....' && exit 0;"]
}
lifecycle {
hook = "poststart"
sidecar = false
}
}
}
} Screenshots |
Should the same changes be applied to |
Previous to this change, we've avoided hashicorp/nomad#10058 by adding a long sleep before attache-control would otherwise exit successfully. Unfortunately this still means that subsequent scaling deployments will fail if the sleep incurred from the initial deployment has expired. This change makes attache-control a proper sidecar (long running ephemeral post-start task). Instead of exiting after a successful run, attache-control will stop the ticker channel and continue running until it receives a kill signal. - Configure attache-control as a side-car in the example nomad job - Remove attempt-limit flag and cliOpts.attemptLimit - Split the main func of attache-control into helpers - Add helper methods to scalingOpts - Replace calls to time.Tick with time.NewTicker - Add renewChan (chan struct{}) and acquired (bool) to the lock struct - lock.Acquire() sets lock.acquired to true, initializes lock.renewChan, and calls lock.periodicallyRenew() in a long-running go-routine - lock.Cleanup() closes lock.renewChan and sets lock.acquired to false Fixes #6 Part Of #11
Is there a v1.2.7 anticipated? or should we just wait for v1.3.0 GA ? |
We'll ship backports to 1.2.7 and 1.1.x when we ship 1.3.0 GA, which is happening soon! |
Thanks! 🚀 |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v1.0.3 (08741d9)
Operating system and Environment details
Ubuntu 18.04.5 LTS
Issue
Deployement with poststart task is immediately marked as unhealthy and still running as an active deployment until deadline hits.
The Service is registered healthy in Consul and reachable via traefik.
Removing the poststart job and the deployment is running fine.
I guess the poststart job is exiting too fast (under 10s) und therefore hits the min_healthy_time of 10s(default).
Reproduction steps
Job file
Enforcing the poststart job to wait for the min_healthy_time from the web service and exiting after, fixed the issue.
So i set the sleep timer slightly above the min_healthy_time = "10s" and it works.
Seen in 9361
Is it a bug or working as expected ?
Thanks.
The text was updated successfully, but these errors were encountered: