-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Register Consul Service and Checks atomically #3935
Comments
we use linkerd with nomad too, it works fine
|
@jippi Thanks for your help. In our case the shutdown part is not the problem, we confirmed it works fine (we use shutdown_delay, ...). The problem is the new instance is considered healthy before the health check gets registered. We can almost always reproduce that under heavy load.
|
Thanks for reporting this @gahebe. We are aware of this and are tracking fixing this in a future release. |
Did you ever find a work-around for this issue @gahebe? We have the same problem, and can't afford the downtime on every deployment. |
Has this issue been roadmapped @tgross? I haven't found a workaround for this issue. I'm really wondering how others are dealing with this. We're using Traefik as a dynamic load balancer, and it often manages to route a request or two to the newly started services before they are marked unhealthy by Consul. We've tried using Traefik health checks in addition to the ones in Consul, but unfortunately new services are marked as healthy before the first health check is done, leaving us with the same problem. Since we're trying to not lose a single request during deployments, we're unfortunately considering migrating away from Nomad. |
It's been removed from the community issues "needs roadmapping" board to our roadmap, which is unfortunately not yet public. It has not been targeted for the upcoming Nomad 1.1. Nomad 1.2 hasn't been scoped yet. We'd certainly welcome a pull request if this is a show-stopper for you, although it's admittedly a squirrelly area of the code base.
I'm not an expert in traefik, but doesn't it have a polling configuration value for Consul? The race here is fairly short; you might be able to reduce the risk of it (but not eliminate it entirely) by increasing that polling window. I recognize this is a workaround rather than a fix. In any case, I'd expect a production load balancer to be able to handle a one-off dropped request if only because of CAP; there might be some configuration values in traefik you need to tweak here. |
How hard/breaking would it be to go with Cheers, |
Call me stupid, but I tried this locally:
and it immediately registered the service with the checks. This seems to work (at least as verified by wireshark and & the nomad source) because |
I'm pretty sure Consul supports putting the check in the service registration back to at least 1.0, because I have a project from an old job that does that and just checked the API version there. So I don't think it's a matter of the API not supporting it. I had an internal discussion with some folks and it looks like a lot of the reasoning for how this was designed was about optimizing writes to Consul's raft DB, but it turns out that Consul already takes care of all of that performance worry under the hood. So now it's just a matter of getting it done. I think the approach @apollo13 has here is promising, but I'd want to make sure that it still works on updates, not just the initial registration. |
I did change the service check and that updated properly. I did not check any other changes to the service, but looking through the code I am confident that this approach should work. It might (?) be worth to look into transactions though to make this an all or nothing sync. But then again I am not sure if this makes sense or not
…On Fri, Mar 26, 2021, at 20:25, Tim Gross wrote:
> How hard/breaking would it be to go with consul/api.Agent.ServiceRegisterOpts in the first call, meaning it would already contain the checks to be created, then perform the necessary cleanup.
I'm pretty sure Consul supports putting the check in the service
registration back to at least 1.0, because I have a project from an old
job that does that and just checked the API version there. So I don't
think it's a matter of the API not supporting it.
I had an internal discussion with some folks and it looks like a lot of
the reasoning for how this was designed was about optimizing writes to
Consul's raft DB, but it turns out that Consul already takes care of
all of that performance worry under the hood. So now it's just a matter
of getting it done. I think the approach @apollo13
<https://github.com/apollo13> has here is promising, but I'd want to
make sure that it still works on updates, not just the initial
registration.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3935 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAT5C5JBYAQROENJUB6URDTFTNTFANCNFSM4ETQNL7Q>.
|
See also #10482 |
We seem to have this issue as well. |
Unfortunately neither Traefik nor HAProxy (which we also tried) are able to work around this issue directly. They can both be configured with their own health checks, but newly discovered services are by default healthy, which makes it difficult. The workaround we're using now is having a custom http proxy between Traefik and Consul. Using the http provder in Traefik, and having the proxy only forward services that have a registered health check, works. But it is far from ideal. |
@madsholden Could you test my patch? The more people testing it the sooner it will probably get merged. |
We are running into a similar issue. Our Setup: We are doing Canary deployments, and no matter what we set the shutdown_delay, it seems that when the old task is being deregistered we are seeing blips where we get 502 going to the application. Originally, we thought shutdown_delay would resolve the issue but it almost seems like there is a split second/seconds where there is no config for the traffic to the proper app. |
@rlandingham could you test my patch (see a few comments above)? If it fixes it for you a PR would certainly be the best step forward. |
This PR updates Nomad's Consul service client to include checks in an initial service registration, so that the checks associated with the service are registered "atomically" with the service. Before, we would only register the checks after the service registration, which causes problems where the service is deemed healthy, even if one or more checks are unhealthy - especially problematic in the case where SuccessBeforePassing is configured. Fixes #3935
This PR updates Nomad's Consul service client to include checks in an initial service registration, so that the checks associated with the service are registered "atomically" with the service. Before, we would only register the checks after the service registration, which causes problems where the service is deemed healthy, even if one or more checks are unhealthy - especially problematic in the case where SuccessBeforePassing is configured. Fixes #3935
…14944) * consul: register checks along with service on initial registration This PR updates Nomad's Consul service client to include checks in an initial service registration, so that the checks associated with the service are registered "atomically" with the service. Before, we would only register the checks after the service registration, which causes problems where the service is deemed healthy, even if one or more checks are unhealthy - especially problematic in the case where SuccessBeforePassing is configured. Fixes #3935 * cr: followup to fix cause of extra consul logging * cr: fix another bug * cr: fixup changelog
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Context
We use Nomad with Consul and Docker to deploy Grpc services to a cluster. We use Linkerd as the loadbalancer, it uses Consul to get the availability information of the services.
Scenario
We observe transient errors at the consumer when a new instance of the service is getting started during the service update with Nomad.
(Tcp health check is used to avoid routing request to it when the Docker container has started and the service inside is not ready yet to receive requests. Linkerd is configured to only consider services with passing health check status)
Expected behavior
Being able to update the services seamlessly with the above setup.
Versions
Nomad v0.6.3
Consul v1.0.3
Linkerd v1.3.5
Analysis
Linkerd uses blocking queries to get availability info from Consul.
Example:
/v1/health/service/testservice?index=1366&dc=dc1&passing=true
I captured the network traffic and this type of query returns the new service instance without the Tcp health check being defined first, then shortly afterwards it returns it the with Tcp health check.
So it seems Consul considers the service “passing” when it does not have health checks defined,
and Nomad first registers the new service instance and then it registers its health checks in a separate call.
The captured network packets (Nomad -> Consul) confirms that it happens separately:
Nomad should register the service and its health checks in one call in Consul,
otherwise the new service instance is considered healthy even before Nomad registers its health check, I believe.
The text was updated successfully, but these errors were encountered: