-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reloading Consul re-runs all watch commands every time. #571
Comments
A watch command seems to be run:
I don't think those commands should be run unless the watch is actually "triggered" - but that's what appears to be the behavior. |
Just found issue #342 that describes the fire on create behavior. |
The fire-on-create one is a weird semantic thing. To me it seemed that most use cases would need the initial data plus any deltas (e.g. initial HAProxy config, plus updates), so we always fire on the first run. I guess there are cases you may not care. I want to make that a flag to the watchers. The re-fire is just caused by us doing the dumbest reload possible. e.g. drop everything and rebuild everything, instead of complex change detection logic. (Was the watcher added/removed/modified) |
Assuming that the executable called by the watch is capable of making an idempotent change seems reasonable (or at the very worst, an identically generated config file + 1x SIGHUP should not be problematic in the common case). FWIW, we're designing the tools called by consul-template around the assumption that watches will fire haphazardly for many different and unknown reasons. |
All of the commands I have been using are idempotent, the problem is that as you add watches it gets run and re-run and re-run over and over and over. If I have 50 containers with 2 watches per container (service and KV) then if ANY container is added or removed, all 50 stop and start each time there's a change to any Consul config file. I have looked at the shell environment and the JSON that's passed to the handler - and there's no obvious difference that I can see - so there's no real way to know it's different outside of Consul. I was attempting to start an AWS box that very comfortably runs 50+ small web containers - I was never able to finish and killed the box after 2300 container restarts. It couldn't really ever catch up after one container had been loaded and the next one finished. |
@darron oh are you saying that the watchers compound over time? e.g. a new set of 50 watchers is added on each reload? |
Here's how it is working with the current state of Consul:
So - approximately 200 watch events fire because Consul is HUPed - not because they change or anything. I could cut the reloads by 50% only reloading Consul once - but that didn't really help much - still containers that didn't need to be restarted were being restarted over and over. I ended up disabling the KV container restart watch - the first one described here: http://blog.froese.org/2014/12/30/how-octohost-uses-consul-watches/ It's just not lightweight enough to justify including the way Consul fires them each time. |
Hey guys. I also ran into this issue as well, and there are some potentially nasty behaviors if any of the handlers invoke
here is the given watch config:
This was with consul 0.5.0 |
I think a possible solution is to use the Can anyone confirm that |
@jsok event watches are a bit of a snowflake, as they are powered by the gossip layer. |
Ok, so I find this all very weird... Triggering an event that is being watched gives me the following output from the watch (a single event): [{"ID":"afe6d0d3-c8f7-bea6-0311-2b3672cdb3fd","Name":"testevent","Payload":null,"NodeFilter":"","ServiceFilter":"","TagFilter":"","Version":1,"LTime":9}] But reloading using [{"ID":"95a7f085-ac1d-d916-6ba4-749e8d102a5e","Name":"testevent","Payload":null,"NodeFilter":"","ServiceFilter":"","TagFilter":"","Version":1,"LTime":5},{"ID":"10681ffe-b7b6-a0a0-d3a3-fa802d997258","Name":"testevent","Payload":null,"NodeFilter":"","ServiceFilter":"","TagFilter":"","Version":1,"LTime":7},{"ID":"5b373801-54f2-2bfa-bcb0-a4bcd3aacacf","Name":"testevent","Payload":null,"NodeFilter":"","ServiceFilter":"","TagFilter":"","Version":1,"LTime":8},{"ID":"afe6d0d3-c8f7-bea6-0311-2b3672cdb3fd","Name":"testevent","Payload":null,"NodeFilter":"","ServiceFilter":"","TagFilter":"","Version":1,"LTime":9}] Why is that, as it makes very little sense to me. |
@ryanuber what is |
LTime is serfs implementation of a lamport clock. The docs go into more depth: https://www.serfdom.io/docs/internals/gossip.html I believe the reason you get all the events on reload is because the other nodes do not know what the last gossiped event that your local agent received. So to synchronise your local agent the other nodes will re-send all previous events. |
ok, but right now I have only one node for testing. |
Can I work around this by checking and remembering the |
@armon Do you still plan to add the flag to the watches (to turn off fire-on-create)? If so, any plans when this is going to be available? |
@BjRo We are tracking it here, but no concrete plans. Lots of more pressing things. |
I think I ran into this issue as well. This is stopping us from adopting Consul since it would fire a deploy if we have to restart Consul. |
Any updates on this? |
This is critical for us as part of our one to many plans. I can't have my events all firing anytime there is an edit or reload. I'd consider this a pressing issue. |
I am also planning to use Consul Watch - Event triggers, whether it is a deployment or HTTP call fired by monitoring system to fix a check (e.g. restart ntpd if ntp peer check failed) with bunch of other events handler. This is still a blocker for me, i am not happy with the workaround as i had to put all the commands in a script handler instead. As a work around i am verifying the payload and restart the service if only payload matches. https://gist.github.com/vkhatri/1c3d9b287338ed0288c0 Would be great to have a configuration parameter to prevent event trigger on service start/reload. |
Because I was tired of launching processes with scripts that kept duplicating functionality, I built a small Go based tool to help with this limitation: https://github.com/darron/sifter It handles
We've been running it in production to protect |
+1 |
We have the same issues with event watches. We want to use events to manually initialize container or schedule some tasks (like deploy). But the watches get executed in the beginning, which ist not our desired behaviour. This is also a serious blocker for us using consul. We are thinking about using |
Any ETA on when a fix or enhancement would be scheduled for? |
Any update on by when this will be fixed? |
Same issue here. Will try sifter but does not seem to be tested with key/keyprefix watches.. Any updates? Cheers! |
A way to remove the event from consul might be useful. |
got stuck at the same issue. seems no update until now |
Want to use "event watch" to start/stop services, but get stuck because of the defect (fire on create). When can this issue be fixed? Thanks. |
I am still confused how watches are even all that useful when this behavior is present. If you are requesting a very specific event but a watch fires also during other times (like a reload), what good is the watch? |
Any update on fixing this? It has been a huge pain in manually keeping track of old watch event. |
I'm trying to replicate this issue and work on a fix. |
I've been working on this issue for a while and I think I have a solution. I want to ask some clarifying questions before I send the PR.
|
@Sudhar287 I believe the
and then running
The solution of adding a new parameter onto the watch plan is sensible, but I think there are some subtle edge cases someone might run into with that behavior that we should think through. The obvious edge case I can see is the following scenario:
In this scenario the watch misses an update, which is undesirable. For event watches specifically, this might be fine because consul events are already a best-effort mechanism and not designed for guaranteed delivery, but for other types of watches this behavior could be a problem. Perhaps a solution can be designed such that a |
Yes @crhino, you are absolutely right in everything you wrote! Thank you for the detailed comment. Please let me know the final proposal and I'll work on it. :) |
@Sudhar287 I roughly reviewed the code you proposed, and the idea of operator being able to explicitly state in the watch config, should it be immediately fired-on-create or not, save a lot of complex tracking caused by states etc. However I have question. Will the watch with @crhino Great comment on this one and the edge cases. I think that the current behaviour with fire-any-time yes |
Thank you for your insights @vaLski. I have some findings/suggestions to present too:
Please let me know if I've made a mistake, I'm a beginner wrt |
The difference is essentially whether or not the watch is declared in the Consul agent's configuration or not. A
Hope that makes sense, let me know if there are any questions! |
Thank you for the response @crhino! |
@Sudhar287 force push several times until it works. Note that rebasing on Head might also help as many tests are a bit more stable in latest commits |
Thanks for the response @pierresouchay. Your suggestions were helpful and I've passed the CI. Can one of the official maintainers please have a look at the PR #7616 ? |
What's the status of reviewing and merging this one? It's blocking some of our work that will implement new consul watches. |
* adding terminating gw test without tls/acls Co-authored-by: Luke Kysow <1034429+lkysow@users.noreply.github.com>
…istered (hashicorp#571) Fixes hashicorp#540 * Modify endpoints controller to delete ACL tokens for each service instance that it deregisters * Remove TLS+ACLs table tests from endpoints controller tests. These tests were testing that endpoints controller works with a client configured to have TLS and ACLs. I thought this test was not necessary because there isn't any code in the controller that behaves differently if the consul client is configured with any of those and as a result there's no way these tests could fail. The tests testing to the new ACL logic are there but they are only testing the logic that was added and configure test agent to accommodate for that. * Create test package under helper and move GenerateServerCerts function from subcommand/common there because it was used outside of subcommand. * Create a helper test function to set up auth methods and refactor existing connect-init command tests to use that function. * Minor editing fixes of comments etc.
Will this issue be fixed? I have also encountered this problem. I have about 70 microservices in the mesh that I should watch their state if all instances are down for a service or unhealthy I should prevent other services from sending requests to unhealthy services. I implemented this with consul-watch but now whenever a service from mesh stopped all of my watches triggered even if nothing happened to their state. This is frustrating. |
I'm reloading Consul to load in new config files as the containers are built - but it appears as if it's re-running each defined watch command every time Consul is reloaded.
Note all the newly created docker containers that are being stopped and started (from the watch command):
http://shared.froese.org/2014/Screen_Shot_2015-01-02_at_4.34.32_PM.png
Example watch command is here:
https://gist.github.com/darron/481604459ccfde4d401a
Is this expected behavior? I had thought that if the config had changed, it should obviously reload and probably re-run, but this is surprising to me.
I have a few watch commands:
https://gist.github.com/darron/38af49ad1352a913d360
Looking through the api, I don't see another way to register a watch command.
In the end, I was trying to launch approximately 50 containers, it was still going 4 hours later - launching and re-launching 2300+ containers and counting:
http://shared.froese.org/2014/v87sd-17-50.jpg
Am I "doing it wrong"?
The text was updated successfully, but these errors were encountered: