Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reloading Consul re-runs all watch commands every time. #571

Open
darron opened this issue Jan 2, 2015 · 55 comments
Open

Reloading Consul re-runs all watch commands every time. #571

darron opened this issue Jan 2, 2015 · 55 comments
Labels
theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner type/enhancement Proposed improvement or new feature

Comments

@darron
Copy link
Contributor

darron commented Jan 2, 2015

I'm reloading Consul to load in new config files as the containers are built - but it appears as if it's re-running each defined watch command every time Consul is reloaded.

Note all the newly created docker containers that are being stopped and started (from the watch command):

http://shared.froese.org/2014/Screen_Shot_2015-01-02_at_4.34.32_PM.png

Example watch command is here:

https://gist.github.com/darron/481604459ccfde4d401a

Is this expected behavior? I had thought that if the config had changed, it should obviously reload and probably re-run, but this is surprising to me.

I have a few watch commands:

https://gist.github.com/darron/38af49ad1352a913d360

Looking through the api, I don't see another way to register a watch command.

In the end, I was trying to launch approximately 50 containers, it was still going 4 hours later - launching and re-launching 2300+ containers and counting:

http://shared.froese.org/2014/v87sd-17-50.jpg

Am I "doing it wrong"?

@darron
Copy link
Contributor Author

darron commented Jan 3, 2015

A watch command seems to be run:

  1. When a config file is loaded.
  2. When Consul is HUP'ed and it rechecks them all.

I don't think those commands should be run unless the watch is actually "triggered" - but that's what appears to be the behavior.

@darron
Copy link
Contributor Author

darron commented Jan 4, 2015

Just found issue #342 that describes the fire on create behavior.

@armon
Copy link
Member

armon commented Jan 5, 2015

The fire-on-create one is a weird semantic thing. To me it seemed that most use cases would need the initial data plus any deltas (e.g. initial HAProxy config, plus updates), so we always fire on the first run. I guess there are cases you may not care. I want to make that a flag to the watchers.

The re-fire is just caused by us doing the dumbest reload possible. e.g. drop everything and rebuild everything, instead of complex change detection logic. (Was the watcher added/removed/modified)

@sean-
Copy link
Contributor

sean- commented Jan 5, 2015

Assuming that the executable called by the watch is capable of making an idempotent change seems reasonable (or at the very worst, an identically generated config file + 1x SIGHUP should not be problematic in the common case). FWIW, we're designing the tools called by consul-template around the assumption that watches will fire haphazardly for many different and unknown reasons.

@darron
Copy link
Contributor Author

darron commented Jan 5, 2015

All of the commands I have been using are idempotent, the problem is that as you add watches it gets run and re-run and re-run over and over and over.

If I have 50 containers with 2 watches per container (service and KV) then if ANY container is added or removed, all 50 stop and start each time there's a change to any Consul config file.

I have looked at the shell environment and the JSON that's passed to the handler - and there's no obvious difference that I can see - so there's no real way to know it's different outside of Consul.

I was attempting to start an AWS box that very comfortably runs 50+ small web containers - I was never able to finish and killed the box after 2300 container restarts. It couldn't really ever catch up after one container had been loaded and the next one finished.

darron added a commit to octohost/octohost that referenced this issue Jan 6, 2015
@armon
Copy link
Member

armon commented Jan 6, 2015

@darron oh are you saying that the watchers compound over time? e.g. a new set of 50 watchers is added on each reload?

@darron
Copy link
Contributor Author

darron commented Jan 6, 2015

Here's how it is working with the current state of Consul:

  1. There's already 49 containers with a kv and service watch each.
  2. Add a new site to Docker.
  3. Add a KV watch. (Hup Consul - as a result, all 98 watches fire again - which start to reload 49 containers and rebuild 49 nginx config files).
  4. Add a service watch. (Hup Consul - as a result, all 99 watches fire again - which start to reload 50 containers and rebuild 49 nginx config files).

So - approximately 200 watch events fire because Consul is HUPed - not because they change or anything.

I could cut the reloads by 50% only reloading Consul once - but that didn't really help much - still containers that didn't need to be restarted were being restarted over and over.

I ended up disabling the KV container restart watch - the first one described here:

http://blog.froese.org/2014/12/30/how-octohost-uses-consul-watches/

It's just not lightweight enough to justify including the way Consul fires them each time.

@dgshep
Copy link

dgshep commented May 27, 2015

Hey guys. I also ran into this issue as well, and there are some potentially nasty behaviors if any of the handlers invoke consul reloadeither directly on indirectly. Essentially the watch will continue to refire until all file descriptors are exhausted:

2015/05/27 18:55:16 [ERR] agent: Failed to invoke watch handler 'consul reload': fork/exec /bin/sh: too many open files

here is the given watch config:

{ "watches": [ { "handler": "consul reload", "type": "event", "name": "break-consul" } ] }

This was with consul 0.5.0

@jsok
Copy link

jsok commented Jul 2, 2015

I think a possible solution is to use the LTime field in the events and store that in a file each time your handler runs. That way, if you receive the same event with an LTime which is less-than-or-equal to the one you have stored in the file you can ignore it.

Can anyone confirm that LTime is reliable for this?

@ryanuber
Copy link
Member

@jsok event watches are a bit of a snowflake, as they are powered by the gossip layer. LTime should be safe to use in the way you describe for event watches since it is a monotonic value, but it will not apply to other watch types and will not be part of the payload returned for those watches.

@joelmoss
Copy link

Ok, so I find this all very weird...

Triggering an event that is being watched gives me the following output from the watch (a single event):

[{"ID":"afe6d0d3-c8f7-bea6-0311-2b3672cdb3fd","Name":"testevent","Payload":null,"NodeFilter":"","ServiceFilter":"","TagFilter":"","Version":1,"LTime":9}]

But reloading using consul reload also triggers this event (as this issue confirms), however, the output from the watch is a list of all events past:

[{"ID":"95a7f085-ac1d-d916-6ba4-749e8d102a5e","Name":"testevent","Payload":null,"NodeFilter":"","ServiceFilter":"","TagFilter":"","Version":1,"LTime":5},{"ID":"10681ffe-b7b6-a0a0-d3a3-fa802d997258","Name":"testevent","Payload":null,"NodeFilter":"","ServiceFilter":"","TagFilter":"","Version":1,"LTime":7},{"ID":"5b373801-54f2-2bfa-bcb0-a4bcd3aacacf","Name":"testevent","Payload":null,"NodeFilter":"","ServiceFilter":"","TagFilter":"","Version":1,"LTime":8},{"ID":"afe6d0d3-c8f7-bea6-0311-2b3672cdb3fd","Name":"testevent","Payload":null,"NodeFilter":"","ServiceFilter":"","TagFilter":"","Version":1,"LTime":9}]

Why is that, as it makes very little sense to me.

@joelmoss
Copy link

@ryanuber what is Ltime?

@jsok
Copy link

jsok commented Jul 29, 2015

LTime is serfs implementation of a lamport clock. The docs go into more depth: https://www.serfdom.io/docs/internals/gossip.html

I believe the reason you get all the events on reload is because the other nodes do not know what the last gossiped event that your local agent received. So to synchronise your local agent the other nodes will re-send all previous events.

@joelmoss
Copy link

ok, but right now I have only one node for testing.

@BjRo
Copy link

BjRo commented Aug 10, 2015

Can I work around this by checking and remembering the CONSUL_INDEX env var in my handler (if I have a watch of type key or keyprefix)?

@BjRo
Copy link

BjRo commented Aug 10, 2015

@armon Do you still plan to add the flag to the watches (to turn off fire-on-create)? If so, any plans when this is going to be available?

@armon
Copy link
Member

armon commented Aug 11, 2015

@BjRo We are tracking it here, but no concrete plans. Lots of more pressing things.

@ssenaria
Copy link

I think I ran into this issue as well. This is stopping us from adopting Consul since it would fire a deploy if we have to restart Consul.

@ssenaria
Copy link

Any updates on this?

@cya9nide
Copy link

This is critical for us as part of our one to many plans. I can't have my events all firing anytime there is an edit or reload. I'd consider this a pressing issue.

@slackpad slackpad added the type/enhancement Proposed improvement or new feature label Nov 16, 2015
@vkhatri
Copy link

vkhatri commented Jan 2, 2016

I am also planning to use Consul Watch - Event triggers, whether it is a deployment or HTTP call fired by monitoring system to fix a check (e.g. restart ntpd if ntp peer check failed) with bunch of other events handler. This is still a blocker for me, i am not happy with the workaround as i had to put all the commands in a script handler instead. As a work around i am verifying the payload and restart the service if only payload matches.

https://gist.github.com/vkhatri/1c3d9b287338ed0288c0

Would be great to have a configuration parameter to prevent event trigger on service start/reload.

@darron
Copy link
Contributor Author

darron commented Jan 2, 2016

Because I was tired of launching processes with scripts that kept duplicating functionality, I built a small Go based tool to help with this limitation:

https://github.com/darron/sifter

It handles event and key watches - and doesn't allow the watch to fire if:

  1. Event: It's already seen that LTime value.
  2. Key: the hash of the payload is the same as before.

We've been running it in production to protect event watches for a few weeks now - key watches are less tested but have worked in my local testing.

@sc0rp10
Copy link

sc0rp10 commented Mar 1, 2016

+1
We too need disabling "fire-on-create".

@eppdot
Copy link

eppdot commented Mar 4, 2016

We have the same issues with event watches. We want to use events to manually initialize container or schedule some tasks (like deploy). But the watches get executed in the beginning, which ist not our desired behaviour. This is also a serious blocker for us using consul. We are thinking about using consul exec instead, but that has serious security issues allowing arbitrary commands to be executed. +1

@ssenaria
Copy link

ssenaria commented Apr 7, 2016

Any ETA on when a fix or enhancement would be scheduled for?

@aashishmodak
Copy link

aashishmodak commented May 17, 2017

Any update on by when this will be fixed?

@slackpad slackpad added the theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner label May 25, 2017
@aroca
Copy link

aroca commented Feb 1, 2018

Same issue here. Will try sifter but does not seem to be tested with key/keyprefix watches..

Any updates?

Cheers!

@daledude
Copy link

A way to remove the event from consul might be useful.

@jgj1018
Copy link

jgj1018 commented Aug 23, 2018

got stuck at the same issue. seems no update until now

@yuqiangh
Copy link

yuqiangh commented Sep 4, 2018

Want to use "event watch" to start/stop services, but get stuck because of the defect (fire on create). When can this issue be fixed? Thanks.

@adawalli
Copy link
Contributor

adawalli commented Feb 9, 2019

I am still confused how watches are even all that useful when this behavior is present. If you are requesting a very specific event but a watch fires also during other times (like a reload), what good is the watch?

@DImuthuUpe
Copy link

Any update on fixing this? It has been a huge pain in manually keeping track of old watch event.

@Sudhar287
Copy link

I'm trying to replicate this issue and work on a fix.
I created a watch on an event and fired it multiple times. Everything worked as expected.
I tried to do a consul reload but this didn't trigger the event, contrary to the behavior described in a previous comment.
Has this problem been fixed? Is there a PR I can look into?

@Sudhar287
Copy link

I've been working on this issue for a while and I think I have a solution. I want to ask some clarifying questions before I send the PR.

  1. I identified that its the watch plan handler function that's being called whenever a new watch is created and thus the fire-on-create behavior is observed. Am I correct here? To fix this I've simply skipped handler function from being called the very first time.
  2. Would you like to give the users the power to either enable or disable this fire-on-create behavior? If this is required, then I thought of having a global parameter to set/unset it.

@crhino
Copy link
Contributor

crhino commented Apr 9, 2020

@Sudhar287 I believe the consul reload problem still exists. I was able to reproduce this via:

$ consul agent -dev -hcl 'watches {type = "key", key = "/test", handler_type = "script", args = ["sh", "-c", "echo test >> /tmp/consulwatch"] }'

and then running consul reload in another shell, causing test to be printed out a second time:

$ cat /tmp/consulwatch
test
test

The solution of adding a new parameter onto the watch plan is sensible, but I think there are some subtle edge cases someone might run into with that behavior that we should think through.

The obvious edge case I can see is the following scenario:

  1. I have a watch on the K/V path /foo
  2. I initiate a reload of consul
  3. The consul agent cancels the current watch of /foo
  4. An update to the /foo key is committed to the Raft log
  5. The consul agent starts a new watch of /foo, not firing on create

In this scenario the watch misses an update, which is undesirable.

For event watches specifically, this might be fine because consul events are already a best-effort mechanism and not designed for guaranteed delivery, but for other types of watches this behavior could be a problem. Perhaps a solution can be designed such that a fire-on-create option only applies to event watches.

@Sudhar287
Copy link

Sudhar287 commented Apr 9, 2020

Yes @crhino, you are absolutely right in everything you wrote! Thank you for the detailed comment.
Even I thought about the edge case but I figured it would be okay to skip the handler being invoked for events that occurred in the past. Now I know that its undesirable...

Please let me know the final proposal and I'll work on it. :)

@vaLski
Copy link
Contributor

vaLski commented Apr 10, 2020

@Sudhar287 I roughly reviewed the code you proposed, and the idea of operator being able to explicitly state in the watch config, should it be immediately fired-on-create or not, save a lot of complex tracking caused by states etc. However I have question. Will the watch with "fireOnCreate":"no" setting set, be executed on initial agent startup? If it will be executed on startup but won't be executed on reload I suggest re-labeling it to "fireOnReload": "no".

@crhino Great comment on this one and the edge cases. I think that the current behaviour with fire-any-time yes should remain and will be the default in future. However I consider this relative easy and clean implementation that @Sudhar287 proposed to provide operators with great fine granted control over watches. Experienced ones might want to fine tune their watches and decide which should stick to the default behaviour and run anytime, and which ones they can "sacrifise" and tweak with "fireOnReload":"no" while fully understanding the downsides of their decision. This is somehow similar to the consul kv get and consul kv get -stale options. I am curious to hear what do you think about that?

@Sudhar287
Copy link

Sudhar287 commented Apr 12, 2020

Thank you for your insights @vaLski. I have some findings/suggestions to present too:

  1. i. Starting the consul agent triggers the watch reload fucntion and
    ii. The watch reload function eventually fires them. Hence, I think fire-on-reload option is essential a fire-on-create option.

  2. The consul reload command reloads all the watches. This has to be skipped if we don't want the fire-on-reload/fire-on-create behavior.

  3. The fucntion call stack are different between watching using consul watch and the consul agent watch. Eg: consul agent -dev -hcl 'watches {type = "key", key = "/test", handler_type = "script", args = ["sh", "-c", "echo test >> /tmp/consulwatch"] }' and consul watch -type=key -key=/test echo test >> /tmp/consulwatch are different that I though it would be. If we want to modify the behaviour of the application, we have to make it consistent for both the commands.

  4. consul reload doesn't seem to have any effect when watching with the consul watch command. Which is the reason for my comment earlier.

Please let me know if I've made a mistake, I'm a beginner wrt consul. Please share your opinion and thoughts too. :)

@crhino
Copy link
Contributor

crhino commented Apr 13, 2020

  1. The fucntion call stack are different between watching using consul watch and the consul agent watch. Eg: consul agent -dev -hcl 'watches {type = "key", key = "/test", handler_type = "script", args = ["sh", "-c", "echo test >> /tmp/consulwatch"] }' and consul watch -type=key -key=/test echo test >> /tmp/consulwatch are different that I though it would be. If we want to modify the behaviour of the application, we have to make it consistent for both the commands.

The difference is essentially whether or not the watch is declared in the Consul agent's configuration or not. A consul reload will only reload the agent's configuration files, which means that any watches run via consul watch are not effected by a reload at all. The Watches docs page explains this as follows:

Watches can be configured as part of the agent's configuration, causing them to run once the agent is initialized. Reloading the agent configuration allows for adding or removing watches dynamically.

Alternatively, the watch command enables a watch to be started outside of the agent. This can be used by an operator to inspect data in Consul or to easily pipe data into processes without being tied to the agent lifecycle.

Hope that makes sense, let me know if there are any questions!

@Sudhar287
Copy link

Thank you for the response @crhino!
FYI, I've updated the PR. Some tests are failing in circle CI but the pass on my local machine. Any tips on how to proceed?

@pierresouchay
Copy link
Contributor

@Sudhar287 force push several times until it works. Note that rebasing on Head might also help as many tests are a bit more stable in latest commits

@Sudhar287
Copy link

Thanks for the response @pierresouchay. Your suggestions were helpful and I've passed the CI. Can one of the official maintainers please have a look at the PR #7616 ?

@pmb311
Copy link

pmb311 commented May 11, 2020

What's the status of reviewing and merging this one? It's blocking some of our work that will implement new consul watches.

duckhan pushed a commit to duckhan/consul that referenced this issue Oct 24, 2021
* adding terminating gw test without tls/acls

Co-authored-by: Luke Kysow <1034429+lkysow@users.noreply.github.com>
duckhan pushed a commit to duckhan/consul that referenced this issue Oct 24, 2021
…istered (hashicorp#571)

Fixes hashicorp#540

* Modify endpoints controller to delete ACL tokens for each service instance that it deregisters
* Remove TLS+ACLs table tests from endpoints controller tests. These tests were testing that endpoints controller works with a client configured to have TLS and ACLs. I thought this test was not necessary because there isn't any code in the controller that behaves differently if the consul client is configured with any of those and as a result there's no way these tests could fail. The tests testing to the new ACL logic are there but they are only testing the logic that was added and configure test agent to accommodate for that.
* Create test package under helper and move GenerateServerCerts function from subcommand/common there because it was used outside of subcommand.
* Create a helper test function to set up auth methods and refactor existing connect-init command tests to use that function.
* Minor editing fixes of comments etc.
@Sharofiddin
Copy link

Will this issue be fixed? I have also encountered this problem. I have about 70 microservices in the mesh that I should watch their state if all instances are down for a service or unhealthy I should prevent other services from sending requests to unhealthy services. I implemented this with consul-watch but now whenever a service from mesh stopped all of my watches triggered even if nothing happened to their state. This is frustrating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner type/enhancement Proposed improvement or new feature
Projects
None yet
Development

No branches or pull requests