Reloading Consul re-runs all watch commands every time. #571

darron · 2015-01-02T23:42:14Z

I'm reloading Consul to load in new config files as the containers are built - but it appears as if it's re-running each defined watch command every time Consul is reloaded.

Note all the newly created docker containers that are being stopped and started (from the watch command):

http://shared.froese.org/2014/Screen_Shot_2015-01-02_at_4.34.32_PM.png

Example watch command is here:

https://gist.github.com/darron/481604459ccfde4d401a

Is this expected behavior? I had thought that if the config had changed, it should obviously reload and probably re-run, but this is surprising to me.

I have a few watch commands:

https://gist.github.com/darron/38af49ad1352a913d360

Looking through the api, I don't see another way to register a watch command.

In the end, I was trying to launch approximately 50 containers, it was still going 4 hours later - launching and re-launching 2300+ containers and counting:

http://shared.froese.org/2014/v87sd-17-50.jpg

Am I "doing it wrong"?

darron · 2015-01-03T23:39:51Z

A watch command seems to be run:

When a config file is loaded.
When Consul is HUP'ed and it rechecks them all.

I don't think those commands should be run unless the watch is actually "triggered" - but that's what appears to be the behavior.

darron · 2015-01-04T00:10:03Z

Just found issue #342 that describes the fire on create behavior.

armon · 2015-01-05T19:25:36Z

The fire-on-create one is a weird semantic thing. To me it seemed that most use cases would need the initial data plus any deltas (e.g. initial HAProxy config, plus updates), so we always fire on the first run. I guess there are cases you may not care. I want to make that a flag to the watchers.

The re-fire is just caused by us doing the dumbest reload possible. e.g. drop everything and rebuild everything, instead of complex change detection logic. (Was the watcher added/removed/modified)

sean- · 2015-01-05T23:07:59Z

Assuming that the executable called by the watch is capable of making an idempotent change seems reasonable (or at the very worst, an identically generated config file + 1x SIGHUP should not be problematic in the common case). FWIW, we're designing the tools called by consul-template around the assumption that watches will fire haphazardly for many different and unknown reasons.

darron · 2015-01-05T23:43:39Z

All of the commands I have been using are idempotent, the problem is that as you add watches it gets run and re-run and re-run over and over and over.

If I have 50 containers with 2 watches per container (service and KV) then if ANY container is added or removed, all 50 stop and start each time there's a change to any Consul config file.

I have looked at the shell environment and the JSON that's passed to the handler - and there's no obvious difference that I can see - so there's no real way to know it's different outside of Consul.

I was attempting to start an AWS box that very comfortably runs 50+ small web containers - I was never able to finish and killed the box after 2300 container restarts. It couldn't really ever catch up after one container had been loaded and the next one finished.

armon · 2015-01-06T18:43:39Z

@darron oh are you saying that the watchers compound over time? e.g. a new set of 50 watchers is added on each reload?

darron · 2015-01-06T18:56:58Z

Here's how it is working with the current state of Consul:

There's already 49 containers with a kv and service watch each.
Add a new site to Docker.
Add a KV watch. (Hup Consul - as a result, all 98 watches fire again - which start to reload 49 containers and rebuild 49 nginx config files).
Add a service watch. (Hup Consul - as a result, all 99 watches fire again - which start to reload 50 containers and rebuild 49 nginx config files).

So - approximately 200 watch events fire because Consul is HUPed - not because they change or anything.

I could cut the reloads by 50% only reloading Consul once - but that didn't really help much - still containers that didn't need to be restarted were being restarted over and over.

I ended up disabling the KV container restart watch - the first one described here:

http://blog.froese.org/2014/12/30/how-octohost-uses-consul-watches/

It's just not lightweight enough to justify including the way Consul fires them each time.

dgshep · 2015-05-27T18:57:09Z

Hey guys. I also ran into this issue as well, and there are some potentially nasty behaviors if any of the handlers invoke consul reloadeither directly on indirectly. Essentially the watch will continue to refire until all file descriptors are exhausted:

2015/05/27 18:55:16 [ERR] agent: Failed to invoke watch handler 'consul reload': fork/exec /bin/sh: too many open files

here is the given watch config:

{ "watches": [ { "handler": "consul reload", "type": "event", "name": "break-consul" } ] }

This was with consul 0.5.0

jsok · 2015-07-02T03:42:57Z

I think a possible solution is to use the LTime field in the events and store that in a file each time your handler runs. That way, if you receive the same event with an LTime which is less-than-or-equal to the one you have stored in the file you can ignore it.

Can anyone confirm that LTime is reliable for this?

ryanuber · 2015-07-13T16:01:41Z

@jsok event watches are a bit of a snowflake, as they are powered by the gossip layer. LTime should be safe to use in the way you describe for event watches since it is a monotonic value, but it will not apply to other watch types and will not be part of the payload returned for those watches.

joelmoss · 2015-07-29T22:18:25Z

Ok, so I find this all very weird...

Triggering an event that is being watched gives me the following output from the watch (a single event):

[{"ID":"afe6d0d3-c8f7-bea6-0311-2b3672cdb3fd","Name":"testevent","Payload":null,"NodeFilter":"","ServiceFilter":"","TagFilter":"","Version":1,"LTime":9}]

But reloading using consul reload also triggers this event (as this issue confirms), however, the output from the watch is a list of all events past:

[{"ID":"95a7f085-ac1d-d916-6ba4-749e8d102a5e","Name":"testevent","Payload":null,"NodeFilter":"","ServiceFilter":"","TagFilter":"","Version":1,"LTime":5},{"ID":"10681ffe-b7b6-a0a0-d3a3-fa802d997258","Name":"testevent","Payload":null,"NodeFilter":"","ServiceFilter":"","TagFilter":"","Version":1,"LTime":7},{"ID":"5b373801-54f2-2bfa-bcb0-a4bcd3aacacf","Name":"testevent","Payload":null,"NodeFilter":"","ServiceFilter":"","TagFilter":"","Version":1,"LTime":8},{"ID":"afe6d0d3-c8f7-bea6-0311-2b3672cdb3fd","Name":"testevent","Payload":null,"NodeFilter":"","ServiceFilter":"","TagFilter":"","Version":1,"LTime":9}]

Why is that, as it makes very little sense to me.

joelmoss · 2015-07-29T22:20:59Z

@ryanuber what is Ltime?

jsok · 2015-07-29T22:32:41Z

LTime is serfs implementation of a lamport clock. The docs go into more depth: https://www.serfdom.io/docs/internals/gossip.html

I believe the reason you get all the events on reload is because the other nodes do not know what the last gossiped event that your local agent received. So to synchronise your local agent the other nodes will re-send all previous events.

joelmoss · 2015-07-29T22:33:29Z

ok, but right now I have only one node for testing.

BjRo · 2015-08-10T13:53:34Z

Can I work around this by checking and remembering the CONSUL_INDEX env var in my handler (if I have a watch of type key or keyprefix)?

BjRo · 2015-08-10T14:46:53Z

@armon Do you still plan to add the flag to the watches (to turn off fire-on-create)? If so, any plans when this is going to be available?

armon · 2015-08-11T17:20:22Z

@BjRo We are tracking it here, but no concrete plans. Lots of more pressing things.

ssenaria · 2015-09-26T02:28:52Z

I think I ran into this issue as well. This is stopping us from adopting Consul since it would fire a deploy if we have to restart Consul.

ssenaria · 2015-09-26T03:38:01Z

Any updates on this?

cya9nide · 2015-09-29T19:20:05Z

This is critical for us as part of our one to many plans. I can't have my events all firing anytime there is an edit or reload. I'd consider this a pressing issue.

vkhatri · 2016-01-02T02:19:21Z

I am also planning to use Consul Watch - Event triggers, whether it is a deployment or HTTP call fired by monitoring system to fix a check (e.g. restart ntpd if ntp peer check failed) with bunch of other events handler. This is still a blocker for me, i am not happy with the workaround as i had to put all the commands in a script handler instead. As a work around i am verifying the payload and restart the service if only payload matches.

https://gist.github.com/vkhatri/1c3d9b287338ed0288c0

Would be great to have a configuration parameter to prevent event trigger on service start/reload.

darron · 2016-01-02T18:02:58Z

Because I was tired of launching processes with scripts that kept duplicating functionality, I built a small Go based tool to help with this limitation:

https://github.com/darron/sifter

It handles event and key watches - and doesn't allow the watch to fire if:

Event: It's already seen that LTime value.
Key: the hash of the payload is the same as before.

We've been running it in production to protect event watches for a few weeks now - key watches are less tested but have worked in my local testing.

sc0rp10 · 2016-03-01T23:57:43Z

+1
We too need disabling "fire-on-create".

eppdot · 2016-03-04T10:48:57Z

We have the same issues with event watches. We want to use events to manually initialize container or schedule some tasks (like deploy). But the watches get executed in the beginning, which ist not our desired behaviour. This is also a serious blocker for us using consul. We are thinking about using consul exec instead, but that has serious security issues allowing arbitrary commands to be executed. +1

ssenaria · 2016-04-07T14:21:21Z

Any ETA on when a fix or enhancement would be scheduled for?

aashishmodak · 2017-05-17T09:53:24Z

Any update on by when this will be fixed?

aroca · 2018-02-01T18:10:05Z

Same issue here. Will try sifter but does not seem to be tested with key/keyprefix watches..

Any updates?

Cheers!

daledude · 2018-05-24T03:47:02Z

A way to remove the event from consul might be useful.

jgj1018 · 2018-08-23T02:33:52Z

got stuck at the same issue. seems no update until now

yuqiangh · 2018-09-04T20:40:16Z

Want to use "event watch" to start/stop services, but get stuck because of the defect (fire on create). When can this issue be fixed? Thanks.

adawalli · 2019-02-09T00:56:50Z

I am still confused how watches are even all that useful when this behavior is present. If you are requesting a very specific event but a watch fires also during other times (like a reload), what good is the watch?

DImuthuUpe · 2019-12-13T03:28:28Z

Any update on fixing this? It has been a huge pain in manually keeping track of old watch event.

Sudhar287 · 2020-03-25T15:17:41Z

I'm trying to replicate this issue and work on a fix.
I created a watch on an event and fired it multiple times. Everything worked as expected.
I tried to do a consul reload but this didn't trigger the event, contrary to the behavior described in a previous comment.
Has this problem been fixed? Is there a PR I can look into?

Sudhar287 · 2020-04-03T20:19:54Z

I've been working on this issue for a while and I think I have a solution. I want to ask some clarifying questions before I send the PR.

I identified that its the watch plan handler function that's being called whenever a new watch is created and thus the fire-on-create behavior is observed. Am I correct here? To fix this I've simply skipped handler function from being called the very first time.
Would you like to give the users the power to either enable or disable this fire-on-create behavior? If this is required, then I thought of having a global parameter to set/unset it.

crhino · 2020-04-09T15:49:01Z

@Sudhar287 I believe the consul reload problem still exists. I was able to reproduce this via:

$ consul agent -dev -hcl 'watches {type = "key", key = "/test", handler_type = "script", args = ["sh", "-c", "echo test >> /tmp/consulwatch"] }'

and then running consul reload in another shell, causing test to be printed out a second time:

$ cat /tmp/consulwatch
test
test

The solution of adding a new parameter onto the watch plan is sensible, but I think there are some subtle edge cases someone might run into with that behavior that we should think through.

The obvious edge case I can see is the following scenario:

I have a watch on the K/V path /foo
I initiate a reload of consul
The consul agent cancels the current watch of /foo
An update to the /foo key is committed to the Raft log
The consul agent starts a new watch of /foo, not firing on create

In this scenario the watch misses an update, which is undesirable.

For event watches specifically, this might be fine because consul events are already a best-effort mechanism and not designed for guaranteed delivery, but for other types of watches this behavior could be a problem. Perhaps a solution can be designed such that a fire-on-create option only applies to event watches.

Sudhar287 · 2020-04-09T19:47:47Z

Yes @crhino, you are absolutely right in everything you wrote! Thank you for the detailed comment.
Even I thought about the edge case but I figured it would be okay to skip the handler being invoked for events that occurred in the past. Now I know that its undesirable...

Please let me know the final proposal and I'll work on it. :)

vaLski · 2020-04-10T10:34:12Z

@Sudhar287 I roughly reviewed the code you proposed, and the idea of operator being able to explicitly state in the watch config, should it be immediately fired-on-create or not, save a lot of complex tracking caused by states etc. However I have question. Will the watch with "fireOnCreate":"no" setting set, be executed on initial agent startup? If it will be executed on startup but won't be executed on reload I suggest re-labeling it to "fireOnReload": "no".

@crhino Great comment on this one and the edge cases. I think that the current behaviour with fire-any-time yes should remain and will be the default in future. However I consider this relative easy and clean implementation that @Sudhar287 proposed to provide operators with great fine granted control over watches. Experienced ones might want to fine tune their watches and decide which should stick to the default behaviour and run anytime, and which ones they can "sacrifise" and tweak with "fireOnReload":"no" while fully understanding the downsides of their decision. This is somehow similar to the consul kv get and consul kv get -stale options. I am curious to hear what do you think about that?

Sudhar287 · 2020-04-12T04:01:03Z

Thank you for your insights @vaLski. I have some findings/suggestions to present too:

i. Starting the consul agent triggers the watch reload fucntion and
ii. The watch reload function eventually fires them. Hence, I think fire-on-reload option is essential a fire-on-create option.
The consul reload command reloads all the watches. This has to be skipped if we don't want the fire-on-reload/fire-on-create behavior.
The fucntion call stack are different between watching using consul watch and the consul agent watch. Eg: consul agent -dev -hcl 'watches {type = "key", key = "/test", handler_type = "script", args = ["sh", "-c", "echo test >> /tmp/consulwatch"] }' and consul watch -type=key -key=/test echo test >> /tmp/consulwatch are different that I though it would be. If we want to modify the behaviour of the application, we have to make it consistent for both the commands.
consul reload doesn't seem to have any effect when watching with the consul watch command. Which is the reason for my comment earlier.

Please let me know if I've made a mistake, I'm a beginner wrt consul. Please share your opinion and thoughts too. :)

crhino · 2020-04-13T14:24:54Z

The fucntion call stack are different between watching using consul watch and the consul agent watch. Eg: consul agent -dev -hcl 'watches {type = "key", key = "/test", handler_type = "script", args = ["sh", "-c", "echo test >> /tmp/consulwatch"] }' and consul watch -type=key -key=/test echo test >> /tmp/consulwatch are different that I though it would be. If we want to modify the behaviour of the application, we have to make it consistent for both the commands.

The difference is essentially whether or not the watch is declared in the Consul agent's configuration or not. A consul reload will only reload the agent's configuration files, which means that any watches run via consul watch are not effected by a reload at all. The Watches docs page explains this as follows:

Watches can be configured as part of the agent's configuration, causing them to run once the agent is initialized. Reloading the agent configuration allows for adding or removing watches dynamically.

Alternatively, the watch command enables a watch to be started outside of the agent. This can be used by an operator to inspect data in Consul or to easily pipe data into processes without being tied to the agent lifecycle.

Hope that makes sense, let me know if there are any questions!

Sudhar287 · 2020-04-17T20:56:29Z

Thank you for the response @crhino!
FYI, I've updated the PR. Some tests are failing in circle CI but the pass on my local machine. Any tips on how to proceed?

pierresouchay · 2020-04-22T22:51:01Z

@Sudhar287 force push several times until it works. Note that rebasing on Head might also help as many tests are a bit more stable in latest commits

Sudhar287 · 2020-05-04T14:10:50Z

Thanks for the response @pierresouchay. Your suggestions were helpful and I've passed the CI. Can one of the official maintainers please have a look at the PR #7616 ?

pmb311 · 2020-05-11T20:07:50Z

What's the status of reviewing and merging this one? It's blocking some of our work that will implement new consul watches.

* adding terminating gw test without tls/acls Co-authored-by: Luke Kysow <1034429+lkysow@users.noreply.github.com>

…istered (hashicorp#571) Fixes hashicorp#540 * Modify endpoints controller to delete ACL tokens for each service instance that it deregisters * Remove TLS+ACLs table tests from endpoints controller tests. These tests were testing that endpoints controller works with a client configured to have TLS and ACLs. I thought this test was not necessary because there isn't any code in the controller that behaves differently if the consul client is configured with any of those and as a result there's no way these tests could fail. The tests testing to the new ACL logic are there but they are only testing the logic that was added and configure test agent to accommodate for that. * Create test package under helper and move GenerateServerCerts function from subcommand/common there because it was used outside of subcommand. * Create a helper test function to set up auth methods and refactor existing connect-init command tests to use that function. * Minor editing fixes of comments etc.

Sharofiddin · 2023-11-22T11:43:03Z

Will this issue be fixed? I have also encountered this problem. I have about 70 microservices in the mesh that I should watch their state if all instances are down for a service or unhealthy I should prevent other services from sending requests to unhealthy services. I implemented this with consul-watch but now whenever a service from mesh stopped all of my watches triggered even if nothing happened to their state. This is frustrating.

darron added a commit to octohost/octohost that referenced this issue Jan 6, 2015

Disabling due to hashicorp/consul#571

fd11592

This was referenced Jan 7, 2015

Document: Deleting key does not reload a container. octohost/octohost#89

Closed

Document: When you reload consul - all watches fire. octohost/octohost#88

Open

slackpad mentioned this issue Nov 16, 2015

Watch event fires twice? #1256

Closed

slackpad added the type/enhancement Proposed improvement or new feature label Nov 16, 2015

slackpad mentioned this issue Feb 8, 2017

Deleting old payload of a watch #2618

Closed

slackpad added the theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner label May 25, 2017

pearkes mentioned this issue May 7, 2018

watches was loaded many times #4093

Closed

This was referenced Sep 12, 2018

consul watch handler runs on consul reload / restart #4609

Closed

When upgrading to consul 1.2.2, agents run consul watch handlers #4610

Closed

schristoff added the old-issue label Nov 12, 2019

hanshasselberg removed the close-old-issue-🤖 label Feb 18, 2020

Sudhar287 mentioned this issue Apr 8, 2020

Fixes fire on create issue with watches #7616

Closed

Sudhar287 mentioned this issue Apr 20, 2020

Watch getting triggered in consul reload #7446

Closed

duckhan pushed a commit to duckhan/consul that referenced this issue Oct 24, 2021

Add terminating gateway tests without ACL/TLS (hashicorp#571)

acb1bf7

* adding terminating gw test without tls/acls Co-authored-by: Luke Kysow <1034429+lkysow@users.noreply.github.com>

Reloading Consul re-runs all watch commands every time. #571

Reloading Consul re-runs all watch commands every time. #571

Comments

darron commented Jan 2, 2015

darron commented Jan 3, 2015

darron commented Jan 4, 2015

armon commented Jan 5, 2015

sean- commented Jan 5, 2015

darron commented Jan 5, 2015

armon commented Jan 6, 2015

darron commented Jan 6, 2015

dgshep commented May 27, 2015

jsok commented Jul 2, 2015

ryanuber commented Jul 13, 2015

joelmoss commented Jul 29, 2015

joelmoss commented Jul 29, 2015

jsok commented Jul 29, 2015

joelmoss commented Jul 29, 2015

BjRo commented Aug 10, 2015

BjRo commented Aug 10, 2015

armon commented Aug 11, 2015

ssenaria commented Sep 26, 2015

ssenaria commented Sep 26, 2015

cya9nide commented Sep 29, 2015

vkhatri commented Jan 2, 2016

darron commented Jan 2, 2016

sc0rp10 commented Mar 1, 2016

eppdot commented Mar 4, 2016

ssenaria commented Apr 7, 2016

aashishmodak commented May 17, 2017 • edited Loading

aroca commented Feb 1, 2018

daledude commented May 24, 2018

jgj1018 commented Aug 23, 2018

yuqiangh commented Sep 4, 2018

adawalli commented Feb 9, 2019

DImuthuUpe commented Dec 13, 2019

Sudhar287 commented Mar 25, 2020

Sudhar287 commented Apr 3, 2020

crhino commented Apr 9, 2020

Sudhar287 commented Apr 9, 2020 • edited Loading

vaLski commented Apr 10, 2020 • edited Loading

Sudhar287 commented Apr 12, 2020 • edited Loading

crhino commented Apr 13, 2020 • edited Loading

Sudhar287 commented Apr 17, 2020

pierresouchay commented Apr 22, 2020

Sudhar287 commented May 4, 2020

pmb311 commented May 11, 2020

Sharofiddin commented Nov 22, 2023

aashishmodak commented May 17, 2017 •

edited

Loading

Sudhar287 commented Apr 9, 2020 •

edited

Loading

vaLski commented Apr 10, 2020 •

edited

Loading

Sudhar287 commented Apr 12, 2020 •

edited

Loading

crhino commented Apr 13, 2020 •

edited

Loading