[BUGFIX] When a node level check is removed, ensure all services of node are notified #3970

pierresouchay · 2018-03-19T13:20:13Z

Bugfix for #3899

When a node level check is removed (example: maintenance),
some watchers on services might have to recompute their state.

If those nodes are performing blocking queries, they have to be notified.
While their state was updated when node-level state did change or was added
this was not the case when the check was removed. This fixes it.

…ode are notified Bugfix for hashicorp#3899 When a node level check is removed (example: maintenance), some watchers on services might have to recompute their state. If those nodes are performing blocking queries, they have to be notified. While their state was updated when node-level state did change or was added this was not the case when the check was removed. This fixes it.

banks

Good catch!

Couple of puzzlers in the code but probably minor fixes/clarifications.

More generally, I'm a little unsettled that bugs like this exist - one of the risks we talked about in your other PR was maintenance overhead and how easy it would be for things like this to slip in where any change to state needs to reason about if there is any way it might affect some other API result and therefore need to bump the index value on it.

I'm wondering how to be confident there aren't other state changes that may occur that we've overlooked?

banks · 2018-03-19T14:04:34Z

agent/consul/state/catalog.go

@@ -1086,6 +1086,43 @@ func (s *Store) EnsureCheck(idx uint64, hc *structs.HealthCheck) error {
 	return nil
 }

+// This ensure all connected services to a check are updated properly
+func (s *Store) alterServicesFromCheck(tx *memdb.Txn, idx uint64, hc *structs.HealthCheck, checkServiceExists bool) error {


Go style is to start the comment with the name of the method and write full sentence:

// alterServicesFromCheck ensures ... are updated properly. func (s *Store) alterServicesFromCheck(...)

Also the name confuses me: it's names "alter services" yet it actually updates the service index right not the actual service record?

DONE in next patch

banks · 2018-03-19T14:12:15Z

agent/consul/state/catalog.go

+	// a registration for the service.
+	if hc.ServiceID != "" {
+		service, err := tx.First("services", "id", hc.Node, hc.ServiceID)
+		// When deleting a service, service might have been removed already


I'm confused, I guess this is refering to the checkServiceExists flag passed which seems reasonable, but then in the case that we don't check existence and so skip the first error handler, the second handler will be triggered and return an error anyway?

The caller in deleteCheckTxn doesn't distinguish between those error types so will fail if this returns any error anyway. Can we just remove this case as it seems to do nothing?

Ok, I did it another way in next patch: I renamed alterServicesFromCheck and it only updates all Service Indexes for a given node

pierresouchay · 2018-03-19T15:30:05Z

@banks I see your concerns

While premature optimization is the root of all evil, we are currently facing very large issues in my company with no optimizations on this side, but each time we add a feature, we add the corresponding test as well.

It really changed the way our clusters and even our apps do behave, and I think we are not the only ones having large deployments that would benefit greatly from this kind of changes.

Furthermore, if you look closely, this bug already existed in the original implementation, but was less visible: "services" index was not updated in that case, but it was less visible BECAUSE the index of services used to change a lot. So basically, we are fixing a bug that did exist for a while.

consul/agent/consul/state/catalog.go

Line 1322 in 29feb66

    
           func (s *Store) deleteCheckTxn(tx *memdb.Txn, idx uint64, node string, checkID types.CheckID) error {

I found this bug this morning, but the good news is that among our many watch implementations (we have at least 5 implementations of watches across our systems), that's the only real bug we did find. Everything else is:

reduction of a factor of 3 of the load on the consul servers
very large improvements for all applications performing watches
large bandwidth limitation between applications, Consul Agents and Consul servers

We had issues for months with Consul performance, and it did fix almost all of our issues and did put our cluster and the whole infrastructure in a sane state.

I can send you privately some metrics regarding our apps / consul state if you are interrested

Kind Regards

UPDATE

I have been thinking about that a bit. I understand your concerns, so if you really have issues with this optimization, we might add a kind of security: we could add a barrier index such as even if a service is not updated during 5 minutes or so (for instance, we could use max_wait_duration or something similar), update the index for watches. Thus, all systems watching it would never have more than 5 minutes of delay even in the case of a catastrophic index bug. I don't find it very elegant, but it might be a compromise.

pierresouchay · 2018-03-20T14:45:10Z

@banks Tested in our cluster successfully after applying the patch on all Consul servers

When we set maintenance, all services are impacted by the change (all are notified properly), and when we remove maintenance, all watches on services are notified as well.

Full Test:

Launch a script that blocks until changes occur:

consul_check_services_changes.sh consul-agent-http.service.consul:8500/v1/health/service/mesos-ping-probe
X-Consul-Index: 291668094 at Tue 20 Mar 2018 15:34:48 CET
[BLOCKED]

Setup maintenance on one of the nodes having the service

$ consul maint --enable --reason Test
Node maintenance is now enabled

Watcher is immediately notified

(This was already working)

--- /var/folders/yz/ln_hlbcd1_bczjfwwptyzms9vyfn_2/T/tmp.0Hv2HWf4	2018-03-20 15:34:48.000000000 +0100
+++ /var/folders/yz/ln_hlbcd1_bczjfwwptyzms9vyfn_2/T/tmp.0Hv2HWf4.new	2018-03-20 15:35:21.000000000 +0100
@@ -35,6 +35,20 @@
         "Checks": [
             {
                 "Node": "mesos-slave002-par.central.criteo.preprod",
+                "CheckID": "_node_maintenance",
+                "Name": "Node Maintenance Mode",
+                "Status": "critical",
+                "Notes": "Test",
+                "Output": "",
+                "ServiceID": "",
+                "ServiceName": "",
+                "ServiceTags": [],
+                "Definition": {},
+                "CreateIndex": 291669232,
+                "ModifyIndex": 291669232
+            },
+            {
+                "Node": "mesos-slave002-par.central.criteo.preprod",
                 "CheckID": "serfHealth",
                 "Name": "Serf Health Status",
                 "Status": "passing",
X-Consul-Index: 291669232 at Tue 20 Mar 2018 15:35:21 CET

Disable maintenance

consul maint --disable
Node maintenance is now disabled

Here is the output of watch

Watch is notified immediately, maintenance has been effectively removed (was not working)

--- /var/folders/yz/ln_hlbcd1_bczjfwwptyzms9vyfn_2/T/tmp.0Hv2HWf4	2018-03-20 15:35:21.000000000 +0100
+++ /var/folders/yz/ln_hlbcd1_bczjfwwptyzms9vyfn_2/T/tmp.0Hv2HWf4.new	2018-03-20 15:35:37.000000000 +0100
@@ -35,20 +35,6 @@
         "Checks": [
             {
                 "Node": "mesos-slave002-par.central.criteo.preprod",
-                "CheckID": "_node_maintenance",
-                "Name": "Node Maintenance Mode",
-                "Status": "critical",
-                "Notes": "Test",
-                "Output": "",
-                "ServiceID": "",
-                "ServiceName": "",
-                "ServiceTags": [],
-                "Definition": {},
-                "CreateIndex": 291669232,
-                "ModifyIndex": 291669232
-            },
-            {
-                "Node": "mesos-slave002-par.central.criteo.preprod",
                 "CheckID": "serfHealth",
                 "Name": "Serf Health Status",
                 "Status": "passing",

Might fix unstable travis build

pierresouchay · 2018-03-22T11:29:26Z

@banks it seems Travis tests are unstable again

…change_service_index

pierresouchay · 2018-03-27T21:43:02Z

@banks This fix is very important IMHO since #3899 has been merged. The bug is now very visible and really needs a fix

guidoiaquinti · 2018-03-28T08:23:34Z

Can we please mark this bug as blocking for the 1.0.7 release?

banks · 2018-03-28T11:34:17Z

@pierresouchay will update soon based on discussions.

One thing I'm not clear on though in the interim: why does #3899 make this so much worse? #3899 only optimises for the index returned in between blocking calls. Once actually blocking, my understanding is that all watchers only return on direct changes to the objects in MemDB they are actually watching - #3899 doesn't change that behaviour.

If node check changes never touched the service at all then surely this maint example never really worked with service watchers anyway?

If node check changes do currently trigger blocked service watchers to return with updated state but without updating the modified index, then it doesn't seem like this is really any worse than before either? I may be missing something here.

Are you sure that #3899 actually made this worse for you? If so can you explain the mechanism a little more so we can understand the real impact etc.

I'm not trying to get out of the fix - if it's wrong it needs fixing for sure. I just want to understand if the urgency for 1.0.7 is valid.

banks · 2018-03-28T12:44:14Z

To clarify the discussion a little here.

/catalog/services/:service endpoint watches

the services table for changes (

consul/agent/consul/state/catalog.go

Lines 786 to 791 in 10bda48

    
           // List all the services. 
        
           services, err := tx.Get("services", "service", serviceName) 
        
           if err != nil { 
        
           	return 0, nil, fmt.Errorf("failed service lookup: %s", err) 
        
           } 
        
           ws.Add(services.WatchCh())

).

the nodes record for each node a service instance is running on:

consul/agent/consul/state/catalog.go

Lines 874 to 879 in 10bda48

    
           // Grab the corresponding node record. 
        
           watchCh, n, err := tx.FirstWatch("nodes", "id", sn.Node) 
        
           if err != nil { 
        
           	return nil, fmt.Errorf("failed node lookup: %s", err) 
        
           } 
        
           ws.AddWithLimit(watchLimit, watchCh, allNodesCh)

NOT any of the health checks at either level
Implying that changes to health checks do not cause blocked watchers on /catalog/services to return and we probably shouldn't really update the modified index for that.

/health/services/:service endpoint watches

same as above but also all the checks associated with both nodes and services involved:

consul/agent/consul/state/catalog.go

Lines 1504 to 1563 in 10bda48

    
           // We don't want to track an unlimited number of nodes, so we pull a 
        
           // top-level watch to use as a fallback. 
        
           allNodes, err := tx.Get("nodes", "id") 
        
           if err != nil { 
        
           	return 0, nil, fmt.Errorf("failed nodes lookup: %s", err) 
        
           } 
        
           allNodesCh := allNodes.WatchCh() 
        
           // We need a similar fallback for checks. Since services need the 
        
           // status of node + service-specific checks, we pull in a top-level 
        
           // watch over all checks. 
        
           allChecks, err := tx.Get("checks", "id") 
        
           if err != nil { 
        
           	return 0, nil, fmt.Errorf("failed checks lookup: %s", err) 
        
           } 
        
           allChecksCh := allChecks.WatchCh() 
        
           results := make(structs.CheckServiceNodes, 0, len(services)) 
        
           for _, sn := range services { 
        
           	// Retrieve the node. 
        
           	watchCh, n, err := tx.FirstWatch("nodes", "id", sn.Node) 
        
           	if err != nil { 
        
           		return 0, nil, fmt.Errorf("failed node lookup: %s", err) 
        
           	} 
        
           	ws.AddWithLimit(watchLimit, watchCh, allNodesCh) 
        
           	if n == nil { 
        
           		return 0, nil, ErrMissingNode 
        
           	} 
        
           	node := n.(*structs.Node) 
        
           	// First add the node-level checks. These always apply to any 
        
           	// service on the node. 
        
           	var checks structs.HealthChecks 
        
           	iter, err := tx.Get("checks", "node_service_check", sn.Node, false) 
        
           	if err != nil { 
        
           		return 0, nil, err 
        
           	} 
        
           	ws.AddWithLimit(watchLimit, iter.WatchCh(), allChecksCh) 
        
           	for check := iter.Next(); check != nil; check = iter.Next() { 
        
           		checks = append(checks, check.(*structs.HealthCheck)) 
        
           	} 
        
           	// Now add the service-specific checks. 
        
           	iter, err = tx.Get("checks", "node_service", sn.Node, sn.ServiceID) 
        
           	if err != nil { 
        
           		return 0, nil, err 
        
           	} 
        
           	ws.AddWithLimit(watchLimit, iter.WatchCh(), allChecksCh) 
        
           	for check := iter.Next(); check != nil; check = iter.Next() { 
        
           		checks = append(checks, check.(*structs.HealthCheck)) 
        
           	} 
        
           	// Append to the results. 
        
           	results = append(results, structs.CheckServiceNode{ 
        
           		Node:    node, 
        
           		Service: sn.ToNodeService(), 
        
           		Checks:  checks, 
        
           	}) 
        
           }

so blockers on this endpoint will return when a node's checks change

Given that, it would seem like:

It's always been the case that adding a new check to a node will notify active blocked health queries, but by design will not trigger blocked catalog queries. Services Indexes modified per service instead of using a global Index #3899 didn't change that.
There is an existing subtle bug where the X-Consul-Index returned by health blocking endpoints doesn't actually get updated when any check changes status. Services Indexes modified per service instead of using a global Index #3899 actually improved this by touching the service-specific indexes when a service-specific check changed.
This PR adds the same behaviour change that Services Indexes modified per service instead of using a global Index #3899 made for service specific checks also work for node checks.
There is a very small potential downside that this will cause X-Consul-Index to increase on catalog responses even when the output doesn't, but I think it's small enough not to be worth splitting the index records further.

I still think there is a valid issue here since it's inconsistent that we do trigger health watches on a check change but do not update the X-Consul-Index making it possible for watchers to "miss" the event with unfortunate timing.

But I don't think this is any worse because of #3899 and in fact #3899 fixed the issue for service-specific watches where it would have been broken before. It's also less bad than it sounds (and probably why it's never been reported) in that blocked clients work just fine before and now, it's only the subtle possibility of a missed update in between blocking calls that is incorrect. That's bad for sure but likely not noticeable except in extreme cases.

@pierresouchay does that match your understanding? If I'm right it helps us understand if this PR is the right fix and whether it's super urgent for 1.0.7 (seems not given it's no worse than before).

pierresouchay · 2018-03-28T14:14:54Z

@banks I think we understand the same.

Basically, updating/adding a check does notify the service/<SERVICE_NAME> check, but removing does not.
Since we have now an optimised version on watches, then clients watching for services may have to wait a long time before being notified when a healthcheck is being removed.

This is an issue, for instance with maintenance mode : When we set a node in maintenance, consul adds a failing healtcheck => everybody see the change. However, when removing the maintenance mode (basically, remove the failing healthcheck at node level), nobody see the change => the node is still considered unhealthy by watchers and node is not taken into account.

This is the reason why this patch addresses ONLY checks being removed

banks · 2018-03-28T14:19:18Z

After discussion with @preetapan, what I said above is mostly correct although I was slightly missing that #3899 already handles node-level check updates in ensureCheckTxn and this PR is only adding the same logic when removing node-level checks.

That means the impact of not merging this seems to be even lower given that node checks adding or failing will work correctly. I think my analysis that this is strictly better than before #3899 still holds too.

That said, we now (finally) have a good understanding of the subtleties here and it does seem worth a fix so will review it as such. It's likely it will make it in before 1.0.7 but I don't think it's a showstopper if it doesn't since the overall behaviour will be no worse than before even without it.

Basically, updating/adding a check does notify the service/<SERVICE_NAME> check

Hmm no it doesn't - updating the index table and "notifying" are different things... see below.

nobody see the change

I think this is still being overstated (or I'm missunderstanding). Updating the index table doesn't affect anything except which index is returned to blocked callers. Any watch requests that are already blocked and waiting on mem-db changes will see the changes already and always have.

The fact the index is not updated only matters for a watcher that happens to be between the last blocking call and the subsequent one at exactly the time the check is removed. That's why it's an edge case since vast majorty of watchers will not wait inbetween calls so the window for "missing" the removal is typically very small.

But to repeat I still agree it's better to be correct now we understand the scope :)

pierresouchay · 2018-03-28T14:23:11Z

How we discovered the issue:

When we perform some important changes to our infrastructure, we set the node in maintenance mode, so all watchers are notified: "Stop Sending requests on this node". We perform the changes, disable maintenance if everything is fine.

However, since the optimisation, the watchers are not notified the node is back online and stop sending it traffic for longer than expected

preetapan

Will need some local soak testing before merging

pierresouchay · 2018-04-07T08:06:35Z

@preetapan @banks

For info we are running this patch on all of our clusters (prod and preprod) for several weeks for almost 3 weeks. No performance impact, and all our services (heavily relying on very long watches) are now behaving well, including during maintenance of nodes

pierresouchay · 2018-04-16T23:08:09Z

@banks Are you Ok with the changes I made?

For us, 1.0.7 cannot work without this patch as it disables services where node are in maintenance in all of our nodes without enabling them back when we remove maintenance on node level.

It is not a big deal for us, but all people relying on watches for provisioning might suffer from this as well

banks · 2018-04-20T16:08:51Z

@pierresouchay we'll certainly merge this for next release. We had a tight deadline for 1.0.7 and hadn't gotten all the testing on this done we wanted although we're happy with the code as it is.

We went ahead with release for all the reasons I gave in #3970 (comment) but we certainly will get this in for next release. Thanks!

banks · 2018-05-08T15:10:44Z

I verified this locally with both consul watch watching for service changes during registration/maint and with a bash script based watcher that sleeps between blocking queries to simulate the case that would exercise this index being out of date.

All seems to work. I'm going to land it in time to go through our pre-release soak test but I expect there not to be further issues.

pierresouchay · 2018-05-08T15:22:14Z

@banks Great, I think it is a very good fix to include. Regards

pierresouchay mentioned this pull request Mar 19, 2018

X-Consul-Index reflects global service churn when watching a single service causing issues with sleeping between blocking calls #3890

Closed

banks requested changes Mar 19, 2018

View reviewed changes

banks mentioned this pull request Mar 19, 2018

Watch /catalog/services optimization: X-Consul-Index changes less often #3934

Closed

Refactoring to have clearer code without weird bool

5e97484

Fixed typo in comments

b691461

pierresouchay added 2 commits March 22, 2018 10:30

Fixed minor typo in comments

dd9efb7

Might fix unstable travis build

Added new test regarding checks index

39a7b5c

pierresouchay added 2 commits March 22, 2018 12:41

More test cases

5fb1b18

Merge remote-tracking branch 'origin/master' into node_health_should_…

18baff8

…change_service_index

preetapan approved these changes Mar 31, 2018

View reviewed changes

pearkes added this to the 1.1.0 milestone Apr 13, 2018

banks approved these changes Apr 20, 2018

View reviewed changes

banks merged commit b7fa335 into hashicorp:master May 8, 2018

pierresouchay mentioned this pull request Dec 4, 2018

[Performance on large clusters] Performance degrades on health blocking queries to more than 682 instances #4984

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUGFIX] When a node level check is removed, ensure all services of node are notified #3970

[BUGFIX] When a node level check is removed, ensure all services of node are notified #3970

pierresouchay commented Mar 19, 2018

banks left a comment

banks Mar 19, 2018

pierresouchay Mar 19, 2018

banks Mar 19, 2018

pierresouchay Mar 19, 2018

pierresouchay commented Mar 19, 2018 •

edited

Loading

pierresouchay commented Mar 20, 2018

pierresouchay commented Mar 22, 2018

pierresouchay commented Mar 27, 2018

guidoiaquinti commented Mar 28, 2018

banks commented Mar 28, 2018

banks commented Mar 28, 2018 •

edited

Loading

pierresouchay commented Mar 28, 2018

banks commented Mar 28, 2018 •

edited

Loading

pierresouchay commented Mar 28, 2018

preetapan left a comment

pierresouchay commented Apr 7, 2018

pierresouchay commented Apr 16, 2018

banks commented Apr 20, 2018

banks commented May 8, 2018

pierresouchay commented May 8, 2018

[BUGFIX] When a node level check is removed, ensure all services of node are notified #3970

[BUGFIX] When a node level check is removed, ensure all services of node are notified #3970

Conversation

pierresouchay commented Mar 19, 2018

banks left a comment

Choose a reason for hiding this comment

banks Mar 19, 2018

Choose a reason for hiding this comment

pierresouchay Mar 19, 2018

Choose a reason for hiding this comment

banks Mar 19, 2018

Choose a reason for hiding this comment

pierresouchay Mar 19, 2018

Choose a reason for hiding this comment

pierresouchay commented Mar 19, 2018 • edited Loading

UPDATE

pierresouchay commented Mar 20, 2018

pierresouchay commented Mar 22, 2018

pierresouchay commented Mar 27, 2018

guidoiaquinti commented Mar 28, 2018

banks commented Mar 28, 2018

banks commented Mar 28, 2018 • edited Loading

pierresouchay commented Mar 28, 2018

banks commented Mar 28, 2018 • edited Loading

pierresouchay commented Mar 28, 2018

preetapan left a comment

Choose a reason for hiding this comment

pierresouchay commented Apr 7, 2018

pierresouchay commented Apr 16, 2018

banks commented Apr 20, 2018

banks commented May 8, 2018

pierresouchay commented May 8, 2018

pierresouchay commented Mar 19, 2018 •

edited

Loading

banks commented Mar 28, 2018 •

edited

Loading

banks commented Mar 28, 2018 •

edited

Loading