Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUGFIX] When a node level check is removed, ensure all services of node are notified #3970

Merged

Conversation

pierresouchay
Copy link
Contributor

Bugfix for #3899

When a node level check is removed (example: maintenance),
some watchers on services might have to recompute their state.

If those nodes are performing blocking queries, they have to be notified.
While their state was updated when node-level state did change or was added
this was not the case when the check was removed. This fixes it.

…ode are notified

Bugfix for hashicorp#3899

When a node level check is removed (example: maintenance),
some watchers on services might have to recompute their state.

If those nodes are performing blocking queries, they have to be notified.
While their state was updated when node-level state did change or was added
this was not the case when the check was removed. This fixes it.
Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Couple of puzzlers in the code but probably minor fixes/clarifications.

More generally, I'm a little unsettled that bugs like this exist - one of the risks we talked about in your other PR was maintenance overhead and how easy it would be for things like this to slip in where any change to state needs to reason about if there is any way it might affect some other API result and therefore need to bump the index value on it.

I'm wondering how to be confident there aren't other state changes that may occur that we've overlooked?

@@ -1086,6 +1086,43 @@ func (s *Store) EnsureCheck(idx uint64, hc *structs.HealthCheck) error {
return nil
}

// This ensure all connected services to a check are updated properly
func (s *Store) alterServicesFromCheck(tx *memdb.Txn, idx uint64, hc *structs.HealthCheck, checkServiceExists bool) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go style is to start the comment with the name of the method and write full sentence:

// alterServicesFromCheck ensures ... are updated properly.
func (s *Store) alterServicesFromCheck(...)

Also the name confuses me: it's names "alter services" yet it actually updates the service index right not the actual service record?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE in next patch

// a registration for the service.
if hc.ServiceID != "" {
service, err := tx.First("services", "id", hc.Node, hc.ServiceID)
// When deleting a service, service might have been removed already
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused, I guess this is refering to the checkServiceExists flag passed which seems reasonable, but then in the case that we don't check existence and so skip the first error handler, the second handler will be triggered and return an error anyway?

The caller in deleteCheckTxn doesn't distinguish between those error types so will fail if this returns any error anyway. Can we just remove this case as it seems to do nothing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I did it another way in next patch: I renamed alterServicesFromCheck and it only updates all Service Indexes for a given node

@pierresouchay
Copy link
Contributor Author

pierresouchay commented Mar 19, 2018

@banks I see your concerns

While premature optimization is the root of all evil, we are currently facing very large issues in my company with no optimizations on this side, but each time we add a feature, we add the corresponding test as well.

It really changed the way our clusters and even our apps do behave, and I think we are not the only ones having large deployments that would benefit greatly from this kind of changes.

Furthermore, if you look closely, this bug already existed in the original implementation, but was less visible: "services" index was not updated in that case, but it was less visible BECAUSE the index of services used to change a lot. So basically, we are fixing a bug that did exist for a while.

func (s *Store) deleteCheckTxn(tx *memdb.Txn, idx uint64, node string, checkID types.CheckID) error {

I found this bug this morning, but the good news is that among our many watch implementations (we have at least 5 implementations of watches across our systems), that's the only real bug we did find. Everything else is:

  • reduction of a factor of 3 of the load on the consul servers
  • very large improvements for all applications performing watches
  • large bandwidth limitation between applications, Consul Agents and Consul servers

We had issues for months with Consul performance, and it did fix almost all of our issues and did put our cluster and the whole infrastructure in a sane state.

I can send you privately some metrics regarding our apps / consul state if you are interrested

Kind Regards

UPDATE

I have been thinking about that a bit. I understand your concerns, so if you really have issues with this optimization, we might add a kind of security: we could add a barrier index such as even if a service is not updated during 5 minutes or so (for instance, we could use max_wait_duration or something similar), update the index for watches. Thus, all systems watching it would never have more than 5 minutes of delay even in the case of a catastrophic index bug. I don't find it very elegant, but it might be a compromise.

@pierresouchay
Copy link
Contributor Author

@banks Tested in our cluster successfully after applying the patch on all Consul servers

When we set maintenance, all services are impacted by the change (all are notified properly), and when we remove maintenance, all watches on services are notified as well.

Full Test:

  1. Launch a script that blocks until changes occur:
consul_check_services_changes.sh consul-agent-http.service.consul:8500/v1/health/service/mesos-ping-probe
X-Consul-Index: 291668094 at Tue 20 Mar 2018 15:34:48 CET
[BLOCKED]
  1. Setup maintenance on one of the nodes having the service
$ consul maint --enable --reason Test
Node maintenance is now enabled
  1. Watcher is immediately notified

(This was already working)

--- /var/folders/yz/ln_hlbcd1_bczjfwwptyzms9vyfn_2/T/tmp.0Hv2HWf4	2018-03-20 15:34:48.000000000 +0100
+++ /var/folders/yz/ln_hlbcd1_bczjfwwptyzms9vyfn_2/T/tmp.0Hv2HWf4.new	2018-03-20 15:35:21.000000000 +0100
@@ -35,6 +35,20 @@
         "Checks": [
             {
                 "Node": "mesos-slave002-par.central.criteo.preprod",
+                "CheckID": "_node_maintenance",
+                "Name": "Node Maintenance Mode",
+                "Status": "critical",
+                "Notes": "Test",
+                "Output": "",
+                "ServiceID": "",
+                "ServiceName": "",
+                "ServiceTags": [],
+                "Definition": {},
+                "CreateIndex": 291669232,
+                "ModifyIndex": 291669232
+            },
+            {
+                "Node": "mesos-slave002-par.central.criteo.preprod",
                 "CheckID": "serfHealth",
                 "Name": "Serf Health Status",
                 "Status": "passing",
X-Consul-Index: 291669232 at Tue 20 Mar 2018 15:35:21 CET
  1. Disable maintenance
consul maint --disable
Node maintenance is now disabled
  1. Here is the output of watch

Watch is notified immediately, maintenance has been effectively removed (was not working)

--- /var/folders/yz/ln_hlbcd1_bczjfwwptyzms9vyfn_2/T/tmp.0Hv2HWf4	2018-03-20 15:35:21.000000000 +0100
+++ /var/folders/yz/ln_hlbcd1_bczjfwwptyzms9vyfn_2/T/tmp.0Hv2HWf4.new	2018-03-20 15:35:37.000000000 +0100
@@ -35,20 +35,6 @@
         "Checks": [
             {
                 "Node": "mesos-slave002-par.central.criteo.preprod",
-                "CheckID": "_node_maintenance",
-                "Name": "Node Maintenance Mode",
-                "Status": "critical",
-                "Notes": "Test",
-                "Output": "",
-                "ServiceID": "",
-                "ServiceName": "",
-                "ServiceTags": [],
-                "Definition": {},
-                "CreateIndex": 291669232,
-                "ModifyIndex": 291669232
-            },
-            {
-                "Node": "mesos-slave002-par.central.criteo.preprod",
                 "CheckID": "serfHealth",
                 "Name": "Serf Health Status",
                 "Status": "passing",

@pierresouchay
Copy link
Contributor Author

@banks it seems Travis tests are unstable again

@pierresouchay
Copy link
Contributor Author

@banks This fix is very important IMHO since #3899 has been merged. The bug is now very visible and really needs a fix

@guidoiaquinti
Copy link
Contributor

Can we please mark this bug as blocking for the 1.0.7 release?

@banks
Copy link
Member

banks commented Mar 28, 2018

@pierresouchay will update soon based on discussions.

One thing I'm not clear on though in the interim: why does #3899 make this so much worse? #3899 only optimises for the index returned in between blocking calls. Once actually blocking, my understanding is that all watchers only return on direct changes to the objects in MemDB they are actually watching - #3899 doesn't change that behaviour.

If node check changes never touched the service at all then surely this maint example never really worked with service watchers anyway?

If node check changes do currently trigger blocked service watchers to return with updated state but without updating the modified index, then it doesn't seem like this is really any worse than before either? I may be missing something here.

Are you sure that #3899 actually made this worse for you? If so can you explain the mechanism a little more so we can understand the real impact etc.

I'm not trying to get out of the fix - if it's wrong it needs fixing for sure. I just want to understand if the urgency for 1.0.7 is valid.

@banks
Copy link
Member

banks commented Mar 28, 2018

To clarify the discussion a little here.

  • /catalog/services/:service endpoint watches

    • the services table for changes (
      // List all the services.
      services, err := tx.Get("services", "service", serviceName)
      if err != nil {
      return 0, nil, fmt.Errorf("failed service lookup: %s", err)
      }
      ws.Add(services.WatchCh())
      ).
    • the nodes record for each node a service instance is running on:
      // Grab the corresponding node record.
      watchCh, n, err := tx.FirstWatch("nodes", "id", sn.Node)
      if err != nil {
      return nil, fmt.Errorf("failed node lookup: %s", err)
      }
      ws.AddWithLimit(watchLimit, watchCh, allNodesCh)
    • NOT any of the health checks at either level
    • Implying that changes to health checks do not cause blocked watchers on /catalog/services to return and we probably shouldn't really update the modified index for that.
  • /health/services/:service endpoint watches

    • same as above but also all the checks associated with both nodes and services involved:
      // We don't want to track an unlimited number of nodes, so we pull a
      // top-level watch to use as a fallback.
      allNodes, err := tx.Get("nodes", "id")
      if err != nil {
      return 0, nil, fmt.Errorf("failed nodes lookup: %s", err)
      }
      allNodesCh := allNodes.WatchCh()
      // We need a similar fallback for checks. Since services need the
      // status of node + service-specific checks, we pull in a top-level
      // watch over all checks.
      allChecks, err := tx.Get("checks", "id")
      if err != nil {
      return 0, nil, fmt.Errorf("failed checks lookup: %s", err)
      }
      allChecksCh := allChecks.WatchCh()
      results := make(structs.CheckServiceNodes, 0, len(services))
      for _, sn := range services {
      // Retrieve the node.
      watchCh, n, err := tx.FirstWatch("nodes", "id", sn.Node)
      if err != nil {
      return 0, nil, fmt.Errorf("failed node lookup: %s", err)
      }
      ws.AddWithLimit(watchLimit, watchCh, allNodesCh)
      if n == nil {
      return 0, nil, ErrMissingNode
      }
      node := n.(*structs.Node)
      // First add the node-level checks. These always apply to any
      // service on the node.
      var checks structs.HealthChecks
      iter, err := tx.Get("checks", "node_service_check", sn.Node, false)
      if err != nil {
      return 0, nil, err
      }
      ws.AddWithLimit(watchLimit, iter.WatchCh(), allChecksCh)
      for check := iter.Next(); check != nil; check = iter.Next() {
      checks = append(checks, check.(*structs.HealthCheck))
      }
      // Now add the service-specific checks.
      iter, err = tx.Get("checks", "node_service", sn.Node, sn.ServiceID)
      if err != nil {
      return 0, nil, err
      }
      ws.AddWithLimit(watchLimit, iter.WatchCh(), allChecksCh)
      for check := iter.Next(); check != nil; check = iter.Next() {
      checks = append(checks, check.(*structs.HealthCheck))
      }
      // Append to the results.
      results = append(results, structs.CheckServiceNode{
      Node: node,
      Service: sn.ToNodeService(),
      Checks: checks,
      })
      }
    • so blockers on this endpoint will return when a node's checks change

Given that, it would seem like:

  1. It's always been the case that adding a new check to a node will notify active blocked health queries, but by design will not trigger blocked catalog queries. Services Indexes modified per service instead of using a global Index #3899 didn't change that.
  2. There is an existing subtle bug where the X-Consul-Index returned by health blocking endpoints doesn't actually get updated when any check changes status. Services Indexes modified per service instead of using a global Index #3899 actually improved this by touching the service-specific indexes when a service-specific check changed.
  3. This PR adds the same behaviour change that Services Indexes modified per service instead of using a global Index #3899 made for service specific checks also work for node checks.
  4. There is a very small potential downside that this will cause X-Consul-Index to increase on catalog responses even when the output doesn't, but I think it's small enough not to be worth splitting the index records further.

I still think there is a valid issue here since it's inconsistent that we do trigger health watches on a check change but do not update the X-Consul-Index making it possible for watchers to "miss" the event with unfortunate timing.

But I don't think this is any worse because of #3899 and in fact #3899 fixed the issue for service-specific watches where it would have been broken before. It's also less bad than it sounds (and probably why it's never been reported) in that blocked clients work just fine before and now, it's only the subtle possibility of a missed update in between blocking calls that is incorrect. That's bad for sure but likely not noticeable except in extreme cases.

@pierresouchay does that match your understanding? If I'm right it helps us understand if this PR is the right fix and whether it's super urgent for 1.0.7 (seems not given it's no worse than before).

@pierresouchay
Copy link
Contributor Author

@banks I think we understand the same.

Basically, updating/adding a check does notify the service/<SERVICE_NAME> check, but removing does not.
Since we have now an optimised version on watches, then clients watching for services may have to wait a long time before being notified when a healthcheck is being removed.

This is an issue, for instance with maintenance mode : When we set a node in maintenance, consul adds a failing healtcheck => everybody see the change. However, when removing the maintenance mode (basically, remove the failing healthcheck at node level), nobody see the change => the node is still considered unhealthy by watchers and node is not taken into account.

This is the reason why this patch addresses ONLY checks being removed

@banks
Copy link
Member

banks commented Mar 28, 2018

After discussion with @preetapan, what I said above is mostly correct although I was slightly missing that #3899 already handles node-level check updates in ensureCheckTxn and this PR is only adding the same logic when removing node-level checks.

That means the impact of not merging this seems to be even lower given that node checks adding or failing will work correctly. I think my analysis that this is strictly better than before #3899 still holds too.

That said, we now (finally) have a good understanding of the subtleties here and it does seem worth a fix so will review it as such. It's likely it will make it in before 1.0.7 but I don't think it's a showstopper if it doesn't since the overall behaviour will be no worse than before even without it.

Basically, updating/adding a check does notify the service/<SERVICE_NAME> check

Hmm no it doesn't - updating the index table and "notifying" are different things... see below.

nobody see the change

I think this is still being overstated (or I'm missunderstanding). Updating the index table doesn't affect anything except which index is returned to blocked callers. Any watch requests that are already blocked and waiting on mem-db changes will see the changes already and always have.

The fact the index is not updated only matters for a watcher that happens to be between the last blocking call and the subsequent one at exactly the time the check is removed. That's why it's an edge case since vast majorty of watchers will not wait inbetween calls so the window for "missing" the removal is typically very small.

But to repeat I still agree it's better to be correct now we understand the scope :)

@pierresouchay
Copy link
Contributor Author

How we discovered the issue:

When we perform some important changes to our infrastructure, we set the node in maintenance mode, so all watchers are notified: "Stop Sending requests on this node". We perform the changes, disable maintenance if everything is fine.

However, since the optimisation, the watchers are not notified the node is back online and stop sending it traffic for longer than expected

Copy link
Contributor

@preetapan preetapan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will need some local soak testing before merging

@pierresouchay
Copy link
Contributor Author

@preetapan @banks

For info we are running this patch on all of our clusters (prod and preprod) for several weeks for almost 3 weeks. No performance impact, and all our services (heavily relying on very long watches) are now behaving well, including during maintenance of nodes

@pearkes pearkes added this to the 1.1.0 milestone Apr 13, 2018
@pierresouchay
Copy link
Contributor Author

@banks Are you Ok with the changes I made?

For us, 1.0.7 cannot work without this patch as it disables services where node are in maintenance in all of our nodes without enabling them back when we remove maintenance on node level.

It is not a big deal for us, but all people relying on watches for provisioning might suffer from this as well

@banks
Copy link
Member

banks commented Apr 20, 2018

@pierresouchay we'll certainly merge this for next release. We had a tight deadline for 1.0.7 and hadn't gotten all the testing on this done we wanted although we're happy with the code as it is.

We went ahead with release for all the reasons I gave in #3970 (comment) but we certainly will get this in for next release. Thanks!

@banks
Copy link
Member

banks commented May 8, 2018

I verified this locally with both consul watch watching for service changes during registration/maint and with a bash script based watcher that sleeps between blocking queries to simulate the case that would exercise this index being out of date.

All seems to work. I'm going to land it in time to go through our pre-release soak test but I expect there not to be further issues.

@pierresouchay
Copy link
Contributor Author

@banks Great, I think it is a very good fix to include. Regards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants