Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connect: health API blocking doesn't notice changes to proxy health checks #5506

Closed
banks opened this issue Mar 18, 2019 · 1 comment · Fixed by #5508
Closed

Connect: health API blocking doesn't notice changes to proxy health checks #5506

banks opened this issue Mar 18, 2019 · 1 comment · Fixed by #5508
Labels
theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies type/bug Feature does not function as expected
Milestone

Comments

@banks
Copy link
Member

banks commented Mar 18, 2019

While investigating #5499 and #5332 I discovered that /health/connect/:service doesn't unblock watchers if a health check on a proxy instance changes state. The X-Consul-Index also doesn't update.

At first I thought it was related to my optimization in #5449, but even before that the same bug existed for connect discovery with proxies.

The problem seems to be this:

  • in catalog checkServiceNodes, we switch the index for connect queries, but then we continue to use the target service name when we choose the consul index:
    idx, ch := maxIndexAndWatchChForService(tx, serviceName, len(results) > 0, true)
  • that index is not updated when changes are made to the proxy instance or it's checks so we "miss" unblocking on changes to those
  • before my optimization, the WatchSet would have unblocked watchers because it would have been watching the proxy checks loaded in parseCheckServiceNodes explicitly too, but since the index returned from the re-query was the same, the query would still not return. So after Optimize health watching to single chan/goroutine. #5449 we have identical (broken) behaviour but at least we no longer waste CPU cycles re-querying when we won't actually return the result!

We should:

  • Add a test for this case in connect watching. I think we have tests of alias checks and noticing target service updates but somehow not noticing this case.
  • Fix it with one of the two methods described below.

Option 1

Add a new index type called service-connect which is updated all the same times that the per-service index is updated, and also updated any time a proxy instance that would target that service is updated.

This is "simple" but it requires more storage and crucially plumbing through all the same places that our current service index update happens.

Even worse, it requires plumbing context about whether the service is a connect proxy (or looking it up again) into places like ensureCheckTxn where it's not needed now.

Option 2

A less invasive option that requires no addition updates or storage and only a single additional lookup/watch chan in the common case would be:

  • For connect queries, compile a list of all the proxy serivice names in the result set
  • lookup the service indexes for all of those (typical case only 1 extra)
  • return the max of that and the target service index
  • watch all of the service index watch chans

This is almost ideal, the one edge case it doesn't catch is if a proxy is registered with a new proxy service name that no other proxy is using (maybe it's the first proxy for that service or has a snowflake naming convention). In that case, watchers would miss the addition if it didn't also happen at the same time as a change to the actual target service registration.

We can solve for this one edge case by:

  • Also watch the services.connect index iterator which will be invalidated by any changes to the proxies that are registered.

This keeps the number of watch chans small - typically 3 - without any invasive updates or negative impact on other query types triggering more often than they need etc.

Unless I'm missing something still Option 2 seems the clear winner.

@banks banks added type/bug Feature does not function as expected theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies labels Mar 18, 2019
@banks banks added this to the 1.5 milestone Mar 18, 2019
@banks
Copy link
Member Author

banks commented Mar 18, 2019

This is the implementation of option 2 and seems to work from ad-hoc testing.

Adding it here just because I've not written tests yet and want to unblock myself on other things.

diff --git a/agent/consul/state/catalog.go b/agent/consul/state/catalog.go
index 4452298b7..ac28624c8 100644
--- a/agent/consul/state/catalog.go
+++ b/agent/consul/state/catalog.go
@@ -1896,8 +1896,18 @@ func (s *Store) checkServiceNodes(ws memdb.WatchSet, serviceName string, connect

 	// Return the results.
 	var results structs.ServiceNodes
+	// For connect queries we need a list of any proxy service names in the result
+	// set
+	var serviceNames map[string]struct{}
+	if connect {
+		serviceNames = make(map[string]struct{})
+	}
 	for service := iter.Next(); service != nil; service = iter.Next() {
-		results = append(results, service.(*structs.ServiceNode))
+		sn := service.(*structs.ServiceNode)
+		results = append(results, sn)
+		if serviceNames != nil && sn.ServiceName != serviceName {
+			serviceNames[sn.ServiceName] = struct{}{}
+		}
 	}

 	// Get the table index.
@@ -1923,6 +1933,34 @@ func (s *Store) checkServiceNodes(ws memdb.WatchSet, serviceName string, connect
 		// when watchLimit is hit (~682 service instances). See
 		// https://github.com/hashicorp/consul/issues/4984
 		ws.Add(ch)
+
+		if connect {
+			// If this is a connect query then there is a subtlety to watch out for.
+			// In addition to watching the proxy service indexes for changes below, we
+			// need to still keep an eye on the connect service index in case a new
+			// proxy with a new name registers - we are only watching proxy service
+			// names we know about below so we'd miss that otherwise. Thankfully this
+			// is only ever one extra chan to watch and will catch any changes to
+			// proxy registrations for this target service.
+			ws.Add(iter.WatchCh())
+		}
+	}
+
+	if len(serviceNames) > 0 {
+		// Also need to fetch the indexes and watch chans for the proxy service
+		// names in the result set.
+		for proxySvc := range serviceNames {
+			proxyIdx, proxyCh := maxIndexAndWatchChForService(tx, proxySvc, true, true)
+			if idx < proxyIdx {
+				idx = proxyIdx
+			}
+			// Only need to watch this index if we are using the index watch
+			// optimization above if not we'll be watching all the radix nodes that
+			// matter anyway.
+			if ch != nil {
+				ws.Add(proxyCh)
+			}
+		}
 	}

 	return s.parseCheckServiceNodes(tx, fallbackWS, idx, serviceName, results, err)

banks added a commit that referenced this issue Mar 19, 2019
banks added a commit that referenced this issue Mar 21, 2019
* Make Connect health queryies unblock correctly in all cases and use optimal number of watch chans. Fixes #5506.

* Node check test cases and clearer bug test doc

* Comment update
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies type/bug Feature does not function as expected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant