Optimize health watching to single chan/goroutine. #5449

banks · 2019-03-08T16:18:24Z

Watching chans for every node we touch in a health query is wasteful. In #4984 it shows that if there are more than 682 service instances we always fallback to watching all services which kills performance.

We already have a record in MemDB that is reliably update whenever the service health result should change thanks to per-service watch indexes.

So in general, provided there is at least one service instances and we actually have a service index for it (we always do now) we only ever need to watch a single channel.

This saves us from ever falling back to the general index and causing the performance cliff in #4984, but it also means fewer goroutines and work done for every blocking health query.

It also saves some allocations made during the query because we no longer have to populate a WatchSet with 3 chans per service instance which saves the internal map allocation.

This passes all state store tests except the one that explicitly checked for the fallback behaviour we've now optimized away and in general seems safe.

Refs #4984. Watching chans for every node we touch in a health query is wasteful. In #4984 it shows that if there are more than 682 service instances we always fallback to watching all services which kills performance. We already have a record in MemDB that is reliably update whenever the service health result should change thanks to per-service watch indexes. So in general, provided there is at least one service instances and we actually have a service index for it (we always do now) we only ever need to watch a single channel. This saves us from ever falling back to the general index and causing the performance cliff in #4984, but it also means fewer goroutines and work done for every blocking health query. It also saves some allocations made during the query because we no longer have to populate a WatchSet with 3 chans per service instance which saves the internal map allocation. This passes all state store tests except the one that explicitly checked for the fallback behaviour we've now optimized away and in general seems safe.

mkeeler

If I understand the callgraph properly, I think this should cover the /health/service/:service and /health/connect/:service endpoints and will cause blocking queries on those to only use 1 go routine for the watch instead of many.

Couldn't we add similar logic into the ServiceChecks function to what you added into the checkServiceNodes to gain the same benefits there for the /health/checks/:service endpoint or is that endpoint different enough for it to not be applicable (at my first glance it appears like it should be but I am not certain). Do we care? I don't think there has been any real world issues around the other endpoint but in theory at least it can run into the same issues.

ShimmerGlass · 2019-03-08T17:14:41Z

Very nice 👍 !
This may also have some performance benefits since watching is much cheaper now.

banks · 2019-03-08T17:25:32Z

Note that there is already a bug in Service indexes as described here #5450, this PR does not fix it but it will make is slightly worse. I think I'll probably just fix it in this PR though.

Couldn't we add similar logic into the ServiceChecks

Yeah, we could probably use a similar technique in a lot of places. I stuck to the obvious one where the big win is for now to try and get this landed, we can potentially use this technique elsewhere later.

mkeeler

@banks Sounds good. The code as is looks great. Going to approve for now. If you do fix that other bit in this PR let me know and I will give it another look.

banks · 2019-03-11T14:51:22Z

@Aestek fixed the other issue in #5458 (thanks!) so I think this is good to merge as-is.

banks · 2019-03-11T17:12:53Z

Here is a very unscientific benchmark to show the difference this makes.

I ran a single Consul in dev mode on my laptop and a synthetic workload that simulates:

A test service with 1000 instances
An instance is deregister and then registered periodically, in this case with 10 seconds in between each change
200 blocking RPCs are made to /health/service/test (direct RPC to simulate real server load where server is not also doing HTTP serving work)

I then setup Prometheus and Grafana and Prometheus node_exporter locally to be able to visualise what is going on.

The first burst is with master (before this PR) the second is the exact same workload with an agent compiled from this branch.

Notable results:

Number of goroutines used for 200 clients and 1000 service instances goes from ~7k to ~400
CPU usage is not very different because the updates in this simulation are slow so CPU is (probably) dominated by encoding the results and sending them. With frequent updates to many other services (not simulated but that's what caused the massive CPU issue in [Performance on large clusters] Performance degrades on health blocking queries to more than 682 instances #4984). This should fix that.
RPC rate and Network bandwidth is the same as the same work is being done.

There are other optimizations proposed in #4984 that can help more here but I think this is a solid evidence that this patch does what we expect.

banks · 2019-03-15T20:18:43Z

I've not had time to setup a better benchmark that can fully reproduce the excessive CPU caused in #4984 but the testing I've done above seems to show that this patch both works and does massively reduce the number of goroutines running for blocking queries so I'm going to land it as it is.

We have other work we want to do to improve further in #4984.

pierresouchay · 2019-03-16T08:50:55Z

We are gonna test this asap

banks added the theme/performance Performance benchmarking or potential improvement label Mar 8, 2019

banks requested a review from a team March 8, 2019 16:18

banks mentioned this pull request Mar 8, 2019

[Performance on large clusters] Performance degrades on health blocking queries to more than 682 instances #4984

Closed

mkeeler reviewed Mar 8, 2019

View reviewed changes

banks mentioned this pull request Mar 8, 2019

Service index does not reflect node updates #5450

Closed

mkeeler approved these changes Mar 8, 2019

View reviewed changes

banks merged commit 0b5a078 into master Mar 15, 2019

banks deleted the fix-health-watch-limit branch March 15, 2019 20:18

This was referenced Mar 18, 2019

Connect: health API blocking doesn't notice changes to proxy health checks #5506

Closed

Connect: Make Connect health queries unblock correctly #5508

Merged

banks mentioned this pull request Jan 27, 2020

Add config option for state_store.watchLimit #4986

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize health watching to single chan/goroutine. #5449

Optimize health watching to single chan/goroutine. #5449

banks commented Mar 8, 2019

mkeeler left a comment

ShimmerGlass commented Mar 8, 2019

banks commented Mar 8, 2019

mkeeler left a comment

banks commented Mar 11, 2019

banks commented Mar 11, 2019 •

edited

Loading

banks commented Mar 15, 2019

pierresouchay commented Mar 16, 2019

Optimize health watching to single chan/goroutine. #5449

Optimize health watching to single chan/goroutine. #5449

Conversation

banks commented Mar 8, 2019

mkeeler left a comment

Choose a reason for hiding this comment

ShimmerGlass commented Mar 8, 2019

banks commented Mar 8, 2019

mkeeler left a comment

Choose a reason for hiding this comment

banks commented Mar 11, 2019

banks commented Mar 11, 2019 • edited Loading

banks commented Mar 15, 2019

pierresouchay commented Mar 16, 2019

banks commented Mar 11, 2019 •

edited

Loading