Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize health watching to single chan/goroutine. #5449

Merged
merged 1 commit into from
Mar 15, 2019

Conversation

banks
Copy link
Member

@banks banks commented Mar 8, 2019

Refs #4984.

Watching chans for every node we touch in a health query is wasteful. In #4984 it shows that if there are more than 682 service instances we always fallback to watching all services which kills performance.

We already have a record in MemDB that is reliably update whenever the service health result should change thanks to per-service watch indexes.

So in general, provided there is at least one service instances and we actually have a service index for it (we always do now) we only ever need to watch a single channel.

This saves us from ever falling back to the general index and causing the performance cliff in #4984, but it also means fewer goroutines and work done for every blocking health query.

It also saves some allocations made during the query because we no longer have to populate a WatchSet with 3 chans per service instance which saves the internal map allocation.

This passes all state store tests except the one that explicitly checked for the fallback behaviour we've now optimized away and in general seems safe.

Refs #4984.

Watching chans for every node we touch in a health query is wasteful. In #4984 it shows that if there are more than 682 service instances we always fallback to watching all services which kills performance.

We already have a record in MemDB that is reliably update whenever the service health result should change thanks to per-service watch indexes.

So in general, provided there is at least one service instances and we actually have a service index for it (we always do now) we only ever need to watch a single channel.

This saves us from ever falling back to the general index and causing the performance cliff in #4984, but it also means fewer goroutines and work done for every blocking health query.

It also saves some allocations made during the query because we no longer have to populate a WatchSet with 3 chans per service instance which saves the internal map allocation.

This passes all state store tests except the one that explicitly checked for the fallback behaviour we've now optimized away and in general seems safe.
@banks banks added the theme/performance Performance benchmarking or potential improvement label Mar 8, 2019
@banks banks requested a review from a team March 8, 2019 16:18
Copy link
Member

@mkeeler mkeeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand the callgraph properly, I think this should cover the /health/service/:service and /health/connect/:service endpoints and will cause blocking queries on those to only use 1 go routine for the watch instead of many.

Couldn't we add similar logic into the ServiceChecks function to what you added into the checkServiceNodes to gain the same benefits there for the /health/checks/:service endpoint or is that endpoint different enough for it to not be applicable (at my first glance it appears like it should be but I am not certain). Do we care? I don't think there has been any real world issues around the other endpoint but in theory at least it can run into the same issues.

@ShimmerGlass
Copy link
Contributor

Very nice 👍 !
This may also have some performance benefits since watching is much cheaper now.

@banks
Copy link
Member Author

banks commented Mar 8, 2019

Note that there is already a bug in Service indexes as described here #5450, this PR does not fix it but it will make is slightly worse. I think I'll probably just fix it in this PR though.

Couldn't we add similar logic into the ServiceChecks

Yeah, we could probably use a similar technique in a lot of places. I stuck to the obvious one where the big win is for now to try and get this landed, we can potentially use this technique elsewhere later.

Copy link
Member

@mkeeler mkeeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@banks Sounds good. The code as is looks great. Going to approve for now. If you do fix that other bit in this PR let me know and I will give it another look.

@banks
Copy link
Member Author

banks commented Mar 11, 2019

@Aestek fixed the other issue in #5458 (thanks!) so I think this is good to merge as-is.

@banks
Copy link
Member Author

banks commented Mar 11, 2019

Here is a very unscientific benchmark to show the difference this makes.

I ran a single Consul in dev mode on my laptop and a synthetic workload that simulates:

  • A test service with 1000 instances
  • An instance is deregister and then registered periodically, in this case with 10 seconds in between each change
  • 200 blocking RPCs are made to /health/service/test (direct RPC to simulate real server load where server is not also doing HTTP serving work)

I then setup Prometheus and Grafana and Prometheus node_exporter locally to be able to visualise what is going on.

image

The first burst is with master (before this PR) the second is the exact same workload with an agent compiled from this branch.

Notable results:

  • Number of goroutines used for 200 clients and 1000 service instances goes from ~7k to ~400
  • CPU usage is not very different because the updates in this simulation are slow so CPU is (probably) dominated by encoding the results and sending them. With frequent updates to many other services (not simulated but that's what caused the massive CPU issue in [Performance on large clusters] Performance degrades on health blocking queries to more than 682 instances #4984). This should fix that.
  • RPC rate and Network bandwidth is the same as the same work is being done.

There are other optimizations proposed in #4984 that can help more here but I think this is a solid evidence that this patch does what we expect.

@banks
Copy link
Member Author

banks commented Mar 15, 2019

I've not had time to setup a better benchmark that can fully reproduce the excessive CPU caused in #4984 but the testing I've done above seems to show that this patch both works and does massively reduce the number of goroutines running for blocking queries so I'm going to land it as it is.

We have other work we want to do to improve further in #4984.

@banks banks merged commit 0b5a078 into master Mar 15, 2019
@banks banks deleted the fix-health-watch-limit branch March 15, 2019 20:18
@pierresouchay
Copy link
Contributor

We are gonna test this asap

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/performance Performance benchmarking or potential improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants