-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize health watching to single chan/goroutine. #5449
Conversation
Refs #4984. Watching chans for every node we touch in a health query is wasteful. In #4984 it shows that if there are more than 682 service instances we always fallback to watching all services which kills performance. We already have a record in MemDB that is reliably update whenever the service health result should change thanks to per-service watch indexes. So in general, provided there is at least one service instances and we actually have a service index for it (we always do now) we only ever need to watch a single channel. This saves us from ever falling back to the general index and causing the performance cliff in #4984, but it also means fewer goroutines and work done for every blocking health query. It also saves some allocations made during the query because we no longer have to populate a WatchSet with 3 chans per service instance which saves the internal map allocation. This passes all state store tests except the one that explicitly checked for the fallback behaviour we've now optimized away and in general seems safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand the callgraph properly, I think this should cover the /health/service/:service
and /health/connect/:service
endpoints and will cause blocking queries on those to only use 1 go routine for the watch instead of many.
Couldn't we add similar logic into the ServiceChecks
function to what you added into the checkServiceNodes
to gain the same benefits there for the /health/checks/:service
endpoint or is that endpoint different enough for it to not be applicable (at my first glance it appears like it should be but I am not certain). Do we care? I don't think there has been any real world issues around the other endpoint but in theory at least it can run into the same issues.
Very nice 👍 ! |
Note that there is already a bug in Service indexes as described here #5450, this PR does not fix it but it will make is slightly worse. I think I'll probably just fix it in this PR though.
Yeah, we could probably use a similar technique in a lot of places. I stuck to the obvious one where the big win is for now to try and get this landed, we can potentially use this technique elsewhere later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@banks Sounds good. The code as is looks great. Going to approve for now. If you do fix that other bit in this PR let me know and I will give it another look.
@Aestek fixed the other issue in #5458 (thanks!) so I think this is good to merge as-is. |
Here is a very unscientific benchmark to show the difference this makes. I ran a single Consul in dev mode on my laptop and a synthetic workload that simulates:
I then setup Prometheus and Grafana and Prometheus node_exporter locally to be able to visualise what is going on. The first burst is with master (before this PR) the second is the exact same workload with an agent compiled from this branch. Notable results:
There are other optimizations proposed in #4984 that can help more here but I think this is a solid evidence that this patch does what we expect. |
I've not had time to setup a better benchmark that can fully reproduce the excessive CPU caused in #4984 but the testing I've done above seems to show that this patch both works and does massively reduce the number of goroutines running for blocking queries so I'm going to land it as it is. We have other work we want to do to improve further in #4984. |
We are gonna test this asap |
Refs #4984.
Watching chans for every node we touch in a health query is wasteful. In #4984 it shows that if there are more than 682 service instances we always fallback to watching all services which kills performance.
We already have a record in MemDB that is reliably update whenever the service health result should change thanks to per-service watch indexes.
So in general, provided there is at least one service instances and we actually have a service index for it (we always do now) we only ever need to watch a single channel.
This saves us from ever falling back to the general index and causing the performance cliff in #4984, but it also means fewer goroutines and work done for every blocking health query.
It also saves some allocations made during the query because we no longer have to populate a WatchSet with 3 chans per service instance which saves the internal map allocation.
This passes all state store tests except the one that explicitly checked for the fallback behaviour we've now optimized away and in general seems safe.