-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
streaming: split event buffer by key #12080
Conversation
This is a pre-requisite for splitting the internal topic buffer by key. While we technically could support subscribing to all of a topic's events by passing an empty key, it would require publishing events on multiple buffers (a key-specific and a global buffer) and as there isn't a compelling real world use-case for this, it doesn't seem worth the extra complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work @boxofrad 👏
I left a few inline comments, mostly notes to myself or random thoughts that don't need changes. There's one or two small things I think it would be nice to change though including fixing the racey map accesses in tests that presumably fail race detector?
The only other one that seems important to consider is whether we can simplify the code (and reasoning about correctness although I think it is correct) and have the bookkeeping function happen atomically with the unsubscribe state being updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, and excellent detail in the description!
I've only done a quick review so far. Mostly responding to the questions and commenting on stylistic things like TopicKey
. I'll hopefully have a chance to read over more of the functional changes in the last two commits next week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all my comments were addressed here so LGTM! I think Daniel said he'd not reviewed in depth yet so I'd wait for an approval from him too!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! A couple minor questions and suggestions, but nothing blocking a merge.
86efe0a
to
4a78287
Compare
🍒 If backport labels were added before merging, cherry-picking will start automatically. To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/569544. |
🍒✅ Cherry pick of commit fdfe079 onto |
Summary
This is an improvement to Consul's streaming backend to address performance problems observed in cases of high parallelism - i.e. many subscribers, high service churn, and servers with a large number of CPU cores.
It reduces lock-contention in the kernel's futex subsystem (more on that below) by only waking subscribers when there is an update that is relevant to them.
Background
Events are published to subscribers by appending them to a data structure called the
eventBuffer
. It is a single-writer, multiple-reader, linked list in which each element contains a channel that is used to notify subscribers when the subsequent element is available (similar to a monitor/condition variable). Subscribers block on a receive from this channel until it is closed by the publisher — this pattern will be familiar if you've used the context package (particularly theDone
method).Currently, there is a single
eventBuffer
for the entire topic. In other words, subscribers will be woken whenever any health event occurs, regardless of whether or not it pertains to a service they're watching, and it's up to them to ignore events that they're not interested in.On a busy cluster with many services and open streams, when an event is published, the majority of scheduled goroutines have no real work to do, and will immediately attempt to receive/block on the next element’s channel.
As well as the general overhead of scheduling these goroutines and spending a few CPU cycles on checking for relevant events, there's a much larger cost incurred by them all attempting to become receivers on the new channel at the same time.
Internally, channels maintain a lock that is held at various points, including when adding a blocked receiver; therefore, the many subscriber goroutines will compete to acquire the lock, resulting in contention.
There is some additional nuance here too, though. It turns out that reducing the contention of this particular lock - e.g. by sharding the same large number of subscribers across many channels so they're not all competing for the same lock - doesn't actually solve the problem. 🤔
On Linux, the Go runtime's internal locks are backed by futexes. The kernel stores futexes in a hashmap in which each bucket is guarded by a spinlock. The contention we're observing doesn't go away when we shard the channel because it isn't contention on the individual channel's lock, but rather the kernel's own internal (semi-global) lock! This explains why we have observed such a large amount of CPU time spent in
native_queued_spin_lock_slowpath
, and why the graph below shows such a high proportion of CPU time spent in kernel-space.The downstream impact of this contention is that event publishing gets slower, and with it so does applying changes to the Raft state machine, which has a massive effect on the general performance of a server. As an example, sustained FSM Apply latency of just 20ms, when Raft's internal queues/buffers are full, can result in several minutes of RPC latency.
By splitting the
topicBuffer
by key (e.g. service name) we can limit the number of goroutines that are woken by an event to just those with "real" work to do — which likely also adds an element of jitter as the goroutine will need to wait on sending gRPC stream frames etc.Naturally, this doesn't improve the pathological case where all subscribers are watching a single service. But in testing, I've found that this case doesn't exhibit the same issue, because the subscribers always have "real" work to do, and so won't immediately contend to re-acquire the channel lock.
Huge kudos to @banks, @mkeeler, @dnephin et al. for all of the incredible research work on this issue 👏🏻 ⭐
Benchmark
To test the efficacy of this change, I reused the test harness from when we last compared the performance of the streaming backend to that of blocking queries.
I deployed a single server node on an
m6i.32xlarge
EC2 instance (128 CPU cores). I then ran the benchmark tool on 50 worker nodes, each creating 14 gRPC connections to the server and consuming 100 streams per connection (for a total of 70k concurrent subscriptions). The tool then began generating write load by updating random tags on the watched services at a rate of 50Hz.Results
The results below show a significant reduction in "system" CPU time, lower FSM Apply time, and crucially less time spent waiting in Raft's internal queues/buffers (leading to pathologically bad RPC latency).