-
Notifications
You must be signed in to change notification settings - Fork 543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: parallelise processing of messages received by memberlist #564
Comments
I've opened a draft PR @ dskit#110 to enhance message processing in KV memberlist by enabling parallelization. This solution makes use of a queue (buffered channel) in which messages are enqueued for later processing by a pool of workers. Additionally, now we must copy the passed message in PTAL |
There was discussion about this proposal on internal Slack (https://raintank-corp.slack.com/archives/C02L86H64TW/p1638435972297100), and the consensus was to go with one goroutine per KV key (and separate buffered channel for each key). Updating single key already requires lock, so having multiple concurrent goroutines working on the same key would provide little benefit. |
Thanks @pstibrany for pointing this discussion out to get all necessary context. I agree that parallelizing operations for a same key shouldn't make any difference given that |
Hi All, any progress on this issue? We are running a 9 node cluster, so far from the 1000+ nodes you mention above. Still, we get this is when we run queries that as for several months of metrics. The VMs on which the pods are running are quite large ( |
It's unexpected for such a small cluster. Could you share a snippet of such logs? What's the frequency at which they occur?
No it's not, but see below...
If CPU cores are exhausted (e.g. machine is running CPU cores at 100%) then all threads are affected, including the ring changes propagation logic. |
Problem
We're running a large Cortex clusters (1000+ nodes) with ring based memberlist and we observe many logs like the following:
The processing of received (user) messages is serialized and handled by
KV.NotifyMsg()
. The next received message will be picked up from the queue once the current oneNotifyMsg()
returns.Because of this, the approach so far has been to optimize
NotifyMsg()
as much as possible. This is definitely a good approach but we still see that we can't catch up with processing all received messages after all optimizations we've done so far. We can (and should) continue to optimise it, but the serialize processing will always put a scalability limit because we can't use more than 1 CPU core to process messages.Proposal
To my understanding, we don't need to guarantee ordering when processing received messages in
NotifyMsg()
because messages can be received in any order due to how propagation in memberlist works (even right now, with the large volume dropped messages we're not effectively processing messages in order because many of them are dropped).I propose to introduce parallelization in the
KV.NotifyMsg()
: the idea is to put received messages in a queue (yes, another limited queue) and have a pool of workers processing received messages. With a accurate use of locks (that should be taken for the shortest time possible) we should be able to significantly increase the throughput of received messages processing.The text was updated successfully, but these errors were encountered: