-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Various changes to improve the cpu impact of TransmitLimitedQueue in large clusters #167
Conversation
@@ -29,6 +29,11 @@ func (b *memberlistBroadcast) Invalidates(other Broadcast) bool { | |||
return b.node == mb.node | |||
} | |||
|
|||
// memberlist.NamedBroadcast optional interface |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a very small similar change that should be made to serf to have it skip the O(N) traversal codepath:
diff --git a/serf/broadcast.go b/serf/broadcast.go
index d20728f..751cf18 100644
--- a/serf/broadcast.go
+++ b/serf/broadcast.go
@@ -16,6 +16,9 @@ func (b *broadcast) Invalidates(other memberlist.Broadcast) bool {
return false
}
+// implements memberlist.UniqueBroadcast
+func (b *broadcast) UniqueBroadcast() {}
+
func (b *broadcast) Message() []byte {
return b.msg
}
2a5777b
to
c224c2e
Compare
queue.go
Outdated
q.mu.Lock() | ||
defer q.mu.Unlock() | ||
|
||
if q.tq == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just make an initializer for the queue and move these checks out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's only one place currently that an insertion into the queue happens and it's here. So I could move these two initialization elements out to something like a lazyInit
method, but without introducing a constructor-y thing for the TransmitLimitedQueue
(which doesn't currently exist) we'd still need the guard checks in all of the other methods for the nil-ness of the btree.
|
||
// Check if this message invalidates another. | ||
if lb.name != "" { | ||
if old, ok := q.tm[lb.name]; ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we move this logic up to where we check for the NamedBroadcast interface, I think we can avoid even saving the name
value on the struct, and just use the map key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prune
, queueBroadcast
, and GetBroadcasts
all need to call deleteItem
to remove stuff from the overall datastructure and none of them are already traversing the map to do so. If we don't store the name
on the limitedBroadcast
wrapper struct then for each of those we'll have to incur a type assertion to extract. If you think it's worth considering that, I'll have to re-characterize the performance hit that might cause.
} | ||
} else if !lb.unique { | ||
// Slow path, hopefully nothing hot hits this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same with this logic, I think we can just move it up and drop tracking of the unique
field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep.
// Visit fresher items first, but only look at stuff that will fit. | ||
// We'll go tier by tier, grabbing the largest items first. | ||
minTr, maxTr := q.getTransmitRange() | ||
for transmits := minTr; transmits <= maxTr; /*do not advance automatically*/ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the complexity of the tier first visiting? Can we not just ascend from the Min to the Max in a single walk rather than nested? It seems like it would be effectively the same, by visiting the tiers in order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the trickiest part for good reason. The data structure features that this TransmitLimitedQueue require for normal, efficient operation are varied enough that it was tricky to improve the performance for large queue depths without affecting the other usages:
- Evict stale messages when superseding messages come in. The map + the two optional interfaces help here.
- Prune the oldest messages, where oldest mostly means by number of transmits. The btree being bidirectional traversible and primarily sorted by number of transmits helps.
- Fill a finite knapsack of the Newest and Correctly Sized messages without traversing everything. This is the part that requires the tiered traversal to avoid many degenerate performance scenarios.
Any naive traversal of the btree bounded just by the primary index (num transmits) is going to be adversely affected by long runs of incorrectly sized messages, especially as the knapsack gets closer to being full. This can quickly degrade to O(N), which is the thing we're desperately trying to avoid. An earlier iteration of this was simpler and partly fixed the perf issue, but upon a deeper trace dump of iteration depths I discovered that queue shapes like [big, big, big, big, big, big, ....., small]
do happen frequently and cause the original bad O(N) traversal behavior to fill the knapsack.
Doing the tiered traversal lets you only ever visit elements that are definitely going into the knapsack (skipping huge swaths of each tier in the process) and thus avoid any incidental visitations in all situations.
Minor comments, but otherwise LGTM! Seems like a pretty serious improvement! |
…large clusters. Avoid the need to traverse the entirety of the queue as much by switching to an alternative data structure. Broadcasts now go into a btree sorted by: 1. transmits ascending 2. message length descending 3. arrival order descending This lets GetBroadcasts more efficiently fill a gossip packet without O(N) traversal. For the two most common Broadcast types coming from consul+serf (unique-by-node-name & globally-unique) avoid the need to traverse the entire joint queue on insert. For unique-by-node-name broadcasts use a map to make the existence check cheaper. For globally-unique broadcasts skip the existence check completely. Also unembed the sync.Mutex to hide the Lock/Unlock methods.
aeae9c0
to
f416c6d
Compare
@armon may be this is time to merge this? what do you think? |
f859afc
to
2690c35
Compare
This activates large-cluster improvements in the gossip layer from hashicorp/memberlist#167
This activates large-cluster improvements in the gossip layer from hashicorp/memberlist#167
full diff: hashicorp/memberlist@3d8438d...v0.1.4 - hashicorp/memberlist#158 Limit concurrent push/pull connections - hashicorp/memberlist#159 Prioritize alive message over other messages - hashicorp/memberlist#168 Add go.mod - hashicorp/memberlist#167 Various changes to improve the cpu impact of TransmitLimitedQueue in large clusters - hashicorp/memberlist#169 added back-off to accept loop to avoid a tight loop - hashicorp/memberlist#178 Avoid to take into account wrong versions of protocols in Vsn - hashicorp/memberlist#189 Allow a dead node's name to be taken by a new node Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
full diff: hashicorp/memberlist@3d8438d...v0.1.4 - hashicorp/memberlist#158 Limit concurrent push/pull connections - hashicorp/memberlist#159 Prioritize alive message over other messages - hashicorp/memberlist#168 Add go.mod - hashicorp/memberlist#167 Various changes to improve the cpu impact of TransmitLimitedQueue in large clusters - hashicorp/memberlist#169 added back-off to accept loop to avoid a tight loop - hashicorp/memberlist#178 Avoid to take into account wrong versions of protocols in Vsn - hashicorp/memberlist#189 Allow a dead node's name to be taken by a new node Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
full diff: hashicorp/memberlist@3d8438d...v0.1.4 - hashicorp/memberlist#158 Limit concurrent push/pull connections - hashicorp/memberlist#159 Prioritize alive message over other messages - hashicorp/memberlist#168 Add go.mod - hashicorp/memberlist#167 Various changes to improve the cpu impact of TransmitLimitedQueue in large clusters - hashicorp/memberlist#169 added back-off to accept loop to avoid a tight loop - hashicorp/memberlist#178 Avoid to take into account wrong versions of protocols in Vsn - hashicorp/memberlist#189 Allow a dead node's name to be taken by a new node Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
full diff: hashicorp/memberlist@3d8438d...v0.1.4 - hashicorp/memberlist#158 Limit concurrent push/pull connections - hashicorp/memberlist#159 Prioritize alive message over other messages - hashicorp/memberlist#168 Add go.mod - hashicorp/memberlist#167 Various changes to improve the cpu impact of TransmitLimitedQueue in large clusters - hashicorp/memberlist#169 added back-off to accept loop to avoid a tight loop - hashicorp/memberlist#178 Avoid to take into account wrong versions of protocols in Vsn - hashicorp/memberlist#189 Allow a dead node's name to be taken by a new node Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Avoid the need to traverse the entirety of the queue as much by switching to an alternative data structure. Broadcasts now go into a btree sorted by:
This lets GetBroadcasts more efficiently fill a gossip packet without O(N) traversal.
For the two most common Broadcast types coming from consul+serf (unique-by-node-name & globally-unique) avoid the need to traverse the entire joint queue on insert. For unique-by-node-name broadcasts use a map to make the existence check cheaper. For globally-unique broadcasts skip the existence check completely.
After these changes I can get a test cluster running on 80
c4.8xlarge
VMs at a density of 260 memberlist+serf instances per VM for a total cluster size of 20800 members before reaching RAM limitations instead of CPU limits.For an even comparison before/after this change the numerical comparisons were done on a cluster running on 60
c4.8xlarge
VMs at a density of 150 memberlist+serf instances per VM for a total cluster size of 9000 members before there are CPU issues on the old code.Before this PR
3m16s
to be fully joined.7m19s
to be quiet (having its gossip broadcast queues empty).25s
to be joined.5m55s
to become quiet again.Sample CPU profile during cluster up:
AFTER this PR
2m40s
to be fully joined.5m50s
to be quiet (having its gossip broadcast queues empty).10s
to be joined.5m55s
to become quiet again.Sample CPU profile during cluster up: