Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various changes to improve the cpu impact of TransmitLimitedQueue in large clusters #167

Merged
merged 5 commits into from
Jan 3, 2019

Conversation

rboyer
Copy link
Member

@rboyer rboyer commented Oct 25, 2018

Avoid the need to traverse the entirety of the queue as much by switching to an alternative data structure. Broadcasts now go into a btree sorted by:

  1. transmits ascending
  2. message length descending
  3. arrival order descending

This lets GetBroadcasts more efficiently fill a gossip packet without O(N) traversal.

For the two most common Broadcast types coming from consul+serf (unique-by-node-name & globally-unique) avoid the need to traverse the entire joint queue on insert. For unique-by-node-name broadcasts use a map to make the existence check cheaper. For globally-unique broadcasts skip the existence check completely.

After these changes I can get a test cluster running on 80 c4.8xlarge VMs at a density of 260 memberlist+serf instances per VM for a total cluster size of 20800 members before reaching RAM limitations instead of CPU limits.

For an even comparison before/after this change the numerical comparisons were done on a cluster running on 60 c4.8xlarge VMs at a density of 150 memberlist+serf instances per VM for a total cluster size of 9000 members before there are CPU issues on the old code.

Before this PR

  • The cluster took 3m16s to be fully joined.
  • The cluster took 7m19s to be quiet (having its gossip broadcast queues empty).
  • During initial cluster up, each node was using 98.9% of CPU all the way until quiet.
  • Restarting all 150 instances on one VM:
    • Joining the cluster used 98% of cpu on that VM (not much on other VMs).
      • That node took 25s to be joined.
    • While that VM was not yet quiet it was using about 25% CPU.
    • That VM took 5m55s to become quiet again.

Sample CPU profile during cluster up:

$ go tool pprof ./tmp/hack cpu.pprof
File: hack
Type: cpu
Time: Oct 29, 2018 at 4:38pm (CDT)
Duration: 30.39s, Total samples = 6.77s (22.28%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 5530ms, 81.68% of 6770ms total
Dropped 163 nodes (cum <= 33.85ms)
Showing top 10 nodes out of 120
      flat  flat%   sum%        cum   cum%
    1990ms 29.39% 29.39%     1990ms 29.39%  github.com/hashicorp/memberlist.limitedBroadcasts.Less
     930ms 13.74% 43.13%      930ms 13.74%  github.com/hashicorp/memberlist.(*memberlistBroadcast).Message
     920ms 13.59% 56.72%     5170ms 76.37%  github.com/hashicorp/memberlist.(*TransmitLimitedQueue).GetBroadcasts
     460ms  6.79% 63.52%      460ms  6.79%  github.com/hashicorp/serf/serf.(*broadcast).Message
     300ms  4.43% 67.95%     2290ms 33.83%  github.com/hashicorp/memberlist.(*limitedBroadcasts).Less
     220ms  3.25% 71.20%      450ms  6.65%  runtime.scanobject
     210ms  3.10% 74.30%     2500ms 36.93%  sort.reverse.Less
     180ms  2.66% 76.96%     2860ms 42.25%  sort.doPivot
     170ms  2.51% 79.47%     2670ms 39.44%  sort.(*reverse).Less
     150ms  2.22% 81.68%      170ms  2.51%  runtime.findObject

AFTER this PR

  • The cluster took 2m40s to be fully joined.
  • The cluster took 5m50s to be quiet (having its gossip broadcast queues empty).
  • During initial cluster up, each node was using 17% of CPU all the way until quiet.
  • Restarting all 150 instances on one VM:
    • Joining the cluster used 98% of cpu on that VM (not much on other VMs); it seemed like all pushpull
      • That node took 10s to be joined.
    • While that VM was not yet quiet it was using about 8% CPU.
    • That VM took 5m55s to become quiet again.

Sample CPU profile during cluster up:

$ go tool pprof ./tmp/hack cpu.pprof 
File: hack
Type: cpu
Time: Oct 29, 2018 at 4:55pm (CDT)
Duration: 30.16s, Total samples = 680ms ( 2.25%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 370ms, 54.41% of 680ms total
Showing top 10 nodes out of 170
      flat  flat%   sum%        cum   cum%
     130ms 19.12% 19.12%      260ms 38.24%  runtime.scanobject
      40ms  5.88% 25.00%       50ms  7.35%  runtime.findObject
      40ms  5.88% 30.88%      160ms 23.53%  runtime.mallocgc
      30ms  4.41% 35.29%       30ms  4.41%  runtime.markBits.isMarked (inline)
      30ms  4.41% 39.71%       30ms  4.41%  runtime.nextFreeFast (inline)
      20ms  2.94% 42.65%       20ms  2.94%  bytes.(*Reader).Read
      20ms  2.94% 45.59%       20ms  2.94%  compress/lzw.(*decoder).decode
      20ms  2.94% 48.53%      180ms 26.47%  github.com/hashicorp/go-msgpack/codec.(*decFnInfo).kMap
      20ms  2.94% 51.47%       20ms  2.94%  github.com/hashicorp/go-msgpack/codec.(*msgpackDecDriver).initReadNext
      20ms  2.94% 54.41%       20ms  2.94%  runtime.heapBits.next (inline)

@@ -29,6 +29,11 @@ func (b *memberlistBroadcast) Invalidates(other Broadcast) bool {
return b.node == mb.node
}

// memberlist.NamedBroadcast optional interface
Copy link
Member Author

@rboyer rboyer Oct 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a very small similar change that should be made to serf to have it skip the O(N) traversal codepath:

diff --git a/serf/broadcast.go b/serf/broadcast.go
index d20728f..751cf18 100644
--- a/serf/broadcast.go
+++ b/serf/broadcast.go
@@ -16,6 +16,9 @@ func (b *broadcast) Invalidates(other memberlist.Broadcast) bool {
        return false
 }
 
+// implements memberlist.UniqueBroadcast
+func (b *broadcast) UniqueBroadcast() {}
+
 func (b *broadcast) Message() []byte {
        return b.msg
 }

@rboyer rboyer force-pushed the faster-bcast-queue branch 5 times, most recently from 2a5777b to c224c2e Compare October 29, 2018 20:21
@rboyer rboyer changed the title WIP: Various changes to improve the cpu impact of TransmitLimitedQueue in large clusters Various changes to improve the cpu impact of TransmitLimitedQueue in large clusters Oct 30, 2018
@rboyer rboyer requested review from armon and a team October 30, 2018 17:01
queue.go Outdated
q.mu.Lock()
defer q.mu.Unlock()

if q.tq == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just make an initializer for the queue and move these checks out?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's only one place currently that an insertion into the queue happens and it's here. So I could move these two initialization elements out to something like a lazyInit method, but without introducing a constructor-y thing for the TransmitLimitedQueue (which doesn't currently exist) we'd still need the guard checks in all of the other methods for the nil-ness of the btree.


// Check if this message invalidates another.
if lb.name != "" {
if old, ok := q.tm[lb.name]; ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we move this logic up to where we check for the NamedBroadcast interface, I think we can avoid even saving the name value on the struct, and just use the map key.

Copy link
Member Author

@rboyer rboyer Nov 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prune, queueBroadcast, and GetBroadcasts all need to call deleteItem to remove stuff from the overall datastructure and none of them are already traversing the map to do so. If we don't store the name on the limitedBroadcast wrapper struct then for each of those we'll have to incur a type assertion to extract. If you think it's worth considering that, I'll have to re-characterize the performance hit that might cause.

}
} else if !lb.unique {
// Slow path, hopefully nothing hot hits this.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with this logic, I think we can just move it up and drop tracking of the unique field.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep.

// Visit fresher items first, but only look at stuff that will fit.
// We'll go tier by tier, grabbing the largest items first.
minTr, maxTr := q.getTransmitRange()
for transmits := minTr; transmits <= maxTr; /*do not advance automatically*/ {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the complexity of the tier first visiting? Can we not just ascend from the Min to the Max in a single walk rather than nested? It seems like it would be effectively the same, by visiting the tiers in order.

Copy link
Member Author

@rboyer rboyer Nov 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the trickiest part for good reason. The data structure features that this TransmitLimitedQueue require for normal, efficient operation are varied enough that it was tricky to improve the performance for large queue depths without affecting the other usages:

  1. Evict stale messages when superseding messages come in. The map + the two optional interfaces help here.
  2. Prune the oldest messages, where oldest mostly means by number of transmits. The btree being bidirectional traversible and primarily sorted by number of transmits helps.
  3. Fill a finite knapsack of the Newest and Correctly Sized messages without traversing everything. This is the part that requires the tiered traversal to avoid many degenerate performance scenarios.

Any naive traversal of the btree bounded just by the primary index (num transmits) is going to be adversely affected by long runs of incorrectly sized messages, especially as the knapsack gets closer to being full. This can quickly degrade to O(N), which is the thing we're desperately trying to avoid. An earlier iteration of this was simpler and partly fixed the perf issue, but upon a deeper trace dump of iteration depths I discovered that queue shapes like [big, big, big, big, big, big, ....., small] do happen frequently and cause the original bad O(N) traversal behavior to fill the knapsack.

Doing the tiered traversal lets you only ever visit elements that are definitely going into the knapsack (skipping huge swaths of each tier in the process) and thus avoid any incidental visitations in all situations.

@armon
Copy link
Member

armon commented Oct 30, 2018

Minor comments, but otherwise LGTM! Seems like a pretty serious improvement!

…large clusters.

Avoid the need to traverse the entirety of the queue as much by
switching to an alternative data structure. Broadcasts now go into a
btree sorted by:

 1. transmits ascending
 2. message length descending
 3. arrival order descending

This lets GetBroadcasts more efficiently fill a gossip packet without
O(N) traversal.

For the two most common Broadcast types coming from consul+serf
(unique-by-node-name & globally-unique) avoid the need to traverse the
entire joint queue on insert. For unique-by-node-name broadcasts use a
map to make the existence check cheaper. For globally-unique broadcasts
skip the existence check completely.

Also unembed the sync.Mutex to hide the Lock/Unlock methods.
@rboyer rboyer force-pushed the faster-bcast-queue branch from aeae9c0 to f416c6d Compare December 19, 2018 15:37
@vtolstov
Copy link

@armon may be this is time to merge this? what do you think?

@rboyer rboyer force-pushed the faster-bcast-queue branch from f859afc to 2690c35 Compare January 3, 2019 22:20
@rboyer rboyer merged commit 1a62499 into master Jan 3, 2019
@rboyer rboyer deleted the faster-bcast-queue branch January 3, 2019 22:23
rboyer added a commit to hashicorp/consul that referenced this pull request Jan 4, 2019
This activates large-cluster improvements in the gossip layer from
hashicorp/memberlist#167
rboyer added a commit to hashicorp/consul that referenced this pull request Jan 7, 2019
This activates large-cluster improvements in the gossip layer from
hashicorp/memberlist#167
thaJeztah added a commit to thaJeztah/libnetwork that referenced this pull request Aug 26, 2019
full diff: hashicorp/memberlist@3d8438d...v0.1.4

- hashicorp/memberlist#158 Limit concurrent push/pull connections
- hashicorp/memberlist#159 Prioritize alive message over other messages
- hashicorp/memberlist#168 Add go.mod
- hashicorp/memberlist#167 Various changes to improve the cpu impact of TransmitLimitedQueue in large clusters
- hashicorp/memberlist#169 added back-off to accept loop to avoid a tight loop
- hashicorp/memberlist#178 Avoid to take into account wrong versions of protocols in Vsn
- hashicorp/memberlist#189 Allow a dead node's name to be taken by a new node

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
thaJeztah added a commit to thaJeztah/libnetwork that referenced this pull request Aug 26, 2019
full diff: hashicorp/memberlist@3d8438d...v0.1.4

- hashicorp/memberlist#158 Limit concurrent push/pull connections
- hashicorp/memberlist#159 Prioritize alive message over other messages
- hashicorp/memberlist#168 Add go.mod
- hashicorp/memberlist#167 Various changes to improve the cpu impact of TransmitLimitedQueue in large clusters
- hashicorp/memberlist#169 added back-off to accept loop to avoid a tight loop
- hashicorp/memberlist#178 Avoid to take into account wrong versions of protocols in Vsn
- hashicorp/memberlist#189 Allow a dead node's name to be taken by a new node

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
thaJeztah added a commit to thaJeztah/libnetwork that referenced this pull request Feb 26, 2020
full diff: hashicorp/memberlist@3d8438d...v0.1.4

- hashicorp/memberlist#158 Limit concurrent push/pull connections
- hashicorp/memberlist#159 Prioritize alive message over other messages
- hashicorp/memberlist#168 Add go.mod
- hashicorp/memberlist#167 Various changes to improve the cpu impact of TransmitLimitedQueue in large clusters
- hashicorp/memberlist#169 added back-off to accept loop to avoid a tight loop
- hashicorp/memberlist#178 Avoid to take into account wrong versions of protocols in Vsn
- hashicorp/memberlist#189 Allow a dead node's name to be taken by a new node

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
thaJeztah added a commit to thaJeztah/libnetwork that referenced this pull request May 11, 2021
full diff: hashicorp/memberlist@3d8438d...v0.1.4

- hashicorp/memberlist#158 Limit concurrent push/pull connections
- hashicorp/memberlist#159 Prioritize alive message over other messages
- hashicorp/memberlist#168 Add go.mod
- hashicorp/memberlist#167 Various changes to improve the cpu impact of TransmitLimitedQueue in large clusters
- hashicorp/memberlist#169 added back-off to accept loop to avoid a tight loop
- hashicorp/memberlist#178 Avoid to take into account wrong versions of protocols in Vsn
- hashicorp/memberlist#189 Allow a dead node's name to be taken by a new node

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants