Remove intent queue warnings and dynamically adjust max queue size #3705

slackpad · 2017-11-21T18:58:10Z

This will be fixed in Serf, but opening here for tracking and to ensure we pull in the updated Serf into Consul's vendored dependencies.

Serf uses a hard-coded warning threshold of 128 for its intent queue warning, which is tied to a goroutine that runs at 1 Hz, meaning that the warning condition can be incredibly spammy. In large clusters, adding or removing a large number of nodes simultaneously can result in huge runs of these warnings, even though things are fine:

2015/06/26 13:28:39 [WARN] serf: Intent queue depth: 4256

Work was done under #1062 to make this better, but the raw broadcast queue is the same, so it's still possible to get lots of warnings, and because they are related to broadcasts they are likely to trigger across a whole cluster, overwhelming logs.

Here are the proposed fixes:

Remove the queue depth warning entirely and simply emit the gauge metric from the goroutine, which will allow operators to observe the queue behavior and warn if needed based on the size of the cluster.
Make the max queue depth (used for dropping messages) proportional to the size of the cluster. It should have a min of 4096 to support early operations and then expand past that as 2*cluster size. This allows large bursts of messages for any size cluster but still limits the RAM used by Serf.
We can also relax the check interval for the goroutine to something like 30 seconds to let the queue go over the max briefly if needed.

The text was updated successfully, but these errors were encountered:

slackpad · 2017-12-07T00:08:34Z

Keep a time stamp for each entry similar to what we did for the node-name based data structure and purge the oldest entries when we are dropping messages. These will be much more likely to be stale and we should do everything we can to avoid spurious rebroadcasts.

^ Removed from the third item above since the queued items are in order already and we prune the ones in the front.

slackpad · 2017-12-07T00:12:09Z

Also realized that we scan the whole queue when we do TransmitLimitedQueue.QueueBroadcast which might be a little spendy if we let that get too big. I'm thinking we could limit that search to the last N (~128) which is probably a reasonable heuristic.

slackpad added type/enhancement Proposed improvement or new feature theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner labels Nov 21, 2017

slackpad added this to the 1.0.2 milestone Nov 21, 2017

slackpad changed the title ~~Clean up intent queue warnings and dynamically adjust size~~ Clean up intent queue warnings and dynamically adjust max queue size Nov 21, 2017

slackpad changed the title ~~Clean up intent queue warnings and dynamically adjust max queue size~~ Remove intent queue warnings and dynamically adjust max queue size Nov 21, 2017

This was referenced Dec 7, 2017

Adds additional controls for intent queue management. hashicorp/serf#492

Merged

Turns of intent queue warnings and enables dynamic queue sizing for Serf. #3731

Merged

slackpad closed this as completed in #3731 Dec 8, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove intent queue warnings and dynamically adjust max queue size #3705

Remove intent queue warnings and dynamically adjust max queue size #3705

slackpad commented Nov 21, 2017 •

edited

Loading

slackpad commented Dec 7, 2017 •

edited

Loading

slackpad commented Dec 7, 2017

Remove intent queue warnings and dynamically adjust max queue size #3705

Remove intent queue warnings and dynamically adjust max queue size #3705

Comments

slackpad commented Nov 21, 2017 • edited Loading

slackpad commented Dec 7, 2017 • edited Loading

slackpad commented Dec 7, 2017

slackpad commented Nov 21, 2017 •

edited

Loading

slackpad commented Dec 7, 2017 •

edited

Loading