Remove intent queue warnings and dynamically adjust max queue size #3705
Labels
theme/operator-usability
Replaces UX. Anything related to making things easier for the practitioner
type/enhancement
Proposed improvement or new feature
Milestone
This will be fixed in Serf, but opening here for tracking and to ensure we pull in the updated Serf into Consul's vendored dependencies.
Serf uses a hard-coded warning threshold of 128 for its intent queue warning, which is tied to a goroutine that runs at 1 Hz, meaning that the warning condition can be incredibly spammy. In large clusters, adding or removing a large number of nodes simultaneously can result in huge runs of these warnings, even though things are fine:
2015/06/26 13:28:39 [WARN] serf: Intent queue depth: 4256
Work was done under #1062 to make this better, but the raw broadcast queue is the same, so it's still possible to get lots of warnings, and because they are related to broadcasts they are likely to trigger across a whole cluster, overwhelming logs.
Here are the proposed fixes:
Remove the queue depth warning entirely and simply emit the gauge metric from the goroutine, which will allow operators to observe the queue behavior and warn if needed based on the size of the cluster.
Make the max queue depth (used for dropping messages) proportional to the size of the cluster. It should have a min of 4096 to support early operations and then expand past that as
2*cluster size
. This allows large bursts of messages for any size cluster but still limits the RAM used by Serf.We can also relax the check interval for the goroutine to something like 30 seconds to let the queue go over the max briefly if needed.
The text was updated successfully, but these errors were encountered: