Clustering: re-discover and re-join peers on interval in background #4465

rfratto · 2023-07-14T19:39:56Z

Request

Add a new flag, --cluster.rejoin-interval which specifies how often a node should rediscover peers and rejoin them to address split brain issues.

--cluster.rejoin-interval should default to some reasonable value, such as 60s.

When set to 0s, rediscovery/rejoining is disabled.

This proposal should be paired with #4464 to avoid overwhelming the network on large clusters, as the state push/pull done on join is more expensive than other gossip traffic.

Use case

Today, for nodes to join a cluster successfully, one of the following must be true:

All nodes must join the same seed node, or
The initial cluster cannot be bootstrapped in parallel.

This is because the set of peers to join is determined once at startup.

In effect, this means that clustering on Kubernetes must only be enabled via a StatefulSet, and podManagementPolicy must not be set to Parallel. If neither of these conditions are met, clustering will end up in a split brain state.

To avoid these constraints, nodes should rediscover and rejoin the cluster in the background on some timer to address split brain issues.

The text was updated successfully, but these errors were encountered:

braunsonm · 2023-07-26T11:17:14Z

Along with this issue, please add support to the chart for the additional flags along with the clustering flag that was recently added

tpaschalis · 2023-07-26T12:12:22Z

@braunsonm are you suggesting exposing the rest of the --cluster.* flags along the clustering.enabled field in values.yaml? We've thought of expanding it for ease of use once we've settled on the options we'd like to expose.

In the meantime, there's an extraArgs field on values.yaml you can use to pass any flags to the run command.

braunsonm · 2023-07-26T12:14:16Z

Yea I'm suggesting clustering.enabled in the chart should configure auto discovery on an interval in the future.

Can definitely be done manually in the meantime, but ideally this should allow more modes than just statefulset to be deployed. We're specifically looking for daemonset to be supported as we run node exporter with Grafana Agent

Go 1.19 adds Windows support for the native Go network stack, but doesn't include support for resolving DNS short names. Other projects, including Prometheus, have updated their build process to exclude the netgo build tags when producing Windows binaries to work around this behavior. Fixes grafana#4465.

rfratto added enhancement New feature or request type/core flow Related to Grafana Agent Flow labels Jul 14, 2023

grafanabot added this to Grafana Agent (Public) Jul 14, 2023

github-project-automation bot moved this to Todo in Grafana Agent (Public) Jul 14, 2023

rfratto mentioned this issue Jul 26, 2023

clustering: refresh peer list on runtime #4596

Closed

This was referenced Jul 26, 2023

helm: add native support for clustering with agent.clustering.enabled value #4372

Merged

clustering: enable refreshing the list of peers on an interval #4608

Merged

rfratto mentioned this issue Aug 1, 2023

build: do not produce Windows binaries with netgo build tags #4672

Merged

tpaschalis closed this as completed in #4608 Aug 16, 2023

github-project-automation bot moved this from Todo to Done in Grafana Agent (Public) Aug 16, 2023

github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 21, 2024

github-actions bot locked as resolved and limited conversation to collaborators Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clustering: re-discover and re-join peers on interval in background #4465

Clustering: re-discover and re-join peers on interval in background #4465

rfratto commented Jul 14, 2023 •

edited

Loading

braunsonm commented Jul 26, 2023

tpaschalis commented Jul 26, 2023

braunsonm commented Jul 26, 2023

Clustering: re-discover and re-join peers on interval in background #4465

Clustering: re-discover and re-join peers on interval in background #4465

Comments

rfratto commented Jul 14, 2023 • edited Loading

Request

Use case

braunsonm commented Jul 26, 2023

tpaschalis commented Jul 26, 2023

braunsonm commented Jul 26, 2023

rfratto commented Jul 14, 2023 •

edited

Loading