Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clustering: re-discover and re-join peers on interval in background #4465

Closed
rfratto opened this issue Jul 14, 2023 · 3 comments · Fixed by #4608
Closed

Clustering: re-discover and re-join peers on interval in background #4465

rfratto opened this issue Jul 14, 2023 · 3 comments · Fixed by #4608
Labels
enhancement New feature or request flow Related to Grafana Agent Flow frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed.

Comments

@rfratto
Copy link
Member

rfratto commented Jul 14, 2023

Request

Add a new flag, --cluster.rejoin-interval which specifies how often a node should rediscover peers and rejoin them to address split brain issues.

--cluster.rejoin-interval should default to some reasonable value, such as 60s.

When set to 0s, rediscovery/rejoining is disabled.

This proposal should be paired with #4464 to avoid overwhelming the network on large clusters, as the state push/pull done on join is more expensive than other gossip traffic.

Use case

Today, for nodes to join a cluster successfully, one of the following must be true:

  • All nodes must join the same seed node, or
  • The initial cluster cannot be bootstrapped in parallel.

This is because the set of peers to join is determined once at startup.

In effect, this means that clustering on Kubernetes must only be enabled via a StatefulSet, and podManagementPolicy must not be set to Parallel. If neither of these conditions are met, clustering will end up in a split brain state.

To avoid these constraints, nodes should rediscover and rejoin the cluster in the background on some timer to address split brain issues.

@braunsonm
Copy link

Along with this issue, please add support to the chart for the additional flags along with the clustering flag that was recently added

@tpaschalis
Copy link
Member

@braunsonm are you suggesting exposing the rest of the --cluster.* flags along the clustering.enabled field in values.yaml? We've thought of expanding it for ease of use once we've settled on the options we'd like to expose.

In the meantime, there's an extraArgs field on values.yaml you can use to pass any flags to the run command.

@braunsonm
Copy link

Yea I'm suggesting clustering.enabled in the chart should configure auto discovery on an interval in the future.

Can definitely be done manually in the meantime, but ideally this should allow more modes than just statefulset to be deployed. We're specifically looking for daemonset to be supported as we run node exporter with Grafana Agent

rfratto added a commit to rfratto/agent that referenced this issue Aug 1, 2023
Go 1.19 adds Windows support for the native Go network stack, but
doesn't include support for resolving DNS short names.

Other projects, including Prometheus, have updated their build process
to exclude the netgo build tags when producing Windows binaries to work
around this behavior.

Fixes grafana#4465.
@github-project-automation github-project-automation bot moved this from Todo to Done in Grafana Agent (Public) Aug 16, 2023
@github-actions github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 21, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request flow Related to Grafana Agent Flow frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed.
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

3 participants