Skip to content

Commit

Permalink
pkg/alertmanager: Use lower value for --cluster.reconnect-timeout
Browse files Browse the repository at this point in the history
In a high-dynamic environment like kubernetes, it's possible that
alertmanager pods come and go on frequent intervals. The default timeout
value of 6h is not suitable in that case as alertmanager will keep
trying to reconnect to a non-existing pod over and over until it gives
up and goes through another DNS resolution process. As such, it's best
to use a lower value which will allow the alertmanager cluster to
recover in case of an update/rollout/etc process in the kubernetes
cluster.

Related: prometheus/alertmanager#2250
  • Loading branch information
hwoarang committed Aug 24, 2020
1 parent 08a9647 commit 00dbe38
Showing 1 changed file with 5 additions and 0 deletions.
5 changes: 5 additions & 0 deletions pkg/alertmanager/statefulset.go
Original file line number Diff line number Diff line change
Expand Up @@ -384,6 +384,11 @@ func makeStatefulSetSpec(a *monitoringv1.Alertmanager, config Config) (*appsv1.S
// below Alertmanager v0.15.0 high availability flags are prefixed with 'mesh' instead of 'cluster'
amArgs[i] = strings.Replace(amArgs[i], "--cluster.", "--mesh.", 1)
}
} else {
// reconnect-timeout was added in 0.15 (https://github.com/prometheus/alertmanager/pull/1384)
// Override default 6h value to allow AlertManager cluster to recover quickly during pod restarts
// and rolling updates.
amArgs = append(amArgs, "--cluster.reconnect-timeout=5m")
}
if version.Minor < 13 {
for i := range amArgs {
Expand Down

0 comments on commit 00dbe38

Please sign in to comment.