pkg/alertmanager: Use lower value for --cluster.reconnect-timeout

In a high-dynamic environment like kubernetes, it's possible that alertmanager pods come and go on frequent intervals. The default timeout value of 6h is not suitable in that case as alertmanager will keep trying to reconnect to a non-existing pod over and over until it gives up and goes through another DNS resolution process. As such, it's best to use a lower value which will allow the alertmanager cluster to recover in case of an update/rollout/etc process in the kubernetes cluster. Related: prometheus/alertmanager#2250
hwoarang · Aug 24, 2020 · 00dbe38 · 00dbe38
1 parent 08a9647
commit 00dbe38
Showing 1 changed file with 5 additions and 0 deletions.
diff --git a/pkg/alertmanager/statefulset.go b/pkg/alertmanager/statefulset.go
@@ -384,6 +384,11 @@ func makeStatefulSetSpec(a *monitoringv1.Alertmanager, config Config) (*appsv1.S
 				// below Alertmanager v0.15.0 high availability flags are prefixed with 'mesh' instead of 'cluster'
 				amArgs[i] = strings.Replace(amArgs[i], "--cluster.", "--mesh.", 1)
 			}
+		} else {
+			// reconnect-timeout was added in 0.15 (https://github.com/prometheus/alertmanager/pull/1384)
+			// Override default 6h value to allow AlertManager cluster to recover quickly during pod restarts
+			// and rolling updates.
+			amArgs = append(amArgs, "--cluster.reconnect-timeout=5m")
 		}
 		if version.Minor < 13 {
 			for i := range amArgs {