Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kubernetes] Peers ip caching causes all clusters to degrade over time #2250

Closed
grzesuav opened this issue May 5, 2020 · 13 comments · Fixed by prometheus-community/helm-charts#4877

Comments

@grzesuav
Copy link

grzesuav commented May 5, 2020

What did you do?

We have deployed several different Alertmanager's - in different namespaces

What did you expect to see?

Each alertmanager installation forms a separate cluster

What did you see instead? Under which circumstances?

After some time, we noticed that all those installation form one big cluster.
We think that :

  • address of peer passed via --cluster.peer=DNS address is resolved to IP address and later on, this ip address is used
  • alertmanager instance is restarted/evicted to different node/etc and ip address can be assigned to other pod, which accidently happens to be ip address of alertmanager instance from other cluster
  • they join in a one big cluster

Environment

Kubernetes cluster

  • System information:

    will add if needed

  • Alertmanager version:

    insert output of alertmanager --version here

Different versions - 020, 0.17, 0.18

  • Prometheus version:

    not related

  • Alertmanager configuration file:

will add if needed

  • Prometheus configuration file:

will add if needed

  • Logs:

will add if needed

@grzesuav grzesuav changed the title [Kubernetes] Peers ip caching causes all cluster to degrade over time [Kubernetes] Peers ip caching causes all clusters to degrade over time May 5, 2020
@simonpasquier
Copy link
Member

Can you elaborate about your setup? Which names do you use for --cluster.peer?
Alertmanager will try to reconnect to a previously known IP address for 6 hours by default. After this, it will forget about it.

@grzesuav
Copy link
Author

grzesuav commented May 7, 2020

hi @simonpasquier ,
so my alertmanager pod arg line looks like

  - args:
    - --config.file=/etc/alertmanager/config/alertmanager.yaml
    - --cluster.listen-address=[$(POD_IP)]:9094
    - --storage.path=/alertmanager
    - --data.retention=120h
    - --web.listen-address=:9093
    - --web.external-url=https://alertmanager-address
    - --web.route-prefix=/
    - --cluster.peer=alertmanager-main-0.alertmanager-operated.namespace.svc:9094
    - --cluster.peer=alertmanager-main-1.alertmanager-operated.namespace.svc:9094
    - --cluster.peer=alertmanager-main-2.alertmanager-operated.namespace.svc:9094

and yes, fresh after hard reset of all instances in the same time it starts with three peers. But previously it has cluster formed along with all other alertmanager installations in our cluster (over 50 instances I guess ?)

So, from what you are saying, if during those 6 hours, ip gets re-assigned to completely different pod, AM will try to form cluster with this new pod ?

@simonpasquier
Copy link
Member

So, from what you are saying, if during those 6 hours, ip gets re-assigned to completely different pod, AM will try to form cluster with this new pod ?

yes

@simonpasquier
Copy link
Member

BTW you can set the --cluster.reconnect-timeout flag to a lower value than the default 6 hours.

@grzesuav
Copy link
Author

Actually I try to resolve it with NetworkPolicies, however it is still workaround. I would still consider this as something which should be handled or at least mention in documentation (we are using prometheus operator so relay a bit on defaults there) as in our case silencing alert on one alertmanager caused somebody else missed notification.

@bjakubski
Copy link

I've found this issue when trying to figure out why alertmanagers formed a mesh over cluster boundaries. I was investigating because of alerts about alertmanager being in inconsistent state fired, and nothing more serious happened.
Same story - prometheus-operator, k8s.
Granted, the issue was caused mostly by a misconfiguration on our side (different cluster has same ip range and is routable).

I do find it surprising that alertmanager (often used in dynamic envs like k8s) will not try to resolve the names on every connection attempt

@hwoarang
Copy link

hwoarang commented Aug 3, 2020

BTW you can set the --cluster.reconnect-timeout flag to a lower value than the default 6 hours.

That's a reasonable suggestion but prometheus operator does not let you pass additional parameters to the alertmanager instance. But all in all, this is something that needs to be addressed there of course.

hwoarang added a commit to hwoarang/prometheus-operator that referenced this issue Aug 24, 2020
In a high-dynamic environment like kubernetes, it's possible that
alertmanager pods come and go on frequent intervals. The default timeout
value of 6h is not suitable in that case as alertmanager will keep
trying to reconnect to a non-existing pod over and over until it gives
up and go through another DNS resolution process. As such, it's best
to use a lower value which will allow the alertmanager cluster to
recover in case of an update/rollout/etc process in the kubernetes
cluster.

Related: prometheus/alertmanager#2250
hwoarang added a commit to hwoarang/prometheus-operator that referenced this issue Aug 24, 2020
In a high-dynamic environment like kubernetes, it's possible that
alertmanager pods come and go on frequent intervals. The default timeout
value of 6h is not suitable in that case as alertmanager will keep
trying to reconnect to a non-existing pod over and over until it gives
up and go through another DNS resolution process. As such, it's best
to use a lower value which will allow the alertmanager cluster to
recover in case of an update/rollout/etc process in the kubernetes
cluster.

Related: prometheus/alertmanager#2250
hwoarang added a commit to hwoarang/prometheus-operator that referenced this issue Aug 24, 2020
In a high-dynamic environment like kubernetes, it's possible that
alertmanager pods come and go on frequent intervals. The default timeout
value of 6h is not suitable in that case as alertmanager will keep
trying to reconnect to a non-existing pod over and over until it gives
up and goes through another DNS resolution process. As such, it's best
to use a lower value which will allow the alertmanager cluster to
recover in case of an update/rollout/etc process in the kubernetes
cluster.

Related: prometheus/alertmanager#2250
hwoarang added a commit to hwoarang/prometheus-operator that referenced this issue Aug 24, 2020
In a high-dynamic environment like kubernetes, it's possible that
alertmanager pods come and go on frequent intervals. The default timeout
value of 6h is not suitable in that case as alertmanager will keep
trying to reconnect to a non-existing pod over and over until it gives
up and goes through another DNS resolution process. As such, it's best
to use a lower value which will allow the alertmanager cluster to
recover in case of an update/rollout/etc process in the kubernetes
cluster.

Related: prometheus/alertmanager#2250
hwoarang added a commit to hwoarang/prometheus-operator that referenced this issue Aug 25, 2020
Alertmanager in cluster mode resolves the DNS name of each peer and
caches its IP address which uses on regular intervals to 'refresh'
the connection.

In high-dynamic environment like kubernetes, it's possible that
alertmanager pods come and go on frequent intervals. The default timeout
value of 6h is not suitable in that case as alertmanager will keep
trying to reconnect to a non-existing pod over and over until it gives
up and remove that peer from the member list. During this period of
time, the cluster is reported to be in a degraded state due to the
missing member.

As such, it's best to use a lower value which will allow the
alertmanager to remove the pod from the list of peers soon
after it disappears.

Related: prometheus/alertmanager#2250
hwoarang added a commit to hwoarang/prometheus-operator that referenced this issue Aug 26, 2020
Alertmanager in cluster mode resolves the DNS name of each peer and
caches its IP address which uses on regular intervals to 'refresh'
the connection.

In high-dynamic environment like kubernetes, it's possible that
alertmanager pods come and go on frequent intervals. The default timeout
value of 6h is not suitable in that case as alertmanager will keep
trying to reconnect to a non-existing pod over and over until it gives
up and remove that peer from the member list. During this period of
time, the cluster is reported to be in a degraded state due to the
missing member.

As such, it's best to use a lower value which will allow the
alertmanager to remove the pod from the list of peers soon
after it disappears.

Related: prometheus/alertmanager#2250
@b10s
Copy link

b10s commented Sep 6, 2021

I've got the same issue and able to reproduce.

I think it is nature of gossip protocol which should be suppressed a bit by alertmanager since it has knowledge of peers from config file and can verify the table of available peers.

UPD
to reproduce

  1. start your kind cluster:
$ kind create cluster
...
  1. deploy here two clusters of alertmanager:
$ helm install my-release foo/bar
$ helm install my-bad-release foo/bar
  1. find your kind's k8s cluster contaienr and enter it:
docker exec -it 942e41a1c6e6 bash
  1. inside container change CNI settings and restart kubelet:
# sed -i 's/"subnet": "10.244.0.0\/24"/"subnet": "10.244.0.0\/28"/g' /etc/cni/net.d/10-kindnet.conflist
# systemctl restart kubelet
  1. create few more Pods with nginx to make sure there is no more available IPs

  2. delete one alertmanager's Pods from one cluster and one from another using the same command so there will be chance they will reuse IP of each other

  3. enjoy merged alertmanager cluster


 Args:
      --storage.path=/alertmanager
      --config.file=/config_out/alertmanager.yml
      --cluster.advertise-address=$(POD_IP):9094
      --cluster.listen-address=0.0.0.0:9094
      --cluster.peer=my-release-alertmanager-0.my-release-alertmanager-headless:9094
      --cluster.peer=my-release-alertmanager-1.my-release-alertmanager-headless:9094
      --cluster.peer=my-release-alertmanager-2.my-release-alertmanager-headless:9094

You can see here is only three peers.

Before making them to switch IPs there IP assignment was:

Selection_999(551)

After restart few Pods few times I can make them to reuse IPs:

Selection_999(552)

Since other Pods were not restarted, they still keep old IPs in their gossip available peers table. Therefore two cluster will merge into one:
Selection_999(549)

@grzesuav
Copy link
Author

I am not familiar of gossip protocol, but I can image that using some identifier for the cluster (i.e. statefulset name) and using it to verify if other pod should join my network should be also a viable solution

@grzesuav
Copy link
Author

of course, at Alertmanager level it would be an cli-argument, which people not using prom-op would need to set, otherwise some default would be used

@b10s
Copy link

b10s commented Oct 22, 2021

@grzesuav ,

Seems there is coming TLS support for gossip in am (if not yet released). Which is one way to avoid the issue:
https://github.com/prometheus/alertmanager/blob/main/docs/https.md#gossip-traffic

Also some notes:
https://github.com/prometheus/alertmanager/tree/main/examples/ha/tls

thanks to @simonpasquier with sharing this docs over IRC : )

@b10s
Copy link

b10s commented Nov 5, 2021

There is also one possible solution is to add cluster id:
https://groups.google.com/g/prometheus-developers/c/wJ60O2Mk3js/m/qixf31fRBQAJ

greed42 added a commit to greed42/alertmanager that referenced this issue Feb 15, 2023
This is an alternate mechanism for isolating Alertmanager clusters without having to set up the right components of TLS.

It should solve issues such as <prometheus#2250>, although enabling this feature will lead to loss of non-persisted state. (For example, if you rely on alertmanager cluster peering to maintain silences instead of using persistent volume storage in Kubernetes.) The Gossip label serves as the "cluster ID" idea mentioned in <prometheus#2250 (comment)>.

You can enable with the command-line flag, `--cluster.gossip-label`; any non-empty string will form an effective namespace for gossip communication.

If you use Prometheus Operator, you can set the `ALERTMANAGER_CLUSTER_GOSSIP_LABEL` environment variable (as Prometheus Operator does not have a way of adding additional command-line flags). You would need to modify your Alertmanager object something like:

```
kind: Alertmanager
...
spec:
  ...
  containers:
    - name: alertmanager
      env:
        - name: ALERTMANAGER_CLUSTER_GOSSIP_LABEL
          value: infrastructure-eu-west-2
  ...
```

This is low-security mechanism, suitable for use with Alertmanager configuration where anyone can add or remove a silence. It protects against surprising cluster expansion due to IP:port re-use.
greed42 added a commit to greed42/alertmanager that referenced this issue Feb 15, 2023
This is an alternate mechanism for isolating Alertmanager clusters without having to set up the right components of TLS.

It should solve issues such as <prometheus#2250>, although enabling this feature will lead to loss of non-persisted state. (For example, if you rely on alertmanager cluster peering to maintain silences instead of using persistent volume storage in Kubernetes.) The Gossip label serves as the "cluster ID" idea mentioned in <prometheus#2250 (comment)>.

You can enable with the command-line flag, `--cluster.gossip-label`; any non-empty string will form an effective namespace for gossip communication.

If you use Prometheus Operator, you can set the `ALERTMANAGER_CLUSTER_GOSSIP_LABEL` environment variable (as Prometheus Operator does not have a way of adding additional command-line flags). You would need to modify your Alertmanager object something like:

```
kind: Alertmanager
...
spec:
  ...
  containers:
    - name: alertmanager
      env:
        - name: ALERTMANAGER_CLUSTER_GOSSIP_LABEL
          value: infrastructure-eu-west-2
  ...
```

This is low-security mechanism, suitable for use with Alertmanager configuration where anyone can add or remove a silence. It protects against surprising cluster expansion due to IP:port re-use.

Signed-off-by: Graham Reed <greed@7deadly.org>
@simonpasquier
Copy link
Member

this should be fixed by #3354 which allows to define a label identifying the cluster and preventing external instances to join the cluster if they don't share the same label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants