-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Description
tldr
If clustering is enabled, AM should be considered ready when gossip has settled, this is currently not the case. I'm wondering if this is a bug or intended behavior.
Description
I'm investigation failed inhibitions and some other alertmanager issues that we're facing, and think that waiting for gossip to settle might be an improvement that could reduce these weird behaviors when an instance is restarted.
When alertmanager starts, it immediately report itself as ready and starts serving traffic without having a complete picture of what is going on in the cluster. This can lead to unexpected behavior, such as inhibitions being missed, or duplicate alerts being send out etc.
Code Snippets
Here we can see that there is a method
alertmanager/cluster/cluster.go
Lines 667 to 672 in e252c2e
| // Settle waits until the mesh is ready (and sets the appropriate internal state when it is). | |
| // The idea is that we don't want to start "working" before we get a chance to know most of the alerts and/or silences. | |
| // Inspired from https://github.com/apache/cassandra/blob/7a40abb6a5108688fb1b10c375bb751cbb782ea4/src/java/org/apache/cassandra/gms/Gossiper.java | |
| // This is clearly not perfect or strictly correct but should prevent the alertmanager to send notification before it is obviously not ready. | |
| // This is especially important for those that do not have persistent storage. | |
| func (p *Peer) Settle(ctx context.Context, interval time.Duration) { |
that gets called in a go routine during startup
alertmanager/cmd/alertmanager/main.go
Lines 326 to 343 in e252c2e
| // Peer state listeners have been registered, now we can join and get the initial state. | |
| if peer != nil { | |
| err = peer.Join( | |
| *reconnectInterval, | |
| *peerReconnectTimeout, | |
| ) | |
| if err != nil { | |
| logger.Warn("unable to join gossip mesh", "err", err) | |
| } | |
| ctx, cancel := context.WithTimeout(context.Background(), *settleTimeout) | |
| defer func() { | |
| cancel() | |
| if err := peer.Leave(10 * time.Second); err != nil { | |
| logger.Warn("unable to leave gossip mesh", "err", err) | |
| } | |
| }() | |
| go peer.Settle(ctx, *gossipInterval*10) | |
| } |
The cluster struct, also provides a method that reports if gossip has settled or not
alertmanager/cluster/cluster.go
Lines 582 to 590 in e252c2e
| // Return true when router has settled. | |
| func (p *Peer) Ready() bool { | |
| select { | |
| case <-p.readyc: | |
| return true | |
| default: | |
| } | |
| return false | |
| } |
However, this method is not check during stratup nor during
/-/ready.
Related Issues
Metadata
Metadata
Assignees
Type
Projects
Status