Skip to content

Comments

[gossip] Don't merge expired gossip messages#1631

Merged
stuartnelson3 merged 3 commits intomasterfrom
stn/not-expiring-silences
Nov 21, 2018
Merged

[gossip] Don't merge expired gossip messages#1631
stuartnelson3 merged 3 commits intomasterfrom
stn/not-expiring-silences

Conversation

@stuartnelson3
Copy link
Contributor

If they're expired, they should be cleaned up on
the next GC cycle, but merging them in means that
they'll probably be gossip'd continually between
the cluster members.

fixes #1624

The issue was in the merge function: https://github.com/prometheus/alertmanager/blob/master/silence/silence.go#L752-L756

Expired silences would get deleted every 15 minutes (the hard-coded maintenance interval), but on each full-state sync an alertmanager would get a bunch of expired silences. Since they didn't exist in its state cache, they would be merged back in. Since the merge happens every minute, they were seemingly guaranteed to stick around. If all the alertmanagers were started at the same time they might GC all the silences before a full state sync, but if they're offset in their maintenance timing, they would keep sharing the same expired silences with each other. This explains why I saw the silences "GC'd", because I had looked within this magical window of "maintenance has run, but a full state sync hasn't yet", but then they came back.

Copy link
Member

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, that you caught the bug. Would you mind adding a unit test?

@stuartnelson3 stuartnelson3 force-pushed the stn/not-expiring-silences branch from d1f7ce2 to 066e9ac Compare November 20, 2018 18:35
If they're expired, they should be cleaned up on
the next GC cycle, but merging them in means that
they'll probably be gossip'd continually between
the cluster members.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
@stuartnelson3 stuartnelson3 force-pushed the stn/not-expiring-silences branch from 066e9ac to 5f811ad Compare November 20, 2018 18:45
@stuartnelson3
Copy link
Contributor Author

Oh dear, my force-pushings are being reported!

@stuartnelson3
Copy link
Contributor Author

Also are notification logs affected too?

I expect so, we've just never noticed because they don't appear in the UI.

The code for nflog was also constantly re-adding
nflogs to the internal memory store, the same as
the silence code was.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
@stuartnelson3
Copy link
Contributor Author

The code between these two is so similar, the difference is just a generic away ..

@stuartnelson3 stuartnelson3 changed the title [silences] Don't merge expired silences [gossip] Don't merge expired gossip messages Nov 20, 2018
With the default 0 retention, the alerts would not
be merged.

Signed-off-by: Stuart Nelson <stuartnelson3@gmail.com>
Copy link
Member

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@stuartnelson3 stuartnelson3 merged commit 2026e4a into master Nov 21, 2018
@stuartnelson3 stuartnelson3 deleted the stn/not-expiring-silences branch November 21, 2018 10:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Alertmanager not GC'ing silences

4 participants