-
Notifications
You must be signed in to change notification settings - Fork 810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HA tracker KV store cleanup #3809
HA tracker KV store cleanup #3809
Conversation
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
This PR is now ready for review. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good job as usual! I left few nits, but LGTM.
pkg/distributor/ha_tracker.go
Outdated
@@ -58,7 +58,8 @@ type HATrackerConfig struct { | |||
// other than the replica written in the KVStore if the difference | |||
// between the stored timestamp and the time we received a sample is | |||
// more than this duration | |||
FailoverTimeout time.Duration `yaml:"ha_tracker_failover_timeout"` | |||
FailoverTimeout time.Duration `yaml:"ha_tracker_failover_timeout"` | |||
CleanupOldReplicas bool `yaml:"cleanup_old_replicas"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why we don't want to always keep it enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking was to test it first in our production (where we have some of these old HA pairs), and then enable it and remove this option. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine as far as we remove it once we're confident. It's marked experimental, so we can remove it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it more, that adds burden to our deployment. I'm inclined to remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline, decided to remove, as it's quick to roll back distributors in case there are problems with this feature.
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks! I left a comment about metric names. Up to your judgment.
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
What this PR does: This PR implements cleanup of old HA replicas from KV store. Since we cannot reliably receive notifications about deleted entries from KV store (eg. Consul doesn't send them), we introduce new
DeletedAt
field toReplicaDesc
message stored in KV store. This works as a "marker". After marking replica for deletion, distributors will remove it from memory. After some time after marking, replica is deleted from KV store completely. If new sample arrives for given cluster in the meantime, deletion is aborted (we resetDeletedAt
back to 0).Cleanup cycle and timeout for deletion are hardcoded (cleanup cycle = every 30 minutes with ±5 minutes jitter), and marking and deletion timeouts are 30 minutes. There are no config options to change this.
This PR builds on top of #3808, and will be rebased once #3808 is merged.Done.Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]