Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HA tracker KV store cleanup #3809

Merged
merged 10 commits into from
Feb 15, 2021
Merged

HA tracker KV store cleanup #3809

merged 10 commits into from
Feb 15, 2021

Conversation

pstibrany
Copy link
Contributor

@pstibrany pstibrany commented Feb 10, 2021

What this PR does: This PR implements cleanup of old HA replicas from KV store. Since we cannot reliably receive notifications about deleted entries from KV store (eg. Consul doesn't send them), we introduce new DeletedAt field to ReplicaDesc message stored in KV store. This works as a "marker". After marking replica for deletion, distributors will remove it from memory. After some time after marking, replica is deleted from KV store completely. If new sample arrives for given cluster in the meantime, deletion is aborted (we reset DeletedAt back to 0).

Cleanup cycle and timeout for deletion are hardcoded (cleanup cycle = every 30 minutes with ±5 minutes jitter), and marking and deletion timeouts are 30 minutes. There are no config options to change this.

This PR builds on top of #3808, and will be rebased once #3808 is merged. Done.

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@pstibrany pstibrany requested a review from pracucci February 10, 2021 11:20
@pstibrany pstibrany mentioned this pull request Feb 10, 2021
1 task
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
@pstibrany
Copy link
Contributor Author

This PR is now ready for review. Thanks!

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good job as usual! I left few nits, but LGTM.

pkg/distributor/ha_tracker.go Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
@@ -58,7 +58,8 @@ type HATrackerConfig struct {
// other than the replica written in the KVStore if the difference
// between the stored timestamp and the time we received a sample is
// more than this duration
FailoverTimeout time.Duration `yaml:"ha_tracker_failover_timeout"`
FailoverTimeout time.Duration `yaml:"ha_tracker_failover_timeout"`
CleanupOldReplicas bool `yaml:"cleanup_old_replicas"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why we don't want to always keep it enabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was to test it first in our production (where we have some of these old HA pairs), and then enable it and remove this option. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine as far as we remove it once we're confident. It's marked experimental, so we can remove it later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it more, that adds burden to our deployment. I'm inclined to remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline, decided to remove, as it's quick to roll back distributors in case there are problems with this feature.

pkg/distributor/ha_tracker.go Outdated Show resolved Hide resolved
pkg/distributor/ha_tracker.go Outdated Show resolved Hide resolved
pkg/distributor/ha_tracker.go Outdated Show resolved Hide resolved
pkg/distributor/ha_tracker.go Outdated Show resolved Hide resolved
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! I left a comment about metric names. Up to your judgment.

pkg/distributor/ha_tracker.go Outdated Show resolved Hide resolved
pkg/distributor/ha_tracker.go Outdated Show resolved Hide resolved
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
@pstibrany pstibrany merged commit 24c7511 into cortexproject:master Feb 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants