Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement polling tenants concurrently #3647

Merged
merged 22 commits into from
Aug 5, 2024

Conversation

zalegrala
Copy link
Contributor

@zalegrala zalegrala commented May 3, 2024

What this PR does:

Here we implement basic polling for tenants concurrently. This is another step
towards making the polling more efficient. The next step will be to implement
a weighted polling strategy to poll tenants by some priority. For now, I think
this is a reasonable start and should stand on its own.

Which issue(s) this PR fixes:
Fixes #

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@github-actions github-actions bot added the stale Used for stale issues / PRs label Jul 3, 2024
@zalegrala zalegrala added keepalive Label to exempt Issues / PRs from stale workflow and removed stale Used for stale issues / PRs labels Jul 3, 2024
@grafana grafana deleted a comment from github-actions bot Jul 12, 2024
@zalegrala
Copy link
Contributor Author

I'm aware of a race in the tests when we check the error message if we exceed the threshold during polling. Since now we poll concurrently, we can't know which error will be the final error, and so an adjustment needs to be made to assert the error is one of n, rather than one specific. I'll follow up.

In the meantime, feedback welcome.

My thought to follow this up this PR is to modify the tenant loop, so we loop over tenants which are sorted by some metric, like block size, or last poll duration. This way we can start work on the ones that require the most effort first.

Copy link
Member

@joe-elliott joe-elliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this lgtm. not approving b/c it's still marked draft

tempodb/blocklist/poller.go Outdated Show resolved Hide resolved
@zalegrala
Copy link
Contributor Author

I've modified the error handling to be consistent, which required keeping of a []error for the duration of the Do() call. This allows us to know if the sequence of errors matches the sequence of the tenants. This might be less important than the order in which the errors arrive, but at least its consistent.

@zalegrala zalegrala force-pushed the tenantPollMulti branch 2 times, most recently from 9ad2b78 to 8a99b99 Compare July 17, 2024 15:16
Copy link
Member

@joe-elliott joe-elliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good. one thought.

can we add the new config option to the docs?

tempodb/blocklist/poller.go Outdated Show resolved Hide resolved
tempodb/blocklist/poller.go Outdated Show resolved Hide resolved
tempodb/blocklist/poller.go Outdated Show resolved Hide resolved
tempodb/blocklist/poller.go Outdated Show resolved Hide resolved
Copy link
Member

@joe-elliott joe-elliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's go!

Here we make changes to the error handling to account for the additional
complexity brought with the tenant concurrency.

This changes the behavior of the
blocklist_poll_tolerate_consecutive_errors configuration by applying to
a single tenant, which instructs the poller to retry until the threshold
is met.

A new configuration parameter blocklist_poll_tolerate_tenant_failures
has been added to account for the number of failing tenants that will be
tolerated.

This allows parts of the old behavior scoped to a single tenant, but
also accounts for a more global picture.  This means that a single
failing tenant by default will not stop the entire polling process.

Tests have been updated to account for this additional logic.
@zalegrala zalegrala merged commit d3a9caf into grafana:main Aug 5, 2024
17 checks passed
@zalegrala zalegrala deleted the tenantPollMulti branch August 5, 2024 13:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keepalive Label to exempt Issues / PRs from stale workflow
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants