Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ingester] Optionally delay circuit breaker activation until after traffic starts #9842

Merged
merged 9 commits into from
Nov 13, 2024

Conversation

pr00se
Copy link
Contributor

@pr00se pr00se commented Nov 5, 2024

What this PR does

Change the implementation of the --initial-delay option for ingester circuit breakers as follows:

  • Start the activation delay with the first call to tryAcquirePermit() rather than the call to activate(). This lets us ignore the first initial_delay of traffic, regardless of how long after activation it arrives.
  • If activate() is called during the activation delay, don't schedule another delayed activation
  • If deactivate() is called during the activation delay, cancel any pending activation

Which issue(s) this PR fixes or relates to

Fixes #

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • about-versioning.md updated with experimental features.

@pr00se pr00se requested a review from a team as a code owner November 5, 2024 23:39
@pr00se pr00se changed the title [ingeter] Delay push circuit breaker activation until after traffic arrives [ingester] Delay push circuit breaker activation until after traffic arrives Nov 5, 2024
@pr00se pr00se changed the title [ingester] Delay push circuit breaker activation until after traffic arrives [ingester] Optionally delay circuit breaker activation until after traffic arrives Nov 5, 2024
@pr00se pr00se changed the title [ingester] Optionally delay circuit breaker activation until after traffic arrives [ingester] Optionally delay circuit breaker activation until after traffic starts Nov 5, 2024
@pr00se pr00se requested a review from tacole02 as a code owner November 5, 2024 23:57
@pr00se pr00se marked this pull request as draft November 6, 2024 00:34
Copy link
Contributor

@tacole02 tacole02 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating these docs! I massaged the wording a little bit.

Comment on lines 1298 to 1300
# (experimental) How long after an initial request an activated circuit
# breaker should wait before becoming effectively active. During this time
# neither failures nor successes will not be counted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# (experimental) How long after an initial request an activated circuit
# breaker should wait before becoming effectively active. During this time
# neither failures nor successes will not be counted.
# (experimental) Duration, in seconds, after an initial request an activated circuit
# breaker should wait before becoming effectively active. During this time,
# neither failures nor successes are counted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in 24a3b0d

# failures and successes will not be counted.
# (experimental) How long after an initial request an activated circuit
# breaker should wait before becoming effectively active. During this time
# neither failures nor successes will not be counted.
# CLI flag: -ingester.read-circuit-breaker.initial-delay
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same suggestion as above here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in 24a3b0d

@pr00se pr00se marked this pull request as ready for review November 7, 2024 04:56
Copy link
Contributor

@duricanikolic duricanikolic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of nits


// cancel any pending activation
if cb.activationTimer != nil {
cb.activationTimer.Stop()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess here we rely on the fact that within time.AfterFunc() we actually set cb.activationTimer to nil.
Are we sure it can never happen that we reach this line while time.AfterFunc() is executing?

I personally prefer usage of time.NewTimer() because then time.Stop()'s result (true or false) clearly explains what happened. Here false means that the timer expired, and the function passed to time.AfterFun() was started in a separate go routine, but doesn't say anything about whether the function was completed or not.

Anyway, if we opt for this approach, since it is crucial to execute cb.activationTimer = nil, I would put a comment on both lines that execute it (251 and 284) to highlight that those lines must not be deleted or moved from those places, because that would break the logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are here we are holding stateMtx, which is also taken at the start of the execution of the AfterFunc(), so if we are here while that function is executing, it is waiting on the mutex. once it gets the mutex it will see that the state is not circuitBreakerActivating and exit.

i believe these are the possible situations:

  1. the timer hasn't yet expired, and we successfully Stop() it here -- the return value would be true
  2. the timer expired and the launched goroutine completed some time before deactivate() was called -- in this case the timer will be nil (because the timer function cleared it) and we won't call Stop() at all
  3. the timer expired and the launched goroutine is currently executing when deactivate() is called -- in this case deactivate() will block because the timer goroutine is holding stateMtx, and then we'll be in the same state as (2)
  4. the timer expired and launched the goroutine while deactivate() was executing -- in this case we would call Stop() and it would return false; the launched goroutine would be waiting on the stateMtx, and after deactivate() completes, would get it, and exit because the state is no longer circuitBreakerActivating

i don't believe the lines that set the timer to nil are strictly necessary. if we removed then, then we would just additionally call Stop() on an expired timer in cases (2) and (3) above.


i opted not to use time.NewTimer()to avoid launching a separate goroutine to poll the timer channel (and have to cancel when we Stop() the timer, since otherwise i think it would block forever)

@@ -4,6 +4,7 @@ package ingester

import (
"context"
"errors"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it is feasible to add a test that would simulate what happens if we:

  • at timestamp ts: schedule a cb activation with delay 3*d
  • at timestamp ts + d: deactivate the cb
  • at timestamp ts + 2d: schedule another cb activation with delay 3*d
  • ensure that the cb is not active before timestamp ts + 5d?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added in c67e4e3

Copy link
Contributor

@duricanikolic duricanikolic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Thank you @pr00se

@pr00se pr00se merged commit c56a748 into main Nov 13, 2024
29 checks passed
@pr00se pr00se deleted the cb-initial-delay branch November 13, 2024 19:19
@grafanabot
Copy link
Contributor

The backport to r316 failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new branch
git switch --create backport-9842-to-r316 origin/r316
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x c56a74829dd8135a4b3a93ebf5f5222c4f1ac117
# Push it to GitHub
git push --set-upstream origin backport-9842-to-r316
git switch main
# Remove the local backport branch
git branch -D backport-9842-to-r316

Then, create a pull request where the base branch is r316 and the compare/head branch is backport-9842-to-r316.

pr00se added a commit that referenced this pull request Nov 13, 2024
…affic starts (#9842)

* Delay push circuit breaker activation until after traffic arrives

* make reference-help doc

* Docs changes

* Fix nil pointer panics

* Update CHANGELOG

* Make activating trinary to fix bug with requests while deactivated queueing an activation

* Use a mutex to synchronize breaker state traansitions

* Remove redundant checks in tests

* Add test

(cherry picked from commit c56a748)
@pr00se pr00se mentioned this pull request Nov 13, 2024
pr00se added a commit that referenced this pull request Nov 13, 2024
…affic starts (#9842) (#9895)

* Delay push circuit breaker activation until after traffic arrives

* make reference-help doc

* Docs changes

* Fix nil pointer panics

* Update CHANGELOG

* Make activating trinary to fix bug with requests while deactivated queueing an activation

* Use a mutex to synchronize breaker state traansitions

* Remove redundant checks in tests

* Add test

(cherry picked from commit c56a748)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants