Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smooth out spikes in rate of chunk flush ops #3191

Merged
merged 4 commits into from
Sep 30, 2020
Merged

Conversation

bboreham
Copy link
Contributor

Ingester chunk flushes run periodically, by default every minute.
Add a rate-limiter so we spread out calls to the DB across the period, avoiding spikes of intense activity which can slow down other operations such as incoming Push() calls.

Fixes #3171

Checklist

  • NA Tests updated
  • NA Documentation added
  • CHANGELOG.md updated

@storyicon
Copy link
Contributor

LGTM, this effectively evens out the pressure on the DB.

@bboreham
Copy link
Contributor Author

TestIngesterSpreadFlush() is failing consistently in CI, though not on my local machine.
I will try to figure that out soon.

@pstibrany
Copy link
Contributor

I think this interacts badly with flush on shutdown or via /flush handler. In these cases we want to flush as fast as possible.

Perhaps the call to i.flushRateLimiter.Wait should be done after dequeing and only if op.immediate is false? WDYT?

(Alternative – setting infinite rate for sweep with immediate flag wouldn't work, as rate would be recomputed on next call to sweepUsers)

@bboreham
Copy link
Contributor Author

bboreham commented Sep 28, 2020

As I first wrote it (wait before dequeuing), there were some surprising behaviours.
For instance, if things were quiet, 50 goroutines would pass the rate-limiter wait and all block on the queue. Then when series are added to the queue they are picked up instantly and sent to the store, then each goroutine goes back round its loop and quite likely doesn't block on the ratelimiter once again, because as far as it is concerned nobody did anything for a while. So we still get "spiky" behaviour which I was trying to avoid.

Perhaps the call to i.flushRateLimiter.Wait should be done after dequeing

This cures the above scenario. I initially found it odd to think we would pull a series from the queue then wait with it, but I don't think this does any harm - we always re-examine the series before sending to the DB, so won't double-flush it.

and only if op.immediate is false?

Excellent idea.

@pracucci pracucci requested review from pstibrany and pracucci and removed request for pstibrany September 29, 2020 09:40
Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Left some non-blocking comments.

pkg/ingester/ingester.go Outdated Show resolved Hide resolved
@@ -424,7 +424,7 @@ func TestIngesterSpreadFlush(t *testing.T) {
_, _ = pushTestSamples(t, ing, 4, 1, int(cfg.MaxChunkAge.Seconds()-1)*1000)

// wait beyond flush time so first set of samples should be sent to store
time.Sleep(cfg.FlushCheckPeriod * 2)
time.Sleep(cfg.FlushCheckPeriod * 4)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand why this was needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because CI was failing; not all chunks had been flushed by the time it checked. I could understand that this PR makes the flush take longer, though the queue should clear within one flush period. On my laptop the failure was intermittent; after several hours of probing I found that Go timers just aren't that reliable - I would see the 20ms timer fire much later, sometimes by 100ms (and if that happens the test still fails after this change). Perhaps related to golang/go#38860 though I can't see what would make any goroutines CPU-bound at the point we do this Sleep().

I could take that change back out and see how we get on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could take that change back out and see how we get on.

No need for me. Thanks for explanation.

Ingester chunk flushes run periodically, by default every minute.
Add a rate-limiter so we spread out calls to the DB across the period,
avoiding spikes of intense activity which can slow down other operations
such as incoming Push() calls.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
This means we can check if it's an immediate flush, and also has better
behaviour when transitioning from a fast to a slow rate, or vice-versa.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
So when you have a short flush period it will go faster

Also initialize the rate-limit to "Inf" or no limit, so we start out
fast and slow down once we know what the queue is like.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
@bboreham
Copy link
Contributor Author

I have rebased to get rid of the merge conflict on CHANGELOG, fixed the 'DB' comment, and dropped the change to wait longer in the test. Let's see if it passes...

@bboreham bboreham merged commit 56aa40c into master Sep 30, 2020
@bboreham bboreham deleted the spread-flushes branch September 30, 2020 10:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Slow ingester Push operations
3 participants