Smooth out spikes in rate of chunk flush ops #3191

bboreham · 2020-09-16T10:48:05Z

Ingester chunk flushes run periodically, by default every minute.
Add a rate-limiter so we spread out calls to the DB across the period, avoiding spikes of intense activity which can slow down other operations such as incoming Push() calls.

Fixes #3171

Checklist

NA Tests updated
NA Documentation added
CHANGELOG.md updated

storyicon · 2020-09-16T11:24:18Z

LGTM, this effectively evens out the pressure on the DB.

bboreham · 2020-09-16T14:36:56Z

TestIngesterSpreadFlush() is failing consistently in CI, though not on my local machine.
I will try to figure that out soon.

pstibrany · 2020-09-17T08:04:09Z

I think this interacts badly with flush on shutdown or via /flush handler. In these cases we want to flush as fast as possible.

Perhaps the call to i.flushRateLimiter.Wait should be done after dequeing and only if op.immediate is false? WDYT?

(Alternative – setting infinite rate for sweep with immediate flag wouldn't work, as rate would be recomputed on next call to sweepUsers)

bboreham · 2020-09-28T09:10:49Z

As I first wrote it (wait before dequeuing), there were some surprising behaviours.
For instance, if things were quiet, 50 goroutines would pass the rate-limiter wait and all block on the queue. Then when series are added to the queue they are picked up instantly and sent to the store, then each goroutine goes back round its loop and quite likely doesn't block on the ratelimiter once again, because as far as it is concerned nobody did anything for a while. So we still get "spiky" behaviour which I was trying to avoid.

Perhaps the call to i.flushRateLimiter.Wait should be done after dequeing

This cures the above scenario. I initially found it odd to think we would pull a series from the queue then wait with it, but I don't think this does any harm - we always re-examine the series before sending to the DB, so won't double-flush it.

and only if op.immediate is false?

Excellent idea.

pstibrany

LGTM. Left some non-blocking comments.

pkg/ingester/ingester.go

pstibrany · 2020-09-29T10:03:07Z

pkg/ingester/ingester_test.go

@@ -424,7 +424,7 @@ func TestIngesterSpreadFlush(t *testing.T) {
 	_, _ = pushTestSamples(t, ing, 4, 1, int(cfg.MaxChunkAge.Seconds()-1)*1000)

 	// wait beyond flush time so first set of samples should be sent to store
-	time.Sleep(cfg.FlushCheckPeriod * 2)
+	time.Sleep(cfg.FlushCheckPeriod * 4)


I don't quite understand why this was needed.

Because CI was failing; not all chunks had been flushed by the time it checked. I could understand that this PR makes the flush take longer, though the queue should clear within one flush period. On my laptop the failure was intermittent; after several hours of probing I found that Go timers just aren't that reliable - I would see the 20ms timer fire much later, sometimes by 100ms (and if that happens the test still fails after this change). Perhaps related to golang/go#38860 though I can't see what would make any goroutines CPU-bound at the point we do this Sleep().

I could take that change back out and see how we get on.

I could take that change back out and see how we get on.

No need for me. Thanks for explanation.

Ingester chunk flushes run periodically, by default every minute. Add a rate-limiter so we spread out calls to the DB across the period, avoiding spikes of intense activity which can slow down other operations such as incoming Push() calls. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

This means we can check if it's an immediate flush, and also has better behaviour when transitioning from a fast to a slow rate, or vice-versa. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

So when you have a short flush period it will go faster Also initialize the rate-limit to "Inf" or no limit, so we start out fast and slow down once we know what the queue is like. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

bboreham · 2020-09-29T15:00:53Z

I have rebased to get rid of the merge conflict on CHANGELOG, fixed the 'DB' comment, and dropped the change to wait longer in the test. Let's see if it passes...

pull-request-size bot added the size/M label Sep 16, 2020

bboreham force-pushed the spread-flushes branch from a78b2f2 to dd1f3d3 Compare September 16, 2020 11:45

bboreham mentioned this pull request Sep 16, 2020

Slow ingester Push operations #3171

Closed

pracucci requested review from pstibrany and pracucci and removed request for pstibrany September 29, 2020 09:40

pstibrany approved these changes Sep 29, 2020

View reviewed changes

bboreham added 3 commits September 29, 2020 14:55

Wait after dequeuing, in flush rate-limiter

4ba98fc

This means we can check if it's an immediate flush, and also has better behaviour when transitioning from a fast to a slow rate, or vice-versa. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

Change min ratelimit calculation to per period

2f1812f

So when you have a short flush period it will go faster Also initialize the rate-limit to "Inf" or no limit, so we start out fast and slow down once we know what the queue is like. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

bboreham force-pushed the spread-flushes branch from dcfb336 to 2f1812f Compare September 29, 2020 14:57

Merge branch 'master' into spread-flushes

af6cbd9

bboreham merged commit 56aa40c into master Sep 30, 2020

bboreham deleted the spread-flushes branch September 30, 2020 10:02

chaudum mentioned this pull request Jul 20, 2023

feat(ingester): Smooth out chunk flush operations grafana/loki#9994

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Smooth out spikes in rate of chunk flush ops #3191

Smooth out spikes in rate of chunk flush ops #3191

bboreham commented Sep 16, 2020

storyicon commented Sep 16, 2020

bboreham commented Sep 16, 2020

pstibrany commented Sep 17, 2020

bboreham commented Sep 28, 2020 •

edited

Loading

pstibrany left a comment

pstibrany Sep 29, 2020

bboreham Sep 29, 2020

pstibrany Sep 29, 2020

bboreham commented Sep 29, 2020

Smooth out spikes in rate of chunk flush ops #3191

Smooth out spikes in rate of chunk flush ops #3191

Conversation

bboreham commented Sep 16, 2020

storyicon commented Sep 16, 2020

bboreham commented Sep 16, 2020

pstibrany commented Sep 17, 2020

bboreham commented Sep 28, 2020 • edited Loading

pstibrany left a comment

Choose a reason for hiding this comment

pstibrany Sep 29, 2020

Choose a reason for hiding this comment

bboreham Sep 29, 2020

Choose a reason for hiding this comment

pstibrany Sep 29, 2020

Choose a reason for hiding this comment

bboreham commented Sep 29, 2020

bboreham commented Sep 28, 2020 •

edited

Loading