MergeIterator: allocate less memory at first #4341

bboreham · 2021-07-06T09:53:18Z

What this PR does:

We were allocating 24x the number of streams of batches, where each batch holds up to 12 samples.
By allowing c.batches to reallocate when needed, we avoid the need to pre-allocate enough memory for all possible scenarios.

Also fix innacurate end time on chunks test data, which was throwing off the benchmark, and add more realistic test sizes - at 15-second scrape intervals a chunk covers 30 minutes, so 1,000 chunks is about three weeks, a highly un-representative test.

Which issue(s) this PR fixes:
Fixes #1195

Benchmarks

name                                                                                                                             old time/op    new time/op    delta
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_1_encoding:_Bigchunk-4              12.3ms ± 4%    12.1ms ± 2%     ~     (p=0.548 n=5+5)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_3_encoding:_Bigchunk-4              30.3ms ± 3%    30.5ms ± 3%     ~     (p=0.548 n=5+5)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_1_encoding:_Varbit-4                8.96ms ± 4%    8.80ms ± 0%     ~     (p=0.190 n=5+4)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_3_encoding:_Varbit-4                21.8ms ± 3%    21.6ms ± 3%     ~     (p=0.548 n=5+5)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_1_encoding:_DoubleDelta-4           10.7ms ± 4%    10.5ms ± 2%     ~     (p=0.310 n=5+5)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_3_encoding:_DoubleDelta-4           26.5ms ± 3%    26.5ms ± 4%     ~     (p=0.841 n=5+5)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_1_encoding:_PrometheusXorChunk-4    12.4ms ± 6%    12.4ms ± 5%     ~     (p=0.690 n=5+5)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_3_encoding:_PrometheusXorChunk-4    31.7ms ± 4%    31.2ms ± 3%     ~     (p=0.421 n=5+5)
NewChunkMergeIterator_CreateAndIterate/chunks:_100_samples_per_chunk:_100_duplication_factor:_1_encoding:_PrometheusXorChunk-4     1.23ms ± 5%    1.22ms ± 3%     ~     (p=0.548 n=5+5)
NewChunkMergeIterator_CreateAndIterate/chunks:_100_samples_per_chunk:_100_duplication_factor:_3_encoding:_PrometheusXorChunk-4     3.11ms ± 2%    3.13ms ± 2%     ~     (p=0.421 n=5+5)
NewChunkMergeIterator_CreateAndIterate/chunks:_1_samples_per_chunk:_100_duplication_factor:_1_encoding:_PrometheusXorChunk-4       17.2µs ± 4%    13.9µs ± 2%  -18.89%  (p=0.008 n=5+5)
NewChunkMergeIterator_CreateAndIterate/chunks:_1_samples_per_chunk:_100_duplication_factor:_3_encoding:_PrometheusXorChunk-4       44.3µs ± 2%    36.4µs ± 5%  -17.86%  (p=0.008 n=5+5)

name                                                                                                                             old alloc/op   new alloc/op   delta
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_1_encoding:_Bigchunk-4              85.0kB ± 0%    74.6kB ± 0%  -12.16%  (p=0.008 n=5+5)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_3_encoding:_Bigchunk-4               320kB ± 0%     288kB ± 0%   -9.84%  (p=0.029 n=4+4)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_1_encoding:_Varbit-4                 213kB ± 0%     202kB ± 0%   -4.86%  (p=0.029 n=4+4)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_3_encoding:_Varbit-4                 703kB ± 0%     672kB ± 0%     ~     (p=0.079 n=4+5)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_1_encoding:_DoubleDelta-4            213kB ± 0%     202kB ± 0%   -4.86%  (p=0.008 n=5+5)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_3_encoding:_DoubleDelta-4            703kB ± 0%     672kB ± 0%   -4.48%  (p=0.008 n=5+5)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_1_encoding:_PrometheusXorChunk-4    85.0kB ± 0%    74.6kB ± 0%  -12.16%  (p=0.029 n=4+4)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_3_encoding:_PrometheusXorChunk-4     320kB ± 0%     288kB ± 0%   -9.84%  (p=0.008 n=5+5)
NewChunkMergeIterator_CreateAndIterate/chunks:_100_samples_per_chunk:_100_duplication_factor:_1_encoding:_PrometheusXorChunk-4     19.6kB ± 0%     9.3kB ± 0%  -52.76%  (p=0.008 n=5+5)
NewChunkMergeIterator_CreateAndIterate/chunks:_100_samples_per_chunk:_100_duplication_factor:_3_encoding:_PrometheusXorChunk-4     65.3kB ± 0%    33.8kB ± 0%  -48.25%  (p=0.008 n=5+5)
NewChunkMergeIterator_CreateAndIterate/chunks:_1_samples_per_chunk:_100_duplication_factor:_1_encoding:_PrometheusXorChunk-4       11.9kB ± 0%     1.6kB ± 0%  -86.94%  (p=0.008 n=5+5)
NewChunkMergeIterator_CreateAndIterate/chunks:_1_samples_per_chunk:_100_duplication_factor:_3_encoding:_PrometheusXorChunk-4       35.2kB ± 0%     3.7kB ± 0%  -89.54%  (p=0.008 n=5+5)

name                                                                                                                             old allocs/op  new allocs/op  delta
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_1_encoding:_Bigchunk-4               1.01k ± 0%     1.01k ± 0%     ~     (all equal)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_3_encoding:_Bigchunk-4               3.02k ± 0%     3.02k ± 0%     ~     (all equal)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_1_encoding:_Varbit-4                 2.01k ± 0%     2.01k ± 0%     ~     (all equal)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_3_encoding:_Varbit-4                 6.02k ± 0%     6.02k ± 0%     ~     (all equal)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_1_encoding:_DoubleDelta-4            3.01k ± 0%     3.01k ± 0%     ~     (all equal)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_3_encoding:_DoubleDelta-4            9.02k ± 0%     9.02k ± 0%     ~     (all equal)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_1_encoding:_PrometheusXorChunk-4     1.01k ± 0%     1.01k ± 0%     ~     (all equal)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_3_encoding:_PrometheusXorChunk-4     3.02k ± 0%     3.02k ± 0%     ~     (all equal)
NewChunkMergeIterator_CreateAndIterate/chunks:_100_samples_per_chunk:_100_duplication_factor:_1_encoding:_PrometheusXorChunk-4        113 ± 0%       113 ± 0%     ~     (all equal)
NewChunkMergeIterator_CreateAndIterate/chunks:_100_samples_per_chunk:_100_duplication_factor:_3_encoding:_PrometheusXorChunk-4        323 ± 0%       323 ± 0%     ~     (all equal)
NewChunkMergeIterator_CreateAndIterate/chunks:_1_samples_per_chunk:_100_duplication_factor:_1_encoding:_PrometheusXorChunk-4         14.0 ± 0%      14.0 ± 0%     ~     (all equal)
NewChunkMergeIterator_CreateAndIterate/chunks:_1_samples_per_chunk:_100_duplication_factor:_3_encoding:_PrometheusXorChunk-4         26.0 ± 0%      26.0 ± 0%     ~     (all equal)

Checklist

Tests updated
NA Documentation added
CHANGELOG.md updated

The `through` time is supposed to be the last time in the chunk, and having it one step higher was throwing off other tests and benchmarks. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

At 15-second scrape intervals a chunk covers 30 minutes, so 1,000 chunks is about three weeks, a highly un-representative test. Instant queries, such as those done by the ruler, will only fetch one chunk from each ingester. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

We were allocating 24x the number of streams of batches, where each batch holds up to 12 samples. By allowing `c.batches` to reallocate when needed, we avoid the need to pre-allocate enough memory for all possible scenarios. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

tomwilkie · 2021-07-06T11:45:12Z

pkg/querier/batch/merge.go

 		c.nextBatchBuf[0] = c.h[0].Batch()
 		c.batchesBuf = mergeStreams(c.batches, c.nextBatchBuf[:], c.batchesBuf, size)
-		copy(c.batches[:len(c.batchesBuf)], c.batchesBuf)
-		c.batches = c.batches[:len(c.batchesBuf)]


This is a no-op, right? Did it impact performance?

My guess about this change (but Bryan can confirm or negate), is that we had to do this change because c.batches may need to grow after the change in newMergeIterator(). @bboreham is my understanding correct?

Yes the append will grow the slice if required, whereas copy will panic. TestMergeIter/DoubleDelta fails if you don't make this change.

tomwilkie · 2021-07-06T11:47:17Z

pkg/querier/batch/merge.go

-		batches:    make(batchStream, 0, len(its)*2*promchunk.BatchSize),
-		batchesBuf: make(batchStream, len(its)*2*promchunk.BatchSize),
+		batches:    make(batchStream, 0, len(its)),
+		batchesBuf: make(batchStream, len(its)),


I don't recall exactly why the pre-allocation was so big - wondering if you know why?

I can't think of a reason why this would affect correctness either, and the perf results speak for themselves...

From the correctness perspective, this change should be fine. batchesBuf looks to be written only by mergeStreams() which extends the slice if required.

pracucci

I don't see any reason why not measuring the impact in prod 🎉 We can merge and deploy to measure impact both on queries and rules. Worst case scenario, rolling back this change is trivial.

pracucci · 2021-07-06T12:30:57Z

pkg/querier/batch/merge.go

 		c.nextBatchBuf[0] = c.h[0].Batch()
 		c.batchesBuf = mergeStreams(c.batches, c.nextBatchBuf[:], c.batchesBuf, size)
-		copy(c.batches[:len(c.batchesBuf)], c.batchesBuf)
-		c.batches = c.batches[:len(c.batchesBuf)]


My guess about this change (but Bryan can confirm or negate), is that we had to do this change because c.batches may need to grow after the change in newMergeIterator(). @bboreham is my understanding correct?

* MergeIterator: allocate less memory at first We were allocating 24x the number of streams of batches, where each batch holds up to 12 samples. By allowing `c.batches` to reallocate when needed, we avoid the need to pre-allocate enough memory for all possible scenarios. * chunk_test: fix innacurate end time on chunks The `through` time is supposed to be the last time in the chunk, and having it one step higher was throwing off other tests and benchmarks. * MergeIterator benchmark: add more realistic sizes At 15-second scrape intervals a chunk covers 30 minutes, so 1,000 chunks is about three weeks, a highly un-representative test. Instant queries, such as those done by the ruler, will only fetch one chunk from each ingester. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> Signed-off-by: Alvin Lin <alvinlin@amazon.com>

pull-request-size bot added the size/S label Jul 6, 2021

bboreham added 4 commits July 6, 2021 10:26

chunk_test: fix innacurate end time on chunks

47def0d

The `through` time is supposed to be the last time in the chunk, and having it one step higher was throwing off other tests and benchmarks. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

Update CHANGELOG

b32d886

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

bboreham force-pushed the tune-merge-iterator branch from 6119b68 to b32d886 Compare July 6, 2021 10:26

tomwilkie reviewed Jul 6, 2021

View reviewed changes

tomwilkie approved these changes Jul 6, 2021

View reviewed changes

pracucci approved these changes Jul 6, 2021

View reviewed changes

bboreham merged commit 95fedaa into master Jul 6, 2021

bboreham deleted the tune-merge-iterator branch July 6, 2021 13:26

bboreham mentioned this pull request Sep 23, 2021

Average time to evaluate rule groups is higher when using streaming of chunks #3957

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MergeIterator: allocate less memory at first #4341

MergeIterator: allocate less memory at first #4341

Uh oh!

bboreham commented Jul 6, 2021 •

edited

Loading

Uh oh!

tomwilkie Jul 6, 2021

Uh oh!

pracucci Jul 6, 2021

Uh oh!

bboreham Jul 6, 2021

Uh oh!

tomwilkie Jul 6, 2021

Uh oh!

pracucci Jul 6, 2021 •

edited

Loading

Uh oh!

pracucci left a comment

Uh oh!

pracucci Jul 6, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MergeIterator: allocate less memory at first #4341

MergeIterator: allocate less memory at first #4341

Uh oh!

Conversation

bboreham commented Jul 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tomwilkie Jul 6, 2021

Choose a reason for hiding this comment

Uh oh!

pracucci Jul 6, 2021

Choose a reason for hiding this comment

Uh oh!

bboreham Jul 6, 2021

Choose a reason for hiding this comment

Uh oh!

tomwilkie Jul 6, 2021

Choose a reason for hiding this comment

Uh oh!

pracucci Jul 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pracucci left a comment

Choose a reason for hiding this comment

Uh oh!

pracucci Jul 6, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bboreham commented Jul 6, 2021 •

edited

Loading

pracucci Jul 6, 2021 •

edited

Loading