-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: New bloom planning using chunk size TSDB stats #14547
Conversation
f10d775
to
8c564fe
Compare
8c564fe
to
20026e6
Compare
20026e6
to
8360fdc
Compare
docs/sources/shared/configuration.md
Outdated
@@ -3775,6 +3775,10 @@ shard_streams: | |||
# CLI flag: -bloom-build.split-keyspace-by | |||
[bloom_split_series_keyspace_by: <int> | default = 256] | |||
|
|||
# Experimental. Target chunk size in bytes for bloom tasks. Default is 100GB. | |||
# CLI flag: -bloom-build.target-chunk-size | |||
[bloom_task_target_chunk_size: <int> | default = 100GB] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what a good default number would be here. Maybe 100GB is too much. What about 20GB? Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the impact of setting this too low/too high?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
too low --> too many tasks
too big --> few huge tasks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it wouldn't change the size of bloom blocks? If that's the case, overpartioning seems better than underpartitioning; 20GB seems fine and we can continue to adjust over time.
docs/sources/shared/configuration.md
Outdated
@@ -3765,7 +3765,7 @@ shard_streams: | |||
[bloom_creation_enabled: <boolean> | default = false] | |||
|
|||
# Experimental. Bloom planning strategy to use in bloom creation. Can be one of: | |||
# 'split_keyspace_by_factor' | |||
# 'split_keyspace_by_factor', 'split_by_chunk_size' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely happy with this name. Some other options I thought were:
split_by_tsdb_chunks_size_stats
- This is probably my favorite
split_by_chunks_size_stats
split_by_series_chunks_size
split_by_series_size
Preferences? Ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like split_by_tsdb_chunks_size
to make it clear which chunks we're talking about; split_by_tsdb_chunks_size_stats
is fine too, though I'm not sure if the _stats
bit adds any more information to make it more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer split_by_series_size
or split_by_stream_size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer split_by_series_size or split_by_stream_size
I would like to specifically have "chunks" somewhere in the name provided that in the future we may look at other stats besides the chunk sizing. But I may be overthinking it 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going with split_by_series_chunks_size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but I don't feel really confident enough in my knowledge of the builder to know if I missed anything here. A review from @chaudum would also be helpful here
series := sizedIter.At() | ||
if series.Len() == 0 { | ||
// This should never happen, but just in case. | ||
level.Error(s.logger).Log("msg", "got empty series batch") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/error/warn? I'm not sure if this is actionable by the operator
(Also, should we include extra identifying information about the series here for debugging in case this ever does hit?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to a warn.
should we include extra identifying information about the series
We have no series at this level but the TSDB name. Added it to the log line.
TSDB tsdb.SingleTenantTSDBIdentifier | ||
FP model.Fingerprint | ||
Chunks []index.ChunkMeta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fields can be package private
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
docs/sources/shared/configuration.md
Outdated
@@ -3765,7 +3765,7 @@ shard_streams: | |||
[bloom_creation_enabled: <boolean> | default = false] | |||
|
|||
# Experimental. Bloom planning strategy to use in bloom creation. Can be one of: | |||
# 'split_keyspace_by_factor' | |||
# 'split_keyspace_by_factor', 'split_by_chunk_size' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer split_by_series_size
or split_by_stream_size
What this PR does / why we need it:
This PR adds a new planning strategy to the bloom builder:
split_by_chunk_size
.This strategy build tasks looking at the TSDB stats. We add a new configurable
bloom_task_target_chunk_size
where we configure the target chunks size of each task. We keep adding series to a task until the sum of data worth of chunks exceeds the target size.Special notes for your reviewer:
Checklist
CONTRIBUTING.md
guide (required)feat
PRs are unlikely to be accepted unless a case can be made for the feature actually being a bug fix to existing behavior.docs/sources/setup/upgrade/_index.md
deprecated-config.yaml
anddeleted-config.yaml
files respectively in thetools/deprecated-config-checker
directory. Example PR