-
Notifications
You must be signed in to change notification settings - Fork 537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store-gateway: inefficient chunks fetching and caching #3939
Comments
Chunk size frequency from a random 24h block from our monitoring cluster (30 segment files out of 249 in the block, around 95M of chunks):
|
In an upcoming PR I'll be using the chunk actual length when fetching chunks from the bucket. For a series we can calculate the size of each chunk as the difference between the chunk refs of two consecutive chunks. This works on the assumption that chunks in the segment files are ordered by the series to which they belong. And this works for all chunks but the last chunk of a series. For the last chunk we still have to estimate its length. (Technically we can look at the chunk ref of the first chunk of the next series, but that's difficult right now) A naive estimationFor this estimation I initially tried with an estimation of 2x the size of the largest chunk whose size we know for certain. This didn't work for cases where we have very few chunks (e.g. with high churn we might have one or two chinks per series) or when the chunks were sparse and covered very different time ranges (30s vs 30m). Using an estimation of 2x caused a underestimation for 0.6% of series in some requests. This increased the number of requests to the object store by 25% (since we had to refetch those series individually and couldn't group them in the same request with the help of the partitioner). Latency was also negative affected. Underestimation rate, object store request rate, and latencyzone-a was using these estimations and my future changes, zones b and c were using the `main` implementationLast chunk size analysisThis prompted me to find out what would be a better estimation for chunk sizes. What I wanted to find out what given the size of the max chunk of a series (excluding the last chunk), how big will the last chunk be? Using the block index I ran a similar analysis as Peter's on the size of chunks in a few blocks. I bucketed the size of the max chunk into powers of two and calculated percentiles for the size of the last chunk (in my analysis I almost always knew the size of the last chunk). TL;DR: There is a lot more variability to last chunk size when the max known chunk is small (<256). And for series where the max chunk size was big (<4096), then variability is lower and the last chunk was always below 2500 B. __Results__I analysed the chunks of 63M series from 6 blocks from different tenants. I wrote this tool to read and parse the index from the bucket and then wrote a small go program to parse the output, bucket the results and calculate percentiles.
Making an estimationHow often this estimation can be below the actual size should be based on our cache hit rate since we wouldn't use the estimation for already cache chunks. In production at Grafana Labs we usually have a usually high hit rate for the chunks cache (96.3% in the last 30 days over all Mimir clusters). If we want to have to refetch 1 batch in 10, then with a hit rate of 96.3%, we can afford to underfetch (and refetch) the last chunk of 0.054% of series ( The following estimations would satisfy this for each max chunk size bucket:
These estimations would cover 99.95063% of series. Ratio vs chunk sizeI'm not sure whether to do an estimation with a fixed size for each bucket or use a multiplier of the actual max chunk size on a per-series basis. The worst case is worse when using a ratio, but I'm not sure about the average case. I'm tempted to do with the fixed size since it looks easier. |
Here's long-overdue update: we've been running the fine-grained chunks changes at Grafana Labs for some time in February and March and more recently trialing it in a few clusters. Unfortunately, we didn't see significant improvements in most Mimir clusters. In most cases it leads to a single digit percentage decrease in in-use heap at the cost of 20-50% increased object store operations and 10-20% increased latency. There was once cluster where the decrease in heap was ~25%, but that wasn't enough to justify keeping the feature. So we took a decision to start removing the feature. |
I analysed the bytes touched vs fetched in some of our production Mimir clusters over the last 24h. I used the metrics
cortex_bucket_store_series_data_size_touched_bytes_sum
andcortex_bucket_store_series_data_size_fetched_bytes_sum
, and got the following (numbers are GBs):Ideally, we would expect Fetched to be less than Touched, because some data will be fetched from the cache (series and postings) and similar for chunks. That's true for "series" and "postings", but not for "chunks". We touched 58TB of chunks, but fetched 504TB (9x more).
Why the Touched vs Fetched discrepancy?
First of all, we need to understand how the metric is tracked for chunks:
Why we fetch 9x more than the actual chunk sizes? This is an effect of two different implementation details:
Why does this affect cache too?
Chunks caching is implemented as a wrapper of the object storage client. We don't cache individual chunks, but we do cache portions of objects containing chunks (called "TSDB segment files"). Each cached portion is 16KB.
The current logic is as follows:
GetRange()
Since cache lookup happens after the partitioner, it means that we're reading from memcached even ranges we don't actually need and will be discarded.
How much does the partitioner over read?
Querying partitioner metrics, we can see how much over read is done by the partitioner:
We requested 323TB and the partitioner expanded it to 521TB. Looking at this data, the partitioner is over-reading by a factor of 1.6x which is not that bad.
Why partitioner effect is 1.6x, but fetched vs touched bytes is 9x?
My theory is that the reason is that we compute initial chunk ranges based on the worst case of 16000 bytes per chunk (since we don't know the actual chunk).
We know on average a sample is 1.4 bytes. A chunk is on average 120 samples, so
120 * 1.4 = 168 bytes
which is way far from the worst case scenario we consider.We also know that the average scrape interval across all our customers is 30s. Assuming chunks are consecutive in the segment file and we query 24h blocks (the largest block compacted by Mimir), all chunks of a series are in a range of
(86400 / 30) * 1.4 = ~4KB
which is still 4x smaller than the minimum range of 16KB we fetch from the cache.Summing together the two issues, and the fact that cached portions are also aligned to 16KB, math gets quite close to a 9x over-reading inefficiency.
Reference queries
GBs touched:
GBs fetched:
Partitioner's requested bytes total:
Partitioner's expanded bytes total:
The text was updated successfully, but these errors were encountered: