-
Notifications
You must be signed in to change notification settings - Fork 802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add shuffle sharding grouper/planner #4357
Conversation
I wanted to discuss a change in the block compaction behavior that this PR would introduce. The current implementation of the Thanos compactor will always compact the first set of overlapping blocks if there exists such a set. This means that if the most recently ingested set of blocks from multiple ingesters are overlapping, the blocks will be compacted. If this happens, there is potentially a “missing” block in the compaction with the Thanos planner since there may be an ingester that has not fully uploaded the block when the compaction begins. So if there are 3 overlapping blocks when the compaction begins, and they are the latest blocks passed to the Thanos planner, the planner will plan a compaction of those 3 blocks even if there is a potential fourth ingester that has yet to upload a block. With this PR, overlapping blocks will not be compacted if they are the last set of blocks meaning in the example above, the 3 blocks won’t be compacted if there are the latest ones and they don’t cover a full range. In a real-world situation, this would only have an impact on customers who stop ingesting blocks. The impact will be that the last group of n blocks, where n is the number of ingesters will remain uncompacted for as long as they are the latest blocks. The impact of leaving the last n blocks uncompacted would be increased storage size as well as query time (if they continue to query even after stopping ingesting blocks). One thing to note with the Thanos approach, there can be duplicate work if the blocks are compacted, and another block that overlaps is uploaded after the compaction begins. A couple of different approaches I considered were adding grouping overlapping blocks before grouping by compactable ranges, this results in the compaction behavior being the same using these changes compared to Thanos. Another approach is if there are no new blocks after the time defined by the smallest block range passes from the max time of all the blocks, the block which are overlapping can be compacted, even if they are the latest blocks. Something else that I considered is making this a toggle to allow the user to define their own preference, but I think that this isn't ideal as it would either lead to having to support the toggle indefinitely or eventually having to have users switch to a single behavior. Small example illustrating what’s mentioned above 4 total blocks with 1 block incoming (not yet uploaded)
Thanos compaction The above blocks with the current (Thanos) compaction with time ranges [20, 120, 240] would result in blocks:
Afterwards, once block 5 is fully uploaded the final resulting blocks from a single run of the compaction will be
With these blocks, another compaction will need to be done to fully compact the overlapping blocks 2-5. New compaction behavior With this PR and the shuffle-sharding strategy, the blocks would remain uncompacted. And would wait until a more recent block than 2-5 is uploaded. Once that block is uploaded blocks 2-5 would be impacted in 1 compaction.
The downside with this approach is that the uncompacted blocks 2-5 were stored for a longer time compared to the current (Thanos) approach as it was waiting for a more recent block to be uploaded before compacting the blocks. In the above with this PR if I was wondering what your thoughts were about which approach would be preferable? |
This PR replaces and implements the changes recommended in #4318 |
Discussed in the community call and leaving the blocks uncompacted is okay. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code is long and I didn't read through every line. Broadly it looks ok.
I did wonder why the word "thanos" shows up so often - if the code is copied from Thanos it should say so, and if not can you just explain your thinking to me?
garbageCollectedBlocks: garbageCollectedBlocks, | ||
hashFunc: hashFunc, | ||
compactions: promauto.With(reg).NewCounterVec(prometheus.CounterOpts{ | ||
Name: "thanos_compact_group_compactions_total", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to add new metrics in Cortex starting "thanos_"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a note where the metrics were copied from in Thanos. With these changes wouldn't the metrics in Cortex remain the same? They are only used when creating a new group using compact.NewGroup
which is what is being done now (https://github.com/cortexproject/cortex/blob/master/vendor/github.com/thanos-io/thanos/pkg/compact/compact.go#L262-L312)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then it might be better to expose those metrics as another function in Thanos?
Any way I can help with this PR? We're running into limits in our compaction (we have about 25M active time series in a single-tenant cortex). I'd be happy to run pre-release compactor builds if this needs some kind of validation. |
error message from build is
Since this looks like a useful PR that we want to merge, but I don't won the original branch, I will create a new branch to work on resolving the error. |
Signed-off-by: Albert <ac1214@users.noreply.github.com> Signed-off-by: Alvin Lin <alvinlin@amazon.com>
Signed-off-by: Albert <ac1214@users.noreply.github.com> Signed-off-by: Alvin Lin <alvinlin@amazon.com>
…ortexproject#4262) * add MaxRetries to WaitInstanceState Signed-off-by: Albert <ac1214@users.noreply.github.com> * update CHANGELOG.md Signed-off-by: Albert <ac1214@users.noreply.github.com> * Add timeout for waiting on compactor to become ACTIVE in the ring. Signed-off-by: Albert <ac1214@users.noreply.github.com> * add MaxRetries variable back to WaitInstanceState Signed-off-by: Albert <ac1214@users.noreply.github.com> * Fix linting issues Signed-off-by: Albert <ac1214@users.noreply.github.com> * Remove duplicate entry from changelog Signed-off-by: Albert <ac1214@users.noreply.github.com> * Address PR comments and set timeout to be configurable Signed-off-by: Albert <ac1214@users.noreply.github.com> * Address PR comments and fix tests Signed-off-by: Albert <ac1214@users.noreply.github.com> * Update unit tests Signed-off-by: Albert <ac1214@users.noreply.github.com> * Update changelog and fix linting Signed-off-by: Albert <ac1214@users.noreply.github.com> * Fixed CHANGELOG entry order Signed-off-by: Marco Pracucci <marco@pracucci.com> Co-authored-by: Albert <ac1214@users.noreply.github.com> Co-authored-by: Marco Pracucci <marco@pracucci.com> Signed-off-by: Alvin Lin <alvinlin@amazon.com>
* MergeIterator: allocate less memory at first We were allocating 24x the number of streams of batches, where each batch holds up to 12 samples. By allowing `c.batches` to reallocate when needed, we avoid the need to pre-allocate enough memory for all possible scenarios. * chunk_test: fix innacurate end time on chunks The `through` time is supposed to be the last time in the chunk, and having it one step higher was throwing off other tests and benchmarks. * MergeIterator benchmark: add more realistic sizes At 15-second scrape intervals a chunk covers 30 minutes, so 1,000 chunks is about three weeks, a highly un-representative test. Instant queries, such as those done by the ruler, will only fetch one chunk from each ingester. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> Signed-off-by: Alvin Lin <alvinlin@amazon.com>
* Expose default configuration values for memberlist. Set the defaults for various memberlist configuration values based on the "Default LAN" configuration. The only result of this change is that the defaults are now visible and are in the documentation. This also means that if the default values change, then the changes are visible in the documentation, where as before they would have gone unnoticed. To prevent this being a breaking change, the existing behaviour is retained, in case anyone is explicitly setting the values to zero and expecting the default to be used. Signed-off-by: Steve Simpson <steve.simpson@grafana.com> * Remove use of zero value as default value indicator. Signed-off-by: Steve Simpson <steve.simpson@grafana.com> * Review comments. Signed-off-by: Steve Simpson <steve.simpson@grafana.com> * Review comments. Signed-off-by: Steve Simpson <steve.simpson@grafana.com> Signed-off-by: Alvin Lin <alvinlin@amazon.com>
cortexproject#4342) * Allow setting ring heartbeat timeout to zero to disable timeout check. This change allows the various ring heartbeat timeouts to be configured with zero, as a means of disabling the timeout. This is expected to be used with a separate enhancement to allow disabling heartbeats. When the heartbeat timeout is disabled, instances will always appear as healthy in the ring. Signed-off-by: Steve Simpson <steve.simpson@grafana.com> * Review comments. Signed-off-by: Steve Simpson <steve.simpson@grafana.com> Signed-off-by: Alvin Lin <alvinlin@amazon.com>
…time. (cortexproject#4317) * Add a new config and metric for reporting ruler query execution wall time. Signed-off-by: Tyler Reid <tyler.reid@grafana.com> * Spacing and PR number fixup Signed-off-by: Tyler Reid <tyler.reid@grafana.com> * Wrap the defer in a function to make it defer after the return rather than after the if block. Add a unit test to validate we're tracking time correctly. Signed-off-by: Tyler Reid <tyler.reid@grafana.com> * Use seconds for our duration rather than nanoseconds Signed-off-by: Tyler Reid <tyler.reid@grafana.com> * Review comment fixes Signed-off-by: Tyler Reid <tyler.reid@grafana.com> * Update config flag in the config docs Signed-off-by: Tyler Reid <tyler.reid@grafana.com> * Pass counter rather than counter vector for metrics query function Signed-off-by: Tyler Reid <tyler.reid@grafana.com> * Fix comment in MetricsQueryFunction Signed-off-by: Tyler Reid <tyler.reid@grafana.com> * Move query metric and log to separate function. Add log message for ruler query time. Signed-off-by: Tyler Reid <tyler.reid@grafana.com> * Update config file and change log to show this a per user metric Signed-off-by: Tyler Reid <tyler.reid@grafana.com> * code review fixes Signed-off-by: Tyler Reid <tyler.reid@grafana.com> * update log message for ruler query metrics Signed-off-by: Tyler Reid <tyler.reid@grafana.com> * Remove append and just use the array for key values in the log messag Signed-off-by: Tyler Reid <tyler.reid@grafana.com> * Add query-frontend component to front end log message Signed-off-by: Tyler Reid <tyler.reid@grafana.com> Signed-off-by: Alvin Lin <alvinlin@amazon.com>
I thought it would be good to put a security page into the docs, so that it shows up in a search. Content is just pointing at other resources. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> Signed-off-by: Alvin Lin <alvinlin@amazon.com>
…xproject#4345) * Optimise memberlist kv store access by storing data unencoded. The following profile data was taken from running 50 idle ingesters with memberlist, with almost everything at default values (5s heartbeats): ``` 52.16% mergeBytesValueForKey +- 52.16% mergeValueForKey +- 47.84% computeNewValue +- 27.24% codec Proto Decode +- 26.25% mergeWithTime ``` It is apparent from the this that a lot of time is spent on the memberlist receive path, as might be expected, specifically, the merging of the update into the current state. The cost however is not in decoding the incoming states (occurs in `mergeBytesValueForKey` before `mergeValueForKey`), but in fact decoding _current state_ of the value in the store (as it is stored encoded). The ring state was measured at 123K (50 ingesters), so it makes sense that decoding could be costly. This can be avoided by storing the value in it's decoded `Mergeable` form. When doing this, care has to be taken to deep copy the value when accessed, as it is modified in place before being updated in the store, and accessed outside the store mutex. Note a side effect of this change is that is no longer straightforward to expose the `memberlist_kv_store_value_bytes` metric, as this reported the size of the encoded data, therefore it has been removed. Signed-off-by: Steve Simpson <steve.simpson@grafana.com> * Typo. Signed-off-by: Steve Simpson <steve.simpson@grafana.com> * Review comments. Signed-off-by: Steve Simpson <steve.simpson@grafana.com> Signed-off-by: Alvin Lin <alvinlin@amazon.com>
…o. (cortexproject#4344) * Allow disabling of ring heartbeats by setting relevant options to zero. Signed-off-by: Steve Simpson <steve.simpson@grafana.com> * Review comments. Signed-off-by: Steve Simpson <steve.simpson@grafana.com> Signed-off-by: Alvin Lin <alvinlin@amazon.com>
…#4346) * Expose configuration of memberlist packet compression. Allows manually specifying whether memberlist should compress packets via a new configuration flag: `-memberlist.enable-compression`. This typically has little benefit for Cortex, as the ring state messages are already compressed with Snappy, the second layer of compression does not achieve any additional saving. It's not clear cut whether there might still be some benefit for internal memberlist messages; this needs to be evaluated in a environment of some reasonable scale. Signed-off-by: Steve Simpson <steve.simpson@grafana.com> * Review comments. Signed-off-by: Steve Simpson <steve.simpson@grafana.com> * Review comments. Signed-off-by: Steve Simpson <steve.simpson@grafana.com> Signed-off-by: Alvin Lin <alvinlin@amazon.com>
…exproject#4348) It was only waiting one second for the second sync to complete, which is probably too harsh a deadline than necessary for overloaded systems. Signed-off-by: Steve Simpson <steve.simpson@grafana.com> Signed-off-by: Alvin Lin <alvinlin@amazon.com>
…xproject#4349) The test is writing a single silence and checking a metric which indicates whether replicating the silence has been attempted yet. This is so we can check later on that no replication activity occurs. The assertions later on in the test are passing, but the first one is not, indicating that the replication doesn't trigger early enough. This makes sense because the replication is not synchronous with the writing of the silence. Signed-off-by: Steve Simpson <steve.simpson@grafana.com> Signed-off-by: Alvin Lin <alvinlin@amazon.com>
) * Add proposal document Signed-off-by: Gofman <ilang@147dda11e800.ant.amazon.com> Signed-off-by: ilangofman <igofman99@gmail.com> * Minor text modifications Signed-off-by: ilangofman <igofman99@gmail.com> * Implement requested changes to the proposal Signed-off-by: ilangofman <igofman99@gmail.com> * Fix mention of Compactor instead of purger in proposal Signed-off-by: ilangofman <igofman99@gmail.com> * Fixed wording and spelling in proposal Signed-off-by: ilangofman <igofman99@gmail.com> * Update the cache invalidation method Signed-off-by: ilangofman <igofman99@gmail.com> * Fix wording on cache invalidation section Signed-off-by: ilangofman <igofman99@gmail.com> * Minor wording additions Signed-off-by: ilangofman <igofman99@gmail.com> * Remove white-noise from text Signed-off-by: ilangofman <igofman99@gmail.com> * Remove the deleting state and change cache invalidation Signed-off-by: ilangofman <igofman99@gmail.com> * Add deleted state and update cache invalidation Signed-off-by: ilangofman <igofman99@gmail.com> * Add one word to clear things up Signed-off-by: ilangofman <igofman99@gmail.com> * update api limits section Signed-off-by: ilangofman <igofman99@gmail.com> * ran clean white noise Signed-off-by: ilangofman <igofman99@gmail.com> Signed-off-by: Alvin Lin <alvinlin@amazon.com>
Signed-off-by: Albert <ac1214@users.noreply.github.com> Signed-off-by: Alvin Lin <alvinlin@amazon.com>
Signed-off-by: Albert <ac1214@users.noreply.github.com> Signed-off-by: Alvin Lin <alvinlin@amazon.com>
Conventionally the minimum time would be before the maximum. Apparently none of the tests were depending on this. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> Signed-off-by: Alvin Lin <alvinlin@amazon.com>
We need to add the merged value back to the map. Extract merging as a separate function so it can be tested. Adapt the existing test to cover multiple series. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> Signed-off-by: Alvin Lin <alvinlin@amazon.com>
Rearrange `CHANGELOG.md` to conform to instructions in `pull_request_template.md`. Also add a `-` to a CLI flag to conform to instructions in `design-patterns-and-conventions.md`. Signed-off-by: Andrew Seigner <andrew@sig.gy> Signed-off-by: Alvin Lin <alvinlin@amazon.com>
* Introduce `http` config settings in Azure storage Cortex v1.11.0 included thanos-io/thanos#3970, which added configuration options to Azure's http client and transport, replacing usage of `http.DefaultClient`. Unfortunately since Cortex was not setting this config, Cortex implicitly switched from `http.DefaultClient` to all empty values (e.g. `MaxIdleConns: 0` rather than 100). Introduce `http` config settings to Azure storage. This motivated moving `s3.HTTPConfig` into a new `pkg/storage/bucket/config` package, to allow `azure` and `s3` to share it. Also update the instructions for running the website to include installing `embedmd`. Signed-off-by: Andrew Seigner <andrew@sig.gy> * feedback: `config.HTTP` -> `http.Config` also back out changelog cleanup Signed-off-by: Andrew Seigner <andrew@sig.gy> * Back out accidental changelog addition Signed-off-by: Andrew Seigner <andrew@sig.gy> Signed-off-by: Alvin Lin <alvinlin@amazon.com>
* Update Thanos to latest main Update Thanos dependency to include thanos-io/thanos#4928, to conserve memory. Signed-off-by: Andrew Seigner <andrew@sig.gy> * Update changelog to summarize user-facing changes Signed-off-by: Andrew Seigner <andrew@sig.gy> Signed-off-by: Alvin Lin <alvinlin@amazon.com>
* Adding test case for dropping metrics by name to understand better flow of distributor Signed-off-by: Pedro Tanaka <pedro.stanaka@gmail.com> * Adding test case and new metric for dropped samples Signed-off-by: Pedro Tanaka <pedro.stanaka@gmail.com> * Updating CHANGELOG with new changes Signed-off-by: Pedro Tanaka <pedro.stanaka@gmail.com> * Fixing linting problem on distributor file Signed-off-by: Pedro Tanaka <pedro.stanaka@gmail.com> * Reusing discarded samples metric from validate package Signed-off-by: Pedro Tanaka <pedro.stanaka@gmail.com> * Compare labelset with len() instead of comparing to nil Signed-off-by: Pedro Tanaka <pedro.stanaka@gmail.com> * Undoing unnecessary changes on tests and distributor Signed-off-by: Pedro Tanaka <pedro.stanaka@gmail.com> * Small rename on comment Signed-off-by: Pedro Tanaka <pedro.stanaka@gmail.com> * Fixing linting offenses Signed-off-by: Pedro Tanaka <pedro.stanaka@gmail.com> * Reseting validation dropped samples metric to avoid getting metrics from other test runs Signed-off-by: Pedro Tanaka <pedro.stanaka@gmail.com> * Resolving problems after rebase conflicts Signed-off-by: Pedro Tanaka <pedro.stanaka@gmail.com> * Registering counter for dropped metrics in test Signed-off-by: Pedro Tanaka <pedro.stanaka@gmail.com> * Checking if user label drop configuration did not drop __name__ label Signed-off-by: Pedro Tanaka <pedro.stanaka@gmail.com> * Do not check for name label, adding new test Signed-off-by: Pedro Tanaka <pedro.stanaka@gmail.com> Signed-off-by: Alvin Lin <alvinlin@amazon.com>
* Disable block deletion marks migration by default Flag is named `-compactor.block-deletion-marks-migration-enabled`. This feature was added in v1.7, so we expect most users to have upgraded by now. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> Signed-off-by: Alvin Lin <alvinlin@amazon.com>
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu> Signed-off-by: Alvin Lin <alvinlin@amazon.com>
…ct#4602) * Upgrade Go to 1.17.5 for integration tests Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com> * Upgrade to Go 1.17 in Dockerfiles Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com> Signed-off-by: Alvin Lin <alvinlin@amazon.com>
* Update build image. Signed-off-by: Peter Štibraný <pstibrany@gmail.com> * CHANGELOG.md Signed-off-by: Peter Štibraný <pstibrany@gmail.com> Signed-off-by: Alvin Lin <alvinlin@amazon.com>
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com> Signed-off-by: Alvin Lin <alvinlin@amazon.com>
This reverts commit f2656f8. Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com> Signed-off-by: Alvin Lin <alvinlin@amazon.com>
…#4440)" (cortexproject#4613) This reverts commit a635a1e. Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com> Signed-off-by: Alvin Lin <alvinlin@amazon.com>
* Federated ruler proposal Signed-off-by: Rees Dooley <rees.dooley@shopify.com> Co-authored-by: Rees Dooley <rdooley@Reess-MacBook-Pro.local> Signed-off-by: Alvin Lin <alvinlin@amazon.com>
) This reverts commit 19f3802. Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com> Signed-off-by: Alvin Lin <alvinlin@amazon.com>
…exproject#4614) Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com> Signed-off-by: Alvin Lin <alvinlin@amazon.com>
…er (cortexproject#4615) Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com> Signed-off-by: Alvin Lin <alvinlin@amazon.com>
…t#4617) Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com> Signed-off-by: Alvin Lin <alvinlin@amazon.com>
)" (cortexproject#4611) This reverts commit 32b1b40. Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com> Signed-off-by: Alvin Lin <alvinlin@amazon.com>
…project#4619) Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com> Signed-off-by: Alvin Lin <alvinlin@amazon.com>
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com> Signed-off-by: Alvin Lin <alvinlin@amazon.com>
Move the change log line to unreleased section Signed-off-by: Alvin Lin <alvinlin@amazon.com>
Signed-off-by: Alvin Lin <alvinlin@amazon.com>
Signed-off-by: Alvin Lin <alvinlin@amazon.com>
Signed-off-by: Alvin Lin <alvinlin@amazon.com>
Signed-off-by: Alvin Lin <alvinlin@amazon.com>
db0595b
to
8e78a51
Compare
Please see #4624 instead. |
Signed-off-by: Albert ac1214@users.noreply.github.com
What this PR does:
Implements generation of parallelize plans for the proposal outlined in #4272 using a shuffle sharding grouper and planner. Currently the parallelizable plans are generated but every compactor runs every planned compaction, the actual sharding will happen in a subsequent PR.
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]