-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Modification of Receive Split proposal. #4036
Merged
+61
−18
Merged
Changes from 1 commit
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,5 @@ | ||
--- | ||
title: Thanos Remote Write | ||
title: Thanos Routing Receive and Ingesting Receive | ||
type: proposal | ||
menu: proposals | ||
status: accepted | ||
|
@@ -8,30 +8,74 @@ status: accepted | |
### Related Tickets | ||
|
||
* Hashring improvements: https://github.com/thanos-io/thanos/issues/3141 | ||
* Previous proposal implementations: https://github.com/thanos-io/thanos/pull/3845, https://github.com/thanos-io/thanos/pull/3580 | ||
* Distribution idea: https://github.com/thanos-io/thanos/pull/2675 | ||
|
||
### Summary | ||
|
||
This document describes the motivation and design of splitting the receive component into receive-route and receive. | ||
Additionally, we touch possibility for adding a consistent hashing mechanism and buffering the incoming requests. | ||
This document describes the motivation and design of running Receiver in a stateless mode that does not have capabilities to store samples, it only routes remote write | ||
to further Receivers based on hashring. This allows setting optional deployment model were only Routing Receivers are using hashring files and does the routing and replication. That allows ingesting Receivers to not handle any routing or hashring, only receiving multi tenant writes. | ||
|
||
### Motivation | ||
|
||
1. Splitting functionality can be used to make resharding events easier. | ||
2. Right now the receiver handles ingestion, routing and writing, thus leading to too many responsibilities in a single process. | ||
This also makes the receiver more difficult to operate and understand for Thanos users. | ||
Splitting the receiver component into two different components could potentially have the following benefits: | ||
1. Resharding events become faster and cause no downtime in ingestion. | ||
2. Deployment becomes easier to understand for Thanos users. | ||
3. Each component consists of less code. | ||
4. The new architecture enables further performance improvements. | ||
[@squat](https://github.com/squat): | ||
|
||
> Currently, any change to the hashring configuration file will trigger all Thanos Receive nodes to flush their multi-TSDBs, causing them to enter an unready state until the flush is complete. This unavailability during a flush allows for a clear state transition, however it can result in downtimes on the order of five minutes for every configuration change. Moreover, during configuration changes, the hashring goes through an even longer period of partial unreadiness, where some nodes begin and finish flushing before and after others. During this partial unreadiness, the hashring can expect high internal request failure rates, which cause clients to retry their requests, resulting in even higher load. Therefore, when the hashring configuration is changed due to automatic horizontal scaling of a set of Thanos Receivers, the system can expect higher than normal resource utilization, which can create a positive feedback loop that continuously scales the hashring. | ||
|
||
|
||
### Goals | ||
|
||
1. Split the receive component into receive-route and receive (and ensure ease of resharding events). | ||
2. Evaluate any effects on performance by simulating scenarios and collecting and analyzing metrics. | ||
3. Use consistent hashing to avoid reshuffling time series after resharding events. The exact consistent hashing mechanism to be used needs some further research. | ||
* Reduce downtime of the ingestion logic in Thanos Receiver | ||
|
||
### Proposal | ||
|
||
We propose allowing to run Thanos Receiver in a mode that only forwards/replicates remote write (distributor mode). You can enable that mode by simply not specifying: | ||
|
||
```yaml | ||
--receive.local-endpoint=RECEIVE.LOCAL-ENDPOINT | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍 |
||
Endpoint of local receive node. Used to | ||
identify the local node in the hashring | ||
configuration. | ||
``` | ||
|
||
We can call this mode a "Routing Receiver". Similarly, we can skip specify any hashring to Thanos Receiver (`--receive.hashrings-file=<path>`), explicitly purposing it only for ingesting. We can call this mode "Ingesting Receiver". | ||
bwplotka marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
User can also mix all of those two modes for various federated hashrings etc. So instead of what we had before: | ||
|
||
![Before](https://docs.google.com/drawings/d/e/2PACX-1vTfko27YB_3ab7ZL8ODNG5uCcrpqKxhmqaz3lW-yhGN3_oNxkTrqXmwwlcZjaWf3cGgAJIM4CMwwkEV/pub?w=960&h=720) | ||
|
||
We have: | ||
|
||
![After](https://docs.google.com/drawings/d/e/2PACX-1vTVrtCGjR4iMbrU7Kj6QAn1a1m4fr-kvoQVDAK4lzQ_wWfXfpLLEE9HB948-WHI5ZG6s1iGWt51R593/pub?w=960&h=720) | ||
|
||
This allows us to (optionally) model deployment in a way that avoid expensive re-configuration of the stateful ingesting Receivers after the hashring configuration file has changed. | ||
|
||
In comparison to previous proposal (as mentioned in [alternatives](#previous-proposal-separate-receive-route-command) we have big adventages: | ||
|
||
1. We can reduce number of components in Thanos system, we can reuse similar component flags and documentation. Users has to learn about one less command and in result Thanos design is much more approachable. Less components mean less maintainance, code and other implicit duties: Separate changelogs, issue confusions, boilerplates, etc. | ||
2. Allow consistent pattern with Query. We don't have separate StoreAPI component for proxying, we have that baked into Querier. This has been proven to be flexible and understandable, so I would like to propose similar pattern in Receiver. | ||
3. This is more future proof for potential advanced cases like chain of routers -> receivers -> routers -> receivers for federated writes, so trees with depth n. | ||
|
||
### Plan | ||
|
||
* Receiver without `--receive.hashrings` does not forward or replicate requests, it routes straight to multi-tsdb. | ||
bwplotka marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* Receiver without ` --Receiver.local-endpoint` will assume that no storage is needed, so will skip creating any resources for multi TSDB. | ||
bwplotka marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* Add changes to the documentation (it's simplistic now). Mention two modes. | ||
|
||
### Alternative Solutions | ||
|
||
#### Previous Proposal: Separate receive-route command | ||
|
||
1. Split the Receiver component into receive-route and Receiver (and ensure ease of resharding events). | ||
1. Evaluate any effects on performance by simulating scenarios and collecting and analyzing metrics. | ||
1. Use consistent hashing to avoid reshuffling time series after resharding events. The exact consistent hashing mechanism to be used needs some further research. | ||
1. Migration: We document how the new architecture can be set up to have the same general deployment of the old architecture. (We run router and Receiver on the same node). | ||
bwplotka marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
This potentially makes the receiver more difficult to operate and understand for Thanos users. I would argue this is howerver much more harder in overall Thanos deployment. Otherwise this option is exactly the same. | ||
bwplotka marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
#### Flag for current Receiver --receive-route | ||
|
||
Idea would be similar same as in [Proposal](#Proposal), but there will be explicit flag to turn off local storage capabilities. | ||
|
||
I think we can have much more understandable logic if we simply not configure hashring for ingesting Receivers and not configure local hashring endpoint to notify that such Receiver instance will never store anything. | ||
|
||
### Drawbacks of the project | ||
There is no possible way to have a single-process receiver. The user must have a router + a receiver running. | ||
##### Solution | ||
We document how the new architecture can be set up to have the same general deployment of the old architecture. (We run router and receiver on the same node). |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️