Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sidecar: time limit requests to Prometheus remote read api #1267

Closed

Conversation

jjneely
Copy link
Contributor

@jjneely jjneely commented Jun 20, 2019

  • [ENHANCEMENT] Add --prometheus.retention=0d flag to sidecar to limit the TSDB data made available from Prometheus to the last given duration. The default is 0 or no limit.

Changes

Implement #1191

The Sidecar keeps track of the mint/maxt of the TSDB data that's it makes available via the Prometheus Remote Read API. This limits the mint to the given duration, such as 1d or 3d, and makes available only the last 1d or 3d worth of data from the local Prometheus.

This can vastly improve query response times if Prometheus has more data than this on disk and long term storage is also uploaded into object storage. Querying the Store component is more efficient than large queries via the Remote Read API. This can also fix 400 errors returned by the Remote Read API when storage.remote.read-sample-limit is hit.

Verification

  • Monitored remote read queries hitting Prometheus that they were for the correct time period.
  • Tested the Query component for errors and speed of query execution.

Usually, long term retention is held in an object storage bucket and it
is most efficient to query that data from the matching Store component.
In cases where Prometheus has more than a few days of configured
retention it can vastly speed up queries to limit the Sidecar queries to
the local Prometheus to just a few days.
If we are advertising a specific window of which we have time series
for, don't query the local Prometheus for anything outside that time
range.  While the local instance might have more data, it will make the
remote read API call more expensive, potentially over the remote read
API limits set by storage.remote.read-sample-limit.
@jjneely jjneely marked this pull request as ready for review June 21, 2019 15:29
@bwplotka
Copy link
Member

bwplotka commented Jun 24, 2019

Hi, Thanks for this I see the value, especially with the non streamed remote read. But do you think this is still needed if we make remote read more efficient as it's in progress now? #1268

I think we can focus on that rather which should make this feature not really neeeded. Do you agree? (:

@jjneely
Copy link
Contributor Author

jjneely commented Jun 24, 2019

Perhaps? I probably don't fully understand the benefits the streaming remote read API will bring. But it seems that if you are using Stores and object buckets for metric storage, then you never want to ask the Sidecar/Prometheus for anything more than the last few hours of data. Or do these changes make Sidecar/Prometheus more efficient to query than the Store/bucket?

I'm migrating from a world where we have 30 days or more retention on our Prometheus VMs into Thanos which moves the long term data into object storage. I can't limit the retention on the Prometheus VMs until the Thanos Query component is production ready. However, the Query component was requesting large date ranges from Prometheus that was blowing the remote read sample limit and the same date ranges from object storage. It was the 400 HTTP status code responses from Prometheus that sent me down this quest.

This also looks like a Prometheus 2.10-ish upgrade would be required for this to work.

@povilasv
Copy link
Member

povilasv commented Aug 6, 2019

So regarding this PR I agree with @bwplotka, I don't think we need it, if we have streaming remote reads.

Regarding:

This can vastly improve query response times if Prometheus has more data than this on disk and long term storage is also uploaded into object storage. Querying the Store component is more efficient than large queries via the Remote Read API.

I don't really agree, if we have streamed remote reads, querying Prometheus should be faster and more efficient than Store, as Prometheus would be hitting local disk worst case scenario and Thanos Store - Object Store.

In my testing of #1077 (comment) I actually seen that query times are shorter when you skip Thanos Store for short periods of data and hit only Prometheus.

@@ -337,8 +343,21 @@ func (s *promMetadata) UpdateLabels(ctx context.Context, logger log.Logger) erro
func (s *promMetadata) UpdateTimestamps(mint int64, maxt int64) {
s.mtx.Lock()
defer s.mtx.Unlock()
var limitt int64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
var limitt int64
var limit int64

@@ -37,6 +37,8 @@ func registerSidecar(m map[string]setupFunc, app *kingpin.Application, name stri
promURL := cmd.Flag("prometheus.url", "URL at which to reach Prometheus's API. For better performance use local network.").
Default("http://localhost:9090").URL()

promRetention := modelDuration(cmd.Flag("prometheus.retention", "A limit on how much retention to query from Prometheus. 0d means query all TSDB data found on disk").Default("0d"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious about this. Should we do retention? I would rather see storeapi.min-time with similar model to specify both relative and absolute time as here: #1077

This is to keep it consistent with potential store gateway time partitioning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was what I struggled with most -- how to name and handle this argument. I'll take a look at implementing the suggestions. I'm all about consistency.

Copy link
Member

@povilasv povilasv Aug 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yy, if we are doing this then --storeapi.min-time, --storeapi.max-time sounds like good options.

You can copy paste the https://github.com/thanos-io/thanos/pull/1077/files#diff-dd29e6298d43e46bb651035051819cfcR14 class from my PR, to make it take same exact format.

And I will merge once I find time to finish my PR

@bwplotka
Copy link
Member

bwplotka commented Oct 3, 2019

I think we have some decision mentioned here: #1191

Happy to approve this once rebased AND @povilasv comment will be addressed (:

Sorry for the massive lag on this @jjneely ! Are you still around to continue on this?

@bwplotka
Copy link
Member

bwplotka commented Oct 9, 2019

Addressed and rebased here: #1619 (:

@bwplotka bwplotka closed this Oct 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants