sidecar: time limit requests to Prometheus remote read api #1267

jjneely · 2019-06-20T19:54:21Z

[ENHANCEMENT] Add --prometheus.retention=0d flag to sidecar to limit the TSDB data made available from Prometheus to the last given duration. The default is 0 or no limit.

Changes

Implement #1191

The Sidecar keeps track of the mint/maxt of the TSDB data that's it makes available via the Prometheus Remote Read API. This limits the mint to the given duration, such as 1d or 3d, and makes available only the last 1d or 3d worth of data from the local Prometheus.

This can vastly improve query response times if Prometheus has more data than this on disk and long term storage is also uploaded into object storage. Querying the Store component is more efficient than large queries via the Remote Read API. This can also fix 400 errors returned by the Remote Read API when storage.remote.read-sample-limit is hit.

Verification

Monitored remote read queries hitting Prometheus that they were for the correct time period.
Tested the Query component for errors and speed of query execution.

Usually, long term retention is held in an object storage bucket and it is most efficient to query that data from the matching Store component. In cases where Prometheus has more than a few days of configured retention it can vastly speed up queries to limit the Sidecar queries to the local Prometheus to just a few days.

If we are advertising a specific window of which we have time series for, don't query the local Prometheus for anything outside that time range. While the local instance might have more data, it will make the remote read API call more expensive, potentially over the remote read API limits set by storage.remote.read-sample-limit.

bwplotka · 2019-06-24T09:32:22Z

Hi, Thanks for this I see the value, especially with the non streamed remote read. But do you think this is still needed if we make remote read more efficient as it's in progress now? #1268

I think we can focus on that rather which should make this feature not really neeeded. Do you agree? (:

jjneely · 2019-06-24T14:43:04Z

Perhaps? I probably don't fully understand the benefits the streaming remote read API will bring. But it seems that if you are using Stores and object buckets for metric storage, then you never want to ask the Sidecar/Prometheus for anything more than the last few hours of data. Or do these changes make Sidecar/Prometheus more efficient to query than the Store/bucket?

I'm migrating from a world where we have 30 days or more retention on our Prometheus VMs into Thanos which moves the long term data into object storage. I can't limit the retention on the Prometheus VMs until the Thanos Query component is production ready. However, the Query component was requesting large date ranges from Prometheus that was blowing the remote read sample limit and the same date ranges from object storage. It was the 400 HTTP status code responses from Prometheus that sent me down this quest.

This also looks like a Prometheus 2.10-ish upgrade would be required for this to work.

povilasv · 2019-08-06T05:40:33Z

So regarding this PR I agree with @bwplotka, I don't think we need it, if we have streaming remote reads.

Regarding:

This can vastly improve query response times if Prometheus has more data than this on disk and long term storage is also uploaded into object storage. Querying the Store component is more efficient than large queries via the Remote Read API.

I don't really agree, if we have streamed remote reads, querying Prometheus should be faster and more efficient than Store, as Prometheus would be hitting local disk worst case scenario and Thanos Store - Object Store.

In my testing of #1077 (comment) I actually seen that query times are shorter when you skip Thanos Store for short periods of data and hit only Prometheus.

bwplotka · 2019-08-06T14:30:10Z

cmd/thanos/sidecar.go

@@ -337,8 +343,21 @@ func (s *promMetadata) UpdateLabels(ctx context.Context, logger log.Logger) erro
 func (s *promMetadata) UpdateTimestamps(mint int64, maxt int64) {
 s.mtx.Lock()
 defer s.mtx.Unlock()
+ var limitt int64


Suggested change

var limitt int64

var limit int64

bwplotka · 2019-08-07T11:05:03Z

cmd/thanos/sidecar.go

@@ -37,6 +37,8 @@ func registerSidecar(m map[string]setupFunc, app *kingpin.Application, name stri
 promURL := cmd.Flag("prometheus.url", "URL at which to reach Prometheus's API. For better performance use local network.").
 Default("http://localhost:9090").URL()

+ promRetention := modelDuration(cmd.Flag("prometheus.retention", "A limit on how much retention to query from Prometheus. 0d means query all TSDB data found on disk").Default("0d"))


Curious about this. Should we do retention? I would rather see storeapi.min-time with similar model to specify both relative and absolute time as here: #1077

This is to keep it consistent with potential store gateway time partitioning.

This was what I struggled with most -- how to name and handle this argument. I'll take a look at implementing the suggestions. I'm all about consistency.

yy, if we are doing this then --storeapi.min-time, --storeapi.max-time sounds like good options.

You can copy paste the https://github.com/thanos-io/thanos/pull/1077/files#diff-dd29e6298d43e46bb651035051819cfcR14 class from my PR, to make it take same exact format.

And I will merge once I find time to finish my PR

bwplotka · 2019-10-03T09:54:34Z

I think we have some decision mentioned here: #1191

Happy to approve this once rebased AND @povilasv comment will be addressed (:

Sorry for the massive lag on this @jjneely ! Are you still around to continue on this?

bwplotka · 2019-10-09T17:13:58Z

Addressed and rebased here: #1619 (:

jjneely added 4 commits June 20, 2019 15:44

Fix up tests for PR thanos-io#1267

f613612

Update docs for PR#1267

ade815f

jjneely marked this pull request as ready for review June 21, 2019 15:29

bwplotka reviewed Aug 7, 2019

View reviewed changes

povilasv mentioned this pull request Aug 28, 2019

Store: Add Time & duration based partitioning #1408

Merged

bwplotka mentioned this pull request Oct 3, 2019

sidecar: exceeded sample limit #1191

Closed

bwplotka mentioned this pull request Oct 9, 2019

Added min-time limitaiton for sidecar. This allows optionally storing longer retention time on Prometheus. #1619

Merged

2 tasks

bwplotka closed this Oct 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sidecar: time limit requests to Prometheus remote read api #1267

sidecar: time limit requests to Prometheus remote read api #1267

jjneely commented Jun 20, 2019

bwplotka commented Jun 24, 2019 •

edited

Loading

jjneely commented Jun 24, 2019

povilasv commented Aug 6, 2019

bwplotka Aug 6, 2019

bwplotka Aug 7, 2019

jjneely Aug 7, 2019

povilasv Aug 8, 2019 •

edited

Loading

bwplotka commented Oct 3, 2019

bwplotka commented Oct 9, 2019

sidecar: time limit requests to Prometheus remote read api #1267

sidecar: time limit requests to Prometheus remote read api #1267

Conversation

jjneely commented Jun 20, 2019

Changes

Verification

bwplotka commented Jun 24, 2019 • edited Loading

jjneely commented Jun 24, 2019

povilasv commented Aug 6, 2019

bwplotka Aug 6, 2019

Choose a reason for hiding this comment

bwplotka Aug 7, 2019

Choose a reason for hiding this comment

jjneely Aug 7, 2019

Choose a reason for hiding this comment

povilasv Aug 8, 2019 • edited Loading

Choose a reason for hiding this comment

bwplotka commented Oct 3, 2019

bwplotka commented Oct 9, 2019

bwplotka commented Jun 24, 2019 •

edited

Loading

povilasv Aug 8, 2019 •

edited

Loading