Response caching for Thanos #1651

bwplotka · 2019-10-15T11:17:10Z

This is an umbrella issue for things we might need to do/contribute in order to have smooth Thanos response caching.

As you probably know Cortex query-frontend now supports Prometheus API, so - it can be put on top of Thanos. Still, there are some cases we have to/can improve for Thanos usage:

Must:

Understanding underlying block structure and downsampling handling. Query frontend splits by day. This does not make sense for our downsampling which helps to run queries for large time ranges. Also split by day when querying 2w block might be not that efficient as most of the index lookup has to be 14x duplicated. However, splitting allows for easier load balancing and distributing the queries across system. This means that we might want maybe a bit smarter query split. Some idea would be:

// thanosQuerySplitIntervalFn returns split interval based on query step. This is based on the fact that
// Thanos in auto-downsampling mode will choose also use step when choosing what resolution to use.
//
// Splitting rationales: 
// Cortex splits by day and it's quite a good ratio of chunks per series to fetch. For 1 day we will have approx 24 chunks/series (30s scrape). Also 
// To max keep 24 chunks per series for 5m resolution it means 10 days split, for 1h resolution 120 days.
var thanosIntervalFn = func(r Request)time.Duration {
	step := time.Duration(r.Step()) * time.Millisecond
	if step > 30 * time.Minute {
		return 120 * day
	}
	if step > 5 * time.Hour {
		return 10 * day
	}
	return day
}

This might be some improvement, but still, there are some edge cases:

Fallback to raw chunks if downsampled blocks are still missing
Downsampled blocks appear only after 40h (5m) and 10d (1h), so this logic might need to consider time as well.
Split by day efficiency might differ across time, depending on how many blocks are touched.

I think there is still no clear idea of what's the best practice there. Ideally, we would split based on the underlying block structure, but there is no easy way to have that info during query planning level.

Maybe treating query splitting and splitting response a separate steps make more sense? Right now response caching relies on splitting as the cache key depends on queried start time. This is to avoid lookup logic in the caching layer.

Allow setting custom query params to Cortex frontend
- Auto downsampling has to be used for the above interval based on step logic.
- Partial response must NOT be cached. This can be achieved by simply setting --query.nopartial-response` however we can also add some logic for caching to be avoided if the response has some special header.
- NOTE that currently partial response is not assumed if StoreAPI is down for short time for some reason (INFO not responding, Service discovery not discovering it for some time). We might need to change this to assume partial response in those cases as well to avoid the wrong cached results. The alternative is to invalidate the cache in those cases. Which below point enables.
Adjust for StoreAPI changes: Expose on Querier some info_hash HTTP API that will be used by cacher to get hash and use it as cache key? How to ensure we are up to date with store API changes? Maybe short TTL is fine?

Could:
* [ ] Discuss cortex to move query-frontend caching logic out of Cortex project: cortexproject/cortex#1672

Use it inside Querier directly: query: Built-in cache for Thanos Querier backed up by Memcached #1006 as a library.

Anything I missed? Feedback, ideas, and help wanted (:

The text was updated successfully, but these errors were encountered:

bwplotka · 2019-11-17T17:04:53Z

Updated this with recent findings and ideas. Design doc and MVP in progress (:

GiedriusS · 2020-01-20T20:51:52Z

I feel like we should return a partial response if there were StoreAPI nodes which were added without a discovery mechanism and they were unhealthy i.e. no response to Info(), kind of like static scrape targets in Prometheus. We shouldn't do the same for dynamically detected nodes since they can go away easily. @bwplotka does this logic sound good?

stale · 2020-02-27T11:14:55Z

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

stale · 2020-03-28T12:46:22Z

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

GiedriusS · 2020-04-13T21:47:10Z

We should document Cortex's query-frontend as the recommended solution + how to enable the most usual use-cases with --store-strict.

bwplotka · 2020-04-14T10:33:27Z

Yes. But there might be even better solution. Issue coming.

…

On Mon, 13 Apr 2020, 22:47 Giedrius Statkevičius, ***@***.***> wrote: We should document Cortex's query-frontend as the recommended solution + how to enable the most usual use-cases with --store-strict. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1651 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABVA3O7ZJJDD5QR7XNPHG4TRMOB65ANCNFSM4JA27Y7Q> .

bwplotka · 2020-04-14T16:44:18Z

Proposal that should solve your concern @GiedriusS (and mine as well): #2434

stale · 2020-05-14T17:38:25Z

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

bwplotka · 2020-05-18T18:01:52Z

Will be done as part of https://thanos.io/proposals/202004_embedd_cortex_frontend.md/

stale · 2020-06-17T18:14:52Z

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

bwplotka · 2020-06-18T16:57:48Z

in progress!

…

On Wed, 17 Jun 2020 at 19:15, stale[bot] ***@***.***> wrote: Hello 👋 Looks like there was no activity on this issue for last 30 days. *Do you mind updating us on the status?* Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command <https://probot.github.io/apps/reminders/> if you wish to be reminded at some point in future. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1651 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABVA3O5OGHQX4XU5BDWCWHTRXEB2VANCNFSM4JA27Y7Q> .

stale · 2020-07-18T17:49:53Z

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale · 2020-07-25T18:24:23Z

Closing for now as promised, let us know if you need this to be reopened! 🤗

yeya24 · 2020-08-21T02:23:23Z

#3018 is merged so I think it is time to get back to this issue.

Just opened a pr on the Cortex side to support dynamic split interval cortexproject/cortex#3066. It would be great if anyone can help review it!

@bwplotka Would you mind updating the issue checklist? I think Partial response must NOT be cached. is already done.

yeya24 · 2020-09-04T14:20:06Z

Updates on this issue:

Auto downsampling has to be used for the above interval based on step logic. and Allow setting custom query params to Cortex frontend should be done.

For the dynamic split interval for downsampled block, sorry I don't really understand how to calculate 10d, 120d, 30m and 5h here, can you please explain it? Since we already support getting max_source_resolution directly, we can use it here directly I think.

var thanosIntervalFn = func(r Request)time.Duration {
	step := time.Duration(r.Step()) * time.Millisecond
	if step > 30 * time.Minute {
		return 120 * day
	}
	if step > 5 * time.Hour {
		return 10 * day
	}
	return day
}

Fallback to raw chunks if downsampled blocks are still missing

How does query frontend know that downsampled blocks are missing? I think we don't need to deal with this on query frontend.

Downsampled blocks appear only after 40h (5m) and 10d (1h), so this logic might need to consider time as well.

This is something we talked about offline, but it is not very clear. How does query frontend know that there are downsampled blocks in bucket? I think it is not doable.

stale · 2020-11-03T14:22:07Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale · 2020-11-20T01:09:31Z

Closing for now as promised, let us know if you need this to be reopened! 🤗

bwplotka added component: query feature request/improvement labels Oct 15, 2019

This was referenced Oct 31, 2019

Thanos Store should always prefer higher resolution data when possible #1170

Closed

Long Term Storage Improvements [Tracking Issue] #1705

Closed

bwplotka added the help wanted label Nov 12, 2019

GiedriusS mentioned this issue Jan 10, 2020

frontend: implement cache control cortexproject/cortex#1974

Merged

GiedriusS added the pinned label Jan 18, 2020

bwplotka removed the pinned label Jan 28, 2020

GiedriusS mentioned this issue Jan 28, 2020

query: add 'sticky' store nodes #2072

Closed

stale bot added the stale label Feb 27, 2020

bwplotka removed the stale label Feb 27, 2020

stale bot added the stale label Mar 28, 2020

GiedriusS removed the stale label Mar 30, 2020

pracucci mentioned this issue Apr 15, 2020

proposal: Added proposal for new Thanos component: Thanos Frontend. #2434

Merged

stale bot added the stale label May 14, 2020

stale bot removed the stale label May 18, 2020

stale bot added the stale label Jun 17, 2020

pracucci removed the stale label Jun 18, 2020

stale bot added the stale label Jul 18, 2020

stale bot closed this as completed Jul 25, 2020

GiedriusS reopened this Jul 25, 2020

stale bot removed the stale label Jul 25, 2020

yeya24 added component: query-frontend and removed component: query labels Aug 21, 2020

yeya24 mentioned this issue Aug 21, 2020

Use intervalFn to support dynamic split interval in query frontend cortexproject/cortex#3066

Merged

3 tasks

stale bot added the stale label Nov 3, 2020

stale bot closed this as completed Nov 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Response caching for Thanos #1651

Response caching for Thanos #1651

bwplotka commented Oct 15, 2019 •

edited

Loading

bwplotka commented Nov 17, 2019

GiedriusS commented Jan 20, 2020

stale bot commented Feb 27, 2020

stale bot commented Mar 28, 2020

GiedriusS commented Apr 13, 2020

bwplotka commented Apr 14, 2020 via email

bwplotka commented Apr 14, 2020

stale bot commented May 14, 2020

bwplotka commented May 18, 2020

stale bot commented Jun 17, 2020

bwplotka commented Jun 18, 2020 via email

stale bot commented Jul 18, 2020

stale bot commented Jul 25, 2020

yeya24 commented Aug 21, 2020 •

edited

Loading

yeya24 commented Sep 4, 2020 •

edited

Loading

stale bot commented Nov 3, 2020

stale bot commented Nov 20, 2020

Response caching for Thanos #1651

Response caching for Thanos #1651

Comments

bwplotka commented Oct 15, 2019 • edited Loading

bwplotka commented Nov 17, 2019

GiedriusS commented Jan 20, 2020

stale bot commented Feb 27, 2020

stale bot commented Mar 28, 2020

GiedriusS commented Apr 13, 2020

bwplotka commented Apr 14, 2020 via email

bwplotka commented Apr 14, 2020

stale bot commented May 14, 2020

bwplotka commented May 18, 2020

stale bot commented Jun 17, 2020

bwplotka commented Jun 18, 2020 via email

stale bot commented Jul 18, 2020

stale bot commented Jul 25, 2020

yeya24 commented Aug 21, 2020 • edited Loading

yeya24 commented Sep 4, 2020 • edited Loading

stale bot commented Nov 3, 2020

stale bot commented Nov 20, 2020

bwplotka commented Oct 15, 2019 •

edited

Loading

yeya24 commented Aug 21, 2020 •

edited

Loading

yeya24 commented Sep 4, 2020 •

edited

Loading