Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Response caching for Thanos #1651

Closed
2 of 6 tasks
bwplotka opened this issue Oct 15, 2019 · 17 comments
Closed
2 of 6 tasks

Response caching for Thanos #1651

bwplotka opened this issue Oct 15, 2019 · 17 comments

Comments

@bwplotka
Copy link
Member

bwplotka commented Oct 15, 2019

This is an umbrella issue for things we might need to do/contribute in order to have smooth Thanos response caching.

As you probably know Cortex query-frontend now supports Prometheus API, so - it can be put on top of Thanos. Still, there are some cases we have to/can improve for Thanos usage:

Must:

  • Understanding underlying block structure and downsampling handling. Query frontend splits by day. This does not make sense for our downsampling which helps to run queries for large time ranges. Also split by day when querying 2w block might be not that efficient as most of the index lookup has to be 14x duplicated. However, splitting allows for easier load balancing and distributing the queries across system. This means that we might want maybe a bit smarter query split. Some idea would be:
// thanosQuerySplitIntervalFn returns split interval based on query step. This is based on the fact that
// Thanos in auto-downsampling mode will choose also use step when choosing what resolution to use.
//
// Splitting rationales: 
// Cortex splits by day and it's quite a good ratio of chunks per series to fetch. For 1 day we will have approx 24 chunks/series (30s scrape). Also 
// To max keep 24 chunks per series for 5m resolution it means 10 days split, for 1h resolution 120 days.
var thanosIntervalFn = func(r Request)time.Duration {
	step := time.Duration(r.Step()) * time.Millisecond
	if step > 30 * time.Minute {
		return 120 * day
	}
	if step > 5 * time.Hour {
		return 10 * day
	}
	return day
}

This might be some improvement, but still, there are some edge cases:

  • Fallback to raw chunks if downsampled blocks are still missing
  • Downsampled blocks appear only after 40h (5m) and 10d (1h), so this logic might need to consider time as well.
  • Split by day efficiency might differ across time, depending on how many blocks are touched.

I think there is still no clear idea of what's the best practice there. Ideally, we would split based on the underlying block structure, but there is no easy way to have that info during query planning level.

Maybe treating query splitting and splitting response a separate steps make more sense? Right now response caching relies on splitting as the cache key depends on queried start time. This is to avoid lookup logic in the caching layer.

  • Allow setting custom query params to Cortex frontend
    • Auto downsampling has to be used for the above interval based on step logic.
    • Partial response must NOT be cached. This can be achieved by simply setting --query.nopartial-response` however we can also add some logic for caching to be avoided if the response has some special header.
    • NOTE that currently partial response is not assumed if StoreAPI is down for short time for some reason (INFO not responding, Service discovery not discovering it for some time). We might need to change this to assume partial response in those cases as well to avoid the wrong cached results. The alternative is to invalidate the cache in those cases. Which below point enables.
  • Adjust for StoreAPI changes: Expose on Querier some info_hash HTTP API that will be used by cacher to get hash and use it as cache key? How to ensure we are up to date with store API changes? Maybe short TTL is fine?

Could:
* [ ] Discuss cortex to move query-frontend caching logic out of Cortex project: cortexproject/cortex#1672

Anything I missed? Feedback, ideas, and help wanted (:

@bwplotka
Copy link
Member Author

Updated this with recent findings and ideas. Design doc and MVP in progress (:

@GiedriusS
Copy link
Member

I feel like we should return a partial response if there were StoreAPI nodes which were added without a discovery mechanism and they were unhealthy i.e. no response to Info(), kind of like static scrape targets in Prometheus. We shouldn't do the same for dynamically detected nodes since they can go away easily. @bwplotka does this logic sound good?

@stale
Copy link

stale bot commented Feb 27, 2020

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

@stale stale bot added the stale label Feb 27, 2020
@bwplotka bwplotka removed the stale label Feb 27, 2020
@stale
Copy link

stale bot commented Mar 28, 2020

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

@stale stale bot added the stale label Mar 28, 2020
@GiedriusS GiedriusS removed the stale label Mar 30, 2020
@GiedriusS
Copy link
Member

We should document Cortex's query-frontend as the recommended solution + how to enable the most usual use-cases with --store-strict.

@bwplotka
Copy link
Member Author

bwplotka commented Apr 14, 2020 via email

@bwplotka
Copy link
Member Author

Proposal that should solve your concern @GiedriusS (and mine as well): #2434

@stale
Copy link

stale bot commented May 14, 2020

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label May 14, 2020
@bwplotka
Copy link
Member Author

@stale stale bot removed the stale label May 18, 2020
@stale
Copy link

stale bot commented Jun 17, 2020

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Jun 17, 2020
@pracucci pracucci removed the stale label Jun 18, 2020
@bwplotka
Copy link
Member Author

bwplotka commented Jun 18, 2020 via email

@stale
Copy link

stale bot commented Jul 18, 2020

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Jul 18, 2020
@stale
Copy link

stale bot commented Jul 25, 2020

Closing for now as promised, let us know if you need this to be reopened! 🤗

@yeya24
Copy link
Contributor

yeya24 commented Aug 21, 2020

#3018 is merged so I think it is time to get back to this issue.

Just opened a pr on the Cortex side to support dynamic split interval cortexproject/cortex#3066. It would be great if anyone can help review it!

@bwplotka Would you mind updating the issue checklist? I think Partial response must NOT be cached. is already done.

@yeya24
Copy link
Contributor

yeya24 commented Sep 4, 2020

Updates on this issue:

Auto downsampling has to be used for the above interval based on step logic. and Allow setting custom query params to Cortex frontend should be done.

For the dynamic split interval for downsampled block, sorry I don't really understand how to calculate 10d, 120d, 30m and 5h here, can you please explain it? Since we already support getting max_source_resolution directly, we can use it here directly I think.

var thanosIntervalFn = func(r Request)time.Duration {
	step := time.Duration(r.Step()) * time.Millisecond
	if step > 30 * time.Minute {
		return 120 * day
	}
	if step > 5 * time.Hour {
		return 10 * day
	}
	return day
}

Fallback to raw chunks if downsampled blocks are still missing

How does query frontend know that downsampled blocks are missing? I think we don't need to deal with this on query frontend.

Downsampled blocks appear only after 40h (5m) and 10d (1h), so this logic might need to consider time as well.

This is something we talked about offline, but it is not very clear. How does query frontend know that there are downsampled blocks in bucket? I think it is not doable.

@stale
Copy link

stale bot commented Nov 3, 2020

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Nov 3, 2020
@stale
Copy link

stale bot commented Nov 20, 2020

Closing for now as promised, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed Nov 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants