Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Compact: Replica de-duplication for long term storage #2362

Closed
iridian-ks opened this issue Apr 2, 2020 · 2 comments
Closed

Comments

@iridian-ks
Copy link

Thanos, Prometheus and Golang version used:

Latest

Object Storage Provider:

S3

What happened:

For high availability of metrics we run multiple Prometheus instances along with the Thanos sidecar to ship data to S3 (in our case we are running 3 instances but will probably switch to 2 for the short-term). If pods OOM or we're doing upgrades then we don't miss metrics due to other instances being up. We always want as close to realtime as possible.

This does mean that there are 3x duplicate metrics making it's way into S3. We don't really have all that much data being scraped and we're almost at a terabyte of data. Only about 200-300 gigs of this is what we actually need.

What you expected to happen:

Thanos Compact could find duplicate data if it had a --replica flag and average any series within the downsample. This probably wouldn't make sense for raw data since things could be off by seconds, but for the 5m and 1h downsamples you could probably find duplicate data between replicas within the downsample range.

I suppose another way to do this would be to leverage relabeling in Thanos Compact and to simply remove the replica labels since they aren't actually useful in the context of long term storage (30+ days, or whatever raw retention is set to). But then Compact would need to effectively run twice on the same block? Not sure if this would actually work as-is, but if it is then it's not documented.

How to reproduce it (as minimally and precisely as possible):

Run >1 replicas to ship data with Thanos.

Anything else we need to know:

I don't actually know the internals of Thanos so I don't have much knowledge on what's going on besides what's in the docs. I apologize if this isn't actually an issue or if I'm misunderstanding something.

My real goal is to figure out how to reduce the S3 bucket size of our metrics. I have already tuned retention levels based off of existing issues/docs I've read. I didn't see any docs about de-duplication besides in the querier.

Could I just simply remove replica labels altogether and rely on Compact to do 1h down samples from that?

Thanks in advance!

Environment:

Kubernetes + Kiwigrid Helm chart

@kakkoyun
Copy link
Member

kakkoyun commented Apr 2, 2020

Hello @iridian-ks, first of all, thank you for reaching out and detailed issue. I believe this issue is a duplicate of #1014. So you can track the progress on the topic on #1014.

That being said, we've recently merged an experimental feature, naive offline vertical duplication that similar to what you have described #2250. This is an initial step that we have planned to tackle offline deduplication problem. There's a hidden flag (deduplication.replica-label) to specify the labels that has been used as replica labels and enable it. Check the recent changelog and the PR to learn more.

Feel free to close this issue if these satisfy your needs.

@iridian-ks
Copy link
Author

Hi @kakkoyun , thanks for the swift response. I tried searching through the issues prior to creating my own but I guess I'm a bad searcher. Sounds like this is getting taken care of. I'll explore the hidden flag! Thanks so much :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants