Allow usage of folders within S3 #697

tim-beatson · 2018-12-28T08:42:24Z

At present I can only specify a bucket name and Thanos writes to the top level, however I want it to use <bucket_name>/logging-and-monitoring/prometheus

bwplotka · 2018-12-28T11:17:57Z

We wanted to postpone this as it adds mostly unnecessary complexity to the code, but if it is a blocker, we will take PRs for it (:

Proper tests for this are must have.

bwplotka · 2018-12-28T11:18:21Z

Also we probably need that on bucket lvl not provider lvl

tim-beatson · 2018-12-28T11:25:06Z

You say unnecessary complexity, but what about organisations that have conventions that must be followed for the storage of data? It then becomes quite necessary.

I'm happy to look at this if you're happy to take a PR, but could you clarify what you'd expect with regards to proper tests please?

bwplotka · 2018-12-28T11:58:03Z

You say unnecessary complexity, but what about organisations that have conventions that must be followed for the storage of data? It then becomes quite necessary.

Amount of additional complexity, maintainance and tests in comparison to added value is quite large. (: Can you elaborate on conventions? Why you cannot just create another bucket using any naming convention you need?

In terms of tests I just mean being careful to not surprise existing users and have consistent handling of this across all Thanos components, nothing more.

tim-beatson · 2018-12-28T16:20:14Z

This just comes down to the rules that govern data in the organisation that I work for. Simply put, I'm not allowed to create buckets using any name.

I'm now looking to use Thanos for cross-cluster federation but I'm still getting to grips with how exactly that works. Could I have all Thanos sidecars in every cluster write to the same bucket? If I then have a query deployment in every cluster plus a monitoring cluster with a querier reading from every other querier, would that mean a Thanos store in the monitoring cluster only?

If I can make this work with just needing one new bucket for all clusters, there's more possibility for a new bucket being permitted.

Many thanks in advance for your help.

bwplotka · 2018-12-28T16:28:36Z

Could I have all Thanos sidecars in every cluster write to the same bucket?

Yes.

If I then have a query deployment in every cluster plus a monitoring cluster with a querier reading from every other querier, would that mean a Thanos store in the monitoring cluster only?

Yes (:

If I can make this work with just needing one new bucket for all clusters, there's more possibility for a new bucket being permitted.

What do you mean?

tim-beatson · 2019-01-03T12:20:53Z

@bwplotka, thanks.

I've made some progress on this, but don't seem able to make queries from the querier in my monitoring cluster to each of the other query components.

Could you please clarify if --store can be passed to a query component to point at another query component?

bwplotka · 2019-01-03T13:25:29Z

yes, but you need to pass it's gRPC endpoint (:

petemounce · 2019-01-08T21:09:13Z

@bwplotka another reason; in AWS, S3 buckets are finite per account and cloud-wide uniquely named. So, they are managed as such.

claytono · 2019-01-10T18:47:04Z

This is something we also would like to have, since we're planning to partition thanos-store components for scaling purposes. I have a strawman implementation that I've done some minimal testing on (claytono/thanos@7598121).

@bwplotka It sounds like you'd prefer to have this at the bucket level, but I think the providers need to be at least aware of it. The reason being that if you're storing Thanos data in say bucketname/thanos, but there could be a tremendous number of non-thanos files in the same bucket. For object stores like S3, providing a prefix on the API call is going to be much more efficient than retrieving all bucket contents and filtering them in the bucket code.

If the providers need to know about the prefix configured, then I'm not seeing a lot of benefit to having the block layer also be aware of it.

What are your thoughts? I'm willing to put together a PR once I know what it needs to include.

bwplotka · 2019-01-11T15:43:48Z

Hm. I am still not sold on those arguments to be honest (:

@petemounce

another reason; in AWS, S3 buckets are finite per account and cloud-wide uniquely named. So, they are managed as such.

OK, but finite does not mean 2 or 3 but probably hundreds or at least dozens, right? And we are talking about single additional bucket for all scrapers/metrics.

The reason being that if you're storing Thanos data in say bucketname/thanos, but there could be a tremendous number of non-thanos files in the same bucket

For current Thanos implementation there is super easy workaround. Use some bucket for Thanos. Put nonthanosfiles single directory with huge number of items inside. At this point from Thanos perspective, single additional "folder" , does not matter. (:

But generally I am not strongly against this either. @domgreen any thoughts?

Regarding @claytono implementation:
I looks through your commit and it looks generally ok, just it is easy to make mistake here in implementation ;p I was initially thinking about adding prefix logic for Thanos components itself before bucket API, but per provider virtual full bucket is quite neat (ability to operate like on full no-prefix bucket just for Thanos, but provider appending prefix when needed).

Wonder if we can create Bucket wrapper (similar to wrapper to metrics), that could be appended to any provider. I think it's doable. What do you think?

claytono · 2019-01-14T18:27:24Z

OK, but finite does not mean 2 or 3 but probably hundreds or at least dozens, right? And we are talking about single additional bucket for all scrapers/metrics.

The limit is 100 per account by default, although you can request that AWS raise that limit. Note that bucket names have to be globally unique across all of S3, not just in a single account. I would expect that it's not uncommon to have creation of new buckets require administrative approval of some sort, whereas using a new prefix is effortless.

For current Thanos implementation there is super easy workaround. Use some bucket for Thanos. Put nonthanosfiles single directory with huge number of items inside. At this point from Thanos perspective, single additional "folder" , does not matter. (:

Imagine that a department at a company has been given one bucket for their use. Because they host data for multiple purposes in that bucket structure looks like this:

+ department-bucket
|
|- - dept-prefix1 (contains 100 million objects)
|
+- - thanos-prefix (contains 1000s of objects)

In this scenario, if you ask the S3 API for a list of all objects in department-bucket then it's very expensive, because you have to retrieve 100M+ objects and do that filtering on the client side. However, if you ask S3 for all objects in department-bucket with a prefix of thanos-prefix then it only returns the files you need, consuming less network and thanos-store resources.

Because I don't think this is an unlikely scenario, I think it makes sense have the object store code be aware of the prefixes. This allow the object store provider to pass the prefix to object store APIs that support it. Once the code for each object store has to be updated to support prefixes, then there doesn't seem to be much advantage by abstracting it in the bucket provider.

bwplotka · 2019-02-06T13:06:39Z

Cool, I think we are happy to take PRs for that (:

alexf101 · 2019-06-25T04:31:13Z

Thank you for contributing this project, I'm hoping it will solve a lot of problems for us!

One more use case for your consideration: permissions management. Ideally, I prefer to use:

<company_name>-<service_name>/<account_name>/....

So, for us, that's

mycompany-prometheus-thanos/development/....
mycompany-prometheus-thanos/production/....

Permissions for the production Prometheus cluster's IAM role will be granted to the production subdirectory, and development instances can use the other one. We have a few other account as well to isolate potentially high-risk services, one for CI, etc.

I can work around this by using extra buckets, but usually I would use prefixes.

midnightconman · 2019-06-27T02:55:23Z

A quick bump on this... One really nice thing about allowing bucket prefixes is access speed improvements. AWS increases S3 download performance by number of consumers, meaning if you have multiple thanos clusters connecting to the same s3 bucket, overall download speeds improve.

oded-dd · 2019-07-10T12:37:05Z

We would really appreciate this feature.

Allex1 · 2019-07-31T18:42:33Z

Having multiple sidecars uploading to the same bucket and having 1 compactor per cluster would not be recommended as the compactors would "compete" for the same data right ?

povilasv · 2019-08-05T06:25:59Z

FYI There is a proposal for this -> #1318

stale · 2020-01-11T06:42:31Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tim-beatson · 2022-03-18T16:26:30Z

This issue should not be stale, I would very much still like the ability to use S3 prefixes

bwplotka added feature request/improvement difficulty: medium help wanted labels Dec 28, 2018

bwplotka mentioned this issue Feb 6, 2019

store: Improve main pain points for using store gateway against big bucket. #814

Closed

5 tasks

stale bot added the stale label Jan 11, 2020

stale bot closed this as completed Jan 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow usage of folders within S3 #697

Allow usage of folders within S3 #697

tim-beatson commented Dec 28, 2018

bwplotka commented Dec 28, 2018

bwplotka commented Dec 28, 2018

tim-beatson commented Dec 28, 2018

bwplotka commented Dec 28, 2018

tim-beatson commented Dec 28, 2018

bwplotka commented Dec 28, 2018

tim-beatson commented Jan 3, 2019

bwplotka commented Jan 3, 2019

petemounce commented Jan 8, 2019

claytono commented Jan 10, 2019

bwplotka commented Jan 11, 2019

claytono commented Jan 14, 2019

bwplotka commented Feb 6, 2019

alexf101 commented Jun 25, 2019

midnightconman commented Jun 27, 2019

oded-dd commented Jul 10, 2019

Allex1 commented Jul 31, 2019

povilasv commented Aug 5, 2019

stale bot commented Jan 11, 2020

tim-beatson commented Mar 18, 2022

Allow usage of folders within S3 #697

Allow usage of folders within S3 #697

Comments

tim-beatson commented Dec 28, 2018

bwplotka commented Dec 28, 2018

bwplotka commented Dec 28, 2018

tim-beatson commented Dec 28, 2018

bwplotka commented Dec 28, 2018

tim-beatson commented Dec 28, 2018

bwplotka commented Dec 28, 2018

tim-beatson commented Jan 3, 2019

bwplotka commented Jan 3, 2019

petemounce commented Jan 8, 2019

claytono commented Jan 10, 2019

bwplotka commented Jan 11, 2019

claytono commented Jan 14, 2019

bwplotka commented Feb 6, 2019

alexf101 commented Jun 25, 2019

midnightconman commented Jun 27, 2019

oded-dd commented Jul 10, 2019

Allex1 commented Jul 31, 2019

povilasv commented Aug 5, 2019

stale bot commented Jan 11, 2020

tim-beatson commented Mar 18, 2022