Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for sub-directories in object stores #472

Merged
merged 4 commits into from
Nov 28, 2023

Conversation

gruuya
Copy link
Contributor

@gruuya gruuya commented Nov 24, 2023

This will allow configuring a specific path alongside a bucket where to store the data in.

Closes #470.

@gruuya gruuya requested a review from mildbyte November 24, 2023 14:20
@gruuya gruuya force-pushed the object-store-root-in-sub-folder branch 2 times, most recently from d0b4959 to 5fb7156 Compare November 24, 2023 15:55
@@ -85,6 +84,8 @@ pub fn build_object_store(
bucket,
cache_properties,
}) => {
// Bucket from the config may contain some path so extract the proper bucket
let bucket = bucket.split('/').next().unwrap_or(bucket);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with overloading the bucket config param so that it can also hold a path to stay backward-compatible.

But I think the preferable way to do this is to ditch the type and bucket fields and replace them with a location field which can store all the relevant info (i.e. type, bucket and optionally path) at the expense of breaking backward-compatibility.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could keep the bucket and have a separate prefix field that is optional and defaults to no prefix where it doesn't exist?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that a separate prefix field wasn't worth the increased config complexity, whereas the location approach felt more consolidated than either alternative.

To me it seems like it's natural to think about the full url location as one thing from the user perspective, and leave the parsing of the individual details (type, bucket, path) to Seafowl. This would also neatly get rid of the endpoint field on S3, as this is covered by location too. Moreover, currently we don't support passing the bucket/path in the endpoint, though in principle there's no reason not to. GlareDB uses the location + storage options map approach and does away with config file completely due to the conciseness of this approach, though this wouldn't work for Seafowl as we have a bunch of other configurations as well.

To clarify, I definitely do prefer a separate path (or prefix) field over overloading the bucket field, I just think location would be even better from a usability standpoint. Let me know what are your preferences.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I made the change to use the prefix field for now as that's good enough for a merge I believe, but we can revisit whether it makes sense to go with the location approach later on.

@gruuya gruuya self-assigned this Nov 27, 2023
@@ -85,6 +84,8 @@ pub fn build_object_store(
bucket,
cache_properties,
}) => {
// Bucket from the config may contain some path so extract the proper bucket
let bucket = bucket.split('/').next().unwrap_or(bucket);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could keep the bucket and have a separate prefix field that is optional and defaults to no prefix where it doesn't exist?

Comment on lines +327 to +328
// In principle for this test we could use any object store since we only exercise the
// prefix/log store uri logic
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're worried about this, you could break out table_prefix into a separate function that operates on (schema::ObjectStore, Url) -> Url and test it without having to create an object store (don't know how much it's worth it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really worried, intended more like a clarification. The test also exercises the log store uri, which does need some object store.

Comment on lines 76 to 87
// In case when there's an endpoint specified, the bucket is not the host but the
// first element in the path, so we need to discard it when creating a prefix.
let mut segments =
self.root_uri.path_segments().unwrap().collect::<Vec<_>>();

// Add the last path segment which is the table UUID
let uuid_str = table_uuid.to_string();
segments.push(&uuid_str);

// Remove the first path segment which is actually the bucket
segments.remove(0);
Path::from(segments.join("/"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe having a prefix as a separate config field could simplify this logic or avoid having to do this on every query.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in this case it would.

ObjectStoreType::InMemory,
ObjectStoreType::Local,
ObjectStoreType::S3(None),
ObjectStoreType::S3(Some("/path/to/folder"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work with or without the leading slash?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, works in both cases, the test_table_location_s3 actually covers that.

@gruuya gruuya force-pushed the object-store-root-in-sub-folder branch from c28c067 to ae09ec3 Compare November 28, 2023 09:27
@gruuya gruuya merged commit 404c480 into main Nov 28, 2023
1 check passed
@gruuya gruuya deleted the object-store-root-in-sub-folder branch November 28, 2023 10:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for object stores with specific paths
2 participants