Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

object_store: replace raw String with Url in Path impl #2230

Closed
Cheappie opened this issue Jul 29, 2022 · 12 comments
Closed

object_store: replace raw String with Url in Path impl #2230

Cheappie opened this issue Jul 29, 2022 · 12 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@Cheappie
Copy link
Contributor

Cheappie commented Jul 29, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I have implemented custom TableProvider and ObjectStore in datafusion, what I would like to do is to have more convenient interface that would allow me to pass dynamic parameters to ObjectStore. Currently ObjectStore accepts Path as parameter, and Path impl is a bit restrictive and provides guarantees that are unnecessary(at least for me).

Describe the solution you'd like
Replace Path internals, precisely exchange raw String for Url and then provide AsRef<Url> for Path.

Describe alternatives you've considered
Alternatively ObjectMeta could be provided to ObjectStore instead of just Path, that would allow me to perform some checks like whether last modified time matches. However It might be still good idea to switch internal Path representation to Url to make it more flexible.

Additional context

@Cheappie Cheappie added the enhancement Any new improvement worthy of a entry in the changelog label Jul 29, 2022
@tustvold
Copy link
Contributor

tustvold commented Jul 30, 2022

I'm not sure about this, the Path is very specifically just that a path, not a URL. This keeps it simple and ensures consistent behaviour across object storage implementations. Any additional URL parameters, e.g. host, credentials, ports, etc... are to be handled at the ObjectStore level, i.e. passed to it on construction.

To give an example of where ambiguity would arise, if I call the list API with a URL instead of a prefix, what should the store do, should it ignore the extra arguments, should the returned Path also contain these arguments?

Could you perhaps expand a bit more on your use case?

@Cheappie
Copy link
Contributor Author

Sure, let me briefly explain what I am trying to achieve. I have implemented custom TableProvider in order to embed communication with external service that performs data-source indexing and produces binary stream with complete listing over data-source for given parameters that are part of SQL query expression. In addition to that I have implemented custom ObjectStore that provides just get operations.

As you can see, what I would prefer most is to be able to provide binary blob instead of either Path or Url to get operation. But Url is closest thing that shares similarities with Path and is more flexible.

Well in my case with existing ObjectStore trait, I will have to convert my binary listing to text and I will have to attach arguments to every Url unfortunately.

To give an example of where ambiguity would arise, if I call the list API with a URL instead of a prefix, what should the store do, should it ignore the extra arguments, should the returned Path also contain these arguments?

Another layer of abstraction is my implementation of ObjectStore that embeds tiered storage, I don't know where would be better place to put that. Let me get to the point, I do need at least ObjectMeta struct in get operations of ObjectStore instead of just Path because I do need to invalidate tiered storage when last modified time doesn't match. I wouldn't mind if we could in addition to passing ObjectMeta to ObjectStore get ops, create new field within ObjectMeta like custom_attributes: Option<Box<[u8]>> or Option<Arc<[u8]>>

@tustvold
Copy link
Contributor

tustvold commented Jul 30, 2022

This might be a stupid question, but if you already have a custom TableProvider what is the motivation for using the ObjectStore trait at all? It feels like you're using it to abstract over something that fundamentally is not an object store?

@Cheappie
Copy link
Contributor Author

Cheappie commented Jul 30, 2022

Because I am using parquet format, that allows me to create physical execution plan based on this file format. And in order to provide data I need to communicate with existing implementation through ObjectStore.

What do you think about modifying ObjectStore get operations to replace Path with ObjectMeta and adding custom_attributes: Option<Box<[u8]>> to ObjectMeta ?

@tustvold
Copy link
Contributor

tustvold commented Jul 30, 2022

Ok, so if I understand the issue correctly:

  • You have a catalog service that identifies the files to scan for a query, along with some metadata
  • These files are stored in parquet somewhere and can be fetched to memory

This is a very similar problem that IOx has and it historically solved this by not using DataFusion's parquet support and using the parquet crate directly, in particular it would:

  • Fetch files to Bytes in memory or a file on disk
  • Use parquet crate directly to scan these "files" using a custom ExecutionPlan

A while back I created some tickets related to making DataFusion's abstractions more flexible, but I've not yet had time to finish it up and I would describe the interface currently as rather limiting.

To me the issue here is that DataFusion's ParquetExec is very tightly coupled with both where the data is located and how it is fetched. There are two solutions to this in my mind:

  1. Continue the work to make DataFusion's abstractions more usable
  2. Accelerate plans to lift the object_store logic from DataFusion into parquet so that it can be reused by custom ExecutionPlan

What do you think?

FYI @crepererum @alamb

What do you think about modifying ObjectStore get operations to replace Path with ObjectMeta and adding custom_attributes: Option<Box<[u8]>> to ObjectMeta ?

I think this is at odds with the objectives of the crate to provide a consistent abstraction across object stores, and so I would be extremely reticent to change the API in this way

@tustvold
Copy link
Contributor

I created #2241 to track adding support for request preconditions

@Cheappie
Copy link
Contributor Author

I have taken a look into datafusion codebase and from what I have seen baking my own ExecutionPlan to replace ObjectStore would require quite large effort. Mainly rewriting ParquetExec, FileScanConfig, and some other components along the way that are bound to ObjectStore.

I feel like anyway we go, at the end there will be similar or yet another abstraction of ObjectStore in DataFusion. That will have more relaxed semantics than the one that is being moved into arrow.

I like the idea of request preconditions, but It won't be sufficient for me. I've remembered that I would need to identify who has requested object from ObjectStore. It's good example of why It would be nice to have ability to pass dynamic attributes through something like Box<[u8]>.

What do you think about modifying ObjectStore get operations to replace Path with ObjectMeta and adding custom_attributes: Option<Box<[u8]>> to ObjectMeta ?

I think this is at odds with the objectives of the crate to provide a consistent abstraction across object stores, and so I would be extremely reticent to change the API in this way

Could you elaborate ?

What is wrong in providing ObjectMeta to get operation ? I mean that is the output of list operation. Or are you worried about custom attributes field ?

@Cheappie
Copy link
Contributor Author

Ahh I see now why you do want to uphold existing interface. You just want to keep it simple, because either object stores or file system just require location to get object. Am I right ?

I think that there should be yet another ObjectStore trait in DataFusion, that should use output of list method as an input to get with some additional flexibility for dynamic attributes.

@tustvold
Copy link
Contributor

tustvold commented Jul 31, 2022

You just want to keep it simple, because either object stores or file system just require location to get object. Am I right ?

Yeah, you most definitely shouldn't need the object meta to request a file, although the request preconditions (#2241) would optionally allow restrictions based on a limited subset of the metadata, constrained by what is generally supported.

I think that there should be yet another ObjectStore trait in DataFusion, that should use output of list method as an input to get with some additional flexibility for dynamic attributes.

I agree that this should be handled at the DataFusion layer, although I'm not entirely sure what the interface should look like. Personally I would be reticent to overload the ObjectStore name, as I think that would get very confusing, but the broad idea of introducing more flexibility into the IO operators sounds like a good thing imo 👍

One simple option might be to adapt the existing FormatReader trait to this purpose or something? 🤔 Tbh what you really want is to be able to override the AsyncFileReader that is passed to ParquetRecordBatchStreamBuilder by ParquetOpener. This feels like it should be imminently achievable without major rework. It also occurs to me that having separate ParquetExec, JsonExec, is basically redundant now that we have the FormatReader abstraction.

If you like I'd be happy to raise an issue on DataFusion proposing something to this effect, and depending on feedback potentially bash it out for you? Also happy to leave this with you, just let me know. I'd very much like to help you get this over the line 😄

Edit: I wrote up something apache/datafusion#2992 PTAL and let me know what you think

@crepererum
Copy link
Contributor

My 2 cents regarding path vs URL/URI:

The address of an object within a store is relative, i.e. a path (or a path segment if you want) but a URL/URI is absolute. It is semantically kinda weird that a single object store (e.g. an S3 store configured for one account+bucket) answers URL-based requests. I think if you want something that abstracts over multiple stores, we should introduce another layer on top, e.g.:

trait ObjectStoreRegistry {
    fn register(&mut self, name: &str, store: Arc<dyn ObjectStore>);
    fn map(&self, url: Url) -> Option<(Arc<dyn ObjectStore> Path)>;
}

// usage
let mut registry = make_registry();
registry.register("s3_prod", make_store(...));
registry.register("s3_dev", make_store(...));
registry.register("local", make_store(...));

let (store, path) = registry.map("s3_dev://my/file.parquet".into()).unwrap();
store.get(path)...

@tustvold
Copy link
Contributor

tustvold commented Aug 1, 2022

Agreed, DataFusion in fact already has such an abstraction - https://docs.rs/datafusion/latest/datafusion/datasource/object_store/struct.ObjectStoreRegistry.html

@tustvold tustvold closed this as not planned Won't fix, can't repro, duplicate, stale Sep 23, 2022
@tustvold
Copy link
Contributor

Closing this as I believe it is now resolved, feel free to re-open if I am mistaken

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

3 participants