Skip to content

discuss: Introduce datafusion-storage as datafusion's own storage interface #14854

@Xuanwo

Description

@Xuanwo

Hello everyone, I'm jumping here from [Discussion] Object Store Composition.

Background

Datafusion is using ObjectStore as it's public storage interface for now. We have public API like register_object_store:

let object_store_url = ObjectStoreUrl::parse("file://").unwrap();
let object_store = object_store::local::LocalFileSystem::new();
let ctx = SessionContext::new();
// All files with the file:// url prefix will be read from the local file system
ctx.register_object_store(object_store_url.as_ref(), Arc::new(object_store));

With the growth of DF, we have to continuously add more features to object_store, making it increasingly difficult to compose, as described in [Discussion] Object Store Composition.

The latest example is adding Extensions to object store GetOptions to allow passing tracing spans within the object store, as requested in Improve use of tracing spans in query path.

It's easy to predict that ObjectStore will move further and further away from its initial position:

Initially the ObjectStore API was relatively simple, consisting of a few methods to interact with object stores. As such many systems took this abstraction and used it as a generic IO abstraction, this is good and what the crate was designed for.

Proposal

So I proposse to build datafusion-storage primarily focused on DataFusion's own needs while maintaining datafusion-storage-object-store and datafusion-storage-opendal separately. The benefit is that users can implement innovative features like datafusion-storage-cudf or datafusion-storage-io_uring without being constrained by the current I/O abstraction of object-store or OpenDAL.

If this becomes a reality, DataFusion can design the abstraction based on its own requirements without having to push everything upstream to object_store. This would allow them to maintain useful features such as context management and add additional requirements to the trait while letting datafusion-storage-object-store and datafusion-storage-opendal handle the extra work.

Implematation

We can start by aliasing the ObjectStore trait within datafusion-storage first. Given sufficient migration time, we can then fine-tune the trait to better align with DF's specific needs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions