Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support HDFS via hdfs-native package #266

Closed
santosh-d3vpl3x opened this issue Jun 22, 2024 · 6 comments · Fixed by #274
Closed

Support HDFS via hdfs-native package #266

santosh-d3vpl3x opened this issue Jun 22, 2024 · 6 comments · Fixed by #274

Comments

@santosh-d3vpl3x
Copy link

delta-rs recently got initial support for hdfs via this PR.

It would be great if we could do the same for delta-rs-kernel.

Duckdb recently introduced support for delta via kernel implementation but it can't be used with hdfs because of this missing integration.

Tagging @Kimahriman to see if they can help out here!

@nicklan
Copy link
Collaborator

nicklan commented Jun 25, 2024

Yeah, seems like we could add this since https://github.com/datafusion-contrib/hdfs-native-object-store just makes hdfs look like an object_store, thanks!

Not sure when I will have time to work on this, but if someone wants to make a PR I'd be open to it, and I will find time at some point.

@SchutteJan
Copy link
Contributor

@nicklan I would be interested in contributing to this, but I am completely new to this project.

Let me know if my understanding of the problem is correct, as I see it there are two ways to go about it:

  • Add hdfs support into the upstream object_store crate, but looking at their issue (feat: adding HDFS support in the object_store crate apache/arrow-rs#5638) they seem to prefer to keep hdfs separate.

  • Allow the current implementation to create object stores from object_store or hdfs_native_object_store crates

    • Similar to the implementation in delta-rs, you would match protocol schemes to the correct object store initializers
    • I.e.: hdfs:// -> hdfs_native_object_store::HdfsObjectStore and s3:// -> object_store::ObjectStore

@santosh-d3vpl3x
Copy link
Author

2nd way would be the most straightforward and preferable I believe.

@Kimahriman
Copy link

Sorry GitHub randomly decided to stop sending me email notifications. delta-rs just uses a dyn ObjectStore so it was fairly easy to integrate. Haven't looked much at this repo yet to see how it handles object store but hopefully is straightforward!

@Kimahriman
Copy link

Kimahriman commented Jul 9, 2024

Ah it looks like object_store::parse_url_opts is just used directly so might be a little more work to integrate, delta-rs already has custom handling of schemes so it was a little more straightforward. Have to do some upfront parsing of the scheme before forwarding to parse_url_opts

@SchutteJan
Copy link
Contributor

@nicklan I've marked my PR as ready, can you take a look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants