Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: adding HDFS support in the object_store crate #5638

Closed
Silemo opened this issue Apr 12, 2024 · 17 comments
Closed

feat: adding HDFS support in the object_store crate #5638

Silemo opened this issue Apr 12, 2024 · 17 comments
Labels
question Further information is requested

Comments

@Silemo
Copy link
Contributor

Silemo commented Apr 12, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Young developer here. As part of a larger project I am currently working on I need to access hdfs using your object_store interface. I am currently looking to a couple of repositories that tried adding support for HDFS, either having a wrapper around libhdfs or native support.

Describe the solution you'd like

I would like to make these solutions general enough so that they can be added to this repository as part of the object_store crate.

Describe alternatives you've considered

The choice that needs to be taken is between using https://github.com/datafusion-contrib/datafusion-objectstore-hdfs/tree/master which is a wrapper around libhdfs. Or try to generalise the solution of https://github.com/Kimahriman/hdfs-native .

Additional context

As I am a young developer, I have a limited idea of which best practices and design choices are better suited for this task and I could use some help. Furthermore I could benefit from the feedback of someone that has contributed directly to the object_store crate.

Here's a list of question that might help me get started, feel free to add any comment or recommendation:

  • When contributing to such a large change (more than 70 lines change) you still recommend a single PR? How should I prepare the work for the contribution?
  • HDFS, being a file system operates more similarly to your Local File system implementation than S3, GCP or Azure. Do you prefer then that I create a single hdfs.rs file as local.rs, or you still would opt for a folder?
  • Is there something in particular I should keep in mind when implementing for hdfs or contributing to this repo?

Thanks in advance for your response

@Silemo Silemo added the enhancement Any new improvement worthy of a entry in the changelog label Apr 12, 2024
@tustvold
Copy link
Contributor

tustvold commented Apr 12, 2024

I am not sure we have the spare maintenance capacity to add and maintain a first-party integration for HDFS to object_store at this time.

Is there a particular reason you do not wish to use either of the third-party integrations linked?

@tustvold tustvold added question Further information is requested and removed enhancement Any new improvement worthy of a entry in the changelog labels Apr 12, 2024
@alamb
Copy link
Contributor

alamb commented Apr 13, 2024

When contributing to such a large change (more than 70 lines change) you still recommend a single PR? How should I prepare the work for the contribution?

I think what you are doing is the right thing -- have the discussion before you make a large PR

Is there something in particular I should keep in mind when implementing for hdfs or contributing to this repo?

As @tustvold hints at above, the first thing you should consider is figuring out who will be able to maintain the code. As you are hinting at, an ObjectStore implementation for HDFS will be a substantial effort. Unless we can find a maintainer who has the time to help review and maintain the bindings in this repo I suggest you look elsewhere.

Maybe you can ask in https://github.com/datafusion-contrib/datafusion-objectstore-hdfs if others are interested in helping maintain such a crate. There are clearly others who need functionality too

@ion-elgreco
Copy link

There has also been interest from delta-rs users, perhaps you can find some companions there to help :)

@tustvold
Copy link
Contributor

FWIW whilst deprecated minio gateway I believe can proxy HDFS to S3 and is likely more performant than webhdfs.

@Silemo
Copy link
Contributor Author

Silemo commented Apr 17, 2024

There has also been interest from delta-rs users, perhaps you can find some companions there to help :)

There is where I am coming from. Seeing the small interest there is right now (and my lack of experience to conduct a whole first party implementation alone) I will probably resort (and fix) the third party implementation already present: https://github.com/datafusion-contrib/datafusion-objectstore-hdfs .

FWIW whilst deprecated minio gateway I believe can proxy HDFS to S3 and is likely more performant than webhdfs.

Having had some discussion with my colleagues I don't think this approach would actually be more performant. But thanks for the heads up.

@milenkovicm
Copy link

milenkovicm commented Apr 17, 2024

First of all I'd agree with @tustvold and @alamb, arrow maintainers should not take this responsibility, HDFS store is a bit more complicated than object stores.

IMHO, there are two directions which could be taken:

  1. Implement HDFS support based on C++ libhdfs. To my knowledge there are two somewhat maintained repositories

and there are few bindings generated for it, one of which is https://github.com/datafusion-contrib/hdfs-native

pros:

cons:

  • libhdfs needs some effort to get it to parity to latest HDFS interface, which might be a bit of effort
  1. Second approach is to write native rust hdfs library and I believe @Kimahriman https://github.com/Kimahriman/hdfs-native is on the right track. I haven't use the library and cant tell how performant it is but IMHO it looks he's on the right track.

pros:

  • we'd have up to date hdfs rust library in rust

cons:

  • we need to invest some effort to get there

@tustvold
Copy link
Contributor

tustvold commented Apr 17, 2024

Having had some discussion with my colleagues I don't think this approach would actually be more performant. But thanks for the heads up

TBC I was referring to webhdfs which is a Java jetty-based proxy with notoriously poor performance. Minio gateway likely would not perform favourably against a native implementation but it might be good enough, at least for a POC.

That being said, in general object stores have high first byte latencies, so clients are already setup to not be too sensitive to this. Assuming the gateway is provisioned appropriately and isn't throughput constrained, it therefore might not be a terrible approach. I wonder if you could even run it as a sidecar 🤔

I dunno, HDFS is from the era of fat Java clients, and so implementing a native client is a fairly significant undertaking which I suspect will be hard given the size of the Rust community and the dwindling HDFS install base

@Kimahriman
Copy link

  1. Second approach is to write native rust hdfs library and I believe @Kimahriman https://github.com/Kimahriman/hdfs-native is on the right track. I haven't use the library and cant tell how performant it is but IMHO it looks he's on the right track.

Thanks for the call out! I agree there's no need to have HDFS support directly in this repo since the trait is public and it's a tricky thing to support. I actually have an object_store implementation on top of my library already https://github.com/Kimahriman/hdfs-native/tree/master/crates/hdfs-native-object-store.

I've gotten pretty far with it at this point. I have some benchmarks that show reading/writing is at least on-par with the libhdfs based client, and RPC calls are even faster. I suspect performance would be even better in real scenarios, since the JVM client heavily makes use of multi-threading, which would help single-task benchmarks compared to my async setup.

The only major feature I'm tracking that is not supported right now is file encryption support via KMS. Not sure how widely that is used or not. The other limitations right now are

  • It dynamically links to libgssapi_krb5 native lib (via the libgssapi crate), which makes cross compiling tricky/impossible with Kerberos support. I know there are other libs (like compression libraries) that I think use their native implementation, so I'd be curious how those work for cross compiling (compiled and statically linked instead of dynamically linked?).
  • Reading and writing data isn't quite as resilient to failures as the Java client right now. Reading was a bit of an oversight I'm trying to fix now, writing is more complicated so it's currently just a "retry the whole thing if it fails" setup

It's also not super heavily battle tested in various HDFS setups, but I haven't heard much yet of things not working for the few people who might be using it.

I've been meaning to try to get it integrated with delta-rs, but haven't gotten around to it since ideally I want it included in the Python wheels, but the libgssapi thing has had me stuck for a while.

@milenkovicm
Copy link

I'm not sure whats @alamb @tustvold opinion, would it make sense to have your repo in datafusion-contrib @Kimahriman ?

@alamb
Copy link
Contributor

alamb commented Apr 18, 2024

I'm not sure whats @alamb @tustvold opinion, would it make sense to have your repo in datafusion-contrib @Kimahriman ?

If so, I would be happy to create a repo in datafusion-contrib with admin rights for @Kimahriman -- just let me know what your desired name is

@Silemo
Copy link
Contributor Author

Silemo commented Apr 23, 2024

Thanks for contributing to the discussion. I am currently doing a ObjectStore implementation for HDFS that is compatible with object_store v 0.10. Currently waiting for the refactor here in delta-rs to have this new updated dependency.

If you want to take a look at my project and give me some feedback/contribute, feel free to do so in my repository here!

@Silemo Silemo closed this as completed Apr 23, 2024
@Kimahriman
Copy link

I'm not sure whats @alamb @tustvold opinion, would it make sense to have your repo in datafusion-contrib @Kimahriman ?

If so, I would be happy to create a repo in datafusion-contrib with admin rights for @Kimahriman -- just let me know what your desired name is

Might make sense to have the object store implementation part of datafusion-contrib, probably wouldn't do the whole native library since it's not datafusion/arrow specific. Would have to pull out the object store crate and make sure the testing setup still works.

@alamb
Copy link
Contributor

alamb commented Apr 24, 2024

@Kimahriman
Copy link

I'll probably just keep the hdfs-native-object-store crate, so maybe just use that same name for the repo? don't think datafusion really needs to be in the project name since it's under the datafusion-contrib group?

@alamb
Copy link
Contributor

alamb commented Apr 24, 2024

Created https://github.com/datafusion-contrib/hdfs-native-object-store and invited you as an admit

@Kimahriman
Copy link

K, it'll probably take me a bit to pull out the object store part and move it over. Also just started working the 0.10 update, have to re-build the multipart upload implementation

@alamb
Copy link
Contributor

alamb commented Apr 24, 2024

K, it'll probably take me a bit to pull out the object store part and move it over. Also just started working the 0.10 update, have to re-build the multipart upload implementation

No worries and no rush -- you can do it in whatever timescale makes the most sense to you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

6 participants