-
Notifications
You must be signed in to change notification settings - Fork 810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: adding HDFS support in the object_store crate #5638
Comments
I am not sure we have the spare maintenance capacity to add and maintain a first-party integration for HDFS to object_store at this time. Is there a particular reason you do not wish to use either of the third-party integrations linked? |
I think what you are doing is the right thing -- have the discussion before you make a large PR
As @tustvold hints at above, the first thing you should consider is figuring out who will be able to maintain the code. As you are hinting at, an Maybe you can ask in https://github.com/datafusion-contrib/datafusion-objectstore-hdfs if others are interested in helping maintain such a crate. There are clearly others who need functionality too |
There has also been interest from delta-rs users, perhaps you can find some companions there to help :) |
FWIW whilst deprecated minio gateway I believe can proxy HDFS to S3 and is likely more performant than webhdfs. |
There is where I am coming from. Seeing the small interest there is right now (and my lack of experience to conduct a whole first party implementation alone) I will probably resort (and fix) the third party implementation already present: https://github.com/datafusion-contrib/datafusion-objectstore-hdfs .
Having had some discussion with my colleagues I don't think this approach would actually be more performant. But thanks for the heads up. |
First of all I'd agree with @tustvold and @alamb, arrow maintainers should not take this responsibility, HDFS store is a bit more complicated than object stores. IMHO, there are two directions which could be taken:
and there are few bindings generated for it, one of which is https://github.com/datafusion-contrib/hdfs-native pros:
cons:
pros:
cons:
|
TBC I was referring to webhdfs which is a Java jetty-based proxy with notoriously poor performance. Minio gateway likely would not perform favourably against a native implementation but it might be good enough, at least for a POC. That being said, in general object stores have high first byte latencies, so clients are already setup to not be too sensitive to this. Assuming the gateway is provisioned appropriately and isn't throughput constrained, it therefore might not be a terrible approach. I wonder if you could even run it as a sidecar 🤔 I dunno, HDFS is from the era of fat Java clients, and so implementing a native client is a fairly significant undertaking which I suspect will be hard given the size of the Rust community and the dwindling HDFS install base |
Thanks for the call out! I agree there's no need to have HDFS support directly in this repo since the trait is public and it's a tricky thing to support. I actually have an object_store implementation on top of my library already https://github.com/Kimahriman/hdfs-native/tree/master/crates/hdfs-native-object-store. I've gotten pretty far with it at this point. I have some benchmarks that show reading/writing is at least on-par with the libhdfs based client, and RPC calls are even faster. I suspect performance would be even better in real scenarios, since the JVM client heavily makes use of multi-threading, which would help single-task benchmarks compared to my async setup. The only major feature I'm tracking that is not supported right now is file encryption support via KMS. Not sure how widely that is used or not. The other limitations right now are
It's also not super heavily battle tested in various HDFS setups, but I haven't heard much yet of things not working for the few people who might be using it. I've been meaning to try to get it integrated with |
I'm not sure whats @alamb @tustvold opinion, would it make sense to have your repo in datafusion-contrib @Kimahriman ? |
If so, I would be happy to create a repo in datafusion-contrib with admin rights for @Kimahriman -- just let me know what your desired name is |
Thanks for contributing to the discussion. I am currently doing a ObjectStore implementation for HDFS that is compatible with object_store v 0.10. Currently waiting for the refactor here in delta-rs to have this new updated dependency. If you want to take a look at my project and give me some feedback/contribute, feel free to do so in my repository here! |
Might make sense to have the object store implementation part of datafusion-contrib, probably wouldn't do the whole native library since it's not datafusion/arrow specific. Would have to pull out the object store crate and make sure the testing setup still works. |
https://github.com/datafusion-contrib/datafusion-objectstore-hdfs already exists Would you like to make one like https://github.com/datafusion-contrib/datafusion-objectstore-hdfs-native ? |
I'll probably just keep the hdfs-native-object-store crate, so maybe just use that same name for the repo? don't think |
Created https://github.com/datafusion-contrib/hdfs-native-object-store and invited you as an admit |
K, it'll probably take me a bit to pull out the object store part and move it over. Also just started working the 0.10 update, have to re-build the multipart upload implementation |
No worries and no rush -- you can do it in whatever timescale makes the most sense to you |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Young developer here. As part of a larger project I am currently working on I need to access hdfs using your object_store interface. I am currently looking to a couple of repositories that tried adding support for HDFS, either having a wrapper around libhdfs or native support.
Describe the solution you'd like
I would like to make these solutions general enough so that they can be added to this repository as part of the object_store crate.
Describe alternatives you've considered
The choice that needs to be taken is between using https://github.com/datafusion-contrib/datafusion-objectstore-hdfs/tree/master which is a wrapper around libhdfs. Or try to generalise the solution of https://github.com/Kimahriman/hdfs-native .
Additional context
As I am a young developer, I have a limited idea of which best practices and design choices are better suited for this task and I could use some help. Furthermore I could benefit from the feedback of someone that has contributed directly to the object_store crate.
Here's a list of question that might help me get started, feel free to add any comment or recommendation:
Thanks in advance for your response
The text was updated successfully, but these errors were encountered: