-
Notifications
You must be signed in to change notification settings - Fork 406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adopt influxdata/object_store_rs? #610
Comments
I think this is a great idea. As you said in a first step it would reduce complexity in this crate. When adopted by datafusion. it would also greatly simplify the datafusion interop in the TableProvider and the operations. One thing to mention is that right now it does not support ADLS Gen2 which we need for |
+1 for me to push the rust data eng community towards a single feature rich and highly optimized object store crate! |
We still need to add the dynamodb lock integration to object_store_rs before it's ready for us to start integrating. I suspect we will have to write some adapters to the object stores to instantiate them with the various configuration options we have. Anything else we need to do? One important difference between our implementation and object_store_rs is that there is that it implements In the Azure case, I think if we keep as two calls we can drop the requirement for ADLS2 and just support Blob store generally, right? Would we see that as an improvement? cc @roeap @thovoll |
I think we only ever use
With the broader support, we also get the ability to use Azurite for integration tests - which is a big win. The biggest loss might be the directory semantics for list operations. Then again, we just made that call recursive for the vacuum operations. When listing the delta log there should be not much difference, since we filter by prefix and its flat. When listing files it would be nice to exclude all log, index files etc. but we need to traverse into partition folders anyhow. Another open point in object_store_rs is the authorization options. Right now I think azure and s3 only support authorization using a master key, while gcp uses an oauth flow based on client certificate. From what I can see we could try and make the token provider trait based over there and keep using the client libraries over here, or implement more oauth flows. This would leave managed identities, which AFAIK in azure do not count as an official oauth flow. Given what people have been asking for - and also what's needed on my end - support for master key, client secrets, client certificates, and managed identities should be supported. More generally speaking I have been hoping to maybe provide more dedicated stores (with a shared object_store) - one for the log and one for interacting with the data files. But that's probably a larger discussion. However we should probably think about how we want to leverage the |
I think HDFS will also be able to leverage this single atomic operation optimization too. With the wrapper trait design you proposed in the description, we should still be able to keep our current atomic rename implementation while migrating the rest to object_store_rs, no? For local file, the performance difference probably won't matter too much because hardlink only incurs metadata copy, so it's cheap. The only minor downside is left over files on crash. But I am not so sure about Azure and HDFS. If the copy call in those object stores requires copying of actual object data, then it will increase the commit latency by a lot. Considering the code to do perform atomic rename for both local file and azure is so small (<100 LoC), I would suggest we keep it for a better user experience. Getting rid of them doesn't bring any value to our users, it's a net negative for them. It will delete 150 lines of code from our point of view, but I don't think those specialized code actually increase the maintenance burden for us?
In an ideal world, I think it's best if we don't require ADLS2 for azure, but make it an optional feature that users can enable if they want to have lower commit latency. Similar to how we make the dynamodb lock an optional feature for S3 store. So yeah, if we can drop the requirement for ADLS2, I would agree it's an improvement :) |
I'm thinking now a good transition path is to:
|
Agreed! Also very much looking forward to all the dependencies we a re going to loose :) - the AWS migration already landed, and Azure is on its way (apache/arrow-rs#2509). Do we already know what the target design for the object sore looks like in general? To the best of my recollection we are somewhere at:
With the second point I am not sure what would be the best approach. Have it handled in a wrapper store, or just wrap the s3 store to handle that? Writing this I think the former might be better? As for when are the object_store_rs stores ready for use here? Main point I see is authorization. From what I can tell, S3 supports a sufficient amount of options already, Azure supports access key and service principal auth, which is en-par with what we have here. Soon I hope to add managed identity support as well. |
Description
This crate maintains its own notion of
StorageBackend
, but it might be more than we need to maintain for Delta Lake support.influxdata/object_store_rs implements an
ObjectStore
trait that covers the basics we need and provides a point of collaboration with the rest of the Rust data ecosystem. It looks likely that DataFusion will also adopt this crate eventually instead of it's internalObjectStore
trait, which is less complete (see apache/datafusion#2489).If ObjectStore exposed the underlying client, it seems like we could simply make StorageBackend depend on that trait and only have to maintain the
rename_obj_noreplace()
method:What do we think of that?
The text was updated successfully, but these errors were encountered: