-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add write and delete operations to ObjectStore #2246
Conversation
This would also close #1777 task 1 |
71b9bb0
to
9207569
Compare
I've written the API to take |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good so far, I've left some relatively minor comments.
unimplemented!(); | ||
} | ||
|
||
async fn create_dir( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was curious what this corresponded to given that object stores don't have a concept of a directory. It would appear that at least in the case of the S3 implementation, it creates empty objects as pretend directories - https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/s3fs.cc
I'm not sure how I feel about this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. I didn't know that was the behavior in C++ Arrow. I started a sibling PR to implement these methods for S3 in datafusion-contrib/datafusion-objectstore-s3#54, so I might prototype some more there and see if that's necessary in any way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like fsspec implementation doesn't create empty objects as directories, so I think it is sensible to do the same.
It's not totally a no-op though for cases like S3. You might want to have create_dir
create a bucket if it doesn't exist (though I'd like for there to be an option to turn off bucket creation). And it should probably check that there isn't an existing file at the path you are trying to create a directory at.
I think what I will do is document the expected behavior and create a generic test that can be run on an arbitrary filesystem to help implementation make sure they follow expected behavior.
unimplemented!(); | ||
} | ||
|
||
async fn rename( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not aware of any object stores that support native rename, I think this will need to be implemented as a copy and remove. Perhaps we could provide this as the default implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, in Rust you can't override implementations from a trait, so I'm reluctant to provide a default implementation. But agreed that as far as I can see that does seem to be the pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok(Box::pin(file)) | ||
} | ||
|
||
fn sync_writer(&self) -> Result<Box<dyn Write + Send + Sync>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI currently writing parquet requires Seek - but there is a suggestion here on how that could be removed apache/arrow-rs#937
@@ -101,4 +110,57 @@ pub trait ObjectStore: Sync + Send + Debug { | |||
|
|||
/// Get object reader for one file | |||
fn file_reader(&self, file: SizedFile) -> Result<Arc<dyn ObjectReader>>; | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to define somewhere the atomicity requirements of implementations. For example, S3 does not provide create if not exists APIs, and therefore cannot guarantee atomic writes.
Similarly if rename, copy, etc... is called concurrently with a writer, the behaviour would likely be implementation defined. Renaming a local file would rename the writer, deleting a local file will vary if Windows or POSIX, etc...
I think it is fine to just say this is not supported, but if people use this trait and expect it to behave like a filesystem, they're likely to be surprised 😅
Closing in favor of work on https://github.com/influxdata/object_store_rs |
Which issue does this PR close?
Closes #2185.
Rationale for this change
This PR adds other filesystem operations so we can build writers without each writer having to re-implement details of each filesystem. The API design was based on the C++ Arrow filesystem with the method names altered to match canonical Rust names from
std::fs
.What changes are included in this PR?
ObjectStore
traitObjectStore
traitAre there any user-facing changes?