Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add write and delete operations to ObjectStore #2246

Closed
wants to merge 14 commits into from

Conversation

wjones127
Copy link
Member

@wjones127 wjones127 commented Apr 16, 2022

Which issue does this PR close?

Closes #2185.

Rationale for this change

This PR adds other filesystem operations so we can build writers without each writer having to re-implement details of each filesystem. The API design was based on the C++ Arrow filesystem with the method names altered to match canonical Rust names from std::fs.

What changes are included in this PR?

  • Add writer to ObjectStore trait
  • Implement writer for local filesystem
  • Add delete functions to ObjectStore trait
  • Implement delete functions for local filesystem

Are there any user-facing changes?

@wjones127 wjones127 changed the title Sketch out write operations Add write and delete operations to ObjectStore Apr 16, 2022
@matthewmturner
Copy link
Contributor

This would also close #1777 task 1

@alamb
Copy link
Contributor

alamb commented Apr 17, 2022

cc @yjshen @tustvold

@wjones127 wjones127 force-pushed the object-store-write branch from 71b9bb0 to 9207569 Compare April 17, 2022 22:46
@wjones127
Copy link
Member Author

I've written the API to take &str for paths, but maybe the better thing to to do is to take AsRef<Path>? I think that would mean you couldn't pass in paths that include the scheme (e.g. file://path/to/thing), but is that already the expectation?

@github-actions github-actions bot added ballista datafusion Changes in the datafusion crate labels Apr 19, 2022
@wjones127 wjones127 marked this pull request as ready for review April 19, 2022 04:32
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good so far, I've left some relatively minor comments.

unimplemented!();
}

async fn create_dir(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was curious what this corresponded to given that object stores don't have a concept of a directory. It would appear that at least in the case of the S3 implementation, it creates empty objects as pretend directories - https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/s3fs.cc

I'm not sure how I feel about this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I didn't know that was the behavior in C++ Arrow. I started a sibling PR to implement these methods for S3 in datafusion-contrib/datafusion-objectstore-s3#54, so I might prototype some more there and see if that's necessary in any way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like fsspec implementation doesn't create empty objects as directories, so I think it is sensible to do the same.

It's not totally a no-op though for cases like S3. You might want to have create_dir create a bucket if it doesn't exist (though I'd like for there to be an option to turn off bucket creation). And it should probably check that there isn't an existing file at the path you are trying to create a directory at.

I think what I will do is document the expected behavior and create a generic test that can be run on an arbitrary filesystem to help implementation make sure they follow expected behavior.

unimplemented!();
}

async fn rename(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not aware of any object stores that support native rename, I think this will need to be implemented as a copy and remove. Perhaps we could provide this as the default implementation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, in Rust you can't override implementations from a trait, so I'm reluctant to provide a default implementation. But agreed that as far as I can see that does seem to be the pattern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok(Box::pin(file))
}

fn sync_writer(&self) -> Result<Box<dyn Write + Send + Sync>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI currently writing parquet requires Seek - but there is a suggestion here on how that could be removed apache/arrow-rs#937

@@ -101,4 +110,57 @@ pub trait ObjectStore: Sync + Send + Debug {

/// Get object reader for one file
fn file_reader(&self, file: SizedFile) -> Result<Arc<dyn ObjectReader>>;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to define somewhere the atomicity requirements of implementations. For example, S3 does not provide create if not exists APIs, and therefore cannot guarantee atomic writes.

Similarly if rename, copy, etc... is called concurrently with a writer, the behaviour would likely be implementation defined. Renaming a local file would rename the writer, deleting a local file will vary if Windows or POSIX, etc...

I think it is fine to just say this is not supported, but if people use this trait and expect it to behave like a filesystem, they're likely to be surprised 😅

@wjones127 wjones127 marked this pull request as draft April 25, 2022 02:05
@wjones127
Copy link
Member Author

Closing in favor of work on https://github.com/influxdata/object_store_rs

@wjones127 wjones127 closed this Jun 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ObjectStore write support
4 participants