-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python filesystems rewrite? #580
Conversation
Alternatively, maybe what we should do is create a |
I think the interface I'd like to aim for is having I think there's three different solutions we can implement:
|
Personally I think your approach makes a lot of sense. There may be however an opportunity to support files systems not supported by pyarrow out of the box without the need to have users provide a custom implementation. Once #573 lands we should be able to write Haven't really thought about it yet, if this could be fully transparent to the python users, but thanks to arrow moving record batches across language boundaries should be very cheap :). |
This would require ser/de between rust bytes and python bytes across language runtime on every write isn't it? The current storage backend abstraction in Rust core is not very well designed because it requires reading the whole object into memory before passing the data down to the caller. Ideally, the abstraction should support streaming read so we can parse json/parquet in a streaming fashion. So this is another thing to consider in the future if we want to go this route. |
@houqp I'm now thinking it might be better to simply work on improving the delta-rs filesystems and continue wrapping those in the Python module. For streaming reads and writes, is there a particular trait we should be implementing? It's unclear to me what the |
I think the async read/write has been added to the parquet2 create a long time ago. async read has been added to parquet-rs recently, see apache/arrow-rs#1154. I have been working on parquet2 migration on and off at #465. We likely will need to support both parquet crates with a feature flag for some time until we are confident that parquet2 can cover all our use-cases. It's supposed to be the fastest parquet implementation out there, so I am hoping that we will get a decent performance boost from the migration. |
Description
This PR would deprecate the existing
DeltaStorageHandler
that subclassespyarrow.FileSystem
in favor of aDeltaFileSystem
that has ato_pyarrow()
method to get the equivalent PyArrow filesystem. The idea is that for reading and writing data files we should use the PyArrow ones, while the delta log is read by the delta-rs one.The other approach I looked at initially is just filling in the methods
Some notes on each filesystem:
So if we went with this, we'd soon have 2 of 3 file systems already supported in the writer, and one that could be supported if another library is added as an optional dependency.
Related Issue(s)
For example:
Documentation