Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: integrate object_store for read/write with pyarrow #799

Merged
merged 3 commits into from
Sep 12, 2022

Conversation

roeap
Copy link
Collaborator

@roeap roeap commented Sep 11, 2022

Description

This PR embraces the object store crate also on the python side and at least for the current test base supports reading and writing using the dataset and other pyarrow APIs. Thanks to @wjones127 's mutipart upload in object_store, implementing the write functionality was actually quite straigt forward. We now implement ObjectInputFile and ObjectOutputStream, which - if wrapped in pyarrow.PythonFile will work with the arrow ecosystem (so far ;))

There was on bigger and breaking deign decision, but I hope people agree :).

Essatially I just accepted, that working exclusively with the relative delta paths makes life much more convenient.. As such rather then adding and removing paths prefixes all the time, I thought it would be reasonable to as users to wrap their own filesystems in a pyarrow.fs.SubTreeFileSystems, which points at the table root..

import pyarrow.fs as fs
from deltalake import DeltaTable

path = "<path/to/table>"
filesystem = fs.SubTreeFileSystem(path, fs.LocalFileSystem())

dt = DeltaTable(path)
ds = dt.to_pyarrow_dataset(filesystem=filesystem)

Of course the hope is to eventually be at comparable (or higher :)) performance then the c++ file systems. Then there would be little reason (I guess) to still provide a "custom" file system at all.

Related Issue(s)

closes (at least to some degree) #570
closes #574
closes #696
closes #689
towards #542

Documentation

Copy link
Member

@houqp houqp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great work!

@roeap roeap merged commit 58c0d0f into delta-io:main Sep 12, 2022
@roeap roeap deleted the arrow-fs branch September 12, 2022 06:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants