Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python: Finish filesystem bindings #570

Closed
wjones127 opened this issue Mar 2, 2022 · 9 comments
Closed

Python: Finish filesystem bindings #570

wjones127 opened this issue Mar 2, 2022 · 9 comments
Assignees
Labels
enhancement New feature or request

Comments

@wjones127
Copy link
Collaborator

wjones127 commented Mar 2, 2022

Description

The PyArrow filesystem bindings are only partially implemented: https://github.com/delta-io/delta-rs/blob/main/python/deltalake/fs.py

Use Case

We will need the rest of the methods implemented for the PyArrow writer (write_deltalake()) to work.

Related Issue(s)

Blocks: #542

@wjones127 wjones127 added the enhancement New feature or request label Mar 2, 2022
@wjones127
Copy link
Collaborator Author

In addition, it probably makes sense to also add in support for Windows and "older" Linux. See

unsafe fn platform_specific_rename(from: *const libc::c_char, to: *const libc::c_char) -> i32 {
cfg_if::cfg_if! {
if #[cfg(all(target_os = "linux", target_env = "gnu"))] {
cfg_if::cfg_if! {
if #[cfg(glibc_renameat2)] {
libc::renameat2(libc::AT_FDCWD, from, libc::AT_FDCWD, to, libc::RENAME_NOREPLACE)
} else {
// target has old glibc (< 2.28), we would need to invoke syscall manually
unimplemented!()
}
}
} else if #[cfg(target_os = "macos")] {
libc::renamex_np(from, to, libc::RENAME_EXCL)
} else {
unimplemented!()
}
}
}
}

@wjones127
Copy link
Collaborator Author

I am working on this one.

@wjones127
Copy link
Collaborator Author

FYI I am working on this, but will take a while. The route I am taking is working upstream to get the ObjectStore trait to a place where it can be wrapped as a PyArrow filesystem. Then we can use that in the dataset writer as well as in the delta log writer. See:

@wjones127
Copy link
Collaborator Author

Update: current plan is to use object-store-rs (soon to be moved into arrow-rs) and then wrap that as a PyArrow filesystem for the writer.

So blocked on #610 for now.

@wjones127
Copy link
Collaborator Author

We can now pass ObjectStores into PyArrow writer. But we need to be able to provide PyArrow / fsspec filesystems into DeltaTable.

fs = S3Filesystem()
dt = DeltaTable("s3://path/to/table", filesystem=fs)
dt.to_pyarrow_table() # Should use fs
write_deltalake(dt, pa.table({'x': [1,2,3]})) # should use fs
write_deltalake("s3://path/to/table", pa.table({'x': [1,2,3]}), filesystem=fs)  # should use fs

@Lundez
Copy link

Lundez commented Oct 2, 2023

Is there any ETA until this is done?

@ion-elgreco
Copy link
Collaborator

@wjones127 can you clarify what is still needed for this? I can work on it if it's still relevant

@bjornandre
Copy link

I am also waiting for this to be implemented:

raise NotImplementedError("Filesystem support is not yet implemented. #570")

@wjones127
Copy link
Collaborator Author

We aren't working on this, and might never implement this. Copying from another discussion where I explained why we didn't peruse this:

We don't write the log files with PyArrow, and that's the really important part to get right. If you mess that up, you mess up atomicity of transactions. The particular operation we need is an atomic replace if not exist or copy if not exist. That is how we write the log file to commit a transaction. ObjectStore has those, but I don't think fsspec has them.

I had considered just having two filesystems, letting the user pass a PyArrow / fsspec one for the data files and keep using object store for the log files. But then you have to keep the configuration of the two in sync. That seemed like more trouble than it was worth.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
4 participants