Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent behaviour of get_file using compression with different filesystems #1758

Open
mraspaud opened this issue Nov 28, 2024 · 4 comments

Comments

@mraspaud
Copy link
Contributor

I have problems getting consistent behaviours when using get_file for different filesystems when using the compression parameter. My understanding from the AbstractFilesystem implementation of that method is that kwargs should be used by the open method, but for some filesystems it fails silently.
My goal was to fetch files and decompress them on the fly: maybe there is a better suited function for this?

Minimal example:

import fsspec
import bz2
from zipfile import ZipFile

# create data file
filename = "/tmp/important_data.txt.bz2"
data = b"very important data."
with open(filename, "wb") as fd:
    fd.write(bz2.compress(data))

# open with compression
print(fsspec.open(filename, compression="infer").open().read())
# prints "b'very important data.'"

# fetch from local filesystem
fsspec.filesystem("file").get_file(filename, "/tmp/new", compression="infer")
print(open("/tmp/new", "rb").read())
# prints "b'BZh91AY&SY\x85\xf4|P\x00\x00\t\x11\x80@\x01&#\xd5  \x00"\x9e\x93i\x06\xca\x10\x00\x02\xdc\xc6\x0c\xb1\xc2\xbc\xad\x16\xc7\xc5\xdc\x91N\x14$!}\x1f\x14\x00'"

# fetch from ssh filesystem
fsspec.filesystem("ssh", host="localhost").get_file(filename, "/tmp/new", compression="infer")
print(open("/tmp/new", "rb").read())
# prints "b'BZh91AY&SY\x85\xf4|P\x00\x00\t\x11\x80@\x01&#\xd5  \x00"\x9e\x93i\x06\xca\x10\x00\x02\xdc\xc6\x0c\xb1\xc2\xbc\xad\x16\xc7\xc5\xdc\x91N\x14$!}\x1f\x14\x00'"

# fetch from zip filesystem
zfile = filename + ".zip"
with ZipFile(zfile, 'w') as zipf:
    zipf.write(filename)
of = fsspec.open("zip://" + filename + "::file://" + zfile)
of.fs.get_file(filename, "/tmp/new", compression="infer")
print(open("/tmp/new", "rb").read())
# prints "b'very important data.'"
@martindurant
Copy link
Member

The fallback implementation of get_file is via open(), so extra kwargs like compression get passed down. However, many filesystem backends have more specialised get_file methods, to allow better operation like parallel downloading. In such cases, we are not necessarily streaming the bytes, and so on-the-fly decompression would not be possible anyway.
I think we should say, that only open() is guaranteed to layer file-like objects for decompression or text mode.

@mraspaud
Copy link
Contributor Author

mraspaud commented Dec 5, 2024

@martindurant thanks for the clarification! I understand I will have to implement a custom solution for my use case.
But I think my point still stands, about the silent ignoring of the kwargs? Wouldn’t it be better to raise an error in such a case?

@martindurant
Copy link
Member

A general problem throughout the fsspec code, is that there are many places that kwargs can get passed to, including general purpose arguments to the third-party backend libraries. Therefore, most methods only extract the arguments they need and pass everything else along, and whether you get an exception or not, depends on how the third-party package is called and what it expects.

@mraspaud
Copy link
Contributor Author

mraspaud commented Dec 6, 2024

I understand, thanks for the explanation. Feel free to close this then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants