Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating ZipStore with file-like object #1018

Open
oeway opened this issue May 3, 2022 · 22 comments
Open

Creating ZipStore with file-like object #1018

oeway opened this issue May 3, 2022 · 22 comments

Comments

@oeway
Copy link
Contributor

oeway commented May 3, 2022

The current ZipStore implementation assumes the input is always a path which becomes a bit limiting when dealing with in-memory file or virtual file object (e.g. in Pyodide in the browser). The virtual file object support is crucial for making zarr library useful in the browser environment, since the emscripten file system itself are rather limited at the moment.

It would be nice if we can allow passing a file-like object. I did a test with a small modification in the init function and it seems to be working nicely in Pyodide/JupyterLite:

from threading import Lock, RLock
import zipfile

class InMemoryZipStore(zarr.ZipStore):
    def __init__(self, path, compression=zipfile.ZIP_STORED, allowZip64=True, mode='a',
                 dimension_separator=None):

        # store properties
        self.path = None # TODO: This need to be handled properly for os.PathLike or file-like object
        self.compression = compression
        self.allowZip64 = allowZip64
        self.mode = mode
        self._dimension_separator = dimension_separator

        # Current understanding is that zipfile module in stdlib is not thread-safe,
        # and so locking is required for both read and write. However, this has not
        # been investigated in detail, perhaps no lock is needed if mode='r'.
        self.mutex = RLock()

        # open zip file
        self.zf = zipfile.ZipFile(path, mode=mode, compression=compression,
                                  allowZip64=allowZip64)

I will maintain this piece of code somewhere for now, but it would be great if we can support this from upstream and eventually have it in pyodide.

@joshmoore
Copy link
Member

Hey @oeway. Thanks for the suggestion! In your mind is what we need the addition of an optional argument for path? Seems like the only possible downside of that would be if someone was expecting the exception from os.path.abspath(path) to detect None.

@oeway
Copy link
Contributor Author

oeway commented May 3, 2022

Hey @oeway. Thanks for the suggestion! In your mind is what we need the addition of an optional argument for path? Seems like the only possible downside of that would be if someone was expecting the exception from os.path.abspath(path) to detect None.

How about we rename the first argument path to file_or_path? Then we try to infer the path based on the types:

if isinstance(file_or_path, (str, os.PathLike)):
    self.path = file_or_path
else: # file-like object
    if hasattr(file_or_path, 'name'):
        # Normal file object has the name property which contains the path
        self.path = file_or_path.name
    else:
        self.path = "" # the default is empty? or None?
 
self.zf = zipfile.ZipFile(file_or_path, mode=mode, compression=compression,
                                 allowZip64=allowZip64)

@joshmoore
Copy link
Member

I wouldn't suggest renaming since existing code could be using ZipStore(path=). If permitting a FileLike for something named "path" is too confusing, then perhaps we add a static initializer ZipStore.from_file(file, *args, **kwargs) (alternative names: wrap(), ...)

@oeway
Copy link
Contributor Author

oeway commented May 3, 2022

I wouldn't suggest renaming since existing code could be using ZipStore(path=). If permitting a FileLike for something named "path" is too confusing, then perhaps we add a static initializer ZipStore.from_file(file, *args, **kwargs) (alternative names: wrap(), ...)

Good point! ZipStore.from_file sounds good to me!

@jakirkham
Copy link
Member

Think it would be ok to just accept ZipFile instances in path. We can just clarify the docs on this point.

We have similar flexibility in Dask.

@jakirkham
Copy link
Member

As a side note we might consider using position only and keyword only arguments in Zarr 3 to avoid these issues around renaming arguments.

@joshmoore
Copy link
Member

Issue created, @jakirkham.

Think it would be ok to just accept ZipFile instances in path.

So the logic would be:

  • is_zip_file()
  • is_file_like()
  • is_path_or_string_like()

Or would you just pass a ZipFile, @oeway ?

@oeway
Copy link
Contributor Author

oeway commented May 4, 2022

Hi, for my case it's a fake file object making http requests to a file with range header, so not a zipfile.

@jakirkham
Copy link
Member

Thanks Josh! 😄

Intriguing would be curious to look at the object if you don't mind sharing 🙂

Does it support one of the file ABCs? Or could it? If so, then it would be pretty easy to check if it isinstance of one of those and use it

@oeway
Copy link
Contributor Author

oeway commented May 4, 2022

Intriguing would be curious to look at the object if you don't mind sharing 🙂

Does it support one of the file ABCs? Or could it? If so, then it would be pretty easy to check if it isinstance of one of those and use it

Sure, here it is: https://github.com/imjoy-team/imjoy-rpc/blob/af739ec829d984da35bc5b87b93aa1a553944fe3/python/imjoy_rpc/utils.py#L672-L842

It's a class inherit from io.IOBase, I guess fsspec also has something similar.

@jakirkham
Copy link
Member

cc @martindurant (in case this is of interest especially given our ZIP discussion recently)

@martindurant
Copy link
Member

martindurant commented May 4, 2022

I would point out that fsspec supports a zarr-compatible key-value store over any fs it can instantiate, including ZIP and in-memory. You can pass these directly to zarr.

# this bit just for example
data = fsspec.filesystem("http").cat("https://github.com/fsspec/filesystem_spec/archive/refs/heads/master.zip")
fsspec.filesystem("memory").pipe("mem.zip", data)

# make storage mapper like this
store = fsspec.get_mapper("zip::memory://mem.zip")

@martindurant
Copy link
Member

martindurant commented May 4, 2022

Obviously the thing above won't work in pyidide, since the HTTP part uses aiohttp - I should have said. But you can replace this part or wait until I write HTTP-for-pyodide (maybe next week?!).

The separate issue I was talking with @jakirkham about, is that accessing this way uses a file - so access to zarr chunks would be always serial. With kerchunk, we can index the ZIP (i.e., translate the existing index embedded in the file) and attain concurrent access to zarr chunks. This is pretty simple, but not yet done, but should work in pyodide/pyscript too.

@TomasPuverle
Copy link

I was just looking at the same problem. Creating from another ZipFile, or file (e.g. io.BytesIO) would be super useful.

I was actually looking at potentially contributing this but there are a number of places in ZipFile where the .zf is recreated from self.path again. That may be an issue if en external file (instead of a path) is provided, no?

@martindurant
Copy link
Member

FSStore already allows for this. In fsspec, you can pass a URL like "zip::s3://bucket/file.zip", or you can pass explicit arguments to the ZIP backend if you like. You don't need to rewrite ZipStore.

@TomasPuverle
Copy link

Hi @martindurant, thanks so much for your response!

I have been looking at the documentation for fsspec and the "zip" filesystem is documented as being read-only: https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.implementations.zip.ZipFileSystem. I was able to successfully read from an fsspec-based zip file store, but according to my reading of the docs, I would not be able to create new in-memory zip file with the combination of fsspec and zarr.

But perhaps I'm misunderstanding your something, in which case your feedback is much appreciated!

Thank you.

@martindurant
Copy link
Member

Good point. There's no principled reason that ZipFileSystem should be read-only, except that writing would be a terrible idea on a key-value storage (every update would need to rewrite the file). However, it'd work fine for local, in-memory or cached-to-remote files.

@martindurant
Copy link
Member

martindurant commented Aug 7, 2022

You can try!

--- a/fsspec/implementations/zip.py
+++ b/fsspec/implementations/zip.py
@@ -44,10 +44,8 @@ class ZipFileSystem(AbstractArchiveFileSystem):
             a string.
         """
         super().__init__(self, **kwargs)
-        if mode != "r":
-            raise ValueError("Only read from zip files accepted")
         if isinstance(fo, str):
-            files = open_files(fo, protocol=target_protocol, **(target_options or {}))
+            files = open_files(fo, mode=mode+"b", protocol=target_protocol, **(target_options or {}))
             if len(files) != 1:
                 raise ValueError(
                     'Path "{}" did not resolve to exactly'
@@ -55,7 +53,7 @@ class ZipFileSystem(AbstractArchiveFileSystem):
                 )
             fo = files[0]
         self.fo = fo.__enter__()  # the whole instance is a context
-        self.zip = zipfile.ZipFile(self.fo)
+        self.zip = zipfile.ZipFile(self.fo, mode=mode)
         self.block_size = block_size
         self.dir_cache = None```

Should fail on any file object that doesn't allow seek while writing.

@martindurant
Copy link
Member

@jakirkham
Copy link
Member

Intriguing would be curious to look at the object if you don't mind sharing 🙂
Does it support one of the file ABCs? Or could it? If so, then it would be pretty easy to check if it isinstance of one of those and use it

Sure, here it is: https://github.com/imjoy-team/imjoy-rpc/blob/af739ec829d984da35bc5b87b93aa1a553944fe3/python/imjoy_rpc/utils.py#L672-L842

It's a class inherit from io.IOBase, I guess fsspec also has something similar.

Just a note, @oeway, think we could just test for io.IOBase then. Am guessing this is a 4-5 line change here. Namely check type, use it, protect against closing for user-provided files (as presumably users would want to handle closing).

@jhamman
Copy link
Member

jhamman commented Jan 3, 2023

Cross posting a related SO post and my answer there. https://stackoverflow.com/questions/74127357/how-to-create-and-return-a-zarr-file-from-xarray-dataset/74148410?noredirect=1#comment132344162_74148410

I wish I had seen this issue first 🤷. But in the end, I came up with a similar solution to @oeway.

@adanb13
Copy link

adanb13 commented Jan 4, 2023

Hey @jhamman thanks for pointing me to this issue, here from SO 👋

In the time between the update, I tried the following and it seems to do the trick:

import zarr
import tempfile
import os

mem_bytes = os.sysconf('SC_PAGE_SIZE') * os.sysconf('SC_PHYS_PAGES')

with tempfile.SpooledTemporaryFile(max_size= int((mem_bytes+1)), dir= TMP_PLACEHOLDER,  suffix= 'zip') as f:
        with tempfile.NamedTemporaryFile() as f2:
            # where `data` is a xarray dataset
            data.to_zarr(zarr.ZipStore(f2.name), mode='w')
            return f2.read()

Although, I'm not 100% certain that this does not write to disk somewhere (I've had instances in the past where tempfile behaves in a weird manner and writes to disk in a place I did not expect), from my observations, it does not write to disk.

Note that max_size is set to be greater than all of the memory on machine. So it should never write to disk. The data sets I'm working with currently aren't large enough to test the bounds but I'd assume a try/except block would cover that scenario.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants