Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading from remote .zarr stores #275

Closed
5 of 8 tasks
berombau opened this issue May 22, 2023 · 3 comments · Fixed by #278
Closed
5 of 8 tasks

Reading from remote .zarr stores #275

berombau opened this issue May 22, 2023 · 3 comments · Fixed by #278

Comments

@berombau
Copy link
Contributor

berombau commented May 22, 2023

Right now, reading from a remote .zarr URL fails:

In [7]: sdata = sd.SpatialData.read('https://s3.embl.de/spatialdata/spatialdata-sandbox/cosmx_io.zarr/')
---------------------------------------------------------------------------
PathNotFoundError                         Traceback (most recent call last)
Input In [7], in <cell line: 1>()
----> 1 sdata = sd.SpatialData.read('https://s3.embl.de/spatialdata/spatialdata-sandbox/cosmx_io.zarr/')

File ~/Documents/GitHub/spatialdata/src/spatialdata/_core/spatialdata.py:1183, in SpatialData.read(file_path)
   1179 @staticmethod
   1180 def read(file_path: str) -> SpatialData:
   1181     from spatialdata import read_zarr
-> 1183     return read_zarr(file_path)

File ~/Documents/GitHub/spatialdata/src/spatialdata/_io/io_zarr.py:22, in read_zarr(store)
     19 if isinstance(store, str):
     20     store = Path(store)
---> 22 f = zarr.open(store, mode="r")
     23 images = {}
     24 labels = {}

File /opt/homebrew/Caskroom/mambaforge/base/envs/spatialdata/lib/python3.10/site-packages/zarr/convenience.py:122, in open(store, mode, zarr_version, path, **kwargs)
    120     return open_group(_store, mode=mode, **kwargs)
    121 else:
--> 122     raise PathNotFoundError(path)

PathNotFoundError: nothing found at path ''

Ideally, this would just work the same as a local dataset or fail with informative errors. Some issues causing this:

  • URL strings are parsed using os.path or Path, which does not work for:
    • .exist() e.g. images_store.exists() in io_zarr.py
    • the remote store is not iterable for subelements as a local path is e.g. `for k in f in io_raster.py
  • URL strings are sometimes required to be str | Path:
    • e.g. _read_multiscale in io_raster.py
  • reading IPv6 URLs (e.g. http://[::]:8000/) also fails, probably due to some string handling error
  • failing to read a remote store or an unconsolidated store gives an empty store instead of raising an issue
  • remote Points elements .parquet are now read in completely, instead of only the metadata. This is really slow for large datasets.

The library urlpath extends pathlib to also work with URLs, but ideally this is handled by the string parsing of the zarr library instead.

The list of subelements can be provided by consolidate_metadata, same as kevinyamauchi/ome-ngff-tables-prototype#12

@berombau
Copy link
Contributor Author

ZarrLocation solves some issues like adding .exists(), but does not support iteration or a Path like syntax: store / 'images.

@berombau
Copy link
Contributor Author

Stuck on reading remote .parquet:

ArrowInvalid                              Traceback (most recent call last)
File /opt/homebrew/Caskroom/mambaforge/base/envs/spatialdata/lib/python3.10/site-packages/IPython/core/formatters.py:707, in PlainTextFormatter.__call__(self, obj)
    700 stream = StringIO()
    701 printer = pretty.RepresentationPrinter(stream, self.verbose,
    702     self.max_width, self.newline,
    703     max_seq_length=self.max_seq_length,
    704     singleton_pprinters=self.singleton_printers,
    705     type_pprinters=self.type_printers,
    706     deferred_pprinters=self.deferred_printers)
--> 707 printer.pretty(obj)
    708 printer.flush()
    709 return stream.getvalue()

File /opt/homebrew/Caskroom/mambaforge/base/envs/spatialdata/lib/python3.10/site-packages/IPython/lib/pretty.py:410, in RepresentationPrinter.pretty(self, obj)
    407                         return meth(obj, self, cycle)
    408                 if cls is not object \
    409                         and callable(cls.__dict__.get('__repr__')):
--> 410                     return _repr_pprint(obj, self, cycle)
    412     return _default_pprint(obj, self, cycle)
    413 finally:

File /opt/homebrew/Caskroom/mambaforge/base/envs/spatialdata/lib/python3.10/site-packages/IPython/lib/pretty.py:778, in _repr_pprint(obj, p, cycle)
    776 """A pprint that just redirects to the normal repr function."""
    777 # Find newlines and replace them with p.break_()
--> 778 output = repr(obj)
    779 lines = output.splitlines()
    780 with p.group():

File ~/Documents/GitHub/spatialdata/src/spatialdata/_core/spatialdata.py:1239, in SpatialData.__repr__(self)
   1238 def __repr__(self) -> str:
-> 1239     return self._gen_repr()

File ~/Documents/GitHub/spatialdata/src/spatialdata/_core/spatialdata.py:1290, in SpatialData._gen_repr(self)
   1288     assert len(t) == 1
   1289     parquet_file = t[0]
-> 1290     table = read_table(parquet_file)
   1291     length = len(table)
   1292 else:
   1293     # length = len(v)

File /opt/homebrew/Caskroom/mambaforge/base/envs/spatialdata/lib/python3.10/site-packages/pyarrow/parquet/core.py:2926, in read_table(source, columns, use_threads, metadata, schema, use_pandas_metadata, read_dictionary, memory_map, buffer_size, partitioning, filesystem, filters, use_legacy_dataset, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, decryption_properties, thrift_string_size_limit, thrift_container_size_limit)
   2919     raise ValueError(
   2920         "The 'metadata' keyword is no longer supported with the new "
   2921         "datasets-based implementation. Specify "
   2922         "'use_legacy_dataset=True' to temporarily recover the old "
   2923         "behaviour."
   2924     )
   2925 try:
-> 2926     dataset = _ParquetDatasetV2(
   2927         source,
   2928         schema=schema,
   2929         filesystem=filesystem,
   2930         partitioning=partitioning,
   2931         memory_map=memory_map,
   2932         read_dictionary=read_dictionary,
   2933         buffer_size=buffer_size,
   2934         filters=filters,
   2935         ignore_prefixes=ignore_prefixes,
   2936         pre_buffer=pre_buffer,
   2937         coerce_int96_timestamp_unit=coerce_int96_timestamp_unit,
   2938         thrift_string_size_limit=thrift_string_size_limit,
   2939         thrift_container_size_limit=thrift_container_size_limit,
   2940     )
   2941 except ImportError:
   2942     # fall back on ParquetFile for simple cases when pyarrow.dataset
   2943     # module is not available
   2944     if filters is not None:

File /opt/homebrew/Caskroom/mambaforge/base/envs/spatialdata/lib/python3.10/site-packages/pyarrow/parquet/core.py:2452, in _ParquetDatasetV2.__init__(self, path_or_paths, filesystem, filters, partitioning, read_dictionary, buffer_size, memory_map, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, schema, decryption_properties, thrift_string_size_limit, thrift_container_size_limit, **kwargs)
   2450     except ValueError:
   2451         filesystem = LocalFileSystem(use_mmap=memory_map)
-> 2452 finfo = filesystem.get_file_info(path_or_paths)
   2453 if finfo.is_file:
   2454     single_file = path_or_paths

File /opt/homebrew/Caskroom/mambaforge/base/envs/spatialdata/lib/python3.10/site-packages/pyarrow/_fs.pyx:571, in pyarrow._fs.FileSystem.get_file_info()

File /opt/homebrew/Caskroom/mambaforge/base/envs/spatialdata/lib/python3.10/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

File /opt/homebrew/Caskroom/mambaforge/base/envs/spatialdata/lib/python3.10/site-packages/pyarrow/error.pxi:100, in pyarrow.lib.check_status()

ArrowInvalid: Expected a local filesystem path, got a URI: 'https://dl01.irc.ugent.be/spatial/cosmx/data.zarr/points/1_points//points.parquet'

@berombau
Copy link
Contributor Author

Consolidated store to test with: https://dl01.irc.ugent.be/spatial/cosmx/data.zarr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant