Reading from remote .zarr stores #275

berombau · 2023-05-22T15:27:57Z

Right now, reading from a remote .zarr URL fails:

In [7]: sdata = sd.SpatialData.read('https://s3.embl.de/spatialdata/spatialdata-sandbox/cosmx_io.zarr/')
---------------------------------------------------------------------------
PathNotFoundError                         Traceback (most recent call last)
Input In [7], in <cell line: 1>()
----> 1 sdata = sd.SpatialData.read('https://s3.embl.de/spatialdata/spatialdata-sandbox/cosmx_io.zarr/')

File ~/Documents/GitHub/spatialdata/src/spatialdata/_core/spatialdata.py:1183, in SpatialData.read(file_path)
   1179 @staticmethod
   1180 def read(file_path: str) -> SpatialData:
   1181     from spatialdata import read_zarr
-> 1183     return read_zarr(file_path)

File ~/Documents/GitHub/spatialdata/src/spatialdata/_io/io_zarr.py:22, in read_zarr(store)
     19 if isinstance(store, str):
     20     store = Path(store)
---> 22 f = zarr.open(store, mode="r")
     23 images = {}
     24 labels = {}

File /opt/homebrew/Caskroom/mambaforge/base/envs/spatialdata/lib/python3.10/site-packages/zarr/convenience.py:122, in open(store, mode, zarr_version, path, **kwargs)
    120     return open_group(_store, mode=mode, **kwargs)
    121 else:
--> 122     raise PathNotFoundError(path)

PathNotFoundError: nothing found at path ''

Ideally, this would just work the same as a local dataset or fail with informative errors. Some issues causing this:

URL strings are parsed using os.path or Path, which does not work for:
- .exist() e.g. images_store.exists() in io_zarr.py
- the remote store is not iterable for subelements as a local path is e.g. `for k in f in io_raster.py
URL strings are sometimes required to be str | Path:
- e.g. _read_multiscale in io_raster.py
reading IPv6 URLs (e.g. http://[::]:8000/) also fails, probably due to some string handling error
failing to read a remote store or an unconsolidated store gives an empty store instead of raising an issue
remote Points elements .parquet are now read in completely, instead of only the metadata. This is really slow for large datasets.

The library urlpath extends pathlib to also work with URLs, but ideally this is handled by the string parsing of the zarr library instead.

The list of subelements can be provided by consolidate_metadata, same as kevinyamauchi/ome-ngff-tables-prototype#12

The text was updated successfully, but these errors were encountered:

berombau · 2023-05-22T16:14:55Z

ZarrLocation solves some issues like adding .exists(), but does not support iteration or a Path like syntax: store / 'images.

berombau · 2023-05-23T14:20:42Z

Stuck on reading remote .parquet:

ArrowInvalid                              Traceback (most recent call last)
File /opt/homebrew/Caskroom/mambaforge/base/envs/spatialdata/lib/python3.10/site-packages/IPython/core/formatters.py:707, in PlainTextFormatter.__call__(self, obj)
    700 stream = StringIO()
    701 printer = pretty.RepresentationPrinter(stream, self.verbose,
    702     self.max_width, self.newline,
    703     max_seq_length=self.max_seq_length,
    704     singleton_pprinters=self.singleton_printers,
    705     type_pprinters=self.type_printers,
    706     deferred_pprinters=self.deferred_printers)
--> 707 printer.pretty(obj)
    708 printer.flush()
    709 return stream.getvalue()

File /opt/homebrew/Caskroom/mambaforge/base/envs/spatialdata/lib/python3.10/site-packages/IPython/lib/pretty.py:410, in RepresentationPrinter.pretty(self, obj)
    407                         return meth(obj, self, cycle)
    408                 if cls is not object \
    409                         and callable(cls.__dict__.get('__repr__')):
--> 410                     return _repr_pprint(obj, self, cycle)
    412     return _default_pprint(obj, self, cycle)
    413 finally:

File /opt/homebrew/Caskroom/mambaforge/base/envs/spatialdata/lib/python3.10/site-packages/IPython/lib/pretty.py:778, in _repr_pprint(obj, p, cycle)
    776 """A pprint that just redirects to the normal repr function."""
    777 # Find newlines and replace them with p.break_()
--> 778 output = repr(obj)
    779 lines = output.splitlines()
    780 with p.group():

File ~/Documents/GitHub/spatialdata/src/spatialdata/_core/spatialdata.py:1239, in SpatialData.__repr__(self)
   1238 def __repr__(self) -> str:
-> 1239     return self._gen_repr()

File ~/Documents/GitHub/spatialdata/src/spatialdata/_core/spatialdata.py:1290, in SpatialData._gen_repr(self)
   1288     assert len(t) == 1
   1289     parquet_file = t[0]
-> 1290     table = read_table(parquet_file)
   1291     length = len(table)
   1292 else:
   1293     # length = len(v)

File /opt/homebrew/Caskroom/mambaforge/base/envs/spatialdata/lib/python3.10/site-packages/pyarrow/parquet/core.py:2926, in read_table(source, columns, use_threads, metadata, schema, use_pandas_metadata, read_dictionary, memory_map, buffer_size, partitioning, filesystem, filters, use_legacy_dataset, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, decryption_properties, thrift_string_size_limit, thrift_container_size_limit)
   2919     raise ValueError(
   2920         "The 'metadata' keyword is no longer supported with the new "
   2921         "datasets-based implementation. Specify "
   2922         "'use_legacy_dataset=True' to temporarily recover the old "
   2923         "behaviour."
   2924     )
   2925 try:
-> 2926     dataset = _ParquetDatasetV2(
   2927         source,
   2928         schema=schema,
   2929         filesystem=filesystem,
   2930         partitioning=partitioning,
   2931         memory_map=memory_map,
   2932         read_dictionary=read_dictionary,
   2933         buffer_size=buffer_size,
   2934         filters=filters,
   2935         ignore_prefixes=ignore_prefixes,
   2936         pre_buffer=pre_buffer,
   2937         coerce_int96_timestamp_unit=coerce_int96_timestamp_unit,
   2938         thrift_string_size_limit=thrift_string_size_limit,
   2939         thrift_container_size_limit=thrift_container_size_limit,
   2940     )
   2941 except ImportError:
   2942     # fall back on ParquetFile for simple cases when pyarrow.dataset
   2943     # module is not available
   2944     if filters is not None:

File /opt/homebrew/Caskroom/mambaforge/base/envs/spatialdata/lib/python3.10/site-packages/pyarrow/parquet/core.py:2452, in _ParquetDatasetV2.__init__(self, path_or_paths, filesystem, filters, partitioning, read_dictionary, buffer_size, memory_map, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, schema, decryption_properties, thrift_string_size_limit, thrift_container_size_limit, **kwargs)
   2450     except ValueError:
   2451         filesystem = LocalFileSystem(use_mmap=memory_map)
-> 2452 finfo = filesystem.get_file_info(path_or_paths)
   2453 if finfo.is_file:
   2454     single_file = path_or_paths

File /opt/homebrew/Caskroom/mambaforge/base/envs/spatialdata/lib/python3.10/site-packages/pyarrow/_fs.pyx:571, in pyarrow._fs.FileSystem.get_file_info()

File /opt/homebrew/Caskroom/mambaforge/base/envs/spatialdata/lib/python3.10/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

File /opt/homebrew/Caskroom/mambaforge/base/envs/spatialdata/lib/python3.10/site-packages/pyarrow/error.pxi:100, in pyarrow.lib.check_status()

ArrowInvalid: Expected a local filesystem path, got a URI: 'https://dl01.irc.ugent.be/spatial/cosmx/data.zarr/points/1_points//points.parquet'

berombau · 2023-05-23T14:38:19Z

Consolidated store to test with: https://dl01.irc.ugent.be/spatial/cosmx/data.zarr

berombau mentioned this issue May 23, 2023

Support for consolidated remote zarr #278

Merged

LucaMarconato closed this as completed in #278 Jul 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading from remote .zarr stores #275

Reading from remote .zarr stores #275

berombau commented May 22, 2023 •

edited

Loading

berombau commented May 22, 2023

berombau commented May 23, 2023

berombau commented May 23, 2023

Reading from remote .zarr stores #275

Reading from remote .zarr stores #275

Comments

berombau commented May 22, 2023 • edited Loading

berombau commented May 22, 2023

berombau commented May 23, 2023

berombau commented May 23, 2023

berombau commented May 22, 2023 •

edited

Loading