Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty AnnData returned when reading Zarr from S3 #1056

Closed
2 of 3 tasks
LucaMarconato opened this issue Jul 13, 2023 · 2 comments · Fixed by #1057
Closed
2 of 3 tasks

Empty AnnData returned when reading Zarr from S3 #1056

LucaMarconato opened this issue Jul 13, 2023 · 2 comments · Fixed by #1057

Comments

@LucaMarconato
Copy link
Member

Please make sure these conditions are met

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of anndata.
  • (optional) I have confirmed this bug exists on the master branch of anndata.

Report

Description

Maybe related to #951.

The example below shows four attempts at reading a remote Zarr AnnData. 1 of the 4 attempts is successful, and the data can be looked at here https://dl01.irc.ugent.be/spatial/mibitof/data.zarr (at the location table/table). Please note that the consolidated metadata is present in the root folder (created by calling zarr.consolidate_metadata()).
The URL https://s3.embl.de/spatialdata/test_remote/data.zarr contains the same data as above, but here we have an S3 storage where the data is "not discoverable", so consolidated metadata are needed for zarr to find the files.

We see that:

  • in attempts 1 and 2 we try to pass a zarr.Group to read_zarr(), and this doesn't work. Maybe because the zarr.Group is returned by open_consolidated() and not by open(). They are both groups but slightly different inside. For instance g.store is a zarr.storage.FSStore for open() and a zarr.storage.ConsolidatedMetadataStore for open_consolidated(). So maybe some of these internal elements break read_zarr().
  • in attempts 3 and 4 we pass a URL. Here read_zarr() is most likely ignoring the zmetadata present two folders up, so in the attempt 4 there is no surprise it fails, while it works in attempts 3 because the folders inside the URL are "discoverable".

Note, in the store you see a zmetadata and not a .zmetadata because of this bug zarr-developers/zarr-python#1121.

Reproducing

Code:

import zarr
import os
import pytest

from anndata import read_zarr

store0 = 'https://dl01.irc.ugent.be/spatial/mibitof/data.zarr'
store1 = 'https://s3.embl.de/spatialdata/test_remote/data.zarr'
# workaround .zmetadata is being written as zmetadata (https://github.com/zarr-developers/zarr-python/issues/1121)
f0 = zarr.open_consolidated(store0, mode="r", metadata_key="zmetadata")
f1 = zarr.open_consolidated(store1, mode="r", metadata_key="zmetadata")

# this shows that the data is seen correctly by zarr
print(dict(f0['table/table']))
print(dict(f1['table/table']))

# attempt 1, fails
group0 = f0["table/table"]
with pytest.raises(zarr.errors.PathNotFoundError):
    table0 = read_zarr(group0)

# attempt 2, fails
group1 = f1["table/table"]
with pytest.raises(zarr.errors.PathNotFoundError):
    table0 = read_zarr(group1)

# attempt 3, works
table0 = read_zarr(os.path.join(store0, 'table/table'))
print(table0)

# attempt 4, fails
# this returns an empty table instead (makes sense since no consolidated metadata is passed here)
table1 = read_zarr(os.path.join(store1, 'table/table'))
print(table1)

Traceback (for the first of the with pytest.raises...):

Traceback (most recent call last):
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3508, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-5-0d02b753fc78>", line 1, in <module>
    table0 = read_zarr(group1)
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_io/zarr.py", line 65, in read_zarr
    f = zarr.open(store, mode="r")
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/zarr/convenience.py", line 122, in open
    raise PathNotFoundError(path)
zarr.errors.PathNotFoundError: nothing found at path ''

Versions

>>> import anndata, session_info; session_info.show()

-----
anndata             0.9.1
session_info        1.0.0
-----
Python 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:41:52) [Clang 15.0.7 ]
macOS-13.4.1-arm64-arm-64bit
-----
Session information updated at 2023-07-13 21:23
@ivirshup
Copy link
Member

ivirshup commented Jul 14, 2023

@LucaMarconato, can reproduce.

As a workaround: ad.experimental.read_elem seems to work for both cases.

I would broadly also suggest using that API when working with zarr.Group. But read_zarr does say it should support this, so I'll fix that.

Example:

import zarr
from anndata.experimental import read_elem

store0 = 'https://dl01.irc.ugent.be/spatial/mibitof/data.zarr'
store1 = 'https://s3.embl.de/spatialdata/test_remote/data.zarr'

f0 = zarr.open_consolidated(store0, mode="r", metadata_key="zmetadata")
f1 = zarr.open_consolidated(store1, mode="r", metadata_key="zmetadata")

display(read_elem(f0["table/table"]))
# AnnData object with n_obs × n_vars = 3309 × 36
#     obs: 'row_num', 'point', 'cell_id', 'X1', 'center_rowcoord', 'center_colcoord', 'cell_size', 'category', 'donor', 'Cluster', 'batch', 'library_id'
#     uns: 'spatialdata_attrs'
#     obsm: 'X_scanorama', 'X_umap', 'spatial'

display(read_elem(f1["table/table"]))
# AnnData object with n_obs × n_vars = 3309 × 36
#     obs: 'row_num', 'point', 'cell_id', 'X1', 'center_rowcoord', 'center_colcoord', 'cell_size', 'category', 'donor', 'Cluster', 'batch', 'library_id'
#     uns: 'spatialdata_attrs'
#     obsm: 'X_scanorama', 'X_umap', 'spatial'

@ivirshup
Copy link
Member

@LucaMarconato, could you try #1057 and make sure it fixes your issue? We're not set up to test against s3 directly, but I think this should fix the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants