Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FileNotFoundError when using zarr with dask/s3fs version >= 0.5.0 #649

Closed
AliceBalfanz opened this issue Nov 11, 2020 · 12 comments
Closed
Milestone

Comments

@AliceBalfanz
Copy link

When using a zarr store from an s3 bucket with not storing physically chunks which are in uninitialized state, as described in the specification of zarr (https://zarr.readthedocs.io/en/stable/spec/v2.html#chunks), a FileNotFoundError occurs. This is new since s3fs version >= 0.5.0.

Minimal, reproducible code sample, a copy-pastable example if possible

import s3fs
import zarr

# AWS S3 path
s3_path = 's3://dcs4cop/bc-cmems-spm-1997-2018_1x704x640.zarr' 
# Initilize the S3 file system
s3 = s3fs.S3FileSystem(anon=True)
store = s3fs.S3Map(root=s3_path, s3=s3, check=False)
# Read Zarr file
ds = zarr.open(store=store, mode='r')

# Get some values:
ds.SPM[5]

This results in FileNotFoundError: The specified key does not exist.

Version and installation information

Please provide the following:

  • Value of zarr.__version__ 2.5.0
  • Value of s3fs.__version__ 0.5.1
  • Value of fsspec.__version__ 0.8.0
  • Version of Python interpreter 3.7.6
  • Operating system Linux
  • Zarr was installed "using conda"

Working as expected with the following versions:

  • Value of zarr.__version__ 2.5.0
  • Value of s3fs.__version__ 0.4.0.
  • Value of fsspec.__version__ 0.6.2

image

@AliceBalfanz AliceBalfanz changed the title FileNotFoundError when using zarr with dask/s3fs version > 0.5.0 FileNotFoundError when using zarr with dask/s3fs version >= 0.5.0 Nov 11, 2020
@joshmoore
Copy link
Member

Hi @AliceBalfanz, this sounds like it might be related to an issue I was seeing as well: fsspec/filesystem_spec#342

@forman
Copy link

forman commented Nov 11, 2020

Hi @joshmoore, fsspec/filesystem_spec#342 is closed, but I cannot see how its resolution fixes the actual issue described here. Currently S3Map in combination with Zarr behavior seems to violate the Zarr Storage Spec 2.5.

I'm not sure if the root cause originates from Zarr or whether we should implement or use another S3-capable MutableMapping instead of using S3Map / S3FileSystem.

We really need "missing chunks" on S3 as we have to deal with large sparse data cubes comprising a large number of NaN-chunks. In fact we use a tool xcube prune to erase NaN chunks from datasets as this drastically reduces the number of files (and uploads to S3).

@rabernat
Copy link
Contributor

rabernat commented Nov 11, 2020

Hi @forman and @AliceBalfanz. Sorry for the friction you have experienced here! The transition to the fsspec-based storage backends brings some major benefits (e.g. async, see #536 (comment)), but clearly there are also some transitional challenges to overcome. I agree that we must be able to preserve the earlier behavior of correctly filling empty chunks, rather than raising an error.

I urge you to not give up on s3fs and instead hold out for a bug fix. We appreciate your support and patience.

I'm wondering if @martindurant can weigh in on this.

@martindurant
Copy link
Member

It seems that this "indirect route" of the old usage may have a hole. The simplified version appears to work

In [22]: ds = zarr.open_consolidated('s3://dcs4cop/bc-cmems-spm-1997-2018_1x704x640.zarr')

In [23]: list(ds)
Out[23]: ['SPM', 'lat', 'lat_bnds', 'lon', 'lon_bnds', 'time', 'time_bnds']

In [24]: ds.SPM[5]
Out[24]:
array([[nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       ...,
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan]])

Obviously old and new ought to work. I'll look into it.

@rabernat
Copy link
Contributor

It seems that this "indirect route" of the old usage may have a hole. The simplified version appears to work

Could you clarify / define the following terms so that we are better able to participate in the discussion

  • "indirect route"
  • "hole"
  • "simplified version"

Thanks!

@martindurant
Copy link
Member

I mean that in my version of the call to zarr, I'm passing a URL, so this gets routed to the new FSStore, rather than a bare S3Map. There is an optional argument to getitems on what to do with errored/missing keys, and from zarr's point of view, the argument should be "omit", as FSStore does, but the default is "raise".

Note that there is a further problem, in that something is turning the paths lower-case, but I think this must be in zarr. Bear with me.

@rabernat
Copy link
Contributor

Thanks Martin for your fast reply!

@martindurant
Copy link
Member

So, apparently lower-case paths are the canonical norm in zarr - this is probably documented somewhere. To open without consolidated, you need to do

In [3]: ds = zarr.open_group('s3://dcs4cop/bc-cmems-spm-1997-2018_1x704x640.zarr', storage_options={'normalize_keys': False})

In [4]: ds.SPM[5]
Out[4]:
array([[nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       ...,
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan]])

I need to fix the fact that this doesn't work with zarr.open (doesn't pass on kwargs).

@rabernat
Copy link
Contributor

Can we make storage_options={'normalize_keys': False} the default?

@martindurant
Copy link
Member

I would need to investigate where this comes from - it appears to be default True elsewhere.
PR for the rest coming.

@forman
Copy link

forman commented Nov 11, 2020

@joshmoore , @rabernat, @martindurant thanks so much for your immediate responses, looks like this will be resolved soon. Don't hesitate to tell us how we can best support you!

@Carreau Carreau added this to the v2.6 milestone Dec 1, 2020
@fwrite
Copy link

fwrite commented Mar 15, 2021

FYI: pydata/xarray#5028 is (likely) related issue due to normalize_keys being True by default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants