Skip to content

Opening fsspec s3 file twice results in invalid start byte #6813

@wroberts4

Description

@wroberts4

What happened?

When I open an fsspec s3 file twice, it results in an error, "file-like object read/write pointer not at the start of the file".

Here's a Dockerfile I used for the environment:

FROM condaforge/mambaforge:4.12.0-0
RUN mamba install -y --strict-channel-priority -c conda-forge python=3.10 dask h5netcdf xarray fsspec s3fs

Input1:

import fsspec
import xarray as xr
fs = fsspec.filesystem('s3', anon=True)
fp = 'noaa-goes16/ABI-L1b-RadF/2019/079/14/OR_ABI-L1b-RadF-M3C03_G16_s20190791400366_e20190791411133_c20190791411180.nc'
data = fs.open(fp)
xr.open_dataset(data, engine='h5netcdf', chunks={})
xr.open_dataset(data, engine='h5netcdf', chunks={})

Output1:

Traceback (most recent call last):
  File "//example.py", line 26, in <module>
    xr.open_dataset(data, engine='h5netcdf', chunks={})
  File "/opt/conda/lib/python3.10/site-packages/xarray/backends/api.py", line 531, in open_dataset
    backend_ds = backend.open_dataset(
  File "/opt/conda/lib/python3.10/site-packages/xarray/backends/h5netcdf_.py", line 389, in open_dataset
    store = H5NetCDFStore.open(
  File "/opt/conda/lib/python3.10/site-packages/xarray/backends/h5netcdf_.py", line 157, in open
    magic_number = read_magic_number_from_file(filename)
  File "/opt/conda/lib/python3.10/site-packages/xarray/core/utils.py", line 645, in read_magic_number_from_file
    raise ValueError(
ValueError: cannot guess the engine, file-like object read/write pointer not at the start of the file, please close and reopen, or use a context manager

----- INVALID EXAMPLE 2 -----
Input2:

import fsspec
import xarray as xr
fs = fsspec.filesystem('s3', anon=True)
fp = 'noaa-goes16/ABI-L1b-RadF/2019/079/14/OR_ABI-L1b-RadF-M3C03_G16_s20190791400366_e20190791411133_c20190791411180.nc'
data = fs.open(fp, mode='r')
xr.open_dataset(data, engine='h5netcdf', chunks={})
xr.open_dataset(data, engine='h5netcdf', chunks={})

Output2:

Traceback (most recent call last):
  File "//example.py", line 25, in <module>
    xr.open_dataset(data, engine='h5netcdf', chunks={})
  File "/opt/conda/lib/python3.10/site-packages/xarray/backends/api.py", line 531, in open_dataset
    backend_ds = backend.open_dataset(
  File "/opt/conda/lib/python3.10/site-packages/xarray/backends/h5netcdf_.py", line 389, in open_dataset
    store = H5NetCDFStore.open(
  File "/opt/conda/lib/python3.10/site-packages/xarray/backends/h5netcdf_.py", line 157, in open
    magic_number = read_magic_number_from_file(filename)
  File "/opt/conda/lib/python3.10/site-packages/xarray/core/utils.py", line 650, in read_magic_number_from_file
    magic_number = filename_or_obj.read(count)
  File "/opt/conda/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte

----- INVALID EXAMPLE 2 -----

What did you expect to happen?

I expect both calls to open_dataset to yield the same result and not error. The following runs without errors:

import fsspec
import xarray as xr
fs = fsspec.filesystem('s3', anon=True)
fp = 'noaa-goes16/ABI-L1b-RadF/2019/079/14/OR_ABI-L1b-RadF-M3C03_G16_s20190791400366_e20190791411133_c20190791411180.nc'
data = fs.open(fp)
xr.open_dataset(data, engine='h5netcdf', chunks={})
data = fs.open(fp)
xr.open_dataset(data, engine='h5netcdf', chunks={})

Minimal Complete Verifiable Example

No response

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

I see the same error mentioned in other issues like #3991, but it was determined to be a problem with the input data.

Environment

INSTALLED VERSIONS

commit: None
python: 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:04:59) [GCC 10.3.0]
python-bits: 64
OS: Linux
OS-release: 4.18.0-348.20.1.el8_5.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.1
libnetcdf: None

xarray: 2022.6.0rc0
pandas: 1.4.3
numpy: 1.23.1
scipy: None
netCDF4: None
pydap: None
h5netcdf: 1.0.1
h5py: 3.7.0
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2022.7.0
distributed: 2022.7.0
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: 2022.5.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 63.2.0
pip: 22.0.4
conda: 4.13.0
pytest: None
IPython: None
sphinx: None

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions