You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The entries powered by intake_xarray driver does not lazy read metadata from the files.
# %%importintakeimportxarrayasxrds=xr.Dataset(
{
"test_var": [0],
},
attrs={"xarray_metadata": "The metadata in the xarray file"},
)
ds.to_netcdf("test_metadata.nc")
ds.to_zarr("test_metadata.zarr", mode="w")
# %%catalog_content="""sources: netcdf: driver: netcdf args: urlpath: '{{ CATALOG_DIR }}/test_metadata.nc' metadata: catalog_metadata: The metadata in the catalog entry zarr_intake_xarray: description: zarr archive read by intake_xarray driver: zarr args: urlpath: '{{ CATALOG_DIR }}/test_metadata.zarr' metadata: catalog_metadata: The metadata in the catalog entry zarr_intake: description: zarr archive read by intake driver: zarr_cat args: urlpath: '{{ CATALOG_DIR }}/test_metadata.zarr' metadata: catalog_metadata: The metadata in the catalog entry"""withopen("catalog.yml", "w") asf:
f.write(catalog_content)
cat=intake.open_catalog("catalog.yml")
print(f"{cat.netcdf.metadata=}")
print(f"{cat.zarr_intake_xarray.metadata=}")
print(f"{cat.zarr_intake.metadata=}")
As you see from the output, the metadata from the entry powered by intake driver has the field from the zarr file:
cat.netcdf.metadata = {'catalog_metadata': 'The metadata in the catalog entry'}
cat.zarr_intake_xarray.metadata = {'catalog_metadata': 'The metadata in the catalog entry'}
cat.zarr_intake.metadata = {'catalog_metadata': 'The metadata in the catalog entry', 'xarray_metadata': 'The metadata in the xarray file'}
However, after reading the files, the metadata is complete:
cat.netcdf.read()
cat.zarr_intake_xarray.read()
print(f"Netcdf metadata after reading: {cat.netcdf.metadata}")
print(f"Zarr metadata after reading: {cat.zarr_intake_xarray.metadata}")
Output:
Netcdf metadata after reading: {'catalog_metadata': 'The metadata in the catalog entry', 'dims': {'test_var': 1}, 'data_vars': {}, 'coords': ('test_var',), 'xarray_metadata': 'The metadata in the xarray file'}
Zarr metadata after reading: {'catalog_metadata': 'The metadata in the catalog entry', 'dims': {'test_var': 1}, 'data_vars': {}, 'coords': ('test_var',), 'xarray_metadata': 'The metadata in the xarray file'}
What do you think the right behaviour should be? Catalog entries are special in Intake (<2.0) in that they get their subentries eagerly, so they have access to the file metadata immediately, is this what you are getting at?
I expected that cat.netcdf.metadata includes also the metadata from the file like this: {'catalog_metadata': 'The metadata in the catalog entry', 'xarray_metadata': 'The metadata in the xarray file'}.
But now, the xarray_metadata key appears only after reading the whole file by executing cat.netcdf.read().
I think it would be better to have "lazy" metadata reading from files because there also could be some useful information... What do you think?
The .discover() method is meant exactly for this purpose, to get information from the file with a minimum of reads. It's usefulness varies by file type.
Actually, xarray is lazy by default, so even if you do a .read(), you do no load all the data into memory, only enough for xarray to be able to understand the file's layout (typically the attributes and coordinate arrays).
The entries powered by
intake_xarray
driver does not lazy read metadata from the files.As you see from the output, the metadata from the entry powered by
intake
driver has the field from thezarr
file:However, after reading the files, the metadata is complete:
Output:
OS: Windows 10
python 3.11.5
intake 0.7.0
intake_xarray 0.7.0
xarray 2023.8.0
zarr 2.16.1
The text was updated successfully, but these errors were encountered: