Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kerchunk interop #2889

Open
martindurant opened this issue Aug 27, 2024 · 21 comments
Open

kerchunk interop #2889

martindurant opened this issue Aug 27, 2024 · 21 comments

Comments

@martindurant
Copy link

Ref:

Feature Request

I was recently sniped into considering HDF4 as a target for kerchunk, largely because of the amount of NASA archival data in the format. I blindly set about decoding the format and made a decent amount of progress (see linked PR) to the point of seeing the arrays and attributes in a file.

After this, it was mentioned that this repo has a lot of relevant code (along with the pyhdf helper), and indeed! You seem to have solved not only hdf4 but all the peculiarities of many more specific archive conventions.

So, how hard would it be to extract out kerchunk references given that you can already pipe dask chunks into xarray?

cc @maxrjones @TomNicholas

@mraspaud
Copy link
Member

Sorry for my ignorance, but what is kerchunk? what does it need?

@mraspaud
Copy link
Member

Regarding pyhdf, I recently looked into it again, and one problem is that is does not expose the chunking information of the underlying file (that's a choice when the implementation was done, see here). Other than that it seems to be working nicely for our needs.

@djhoese
Copy link
Member

djhoese commented Aug 27, 2024

See #2884 for recent discussion of when dask changes broke our pyhdf wrappers.

@martindurant
Copy link
Author

but what is kerchunk

I should have started with this :)
kerchunk allows zarr (and so xarray) to read any dataset where the binary buffers can be found and described. So, if we can infer that a file has an array (x, y) of dtype, and chunks (dx, dy), and also we know the offset/length and compression of each chunk, then we can construct a zarr array.
The point of doing this, is

  • not to need another third party library once the scan has been done once (and posted somewhere as a file)
  • allow concurrent fetches of the chunks with fsspec and parallel processing of partitions with dask
  • allow access to any of the filesystems fsspec supports, making it "cloud native"
  • combine potentially thousands of original archival files into a single logical dataset, so that xarray slicing can select only the needed bytes out of the files for a given calculation

it does not expose the chunking information

I definitely got that far, associating chunk sets (offset/length/index) and dtype/compression with arrays! The struggle I have, it working out the grouping hierarchy - which dimensions belong to which variable.

@djhoese
Copy link
Member

djhoese commented Aug 27, 2024

I definitely got that far, associating chunk sets (offset/length/index) and dtype/compression with arrays! The struggle I have, it working out the grouping hierarchy - which dimensions belong to which variable.

I'm not sure what you mean. Are you allowed to use a library like pyhdf to do the initial parsing? If so then h.select(<variable name>).dimensions() gives you the dimensions for that variable.

@martindurant
Copy link
Author

Are you allowed to use a library like pyhdf

Certainly, I just didn't know about it when I got going. I also have an irrational itch to solve it all in one place in pure python, but ...

@maxrjones
Copy link

See #2884 for recent discussion of when dask changes broke our pyhdf wrappers.

This was interesting. Would a pyhdf xarray backend be helpful for satpy even after the fix in #2886? I think this would be helpful for xarray workflows outside of satpy (e.g., https://discourse.pangeo.io/t/reading-modis-thermal-anomalies-into-xarray/4437) and may be able to find some time to work on it.

it does not expose the chunking information

I definitely got that far, associating chunk sets (offset/length/index) and dtype/compression with arrays! The struggle I have, it working out the grouping hierarchy - which dimensions belong to which variable.

could this get implemented in pyhdf instead of kerchunk using SDgetchunkinfo()?

@djhoese
Copy link
Member

djhoese commented Aug 27, 2024

Another thing to consider for parsing, if you're already using NetCDF libraries for parsing NetCDF (.nc) files, you could also use it to parse HDF4 files if the NetCDF C library was compiled with HDF4 support. This is the case for the conda-forge build of netcdf4.

@mraspaud
Copy link
Member

mraspaud commented Aug 27, 2024

See #2884 for recent discussion of when dask changes broke our pyhdf wrappers.

This was interesting. Would a pyhdf xarray backend be helpful for satpy even after the fix in #2886? I think this would be helpful for xarray workflows outside of satpy (e.g., discourse.pangeo.io/t/reading-modis-thermal-anomalies-into-xarray/4437) and may be able to find some time to work on it.

I for one would love this.

it does not expose the chunking information

I definitely got that far, associating chunk sets (offset/length/index) and dtype/compression with arrays! The struggle I have, it working out the grouping hierarchy - which dimensions belong to which variable.

could this get implemented in pyhdf instead of kerchunk using SDgetchunkinfo()?

Yes, it absolutely could, but I got lost in the Swig interface and didn't know where to start to add it to pyhdf...

@mraspaud
Copy link
Member

Are you allowed to use a library like pyhdf

Certainly, I just didn't know about it when I got going. I also have an irrational itch to solve it all in one place in pure python, but ...

I definitely share that itch, but I'm always worried about performance...

@martindurant
Copy link
Author

if you're already using NetCDF libraries for parsing NetCDF (.nc) files

I'm sure it can read them, but we want the chunks information rather than the data. For HDF5, it makes data access MUCH faster (depending on access pattern, of course), e.g., https://nbviewer.org/github/cgentemann/cloud_science/blob/master/zarr_meta/cloud_mur_v41_benchmark.ipynb

@martindurant
Copy link
Author

I'm always worried about performance...

Seeking around a remote file to get all those 4-byte definitions will be poor performance no matter how you do it, so it seems to me that extracting what you need once and then reading it one-shot at access time shold be much faster. However, HDF4 does tend to tiny chunks (since the conventions predate the cloud era), so it would mean storing lots of chunks - even though they are probably contiguous in the file.

@mraspaud
Copy link
Member

I'm always worried about performance...

Seeking around a remote file to get all those 4-byte definitions will be poor performance no matter how you do it, so it seems to me that extracting what you need once and then reading it one-shot at access time shold be much faster.

Sorry for going off-topic, but that does not apply to real-time processing I suppose? where you just need to read the data once, process it, save the result, and in the end delete the data file? Not that this is a show stopper in anyway, just curious for my main use of data :)

@martindurant
Copy link
Author

that does not apply to real-time processing

No. In that case, you may as well grab the whole file locally (in memory) and you end up scanning the bits of metadata just the once, and probably processing every byte of array data.
Kerchunk allows scanning the metadata once and putting it into a convenient format, so that others don't have to.
Perhaps there are scenarios where the file structure is byte-identical across some range, and so you can apply what you find in one to the others; that's unlikely to be true in the presence of compression.

@TomNicholas
Copy link

TomNicholas commented Aug 27, 2024

Seeking around a remote file to get all those 4-byte definitions will be poor performance no matter how you do it, so it seems to me that extracting what you need once and then reading it one-shot at access time shold be much faster.

Sorry for going off-topic, but that does not apply to real-time processing I suppose?

FYI I see kerchunk / virtualizarr as a form of caching. You're caching the result of the step where given a so-far-unseen file you have to seek through it to read the metadata and find the positions and lengths of the byte ranges for the chunks inside the file. On local filesystems this step is quick and so there isn't much benefit to caching, but on object storage there is no seek, and LIST and GET requests take long enough that it's well worth caching the result of that step.

With virtualizarr's design you can also think of this workflow as caching the result of xarray.open_mfdataset, therefore saving you the time that step can take on every subsequent access to the data.

@martindurant
Copy link
Author

a form of caching

That's a decent way to describe it; but some (C) backend libraries won't read from remote or won't do concurrent/parallel reads at all. It wouldn't be too surprising if, for instance, pyhdf had global state that prevented threaded use; eccodes (for cfgrib) certainly does.

@martindurant
Copy link
Author

After poking around the trial dataset (MOD14.A2024226.2345.061.2024227034233.hdf - I don't actually know which variant this is within satpy) with pyhdf, I found that my code actually did find exactly the right set of tags and hierarchy.

My question is, when opened with xarray/rasterio/gdal, I get coordinates

  * x            (x) float64 0.5 1.5 2.5 3.5 ... 1.352e+03 1.352e+03 1.354e+03
  * y            (y) float64 0.5 1.5 2.5 3.5 ... 2.028e+03 2.028e+03 2.03e+03

which are not arrays in the file, but clearly generated somehow. Is it just pixel centres??

@mraspaud
Copy link
Member

a form of caching

That's a decent way to describe it; but some (C) backend libraries won't read from remote or won't do concurrent/parallel reads at all. It wouldn't be too surprising if, for instance, pyhdf had global state that prevented threaded use; eccodes (for cfgrib) certainly does.

The remote and/or parallel reading is a good point. I don't think hdf4 supports remote reading, and I highly doubt any new development on the library would be done to support it.
And we are using remote reading more and more, so it definitely makes sense to have it implemented for these files somehow.

@mraspaud
Copy link
Member

After poking around the trial dataset (MOD14.A2024226.2345.061.2024227034233.hdf - I don't actually know which variant this is within satpy) with pyhdf, I found that my code actually did find exactly the right set of tags and hierarchy.

My question is, when opened with xarray/rasterio/gdal, I get coordinates

  * x            (x) float64 0.5 1.5 2.5 3.5 ... 1.352e+03 1.352e+03 1.354e+03
  * y            (y) float64 0.5 1.5 2.5 3.5 ... 2.028e+03 2.028e+03 2.03e+03

which are not arrays in the file, but clearly generated somehow. Is it just pixel centres??

I don't really know what gdal does tbh, but the convention is pixel centres yes. That being said, I don't think this is something an hdf4 library should generate, that should be more on the user library side like rioxarray or satpy really.

@TomNicholas
Copy link

a form of caching

That's a decent way to describe it; but some (C) backend libraries won't read from remote or won't do concurrent/parallel reads at all. It wouldn't be too surprising if, for instance, pyhdf had global state that prevented threaded use; eccodes (for cfgrib) certainly does.

The remote and/or parallel reading is a good point.

I'm not sure I understand this point. Once you have the byte offsets and lengths, you don't need to use any specialized backend libraries to read the data, you just read those byte ranges (e.g. via http range requests to object storage). You can do that with as much parallelism as you like, because object storage doesn't have file locks.

@martindurant
Copy link
Author

Once you have the byte offsets and lengths, you don't need to use any specialized backend libraries to read the data

Exactly true, but if you use one third-party library (like netcdf4), it very_probably acts on a file-like object with seek/read and serial, blocking access. Otherwise, you need another layer to know what to do with those bytes blocks.
Only more modern IO libraries (like zarr!) know that not everything is a disk. In one case, fsspec does this offsets-finding in order to concurrently fetch many bytes, and can speed up reading of parquet by a decent factor in some cases - and that's for the modern arrow library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants