Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow HDF Groups #424

Merged
merged 10 commits into from
Feb 29, 2024
Merged

Allow HDF Groups #424

merged 10 commits into from
Feb 29, 2024

Conversation

martindurant
Copy link
Member

@valeriupredoi , first guess at something that might work. I am not sure, but it's possible that even if a dataset is not included, it still is read - to be determined. We also need to understand how this kind of thing behaves for nested HDF groups (you don't normally see these in netCDF style data).

@valeriupredoi
Copy link
Contributor

Hi @martindurant very many thanks for starting to look into this so quickly in a PR, mate 🍺

Your code change addresses one issue (in terms of performance): the kerchunking/translation, and does so, in my test case, by reducing the runtime from about 100s to 75s - which is great, but not ideal 😁 The main problem here is visititems() that takes the bulk of time, and that's done on the entire HDF5/netCDF4 file. Here is a very, very rough POC of how this issue of visititems can be addressed: in translate(self), when it comes to visiting items (creating the B-tree), we should do that just for the bit of HDF5 that we need to look at ie just the Dataset of interest - I could not think of a better way to move from Dataset to HDF5 object, but a nasty write-to-disk, but surely there are more elegant ways to convert a Dataset to a HDF5 in memory:

        if self.var_pattern and self._h5f[self.var_pattern]:
            dset = self._h5f[self.var_pattern]
            self._h5f = h5py.File('data.hdf5', 'w')
            self._h5f.create_dataset(self.var_pattern, data=dset)
            self._h5f.visititems(self._translator)

this makes my visititems drop from 75s to 12s clearly with most of those 12s spent writing the silly file to disk 🍺

@valeriupredoi
Copy link
Contributor

and indeed that ~12s is file IO, if one uses a cached File that time drops to about 1s just for visititems 👍

@valeriupredoi
Copy link
Contributor

Hi @martindurant I've struggled with h5py all day today, and the best I came up with, in order to restrict visiting items only to the bits of the file one needs (ie only the needed Datasets/variables) is to create an empty Group and store those datasets in translator():

        lggr.debug("Translation begins")
        self._transfer_attrs(self._h5f, self._zroot)
        if self.var and self._h5f[self.var]:
            if isinstance(self._h5f[self.var], h5py.Dataset):
                import time
                t1 = time.time()
                self._h5f.create_group(self.var + "_adhoc")
                self._h5f[self.var + "_adhoc"][self.var] = self._h5f[self.var]
                self._h5f = self._h5f[self.var + "_adhoc"]
                self._h5f.visititems(self._translator)
                t2 = time.time()
                print("New group creation and visititems took:", t2 - t1)
            else:
                self._h5f = dset  # already a Group
                self._h5f.visititems(self._translator)
        else:
            self._h5f.visititems(self._translator)

thing is, this is ugly (because I don't really know HDF5 all too well, sorry), and not particularly fast, it is, in fact, slower than writing out a file, though we don't want to transfer any data at all. I hope to High Heavens and trust you have a much more efficient and elegant solution 😃 🍻

@martindurant
Copy link
Member Author

Would perhaps an even easier solution be, to allow passing in an arbitrary H5py object to SingleHdf5ToZarr? It already accepts a file path or open file-like. I'll push a new possible version into this PR, see what you think.

@martindurant martindurant marked this pull request as draft February 22, 2024 19:24
kerchunk/hdf.py Outdated
@@ -47,7 +48,7 @@ class SingleHdf5ToZarr:
to BinaryIO is optional), in which case must also provide url. If a str,
file will be opened using fsspec and storage_options.
url : string
URI of the HDF5 file, if passing a file-like object
URI of the HDF5 file, if passing a file-like object or h5py File/dataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @martindurant many thanks for looking into this! Great minds think alike - this is how I made it ingest my subset of the multi-variate file myself, earlier today, on a scratch dev version of Kerchunk in my env: I passed an already extracted h5py.Group object. The only hitch with this approach is that if one passes a h5py.Dataset instead, Kerchunk (well, h5py in reality) will complain since visititems is not a valid method of a Dataset but only of File or Group objects, so in my case, I constructed an empty group where I plopped the Dataset of interest. The issue with that approach is that one needs to name the new Group something else than the Dataset, hence introducing some extra unwanted overhead

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to put it in a separate PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the changes I made to Kerchunk are literally the ones you did here (including passing the variable name and looking for it), so it's not much done on the Kerchunk side, most of the other stuff (creating the new Group etc) I did at our end, but if you think that's useful, I'll plop in Kerchunk, no problemo. I still think it's a bit ugly TBF 😁

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so we need something to cope with Dataset Vs File, maybe just put the diff in here? Yes, I think it's useful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really, just a peasanty workround to get Kerchunk to be able to run visititems(callback)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good man! That's exactly the thing. I'll post them up tomorrow, have not committed them off my work machine yet, and am home now, dinner time 🍕

Copy link
Contributor

@valeriupredoi valeriupredoi Feb 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @martindurant here is my approach in my package (PyActiveStorage):

    elif storage_type == "s3" and storage_options is not None:
        storage_options = storage_options.copy()
        storage_options['default_fill_cache'] = False
        # storage_options['default_cache_type'] = "none"  # big time drain this one
        fs = s3fs.S3FileSystem(**storage_options)
        fs2 = fsspec.filesystem('')
        with fs.open(file_url, 'rb') as s3file:
            s3file = h5py.File(s3file, mode="w")
            if isinstance(s3file[varname], h5py.Dataset):
                print("Looking only at a single Dataset", s3file[varname])
                s3file.create_group(varname + " ")
                s3file[varname + " "][varname] = s3file[varname]
            elif isinstance(s3file[varname], h5py.Group):
                print("Looking only at a single Group", s3file[varname])
                s3file = s3file[varname]
            h5chunks = SingleHdf5ToZarr(s3file, file_url, var=varname,
                                        inline_threshold=0)

and the bit changed in kerchunk/hdf.py is pretty much all you did here, with the added bit that the object becomes just the Group I want to get kerchunked, so in translate() I plopped this hacky bit:

        if self.var and self._h5f[self.var + " "]:
            self._h5f = self._h5f[self.var + " "]
        print("Visiting the following object", self._h5f)
        self._h5f.visititems(self._translator)

Cheers 🍺

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a couple more details: I am using kerchunk==0.2.0 in my conda/mamba env (installed from conda-forge), so I can bypass the dep issue with pinned numcodecs, and here are some timing results of this approch (changed Kerchunk + conversion to Group and limiting kerchunking to it) vs bogstandard Kerchunking my entire file (that has some 100 variables, with all manners of dimesnions, but the bigger ones are shape (30, 30, 350, 420)):

With changed Kerchunk + conversion Dataset to Group
---------------------------------------------------
Visititems took: 2.5403971672058105
Time to Translate and Dump Kerchunks to json file 4.393939018249512
Visititems took: 1.9200255870819092
Time to Translate and Dump Kerchunks to json file 2.7312347888946533
Visititems took: 2.005722761154175
Time to Translate and Dump Kerchunks to json file 2.588365316390991
Visititems took: 1.9823436737060547
Time to Translate and Dump Kerchunks to json file 2.7559237480163574
Visititems took: 1.9835329055786133
Time to Translate and Dump Kerchunks to json file 2.5909011363983154

With regular Kerchunk
---------------------
Visititems took: 4.841791152954102
Time to Translate and Dump Kerchunks to json file 5.548096656799316
Visititems took: 4.454912900924683
Time to Translate and Dump Kerchunks to json file 5.720059156417847
Visititems took: 3.8621530532836914
Time to Translate and Dump Kerchunks to json file 4.593475580215454
Visititems took: 4.457882881164551
Time to Translate and Dump Kerchunks to json file 5.079823732376099
Visititems took: 4.275482177734375
Time to Translate and Dump Kerchunks to json file 4.894218444824219

Kerchunking on a restricted space does indeed improve timings, order factor of 2 it appears in my particular test case 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the JSON file containing the Kerchunk indices/Zarr ref file data drops from 300k normal to 8k when I do the restricted approach (this would matter if we were in 1992, though 🤣 )

@valeriupredoi
Copy link
Contributor

also, worth mentioning that, from my tests - making sure to select the needed variable/Dataset/Group does make a pretty hefty difference in terms of speedup ie something of order 2-3x (we just found a massive caching issue with our s3fs loader, so managed to down the runtime from 100s to about 10s, that includes about 5-6s for Kerchunking for the entire file, that time drops to 2-3s when kerchunking only the variable of interest) 👍

@martindurant
Copy link
Member Author

storage_options['default_cache_type'] = "none"

Type "first" is usually the best option for HDF5.

@valeriupredoi
Copy link
Contributor

whoa and a difference it makes, cheers muchly, Martin! Have a look at these numbers (for the same test above):

Normal Kerchunk with default cache "first"
------------------------------------------
Visititems took: 1.6179678440093994
Time to Translate and Dump Kerchunks to json file 2.377903461456299
Visititems took: 1.6005499362945557
Time to Translate and Dump Kerchunks to json file 2.2975118160247803
Visititems took: 1.6153967380523682
Time to Translate and Dump Kerchunks to json file 2.6384167671203613
Visititems took: 1.5885121822357178
Time to Translate and Dump Kerchunks to json file 2.504279136657715

Restricted to one Group/Dataset Kerchunk with default cache "first"
-------------------------------------------------------------------
Visititems took: 0.10222649574279785
Time to Translate and Dump Kerchunks to json file 0.7997150421142578
Visititems took: 0.10846853256225586
Time to Translate and Dump Kerchunks to json file 0.7331216335296631
Visititems took: 0.11911702156066895
Time to Translate and Dump Kerchunks to json file 0.82962965965271
Visititems took: 0.10615754127502441
Time to Translate and Dump Kerchunks to json file 0.8147380352020264

@martindurant
Copy link
Member Author

I tried to adapt your version in the latest commit.

@valeriupredoi
Copy link
Contributor

ooh that looks very promising, let me take it for a spin 🍺

kerchunk/hdf.py Outdated
self.input_file = fs.open(path, "rb")
elif isinstance(h5f, h5py.Dataset):
group = h5f.file.create_group(f"{h5f.name} ")
group[h5f.name] = h5f
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this barfs, unfortunately:

activestorage/netcdf_to_zarr.py:46: in gen_json
    h5chunks = SingleHdf5ToZarr(_dataset, file_url,
../miniconda3/envs/pyactive/lib/python3.12/site-packages/kerchunk/hdf.py:108: in __init__
    group[h5f.name] = h5f
../miniconda3/envs/pyactive/lib/python3.12/site-packages/h5py/_hl/group.py:468: in __setitem__
    h5o.link(obj.id, self.id, name, lcpl=lcpl, lapl=self._lapl)
h5py/_objects.pyx:54: in h5py._objects.with_phil.wrapper
    ???
h5py/_objects.pyx:55: in h5py._objects.with_phil.wrapper
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   OSError: Unable to create link (name already exists)

h5py/h5o.pyx:201: OSError

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, forgot to mention how I am creating the call:

    elif storage_type == "s3" and storage_options is not None:
        storage_options = storage_options.copy()
        storage_options['default_fill_cache'] = False
        storage_options['default_cache_type'] = "first"
        fs = s3fs.S3FileSystem(**storage_options)
        fs2 = fsspec.filesystem('')
        tk1 = time.time()
        with fs.open(file_url, 'rb') as s3file:
            _file = h5py.File(s3file, mode="w")
            _dataset = _file[varname]
            h5chunks = SingleHdf5ToZarr(_dataset, file_url,
                                        inline_threshold=0)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tried to rewrite the file??
I thought there would be a way to update the in-memory version without changing the file at all. If that's not true, it leaves us in a pickle, since remote files can't be rewritten ever (without copying to local, which we don't want).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK and from my tests it's not trying to write to file, the only way it nee allows group creation is if the file object us opened in write mode - have not seen any actual fata transfers or writes though, it's just an annoyance that it won't allow new groups with existing dataset names

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So hacking the name fixes this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, am on a bus in the English countryside, my typing skills are impacted by terrible roads 🤣

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, just giving it any name that doesn't already exist, could be cow-in-field for the matter 😁

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I did that, let me know what happens.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cheers! This should work now, I'll test on Monday. HDF5 is really strict with its names and such - probably bc it's a fairly thin border between a Dataset and a Group, but then again, they should support similar API's and methods on both

@valeriupredoi
Copy link
Contributor

valeriupredoi commented Feb 26, 2024

hi @martindurant - the last implementation didn't work either, HDF5 still complaining name exists - pain in the butt, but, tell you what, let the user supply a Group rather than try and cater to the user's Dataset which is clearly problematic - I made it work really nicely at my (user) end, with just a minor change to the __init__() func:

    def __init__(
        self,
        h5f: "BinaryIO | str",
        url: str = None,
        spec=1,
        inline_threshold=500,
        storage_options=None,
        error="warn",
        vlen_encode="embed",
    ):

        # Open HDF5 file in read mode...
        lggr.debug(f"HDF5 file: {h5f}")

        if isinstance(h5f, str):
            fs, path = fsspec.core.url_to_fs(h5f, **(storage_options or {}))
            self.input_file = fs.open(path, "rb")
            url = h5f
            self._h5f = h5py.File(self.input_file, mode="r")
        elif isinstance(h5f, io.IOBase):
            self.input_file = h5f
            self._h5f = h5py.File(self.input_file, mode="r")
        elif isinstance(h5f, (h5py.File, h5py.Group)):
            self._h5f = h5f

        self.spec = spec
        self.inline = inline_threshold
        if vlen_encode not in ["embed", "null", "leave", "encode"]:
            raise NotImplementedError
        self.vlen = vlen_encode

        self.store = {}
        self._zroot = zarr.group(store=self.store, overwrite=True)

        self._uri = url
        self.error = error
        lggr.debug(f"HDF5 file URI: {self._uri}")

that's all I need to get it to do restricted kerchunking, since I am myself building the dummy Group, and putting the Dataset inside it, then I am just supplying that to SingleHdf5ToZarr 😃

@martindurant
Copy link
Member Author

So you're suggesting removing the Dataset possibility?

@valeriupredoi
Copy link
Contributor

valeriupredoi commented Feb 27, 2024

So you're suggesting removing the Dataset possibility?

indeed, I think it's too much of a headache to make that work at your end, and as far as I can see it works well at my end (user's end), so prob best to turn it off and only leave the Group input possibility? 🍺

@martindurant martindurant changed the title Allow regex filter in HDF Allow HDF Groups Feb 27, 2024
@martindurant martindurant marked this pull request as ready for review February 27, 2024 14:54
@martindurant
Copy link
Member Author

Right you are - awaiting your OK.

Copy link
Contributor

@valeriupredoi valeriupredoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cheers muchly, Martin, a quick review from me 🍺

kerchunk/hdf.py Outdated Show resolved Hide resolved
kerchunk/hdf.py Outdated Show resolved Hide resolved
kerchunk/hdf.py Outdated Show resolved Hide resolved
# assume h5py object (File or group/dataset)
self._h5f = h5f
fs, path = fsspec.core.url_to_fs(url, **(storage_options or {}))
self.input_file = fs.open(path, "rb")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need these two lines anymore (they certainly mess up my used case where the file is an S3 object), since the file is loaded as File object up in the first branch of the conditional, if h5f is an h5py.Group then it should be kept that way with self._h5f set to it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_h5f is indeed set to the input two lines above. This exists for any inlining that might happen, which requires getting bytes directly from the original file, not going via h5py.

mess up my use case

What happens? I think providing the URL/options will certainly be required.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in my case it's looking for a local file even if I pass valid S3 storage_options - leave it like this for now, I'll need to do a wee bit more testing to understand what's going on, and will get back to you if Kerchunk needs changing 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The urls starts with "s3://"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes and no 🤣 It's a very peculariar bucket, the storage options dict that s3fs recognizes is

{'key': 'xxxx', 'secret': "xxxx", 'client_kwargs': {'endpoint_url': 'https://uor-aces-o.s3-ext.jc.rl.ac.uk'}, 'default_fill_cache': False, 'default_cache_type': 'first'}

the call to s3fs to able to read such a strange bucket is as follows:

fs = s3fs.S3FileSystem(**storage_options)
with fs.open(file_url, 'rb') as s3file:
...

but file_url needs to be the truncated (bucket + file-name) ie bnl/da193a_25_day__198807-198807.nc in this case, and s3fs is assembling its full URL via the endpoint URL and that truncated bucket _ filename - it's odd, not 100% sure why this type of s3 storage wants that configuration, but bottom line is in the case of Kerchunk trying to open it as a regular s3 file it's not working - even if I prepend a correct full s3://...path to the file, I get Forbidden access since the storage identification is done wrongly

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s3://uor-aces-o.s3-ext.jc.rl.ac.uk/bnl/da193a_25_day__198807-198807.nc

This is definitely not the right URL: the first part should be the bucket, not a server name (I'm surprised it even attempts to connect). The URL should be "s3://bnl/da193a_25_day__198807-198807.nc", as the server/endpoint is already included in the storage options.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blast! That worked! I knew I'm not doing something right 😆

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though am getting fairly long times from visititems() - very much comparable times to the ones where there is no kerchunking done on a single Group, but rather, on the entire file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah that's because this self._h5f = h5py.File(self.input_file, mode="r") is a few lines down 😁

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(oops, fixed)

martindurant and others added 4 commits February 27, 2024 10:20
Co-authored-by: Valeriu Predoi <valeriu.predoi@gmail.com>
Co-authored-by: Valeriu Predoi <valeriu.predoi@gmail.com>
Co-authored-by: Valeriu Predoi <valeriu.predoi@gmail.com>
@valeriupredoi
Copy link
Contributor

valeriupredoi commented Feb 28, 2024

@martindurant this is cool! So all works fine, up to the point where the Kerchunked/Zarr-ed indices are being read from the JSON I am dumping them to - in this case (and not just for this PR, but for main as well, I am getting mixup related to filters: I am seeing both shuffle and zlib(level=1 where I am fairly sure just shuffle should be there, here is the JSON (tiny one, since we trim only on the Group of interest):

{"version":1,"refs":{".zgroup":"{\"zarr_format\":2}","m01s06i247_4 \/.zgroup":"{\"zarr_format\":2}","m01s06i247_4 \/m01s06i247_4\/.zarray":"{\"chunks\":[1,39,325,432],\"compressor\":null,\"dtype\":\"<f4\",\"fill_value\":-1073741824.0,\"filters\":[{\"elementsize\":4,\"id\":\"shuffle\"},{\"id\":\"zlib\",\"level\":1}],\"order\":\"C\",\"shape\":[30,39,325,432],\"zarr_format\":2}","m01s06i247_4 \/m01s06i247_4\/.zattrs":"{\"_ARRAY_DIMENSIONS\":[\"time_counter\",\"um_atmos_PLEV39\",\"lat_um_atmos_grid_uv\",\"lon_um_atmos_grid_uv\"],\"cell_measures\":\"area: areacella\",\"cell_methods\":\"area: mean time: mean (interval: 900 s)\",\"coordinates\":\"\",\"interval_offset\":\"0ts\",\"interval_operation\":\"900 s\",\"interval_write\":\"1 d\",\"long_name\":\"U-ACCEL FROM SATURATED STRESS P LEVS\",\"missing_value\":-1073741824.0,\"online_operation\":\"average\",\"standard_name\":\"tendency_of_eastward_wind_due_to_orographic_gravity_wave_drag\",\"units\":\"m s-2\"}","m01s06i247_4 \/m01s06i247_4\/0.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",135877492,4624899],"m01s06i247_4 \/m01s06i247_4\/1.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",216523493,4611615],"m01s06i247_4 \/m01s06i247_4\/2.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",299013844,4522744],"m01s06i247_4 \/m01s06i247_4\/3.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",381537798,4605694],"m01s06i247_4 \/m01s06i247_4\/4.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",464161817,4750355],"m01s06i247_4 \/m01s06i247_4\/5.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",548452685,4796850],"m01s06i247_4 \/m01s06i247_4\/6.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",632710844,4550181],"m01s06i247_4 \/m01s06i247_4\/7.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",716577841,4535335],"m01s06i247_4 \/m01s06i247_4\/8.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",800332289,4734064],"m01s06i247_4 \/m01s06i247_4\/9.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",884381873,4868195],"m01s06i247_4 \/m01s06i247_4\/10.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",968638674,4772854],"m01s06i247_4 \/m01s06i247_4\/11.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",1052835756,4572836],"m01s06i247_4 \/m01s06i247_4\/12.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",1136776291,4735299],"m01s06i247_4 \/m01s06i247_4\/13.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",1220916882,4804409],"m01s06i247_4 \/m01s06i247_4\/14.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",1305183232,4832397],"m01s06i247_4 \/m01s06i247_4\/15.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",1389488221,4887386],"m01s06i247_4 \/m01s06i247_4\/16.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",1473840827,4904544],"m01s06i247_4 \/m01s06i247_4\/17.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",1558181189,4866530],"m01s06i247_4 \/m01s06i247_4\/18.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",1642467979,4836856],"m01s06i247_4 \/m01s06i247_4\/19.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",1726656904,4810839],"m01s06i247_4 \/m01s06i247_4\/20.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",1810853207,4901871],"m01s06i247_4 \/m01s06i247_4\/21.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",1895164069,4999675],"m01s06i247_4 \/m01s06i247_4\/22.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",1979662794,4850825],"m01s06i247_4 \/m01s06i247_4\/23.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",2064149371,4798183],"m01s06i247_4 \/m01s06i247_4\/24.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",2148587291,4796226],"m01s06i247_4 \/m01s06i247_4\/25.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",2232967613,4834154],"m01s06i247_4 \/m01s06i247_4\/26.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",2321954744,4821715],"m01s06i247_4 \/m01s06i247_4\/27.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",2780354968,4666006],"m01s06i247_4 \/m01s06i247_4\/28.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",2785020974,4615784],"m01s06i247_4 \/m01s06i247_4\/29.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",2789636758,4687499]}}

Any ideas what's going one?

@valeriupredoi
Copy link
Contributor

attaching the file, so it's more readable
test_file.json

@martindurant
Copy link
Member Author

Is this different behaviour than without filtering the HDF?

@martindurant
Copy link
Member Author

Also, since it's just JSON: can you edit out the offending filter and see if that's a fix?

@valeriupredoi
Copy link
Contributor

hi @martindurant the problem here is that Kerchunk's translator misidentifies the compressor with a filter, see how my Zarr metadata looks like when I run the same file through kerchunk=0.2.0:

{"version":1,"refs":{".zgroup":"{\"zarr_format\":2}","m01s06i247_4 \/.zgroup":"{\"zarr_format\":2}","m01s06i247_4 \/m01s06i247_4\/.zarray":"{\"chunks\":[1,39,325,432],\"compressor\":{\"id\":\"zlib\",\"level\":1},\"dtype\":\"<f4\",\"fill_value\":-1073741824.0,\"filters\":[{\"elementsize\":4,\"id\":\"shuffle\"}],\"order\":\"C\",\"shape\":[30,39,325,432],\"zarr_format\":2}","m01s06i247_4 \/m01s06i247_4\/.zattrs":"{\"_ARRAY_DIMENSIONS\":[\"time_counter\",\"um_atmos_PLEV39\",\"lat_um_atmos_grid_uv\",\"lon_um_atmos_grid_uv\"],\"cell_measures\":\"area: areacella\",\"cell_methods\":\"area: mean time: mean (interval: 900 s)\",\"coordinates\":\"\",\"interval_offset\":\"0ts\",

It finds out that my netCDF4 file is indeed compressed with Zlib compression, level=1, but that's not a filter. But this is not a problem from this branch, it is something that's crept up in your main for a while, I reckon, surely after 0.2.0. Incidentally, 0.2.2 can no longer be installed with Python=3.12 since it needs an old numcodecs that's not Py3.12-compatible 🍺

@martindurant
Copy link
Member Author

s indeed compressed with Zlib compression, level=1, but that's not a filter

In zarr, a compressor is just a special type of filter. So having zlib in filters instead of compressor= is fine, so long as the order of those filters is correct.

0.2.2 can no longer be installed with Python=3.12 since it needs an old numcodecs that's not Py3.12-compatible

the numcodecs pin has been dropped, maybe not released yet

@valeriupredoi
Copy link
Contributor

About the numcodecs situation- awesome, cheers! I can help on the feedstock if you need me to, get the release out. But about the compressor thing, am 'fraid that's breaking our bit of the spiel because we have an s3-reduction engine that runs with a select number of recognizable filters, and it barfs for Zlib(level=1) 😁 You guys keen to keep the current implementation that assigns a null to compressor value and adds Zlib to the filters list? If so, we'll have to get the engine changed then 👍

@martindurant
Copy link
Member Author

You guys keen to keep the current implementation that assigns a null to compressor value and adds Zlib to the filters list

It is certainly convenient in code to manipulate a single list rather than handle multiple kwargs variables. So a change would be needed somewhere. This happened when it became clear that having multiple stages in an HDF decode pipeline was pretty widespread.

@valeriupredoi
Copy link
Contributor

hi @martindurant apols for the radio silence, I took the time to fix the wiggles that came up from this PR (and the newer Kerchunk) at our end, and it works really nicely - if you make this PR RfR I can approve any time (as long as there are no more API changes, that need testing at my end). Very many thanks for the great communication and work done here, mate! I'll sign me up for kerchunk feedstock maintenance, if that's OK with you, so I can help a bit with the package too 🍺 🖖

@martindurant
Copy link
Member Author

The feedstock needs zero maintenance, since it's pure python and almost all dependencies are optional and unpinned. Glad to have your help wherever you have capacity, though.

@martindurant martindurant merged commit 7377869 into fsspec:main Feb 29, 2024
5 checks passed
@martindurant martindurant deleted the hdf_filter branch February 29, 2024 21:00
@valeriupredoi
Copy link
Contributor

brilliant, cheers muchly, mate! 🍺

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants