Allow HDF Groups #424

martindurant · 2024-02-20T16:04:54Z

@valeriupredoi , first guess at something that might work. I am not sure, but it's possible that even if a dataset is not included, it still is read - to be determined. We also need to understand how this kind of thing behaves for nested HDF groups (you don't normally see these in netCDF style data).

valeriupredoi · 2024-02-21T14:15:05Z

Hi @martindurant very many thanks for starting to look into this so quickly in a PR, mate 🍺

Your code change addresses one issue (in terms of performance): the kerchunking/translation, and does so, in my test case, by reducing the runtime from about 100s to 75s - which is great, but not ideal 😁 The main problem here is visititems() that takes the bulk of time, and that's done on the entire HDF5/netCDF4 file. Here is a very, very rough POC of how this issue of visititems can be addressed: in translate(self), when it comes to visiting items (creating the B-tree), we should do that just for the bit of HDF5 that we need to look at ie just the Dataset of interest - I could not think of a better way to move from Dataset to HDF5 object, but a nasty write-to-disk, but surely there are more elegant ways to convert a Dataset to a HDF5 in memory:

        if self.var_pattern and self._h5f[self.var_pattern]:
            dset = self._h5f[self.var_pattern]
            self._h5f = h5py.File('data.hdf5', 'w')
            self._h5f.create_dataset(self.var_pattern, data=dset)
            self._h5f.visititems(self._translator)

this makes my visititems drop from 75s to 12s clearly with most of those 12s spent writing the silly file to disk 🍺

valeriupredoi · 2024-02-21T14:27:06Z

and indeed that ~12s is file IO, if one uses a cached File that time drops to about 1s just for visititems 👍

valeriupredoi · 2024-02-21T17:27:46Z

Hi @martindurant I've struggled with h5py all day today, and the best I came up with, in order to restrict visiting items only to the bits of the file one needs (ie only the needed Datasets/variables) is to create an empty Group and store those datasets in translator():

        lggr.debug("Translation begins")
        self._transfer_attrs(self._h5f, self._zroot)
        if self.var and self._h5f[self.var]:
            if isinstance(self._h5f[self.var], h5py.Dataset):
                import time
                t1 = time.time()
                self._h5f.create_group(self.var + "_adhoc")
                self._h5f[self.var + "_adhoc"][self.var] = self._h5f[self.var]
                self._h5f = self._h5f[self.var + "_adhoc"]
                self._h5f.visititems(self._translator)
                t2 = time.time()
                print("New group creation and visititems took:", t2 - t1)
            else:
                self._h5f = dset  # already a Group
                self._h5f.visititems(self._translator)
        else:
            self._h5f.visititems(self._translator)

thing is, this is ugly (because I don't really know HDF5 all too well, sorry), and not particularly fast, it is, in fact, slower than writing out a file, though we don't want to transfer any data at all. I hope to High Heavens and trust you have a much more efficient and elegant solution 😃 🍻

martindurant · 2024-02-22T18:29:47Z

Would perhaps an even easier solution be, to allow passing in an arbitrary H5py object to SingleHdf5ToZarr? It already accepts a file path or open file-like. I'll push a new possible version into this PR, see what you think.

valeriupredoi · 2024-02-22T19:32:14Z

kerchunk/hdf.py

@@ -47,7 +48,7 @@ class SingleHdf5ToZarr:
        to BinaryIO is optional), in which case must also provide url. If a str,
        file will be opened using fsspec and storage_options.
    url : string
-        URI of the HDF5 file, if passing a file-like object
+        URI of the HDF5 file, if passing a file-like object or h5py File/dataset


hi @martindurant many thanks for looking into this! Great minds think alike - this is how I made it ingest my subset of the multi-variate file myself, earlier today, on a scratch dev version of Kerchunk in my env: I passed an already extracted h5py.Group object. The only hitch with this approach is that if one passes a h5py.Dataset instead, Kerchunk (well, h5py in reality) will complain since visititems is not a valid method of a Dataset but only of File or Group objects, so in my case, I constructed an empty group where I plopped the Dataset of interest. The issue with that approach is that one needs to name the new Group something else than the Dataset, hence introducing some extra unwanted overhead

Do you want to put it in a separate PR?

the changes I made to Kerchunk are literally the ones you did here (including passing the variable name and looking for it), so it's not much done on the Kerchunk side, most of the other stuff (creating the new Group etc) I did at our end, but if you think that's useful, I'll plop in Kerchunk, no problemo. I still think it's a bit ugly TBF 😁

Right, so we need something to cope with Dataset Vs File, maybe just put the diff in here? Yes, I think it's useful.

really, just a peasanty workround to get Kerchunk to be able to run visititems(callback)

good man! That's exactly the thing. I'll post them up tomorrow, have not committed them off my work machine yet, and am home now, dinner time 🍕

hi @martindurant here is my approach in my package (PyActiveStorage):

elif storage_type == "s3" and storage_options is not None: storage_options = storage_options.copy() storage_options['default_fill_cache'] = False # storage_options['default_cache_type'] = "none" # big time drain this one fs = s3fs.S3FileSystem(**storage_options) fs2 = fsspec.filesystem('') with fs.open(file_url, 'rb') as s3file: s3file = h5py.File(s3file, mode="w") if isinstance(s3file[varname], h5py.Dataset): print("Looking only at a single Dataset", s3file[varname]) s3file.create_group(varname + " ") s3file[varname + " "][varname] = s3file[varname] elif isinstance(s3file[varname], h5py.Group): print("Looking only at a single Group", s3file[varname]) s3file = s3file[varname] h5chunks = SingleHdf5ToZarr(s3file, file_url, var=varname, inline_threshold=0)

and the bit changed in kerchunk/hdf.py is pretty much all you did here, with the added bit that the object becomes just the Group I want to get kerchunked, so in translate() I plopped this hacky bit:

if self.var and self._h5f[self.var + " "]: self._h5f = self._h5f[self.var + " "] print("Visiting the following object", self._h5f) self._h5f.visititems(self._translator)

Cheers 🍺

a couple more details: I am using kerchunk==0.2.0 in my conda/mamba env (installed from conda-forge), so I can bypass the dep issue with pinned numcodecs, and here are some timing results of this approch (changed Kerchunk + conversion to Group and limiting kerchunking to it) vs bogstandard Kerchunking my entire file (that has some 100 variables, with all manners of dimesnions, but the bigger ones are shape (30, 30, 350, 420)):

With changed Kerchunk + conversion Dataset to Group --------------------------------------------------- Visititems took: 2.5403971672058105 Time to Translate and Dump Kerchunks to json file 4.393939018249512 Visititems took: 1.9200255870819092 Time to Translate and Dump Kerchunks to json file 2.7312347888946533 Visititems took: 2.005722761154175 Time to Translate and Dump Kerchunks to json file 2.588365316390991 Visititems took: 1.9823436737060547 Time to Translate and Dump Kerchunks to json file 2.7559237480163574 Visititems took: 1.9835329055786133 Time to Translate and Dump Kerchunks to json file 2.5909011363983154 With regular Kerchunk --------------------- Visititems took: 4.841791152954102 Time to Translate and Dump Kerchunks to json file 5.548096656799316 Visititems took: 4.454912900924683 Time to Translate and Dump Kerchunks to json file 5.720059156417847 Visititems took: 3.8621530532836914 Time to Translate and Dump Kerchunks to json file 4.593475580215454 Visititems took: 4.457882881164551 Time to Translate and Dump Kerchunks to json file 5.079823732376099 Visititems took: 4.275482177734375 Time to Translate and Dump Kerchunks to json file 4.894218444824219

Kerchunking on a restricted space does indeed improve timings, order factor of 2 it appears in my particular test case 👍

the JSON file containing the Kerchunk indices/Zarr ref file data drops from 300k normal to 8k when I do the restricted approach (this would matter if we were in 1992, though 🤣 )

valeriupredoi · 2024-02-22T19:39:01Z

also, worth mentioning that, from my tests - making sure to select the needed variable/Dataset/Group does make a pretty hefty difference in terms of speedup ie something of order 2-3x (we just found a massive caching issue with our s3fs loader, so managed to down the runtime from 100s to about 10s, that includes about 5-6s for Kerchunking for the entire file, that time drops to 2-3s when kerchunking only the variable of interest) 👍

martindurant · 2024-02-23T14:12:52Z

storage_options['default_cache_type'] = "none"

Type "first" is usually the best option for HDF5.

valeriupredoi · 2024-02-23T14:57:56Z

whoa and a difference it makes, cheers muchly, Martin! Have a look at these numbers (for the same test above):

Normal Kerchunk with default cache "first"
------------------------------------------
Visititems took: 1.6179678440093994
Time to Translate and Dump Kerchunks to json file 2.377903461456299
Visititems took: 1.6005499362945557
Time to Translate and Dump Kerchunks to json file 2.2975118160247803
Visititems took: 1.6153967380523682
Time to Translate and Dump Kerchunks to json file 2.6384167671203613
Visititems took: 1.5885121822357178
Time to Translate and Dump Kerchunks to json file 2.504279136657715

Restricted to one Group/Dataset Kerchunk with default cache "first"
-------------------------------------------------------------------
Visititems took: 0.10222649574279785
Time to Translate and Dump Kerchunks to json file 0.7997150421142578
Visititems took: 0.10846853256225586
Time to Translate and Dump Kerchunks to json file 0.7331216335296631
Visititems took: 0.11911702156066895
Time to Translate and Dump Kerchunks to json file 0.82962965965271
Visititems took: 0.10615754127502441
Time to Translate and Dump Kerchunks to json file 0.8147380352020264

martindurant · 2024-02-23T15:01:16Z

I tried to adapt your version in the latest commit.

valeriupredoi · 2024-02-23T15:22:53Z

ooh that looks very promising, let me take it for a spin 🍺

valeriupredoi · 2024-02-23T16:13:19Z

kerchunk/hdf.py

+            self.input_file = fs.open(path, "rb")
+        elif isinstance(h5f, h5py.Dataset):
+            group = h5f.file.create_group(f"{h5f.name} ")
+            group[h5f.name] = h5f


this barfs, unfortunately:

activestorage/netcdf_to_zarr.py:46: in gen_json h5chunks = SingleHdf5ToZarr(_dataset, file_url, ../miniconda3/envs/pyactive/lib/python3.12/site-packages/kerchunk/hdf.py:108: in __init__ group[h5f.name] = h5f ../miniconda3/envs/pyactive/lib/python3.12/site-packages/h5py/_hl/group.py:468: in __setitem__ h5o.link(obj.id, self.id, name, lcpl=lcpl, lapl=self._lapl) h5py/_objects.pyx:54: in h5py._objects.with_phil.wrapper ??? h5py/_objects.pyx:55: in h5py._objects.with_phil.wrapper ??? _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > ??? E OSError: Unable to create link (name already exists) h5py/h5o.pyx:201: OSError

sorry, forgot to mention how I am creating the call:

elif storage_type == "s3" and storage_options is not None: storage_options = storage_options.copy() storage_options['default_fill_cache'] = False storage_options['default_cache_type'] = "first" fs = s3fs.S3FileSystem(**storage_options) fs2 = fsspec.filesystem('') tk1 = time.time() with fs.open(file_url, 'rb') as s3file: _file = h5py.File(s3file, mode="w") _dataset = _file[varname] h5chunks = SingleHdf5ToZarr(_dataset, file_url, inline_threshold=0)

This tried to rewrite the file??
I thought there would be a way to update the in-memory version without changing the file at all. If that's not true, it leaves us in a pickle, since remote files can't be rewritten ever (without copying to local, which we don't want).

AFAIK and from my tests it's not trying to write to file, the only way it nee allows group creation is if the file object us opened in write mode - have not seen any actual fata transfers or writes though, it's just an annoyance that it won't allow new groups with existing dataset names

So hacking the name fixes this?

Sorry, am on a bus in the English countryside, my typing skills are impacted by terrible roads 🤣

Yeah, just giving it any name that doesn't already exist, could be cow-in-field for the matter 😁

OK, I did that, let me know what happens.

Cheers! This should work now, I'll test on Monday. HDF5 is really strict with its names and such - probably bc it's a fairly thin border between a Dataset and a Group, but then again, they should support similar API's and methods on both

valeriupredoi · 2024-02-26T14:37:18Z

hi @martindurant - the last implementation didn't work either, HDF5 still complaining name exists - pain in the butt, but, tell you what, let the user supply a Group rather than try and cater to the user's Dataset which is clearly problematic - I made it work really nicely at my (user) end, with just a minor change to the __init__() func:

    def __init__(
        self,
        h5f: "BinaryIO | str",
        url: str = None,
        spec=1,
        inline_threshold=500,
        storage_options=None,
        error="warn",
        vlen_encode="embed",
    ):

        # Open HDF5 file in read mode...
        lggr.debug(f"HDF5 file: {h5f}")

        if isinstance(h5f, str):
            fs, path = fsspec.core.url_to_fs(h5f, **(storage_options or {}))
            self.input_file = fs.open(path, "rb")
            url = h5f
            self._h5f = h5py.File(self.input_file, mode="r")
        elif isinstance(h5f, io.IOBase):
            self.input_file = h5f
            self._h5f = h5py.File(self.input_file, mode="r")
        elif isinstance(h5f, (h5py.File, h5py.Group)):
            self._h5f = h5f

        self.spec = spec
        self.inline = inline_threshold
        if vlen_encode not in ["embed", "null", "leave", "encode"]:
            raise NotImplementedError
        self.vlen = vlen_encode

        self.store = {}
        self._zroot = zarr.group(store=self.store, overwrite=True)

        self._uri = url
        self.error = error
        lggr.debug(f"HDF5 file URI: {self._uri}")

that's all I need to get it to do restricted kerchunking, since I am myself building the dummy Group, and putting the Dataset inside it, then I am just supplying that to SingleHdf5ToZarr 😃

martindurant · 2024-02-27T14:17:09Z

So you're suggesting removing the Dataset possibility?

valeriupredoi · 2024-02-27T14:49:51Z

So you're suggesting removing the Dataset possibility?

indeed, I think it's too much of a headache to make that work at your end, and as far as I can see it works well at my end (user's end), so prob best to turn it off and only leave the Group input possibility? 🍺

martindurant · 2024-02-27T14:55:18Z

Right you are - awaiting your OK.

valeriupredoi

cheers muchly, Martin, a quick review from me 🍺

kerchunk/hdf.py

valeriupredoi · 2024-02-27T15:18:22Z

kerchunk/hdf.py

+            # assume h5py object (File or group/dataset)
+            self._h5f = h5f
+            fs, path = fsspec.core.url_to_fs(url, **(storage_options or {}))
+            self.input_file = fs.open(path, "rb")


I don't think you need these two lines anymore (they certainly mess up my used case where the file is an S3 object), since the file is loaded as File object up in the first branch of the conditional, if h5f is an h5py.Group then it should be kept that way with self._h5f set to it

_h5f is indeed set to the input two lines above. This exists for any inlining that might happen, which requires getting bytes directly from the original file, not going via h5py.

mess up my use case

What happens? I think providing the URL/options will certainly be required.

in my case it's looking for a local file even if I pass valid S3 storage_options - leave it like this for now, I'll need to do a wee bit more testing to understand what's going on, and will get back to you if Kerchunk needs changing 👍

The urls starts with "s3://"?

yes and no 🤣 It's a very peculariar bucket, the storage options dict that s3fs recognizes is

{'key': 'xxxx', 'secret': "xxxx", 'client_kwargs': {'endpoint_url': 'https://uor-aces-o.s3-ext.jc.rl.ac.uk'}, 'default_fill_cache': False, 'default_cache_type': 'first'}

the call to s3fs to able to read such a strange bucket is as follows:

fs = s3fs.S3FileSystem(**storage_options) with fs.open(file_url, 'rb') as s3file: ...

but file_url needs to be the truncated (bucket + file-name) ie bnl/da193a_25_day__198807-198807.nc in this case, and s3fs is assembling its full URL via the endpoint URL and that truncated bucket _ filename - it's odd, not 100% sure why this type of s3 storage wants that configuration, but bottom line is in the case of Kerchunk trying to open it as a regular s3 file it's not working - even if I prepend a correct full s3://...path to the file, I get Forbidden access since the storage identification is done wrongly

s3://uor-aces-o.s3-ext.jc.rl.ac.uk/bnl/da193a_25_day__198807-198807.nc

This is definitely not the right URL: the first part should be the bucket, not a server name (I'm surprised it even attempts to connect). The URL should be "s3://bnl/da193a_25_day__198807-198807.nc", as the server/endpoint is already included in the storage options.

blast! That worked! I knew I'm not doing something right 😆

though am getting fairly long times from visititems() - very much comparable times to the ones where there is no kerchunking done on a single Group, but rather, on the entire file

ah that's because this self._h5f = h5py.File(self.input_file, mode="r") is a few lines down 😁

(oops, fixed)

Co-authored-by: Valeriu Predoi <valeriu.predoi@gmail.com>

valeriupredoi · 2024-02-28T14:10:51Z

@martindurant this is cool! So all works fine, up to the point where the Kerchunked/Zarr-ed indices are being read from the JSON I am dumping them to - in this case (and not just for this PR, but for main as well, I am getting mixup related to filters: I am seeing both shuffle and zlib(level=1 where I am fairly sure just shuffle should be there, here is the JSON (tiny one, since we trim only on the Group of interest):

{"version":1,"refs":{".zgroup":"{\"zarr_format\":2}","m01s06i247_4 \/.zgroup":"{\"zarr_format\":2}","m01s06i247_4 \/m01s06i247_4\/.zarray":"{\"chunks\":[1,39,325,432],\"compressor\":null,\"dtype\":\"<f4\",\"fill_value\":-1073741824.0,\"filters\":[{\"elementsize\":4,\"id\":\"shuffle\"},{\"id\":\"zlib\",\"level\":1}],\"order\":\"C\",\"shape\":[30,39,325,432],\"zarr_format\":2}","m01s06i247_4 \/m01s06i247_4\/.zattrs":"{\"_ARRAY_DIMENSIONS\":[\"time_counter\",\"um_atmos_PLEV39\",\"lat_um_atmos_grid_uv\",\"lon_um_atmos_grid_uv\"],\"cell_measures\":\"area: areacella\",\"cell_methods\":\"area: mean time: mean (interval: 900 s)\",\"coordinates\":\"\",\"interval_offset\":\"0ts\",\"interval_operation\":\"900 s\",\"interval_write\":\"1 d\",\"long_name\":\"U-ACCEL FROM SATURATED STRESS P LEVS\",\"missing_value\":-1073741824.0,\"online_operation\":\"average\",\"standard_name\":\"tendency_of_eastward_wind_due_to_orographic_gravity_wave_drag\",\"units\":\"m s-2\"}","m01s06i247_4 \/m01s06i247_4\/0.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",135877492,4624899],"m01s06i247_4 \/m01s06i247_4\/1.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",216523493,4611615],"m01s06i247_4 \/m01s06i247_4\/2.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",299013844,4522744],"m01s06i247_4 \/m01s06i247_4\/3.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",381537798,4605694],"m01s06i247_4 \/m01s06i247_4\/4.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",464161817,4750355],"m01s06i247_4 \/m01s06i247_4\/5.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",548452685,4796850],"m01s06i247_4 \/m01s06i247_4\/6.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",632710844,4550181],"m01s06i247_4 \/m01s06i247_4\/7.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",716577841,4535335],"m01s06i247_4 \/m01s06i247_4\/8.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",800332289,4734064],"m01s06i247_4 \/m01s06i247_4\/9.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",884381873,4868195],"m01s06i247_4 \/m01s06i247_4\/10.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",968638674,4772854],"m01s06i247_4 \/m01s06i247_4\/11.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",1052835756,4572836],"m01s06i247_4 \/m01s06i247_4\/12.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",1136776291,4735299],"m01s06i247_4 \/m01s06i247_4\/13.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",1220916882,4804409],"m01s06i247_4 \/m01s06i247_4\/14.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",1305183232,4832397],"m01s06i247_4 \/m01s06i247_4\/15.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",1389488221,4887386],"m01s06i247_4 \/m01s06i247_4\/16.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",1473840827,4904544],"m01s06i247_4 \/m01s06i247_4\/17.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",1558181189,4866530],"m01s06i247_4 \/m01s06i247_4\/18.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",1642467979,4836856],"m01s06i247_4 \/m01s06i247_4\/19.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",1726656904,4810839],"m01s06i247_4 \/m01s06i247_4\/20.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",1810853207,4901871],"m01s06i247_4 \/m01s06i247_4\/21.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",1895164069,4999675],"m01s06i247_4 \/m01s06i247_4\/22.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",1979662794,4850825],"m01s06i247_4 \/m01s06i247_4\/23.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",2064149371,4798183],"m01s06i247_4 \/m01s06i247_4\/24.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",2148587291,4796226],"m01s06i247_4 \/m01s06i247_4\/25.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",2232967613,4834154],"m01s06i247_4 \/m01s06i247_4\/26.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",2321954744,4821715],"m01s06i247_4 \/m01s06i247_4\/27.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",2780354968,4666006],"m01s06i247_4 \/m01s06i247_4\/28.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",2785020974,4615784],"m01s06i247_4 \/m01s06i247_4\/29.0.0.0":["s3:\/\/bnl\/da193a_25_day__198807-198807.nc",2789636758,4687499]}}

Any ideas what's going one?

valeriupredoi · 2024-02-28T14:15:03Z

attaching the file, so it's more readable
test_file.json

martindurant · 2024-02-28T14:39:21Z

Is this different behaviour than without filtering the HDF?

martindurant · 2024-02-28T14:43:17Z

Also, since it's just JSON: can you edit out the offending filter and see if that's a fix?

valeriupredoi · 2024-02-28T16:11:41Z

hi @martindurant the problem here is that Kerchunk's translator misidentifies the compressor with a filter, see how my Zarr metadata looks like when I run the same file through kerchunk=0.2.0:

{"version":1,"refs":{".zgroup":"{\"zarr_format\":2}","m01s06i247_4 \/.zgroup":"{\"zarr_format\":2}","m01s06i247_4 \/m01s06i247_4\/.zarray":"{\"chunks\":[1,39,325,432],\"compressor\":{\"id\":\"zlib\",\"level\":1},\"dtype\":\"<f4\",\"fill_value\":-1073741824.0,\"filters\":[{\"elementsize\":4,\"id\":\"shuffle\"}],\"order\":\"C\",\"shape\":[30,39,325,432],\"zarr_format\":2}","m01s06i247_4 \/m01s06i247_4\/.zattrs":"{\"_ARRAY_DIMENSIONS\":[\"time_counter\",\"um_atmos_PLEV39\",\"lat_um_atmos_grid_uv\",\"lon_um_atmos_grid_uv\"],\"cell_measures\":\"area: areacella\",\"cell_methods\":\"area: mean time: mean (interval: 900 s)\",\"coordinates\":\"\",\"interval_offset\":\"0ts\",

It finds out that my netCDF4 file is indeed compressed with Zlib compression, level=1, but that's not a filter. But this is not a problem from this branch, it is something that's crept up in your main for a while, I reckon, surely after 0.2.0. Incidentally, 0.2.2 can no longer be installed with Python=3.12 since it needs an old numcodecs that's not Py3.12-compatible 🍺

martindurant · 2024-02-28T16:25:46Z

s indeed compressed with Zlib compression, level=1, but that's not a filter

In zarr, a compressor is just a special type of filter. So having zlib in filters instead of compressor= is fine, so long as the order of those filters is correct.

0.2.2 can no longer be installed with Python=3.12 since it needs an old numcodecs that's not Py3.12-compatible

the numcodecs pin has been dropped, maybe not released yet

valeriupredoi · 2024-02-28T16:30:00Z

About the numcodecs situation- awesome, cheers! I can help on the feedstock if you need me to, get the release out. But about the compressor thing, am 'fraid that's breaking our bit of the spiel because we have an s3-reduction engine that runs with a select number of recognizable filters, and it barfs for Zlib(level=1) 😁 You guys keen to keep the current implementation that assigns a null to compressor value and adds Zlib to the filters list? If so, we'll have to get the engine changed then 👍

martindurant · 2024-02-28T16:36:18Z

You guys keen to keep the current implementation that assigns a null to compressor value and adds Zlib to the filters list

It is certainly convenient in code to manipulate a single list rather than handle multiple kwargs variables. So a change would be needed somewhere. This happened when it became clear that having multiple stages in an HDF decode pipeline was pretty widespread.

valeriupredoi · 2024-02-29T20:57:30Z

hi @martindurant apols for the radio silence, I took the time to fix the wiggles that came up from this PR (and the newer Kerchunk) at our end, and it works really nicely - if you make this PR RfR I can approve any time (as long as there are no more API changes, that need testing at my end). Very many thanks for the great communication and work done here, mate! I'll sign me up for kerchunk feedstock maintenance, if that's OK with you, so I can help a bit with the package too 🍺 🖖

martindurant · 2024-02-29T20:59:14Z

The feedstock needs zero maintenance, since it's pure python and almost all dependencies are optional and unpinned. Glad to have your help wherever you have capacity, though.

valeriupredoi · 2024-02-29T21:01:35Z

brilliant, cheers muchly, mate! 🍺

martindurant added 2 commits February 20, 2024 11:00

Allow regex filter in HDF

91939f4

actual module

183d126

Allow passing an h5py object

2075162

martindurant marked this pull request as draft February 22, 2024 19:24

valeriupredoi reviewed Feb 22, 2024

View reviewed changes

Allow passing in HDF Dataset

b388d95

valeriupredoi reviewed Feb 23, 2024

View reviewed changes

hack names

2cfd218

valeriupredoi mentioned this pull request Feb 27, 2024

[DRAFT] Test optimal kerchunk NCAS-CMS/PyActiveStorage#186

Open

remove hdf dataset route after all

bbd4105

martindurant changed the title ~~Allow regex filter in HDF~~ Allow HDF Groups Feb 27, 2024

martindurant marked this pull request as ready for review February 27, 2024 14:54

valeriupredoi reviewed Feb 27, 2024

View reviewed changes

martindurant and others added 4 commits February 27, 2024 10:20

Update kerchunk/hdf.py

ecc5f49

Co-authored-by: Valeriu Predoi <valeriu.predoi@gmail.com>

Update kerchunk/hdf.py

a33afb2

Co-authored-by: Valeriu Predoi <valeriu.predoi@gmail.com>

Update kerchunk/hdf.py

9280b26

Co-authored-by: Valeriu Predoi <valeriu.predoi@gmail.com>

omit remaking HDF File

b4fc62d

martindurant merged commit 7377869 into fsspec:main Feb 29, 2024
5 checks passed

martindurant deleted the hdf_filter branch February 29, 2024 21:00

Allow HDF Groups #424

Allow HDF Groups #424

Conversation

martindurant commented Feb 20, 2024

valeriupredoi commented Feb 21, 2024

valeriupredoi commented Feb 21, 2024

valeriupredoi commented Feb 21, 2024

martindurant commented Feb 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

valeriupredoi Feb 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

valeriupredoi commented Feb 22, 2024

martindurant commented Feb 23, 2024

valeriupredoi commented Feb 23, 2024

martindurant commented Feb 23, 2024

valeriupredoi commented Feb 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

valeriupredoi commented Feb 26, 2024 • edited Loading

martindurant commented Feb 27, 2024

valeriupredoi commented Feb 27, 2024 • edited Loading

martindurant commented Feb 27, 2024

valeriupredoi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

valeriupredoi commented Feb 28, 2024 • edited Loading

valeriupredoi commented Feb 28, 2024

martindurant commented Feb 28, 2024

martindurant commented Feb 28, 2024

valeriupredoi commented Feb 28, 2024

martindurant commented Feb 28, 2024

valeriupredoi commented Feb 28, 2024

martindurant commented Feb 28, 2024

valeriupredoi commented Feb 29, 2024

martindurant commented Feb 29, 2024

valeriupredoi commented Feb 29, 2024

valeriupredoi Feb 23, 2024 •

edited

Loading

valeriupredoi commented Feb 26, 2024 •

edited

Loading

valeriupredoi commented Feb 27, 2024 •

edited

Loading

valeriupredoi commented Feb 28, 2024 •

edited

Loading