-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow HDF Groups #424
Allow HDF Groups #424
Conversation
Hi @martindurant very many thanks for starting to look into this so quickly in a PR, mate 🍺 Your code change addresses one issue (in terms of performance): the kerchunking/translation, and does so, in my test case, by reducing the runtime from about 100s to 75s - which is great, but not ideal 😁 The main problem here is if self.var_pattern and self._h5f[self.var_pattern]:
dset = self._h5f[self.var_pattern]
self._h5f = h5py.File('data.hdf5', 'w')
self._h5f.create_dataset(self.var_pattern, data=dset)
self._h5f.visititems(self._translator) this makes my visititems drop from 75s to 12s clearly with most of those 12s spent writing the silly file to disk 🍺 |
and indeed that ~12s is file IO, if one uses a cached |
Hi @martindurant I've struggled with h5py all day today, and the best I came up with, in order to restrict visiting items only to the bits of the file one needs (ie only the needed Datasets/variables) is to create an empty Group and store those datasets in lggr.debug("Translation begins")
self._transfer_attrs(self._h5f, self._zroot)
if self.var and self._h5f[self.var]:
if isinstance(self._h5f[self.var], h5py.Dataset):
import time
t1 = time.time()
self._h5f.create_group(self.var + "_adhoc")
self._h5f[self.var + "_adhoc"][self.var] = self._h5f[self.var]
self._h5f = self._h5f[self.var + "_adhoc"]
self._h5f.visititems(self._translator)
t2 = time.time()
print("New group creation and visititems took:", t2 - t1)
else:
self._h5f = dset # already a Group
self._h5f.visititems(self._translator)
else:
self._h5f.visititems(self._translator) thing is, this is ugly (because I don't really know HDF5 all too well, sorry), and not particularly fast, it is, in fact, slower than writing out a file, though we don't want to transfer any data at all. I hope to High Heavens and trust you have a much more efficient and elegant solution 😃 🍻 |
Would perhaps an even easier solution be, to allow passing in an arbitrary H5py object to SingleHdf5ToZarr? It already accepts a file path or open file-like. I'll push a new possible version into this PR, see what you think. |
kerchunk/hdf.py
Outdated
@@ -47,7 +48,7 @@ class SingleHdf5ToZarr: | |||
to BinaryIO is optional), in which case must also provide url. If a str, | |||
file will be opened using fsspec and storage_options. | |||
url : string | |||
URI of the HDF5 file, if passing a file-like object | |||
URI of the HDF5 file, if passing a file-like object or h5py File/dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @martindurant many thanks for looking into this! Great minds think alike - this is how I made it ingest my subset of the multi-variate file myself, earlier today, on a scratch dev version of Kerchunk in my env: I passed an already extracted h5py.Group
object. The only hitch with this approach is that if one passes a h5py.Dataset
instead, Kerchunk (well, h5py
in reality) will complain since visititems
is not a valid method of a Dataset
but only of File
or Group
objects, so in my case, I constructed an empty group where I plopped the Dataset
of interest. The issue with that approach is that one needs to name the new Group
something else than the Dataset
, hence introducing some extra unwanted overhead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to put it in a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the changes I made to Kerchunk are literally the ones you did here (including passing the variable name and looking for it), so it's not much done on the Kerchunk side, most of the other stuff (creating the new Group etc) I did at our end, but if you think that's useful, I'll plop in Kerchunk, no problemo. I still think it's a bit ugly TBF 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, so we need something to cope with Dataset Vs File, maybe just put the diff in here? Yes, I think it's useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really, just a peasanty workround to get Kerchunk to be able to run visititems(callback)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good man! That's exactly the thing. I'll post them up tomorrow, have not committed them off my work machine yet, and am home now, dinner time 🍕
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @martindurant here is my approach in my package (PyActiveStorage):
elif storage_type == "s3" and storage_options is not None:
storage_options = storage_options.copy()
storage_options['default_fill_cache'] = False
# storage_options['default_cache_type'] = "none" # big time drain this one
fs = s3fs.S3FileSystem(**storage_options)
fs2 = fsspec.filesystem('')
with fs.open(file_url, 'rb') as s3file:
s3file = h5py.File(s3file, mode="w")
if isinstance(s3file[varname], h5py.Dataset):
print("Looking only at a single Dataset", s3file[varname])
s3file.create_group(varname + " ")
s3file[varname + " "][varname] = s3file[varname]
elif isinstance(s3file[varname], h5py.Group):
print("Looking only at a single Group", s3file[varname])
s3file = s3file[varname]
h5chunks = SingleHdf5ToZarr(s3file, file_url, var=varname,
inline_threshold=0)
and the bit changed in kerchunk/hdf.py
is pretty much all you did here, with the added bit that the object becomes just the Group I want to get kerchunked, so in translate()
I plopped this hacky bit:
if self.var and self._h5f[self.var + " "]:
self._h5f = self._h5f[self.var + " "]
print("Visiting the following object", self._h5f)
self._h5f.visititems(self._translator)
Cheers 🍺
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a couple more details: I am using kerchunk==0.2.0 in my conda/mamba env (installed from conda-forge), so I can bypass the dep issue with pinned numcodecs, and here are some timing results of this approch (changed Kerchunk + conversion to Group and limiting kerchunking to it) vs bogstandard Kerchunking my entire file (that has some 100 variables, with all manners of dimesnions, but the bigger ones are shape (30, 30, 350, 420)):
With changed Kerchunk + conversion Dataset to Group
---------------------------------------------------
Visititems took: 2.5403971672058105
Time to Translate and Dump Kerchunks to json file 4.393939018249512
Visititems took: 1.9200255870819092
Time to Translate and Dump Kerchunks to json file 2.7312347888946533
Visititems took: 2.005722761154175
Time to Translate and Dump Kerchunks to json file 2.588365316390991
Visititems took: 1.9823436737060547
Time to Translate and Dump Kerchunks to json file 2.7559237480163574
Visititems took: 1.9835329055786133
Time to Translate and Dump Kerchunks to json file 2.5909011363983154
With regular Kerchunk
---------------------
Visititems took: 4.841791152954102
Time to Translate and Dump Kerchunks to json file 5.548096656799316
Visititems took: 4.454912900924683
Time to Translate and Dump Kerchunks to json file 5.720059156417847
Visititems took: 3.8621530532836914
Time to Translate and Dump Kerchunks to json file 4.593475580215454
Visititems took: 4.457882881164551
Time to Translate and Dump Kerchunks to json file 5.079823732376099
Visititems took: 4.275482177734375
Time to Translate and Dump Kerchunks to json file 4.894218444824219
Kerchunking on a restricted space does indeed improve timings, order factor of 2 it appears in my particular test case 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the JSON file containing the Kerchunk indices/Zarr ref file data drops from 300k normal to 8k when I do the restricted approach (this would matter if we were in 1992, though 🤣 )
also, worth mentioning that, from my tests - making sure to select the needed variable/Dataset/Group does make a pretty hefty difference in terms of speedup ie something of order 2-3x (we just found a massive caching issue with our s3fs loader, so managed to down the runtime from 100s to about 10s, that includes about 5-6s for Kerchunking for the entire file, that time drops to 2-3s when kerchunking only the variable of interest) 👍 |
Type "first" is usually the best option for HDF5. |
whoa and a difference it makes, cheers muchly, Martin! Have a look at these numbers (for the same test above):
|
I tried to adapt your version in the latest commit. |
ooh that looks very promising, let me take it for a spin 🍺 |
kerchunk/hdf.py
Outdated
self.input_file = fs.open(path, "rb") | ||
elif isinstance(h5f, h5py.Dataset): | ||
group = h5f.file.create_group(f"{h5f.name} ") | ||
group[h5f.name] = h5f |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this barfs, unfortunately:
activestorage/netcdf_to_zarr.py:46: in gen_json
h5chunks = SingleHdf5ToZarr(_dataset, file_url,
../miniconda3/envs/pyactive/lib/python3.12/site-packages/kerchunk/hdf.py:108: in __init__
group[h5f.name] = h5f
../miniconda3/envs/pyactive/lib/python3.12/site-packages/h5py/_hl/group.py:468: in __setitem__
h5o.link(obj.id, self.id, name, lcpl=lcpl, lapl=self._lapl)
h5py/_objects.pyx:54: in h5py._objects.with_phil.wrapper
???
h5py/_objects.pyx:55: in h5py._objects.with_phil.wrapper
???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E OSError: Unable to create link (name already exists)
h5py/h5o.pyx:201: OSError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, forgot to mention how I am creating the call:
elif storage_type == "s3" and storage_options is not None:
storage_options = storage_options.copy()
storage_options['default_fill_cache'] = False
storage_options['default_cache_type'] = "first"
fs = s3fs.S3FileSystem(**storage_options)
fs2 = fsspec.filesystem('')
tk1 = time.time()
with fs.open(file_url, 'rb') as s3file:
_file = h5py.File(s3file, mode="w")
_dataset = _file[varname]
h5chunks = SingleHdf5ToZarr(_dataset, file_url,
inline_threshold=0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tried to rewrite the file??
I thought there would be a way to update the in-memory version without changing the file at all. If that's not true, it leaves us in a pickle, since remote files can't be rewritten ever (without copying to local, which we don't want).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK and from my tests it's not trying to write to file, the only way it nee allows group creation is if the file object us opened in write mode - have not seen any actual fata transfers or writes though, it's just an annoyance that it won't allow new groups with existing dataset names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So hacking the name fixes this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, am on a bus in the English countryside, my typing skills are impacted by terrible roads 🤣
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, just giving it any name that doesn't already exist, could be cow-in-field
for the matter 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I did that, let me know what happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cheers! This should work now, I'll test on Monday. HDF5 is really strict with its names and such - probably bc it's a fairly thin border between a Dataset and a Group, but then again, they should support similar API's and methods on both
hi @martindurant - the last implementation didn't work either, HDF5 still complaining name exists - pain in the butt, but, tell you what, let the user supply a def __init__(
self,
h5f: "BinaryIO | str",
url: str = None,
spec=1,
inline_threshold=500,
storage_options=None,
error="warn",
vlen_encode="embed",
):
# Open HDF5 file in read mode...
lggr.debug(f"HDF5 file: {h5f}")
if isinstance(h5f, str):
fs, path = fsspec.core.url_to_fs(h5f, **(storage_options or {}))
self.input_file = fs.open(path, "rb")
url = h5f
self._h5f = h5py.File(self.input_file, mode="r")
elif isinstance(h5f, io.IOBase):
self.input_file = h5f
self._h5f = h5py.File(self.input_file, mode="r")
elif isinstance(h5f, (h5py.File, h5py.Group)):
self._h5f = h5f
self.spec = spec
self.inline = inline_threshold
if vlen_encode not in ["embed", "null", "leave", "encode"]:
raise NotImplementedError
self.vlen = vlen_encode
self.store = {}
self._zroot = zarr.group(store=self.store, overwrite=True)
self._uri = url
self.error = error
lggr.debug(f"HDF5 file URI: {self._uri}") that's all I need to get it to do restricted kerchunking, since I am myself building the dummy Group, and putting the Dataset inside it, then I am just supplying that to SingleHdf5ToZarr 😃 |
So you're suggesting removing the Dataset possibility? |
indeed, I think it's too much of a headache to make that work at your end, and as far as I can see it works well at my end (user's end), so prob best to turn it off and only leave the |
Right you are - awaiting your OK. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cheers muchly, Martin, a quick review from me 🍺
# assume h5py object (File or group/dataset) | ||
self._h5f = h5f | ||
fs, path = fsspec.core.url_to_fs(url, **(storage_options or {})) | ||
self.input_file = fs.open(path, "rb") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need these two lines anymore (they certainly mess up my used case where the file is an S3 object), since the file is loaded as File
object up in the first branch of the conditional, if h5f
is an h5py.Group
then it should be kept that way with self._h5f
set to it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_h5f is indeed set to the input two lines above. This exists for any inlining that might happen, which requires getting bytes directly from the original file, not going via h5py.
mess up my use case
What happens? I think providing the URL/options will certainly be required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in my case it's looking for a local file even if I pass valid S3 storage_options
- leave it like this for now, I'll need to do a wee bit more testing to understand what's going on, and will get back to you if Kerchunk needs changing 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The urls starts with "s3://"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes and no 🤣 It's a very peculariar bucket, the storage options dict that s3fs recognizes is
{'key': 'xxxx', 'secret': "xxxx", 'client_kwargs': {'endpoint_url': 'https://uor-aces-o.s3-ext.jc.rl.ac.uk'}, 'default_fill_cache': False, 'default_cache_type': 'first'}
the call to s3fs
to able to read such a strange bucket is as follows:
fs = s3fs.S3FileSystem(**storage_options)
with fs.open(file_url, 'rb') as s3file:
...
but file_url
needs to be the truncated (bucket + file-name) ie bnl/da193a_25_day__198807-198807.nc
in this case, and s3fs is assembling its full URL via the endpoint URL and that truncated bucket _ filename - it's odd, not 100% sure why this type of s3 storage wants that configuration, but bottom line is in the case of Kerchunk trying to open it as a regular s3 file it's not working - even if I prepend a correct full s3://...path to the file, I get Forbidden access since the storage identification is done wrongly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s3://uor-aces-o.s3-ext.jc.rl.ac.uk/bnl/da193a_25_day__198807-198807.nc
This is definitely not the right URL: the first part should be the bucket, not a server name (I'm surprised it even attempts to connect). The URL should be "s3://bnl/da193a_25_day__198807-198807.nc", as the server/endpoint is already included in the storage options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blast! That worked! I knew I'm not doing something right 😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
though am getting fairly long times from visititems()
- very much comparable times to the ones where there is no kerchunking done on a single Group, but rather, on the entire file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah that's because this self._h5f = h5py.File(self.input_file, mode="r")
is a few lines down 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(oops, fixed)
Co-authored-by: Valeriu Predoi <valeriu.predoi@gmail.com>
Co-authored-by: Valeriu Predoi <valeriu.predoi@gmail.com>
Co-authored-by: Valeriu Predoi <valeriu.predoi@gmail.com>
@martindurant this is cool! So all works fine, up to the point where the Kerchunked/Zarr-ed indices are being read from the JSON I am dumping them to - in this case (and not just for this PR, but for
Any ideas what's going one? |
attaching the file, so it's more readable |
Is this different behaviour than without filtering the HDF? |
Also, since it's just JSON: can you edit out the offending filter and see if that's a fix? |
hi @martindurant the problem here is that Kerchunk's translator misidentifies the
It finds out that my netCDF4 file is indeed compressed with Zlib compression, level=1, but that's not a filter. But this is not a problem from this branch, it is something that's crept up in your |
In zarr, a compressor is just a special type of filter. So having zlib in filters instead of compressor= is fine, so long as the order of those filters is correct.
the numcodecs pin has been dropped, maybe not released yet |
About the numcodecs situation- awesome, cheers! I can help on the feedstock if you need me to, get the release out. But about the compressor thing, am 'fraid that's breaking our bit of the spiel because we have an s3-reduction engine that runs with a select number of recognizable filters, and it barfs for |
It is certainly convenient in code to manipulate a single list rather than handle multiple kwargs variables. So a change would be needed somewhere. This happened when it became clear that having multiple stages in an HDF decode pipeline was pretty widespread. |
hi @martindurant apols for the radio silence, I took the time to fix the wiggles that came up from this PR (and the newer Kerchunk) at our end, and it works really nicely - if you make this PR RfR I can approve any time (as long as there are no more API changes, that need testing at my end). Very many thanks for the great communication and work done here, mate! I'll sign me up for kerchunk feedstock maintenance, if that's OK with you, so I can help a bit with the package too 🍺 🖖 |
The feedstock needs zero maintenance, since it's pure python and almost all dependencies are optional and unpinned. Glad to have your help wherever you have capacity, though. |
brilliant, cheers muchly, mate! 🍺 |
@valeriupredoi , first guess at something that might work. I am not sure, but it's possible that even if a dataset is not included, it still is read - to be determined. We also need to understand how this kind of thing behaves for nested HDF groups (you don't normally see these in netCDF style data).