-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for reading HDF5 virtual datasets #1828
Conversation
In case HDF5 adds more storage specifications, netcdf4 should be able to cope with them by default. Further specializations could be added nonetheless.
It seems like it is part of the design of HDF5 virtual datasets that objects within a file remain opened while the files is aready "closed". Setting the fclose degree to SEMI would cause the library to bail out. This commit makes nc_test4/tst_virtual_dataset succeed. See also Unidata#1799
So it seems that changing the fclose-degree does not go unnoticed for all the tests. And honestly, changing that also didn't feel alright while doing it. But I didn't see a more proper way out of that situation. Does anyone who is more knowledgeable about the codebase has some suggestions? |
Well you are seeing the following error:
Did you run tests on your machine? It's not clear to me why the change you made would have this effect. |
if ((fapl_id = H5Pcreate(H5P_FILE_ACCESS)) < 0) | ||
BAIL(NC_EHDFERR); | ||
|
||
if (H5Pset_fclose_degree(fapl_id, H5F_CLOSE_SEMI) < 0) | ||
if (H5Pset_fclose_degree(fapl_id, H5F_CLOSE_WEAK) < 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a somewhat drastic step. Can we discuss?
Can you tell me more about why a virtual file might have objects left open?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the step also a bit too drastic. I don't really know about how the thought process in HDF5 is regarding the implementation of virtual datasets. But if you have a look at the links above, this line opens a new reference to the linked file and almost immediately after that the file is closed again, but the previous reference is not released. This leads to an error if fclose_degree is SEMI. Also as the corresponding function is called H5D__virtual_open_source_dset
, I assumed that at the end of the function something should be left in an open state... The above change was a way to get rid of the error without having to dig further into HDF5 and as h5netcdf could handle this, I thought there should be a way to handle it without changing HDF5.
But yes, I'd be happy to find a different solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you find a different solution for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I have currently no clue how to do that differently without changing the virtual dataset implementation of HDF5. Maybe we would have to reach out to the HDF5 developers and ask them about their intention on this option. But I can't start that right now, as I'll be off for a week in a couple of hours.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've sent a request to HDFgroup to clarify if the need for H5F_CLOSE_WEAK
is intended behaviour or if I am missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just wanted to briefly get back on this. It turns out that this behaviour is indeed a bug in HDF5. I just got the confirmation that there's a fix almost ready which should be released soon. This fix however effectively will force source datasets to be opened as WEAK
internally.
This raises the question of how to proceed in this PR?
- Should there be a backwards-compatible solution which runs with current HDF5 by setting file opening
WEAK
outside of HDF5? - Should there be a requirement to use HDF5 >= the upcoming release?
I think these are good changes. We just need to get them testing correctly... |
Yes, but due to a lack of cleverness, I only ran the test while preparing the draft PR. I get the same error as above on my machine and it is triggered by 9f88977. I also don't understand why this happens. |
The nc_sync test fails if the settings are different for file creation and opening.
So now travis is happy, but appveyor not. I am not sure how to interpret its results though. The other issue is, if travis is now happy for the right reasons (e.g. |
Hi @d70-t, thanks for this work. Before we go to much further, I wanted to share a few of our thoughts here internally at Unidata, to provide some context vis-a-vis deciding how to proceed. "What is netCDF" (see issue Unidata/netcdf#50) and the surrounding discussion has been pretty robust, as we consider the balance between the data storage format ('files', traditionally, but we are being forced to expand our terminology as we move into object storage), the data model, and the software library. One of the historic, oft-emphasized properties of a netCDF data set is that it is self-describing. If we expand to using something like an HDF5 virtual dataset, where data is spread beyond a single file, is it still a netCDF file/is the file still self-describing? What does the corresponding These are clearly a mix of practical and philosophical questions, but they describe a good chunk of what we need to think about before adding support for a feature, even when it is 'read-only support for a particular storage option'; anything netCDF can read, it will need to be able to write using other, supported storage options. Questions are further compounded by the fact that we have had very few requests to support this functionality. The community has instead been pressing us to add support for object storage and Zarr support (which is in line with our current efforts). I say all this not to be discouraging, but rather to start a discussion before you invest too much more time into this. Since putting up this PR, there has been an active internal discussion surrounding some of these questions, and what the potential implications are of adding this support. Thanks -Ward |
Hi @WardF, thank you for this wrap up. I think all of these questions are quite important to think about. In particular the what is netCDF issue is indeed quite interesting, I've had this question in my mind several times as well and I think, that is a very important question. For many of my previous works, the netCDF data model has been awesome and providing that interface to users of the data has mostly been a guarantee that the data could be ingested quickly and robustly. As long as it was possible to read in the data using the netcdf-c library, everyone was happy. However for several reasons, the actual netCDF storage format (i.e. classic or HDF5) sometimes created larger challenges. I certainly don't know a perfect answer to these problems, but I can share my issues and thoughts about it. Regarding the self-describing nature of the file, I'd argue that it is still self describing (apart from the possibility to define the location of the virtual dataset source files via the current working directory or environment variables which I find a bit disturbing). However, there is a clear departure from the "one file" philosophy. But I'd argue that this is also the case for opendap or zarr resources and maybe the abstraction of a single file is not the best one for all purposes anyway... Regarding An interesting comment in the "what is netCDF" thread is this one from @lesserwhirls with a reference to these conventions as it relates to why I started to try the virtual dataset thing at all. My current problem is that I have to deal with a large set of field campaign data (somewhere in the 100TB ... 1PB range, depending on desired processing levels), so there is some pressure in using disk space efficiently. For a bunch of processing steps, computing (or just reshaping) the data does not take a lot of processing power, but it hurts a lot to store a second copy of the data. So I'd be really interested in having some way to store the data processing steps in stead of the actual data. To cope with that problem, I've previously created a custom opendap server, which stores compute graphs for potentially interesting products in a database and subsequently recomputes the actual data upon user request. A simplified and not very polished version of it which uses xarray, potentially combined with dask for defining the data sources can be found here or an application of it here. But running a separate server just to combine some datasets maybe is a bit of an overkill and maybe also not what I would want to have due to performance issues (see below). So I also came across the above referenced conventions and I also had heard about the possibility for virtual datasets as well as external datasets in HDF5. I think that in those netCDF conventions there is nothing which would contradict the use of either external storage or virtual datasets in HDF5 to create a compatible netCDF4 dataset, so I just assumed it will work out of the box and hoped that this could help to solve parts of my issues in a simpler way. From my testing, I saw that HDF5 with external storage works flawlessly with netcdf-c, but I got stuck when trying virtual datasets. So based on these observations, I'd argue that it is already possible to define valid netcdf "files" which span across multiple files and based on the conventions, also virtual datasets should be allowed. On the other hand, I am also not completely satisfied with the way file references in HDF5 work. The way of locating source files relative to the current working directory (manual and release notes) is kind of weird. Furthermore, it would be great to have a possibility to also add small computations in the data access pipeline (e.g. application of calibration factors, interpolation to a more useful grid). In my particular example, but maybe that applies to other datasets as well, I need way less storage and file transfer bandwidth if the data is stored uncalibrated and on impractical grids but that particular format makes it really hard for new people to dive into the data. Hosting the transforming code on a server is also only part of the deal, as it makes the data more user friendly and saves on disks, but file transfer bandwidth could still be a limiting factor. On the other hand, implementing data transformation within the file format could also escalate quickly, so that step should definitively be decided with great care. And there might also be the option to run an opendap server on the client side as well. That way users can still access the datasets using netcdf-c in an nice-to-digest format, but storage and file transfer bandwidth is saved. The option for zarr is also quite interesting, but I think it is a bit orthogonal to the virtual dataset issues. However, that might imply that a data concatenation / transformation / etc... with respect to netCDF should maybe also include the possibility to refer to opendap or zarr resources. So there are many things to consider and some of my problems may not be other peoples problems... I really would like to have some way of storing and transporting more or less simple ways of transforming existing datasets such that users can access these as if they are real datasets in whatever programming language they are used to use and netCDF seems to be pretty close to that. I am undecided if the possibilities provided by HDF5 really how it should be done. I am happy to discuss and hear more about how you think about these issues at Unidata. Cheers, |
@ward hard to see that difficulties deciding on an ncdump implementation would torpedo this feature! Ncdump can be figured out. The philosophy with which I wrote netCDF-4 is that every feature of HDF5 which can be exposed, should be exposed. Somehow that has gotten turned around to imply that I chose all the best features of HDF5, and whatever I left out was left out for a good reason, and so should continue to be left out. But that is not correct. I made every HDF5 feature that I could (time permitting) visible in netCDF. The goal was to allow netCDF users to get all the benefit possible from HDF5 without have to switch their I/O code to HDF5. This PR presents a change which allows netCDF to access whole new aggregations of data, just like netCDF-Java does. This has proven very popular with the Java community, and I know there will great interest in the HPC world. This is a read-only access that will have no effect on any other netCDF application, so it's hard to see the objection. Just as with the addition of the COMPACT storage type, this change is just filling in some of the missing code of my netCDF-4 implementation. I would have accommodated this in the first release of netCDF-4, if I had the time and knowledge to do so then. |
My 2 cents. p.s. Questions about NC_VIRTUAL|NC_UNKNOWN_STORAGE:
|
Perhaps it is time to consider the general problem of passing information
#2 means that we probably cannot use varargs signatures (e.g. f(x,...)) |
I have questions. If the virtual dataset is composed of other datasets, |
For all errors with the underlying data set (file moved or deleted, etc.) we simply will rely on HDF5 to return an error, and then netCDF will throw up its hands and be done. Whether the underlying HDF5 files can be legally moved or not is really not important to us. We will read a legal virtual file and error out of anything else. I suggest we hold off introducing a new open/create with additional information. That may be needed in some cases, but not yet. When it is needed we can discuss how to handle it. Using a void pointer is not a great option. I am just confronting this with the filter_actions() function. It's super convenient when we can rely on all parameters to be simple C types. Then we can easily calculate the size of each parameter, which PIO needs to do, for example. (We have a plan to introduce more mode flags by moving the flag to 64-bits, which impacts this perhaps.) What this PR is shaping up to be is a minor change in libhdf5 which then allows netCDF-4 to read virtual HDF5 files, as if they were just regular HDF5 files, without any netCDF API changes (except the introduction of two new constants). That's a great new feature! Since such files are currently being created by HDF5 systems (not netCDF systems) there is no strong need to add write capability here. We will read these files and let the HDF5 library handle the complexities of writing them. If there is strong desire to write such files in netcdf-c, that will be more of a change, which probably will require some API changes to support the additional info needed. |
I think that I am thinking pretty much like @edhartnett described the situation. From my perspective, the support for virtual datasets is primarily a way to simplify data usage for people consuming the datasets, thus reading the data is the primary concern. It is true that writing virtual datasets and keeping them in a valid state is not necessarily straight forward. Furthermore pulling that feature into netCDF may be even more difficult and probably writing those datasets is equally complicated, either by using HDF5 directly or a to-be-invented netCDF API. But I'd argue that forwarding the writing feature to netCDF is maybe not needed (at least not right now, maybe never). About
One goal of HDF5 virtual datasets is that virtual datasets could be read as if they were any other of the old fashioned standard datasets. If I consider netCDF as a user of the HDF5 API, this translates to: netCDF should not care which kind storage is used for the dataset as long as only reading is considered. Thus if netCDF does not know about the storage kind (no matter if that storage kind is already defined or will be defined in the future), there should be a way of reading the dataset independently of any information of the actual storage kind. That is what So the primary concern is that
There is a possibility to do so using the HDF5_VDS_PREFIX environment variable or the H5P_SET_VIRTUAL_PREFIX HDF5 function. But in general, I think users of such virtual datasets should be made aware that moving these files can be tricky. Personally, I prefer to use relative paths for my datasets, such that I can move files in groups. But again, most likely the virtual dataset feature will be used if the datasets are rather large which implies that moving the datasets probably needs a bit of thinking anyways. Also as the virtual dataset file tends to be quite small, re-writing the entire file might also well be an option. To conclude: while I see that this change raises a lot of questions, which are definitely worth thinking about (especially in the context of "what is netCDF"?) I'd prefer not to expand this PR discussion too much and in stead look at it more as a small fix for the missing |
This is failing on the windows build like this: Does the windows build of HDF5 not include H5D_VIRTUAL? |
Thus, there should probable be a check if |
Older HDF5 libraries do not support virtual datasets but could otherwise be supported by netCDF4. This change removes the special case to handle HDF5 virtual datasets if the installed HDF5 version does not support virtual datasets.
Yes, because it is required that HDF5 1.8.x continue to work as well. There may already be a pre-processor define that will work for this, or you could just check with #ifdef H5D_VIRTUAL... |
So now I've disabled the check for
which probably is a temporary issue? |
The bug in HDF5 preventing use of H5F_CLOSE_SEMI with virtual datasets has been fixed, and the fix will be in all future releases of HDF5. Note that for now if you explicitly open the source file while simultaneously accessing it through the VDS you must open the source file with H5F_CLOSE WEAK. Tobias has assured me this won't be a problem here. |
Hey @fortnern, thanks for fixing the bug! |
Yes, if NetCDF opens the source file (with H5F_CLOSE_SEMI) and the virtual dataset at the same time it will fail. Fixing this will require internally changing the scope of the file close flag. I have been pushing for this but we need more discussion on this. I will file a ticket. |
I will add that using H5F_CLOSE_STRONG was only extra precaution on my part. H5F_CLOSE_SEMI will work exactly the same for all netCDF-4 files. I have not been following this issue in detail, but I hope it can be merged - I think it's an excellent improvement for a small amount of code change. |
I would also like to support mainlining this into netCDF for a reason not explicitly mentioned above. One of the long-standing struggles with CF conventions is their inability to handle references to external datasets. This is most concretely illustrated by the clumsiness of CMIP (the posterchild for CF) datasets at referencing the
|
I think we are reasonable close to the finish line, so I've un-drafted the PR again. Still, there's the open topic on how we should handle the change to
|
As you have outlined the choices I will suggest that changing it to WEAK is the lesser evil. |
Nice one 🤣 . So then, let's go for the |
Reviewing PRs for merging in an attempt to get caught up, tagging @lesserwhirls to make sure I'm not missing some angle as applies to netcdf-java and the divergence from netcdf-c/4 functionality, and also @DennisHeimbigner as a second pair of eyes from the C side of things. |
I think adding support for reading HDF5 virtual datasets is reasonable. NetCDF-Java currently does not support that, although I would like to extend it to do so but will need gather up some test data files to take a shot at it. @ethanrd - thoughts? |
@lesserwhirls in #1799 (comment), there's a Python script to generate two small example files. The files generated by this script are also attached to the comment. That's not a whole lot of test data, but it should cover the core idea. |
Excellent - thank you @d70-t! |
@lesserwhirls - I agree, adding support to netCDF-Java for reading HDF5 virtual datasets seems like a useful addition. @WardF - Is there documentation on what non-netCDF formats the C library can read? (Sorry, I may have just missed it.) I think I remember that it can read some HDF4 files and some non-netCDF-4 HDF5 files. Are there others? Does this PR support HDF5 virtual datasets for both netCDF-4 files and non-nc4 HDF5 files? And, Sean, are you thinking the netCDF-Java implementation would support this for both? |
@ethanrd - the |
@ethanrd HDF4 is one. There are no guarantees that netCDF can read non-netCDF4 HDF5 files, and those it can read, there are no guarantees that they will be interpreted correctly. |
@WardF I must disagree. Reading non-netCDF-4 HDF5 files is a documented feature of netCDF, and very few HDF5 files are not readable to netCDF. We should continue this feature as it is very useful, and pretty easy. Those that can be read, are indeed guaranteed to be read correctly. The only exceptions are files with circular groups, and those that use some outdated, and very rarely used, HDF5 types, and also those that use HDF5 references, which I have not figured out how to express in netCDF. |
This PR is an attempt in solving #1799. HDF5 recently introduced the possibility to create virtual datasets. These kinds of datasets are especially useful if new datasets should be derived from other datasets (e.g. processing measurement data to higher levels). With virtual datasets some variables can be reused from existing files. This can significantly reduce the need for storage space while improving the user experience for dataset consumers.
It is not yet possible to create virtual datasets using netCDF4. However, in principle it should be possible to create netCDF4-compliant HDF5 files which use this feature (see added test case). In fact, the python library h5netcdf is already able to read such specially crafted HDF5-files, so netcdf-c should be able to do the same.
Up to my current knowledge, the compatibility issues with netCDF and HDF in this regard are twofold. First, get_chunking_info has no catch-all
else
case which could cover any new storage mechanism which has been or will be invented by HDF5 in a default way. Second, theSEMI
fclose-degree does not play well with the current way of opening virtual datasets in HDF5. In particular, the HDF5 library opens a name from another file without closing it before the file is closed again. In this case, the reference count to the file is not 0 and if the fclose-degree is set toSEMI
, the library will complain.I am not sure if this is an elegant or complete way of solving the issue, but maybe it is a good start for further discussion.