Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for reading HDF5 virtual datasets #1828

Merged
merged 10 commits into from
Jul 29, 2021
Merged

Conversation

d70-t
Copy link
Contributor

@d70-t d70-t commented Sep 2, 2020

This PR is an attempt in solving #1799. HDF5 recently introduced the possibility to create virtual datasets. These kinds of datasets are especially useful if new datasets should be derived from other datasets (e.g. processing measurement data to higher levels). With virtual datasets some variables can be reused from existing files. This can significantly reduce the need for storage space while improving the user experience for dataset consumers.

It is not yet possible to create virtual datasets using netCDF4. However, in principle it should be possible to create netCDF4-compliant HDF5 files which use this feature (see added test case). In fact, the python library h5netcdf is already able to read such specially crafted HDF5-files, so netcdf-c should be able to do the same.

Up to my current knowledge, the compatibility issues with netCDF and HDF in this regard are twofold. First, get_chunking_info has no catch-all else case which could cover any new storage mechanism which has been or will be invented by HDF5 in a default way. Second, the SEMI fclose-degree does not play well with the current way of opening virtual datasets in HDF5. In particular, the HDF5 library opens a name from another file without closing it before the file is closed again. In this case, the reference count to the file is not 0 and if the fclose-degree is set to SEMI, the library will complain.

I am not sure if this is an elegant or complete way of solving the issue, but maybe it is a good start for further discussion.

d70-t added 4 commits July 22, 2020 17:55
In case HDF5 adds more storage specifications, netcdf4 should be able to
cope with them by default. Further specializations could be added
nonetheless.
It seems like it is part of the design of HDF5 virtual datasets that
objects within a file remain opened while the files is aready "closed".
Setting the fclose degree to SEMI would cause the library to bail out.
This commit makes nc_test4/tst_virtual_dataset succeed.

See also Unidata#1799
@CLAassistant
Copy link

CLAassistant commented Sep 2, 2020

CLA assistant check
All committers have signed the CLA.

@d70-t
Copy link
Contributor Author

d70-t commented Sep 2, 2020

So it seems that changing the fclose-degree does not go unnoticed for all the tests. And honestly, changing that also didn't feel alright while doing it. But I didn't see a more proper way out of that situation. Does anyone who is more knowledgeable about the codebase has some suggestions?

@edwardhartnett
Copy link
Contributor

Well you are seeing the following error:

12180*** testing nc_redef ...  1386 good comparisons.  34 good comparisons.  1386 good comparisons.  34 good comparisons. ok
12181*** testing nc_sync ... 
12182	FAILURE at line 365 of /home/tester/build-netcdf-c/nc_test/test_write.c: open: NetCDF: HDF error
12183	FAILURE at line 368 of /home/tester/build-netcdf-c/nc_test/test_write.c: sync of ncidr failed: NetCDF: Not a valid ID
12184	FAILURE at line 956 of /home/tester/netcdf-c/nc_test/util.c: nc_inq_dim: NetCDF: Not a valid ID
12185	FAILURE at line 958 of /home/tester/netcdf-c/nc_test/util.c: Unexpected name of dimension 0: '����', expected: 'Dr'
12186	FAILURE at line 960 of /home/tester/netcdf-c/nc_test/util.c: Unexpected length 0 of dimension 0, expected 2
12187	FAILURE at line 956 of /home/tester/netcdf-c/nc_test/util.c: nc_inq_dim: NetCDF: Not a valid ID
12188	FAILURE at line 958 of /home/tester/netcdf-c/nc_test/util.c: Unexpected name of dimension 1: '����', expected: 'D1'
12189	FAILURE at line 960 of /home/tester/netcdf-c/nc_test/util.c: Unexpected length 0 of dimension 1, expected 1nc_test: /home/tester/netcdf-c/nc_test/util.c:438: hash: Assertion `0' faile

Did you run tests on your machine? It's not clear to me why the change you made would have this effect.

if ((fapl_id = H5Pcreate(H5P_FILE_ACCESS)) < 0)
BAIL(NC_EHDFERR);

if (H5Pset_fclose_degree(fapl_id, H5F_CLOSE_SEMI) < 0)
if (H5Pset_fclose_degree(fapl_id, H5F_CLOSE_WEAK) < 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a somewhat drastic step. Can we discuss?

Can you tell me more about why a virtual file might have objects left open?

Copy link
Contributor Author

@d70-t d70-t Sep 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the step also a bit too drastic. I don't really know about how the thought process in HDF5 is regarding the implementation of virtual datasets. But if you have a look at the links above, this line opens a new reference to the linked file and almost immediately after that the file is closed again, but the previous reference is not released. This leads to an error if fclose_degree is SEMI. Also as the corresponding function is called H5D__virtual_open_source_dset, I assumed that at the end of the function something should be left in an open state... The above change was a way to get rid of the error without having to dig further into HDF5 and as h5netcdf could handle this, I thought there should be a way to handle it without changing HDF5.

But yes, I'd be happy to find a different solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you find a different solution for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I have currently no clue how to do that differently without changing the virtual dataset implementation of HDF5. Maybe we would have to reach out to the HDF5 developers and ask them about their intention on this option. But I can't start that right now, as I'll be off for a week in a couple of hours.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've sent a request to HDFgroup to clarify if the need for H5F_CLOSE_WEAK is intended behaviour or if I am missing something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wanted to briefly get back on this. It turns out that this behaviour is indeed a bug in HDF5. I just got the confirmation that there's a fix almost ready which should be released soon. This fix however effectively will force source datasets to be opened as WEAK internally.

This raises the question of how to proceed in this PR?

  • Should there be a backwards-compatible solution which runs with current HDF5 by setting file opening WEAK outside of HDF5?
  • Should there be a requirement to use HDF5 >= the upcoming release?

@edwardhartnett
Copy link
Contributor

I think these are good changes. We just need to get them testing correctly...

@d70-t
Copy link
Contributor Author

d70-t commented Sep 2, 2020

Well you are seeing the following error:

12180*** testing nc_redef ...  1386 good comparisons.  34 good comparisons.  1386 good comparisons.  34 good comparisons. ok
12181*** testing nc_sync ... 
12182	FAILURE at line 365 of /home/tester/build-netcdf-c/nc_test/test_write.c: open: NetCDF: HDF error
12183	FAILURE at line 368 of /home/tester/build-netcdf-c/nc_test/test_write.c: sync of ncidr failed: NetCDF: Not a valid ID
12184	FAILURE at line 956 of /home/tester/netcdf-c/nc_test/util.c: nc_inq_dim: NetCDF: Not a valid ID
12185	FAILURE at line 958 of /home/tester/netcdf-c/nc_test/util.c: Unexpected name of dimension 0: '����', expected: 'Dr'
12186	FAILURE at line 960 of /home/tester/netcdf-c/nc_test/util.c: Unexpected length 0 of dimension 0, expected 2
12187	FAILURE at line 956 of /home/tester/netcdf-c/nc_test/util.c: nc_inq_dim: NetCDF: Not a valid ID
12188	FAILURE at line 958 of /home/tester/netcdf-c/nc_test/util.c: Unexpected name of dimension 1: '����', expected: 'D1'
12189	FAILURE at line 960 of /home/tester/netcdf-c/nc_test/util.c: Unexpected length 0 of dimension 1, expected 1nc_test: /home/tester/netcdf-c/nc_test/util.c:438: hash: Assertion `0' faile

Did you run tests on your machine? It's not clear to me why the change you made would have this effect.

Yes, but due to a lack of cleverness, I only ran the test while preparing the draft PR. I get the same error as above on my machine and it is triggered by 9f88977. I also don't understand why this happens.

@d70-t
Copy link
Contributor Author

d70-t commented Sep 2, 2020

So now travis is happy, but appveyor not. I am not sure how to interpret its results though.

The other issue is, if travis is now happy for the right reasons (e.g. H5F_CLOSE_WEAK) or is it a bad idea to introduce this change?

@WardF
Copy link
Member

WardF commented Sep 2, 2020

Hi @d70-t, thanks for this work. Before we go to much further, I wanted to share a few of our thoughts here internally at Unidata, to provide some context vis-a-vis deciding how to proceed.

"What is netCDF" (see issue Unidata/netcdf#50) and the surrounding discussion has been pretty robust, as we consider the balance between the data storage format ('files', traditionally, but we are being forced to expand our terminology as we move into object storage), the data model, and the software library. One of the historic, oft-emphasized properties of a netCDF data set is that it is self-describing. If we expand to using something like an HDF5 virtual dataset, where data is spread beyond a single file, is it still a netCDF file/is the file still self-describing? What does the corresponding .cdl representation of a virtual data set look like? What are the best practices when reading a virtual dataset and converting via nccopy, or converting via the ncdump/ncgen pipeline? What happens if some portion of the virtual dataset is no long accessible or cannot be found? Is the entire data set invalidated, or is it just treated as a partial data set? What is the overhead for supporting the virtual dataset API moving forward, and what are the HDF group plans for supporting this functionality in libhdf5 (in terms of API/ABI compatibility, I'm not worried that they will drop this feature).

These are clearly a mix of practical and philosophical questions, but they describe a good chunk of what we need to think about before adding support for a feature, even when it is 'read-only support for a particular storage option'; anything netCDF can read, it will need to be able to write using other, supported storage options. Questions are further compounded by the fact that we have had very few requests to support this functionality. The community has instead been pressing us to add support for object storage and Zarr support (which is in line with our current efforts).

I say all this not to be discouraging, but rather to start a discussion before you invest too much more time into this. Since putting up this PR, there has been an active internal discussion surrounding some of these questions, and what the potential implications are of adding this support.

Thanks -Ward

@d70-t
Copy link
Contributor Author

d70-t commented Sep 2, 2020

Hi @WardF,

thank you for this wrap up. I think all of these questions are quite important to think about. In particular the what is netCDF issue is indeed quite interesting, I've had this question in my mind several times as well and I think, that is a very important question. For many of my previous works, the netCDF data model has been awesome and providing that interface to users of the data has mostly been a guarantee that the data could be ingested quickly and robustly. As long as it was possible to read in the data using the netcdf-c library, everyone was happy. However for several reasons, the actual netCDF storage format (i.e. classic or HDF5) sometimes created larger challenges. I certainly don't know a perfect answer to these problems, but I can share my issues and thoughts about it.

Regarding the self-describing nature of the file, I'd argue that it is still self describing (apart from the possibility to define the location of the virtual dataset source files via the current working directory or environment variables which I find a bit disturbing). However, there is a clear departure from the "one file" philosophy. But I'd argue that this is also the case for opendap or zarr resources and maybe the abstraction of a single file is not the best one for all purposes anyway...

Regarding nccopy, that is kind of a tricky one. My basic motivation has been not to copy files at all, due to issues with storage space (see below). But maybe it would indeed be interesting to add the option for "shallow copies" to the nccopy command. That would probably facilitate the creation of such files for the typical users as well.

An interesting comment in the "what is netCDF" thread is this one from @lesserwhirls with a reference to these conventions as it relates to why I started to try the virtual dataset thing at all. My current problem is that I have to deal with a large set of field campaign data (somewhere in the 100TB ... 1PB range, depending on desired processing levels), so there is some pressure in using disk space efficiently. For a bunch of processing steps, computing (or just reshaping) the data does not take a lot of processing power, but it hurts a lot to store a second copy of the data. So I'd be really interested in having some way to store the data processing steps in stead of the actual data. To cope with that problem, I've previously created a custom opendap server, which stores compute graphs for potentially interesting products in a database and subsequently recomputes the actual data upon user request. A simplified and not very polished version of it which uses xarray, potentially combined with dask for defining the data sources can be found here or an application of it here. But running a separate server just to combine some datasets maybe is a bit of an overkill and maybe also not what I would want to have due to performance issues (see below). So I also came across the above referenced conventions and I also had heard about the possibility for virtual datasets as well as external datasets in HDF5. I think that in those netCDF conventions there is nothing which would contradict the use of either external storage or virtual datasets in HDF5 to create a compatible netCDF4 dataset, so I just assumed it will work out of the box and hoped that this could help to solve parts of my issues in a simpler way. From my testing, I saw that HDF5 with external storage works flawlessly with netcdf-c, but I got stuck when trying virtual datasets. So based on these observations, I'd argue that it is already possible to define valid netcdf "files" which span across multiple files and based on the conventions, also virtual datasets should be allowed.

On the other hand, I am also not completely satisfied with the way file references in HDF5 work. The way of locating source files relative to the current working directory (manual and release notes) is kind of weird.

Furthermore, it would be great to have a possibility to also add small computations in the data access pipeline (e.g. application of calibration factors, interpolation to a more useful grid). In my particular example, but maybe that applies to other datasets as well, I need way less storage and file transfer bandwidth if the data is stored uncalibrated and on impractical grids but that particular format makes it really hard for new people to dive into the data. Hosting the transforming code on a server is also only part of the deal, as it makes the data more user friendly and saves on disks, but file transfer bandwidth could still be a limiting factor. On the other hand, implementing data transformation within the file format could also escalate quickly, so that step should definitively be decided with great care. And there might also be the option to run an opendap server on the client side as well. That way users can still access the datasets using netcdf-c in an nice-to-digest format, but storage and file transfer bandwidth is saved.

The option for zarr is also quite interesting, but I think it is a bit orthogonal to the virtual dataset issues. However, that might imply that a data concatenation / transformation / etc... with respect to netCDF should maybe also include the possibility to refer to opendap or zarr resources.

So there are many things to consider and some of my problems may not be other peoples problems... I really would like to have some way of storing and transporting more or less simple ways of transforming existing datasets such that users can access these as if they are real datasets in whatever programming language they are used to use and netCDF seems to be pretty close to that. I am undecided if the possibilities provided by HDF5 really how it should be done. I am happy to discuss and hear more about how you think about these issues at Unidata.

Cheers,
Tobi

@edwardhartnett
Copy link
Contributor

edwardhartnett commented Sep 3, 2020

@ward hard to see that difficulties deciding on an ncdump implementation would torpedo this feature! Ncdump can be figured out.

The philosophy with which I wrote netCDF-4 is that every feature of HDF5 which can be exposed, should be exposed.

Somehow that has gotten turned around to imply that I chose all the best features of HDF5, and whatever I left out was left out for a good reason, and so should continue to be left out. But that is not correct. I made every HDF5 feature that I could (time permitting) visible in netCDF. The goal was to allow netCDF users to get all the benefit possible from HDF5 without have to switch their I/O code to HDF5.

This PR presents a change which allows netCDF to access whole new aggregations of data, just like netCDF-Java does. This has proven very popular with the Java community, and I know there will great interest in the HPC world. This is a read-only access that will have no effect on any other netCDF application, so it's hard to see the objection.

Just as with the addition of the COMPACT storage type, this change is just filling in some of the missing code of my netCDF-4 implementation. I would have accommodated this in the first release of netCDF-4, if I had the time and knowledge to do so then.

@DennisHeimbigner
Copy link
Collaborator

My 2 cents.
I have no problem supporting HDF5 features for reading as long as they
can be discovered by libhdf5 and used. As a rule, such features will be
invisible wrt reading.
The issue for me is supporting the writing of HDF5 files using special features
such as virtual. It is unlikely that such writing will be invisible because
you need to pass feature-specific information from the client down into
e.g. libhdf5 to tell it what to do.
I have addressed this problem for NCZarr by requiring that the path
be a URL and using the standard URL fragment syntax (https://.../...#key=value...
as the means for passing information thru to the NCZarr dispatch implementation.
Cannot say I am particularly happy with this solution, but it is pretty general.
=Dennis Heimbigner

p.s. Questions about NC_VIRTUAL|NC_UNKNOWN_STORAGE:

  1. Is the reason this is defined is so nc_inq_var_storage can return the value?
  2. Is it the case that reading virtual data disables chunking? If not,
    then it seems that using NC_UNKNOWN_STORAGE is not a solution.

@DennisHeimbigner
Copy link
Collaborator

DennisHeimbigner commented Sep 3, 2020

Perhaps it is time to consider the general problem of passing information
from the client to the dispatch code. As I mention above, I currently do this
for NCZarr but I should not also for DAP2 and DAP4.
One possibility is to provided extended versions of nc_open/nc_create
that take arbitrary extra info that is passed pretty much unexamined
and sent into the dispatcher to control its operation.
THings we need to consider for this:

  1. it has to be pretty general because we cannot know what future
    dispatchers will require
  2. it needs to be set up in such a way as to be usable by other
    programming languages that wrap netcdf-c.

#2 means that we probably cannot use varargs signatures (e.g. f(x,...))
and that using structures may be a problem also. One possibility is
to use some variant of the def_var_filter API that requires the caller
to shoe-horn his extra info into a simple array of either unsigned ints
or strings.

@DennisHeimbigner
Copy link
Collaborator

I have questions. If the virtual dataset is composed of other datasets,
is that invisible to the reader of the virtual dataset? Also
I infer from above that moving one or more of the underlying datasets
can cripple the virtual dataset. Is there a way to update the virtual
dataset to tell it where the underlying datasets were moved to?

@edwardhartnett
Copy link
Contributor

For all errors with the underlying data set (file moved or deleted, etc.) we simply will rely on HDF5 to return an error, and then netCDF will throw up its hands and be done. Whether the underlying HDF5 files can be legally moved or not is really not important to us. We will read a legal virtual file and error out of anything else.

I suggest we hold off introducing a new open/create with additional information. That may be needed in some cases, but not yet. When it is needed we can discuss how to handle it. Using a void pointer is not a great option. I am just confronting this with the filter_actions() function. It's super convenient when we can rely on all parameters to be simple C types. Then we can easily calculate the size of each parameter, which PIO needs to do, for example.

(We have a plan to introduce more mode flags by moving the flag to 64-bits, which impacts this perhaps.)

What this PR is shaping up to be is a minor change in libhdf5 which then allows netCDF-4 to read virtual HDF5 files, as if they were just regular HDF5 files, without any netCDF API changes (except the introduction of two new constants). That's a great new feature!

Since such files are currently being created by HDF5 systems (not netCDF systems) there is no strong need to add write capability here. We will read these files and let the HDF5 library handle the complexities of writing them. If there is strong desire to write such files in netcdf-c, that will be more of a change, which probably will require some API changes to support the additional info needed.

@d70-t
Copy link
Contributor Author

d70-t commented Sep 14, 2020

I think that I am thinking pretty much like @edhartnett described the situation.

From my perspective, the support for virtual datasets is primarily a way to simplify data usage for people consuming the datasets, thus reading the data is the primary concern. It is true that writing virtual datasets and keeping them in a valid state is not necessarily straight forward. Furthermore pulling that feature into netCDF may be even more difficult and probably writing those datasets is equally complicated, either by using HDF5 directly or a to-be-invented netCDF API. But I'd argue that forwarding the writing feature to netCDF is maybe not needed (at least not right now, maybe never).

About

p.s. Questions about NC_VIRTUAL|NC_UNKNOWN_STORAGE:

  1. Is the reason this is defined is so nc_inq_var_storage can return the value?
  2. Is it the case that reading virtual data disables chunking? If not, then it seems that using NC_UNKNOWN_STORAGE is not a solution.

One goal of HDF5 virtual datasets is that virtual datasets could be read as if they were any other of the old fashioned standard datasets. If I consider netCDF as a user of the HDF5 API, this translates to: netCDF should not care which kind storage is used for the dataset as long as only reading is considered. Thus if netCDF does not know about the storage kind (no matter if that storage kind is already defined or will be defined in the future), there should be a way of reading the dataset independently of any information of the actual storage kind. That is what NC_UNKNOWN_STORAGE is for and this netCDF storage mode should be able to read all of the already existing HDF5 storage modes (albeit maybe less efficient or less verbose). Boldly speaking, as far as reading is concerned, all the other (netCDF) modes are more or less for convenience (e.g. better ncdump printout, more efficient access patterns etc...) but should not alter the resulting data. From this point of view NC_VIRTUAL is just another convenience type, which currently is only used for the printout in ncdump.

So the primary concern is that get_chunking_info should have a valid and sensible else case, which is triggered if H5Pget_layout terminates successfully and returns something which netCDF is not aware of. I didn't find a nc_inq_var_storage function but nc_inq_var_chunking would indeed return NC_UNKNOWN_STORAGE or NC_VIRTUAL.
I am not really sure what to answer to the second part of the question. Variables which are in a virtual dataset could also be stored with chunking and even with a mixture of chunked, contiguous, compact and virtually stored sub-parts of the variable. So it doesn't disable chunking but accessing the chunking information is probably quite complicated. I'd argue again that this is however not needed (right) now.

Is there a way to update the virtual dataset to tell it where the underlying datasets were moved to?

There is a possibility to do so using the HDF5_VDS_PREFIX environment variable or the H5P_SET_VIRTUAL_PREFIX HDF5 function. But in general, I think users of such virtual datasets should be made aware that moving these files can be tricky. Personally, I prefer to use relative paths for my datasets, such that I can move files in groups. But again, most likely the virtual dataset feature will be used if the datasets are rather large which implies that moving the datasets probably needs a bit of thinking anyways. Also as the virtual dataset file tends to be quite small, re-writing the entire file might also well be an option.

To conclude: while I see that this change raises a lot of questions, which are definitely worth thinking about (especially in the context of "what is netCDF"?) I'd prefer not to expand this PR discussion too much and in stead look at it more as a small fix for the missing else statement.

@edwardhartnett
Copy link
Contributor

This is failing on the windows build like this:
C:\projects\netcdf-c\libhdf5\hdf5open.c(1144): error C2065: 'H5D_VIRTUAL': undeclared identifier [C:\projects\netcdf-c\build\libhdf5\netcdfhdf5.vcxproj]

Does the windows build of HDF5 not include H5D_VIRTUAL?

@d70-t
Copy link
Contributor Author

d70-t commented Sep 14, 2020

H5D_VIRTUAL is defined here. It doesn't look like there is a special case for windows. But the test installs HDF5 v1.8.18 which does not yet include H5D_VIRTUAL 🤷‍♂️

Thus, there should probable be a check if H5D_VIRTUAL is defined.

Older HDF5 libraries do not support virtual datasets but could otherwise
be supported by netCDF4. This change removes the special case to handle
HDF5 virtual datasets if the installed HDF5 version does not support
virtual datasets.
@edwardhartnett
Copy link
Contributor

Yes, because it is required that HDF5 1.8.x continue to work as well. There may already be a pre-processor define that will work for this, or you could just check with #ifdef H5D_VIRTUAL...

@d70-t
Copy link
Contributor Author

d70-t commented Sep 14, 2020

So now I've disabled the check for H5D_VIRTUAL if that is not defined and disabled the corresponding unit test if HDF5 is older than 1.10.
The remaining issue is on LGTM:

Could not connect to archive.ubuntu.com:80 (91.189.88.142), connection timed out Could not connect to archive.ubuntu.com:80 (91.189.88.152), connection timed out

which probably is a temporary issue?

@fortnern
Copy link

The bug in HDF5 preventing use of H5F_CLOSE_SEMI with virtual datasets has been fixed, and the fix will be in all future releases of HDF5. Note that for now if you explicitly open the source file while simultaneously accessing it through the VDS you must open the source file with H5F_CLOSE WEAK. Tobias has assured me this won't be a problem here.

@d70-t
Copy link
Contributor Author

d70-t commented Feb 15, 2021

Hey @fortnern, thanks for fixing the bug!
Now that I am thinking more about it, I am not completely sure anymore if that won't be a problem here. The current idea for opening the VDS is that neither the user nor the netCDF4 library knows about which files are accessed by the VDS system. Thus it could be the case that within one program, a user opens one of the source files as well as the virtual dataset file simultaneously without even knowing that they are related in this way. Will the program fail in this case?

@fortnern
Copy link

Yes, if NetCDF opens the source file (with H5F_CLOSE_SEMI) and the virtual dataset at the same time it will fail. Fixing this will require internally changing the scope of the file close flag. I have been pushing for this but we need more discussion on this. I will file a ticket.

@edwardhartnett
Copy link
Contributor

I will add that using H5F_CLOSE_STRONG was only extra precaution on my part. H5F_CLOSE_SEMI will work exactly the same for all netCDF-4 files. I have not been following this issue in detail, but I hope it can be merged - I think it's an excellent improvement for a small amount of code change.

@czender
Copy link
Contributor

czender commented Mar 2, 2021

I would also like to support mainlining this into netCDF for a reason not explicitly mentioned above. One of the long-standing struggles with CF conventions is their inability to handle references to external datasets. This is most concretely illustrated by the clumsiness of CMIP (the posterchild for CF) datasets at referencing the area variable or any external cell_measures dataset. For reasons of space, CMIP does not include area-like variables with each dataset, though each dataset needs to be associated with an area-like variable. If my understanding is correct, then virtual datasets could eventually lead to a netCDF-compliant way of referencing external datasets, and thus help beautify some ugly workarounds that large netCDF datastores currently employ. Please correct me if I am mistaken. For reference, here is the relevant CF documentation:

2.6.3. External Variables
The global external_variables attribute is a blank-separated list of the names of variables which are named by attributes in the file but which are not present in the file. These variables are to be found in other files (called "external files") but CF does not provide conventions for identifying the files concerned. The only attribute for which CF standardises the use of external variables is cell_measures

@d70-t d70-t marked this pull request as ready for review March 2, 2021 19:47
@d70-t d70-t requested a review from WardF as a code owner March 2, 2021 19:47
@d70-t
Copy link
Contributor Author

d70-t commented Mar 2, 2021

I think we are reasonable close to the finish line, so I've un-drafted the PR again.

Still, there's the open topic on how we should handle the change to H5F_CLOSE_WEAK which I added to this PR to make the checks pass. We have to decide between the two and I don't think that I am the right person to do so. The options (as I see them are):

  • The change to HDF5 by @fortnern should make the checks happy also without touching all the instances of H5F_CLOSE_SEMI. But that will add the requirement of at least that version of HDF5. Plus if any file which is referenced as a source for a virtual dataset (which currently is invisible to netCDF) is opened not only through the virtual dataset API but also directly by netCDF within one executable, this will crash. We could call that a user error, but 🤷
  • If we in stead keep the change to H5F_CLOSE_WEAK (as currently suggested by the PR), those datasets won't crash and it will also work on previous versions of HDF5. I however can not estimate how bad this change is to the rest of netCDF. @edwardhartnett called it "drastic" above.

@edwardhartnett
Copy link
Contributor

As you have outlined the choices I will suggest that changing it to WEAK is the lesser evil.
https://www.youtube.com/watch?v=440l8poSQiA

@d70-t
Copy link
Contributor Author

d70-t commented Mar 2, 2021

Nice one 🤣 . So then, let's go for the WEAKer evil.

@WardF
Copy link
Member

WardF commented Jun 28, 2021

Reviewing PRs for merging in an attempt to get caught up, tagging @lesserwhirls to make sure I'm not missing some angle as applies to netcdf-java and the divergence from netcdf-c/4 functionality, and also @DennisHeimbigner as a second pair of eyes from the C side of things.

@lesserwhirls
Copy link
Contributor

I think adding support for reading HDF5 virtual datasets is reasonable. NetCDF-Java currently does not support that, although I would like to extend it to do so but will need gather up some test data files to take a shot at it. @ethanrd - thoughts?

@d70-t
Copy link
Contributor Author

d70-t commented Jun 28, 2021

@lesserwhirls in #1799 (comment), there's a Python script to generate two small example files. The files generated by this script are also attached to the comment. That's not a whole lot of test data, but it should cover the core idea.

@lesserwhirls
Copy link
Contributor

@lesserwhirls in #1799 (comment), there's a Python script to generate two small example files. The files generated by this script are also attached to the comment. That's not a whole lot of test data, but it should cover the core idea.

Excellent - thank you @d70-t!

@ethanrd
Copy link
Member

ethanrd commented Jun 29, 2021

@lesserwhirls - I agree, adding support to netCDF-Java for reading HDF5 virtual datasets seems like a useful addition.

@WardF - Is there documentation on what non-netCDF formats the C library can read? (Sorry, I may have just missed it.) I think I remember that it can read some HDF4 files and some non-netCDF-4 HDF5 files. Are there others?

Does this PR support HDF5 virtual datasets for both netCDF-4 files and non-nc4 HDF5 files? And, Sean, are you thinking the netCDF-Java implementation would support this for both?

@lesserwhirls
Copy link
Contributor

Does this PR support HDF5 virtual datasets for both netCDF-4 files and non-nc4 HDF5 files? And, Sean, are you thinking the netCDF-Java implementation would support this for both?

@ethanrd - the HDF5Iosp is where support would be added, so if someone wrote a netCDF-4 file using something other than netCDF-C, then we we would be able to read it.

@WardF
Copy link
Member

WardF commented Jun 29, 2021

@ethanrd HDF4 is one. There are no guarantees that netCDF can read non-netCDF4 HDF5 files, and those it can read, there are no guarantees that they will be interpreted correctly.

@edwardhartnett
Copy link
Contributor

@WardF I must disagree. Reading non-netCDF-4 HDF5 files is a documented feature of netCDF, and very few HDF5 files are not readable to netCDF. We should continue this feature as it is very useful, and pretty easy. Those that can be read, are indeed guaranteed to be read correctly. The only exceptions are files with circular groups, and those that use some outdated, and very rarely used, HDF5 types, and also those that use HDF5 references, which I have not figured out how to express in netCDF.

@WardF WardF merged commit 9f798e2 into Unidata:master Jul 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

segfault while reading virtual datasets
9 participants