Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move HDF4 to its own dispatch layer #849

Merged

Conversation

edhartnett
Copy link
Contributor

@edhartnett edhartnett commented Feb 8, 2018

In this PR the HDF4 code is moved to its own dispatch layer. This cleans a bunch of code out of the libsrc4 directory.

This is a very helpful pre-cursor to the user-defined binary format.

Fixes #748.
Part of #177.

(I also fixed the parallel build problems of tst_filters.sh, in order to get this past my CI. I can take that out if you need me to.)

@DennisHeimbigner
Copy link
Collaborator

There is an interesting question here. AFAIK, the hdf4 library does not
depend on the hdf5 library. This, to me implies a gap in our configuration.
It would be possible, I think, to enable the netcdf-4 data model without
enabling use of hdf5. Currently, the two are synonymous.

@edhartnett
Copy link
Contributor Author

Yes that became clear to me as well. However, the current PR just continues the current state of affairs, which is that you can't get HDF4 without HDF5.

The user defined format feature I am going to introduce next also uses the netCDF-4 data model without using HDF5.

After that, a future modification of interest might be to separate the HDF5 from the netCDF-4 internal data model code in such a way that we could build in HDF4 and user-format support, without HDF5 being required.

The netCDF-4 internal data model is sufficiently general that many different user formats can make use of it. The advantage is that lots of functions (like all the inq functions) are just running off the internal data model - they will work for user-defined formats automatically, which is great. (As they also do for HDF4).

@WardF
Copy link
Member

WardF commented Feb 8, 2018

So maybe I missed it in the deluge of other information, but regarding a user-defined data format, is there an issue open with more information? We touched on it briefly before when you were here discussing PIO, remind me what the connection between the two is? Or is it a wholly separate feature we want to add to libnetcdf?

Additionally, why is it we can't get HDF4 without HDF5?

@edhartnett
Copy link
Contributor Author

edhartnett commented Feb 8, 2018

User-defined formats are described in #177. They will allow model groups which use proprietary formats to convert to netCDF, without breaking all their tools or losing access to data in the old format. Essentially they allow outside users to do what the HDF4 dispatch layer does for HDF4 - make it seem like netCDF.

The HDF4 read-only code has always been a part of the libsrc4 code, and always required --enable-netcdf-4 at configure time. It was actually written before the dispatch layer code was written. So to use HDF4 you also must have HDF5 installed.

After this PR is merged, with a little more hanky-panky with the code, it will be possible to allow HDF4 to be used without HDF5. I'm not sure if this will really be of much value in the case of HDF4. I don't know how many people are using it.

However, it would be very useful to do because it would be that user-defined formats could also be used without HDF5. I think this would be a common scenario. For example, a model group might want to convert from their internal format to netCDF CDF5. They should be able to do so without needing to install HDF5.

@WardF
Copy link
Member

WardF commented Feb 8, 2018

Thanks for helping organize my thoughts. I hadn't realized there was the dependency between hdf4/hdf5.

Thanks!

@DennisHeimbigner
Copy link
Collaborator

Separating the netcdf4 model from hdf5 is, IMO, pretty challenging, but useful.
I suspect that the right solution is to make the netcdf-4 model be the norm
which means that we need to eliminate all uses of USE_NETCDF4 and instead
start relying more on the error codes NC_ENOTNC4 and NC_ENOTBUILT.

@DennisHeimbigner
Copy link
Collaborator

I should note that I do something like this for dap4 already. Basically, I create
a hidden in-memory netcdf-4 file in which to store the netcdf4 translated
metadata for the dap4 model. Then I can redirect all of the metadata related
dispatch operations to that in-memory file. So I hypothesize that the key is
to separate out a netcdf4 metadata subsystem with some hooks into it
that allow for implementation specific operations to be used.

@edhartnett
Copy link
Contributor Author

edhartnett commented Feb 8, 2018

I don't think this will actually be that hard.

Basically the data structs (NC_VAR_INFO, etc.), and the functions that manipulate them, form an internal API that is used by different codes, like HDF4 and HDF5 code.

These functions are already pretty separate, with the HDF5 code in nc4hdf.c and most of the data layer code in ncfunc.c and nc4internal.c. So I think it will mostly be moving a bunch of what we call "internal" functions into a separate directory (libint) which will be the data layer code.

The remaining code in libsrc4 will be the HDF5 code. Then it should be pretty easy to ensure the build works with the internal model, and without HDF5.

I would like to get the currently working user-defined format code merged first, and the circle around and separate out the HDF5 code.

@DennisHeimbigner
Copy link
Collaborator

There may be an additional benefit to isolating the metadata code.
We have wanted to do lazy metadata access and have hypothesized that
in internal metadata module is necessary to do that.

@edhartnett
Copy link
Contributor Author

edhartnett commented Feb 8, 2018

I have actually been pondering how to do lazy metadata access.

I am thinking that if we read a group at a time, that would work...

But I have not seriously tried it yet. I agree that separating out the internal data code would be a helpful step, because isolating that code would be a nice simplification of the libsrc4 code. The less there is, the less chance of breaking something.

After I get the user-defined data code merged in, I will take a stab at separating out the data model internals.

@DennisHeimbigner
Copy link
Collaborator

I disagree about group at a time. Once you have the mechanism in place,
you can read any single item -- type,var,dim,attr,group -- at a time taking
dependencies into account. My speculation is that only doing lazy eval
of vars and attr is the big payoff.

@DennisHeimbigner
Copy link
Collaborator

Also, we have a specific case. A file with only the root group
and with 14mb of metadata. So reading per-group would buy
nothing for this file.

@edhartnett
Copy link
Contributor Author

@DennisHeimbigner and other interested readers, I have created two new issues, #856 and #857, where the issues of separating the netCDF-4 internal data code from HDF5, and implementing lazy reads, can be discussed.

@edhartnett
Copy link
Contributor Author

OK, something funny was going on with the cmake build, which was causing the failures.

The CMakeLists.txt file has this:

# Option to use HDF4
OPTION(ENABLE_HDF4 "Build netCDF-4 with HDF4 read capability(HDF4, HDF5 and Zlib required)." OFF)

So HDF4 should be off by default. But that was not the case. Not sure why, but after cmake is run, I see:

# Features
--------
NetCDF-2 API:		yes
HDF4 Support:		yes
NetCDF-4 API:		no

Not sure why this is happening.

So I added the else section below to expressly turn off HDF4 if netcdf-4 is not enabled:

# Build netCDF4
OPTION(ENABLE_NETCDF_4 "Enable netCDF-4" ON)
IF(ENABLE_NETCDF_4)
  SET(USE_NETCDF4 ON CACHE BOOL "")
  SET(ENABLE_NETCDF_4 ON CACHE BOOL "")
  SET(ENABLE_NETCDF4 ON CACHE BOOL "")
ELSE()
  SET(USE_HDF4_FILE_TESTS OFF)
  SET(USE_HDF4 OFF)  
  SET(ENABLE_HDF4_FILE_TESTS OFF)
  SET(ENABLE_HDF4 OFF)  
ENDIF()

@WardF if you know of a better way to deal with this, let me know.

@DennisHeimbigner
Copy link
Collaborator

Did you check to see if it gets turned off if the library is not available?
I do not recall seeing this problem before.

@edhartnett
Copy link
Contributor Author

OK, the one travis test that failed did so due to a timeout. So I believe this PR is good to merge, @WardF, whenever you are ready.

@WardF
Copy link
Member

WardF commented Feb 9, 2018

It will probably be next week some time.

@edhartnett
Copy link
Contributor Author

@WardF can we get this merged soon?

I would like to get the user-defined format work up for review. It will probably lead to some discussions, so I want to get it going.

Thanks,
Ed

WardF added a commit that referenced this pull request Feb 16, 2018
@WardF WardF merged commit 40805e5 into Unidata:master Feb 16, 2018
WardF added a commit that referenced this pull request Feb 16, 2018
Pull #849 plus a note in the release notes.
@WardF
Copy link
Member

WardF commented Feb 16, 2018

There we go!

@edhartnett
Copy link
Contributor Author

Thanks!

@edhartnett edhartnett deleted the ejh_hdf4_dispatch_unidata2 branch February 26, 2018 16:30
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this pull request May 16, 2018
Upstream changes:
## 4.6.1 - March 15, 2018

* [Bug Fix] Corrected an issue which could result in a dap4 failure. See [Github #888](Unidata/netcdf-c#888) for more information.
* [Bug Fix][Enhancement] Allow `nccopy` to control output filter suppresion.  See [Github #894](Unidata/netcdf-c#894) for more information.
* [Enhancement] Reverted some new behaviors that, while in line with the netCDF specification, broke existing workflows.  See [Github #843](Unidata/netcdf-c#843) for more information.
* [Bug Fix] Improved support for CRT builds with Visual Studio, improves zlib detection in hdf5 library. See [Github #853](Unidata/netcdf-c#853) for more information.
* [Enhancement][Internal] Moved HDF4 into a distinct dispatch layer. See [Github #849](Unidata/netcdf-c#849) for more information.

## 4.6.0 - January 24, 2018
* [Enhancement] Full support for using HDF5 dynamic filters, both for reading and writing. See the file docs/filters.md.
* [Enhancement] Added an option to enable strict null-byte padding for headers; this padding was specified in the spec but was not enforced.  Enabling this option will allow you to check your files, as it will return an E_NULLPAD error.  It is possible for these files to have been written by older versions of libnetcdf.  There is no effective problem caused by this lack of null padding, so enabling these options is informational only.  The options for `configure` and `cmake` are `--enable-strict-null-byte-header-padding` and `-DENABLE_STRICT_NULL_BYTE_HEADER_PADDING`, respectively.  See [Github #657](Unidata/netcdf-c#657) for more information.
* [Enhancement] Reverted behavior/handling of out-of-range attribute values to pre-4.5.0 default. See [Github #512](Unidata/netcdf-c#512) for more information.
* [Bug] Fixed error in tst_parallel2.c. See [Github #545](Unidata/netcdf-c#545) for more information.
* [Bug] Fixed handling of corrupt files + proper offset handling for hdf5 files. See [Github #552](Unidata/netcdf-c#552) for more information.
* [Bug] Corrected a memory overflow in `tst_h_dimscales`, see [Github #511](Unidata/netcdf-c#511), [Github #505](Unidata/netcdf-c#505), [Github #363](Unidata/netcdf-c#363) and [Github #244](Unidata/netcdf-c#244) for more information.

## 4.5.0 - October 20, 2017

* Corrected an issue which could potential result in a hang while using parallel file I/O. See [Github #449](Unidata/netcdf-c#449) for more information.
* Addressed an issue with `ncdump` not properly handling dates on a 366 day calendar. See [GitHub #359](Unidata/netcdf-c#359) for more information.

### 4.5.0-rc3 - September 29, 2017

* [Update] Due to ongoing issues, native CDF5 support has been disabled by **default**.  You can use the options mentioned below (`--enable-cdf5` or `-DENABLE_CDF5=TRUE` for `configure` or `cmake`, respectively).  Just be aware that for the time being, Reading/Writing CDF5 files on 32-bit platforms may result in unexpected behavior when using extremely large variables.  For 32-bit platforms it is best to continue using `NC_FORMAT_64BIT_OFFSET`.
* [Bug] Corrected an issue where older versions of curl might fail. See [GitHub #487](Unidata/netcdf-c#487) for more information.
* [Enhancement] Added options to enable/disable `CDF5` support at configure time for autotools and cmake-based builds.  The options are `--enable/disable-cdf5` and `ENABLE_CDF5`, respectively.  See [Github #484](Unidata/netcdf-c#484) for more information.
* [Bug Fix] Corrected an issue when subsetting a netcdf3 file via `nccopy -v/-V`. See [Github #425](Unidata/netcdf-c#425) and [Github #463](Unidata/netcdf-c#463) for more information.
* [Bug Fix] Corrected `--has-dap` and `--has-dap4` output for cmake-based builds. See [GitHub #473](Unidata/netcdf-c#473) for more information.
* [Bug Fix] Corrected an issue where `NC_64BIT_DATA` files were being read incorrectly by ncdump, despite the data having been written correctly.  See [GitHub #457](Unidata/netcdf-c#457) for more information.
* [Bug Fix] Corrected a potential stack buffer overflow.  See [GitHub #450](Unidata/netcdf-c#450) for more information.

### 4.5.0-rc2 - August 7, 2017

* [Bug Fix] Addressed an issue with how cmake was implementing large file support on 32-bit systems. See [GitHub #385](Unidata/netcdf-c#385) for more information.
* [Bug Fix] Addressed an issue where ncgen would not respect keyword case. See [GitHub #310](Unidata/netcdf-c#310) for more information.

### 4.5.0-rc1 - June 5, 2017

* [Enhancement] DAP4 is now included. Since dap2 is the default for urls, dap4 must be specified by
(1) using "dap4:" as the url protocol, or
(2) appending "#protocol=dap4" to the end of the url, or
(3) appending "#dap4" to the end of the url
Note that dap4 is enabled by default but remote-testing is
disbled until the testserver situation is resolved.
* [Enhancement] The remote testing server can now be specified with the `--with-testserver` option to ./configure.
* [Enhancement] Modified netCDF4 to use ASCII for NC_CHAR.  See [Github Pull request #316](Unidata/netcdf-c#316) for more information.
* [Bug Fix] Corrected an error with how dimsizes might be read. See [Github #410](Unidata/netcdf-c#410) for more information.
* [Bug Fix] Corrected an issue where 'make check' would fail if 'make' or 'make all' had not run first.  See [Github #339](Unidata/netcdf-c#339) for more information.
* [Bug Fix] Corrected an issue on Windows with Large file tests. See [Github #385](Unidata/netcdf-c#385]) for more information.
* [Bug Fix] Corrected an issue with diskless file access, see [Pull Request #400](Unidata/netcdf-c#400) and [Pull Request #403](Unidata/netcdf-c#403) for more information.
* [Upgrade] The bash based test scripts have been upgraded to use a common test_common.sh include file that isolates build specific information.
* [Upgrade] The bash based test scripts have been upgraded to use a common test_common.sh include file that isolates build specific information.
* [Refactor] the oc2 library is no longer independent of the main netcdf-c library. For example, it now uses ncuri, nclist, and ncbytes instead of its homegrown equivalents.
* [Bug Fix] `NC_EGLOBAL` is now properly returned when attempting to set a global `_FillValue` attribute. See [GitHub #388](Unidata/netcdf-c#388) and [GitHub #389](Unidata/netcdf-c#389) for more information.
* [Bug Fix] Corrected an issue where data loss would occur when `_FillValue` was mistakenly allowed to be redefined.  See [Github #390](Unidata/netcdf-c#390), [GitHub #387](Unidata/netcdf-c#387) for more information.
* [Upgrade][Bug] Corrected an issue regarding how "orphaned" DAS attributes were handled. See [GitHub #376](Unidata/netcdf-c#376) for more information.
* [Upgrade] Update utf8proc.[ch] to use the version now maintained by the Julia Language project (https://github.com/JuliaLang/utf8proc/blob/master/LICENSE.md).
* [Bug] Addressed conversion problem with Windows sscanf.  This primarily affected some OPeNDAP URLs on Windows.  See [GitHub #365](Unidata/netcdf-c#365) and [GitHub #366](Unidata/netcdf-c#366) for more information.
* [Enhancement] Added support for HDF5 collective metadata operations when available. Patch submitted by Greg Sjaardema, see [Pull request #335](Unidata/netcdf-c#335) for more information.
* [Bug] Addressed a potential type punning issue. See [GitHub #351](Unidata/netcdf-c#351) for more information.
* [Bug] Addressed an issue where netCDF wouldn't build on Windows systems using MSVC 2012. See [GitHub #304](Unidata/netcdf-c#304) for more information.
* [Bug] Fixed an issue related to potential type punning, see [GitHub #344](Unidata/netcdf-c#344) for more information.
* [Enhancement] Incorporated an enhancement provided by Greg Sjaardema, which may improve read/write times for some complex files.  Basically, linked lists were replaced in some locations where it was safe to use an array/table.  See [Pull request #328](Unidata/netcdf-c#328) for more information.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Move HDF4 code to its own dispatch layer
3 participants