Merge pull request #2362 from edwardhartnett/ejh_doc

adding quantize documentation
Unidata · Jun 6, 2022 · 054c392 · 054c392
2 parents bef049d + 31c7b7a
commit 054c392
Show file tree

Hide file tree

Showing 8 changed files with 313 additions and 16 deletions.
diff --git a/docs/CMakeLists.txt b/docs/CMakeLists.txt
@@ -95,6 +95,6 @@ obsolete/fan_utils.html bestpractices.md filters.md indexing.md
 inmemory.md DAP2.dox FAQ.md
 known_problems.md
 COPYRIGHT.dox user_defined_formats.md DAP4.md DAP4.dox
-testserver.dox byterange.dox filters.md nczarr.md auth.md)
+testserver.dox byterange.dox filters.md nczarr.md auth.md quantize.md)
 
 ADD_EXTRA_DIST("${CUR_EXTRA_DIST}")
diff --git a/docs/Doxyfile.in b/docs/Doxyfile.in
@@ -758,6 +758,7 @@ INPUT = \
     @abs_top_srcdir@/docs/inmeminternal.dox \
     @abs_top_srcdir@/docs/indexing.dox \
     @abs_top_srcdir@/docs/testserver.dox \
+    @abs_top_srcdir@/docs/quantize.md \
     @abs_top_srcdir@/include/netcdf.h \
     @abs_top_srcdir@/include/netcdf_mem.h \
     @abs_top_srcdir@/include/netcdf_par.h \
@@ -906,7 +907,9 @@ IMAGE_PATH = @abs_top_srcdir@/docs/images/chunking2.png \
     @abs_top_srcdir@/docs/images/netcdf_architecture.png \
     @abs_top_srcdir@/docs/images/pnetcdf.png \
     @abs_top_srcdir@/docs/images/deptree.jpg \
-    @abs_top_srcdir@/docs/images/InstallTreeWindows.png
+    @abs_top_srcdir@/docs/images/InstallTreeWindows.png \
+    @abs_top_srcdir@/docs/images/quantize_pi.png \
+    @abs_top_srcdir@/docs/images/quantize_performance.png
 
 # The INPUT_FILTER tag can be used to specify a program that doxygen should
 # invoke to filter for each input file. Doxygen will invoke the filter program

diff --git a/docs/Makefile.am b/docs/Makefile.am
@@ -13,7 +13,7 @@ windows-binaries.md dispatch.md building-with-cmake.md CMakeLists.txt groups.dox
 notes.md install-fortran.md credits.md auth.md filters.md		\
 obsolete/fan_utils.html indexing.dox inmemory.md FAQ.md			\
 known_problems.md COPYRIGHT.md inmeminternal.dox testserver.dox		\
-byterange.dox nczarr.md
+byterange.dox nczarr.md quantize.md
 
 # Turn off parallel builds in this directory.
 .NOTPARALLEL:

diff --git a/docs/images/Makefile.am b/docs/images/Makefile.am
@@ -5,7 +5,8 @@
 
 # See netcdf-c/COPYRIGHT file for more info.
 
-EXTRA_DIST = aqua.jpg chunking2.png compatibility3.png compression.png	\
-groups.png nc4-model.png ncatts.png nc-classic-uml.png nccoords.png	\
-ncfile.png pnetcdf.png terra.jpg netcdf_architecture.png \
-deptree.jpg InstallTreeWindows.png uniLogo.png
+EXTRA_DIST = aqua.jpg chunking2.png compatibility3.png			\
+compression.png groups.png nc4-model.png ncatts.png			\
+nc-classic-uml.png nccoords.png ncfile.png pnetcdf.png terra.jpg	\
+netcdf_architecture.png deptree.jpg InstallTreeWindows.png		\
+uniLogo.png quantize_pi.png quantize_performance.md
diff --git a/docs/images/quantize_performance.png b/docs/images/quantize_performance.png
diff --git a/docs/images/quantize_pi.png b/docs/images/quantize_pi.png
diff --git a/docs/quantize.md b/docs/quantize.md
@@ -0,0 +1,290 @@
+# Lossy Compression with Quantize
+
+## Introduction {#quantize}
+
+The quantize feature was initially developed as part of the Community
+Codec Repository (CCR) [2]. The CCR project allows netCDF users to
+make use of HDF5 plugins (a.k.a. “filters”) which can add new
+compression and other algorithms to the HDF5 library. As part of CCR,
+the quantization algorithms were implemented as HDF5 filters.
+
+However, one aspect of implementing the quantization as a filter is
+that the filter is also required when reading the data [1]. Although
+this makes sense for compression/decompression algorithms, the
+quantize algorithms are only needed when data are written. Requiring
+that the readers of the data also install the filters places an
+unnecessary burden on data readers. Furthermore, using the quantize
+filter results in data that cannot be read by netCDF-Java or versions
+of netcdf-c before 4.8.0, when support for multiple HDF5 filters was
+added. For these reasons, it was decided to merge the quantize
+algorithms into the netcdf-c library [5].
+
+As part of the netcdf-c library, the quantize algorithms are available
+for netCDF/HDF5 files, and the new ncZarr format, and produce data
+files that are fully backward compatible for all versions of netcdf-c
+since 4.0, and also fully compatible with netcdf-Java.
+
+## The Quantize Feature
+
+The quantize algorithms assist with lossy compression by setting
+excess bits to all zeros or all ones (in alternate array values). This
+allows a subsequent compression algorithm, like the zlib-based
+deflation built into netCDF-4, to better compress the data.
+
+The quantize feature is applied to a variable in a netCDF file, and
+may only be used with single or double precision floating point
+(netCDF types NC_FLOAT and NC_DOUBLE). Attempting to turn on quantize
+for any other type of netCDF variable will result in an error.
+
+It should be noted that turning on quantize does not, by itself,
+reduce the size of the data. Only if subsequent compression is used
+will setting the quantize feature result in additional compression.
+
+![Quantization of Pi](quantize_pi.png)
+Figure 1: The value of Pi expressed as a 32-bit floating point number,
+with different levels of quantization applied, from Number of
+Significant Digits equal to 8 (no quantization), to 1 (maximum
+quantization). The least significant bits of the significand are
+replaced with zeros, to the extent possible, while preserving the
+desired number of significant digits. In this example the Bit Grooming
+quantization algorithm is used.
+
+## Quantization Algorithms
+
+Three different quantization algorithms are provided in the netcdf-c
+quantize feature. Each does a somewhat different calculation to
+determine the number of bits that can be set to zeros (or ones), while
+preserving the number of significant digits specified by the user.
+
+Two of the algorithms, Bit-Groom and Granular Bit-Round, accept the
+number of decimal digits to be preserved in the data. One algorithm,
+Bit-Round, accepts the number of binary bits to preserve.
+
+### Bit Grooming
+
+The Bit Grooming algorithm sets determines the number of bits which
+are necessary for the required number of significant decimal
+digits. This determination is made at the beginning of processing and
+is applied to all values.
+
+Bit Grooming then sets excess bits of the first array value to zero,
+then excess bits of the next array value to one, and continues
+alternating between zero and one for the excess bits of every other
+array value. In this way, the average value of the array is preserved.
+
+For the Bit Grooming algorithm, the NSD parameter refers to the number
+of significant decimal digits that will be preserved. The number of
+significant digits may be 1-7 for single precision floating point, or
+1-15 for double precision floating point.
+
+### Granular Bit Round
+
+Granular Bit Round determines the number of required bits for each
+value in the array, and uses IEEE rounding to change the data
+value. It achieves a better overall compression ratio by more
+aggressively determining the minimum number of bits required to
+preserve the specified number of decimal digit precision.
+
+For the Granular Bit Round algorithm, the NSD parameter refers to the
+number of significant decimal digits that will be preserved (as with
+the Bit Grooming algorithm). The number of significant digits may be
+1-7 for single precision floating point, or 1-15 for double precision
+floating point.
+
+### Bit Round
+
+The Bit Round algorithm allows the user to directly specify the number
+of bits of the significant which will be preserved, and then sets
+excess bits to zero or one for alternate array values.
+
+For the Bit Round algorithm, the NSD parameter refers to the number of
+significant binary digits that will be preserved. The number of
+significant digits may be 1-23 for single precision floating point, or
+1-52 for double precision floating point.
+
+## Quantize Attribute
+
+When the quantize feature is used, an integer attribute is added to
+the variable which contains the NSD setting. Without this attribute it
+would be impossible for readers to know that quantize had been applied
+to the data. The name of the attribute reflects the quantize algorithm
+used. In accordance with the conventions established by the NetCDF
+Users Guide, these attribute names begin with an underscore,
+indicating that they are added by the library and should not be
+modified or deleted by users [6].
+
+Algorithm | Attribute Name
+----------|---------------
+Bit Groom | _QuantizeBitGroomNumberOfSignificantDigits
+Granular Bit Round | _QuantizeGranularBitRoundNumberOfSignificantDigits
+Bit Round | _QuantizeBitRoundNumberOfSignificantBits
+
+Figure 2: Table showing the names of the attribute added to a variable
+after the quantize feature has been applied. The name of the attribute
+indicates the algorithm used, the integer values represents the number
+of significant decimal digits (for Bit Groom and Granular Bit Round),
+or the number of significand bits retained (for Bit Round).
+
+## Handling of Fill Values
+
+In a netCDF file, fill values refer to the value used for elements of
+the data not written by the user. For example, if a variable contains
+an array of 10 values, and the user only writes 8 of them, the other
+two values will be set to the fill value for that variable.
+
+The fill value of a variable may be set by the user by adding an
+attribute of the same type as the variable with the name
+“_FillValue”. If present, the value of this attribute will be used as
+the fill value for that variable. If not specified, a default value
+for each type is used as the fill value. The default fill values may
+be found in the netcdf.h file.
+
+When using the quantize feature, any fill values will remain
+unquantized. That is, the excess bits of any array element will not be
+changed, if that element is the fill value. This is necessary if the
+fill value is to retain its purpose as an indicator of values that
+have not been written.
+
+## Using the Quantize Feature
+
+Turning on the quantize feature must be done on a per-variable basis,
+after the variable has been defined, and before nc_enddef() (or its
+Fortran equivalents) have been called. (Recall that for netCDF/HDF5
+files, nc_enddef() is automatically called when data are written or
+read from a variable.)
+
+In accordance with the usual NetCDF API practice, an inquiry function
+is also provided which may be called to check if quantize has been
+turned on for a variable. Calling the inquiry function is not required
+when reading the data - it is provided for user convenience.
+
+### Using Quantize with the NetCDF C API
+
+Quantize is available in the main branch of the netcdf-c library, and
+will be part of the next release (netcdf-c-4.9.0).
+
+To turn on the quantize feature, call the nc_def_var_quantize()
+function. To inquire about whether quantize been turned on for a
+variable, use the nc_inq_var_quantize() function.
+
+@code
+        /* Create two variables, one float, one double. Quantization
+         * may only be applied to floating point data. */
+        if (nc_def_var(ncid, "var1", NC_FLOAT, NDIM1, &dimid, &varid1)) ERR;
+        if (nc_def_var(ncid, "var2", NC_DOUBLE, NDIM1, &dimid, &varid2)) ERR;
+
+        /* Set up quantization. This will not make the data any
+         * smaller, unless compression is also turned on. In this
+         * case, we will set 3 significant digits. */
+        if (nc_def_var_quantize(ncid, varid1, NC_QUANTIZE_BITGROOM, NSD_3)) ERR;
+        if (nc_def_var_quantize(ncid, varid2, NC_QUANTIZE_BITGROOM, NSD_3)) ERR;
+
+        /* Set up zlib compression. This will work better because the
+         * data are quantized, yielding a smaller output file. We will
+         * set compression level to 1, which is usually the best
+         * choice. */
+        if (nc_def_var_deflate(ncid, varid1, 0, 1, 1)) ERR;
+        if (nc_def_var_deflate(ncid, varid2, 0, 1, 1)) ERR;
+@endcode
+
+Figure 3: Example of using the quantize feature in C. Note that the
+example also demonstrates adding zlib (a.k.a. deflate) compression to
+the variables. Without turning on the compression, use of quantize
+alone will not result in smaller data output.
+
+### Using Quantize with the NetCDF Fortran 90 API
+
+Quantize is available on a branch of the netcdf-fortran libraries, and
+will be merged to main after the next netcdf-c release (4.9.0) and
+will be released as part of the netCDF Fortran 90 API in the
+subsequent release of netcdf-fortran.
+
+In the Fortran 90 API, quantization is turned on by using two new
+optional arguments to nf90_def_var(), the quantize_mode and the nsd
+arguments.
+
+@code
+! Define some variables. 
+  call check(nf90_def_var(ncid, VAR1_NAME, NF90_FLOAT, dimids, varid1&
+       &, deflate_level = DEFLATE_LEVEL, quantize_mode =&
+       & nf90_quantize_bitgroom, nsd = 3))
+  call check(nf90_def_var(ncid, VAR2_NAME, NF90_DOUBLE, dimids,&
+       & varid2, contiguous = .TRUE., quantize_mode =&
+       & nf90_quantize_bitgroom, nsd = 3))
+@endcode
+
+Figure 4: In the Fortran 90 netCDF API, two additional optional
+parameters are available for the quantize feature, the quantize_mode
+and nsd parameters.
+
+### Using Quantize with the NetCDF Fortran 77 API
+
+Quantize is available on a branch of the netcdf-fortran libraries, and
+will be merged to main after the next netcdf-c release (4.9.0) and
+will be released as part of the netCDF Fortran 77 API in the
+subsequent release of netcdf-fortran.
+
+@code
+C     Create some variables.
+      do x = 1, NVARS      
+          retval = nf_def_var(ncid, var_name(x), var_type(x), NDIM1,
+     $         dimids, varid(x))
+          if (retval .ne. nf_noerr) stop 3
+
+C         Turn on quantize. 
+          retval = nf_def_var_quantize(ncid, varid(x),
+     $         NF_QUANTIZE_BITGROOM, NSD_3)
+          if (retval .ne. nf_noerr) stop 3
+
+C         Turn on zlib compression. 
+          retval = nf_def_var_deflate(ncid, varid(x), 0, 1, 1)
+          if (retval .ne. nf_noerr) stop 3
+      end do
+@endcode
+
+Figure 4: In the Fortran 77 netCDF API, nf_def_var_quantize() and
+nf_inq_var_quantize() are provided, which wrap the quantize functions
+from the C API.
+
+## Performance
+
+![Quantization Performance](images/quantize_performance.png)
+
+Figure 5: Compression ratio of E3SM Atmosphere Model (EAM) v2 default
+monthly dataset of raw size 445 MB compressed with default netCDF
+lossless compression algorithm (DEFLATE, compression level=1) alone
+(leftmost), or after pre-filtering with one of three lossy codecs
+(BitGroom, Granular BitGroom, or BitRound) with quantization increasing
+(and precision decreasing) to the right.
+
+## References
+
+1. HDF5 Dynamically Loaded Filters, The HDF Group, retrieved on
+December 2, 2021 from
+https://support.hdfgroup.org/HDF5/doc/Advanced/DynamicallyLoadedFilters/HDF5DynamicallyLoadedFilters.pdf.
+
+2. Hartnett, Zender, C. S., (2020), ADDITIONAL NETCDF COMPRESSION
+OPTIONS WITH THE COMMUNITY CODEC REPOSITORY (CCR), American
+Meteorological Society (AMS) Annual Meeting, retrieved on July 3, 2021
+from
+https://www.researchgate.net/publication/347726695_ADDITIONAL_NETCDF_COMPRESSION_OPTIONS_WITH_THE_COMMUNITY_CODEC_REPOSITORY_CCR.
+
+3. Zender, C. S. (2016), Bit Grooming: Statistically accurate
+precision-preserving quantization with compression, evaluated in the
+netCDF Operators (NCO, v4.4.8+), Geosci. Model Dev., 9, 3199-3211,
+doi:10.5194/gmd-9-3199-2016 Retrieved on Sep 21, 2020 from
+https://www.researchgate.net/publication/301575383_Bit_Grooming_Statistically_accurate_precision-preserving_quantization_with_compression_evaluated_in_the_netCDF_Operators_NCO_v448.
+
+4. Delaunay, X., A. Courtois, and F. Gouillon (2019), Evaluation of
+lossless and lossy algorithms for the compression of scientific
+datasets in netCDF-4 or HDF5 files, Geosci. Model Dev., 12(9),
+4099-4113, doi:10.5194/gmd-2018-250, retrieved on Sep 21, 2020 from
+https://www.researchgate.net/publication/335987647_Evaluation_of_lossless_and_lossy_algorithms_for_the_compression_of_scientific_datasets_in_netCDF-4_or_HDF5_files.
+
+5. Hartnett, E., et. al., “Provide a way to do bit grooming before
+compression”, netcdf-c GitHub Issue #1548,
+https://github.com/Unidata/netcdf-c/issues/1548.
+
+6. Rew, R., et. al., NetCDF Users Guide, Appendix A: Attribute
+Conventions, Unidata,
+https://docs.unidata.ucar.edu/netcdf-c/current/attribute_conventions.html.
diff --git a/libdispatch/dvar.c b/libdispatch/dvar.c
@@ -518,15 +518,18 @@ nc_def_var_deflate(int ncid, int varid, int shuffle, int deflate, int deflate_le
    variable which has been quantized is readable to older versions of
    the netCDF libraries, and to netCDF-Java.
  
-   For more information about quantization and the BitGroom filter, see 
-
-   Zender, C. S. (2016), Bit Grooming: Statistically accurate
-   precision-preserving quantization with compression, evaluated in
-   the netCDF Operators (NCO, v4.4.8+), Geosci. Model Dev., 9,
-   3199-3211, doi:10.5194/gmd-9-3199-2016 Retrieved on Sep 21, 2020
-   from
-   https://www.researchgate.net/publication/301575383_Bit_Grooming_Statistically_accurate_precision-preserving_quantization_with_compression_evaluated_in_the_netCDF_Operators_NCO_v448.
-  
+   For more information about quantization and the BitGroom filter,
+   see @ref quantize.
+
+   @note Users new to quantization should start with Granular Bit
+   Round, which results in the best compression. The Bit Groom
+   algorithm is not as effective when compressing, but is faster than
+   Granular Bit Round. The Bit Round algorithm accepts a number of
+   bits to maintain, rather than a number of decimal digits, and is
+   provided for users who are already performing some bit-based
+   quantization, and wish to turn this task over to the netCDF
+   library.
+
    @param ncid File ID.
    @param varid Variable ID. ::NC_GLOBAL may not be used.
    @param quantize_mode Quantization mode. May be ::NC_NOQUANTIZE or