Skip to content

Commit

Permalink
Merge pull request #2362 from edwardhartnett/ejh_doc
Browse files Browse the repository at this point in the history
adding quantize documentation
  • Loading branch information
WardF authored Jun 6, 2022
2 parents bef049d + 31c7b7a commit 054c392
Show file tree
Hide file tree
Showing 8 changed files with 313 additions and 16 deletions.
2 changes: 1 addition & 1 deletion docs/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,6 @@ obsolete/fan_utils.html bestpractices.md filters.md indexing.md
inmemory.md DAP2.dox FAQ.md
known_problems.md
COPYRIGHT.dox user_defined_formats.md DAP4.md DAP4.dox
testserver.dox byterange.dox filters.md nczarr.md auth.md)
testserver.dox byterange.dox filters.md nczarr.md auth.md quantize.md)

ADD_EXTRA_DIST("${CUR_EXTRA_DIST}")
5 changes: 4 additions & 1 deletion docs/Doxyfile.in
Original file line number Diff line number Diff line change
Expand Up @@ -758,6 +758,7 @@ INPUT = \
@abs_top_srcdir@/docs/inmeminternal.dox \
@abs_top_srcdir@/docs/indexing.dox \
@abs_top_srcdir@/docs/testserver.dox \
@abs_top_srcdir@/docs/quantize.md \
@abs_top_srcdir@/include/netcdf.h \
@abs_top_srcdir@/include/netcdf_mem.h \
@abs_top_srcdir@/include/netcdf_par.h \
Expand Down Expand Up @@ -906,7 +907,9 @@ IMAGE_PATH = @abs_top_srcdir@/docs/images/chunking2.png \
@abs_top_srcdir@/docs/images/netcdf_architecture.png \
@abs_top_srcdir@/docs/images/pnetcdf.png \
@abs_top_srcdir@/docs/images/deptree.jpg \
@abs_top_srcdir@/docs/images/InstallTreeWindows.png
@abs_top_srcdir@/docs/images/InstallTreeWindows.png \
@abs_top_srcdir@/docs/images/quantize_pi.png \
@abs_top_srcdir@/docs/images/quantize_performance.png

# The INPUT_FILTER tag can be used to specify a program that doxygen should
# invoke to filter for each input file. Doxygen will invoke the filter program
Expand Down
2 changes: 1 addition & 1 deletion docs/Makefile.am
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ windows-binaries.md dispatch.md building-with-cmake.md CMakeLists.txt groups.dox
notes.md install-fortran.md credits.md auth.md filters.md \
obsolete/fan_utils.html indexing.dox inmemory.md FAQ.md \
known_problems.md COPYRIGHT.md inmeminternal.dox testserver.dox \
byterange.dox nczarr.md
byterange.dox nczarr.md quantize.md

# Turn off parallel builds in this directory.
.NOTPARALLEL:
Expand Down
9 changes: 5 additions & 4 deletions docs/images/Makefile.am
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,8 @@

# See netcdf-c/COPYRIGHT file for more info.

EXTRA_DIST = aqua.jpg chunking2.png compatibility3.png compression.png \
groups.png nc4-model.png ncatts.png nc-classic-uml.png nccoords.png \
ncfile.png pnetcdf.png terra.jpg netcdf_architecture.png \
deptree.jpg InstallTreeWindows.png uniLogo.png
EXTRA_DIST = aqua.jpg chunking2.png compatibility3.png \
compression.png groups.png nc4-model.png ncatts.png \
nc-classic-uml.png nccoords.png ncfile.png pnetcdf.png terra.jpg \
netcdf_architecture.png deptree.jpg InstallTreeWindows.png \
uniLogo.png quantize_pi.png quantize_performance.md
Binary file added docs/images/quantize_performance.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/quantize_pi.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
290 changes: 290 additions & 0 deletions docs/quantize.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,290 @@
# Lossy Compression with Quantize

## Introduction {#quantize}

The quantize feature was initially developed as part of the Community
Codec Repository (CCR) [2]. The CCR project allows netCDF users to
make use of HDF5 plugins (a.k.a. “filters”) which can add new
compression and other algorithms to the HDF5 library. As part of CCR,
the quantization algorithms were implemented as HDF5 filters.

However, one aspect of implementing the quantization as a filter is
that the filter is also required when reading the data [1]. Although
this makes sense for compression/decompression algorithms, the
quantize algorithms are only needed when data are written. Requiring
that the readers of the data also install the filters places an
unnecessary burden on data readers. Furthermore, using the quantize
filter results in data that cannot be read by netCDF-Java or versions
of netcdf-c before 4.8.0, when support for multiple HDF5 filters was
added. For these reasons, it was decided to merge the quantize
algorithms into the netcdf-c library [5].

As part of the netcdf-c library, the quantize algorithms are available
for netCDF/HDF5 files, and the new ncZarr format, and produce data
files that are fully backward compatible for all versions of netcdf-c
since 4.0, and also fully compatible with netcdf-Java.

## The Quantize Feature

The quantize algorithms assist with lossy compression by setting
excess bits to all zeros or all ones (in alternate array values). This
allows a subsequent compression algorithm, like the zlib-based
deflation built into netCDF-4, to better compress the data.

The quantize feature is applied to a variable in a netCDF file, and
may only be used with single or double precision floating point
(netCDF types NC_FLOAT and NC_DOUBLE). Attempting to turn on quantize
for any other type of netCDF variable will result in an error.

It should be noted that turning on quantize does not, by itself,
reduce the size of the data. Only if subsequent compression is used
will setting the quantize feature result in additional compression.

![Quantization of Pi](quantize_pi.png)
Figure 1: The value of Pi expressed as a 32-bit floating point number,
with different levels of quantization applied, from Number of
Significant Digits equal to 8 (no quantization), to 1 (maximum
quantization). The least significant bits of the significand are
replaced with zeros, to the extent possible, while preserving the
desired number of significant digits. In this example the Bit Grooming
quantization algorithm is used.

## Quantization Algorithms

Three different quantization algorithms are provided in the netcdf-c
quantize feature. Each does a somewhat different calculation to
determine the number of bits that can be set to zeros (or ones), while
preserving the number of significant digits specified by the user.

Two of the algorithms, Bit-Groom and Granular Bit-Round, accept the
number of decimal digits to be preserved in the data. One algorithm,
Bit-Round, accepts the number of binary bits to preserve.

### Bit Grooming

The Bit Grooming algorithm sets determines the number of bits which
are necessary for the required number of significant decimal
digits. This determination is made at the beginning of processing and
is applied to all values.

Bit Grooming then sets excess bits of the first array value to zero,
then excess bits of the next array value to one, and continues
alternating between zero and one for the excess bits of every other
array value. In this way, the average value of the array is preserved.

For the Bit Grooming algorithm, the NSD parameter refers to the number
of significant decimal digits that will be preserved. The number of
significant digits may be 1-7 for single precision floating point, or
1-15 for double precision floating point.

### Granular Bit Round

Granular Bit Round determines the number of required bits for each
value in the array, and uses IEEE rounding to change the data
value. It achieves a better overall compression ratio by more
aggressively determining the minimum number of bits required to
preserve the specified number of decimal digit precision.

For the Granular Bit Round algorithm, the NSD parameter refers to the
number of significant decimal digits that will be preserved (as with
the Bit Grooming algorithm). The number of significant digits may be
1-7 for single precision floating point, or 1-15 for double precision
floating point.

### Bit Round

The Bit Round algorithm allows the user to directly specify the number
of bits of the significant which will be preserved, and then sets
excess bits to zero or one for alternate array values.

For the Bit Round algorithm, the NSD parameter refers to the number of
significant binary digits that will be preserved. The number of
significant digits may be 1-23 for single precision floating point, or
1-52 for double precision floating point.

## Quantize Attribute

When the quantize feature is used, an integer attribute is added to
the variable which contains the NSD setting. Without this attribute it
would be impossible for readers to know that quantize had been applied
to the data. The name of the attribute reflects the quantize algorithm
used. In accordance with the conventions established by the NetCDF
Users Guide, these attribute names begin with an underscore,
indicating that they are added by the library and should not be
modified or deleted by users [6].

Algorithm | Attribute Name
----------|---------------
Bit Groom | _QuantizeBitGroomNumberOfSignificantDigits
Granular Bit Round | _QuantizeGranularBitRoundNumberOfSignificantDigits
Bit Round | _QuantizeBitRoundNumberOfSignificantBits

Figure 2: Table showing the names of the attribute added to a variable
after the quantize feature has been applied. The name of the attribute
indicates the algorithm used, the integer values represents the number
of significant decimal digits (for Bit Groom and Granular Bit Round),
or the number of significand bits retained (for Bit Round).

## Handling of Fill Values

In a netCDF file, fill values refer to the value used for elements of
the data not written by the user. For example, if a variable contains
an array of 10 values, and the user only writes 8 of them, the other
two values will be set to the fill value for that variable.

The fill value of a variable may be set by the user by adding an
attribute of the same type as the variable with the name
_FillValue”. If present, the value of this attribute will be used as
the fill value for that variable. If not specified, a default value
for each type is used as the fill value. The default fill values may
be found in the netcdf.h file.

When using the quantize feature, any fill values will remain
unquantized. That is, the excess bits of any array element will not be
changed, if that element is the fill value. This is necessary if the
fill value is to retain its purpose as an indicator of values that
have not been written.

## Using the Quantize Feature

Turning on the quantize feature must be done on a per-variable basis,
after the variable has been defined, and before nc_enddef() (or its
Fortran equivalents) have been called. (Recall that for netCDF/HDF5
files, nc_enddef() is automatically called when data are written or
read from a variable.)

In accordance with the usual NetCDF API practice, an inquiry function
is also provided which may be called to check if quantize has been
turned on for a variable. Calling the inquiry function is not required
when reading the data - it is provided for user convenience.

### Using Quantize with the NetCDF C API

Quantize is available in the main branch of the netcdf-c library, and
will be part of the next release (netcdf-c-4.9.0).

To turn on the quantize feature, call the nc_def_var_quantize()
function. To inquire about whether quantize been turned on for a
variable, use the nc_inq_var_quantize() function.

@code
/* Create two variables, one float, one double. Quantization
* may only be applied to floating point data. */
if (nc_def_var(ncid, "var1", NC_FLOAT, NDIM1, &dimid, &varid1)) ERR;
if (nc_def_var(ncid, "var2", NC_DOUBLE, NDIM1, &dimid, &varid2)) ERR;

/* Set up quantization. This will not make the data any
* smaller, unless compression is also turned on. In this
* case, we will set 3 significant digits. */
if (nc_def_var_quantize(ncid, varid1, NC_QUANTIZE_BITGROOM, NSD_3)) ERR;
if (nc_def_var_quantize(ncid, varid2, NC_QUANTIZE_BITGROOM, NSD_3)) ERR;

/* Set up zlib compression. This will work better because the
* data are quantized, yielding a smaller output file. We will
* set compression level to 1, which is usually the best
* choice. */
if (nc_def_var_deflate(ncid, varid1, 0, 1, 1)) ERR;
if (nc_def_var_deflate(ncid, varid2, 0, 1, 1)) ERR;
@endcode

Figure 3: Example of using the quantize feature in C. Note that the
example also demonstrates adding zlib (a.k.a. deflate) compression to
the variables. Without turning on the compression, use of quantize
alone will not result in smaller data output.

### Using Quantize with the NetCDF Fortran 90 API

Quantize is available on a branch of the netcdf-fortran libraries, and
will be merged to main after the next netcdf-c release (4.9.0) and
will be released as part of the netCDF Fortran 90 API in the
subsequent release of netcdf-fortran.

In the Fortran 90 API, quantization is turned on by using two new
optional arguments to nf90_def_var(), the quantize_mode and the nsd
arguments.

@code
! Define some variables.
call check(nf90_def_var(ncid, VAR1_NAME, NF90_FLOAT, dimids, varid1&
&, deflate_level = DEFLATE_LEVEL, quantize_mode =&
& nf90_quantize_bitgroom, nsd = 3))
call check(nf90_def_var(ncid, VAR2_NAME, NF90_DOUBLE, dimids,&
& varid2, contiguous = .TRUE., quantize_mode =&
& nf90_quantize_bitgroom, nsd = 3))
@endcode

Figure 4: In the Fortran 90 netCDF API, two additional optional
parameters are available for the quantize feature, the quantize_mode
and nsd parameters.

### Using Quantize with the NetCDF Fortran 77 API

Quantize is available on a branch of the netcdf-fortran libraries, and
will be merged to main after the next netcdf-c release (4.9.0) and
will be released as part of the netCDF Fortran 77 API in the
subsequent release of netcdf-fortran.

@code
C Create some variables.
do x = 1, NVARS
retval = nf_def_var(ncid, var_name(x), var_type(x), NDIM1,
$ dimids, varid(x))
if (retval .ne. nf_noerr) stop 3

C Turn on quantize.
retval = nf_def_var_quantize(ncid, varid(x),
$ NF_QUANTIZE_BITGROOM, NSD_3)
if (retval .ne. nf_noerr) stop 3

C Turn on zlib compression.
retval = nf_def_var_deflate(ncid, varid(x), 0, 1, 1)
if (retval .ne. nf_noerr) stop 3
end do
@endcode

Figure 4: In the Fortran 77 netCDF API, nf_def_var_quantize() and
nf_inq_var_quantize() are provided, which wrap the quantize functions
from the C API.

## Performance

![Quantization Performance](images/quantize_performance.png)

Figure 5: Compression ratio of E3SM Atmosphere Model (EAM) v2 default
monthly dataset of raw size 445 MB compressed with default netCDF
lossless compression algorithm (DEFLATE, compression level=1) alone
(leftmost), or after pre-filtering with one of three lossy codecs
(BitGroom, Granular BitGroom, or BitRound) with quantization increasing
(and precision decreasing) to the right.

## References

1. HDF5 Dynamically Loaded Filters, The HDF Group, retrieved on
December 2, 2021 from
https://support.hdfgroup.org/HDF5/doc/Advanced/DynamicallyLoadedFilters/HDF5DynamicallyLoadedFilters.pdf.

2. Hartnett, Zender, C. S., (2020), ADDITIONAL NETCDF COMPRESSION
OPTIONS WITH THE COMMUNITY CODEC REPOSITORY (CCR), American
Meteorological Society (AMS) Annual Meeting, retrieved on July 3, 2021
from
https://www.researchgate.net/publication/347726695_ADDITIONAL_NETCDF_COMPRESSION_OPTIONS_WITH_THE_COMMUNITY_CODEC_REPOSITORY_CCR.

3. Zender, C. S. (2016), Bit Grooming: Statistically accurate
precision-preserving quantization with compression, evaluated in the
netCDF Operators (NCO, v4.4.8+), Geosci. Model Dev., 9, 3199-3211,
doi:10.5194/gmd-9-3199-2016 Retrieved on Sep 21, 2020 from
https://www.researchgate.net/publication/301575383_Bit_Grooming_Statistically_accurate_precision-preserving_quantization_with_compression_evaluated_in_the_netCDF_Operators_NCO_v448.

4. Delaunay, X., A. Courtois, and F. Gouillon (2019), Evaluation of
lossless and lossy algorithms for the compression of scientific
datasets in netCDF-4 or HDF5 files, Geosci. Model Dev., 12(9),
4099-4113, doi:10.5194/gmd-2018-250, retrieved on Sep 21, 2020 from
https://www.researchgate.net/publication/335987647_Evaluation_of_lossless_and_lossy_algorithms_for_the_compression_of_scientific_datasets_in_netCDF-4_or_HDF5_files.

5. Hartnett, E., et. al., “Provide a way to do bit grooming before
compression”, netcdf-c GitHub Issue #1548,
https://github.com/Unidata/netcdf-c/issues/1548.

6. Rew, R., et. al., NetCDF Users Guide, Appendix A: Attribute
Conventions, Unidata,
https://docs.unidata.ucar.edu/netcdf-c/current/attribute_conventions.html.
21 changes: 12 additions & 9 deletions libdispatch/dvar.c
Original file line number Diff line number Diff line change
Expand Up @@ -518,15 +518,18 @@ nc_def_var_deflate(int ncid, int varid, int shuffle, int deflate, int deflate_le
variable which has been quantized is readable to older versions of
the netCDF libraries, and to netCDF-Java.
For more information about quantization and the BitGroom filter, see
Zender, C. S. (2016), Bit Grooming: Statistically accurate
precision-preserving quantization with compression, evaluated in
the netCDF Operators (NCO, v4.4.8+), Geosci. Model Dev., 9,
3199-3211, doi:10.5194/gmd-9-3199-2016 Retrieved on Sep 21, 2020
from
https://www.researchgate.net/publication/301575383_Bit_Grooming_Statistically_accurate_precision-preserving_quantization_with_compression_evaluated_in_the_netCDF_Operators_NCO_v448.
For more information about quantization and the BitGroom filter,
see @ref quantize.
@note Users new to quantization should start with Granular Bit
Round, which results in the best compression. The Bit Groom
algorithm is not as effective when compressing, but is faster than
Granular Bit Round. The Bit Round algorithm accepts a number of
bits to maintain, rather than a number of decimal digits, and is
provided for users who are already performing some bit-based
quantization, and wish to turn this task over to the netCDF
library.
@param ncid File ID.
@param varid Variable ID. ::NC_GLOBAL may not be used.
@param quantize_mode Quantization mode. May be ::NC_NOQUANTIZE or
Expand Down

0 comments on commit 054c392

Please sign in to comment.