Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overcomplicated and rigid interface to Quantization #2369

Closed
rkouznetsov opened this issue May 30, 2022 · 23 comments · Fixed by #2362
Closed

Overcomplicated and rigid interface to Quantization #2369

rkouznetsov opened this issue May 30, 2022 · 23 comments · Fixed by #2362

Comments

@rkouznetsov
Copy link
Contributor

This is just a follow-up for my question at EGU (to https://meetingorganizer.copernicus.org/EGU22/EGU22-13259.html) Unfortunately, the slides only available for participants, so you might wish to consider sharing the slides elsewhere.

According to the presentation , the current implementation has three modes of trimming precision and a control for number of significant digits, i.e relative precision. I believe, a single floating-point parameter e.g. a quantum (or an error margin) would provide a more flexible, fair and transparent way to control the quantization of relative error than a number of significant digits.

Also the implementation lacks a way to trim the absolute precision, which is the right way for the majority of meteorological fields (see slide 8 of http://silam.fmi.fi/roux/PrecisionEGU2022slides-ShownAtEGU2022.pdf). That would need a second floating-point parameter.

Here is more rationale:

int nsd is very coarse granule, and it is not really obvious which error margin one would get by requesting a specific number of decimals. Specifying an explicit margin or a quantum (= 2x the margin) would give a better control.

Out of three modes, only rounding is needed:

  • "_QuantizeBitGroomNumberOfSignificantDigits" has been shown to be sub-optimal in terms of precision, and generating artifacts in multi-point statistics https://doi.org/10.5194/gmd-14-377-2021. I could not find any example where BitGroom should be preferred to rounding. Please let me know if you have any.

  • "_QuantizeGranularBitRoundNumberOfSignificantDigits" is sub-optimal unless one really wants to print the numbers as decimals. If one wants to constrain the error margin, rounding is the right method (see my slide 4 from the link above). With this method for N digits one gets an error margin and compression efficiency equivalent to those of rounding to N-1 digits (slide 5). Again, I would be happy to see any example disproving this statement.

  • "_QuantizeBitRoundNumberOfSignificantBits" Actually, explicit error margin would be more clear for those who are not fluent with the binary representation.

To summarize: giving a user an explicit control over the error margins in terms of absolute or relative error would be more flexible and much less confusing for a user. Please consider implementing it, and let me know if I can be of any help.

@edwardhartnett
Copy link
Contributor

What would the desired API look like?

@rkouznetsov
Copy link
Contributor Author

I would suggest to use e.g. float abs_error_margin and float rel_error_margin parameters, with zero defaults and get some attributes for them into a CF conventions.

@edwardhartnett
Copy link
Contributor

@czender your thoughts on this?

@rkouznetsov
Copy link
Contributor Author

BTW, it would be great to avoid using _underscore attributes. They seem to cause a lot of trouble when each software has to decide (sometimes correctly) if it has to keep them, discard them, or reset them. I guess @czender can tell more about it...

@edwardhartnett
Copy link
Contributor

Underscore attributes should never be removed or changed. That seems like a fairly straightforward rule and is laid down in the NetCDF Users Guide.

@rkouznetsov
Copy link
Contributor Author

You still might wish to trim the precision further later if your application is less demanding than the original precision. Then, if we accept that "never", these attributes should not be underscored..

@DennisHeimbigner
Copy link
Collaborator

Let me clarify a bit. The NUG (at least the attribute conventions document) says:

Attribute names commencing with underscore ('_') are reserved for use by the netCDF library.

The point is to reserve part of the attribute namespace for our use to avoid potential conflicts
with user defined attributes. That is why it would inadvisable to remove those leading
underscores.

@czender
Copy link
Contributor

czender commented May 31, 2022

@rkouznetsov Thank you for your input. The metadata attributes planned for the netCDF library ensure complete provenance and reproducibility, because they identify the algorithms and inputs to the algorithms. Moreover, participants in the 2021 netCDF-CF workshop concurred that unique identification of the algorithms and their inputs was essentially. That does not preclude adding other algorithms or attributes in the future. Should algorithms that take as input the relative and/or absolute error metrics you suggest, be incorporated in future netCDF library releases, then it would make sense for the library to write those metrics instead. I encourage you to participate virtually or directly in the 2022 netCDF-CF workshop in Santander, Spain from 9/13-9/15. I am leading a session there on metadata for lossy compression where we will discuss exactly the issues you raise. I would be happy to work with you beforehand to draft suggestions for consideration at that meeting. Unfortunately, it is probably too late to change what is current staged for adoption in netCDF 4.9.0.

As we discussed in Vienna, your characterization of GranularBitRound is incomplete. Please acknowledge that GranularBitRound obtains significantly better compression ratio for the same NSD as BitGroom followed by BitRound (which, as you note, NCO implemented as default for awhile). Then your concerns would be balanced by appreciation of the point of view of others. However, this GitHub issue is not the place for that discussion. The implementation of metadata by the netCDF library for the imminently supported quantization algorithms does not preclude implementing related algorithms that would more naturally be designed to store the metadata you suggest. These algorithms were all recently invented, and I hope and expect that improvements such as you suggest will be implemented and tested for possible inclusions in later versions of NCO (which is easy to get an algorithm into) and netCDF (which necessarily lags a bit).

@edwardhartnett
Copy link
Contributor

Well good commentary from @DennisHeimbigner and @czender!

I would like to address the question of why the three algorithms are provided.

  • Bit Groom is not as compressive as Granular Bit Round, but is fastest.
  • Granular Bit Round is the most compressive, and is what most users new to quantization should use.
  • Bit Round takes number of bits instead of decimal digits, and is for users currently doing their own quantization, based on number of bits. It was added at the specific request of NOAA, which has already conducted significant analysis of the scientific impact of their bit-based quantization. No doubt other users who are already doing some quantization and have determined what number of bits they need to retain.

WRT the CF conventions, I note that generally the netcdf-c library does not attempt to put in CF convention attributes - that is left for the user to do. It's not usually possible for the library to know everything it needs to know for CF conventions. (For example, long_name and units attributes are required, yet netcdf-c does not attempt to provide them.)

As @czender points out, the CCR and the NCO projects are good proving grounds for advances in quantization. Once new approaches are demonstrated there, and the details worked out, they may be considered for inclusion in the netcdf-c library.

I suggest this issue be closed as we are not planning any immediate changes to the quantize API. However, I will update the documentation to make the choice between the three available algorithms more clear.

@rkouznetsov
Copy link
Contributor Author

rkouznetsov commented May 31, 2022

@czender, thank you for the correction. I'd be happy to participate to the workshop.

I hope we agree that all lossy compressions are about trading some precision for compression ratio.
My main point was that NSD is a very vague measure of precision, unlike the error margin (or e.g. RMSE). I never denied, and hereby fully acknowledge that that GranularBitRound obtains significantly better compression ratio for the same formal NSD. My problem is that these NSD, while keeping the same number of decimals translates to x10 larger error margin for GranularBitRound than they do for BitRound. It is absolutely no surprise that larger errors correspond to higher compression. If we constrain an error margin, N NSD of GranularBitRound should be compared to N-1 NSD of BitRound, and then there is no benefit in compression ratio. Therefore I still insist that the only use case where GranularBitRound is preferable over BitRound with N-1 NSD, is printing the numbers in decimal form. Implementing a specific algorithm for such an exotic use case into a general-purpose library is, probably, not worth it. As I wrote above, any more realistic use-case is welcome.

What bugs me in the current implementation is that a user has to get into details of the algorithms. If they do not, they suddenly get x10 higher error after a minor-version update (here I refer again to the slide 4). I have my scripts with many nco calls been run at few dozens of computers, and keeping eye on which versions are used where is a quite a bit of additional load. An alternative would be to always remember to put all cryptic flags to force a specific algorithm and specific number of bits etc..

What I suggest is to have an algorithm-agnostic measure of precision, that does not depend on a specific binary or decimal representation of a number. Moreover, that would allow a proper error-propagation estimates. Consider two fields with NSD=1 as in GranularBitRound. Their sum would no longer have NSD=1. If you have a sum of two fields with relative error margin of 0.1 (or NSD=1 as in BitRound), you can be guaranteed that their sum has the same error margin.

I fully recognize production cycles and release schedules. However, introducing all the mess with decimals and specific algorithms into the main NetCDF library will create a nightmare with the backward compatibility in the future. So for now, it might make sense to put this feature on hold and elaborate some more universal approach.

@rkouznetsov
Copy link
Contributor Author

@edwardhartnett Thank you!

Once new approaches are demonstrated there, and the details worked out, they may be considered for inclusion in the netcdf-c library.

This is a good statement. I just draw your attention to the fact that BitGroom has been shown sub-optimal, and has no longer been used in nco by default, and GranularBitRound has been enabled in nco by default only few months ago and immediately caused problems in our operational applications. Non-default options are rarely used. They definitely can be considered "demonstrated", but cannot be considered well tested....

@rkouznetsov
Copy link
Contributor Author

As a suggestion. Would just adding a helper function that translates a number of decimals to approximate number of bits for BitRound help? Also a function that translates a relative error margin to the number of bits.
Then the parameter for the algorithm will no longer be needed, and a user will have an option to use decimals or an error margin if they wish. int nsd then can be replaced with int nsb.

@edwardhartnett
Copy link
Contributor

One thing that would be helpful would be some C code that accepts the two floating point parameters you suggest, and applies them to some data, so we can see exactly how that would work.

@rkouznetsov
Copy link
Contributor Author

It has been done already. Fortran and Python implementations, and the code that actually applies them (including generation of data and plotting statistics) can be found in my GMD paper https://doi.org/10.5194/gmd-14-377-2021 and its supplement. C implementation of the code can be found from nco. If you have any specific wrapper, that you used to test all the algorithms I could copy-paste the implementations from nco there...

@rkouznetsov
Copy link
Contributor Author

I could try to implement the things directly into netcdf-c, but that would break the backward compatibility with the currently-implemented interface.

@edwardhartnett
Copy link
Contributor

edwardhartnett commented May 31, 2022

I will take a look at this code.

However, even if we add more quantize modes, including based on percent error, that does not obviate the need for the existing API. For example, NOAA is planning on immediately using the Bit Round algorithm in the Unified Forecast System, as soon as it becomes available. I suspect other large data producers, currently using a bit-oriented approach, will do the same.

So while we continue to investigate your suggestions, the existing API should be released with the imminent 4.9.0 release, so that it can get into the hands of users.

Do we need to provide two floats as parameters (abs_error_margin and rel_error_margin) for your approach? Can we reduce it to one? Are these values percentages?

@rkouznetsov
Copy link
Contributor Author

BitRound makes a full sense to me, except for naming the formal parameter nsd. The latter should be nsb or keepbits or something like that. Just check if it includes the implicit bit or not. My reservations are about two other algorithms and the argument to pass the algorithm indicator. If we can postpone those for now, it would be great.

For the future releases, probably, it would be good then to separate low-level interface that actually does bit rounding from the one that implements the interpretation of floating-point error margins, that do not exactly map into binary, or the interpretation of much more vague integer decimals.

We would still need two parameters, since they express rather different ways of setting the error margin, and can be applied in any combination (see slide 8). Assume we already have one for relative precision (keepbits), one more will be needed for the absolute precision.

I could implement those in a coming month or so. Makes sense?

@edwardhartnett
Copy link
Contributor

edwardhartnett commented May 31, 2022

We can evaluate your proposal as an additional API function. This is not a reason not to release what we have now, which is a vast improvement on nothing, and allows users to use lossy compression.

I understand you believe your way to be best, and perhaps it is for some users, but undoubtedly many users will find the current implementation very useful.

Since a separate API function will be required to support two floats, there is no reason to delay release of the current algorithms, which require one int.

@rkouznetsov
Copy link
Contributor Author

Thank you! It is not my way, and it is not a belief, but rather something I can prove, and anyone is free to disprove if they can.
A single example would be sufficient to disprove it, but I have not seen any so far.

Could you at least avoid passing two different parameters under the same name nsd, and explicitly mark no promise for backward compatibility for BitGroom and GranularBitGroom, since those are highly likely going to be deprecated.
I would also avoid encouraging people to use those. Current documentation in #2362 is just misleading.

@edwardhartnett
Copy link
Contributor

WRT the parameter name, nsd stands for number of significant digits. Digits may be decimal digits or binary digits. Obviously we cannot add a new function merely to make that distinction more clear. So we will keep the parameter named as it is.

Nor will there be any plan of deprecating the nsd quantization. All engineers and scientists understand what number of significant digits means. Many quantities in meteorology (and elsewhere) will be immediately understood in those terms. I trust scientists will use this feature responsibly and no great mathematical understanding is needed for them to do so.

What I would suggest is that, instead of trying to pull down what is already present, you put up a PR with a new, additional API, using error percentages. We can then evaluate that, and if all find it useful, add it to the API. Users will then have additional choices when choosing a quantization scheme.

If you have a specific note about the documentation, that would be welcome.

You have raised many issues in this thread, and I believe we have dealt with most of them. I look forward to seeing your contribution and helping evaluate its fitness for the netCDF API.

@czender
Copy link
Contributor

czender commented May 31, 2022

@rkouznetsov It may be easier to prototype the kind of metadata functionality you want in NCO than in the netCDF-C library. I invite you to submit a patch or open a suggestion there.

@rkouznetsov
Copy link
Contributor Author

rkouznetsov commented May 31, 2022

@czender Thnak you! No problem we'll do that.
@edhartnett, For now, I believe, we have to at least make sure that the recommendation to use BitGroom mentions unlimited error in two-point statistics, introduced by the method, and the recommendation to use GranularBitRound mentions that NSD=1 introduces up to 50% relative error (I would expect some 5%). These two issues are by no means immediately understood by the scientists in the field. At least I have spent quite some time to realize that these issues are not because of bugs in my code, but rather the immanent features of the methods.

@rkouznetsov
Copy link
Contributor Author

rkouznetsov commented Jun 1, 2022

@edwardhartnett Sorry. I was wrong about 50%. The error margin for GranularBitRound in nco is 15% for NSD=1, 2% for NSD=2, 0.22% for nsd=3 etc. I am, probably, not too experienced with decimals, but I would expect the relative error specified in decimals to be less than the relative value of 1 at the last decimal place for the worst-case scenario, so it should be at most 5%, 0.5%, and 0.05% for NSD of 1,2, and 3 respectively (as it is for BitGroom and was in nco before GranularBitRound).

So for the netcdf-c documentation I would recommend to those who prefer to think in decimals to use BitRound with the number of bits defined as nsb = ceil(3.32*NSD) (please cross-check the equation). This method the least creative one in terms of introduced errors. I would also bet that it is not slower than BitGroom.

I would still suggest to use nsb (number of significant bits) to avoid the confusion between the "Number of significant decimals" and "Number of significant digits" with necessity to always keep in mind if the method uses decimal or binary system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants