-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overcomplicated and rigid interface to Quantization #2369
Comments
What would the desired API look like? |
I would suggest to use e.g. |
@czender your thoughts on this? |
BTW, it would be great to avoid using |
Underscore attributes should never be removed or changed. That seems like a fairly straightforward rule and is laid down in the NetCDF Users Guide. |
You still might wish to trim the precision further later if your application is less demanding than the original precision. Then, if we accept that "never", these attributes should not be underscored.. |
Let me clarify a bit. The NUG (at least the attribute conventions document) says:
The point is to reserve part of the attribute namespace for our use to avoid potential conflicts |
@rkouznetsov Thank you for your input. The metadata attributes planned for the netCDF library ensure complete provenance and reproducibility, because they identify the algorithms and inputs to the algorithms. Moreover, participants in the 2021 netCDF-CF workshop concurred that unique identification of the algorithms and their inputs was essentially. That does not preclude adding other algorithms or attributes in the future. Should algorithms that take as input the relative and/or absolute error metrics you suggest, be incorporated in future netCDF library releases, then it would make sense for the library to write those metrics instead. I encourage you to participate virtually or directly in the 2022 netCDF-CF workshop in Santander, Spain from 9/13-9/15. I am leading a session there on metadata for lossy compression where we will discuss exactly the issues you raise. I would be happy to work with you beforehand to draft suggestions for consideration at that meeting. Unfortunately, it is probably too late to change what is current staged for adoption in netCDF 4.9.0. As we discussed in Vienna, your characterization of GranularBitRound is incomplete. Please acknowledge that GranularBitRound obtains significantly better compression ratio for the same NSD as BitGroom followed by BitRound (which, as you note, NCO implemented as default for awhile). Then your concerns would be balanced by appreciation of the point of view of others. However, this GitHub issue is not the place for that discussion. The implementation of metadata by the netCDF library for the imminently supported quantization algorithms does not preclude implementing related algorithms that would more naturally be designed to store the metadata you suggest. These algorithms were all recently invented, and I hope and expect that improvements such as you suggest will be implemented and tested for possible inclusions in later versions of NCO (which is easy to get an algorithm into) and netCDF (which necessarily lags a bit). |
Well good commentary from @DennisHeimbigner and @czender! I would like to address the question of why the three algorithms are provided.
WRT the CF conventions, I note that generally the netcdf-c library does not attempt to put in CF convention attributes - that is left for the user to do. It's not usually possible for the library to know everything it needs to know for CF conventions. (For example, long_name and units attributes are required, yet netcdf-c does not attempt to provide them.) As @czender points out, the CCR and the NCO projects are good proving grounds for advances in quantization. Once new approaches are demonstrated there, and the details worked out, they may be considered for inclusion in the netcdf-c library. I suggest this issue be closed as we are not planning any immediate changes to the quantize API. However, I will update the documentation to make the choice between the three available algorithms more clear. |
@czender, thank you for the correction. I'd be happy to participate to the workshop. I hope we agree that all lossy compressions are about trading some precision for compression ratio. What bugs me in the current implementation is that a user has to get into details of the algorithms. If they do not, they suddenly get x10 higher error after a minor-version update (here I refer again to the slide 4). I have my scripts with many nco calls been run at few dozens of computers, and keeping eye on which versions are used where is a quite a bit of additional load. An alternative would be to always remember to put all cryptic flags to force a specific algorithm and specific number of bits etc.. What I suggest is to have an algorithm-agnostic measure of precision, that does not depend on a specific binary or decimal representation of a number. Moreover, that would allow a proper error-propagation estimates. Consider two fields with NSD=1 as in GranularBitRound. Their sum would no longer have NSD=1. If you have a sum of two fields with relative error margin of 0.1 (or NSD=1 as in BitRound), you can be guaranteed that their sum has the same error margin. I fully recognize production cycles and release schedules. However, introducing all the mess with decimals and specific algorithms into the main NetCDF library will create a nightmare with the backward compatibility in the future. So for now, it might make sense to put this feature on hold and elaborate some more universal approach. |
@edwardhartnett Thank you!
This is a good statement. I just draw your attention to the fact that BitGroom has been shown sub-optimal, and has no longer been used in nco by default, and GranularBitRound has been enabled in nco by default only few months ago and immediately caused problems in our operational applications. Non-default options are rarely used. They definitely can be considered "demonstrated", but cannot be considered well tested.... |
As a suggestion. Would just adding a helper function that translates a number of decimals to approximate number of bits for BitRound help? Also a function that translates a relative error margin to the number of bits. |
One thing that would be helpful would be some C code that accepts the two floating point parameters you suggest, and applies them to some data, so we can see exactly how that would work. |
It has been done already. Fortran and Python implementations, and the code that actually applies them (including generation of data and plotting statistics) can be found in my GMD paper https://doi.org/10.5194/gmd-14-377-2021 and its supplement. C implementation of the code can be found from nco. If you have any specific wrapper, that you used to test all the algorithms I could copy-paste the implementations from nco there... |
I could try to implement the things directly into netcdf-c, but that would break the backward compatibility with the currently-implemented interface. |
I will take a look at this code. However, even if we add more quantize modes, including based on percent error, that does not obviate the need for the existing API. For example, NOAA is planning on immediately using the Bit Round algorithm in the Unified Forecast System, as soon as it becomes available. I suspect other large data producers, currently using a bit-oriented approach, will do the same. So while we continue to investigate your suggestions, the existing API should be released with the imminent 4.9.0 release, so that it can get into the hands of users. Do we need to provide two floats as parameters (abs_error_margin and rel_error_margin) for your approach? Can we reduce it to one? Are these values percentages? |
BitRound makes a full sense to me, except for naming the formal parameter For the future releases, probably, it would be good then to separate low-level interface that actually does bit rounding from the one that implements the interpretation of floating-point error margins, that do not exactly map into binary, or the interpretation of much more vague integer decimals. We would still need two parameters, since they express rather different ways of setting the error margin, and can be applied in any combination (see slide 8). Assume we already have one for relative precision (keepbits), one more will be needed for the absolute precision. I could implement those in a coming month or so. Makes sense? |
We can evaluate your proposal as an additional API function. This is not a reason not to release what we have now, which is a vast improvement on nothing, and allows users to use lossy compression. I understand you believe your way to be best, and perhaps it is for some users, but undoubtedly many users will find the current implementation very useful. Since a separate API function will be required to support two floats, there is no reason to delay release of the current algorithms, which require one int. |
Thank you! It is not my way, and it is not a belief, but rather something I can prove, and anyone is free to disprove if they can. Could you at least avoid passing two different parameters under the same name |
WRT the parameter name, nsd stands for number of significant digits. Digits may be decimal digits or binary digits. Obviously we cannot add a new function merely to make that distinction more clear. So we will keep the parameter named as it is. Nor will there be any plan of deprecating the nsd quantization. All engineers and scientists understand what number of significant digits means. Many quantities in meteorology (and elsewhere) will be immediately understood in those terms. I trust scientists will use this feature responsibly and no great mathematical understanding is needed for them to do so. What I would suggest is that, instead of trying to pull down what is already present, you put up a PR with a new, additional API, using error percentages. We can then evaluate that, and if all find it useful, add it to the API. Users will then have additional choices when choosing a quantization scheme. If you have a specific note about the documentation, that would be welcome. You have raised many issues in this thread, and I believe we have dealt with most of them. I look forward to seeing your contribution and helping evaluate its fitness for the netCDF API. |
@rkouznetsov It may be easier to prototype the kind of metadata functionality you want in NCO than in the netCDF-C library. I invite you to submit a patch or open a suggestion there. |
@czender Thnak you! No problem we'll do that. |
@edwardhartnett Sorry. I was wrong about 50%. The error margin for GranularBitRound in nco is 15% for NSD=1, 2% for NSD=2, 0.22% for nsd=3 etc. I am, probably, not too experienced with decimals, but I would expect the relative error specified in decimals to be less than the relative value of 1 at the last decimal place for the worst-case scenario, so it should be at most 5%, 0.5%, and 0.05% for NSD of 1,2, and 3 respectively (as it is for BitGroom and was in nco before GranularBitRound). So for the netcdf-c documentation I would recommend to those who prefer to think in decimals to use BitRound with the number of bits defined as I would still suggest to use |
This is just a follow-up for my question at EGU (to https://meetingorganizer.copernicus.org/EGU22/EGU22-13259.html) Unfortunately, the slides only available for participants, so you might wish to consider sharing the slides elsewhere.
According to the presentation , the current implementation has three modes of trimming precision and a control for number of significant digits, i.e relative precision. I believe, a single floating-point parameter e.g. a quantum (or an error margin) would provide a more flexible, fair and transparent way to control the quantization of relative error than a number of significant digits.
Also the implementation lacks a way to trim the absolute precision, which is the right way for the majority of meteorological fields (see slide 8 of http://silam.fmi.fi/roux/PrecisionEGU2022slides-ShownAtEGU2022.pdf). That would need a second floating-point parameter.
Here is more rationale:
int nsd
is very coarse granule, and it is not really obvious which error margin one would get by requesting a specific number of decimals. Specifying an explicit margin or a quantum (= 2x the margin) would give a better control.Out of three modes, only rounding is needed:
"_QuantizeBitGroomNumberOfSignificantDigits" has been shown to be sub-optimal in terms of precision, and generating artifacts in multi-point statistics https://doi.org/10.5194/gmd-14-377-2021. I could not find any example where BitGroom should be preferred to rounding. Please let me know if you have any.
"_QuantizeGranularBitRoundNumberOfSignificantDigits" is sub-optimal unless one really wants to print the numbers as decimals. If one wants to constrain the error margin, rounding is the right method (see my slide 4 from the link above). With this method for N digits one gets an error margin and compression efficiency equivalent to those of rounding to N-1 digits (slide 5). Again, I would be happy to see any example disproving this statement.
"_QuantizeBitRoundNumberOfSignificantBits" Actually, explicit error margin would be more clear for those who are not fluent with the binary representation.
To summarize: giving a user an explicit control over the error margins in terms of absolute or relative error would be more flexible and much less confusing for a user. Please consider implementing it, and let me know if I can be of any help.
The text was updated successfully, but these errors were encountered: