Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meaning of the number of significant digits #2389

Closed
rkouznetsov opened this issue Jun 10, 2022 · 4 comments
Closed

Meaning of the number of significant digits #2389

rkouznetsov opened this issue Jun 10, 2022 · 4 comments

Comments

@rkouznetsov
Copy link
Contributor

rkouznetsov commented Jun 10, 2022

I believe, it is an important topic that should be openly discussed. Please let me know if there is a better place for it, it could be moved there then.

@edhartnett in his comment #2369 (comment) wrote

All engineers and scientists understand what number of significant digits means. Many quantities in meteorology (and elsewhere) will be immediately understood in those terms. I trust scientists will use this feature responsibly and no great mathematical understanding is needed for them to do so.

I am afraid that this point of view shared by many in this community. I see two issues here, that have been causing problems and will continue causing them.

  1. The assumption that all engineers and scientists understand what number of significant digits means seems to be overoptimistic. 9 out of 10 papers I get for review report pollution trends or model-measurement statistics or some other numbers expressed with seven (sic!) decimals, while the numbers at best have the uncertainty expressed in multiple percents. Judging form the fraction of published peer-reviewed papers that have such omissions, a substantial fraction of the reviewers and editors also considers NSD issues being of minor importance.
  2. NetCDF (nco, etc) uses NSD as a measure for a distortion introduced by lossy compression methods. It is pretty straightforward to figure out how many figures one needs to report in a decimal form a value that has a given uncertainty (see e.g. https://en.wikipedia.org/wiki/Significant_figures). The translation from NSD to a relative error margin, i.e. a maximum error introduced by trimming the precision is by no means unique or obvious. For that reason netcdf-c developers had to introduce algorithm-specific kinds of NSD. The translation of NSD to the error margins ranges among the algorithms within an order of magnitude and leads to breaking changes on a minor-version update of nco x10 error margin for the same spell after nco/nco#256.

I have arranged a small poll from qualified researchers around me, who do work with data, and have quite impressive scientific merits and publication records. The question was:

Suppose you have got a "perfect" (fair, precise, etc.) binary dataset, and its metadata says "NumberOfSignificanDecimals=2". What would be a margin for a relative error of a single value in the dataset (in percent) that you would assume?

So far I have got 11 replies:

  • 6 answered "1%"
  • 3 answered "0.5%"
  • 1 answered "5%"
  • 1 answered "Unknown, can be whatever"

The one, who answered 5% was a person with whom I extensively discussed before the distortions originating from GranularBitRound, recently introduced in NCO.

I believe, it is way too much to demand every scientist or engineer to get into all the details and ways of precision-trimming in IEEE754 numbers and various ways to interpret NSD and all the terminology around. Therefore I would propose to make a method- and system- (binary, decimal,etc) agnostic means to convey the magnitude of the distortion introduced by a precision trimming procedure.

The variable attributes storage_abs_error_margin (in a units of a variable) and storage_rel_error_margin (dimensionless fraction) could serve the purpose. They should be clearly distinct from actual error margins, that can be much larger. To avoid ambiguity and round-off errors, the rounding algorithm itself can be fed with two integers: number of keep-bits and binary logarithm of the value of the least-significant bit kept. 8 bit should be sufficient for each of them. (@edhartnett , I hope this answers your question).

I would be happy to hear other opinions on the subject and on the best ways to implement it. Thank you for those who got to this point.

@edwardhartnett
Copy link
Contributor

So your proposal is that we add these attributes when quantization is used?

How would we calculate them?

@rkouznetsov
Copy link
Contributor Author

Adding attributes and calculating them is not a big deal. Formulating them is a bigger challenge.

My proposal for now is to think and share opinions, views and concerns. Then we might come to some weighted solution to deliver the uncertainty, that would be understandable, unambiguous and not too destructive for users. At the moment we have a bunch of terms and concepts that have too vague or even misleading meaning to be of any use in communication. For instance, the meaning of NSD in NetCDF is different for every specific method, and very different from what people think of it. I bet in a year there will be, probably, a couple of people in the world capable interpret corresponding attributes properly without googling through a bunch of inconsistent documents and tons of even more diverse opinions in forums. So, I guess, the term NSD has been already irreversibly spoiled.

Besides that there is quite a few other misleading terms around: a "precision-reserving compression" (also my guilt), that preserves precision in exactly the same sense in which shopping preserves money: one trades some precision to get something they consider more valuable. "Statistically accurate method" that introduces unlimited errors in two-point statistics etc. So some tedious work is ahead to clean the mess.

The correspondence between the margins and the way precision trimmed has been specified in my slides at EGU2022 (slide 3) and in my GMD paper (2021). I prefer to start from the error margins (defined by the data and applications), and then one can absolutely unequivocally select the best method, number of bits etc...

@edwardhartnett
Copy link
Contributor

Ok I think a good starting point is going to be to move the discussion of error from filters to quantize.

Perhaps at the upcoming CF meeting a consensus will be hammered out for how to best express this information...

@WardF
Copy link
Member

WardF commented Jun 15, 2022

I'm going to convert this over to a discussion, as that feels more appropriate for the (anticipated) long-form discussion we'll be having around this. Thanks!

@Unidata Unidata locked and limited conversation to collaborators Jun 15, 2022
@WardF WardF converted this issue into discussion #2406 Jun 15, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants