Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

r.quantile/r.stats.quantile/libstats: fix quantile algorithm #2108

Merged
merged 4 commits into from
Jan 21, 2022

Conversation

metzm
Copy link
Contributor

@metzm metzm commented Jan 19, 2022

Hyndman and Fan (1996) (https://doi.org/10.2307/2684934) list 9 different algorithms to the corresponding rank of a sorted list for a given quantile. The algorithm used in GRASS is not listed. Therefore I decided to use the algorithm type 7 also used by R and numpy. In R, see ?quantile for a description of the different algorithms.

There was an independent bug in r.quantile and r.stats.quantile: if the corresponding rank belonged to the last entry of a slot, the result was wrong because the first value of the next slot needs to be used to calculate the correct value for the given quantile.

@metzm metzm added bug Something isn't working backport_needed C Related code is in C labels Jan 19, 2022
@metzm metzm added this to the 8.0.1 milestone Jan 19, 2022
@metzm metzm requested review from marisn and veroandreo January 19, 2022 17:35
@veroandreo
Copy link
Contributor

I tested using the example in r.quantile manual:

# grass8.1dev
r.quantile input=elevation percentiles=0.1,1,10,25,50,75,90,99,99.9
Computing quantiles
0:0.100000:59.870384
1:1.000000:68.065155
2:10.000000:84.523346
3:25.000000:94.789986
4:50.000000:108.879921
5:75.000000:126.791985
6:90.000000:138.660385
7:99.000000:149.824585
8:99.900000:153.375854

# mm-branch
Computing quantiles
0:0.100000:59.870381
1:1.000000:68.065153
2:10.000000:84.523341
3:25.000000:94.789986
4:50.000000:108.879906
5:75.000000:126.791973
6:90.000000:138.660303
7:99.000000:149.824419
8:99.900000:153.375565

Changes appear in the 5th and 6th decimal positions. Is that ok, @metzm ?

After exporting the raster (as we still don't have rgrass for grass8) and testing in R, I get:

quantile(elev, probs=c(0.1,1,10,25,50,75,90,99,99.9)/100, type=7)
     0.1%        1%       10%       25%       50%       75%       90%       99% 
 59.87038  68.06515  84.52334  94.78999 108.87991 126.79197 138.66030 149.82442 
    99.9% 
153.37556 

but it's rounded to the 5th decimal by default. In any case, I'm in favor of consistency with other software packages, so +1 for this change.

@metzm
Copy link
Contributor Author

metzm commented Jan 20, 2022

Yes, changes are expected, particularly for low and high quantiles, and particularly for small samples. Importantly, the 9 different algorithms listed in literature and software implementations produce different results in these cases. The 50% percentile should be identical, however. That was a real bug in lib/stats/c_percentile.c.

 * adjust test for r.neighbors
 * sync r.stats quantile to r.quantile
@metzm
Copy link
Contributor Author

metzm commented Jan 20, 2022

A simple example with a list of 10 sorted values and their indices (zero-based ranks):

value:  1  2  3  4  5  6  7  8  9  10
index:  0  1  2  3  4  5  6  7  8   9

The 50% percentile would produce a split between the 5 lowest values 1, 2, 3, 4, 5 and the 5 highest values 6, 7, 8, 9, 10, the correct result is 5 * 0.5 + 6 * 0.5 = 5.5.
The previous code with n * quant, thus the index 10 * 0.5 = 5 would return the value 6 for the 50% percentile.
The new code with quant * (n - 1) correctly uses the index 4.5 to return the value 5.5 for the 50% percentile.

@neteler
Copy link
Member

neteler commented Jan 20, 2022

One test is failing, fixed with attached diff (I don't know how to commit against your PR?):
test_r_neighbors.zip

Running ./raster/r.neighbors/testsuite/test_r_neighbors.py...
========================================================================
...F...
======================================================================
FAIL: test_standard_options_quantile (__main__.TestNeighbors)
Test output with standard options (size=3, quadratic moving window,
----------------------------------------------------------------------
Traceback (most recent call last):
  File "raster/r.neighbors/testsuite/test_r_neighbors.py", line 480, in test_standard_options_quantile
    self.check_univar(test_case)
  File "raster/r.neighbors/testsuite/test_r_neighbors.py", line 429, in check_univar
    precision=1e-5,
  File "etc/python/grass/gunittest/case.py", line 308, in assertRasterFitsUnivar
    precision=precision,
  File "etc/python/grass/gunittest/case.py", line 282, in assertModuleKeyValue
    self.fail(self._formatMessage(msg, stdMsg))
AssertionError: r.univar map=test_standard_options_quantile_raster_0.03 percentile=90.0 separator== -g difference:
mismatch values (key, reference, actual): [('coeff_var', 18.5370455239192, 18.5383334276472), ('max', 156.208636016846, 156.20787109375), ('mean', 109.552850010781, 109.544974331288), ('mean_of_abs', 109.552850010781, 109.544974331288), ('min', 55.600062789917, 55.5965177536011), ('range', 100.608573226929, 100.611353340149), ('stddev', 20.3078616792495, 20.3078125947647), ('sum', 221844521.271832, 221828573.020859), ('variance', 412.409245983529, 412.407252384084)]
command: r.univar map=test_standard_options_quantile_raster_0.03 percentile=90.0 separator== -g {'map': 'test_standard_options_quantile_raster_0.03', 'separator': '=', 'flags': 'g'}

----------------------------------------------------------------------
Ran 7 tests in 24.097s
FAILED (failures=1)
========================================================================
FAILED ./raster/r.neighbors/testsuite/test_r_neighbors.py (1 test failed)

@metzm
Copy link
Contributor Author

metzm commented Jan 20, 2022

I am busy updating the tests for r.neighbors, WIP.

 * update test results
@metzm metzm merged commit c700a8a into OSGeo:main Jan 21, 2022
@metzm metzm deleted the quantiles branch January 21, 2022 18:44
@metzm
Copy link
Contributor Author

metzm commented Jan 21, 2022

TODO: backport c700a8a to G 8.0.1

neteler pushed a commit that referenced this pull request Jan 29, 2022
* use type 7 algorithm of Hyndman and Fan (1996) for quantiles, as is the default in R and numpy 
 * update manuals for `r.quantile`, `r.stats.quantile`
 * sync `r.stats quantile` to `r.quantile`
 * update test results for `r.neighbors`
@neteler
Copy link
Member

neteler commented Jan 29, 2022

TODO: backport c700a8a to G 8.0.1

Backport done.

@ecodiv
Copy link
Contributor

ecodiv commented Apr 24, 2022

@metzm , did this perhaps cause the bug #2259 ?

ninsbl pushed a commit to ninsbl/grass that referenced this pull request Oct 26, 2022
…o#2108)

* use type 7 algorithm of Hyndman and Fan (1996) for quantiles, as is the default in R and numpy 
 * update manuals for `r.quantile`, `r.stats.quantile`
 * sync `r.stats quantile` to `r.quantile`
 * update test results for `r.neighbors`
ninsbl pushed a commit to ninsbl/grass that referenced this pull request Feb 17, 2023
…o#2108)

* use type 7 algorithm of Hyndman and Fan (1996) for quantiles, as is the default in R and numpy 
 * update manuals for `r.quantile`, `r.stats.quantile`
 * sync `r.stats quantile` to `r.quantile`
 * update test results for `r.neighbors`
neteler pushed a commit to nilason/grass that referenced this pull request Nov 7, 2023
…o#2108)

* use type 7 algorithm of Hyndman and Fan (1996) for quantiles, as is the default in R and numpy 
 * update manuals for `r.quantile`, `r.stats.quantile`
 * sync `r.stats quantile` to `r.quantile`
 * update test results for `r.neighbors`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working C Related code is in C
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants