Percentile/quantile estimation types #1121
Labels
enhancement
New feature or request
research
This requires a deeper dive to gather a better understanding
Milestone
Helps #961
Follow up for #543 and #1060
Our current percentile implementation is based on the Nearest Rank with rounding down for non-numbers and a form of linear interpolation for number types.
However, this implementation makes some bold assumptions that can influence the expected outcome, especially for smaller data sizes.
Small background into the topic:
If I mention quantiles; they are the generic form of cutting up a range into continuous intervals with equal probabilities (sub-range size). A percentile is thus a 100-quantile; a subdivision of a range into a hundred pieces of equal size. A quartile is like a 4-quantile. A median can thus be seen as 2-quantile or the 2nd quartile, or 50th percentile, etc.
There are, however, 9 commonly used algorithms to calculate the i-th percentile/quartile/etc. They were collected by Hyndman, Rob & Fan, Yanan. (1996). Sample Quantiles in Statistical Packages. The American Statistician. 50. 361-365. 10.1080/00031305.1996.10473566. and used by almost all libraries/tools that can calculate quantiles, like R, Numpy, SciPy, Apache commons-math (legacy), Apache commons-statistics, Wolfram Mathematica, Matlab
Adapted from Wikipedia:
All 9 methods compute Qp, the (estimate for) the p-quantile.
(When talking about the k-th q-quantile, you get "p = k/q". So for the median, this can be p = 50 / 100 = 2, aka, the 2-quantile)
This is computed from a sample of size N by computing a real valued index h. When h is a whole number, the h-th smallest of the N values, xh, is the quantile estimate. Otherwise a rounding or interpolation scheme is used to compute the quantile estimate from h, x⌊h⌋, and x⌈h⌉, so by rounding down or up from h, respectively.
It might also make sense to create a
quantile
function as main entry-point and let percentile, median, quartile, (decile?), call into it with the right value for q in q-quantile.Now, of course it might take some time to fully implement this. So, until we fully implement this, we should at least be ready for it and take into account that there are multiple options. This could be done by, say, sticking to R-3 and R-7 dependent on the data type and later offer users more choice. These choices should be mentioned in the functions
median
andpercentile
.The text was updated successfully, but these errors were encountered: