Skip to content

Percentile/quantile estimation types #1121

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Jolanrensen opened this issue Apr 9, 2025 · 1 comment
Open

Percentile/quantile estimation types #1121

Jolanrensen opened this issue Apr 9, 2025 · 1 comment
Labels
enhancement New feature or request research This requires a deeper dive to gather a better understanding
Milestone

Comments

@Jolanrensen
Copy link
Collaborator

Helps #961
Follow up for #543 and #1060

Our current percentile implementation is based on the Nearest Rank with rounding down for non-numbers and a form of linear interpolation for number types.
However, this implementation makes some bold assumptions that can influence the expected outcome, especially for smaller data sizes.

Small background into the topic:
If I mention quantiles; they are the generic form of cutting up a range into continuous intervals with equal probabilities (sub-range size). A percentile is thus a 100-quantile; a subdivision of a range into a hundred pieces of equal size. A quartile is like a 4-quantile. A median can thus be seen as 2-quantile or the 2nd quartile, or 50th percentile, etc.

There are, however, 9 commonly used algorithms to calculate the i-th percentile/quartile/etc. They were collected by Hyndman, Rob & Fan, Yanan. (1996). Sample Quantiles in Statistical Packages. The American Statistician. 50. 361-365. 10.1080/00031305.1996.10473566. and used by almost all libraries/tools that can calculate quantiles, like R, Numpy, SciPy, Apache commons-math (legacy), Apache commons-statistics, Wolfram Mathematica, Matlab

Adapted from Wikipedia:

All 9 methods compute Qp, the (estimate for) the p-quantile.
(When talking about the k-th q-quantile, you get "p = k/q". So for the median, this can be p = 50 / 100 = 2, aka, the 2-quantile)

This is computed from a sample of size N by computing a real valued index h. When h is a whole number, the h-th smallest of the N values, xh, is the quantile estimate. Otherwise a rounding or interpolation scheme is used to compute the quantile estimate from h, x⌊h⌋, and x⌈h⌉, so by rounding down or up from h, respectively.

Type h Qp Notes
R‑1, SAS‑3, Maple‑1 Np x⌈h⌉ Inverse of empirical distribution function.
R‑2, SAS‑5, Maple‑2, Stata Np + 1/2 (x⌈h – 1/2⌉ + x⌊h + 1/2⌋) / 2 The same as R-1, but with averaging at discontinuities.
R‑3, SAS‑2 Np − 1/2 x⌊h⌉ The observation numbered closest to Np. Here, ⌊h⌉ indicates rounding to the nearest integer, choosing the even integer in the case of a tie.
R‑4, SAS‑1, SciPy‑(0,1), Julia‑(0,1), Maple‑3 Np x⌊h⌋ + (h − ⌊h⌋) (x⌈h⌉ − x⌊h⌋) Linear interpolation of the inverse of the empirical distribution function.
R‑5, SciPy‑(1/2,1/2), Julia‑(1/2,1/2), Maple‑4 Np + 1/2 x⌊h⌋ + (h − ⌊h⌋) (x⌈h⌉ − x⌊h⌋) Piecewise linear function where the knots are the values midway through the steps of the empirical distribution function.
R‑6, Excel, Python, SAS‑4, SciPy‑(0,0), Julia-(0,0), Maple‑5, Stata‑altdef (N + 1)p x⌊h⌋ + (h − ⌊h⌋) (x⌈h⌉ − x⌊h⌋) Linear interpolation of the expectations for the order statistics for the uniform distribution on [0,1]. That is, it is the linear interpolation between points (ph, xh), where ph = h/(N+1) is the probability that the last of (N+1) randomly drawn values will not exceed the h-th smallest of the first N randomly drawn values.
R‑7, Excel, Python, SciPy‑(1,1), Julia-(1,1), Maple‑6, NumPy (N − 1)p + 1 x⌊h⌋ + (h − ⌊h⌋) (x⌈h⌉ − x⌊h⌋) Linear interpolation of the modes for the order statistics for the uniform distribution on [0,1].
R‑8, SciPy‑(1/3,1/3), Julia‑(1/3,1/3), Maple‑7 (N + 1/3)p + 1/3 x⌊h⌋ + (h − ⌊h⌋) (x⌈h⌉ − x⌊h⌋) Linear interpolation of the approximate medians for order statistics.
R‑9, SciPy‑(3/8,3/8), Julia‑(3/8,3/8), Maple‑8 (N + 1/4)p + 3/8 x⌊h⌋ + (h − ⌊h⌋) (x⌈h⌉ − x⌊h⌋) The resulting quantile estimates are approximately unbiased for the expected order statistics if x is normally distributed.
  • R‑1 through R‑3 are piecewise constant, with discontinuities. (useful for self-comparables)
  • R‑4 and following are piecewise linear, without discontinuities, but differ in how h is computed. (useful for primitive numbers)
  • R‑3 and R‑4 are not symmetric in that they do not give h = (N + 1) / 2 when p = 1/2.

It might also make sense to create a quantile function as main entry-point and let percentile, median, quartile, (decile?), call into it with the right value for q in q-quantile.

Now, of course it might take some time to fully implement this. So, until we fully implement this, we should at least be ready for it and take into account that there are multiple options. This could be done by, say, sticking to R-3 and R-7 dependent on the data type and later offer users more choice. These choices should be mentioned in the functions median and percentile.

@Jolanrensen Jolanrensen added the enhancement New feature or request label Apr 9, 2025
@Jolanrensen Jolanrensen added this to the Backlog milestone Apr 9, 2025
@Jolanrensen Jolanrensen added the research This requires a deeper dive to gather a better understanding label Apr 9, 2025
@Jolanrensen
Copy link
Collaborator Author

Partly used by #1149

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request research This requires a deeper dive to gather a better understanding
Projects
None yet
Development

No branches or pull requests

1 participant