Percentile/quantile estimation types #1121

Jolanrensen · 2025-04-09T12:39:25Z

Helps #961
Follow up for #543 and #1060

Our current percentile implementation is based on the Nearest Rank with rounding down for non-numbers and a form of linear interpolation for number types.
However, this implementation makes some bold assumptions that can influence the expected outcome, especially for smaller data sizes.

Small background into the topic:
If I mention quantiles; they are the generic form of cutting up a range into continuous intervals with equal probabilities (sub-range size). A percentile is thus a 100-quantile; a subdivision of a range into a hundred pieces of equal size. A quartile is like a 4-quantile. A median can thus be seen as 2-quantile or the 2nd quartile, or 50th percentile, etc.

There are, however, 9 commonly used algorithms to calculate the i-th percentile/quartile/etc. They were collected by Hyndman, Rob & Fan, Yanan. (1996). Sample Quantiles in Statistical Packages. The American Statistician. 50. 361-365. 10.1080/00031305.1996.10473566. and used by almost all libraries/tools that can calculate quantiles, like R, Numpy, SciPy, Apache commons-math (legacy), Apache commons-statistics, Wolfram Mathematica, Matlab

Adapted from Wikipedia:

All 9 methods compute Q_p, the (estimate for) the p-quantile.
(When talking about the k-th q-quantile, you get "p = k/q". So for the median, this can be p = 50 / 100 = 2, aka, the 2-quantile)

This is computed from a sample of size N by computing a real valued index h. When h is a whole number, the h-th smallest of the N values, x_h, is the quantile estimate. Otherwise a rounding or interpolation scheme is used to compute the quantile estimate from h, x_⌊h⌋, and x_⌈h⌉, so by rounding down or up from h, respectively.

Type	h	Q_p	Notes
R‑1, SAS‑3, Maple‑1	Np	x_⌈h⌉	Inverse of empirical distribution function.
R‑2, SAS‑5, Maple‑2, Stata	Np + 1/2	(x_{⌈h – 1/2⌉} + x_{⌊h + 1/2⌋}) / 2	The same as R-1, but with averaging at discontinuities.
R‑3, SAS‑2	Np − 1/2	x_⌊h⌉	The observation numbered closest to Np. Here, ⌊h⌉ indicates rounding to the nearest integer, choosing the even integer in the case of a tie.
R‑4, SAS‑1, SciPy‑(0,1), Julia‑(0,1), Maple‑3	Np	x_⌊h⌋ + (h − ⌊h⌋) (x_⌈h⌉ − x_⌊h⌋)	Linear interpolation of the inverse of the empirical distribution function.
R‑5, SciPy‑(1/2,1/2), Julia‑(1/2,1/2), Maple‑4	Np + 1/2	x_⌊h⌋ + (h − ⌊h⌋) (x_⌈h⌉ − x_⌊h⌋)	Piecewise linear function where the knots are the values midway through the steps of the empirical distribution function.
R‑6, Excel, Python, SAS‑4, SciPy‑(0,0), Julia-(0,0), Maple‑5, Stata‑altdef	(N + 1)p	x_⌊h⌋ + (h − ⌊h⌋) (x_⌈h⌉ − x_⌊h⌋)	Linear interpolation of the expectations for the order statistics for the uniform distribution on [0,1]. That is, it is the linear interpolation between points (p_h, x_h), where p_h = h/(N+1) is the probability that the last of (N+1) randomly drawn values will not exceed the h-th smallest of the first N randomly drawn values.
R‑7, Excel, Python, SciPy‑(1,1), Julia-(1,1), Maple‑6, NumPy	(N − 1)p + 1	x_⌊h⌋ + (h − ⌊h⌋) (x_⌈h⌉ − x_⌊h⌋)	Linear interpolation of the modes for the order statistics for the uniform distribution on [0,1].
R‑8, SciPy‑(1/3,1/3), Julia‑(1/3,1/3), Maple‑7	(N + 1/3)p + 1/3	x_⌊h⌋ + (h − ⌊h⌋) (x_⌈h⌉ − x_⌊h⌋)	Linear interpolation of the approximate medians for order statistics.
R‑9, SciPy‑(3/8,3/8), Julia‑(3/8,3/8), Maple‑8	(N + 1/4)p + 3/8	x_⌊h⌋ + (h − ⌊h⌋) (x_⌈h⌉ − x_⌊h⌋)	The resulting quantile estimates are approximately unbiased for the expected order statistics if x is normally distributed.

R‑1 through R‑3 are piecewise constant, with discontinuities. (useful for self-comparables)
R‑4 and following are piecewise linear, without discontinuities, but differ in how h is computed. (useful for primitive numbers)
R‑3 and R‑4 are not symmetric in that they do not give h = (N + 1) / 2 when p = 1/2.

It might also make sense to create a quantile function as main entry-point and let percentile, median, quartile, (decile?), call into it with the right value for q in q-quantile.

Now, of course it might take some time to fully implement this. So, until we fully implement this, we should at least be ready for it and take into account that there are multiple options. This could be done by, say, sticking to R-3 and R-7 dependent on the data type and later offer users more choice. These choices should be mentioned in the functions median and percentile.

The text was updated successfully, but these errors were encountered:

Jolanrensen · 2025-04-26T10:50:21Z

Partly used by #1149

Jolanrensen added the enhancement New feature or request label Apr 9, 2025

Jolanrensen added this to the Backlog milestone Apr 9, 2025

Jolanrensen mentioned this issue Apr 9, 2025

☂ Statistics streamlining #961

Open

9 tasks

Jolanrensen added the research This requires a deeper dive to gather a better understanding label Apr 9, 2025

Jolanrensen mentioned this issue Apr 26, 2025

Percentile #1149

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Percentile/quantile estimation types #1121

Percentile/quantile estimation types #1121

Jolanrensen commented Apr 9, 2025

Jolanrensen commented Apr 26, 2025

Uh oh!

Percentile/quantile estimation types #1121

Percentile/quantile estimation types #1121

Comments

Jolanrensen commented Apr 9, 2025

Jolanrensen commented Apr 26, 2025

Uh oh!