Inverse quantile algorithm is non-contiguous #72

ElectronicRU · 2016-09-19T15:29:09Z

Not only it gives strange results at the end (and indeed, python implementation fixes that; sorry, not fluent at Java), but also the ranges that are covered as q goes up are non-contiguous.
I propose the following changes:

t <- 0, q <- $1 + q (\sum c_i.count - 1)$
for i <- 1..m
  if q \leq t + c_i.count
    if i = 1 : return c_i.mean
    if i = m : return c_i.mean
    low <- (c_{i-1}.mean + c_i.mean)/2
    high <- (c_{i+1}.mean + c_i.mean)/2
    return low + (high - low) * (q - t)/c_i.count

This has the property of being contiguous and returning precise values for $q = 0, 1$.

The text was updated successfully, but these errors were encountered:

tdunning · 2016-09-20T11:38:36Z

I think that this algorithm doesn't account for situations with small
counts well. Whenever a centroid has a single sample, we know that the
quantile increments discontinuously exactly at the mean value for the
centroid.

Also, there are two approaches for thinking about quantiles in a q-digest
framework. In one approach (yours), the thought is that the samples for a
centroid are uniformly spaced from some minimum value to a maximum value.
The question becomes how we decide what the minimum and maximum are. One
way to do this is to assume that the extent of a centroid on one side is
proportional to the number of samples in that centroid, scaled by the total
number of samples in that centroid and the next centroid on that side. Note
that the centroid will no longer necessarily be in the center of this
distribution. I think it is also good practice to account for a small gap
between centroids which also coincidentally deals with the fact that n
samples should properly be considered to be n-1 units wide.

The second approach would be to consider the samples to be uniform between
centroids. This means that we know what the upper and lower bounds are. We
can then assume that there are (n_left -1)/2 + (n_right - 1)/2 samples plus
a gap.

Which approach is better is not clear to me.

On Mon, Sep 19, 2016 at 5:29 PM, Alexander Sedov notifications@github.com
wrote:

Not only it gives strange results at the end (and indeed, python
implementation fixes that; sorry, not fluent at Java), but also the ranges
that are covered as q goes up are non-contiguous.
I propose the following changes:

t <- 0, q <- $1 + q (\sum c_i.count - 1)$
for i <- 1..m
if q \leq t + c_i.count
if i = 1 : return c_i.mean
if i = m : return c_i.mean
low <- (c_{i-1}.mean + c_i.mean)/2
high <- (c_{i+1}.mean + c_i.mean)/2
return low + (high - low) * (q - t)/c_i.count

This has the property of being contiguous and returning precise values for
$q = 0, 1$.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#72, or mute the thread
https://github.com/notifications/unsubscribe-auth/AAPSeiXRTQn94Wu_w-a1uSu1hF-0xGPMks5qrqpGgaJpZM4KAn8h
.

ElectronicRU · 2016-09-20T11:56:58Z

That’s all fine and dandy, but your formulae don’t reflect your approach of it being uniform between centroids, as you still center it on the centroid.
And while I can understand it being non-contiguous, it being non-monotonic (which it currently is) is completely beyond me.

20 сент. 2016 г., в 14:38, Ted Dunning notifications@github.com написал(а):

I think that this algorithm doesn't account for situations with small
counts well. Whenever a centroid has a single sample, we know that the
quantile increments discontinuously exactly at the mean value for the
centroid.

Also, there are two approaches for thinking about quantiles in a q-digest
framework. In one approach (yours), the thought is that the samples for a
centroid are uniformly spaced from some minimum value to a maximum value.
The question becomes how we decide what the minimum and maximum are. One
way to do this is to assume that the extent of a centroid on one side is
proportional to the number of samples in that centroid, scaled by the total
number of samples in that centroid and the next centroid on that side. Note
that the centroid will no longer necessarily be in the center of this
distribution. I think it is also good practice to account for a small gap
between centroids which also coincidentally deals with the fact that n
samples should properly be considered to be n-1 units wide.

The second approach would be to consider the samples to be uniform between
centroids. This means that we know what the upper and lower bounds are. We
can then assume that there are (n_left -1)/2 + (n_right - 1)/2 samples plus
a gap.

Which approach is better is not clear to me.

On Mon, Sep 19, 2016 at 5:29 PM, Alexander Sedov notifications@github.com
wrote:

Not only it gives strange results at the end (and indeed, python
implementation fixes that; sorry, not fluent at Java), but also the ranges
that are covered as q goes up are non-contiguous.
I propose the following changes:

t <- 0, q <- $1 + q (\sum c_i.count - 1)$
for i <- 1..m
if q \leq t + c_i.count
if i = 1 : return c_i.mean
if i = m : return c_i.mean
low <- (c_{i-1}.mean + c_i.mean)/2
high <- (c_{i+1}.mean + c_i.mean)/2
return low + (high - low) * (q - t)/c_i.count

This has the property of being contiguous and returning precise values for
$q = 0, 1$.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#72, or mute the thread
https://github.com/notifications/unsubscribe-auth/AAPSeiXRTQn94Wu_w-a1uSu1hF-0xGPMks5qrqpGgaJpZM4KAn8h
.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub #72 (comment), or mute the thread https://github.com/notifications/unsubscribe-auth/AAW_XUeZAIo8lMWtoh3tRg0jC0Jq3OKSks5qr8W_gaJpZM4KAn8h.

tdunning · 2016-09-20T12:34:48Z

On Tue, Sep 20, 2016 at 1:56 PM, Alexander Sedov notifications@github.com
wrote:

it being non-monotonic (which it currently is) is completely beyond me.

non-monotonic is not the intent.

Need to check that.

tdunning · 2017-04-20T06:35:41Z

Alex,

The latest implementation is considerable better behaved. I am working towards a release soon and will include a test for your pathology.

The new quantile/cdf algorithm uses the following diagram as an intuitive basis:

The fundamental idea here is that the solid red line represents our desired result. At the extremes, we interpolate between the first (or last) centroid and the recorded min and max values ever seen. It is assumed that each centroid is collocated with the median of the original data for the centroid and that the data is uniformly distributed between the centroids.

By definition, this form should give monotonic functions for both x -> quantile and quantile -> x.

tdunning closed this as completed Apr 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inverse quantile algorithm is non-contiguous #72

Inverse quantile algorithm is non-contiguous #72

ElectronicRU commented Sep 19, 2016

tdunning commented Sep 20, 2016

ElectronicRU commented Sep 20, 2016

tdunning commented Sep 20, 2016

tdunning commented Apr 20, 2017

Inverse quantile algorithm is non-contiguous #72

Inverse quantile algorithm is non-contiguous #72

Comments

ElectronicRU commented Sep 19, 2016

tdunning commented Sep 20, 2016

ElectronicRU commented Sep 20, 2016

tdunning commented Sep 20, 2016

tdunning commented Apr 20, 2017