Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inverse quantile algorithm is non-contiguous #72

Closed
ElectronicRU opened this issue Sep 19, 2016 · 4 comments
Closed

Inverse quantile algorithm is non-contiguous #72

ElectronicRU opened this issue Sep 19, 2016 · 4 comments

Comments

@ElectronicRU
Copy link

Not only it gives strange results at the end (and indeed, python implementation fixes that; sorry, not fluent at Java), but also the ranges that are covered as q goes up are non-contiguous.
I propose the following changes:

t <- 0, q <- $1 + q (\sum c_i.count - 1)$
for i <- 1..m
  if q \leq t + c_i.count
    if i = 1 : return c_i.mean
    if i = m : return c_i.mean
    low <- (c_{i-1}.mean + c_i.mean)/2
    high <- (c_{i+1}.mean + c_i.mean)/2
    return low + (high - low) * (q - t)/c_i.count

This has the property of being contiguous and returning precise values for $q = 0, 1$.

@tdunning
Copy link
Owner

I think that this algorithm doesn't account for situations with small
counts well. Whenever a centroid has a single sample, we know that the
quantile increments discontinuously exactly at the mean value for the
centroid.

Also, there are two approaches for thinking about quantiles in a q-digest
framework. In one approach (yours), the thought is that the samples for a
centroid are uniformly spaced from some minimum value to a maximum value.
The question becomes how we decide what the minimum and maximum are. One
way to do this is to assume that the extent of a centroid on one side is
proportional to the number of samples in that centroid, scaled by the total
number of samples in that centroid and the next centroid on that side. Note
that the centroid will no longer necessarily be in the center of this
distribution. I think it is also good practice to account for a small gap
between centroids which also coincidentally deals with the fact that n
samples should properly be considered to be n-1 units wide.

The second approach would be to consider the samples to be uniform between
centroids. This means that we know what the upper and lower bounds are. We
can then assume that there are (n_left -1)/2 + (n_right - 1)/2 samples plus
a gap.

Which approach is better is not clear to me.

On Mon, Sep 19, 2016 at 5:29 PM, Alexander Sedov notifications@github.com
wrote:

Not only it gives strange results at the end (and indeed, python
implementation fixes that; sorry, not fluent at Java), but also the ranges
that are covered as q goes up are non-contiguous.
I propose the following changes:

t <- 0, q <- $1 + q (\sum c_i.count - 1)$
for i <- 1..m
if q \leq t + c_i.count
if i = 1 : return c_i.mean
if i = m : return c_i.mean
low <- (c_{i-1}.mean + c_i.mean)/2
high <- (c_{i+1}.mean + c_i.mean)/2
return low + (high - low) * (q - t)/c_i.count

This has the property of being contiguous and returning precise values for
$q = 0, 1$.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#72, or mute the thread
https://github.com/notifications/unsubscribe-auth/AAPSeiXRTQn94Wu_w-a1uSu1hF-0xGPMks5qrqpGgaJpZM4KAn8h
.

@ElectronicRU
Copy link
Author

That’s all fine and dandy, but your formulae don’t reflect your approach of it being uniform between centroids, as you still center it on the centroid.
And while I can understand it being non-contiguous, it being non-monotonic (which it currently is) is completely beyond me.

20 сент. 2016 г., в 14:38, Ted Dunning notifications@github.com написал(а):

I think that this algorithm doesn't account for situations with small
counts well. Whenever a centroid has a single sample, we know that the
quantile increments discontinuously exactly at the mean value for the
centroid.

Also, there are two approaches for thinking about quantiles in a q-digest
framework. In one approach (yours), the thought is that the samples for a
centroid are uniformly spaced from some minimum value to a maximum value.
The question becomes how we decide what the minimum and maximum are. One
way to do this is to assume that the extent of a centroid on one side is
proportional to the number of samples in that centroid, scaled by the total
number of samples in that centroid and the next centroid on that side. Note
that the centroid will no longer necessarily be in the center of this
distribution. I think it is also good practice to account for a small gap
between centroids which also coincidentally deals with the fact that n
samples should properly be considered to be n-1 units wide.

The second approach would be to consider the samples to be uniform between
centroids. This means that we know what the upper and lower bounds are. We
can then assume that there are (n_left -1)/2 + (n_right - 1)/2 samples plus
a gap.

Which approach is better is not clear to me.

On Mon, Sep 19, 2016 at 5:29 PM, Alexander Sedov notifications@github.com
wrote:

Not only it gives strange results at the end (and indeed, python
implementation fixes that; sorry, not fluent at Java), but also the ranges
that are covered as q goes up are non-contiguous.
I propose the following changes:

t <- 0, q <- $1 + q (\sum c_i.count - 1)$
for i <- 1..m
if q \leq t + c_i.count
if i = 1 : return c_i.mean
if i = m : return c_i.mean
low <- (c_{i-1}.mean + c_i.mean)/2
high <- (c_{i+1}.mean + c_i.mean)/2
return low + (high - low) * (q - t)/c_i.count

This has the property of being contiguous and returning precise values for
$q = 0, 1$.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#72, or mute the thread
https://github.com/notifications/unsubscribe-auth/AAPSeiXRTQn94Wu_w-a1uSu1hF-0xGPMks5qrqpGgaJpZM4KAn8h
.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub #72 (comment), or mute the thread https://github.com/notifications/unsubscribe-auth/AAW_XUeZAIo8lMWtoh3tRg0jC0Jq3OKSks5qr8W_gaJpZM4KAn8h.

@tdunning
Copy link
Owner

On Tue, Sep 20, 2016 at 1:56 PM, Alexander Sedov notifications@github.com
wrote:

it being non-monotonic (which it currently is) is completely beyond me.

non-monotonic is not the intent.

Need to check that.

@tdunning
Copy link
Owner

Alex,

The latest implementation is considerable better behaved. I am working towards a release soon and will include a test for your pathology.

The new quantile/cdf algorithm uses the following diagram as an intuitive basis:

image

The fundamental idea here is that the solid red line represents our desired result. At the extremes, we interpolate between the first (or last) centroid and the recorded min and max values ever seen. It is assumed that each centroid is collocated with the median of the original data for the centroid and that the data is uniformly distributed between the centroids.

By definition, this form should give monotonic functions for both x -> quantile and quantile -> x.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants