Simple alternate algorithm using maxima, ranks and fixed cumulative weighting #71

hville · 2016-09-11T15:58:58Z

Got down a rabbit hole, came out with this. It is a small part of a bigger project in javascript but some ideas might be of use to others. Sorry no Java.

In short, it flips around the value-rank compression approach. Instead of having approximated weighted-value centroids for a given rank, this algorithm uses actual values as-is but with approximated ranks.

maximum value and rank instead of weighted value centroids:

actual values are retained. Better for discrete or repeated values
cdf is directly implied from a given maximum and rank F(x) = P(X <= x)
ranks are interpolated (ie values are exact, ranks are not)

fixed size, simple array, no tree structure,

no compression cycles
fixed memory use
weighting function calculated only once (fixed target cumulative probability)

There is already a number of tests and benchmarks but it is still a proof of concept at this time since it is still fairly new. I can elaborate more if there is any interest. (unless it already exists somewhere?)

The text was updated successfully, but these errors were encountered:

tdunning · 2016-09-11T21:50:58Z

I haven't looked at this enough yet, but here are some thoughts:

I recently looked into keeping min/max for each centroid/bucket in
t-digest itself. I convinced myself that the accuracy would be lower
because during early adaptation or when merging digests, you wind up with
the ideal square distribution of samples with some small leaky tails. That
means that min and max are not really good estimators of how wide the
general mass of the bucket really is.
Centroids are really clean and simple to update. The issue above makes
most other measures harder
Have you looked at the MergingDigest work? It gives you completely static
memory allocation and no tree. The addition of new samples is dead simple.
CDF and quantile computation is nearly the same as for tree digests.
Can you say how your insertion algorithm is different from t-digest and
different from Greenwald-Khanna's algorithm (paper attached)?

hville · 2016-09-12T12:42:02Z

Thanks for your time and the pointers since I am not familiar with the literature.
I have read the Greenwald-Khanna paper and the MergingDigest code and I think that what I am doing is very close to the MergingDigest idea, but more naive and with the value-rank axis inverted throughout:

min/max vs centroids

I don't follow your point but in any case, the min are not used and the max are interpolated.
max[i-1] < newVal < max[i]
newRnk = rnk[i] - (rnk[i] - (rnk[i-1] + 1)) * (max[i] - newVal) / (max[i] - max[i-1])

I think this makes it it is much closer to t-Digest than to the Greenwald-Khanna since it only uses a single interpolated value. It interpolates the rank instead of the value.

I also have a hunch that data is better used by increasing the number of buckets and overall resolution instead of keeping min values.

weighting

The relative target weighting of buckets is fixed at the onset. Any distribution can be used and this is the current implementation:
w: target bucket relative weight
p = idx/(N-1): bucket relative position

Assuming that the bucket error is proportional to the bucket weight and that we want every bucket to have similar kind of relative error:
Constant RMS edge error = errRMS = sqrt( (w/p)^2 + (w/(1-p))^2 )
errRMS = w/(p(1-p)) * sqrt(1 - 2p(1-p))
w[i] = errRMS * (p(1-p)) / sqrt(1 - 2p(1-p))

w[i] ~= k p(1-p) (1 + 2p(1-p)) is a very close approximation using Maclaurin series and easy to integrate
Cumulative target weights = ( 15p^2 + 10p^3 - 30p^4 + 12p^5 ) / 7

This relative error approach is actually much closer to t-Digest than Greenwald-Khanna where the later uses absolute error instead.
I thought the t-Digest implementation paper used something like 1/w(1-w) but the MergingDigest uses a cumulative arcsin function that I think is the inverse of the above.
Again, it looks to me like it is very close to the t-Digest but with the value-rank axis reversed and the values precomputed by buckets.

insertion

Since the relative target bucket weight are fixed at the onset, there is relatively little cost to merging values and no buffer is required. Although the target weights need to be stored, they can be shared across many instances of the same size.

for a new value v find v[i-1] < v <= v[i+1]
calculate the interpolated rank r for the new value
compare r with the precomputed target weights
if r/N > w[i+1] replace the bucket [i+1] and continue shifting right with the replaced bucket
if r/N < w[i] replace the bucket [i] and continue shifting left with the replaced bucket

I guess I could take some of the unit-test cases and check where/when the error profile passes or fails.

hville · 2016-09-19T23:45:47Z

I'm closing this since it is not an issue and was simply open to share ideas.

tdunning · 2016-09-20T10:02:31Z

One last comment is that inverting the axes might be a very clever thought.
I am not clear whether that is entirely true, but it is worth following up
on at some point.

The point is that a rational approximation for sin(k) is super easy while
asin(q) is harder.

On Tue, Sep 20, 2016 at 1:45 AM, hugo notifications@github.com wrote:

Closed #71 #71.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#71 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAPSerSOukwU_EbH9lLwE28cmLvfNjmXks5qrx6rgaJpZM4J6CYk
.

hville changed the title ~~Simple alternate implementation using maxima, ranks and fixed cumulative weighting~~ Simple alternate algorithm using maxima, ranks and fixed cumulative weighting Sep 11, 2016

hville closed this as completed Sep 19, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simple alternate algorithm using maxima, ranks and fixed cumulative weighting #71

Simple alternate algorithm using maxima, ranks and fixed cumulative weighting #71

hville commented Sep 11, 2016 •

edited

Loading

tdunning commented Sep 11, 2016

hville commented Sep 12, 2016

hville commented Sep 19, 2016

tdunning commented Sep 20, 2016

Simple alternate algorithm using maxima, ranks and fixed cumulative weighting #71

Simple alternate algorithm using maxima, ranks and fixed cumulative weighting #71

Comments

hville commented Sep 11, 2016 • edited Loading

tdunning commented Sep 11, 2016

hville commented Sep 12, 2016

min/max vs centroids

weighting

insertion

hville commented Sep 19, 2016

tdunning commented Sep 20, 2016

hville commented Sep 11, 2016 •

edited

Loading