Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple alternate algorithm using maxima, ranks and fixed cumulative weighting #71

Closed
hville opened this issue Sep 11, 2016 · 4 comments
Closed

Comments

@hville
Copy link

hville commented Sep 11, 2016

Got down a rabbit hole, came out with this. It is a small part of a bigger project in javascript but some ideas might be of use to others. Sorry no Java.

In short, it flips around the value-rank compression approach. Instead of having approximated weighted-value centroids for a given rank, this algorithm uses actual values as-is but with approximated ranks.

maximum value and rank instead of weighted value centroids:

  • actual values are retained. Better for discrete or repeated values
  • cdf is directly implied from a given maximum and rank F(x) = P(X <= x)
  • ranks are interpolated (ie values are exact, ranks are not)

fixed size, simple array, no tree structure,

  • no compression cycles
  • fixed memory use
  • weighting function calculated only once (fixed target cumulative probability)

There is already a number of tests and benchmarks but it is still a proof of concept at this time since it is still fairly new. I can elaborate more if there is any interest. (unless it already exists somewhere?)

@hville hville changed the title Simple alternate implementation using maxima, ranks and fixed cumulative weighting Simple alternate algorithm using maxima, ranks and fixed cumulative weighting Sep 11, 2016
@tdunning
Copy link
Owner

I haven't looked at this enough yet, but here are some thoughts:

  • I recently looked into keeping min/max for each centroid/bucket in
    t-digest itself. I convinced myself that the accuracy would be lower
    because during early adaptation or when merging digests, you wind up with
    the ideal square distribution of samples with some small leaky tails. That
    means that min and max are not really good estimators of how wide the
    general mass of the bucket really is.
  • Centroids are really clean and simple to update. The issue above makes
    most other measures harder
  • Have you looked at the MergingDigest work? It gives you completely static
    memory allocation and no tree. The addition of new samples is dead simple.
    CDF and quantile computation is nearly the same as for tree digests.
  • Can you say how your insertion algorithm is different from t-digest and
    different from Greenwald-Khanna's algorithm (paper attached)?

@hville
Copy link
Author

hville commented Sep 12, 2016

Thanks for your time and the pointers since I am not familiar with the literature.
I have read the Greenwald-Khanna paper and the MergingDigest code and I think that what I am doing is very close to the MergingDigest idea, but more naive and with the value-rank axis inverted throughout:

min/max vs centroids

I don't follow your point but in any case, the min are not used and the max are interpolated.
max[i-1] < newVal < max[i]
newRnk = rnk[i] - (rnk[i] - (rnk[i-1] + 1)) * (max[i] - newVal) / (max[i] - max[i-1])

I think this makes it it is much closer to t-Digest than to the Greenwald-Khanna since it only uses a single interpolated value. It interpolates the rank instead of the value.

I also have a hunch that data is better used by increasing the number of buckets and overall resolution instead of keeping min values.

weighting

The relative target weighting of buckets is fixed at the onset. Any distribution can be used and this is the current implementation:
w: target bucket relative weight
p = idx/(N-1): bucket relative position

Assuming that the bucket error is proportional to the bucket weight and that we want every bucket to have similar kind of relative error:
Constant RMS edge error = errRMS = sqrt( (w/p)^2 + (w/(1-p))^2 )
errRMS = w/(p(1-p)) * sqrt(1 - 2p(1-p))
w[i] = errRMS * (p(1-p)) / sqrt(1 - 2p(1-p))

w[i] ~= k p(1-p) (1 + 2p(1-p)) is a very close approximation using Maclaurin series and easy to integrate
Cumulative target weights = ( 15p^2 + 10p^3 - 30p^4 + 12p^5 ) / 7

This relative error approach is actually much closer to t-Digest than Greenwald-Khanna where the later uses absolute error instead.
I thought the t-Digest implementation paper used something like 1/w(1-w) but the MergingDigest uses a cumulative arcsin function that I think is the inverse of the above.
Again, it looks to me like it is very close to the t-Digest but with the value-rank axis reversed and the values precomputed by buckets.

insertion

Since the relative target bucket weight are fixed at the onset, there is relatively little cost to merging values and no buffer is required. Although the target weights need to be stored, they can be shared across many instances of the same size.

  • for a new value v find v[i-1] < v <= v[i+1]
  • calculate the interpolated rank r for the new value
  • compare r with the precomputed target weights
  • if r/N > w[i+1] replace the bucket [i+1] and continue shifting right with the replaced bucket
  • if r/N < w[i] replace the bucket [i] and continue shifting left with the replaced bucket

I guess I could take some of the unit-test cases and check where/when the error profile passes or fails.

@hville
Copy link
Author

hville commented Sep 19, 2016

I'm closing this since it is not an issue and was simply open to share ideas.

@hville hville closed this as completed Sep 19, 2016
@tdunning
Copy link
Owner

One last comment is that inverting the axes might be a very clever thought.
I am not clear whether that is entirely true, but it is worth following up
on at some point.

The point is that a rational approximation for sin(k) is super easy while
asin(q) is harder.

On Tue, Sep 20, 2016 at 1:45 AM, hugo notifications@github.com wrote:

Closed #71 #71.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#71 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAPSerSOukwU_EbH9lLwE28cmLvfNjmXks5qrx6rgaJpZM4J6CYk
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants