-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvement to t-digest stability under merging #78
Comments
Hmm... The more modern implementations (MergingDigest) already use a sorted merge approach. Check out the faster-merge branch for implementation. Also the new-article branch for some discussions on how this all works. |
Aha, that explains the code I was looking at. I'll work on including that algorithm in my comparisons |
Here is a comparison of a bunch of computations of ks statistic using a
t-digest or the original data:
[image: Inline image 1]
This indicates to me that the t-digest is doing a good job of representing
the distribution. This is for \delta = 100.
…On Thu, Dec 22, 2016 at 3:32 PM, Erik Erlandson ***@***.***> wrote:
Aha, that explains the code I was looking at. I'll work on including that
algorithm in my comparisons
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#78 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAPSekezAshgDe_b0n23UgLhfEszbKERks5rKwiRgaJpZM4LUGkf>
.
|
I implemented the new merge-based combining algorithm for comparison. The code is here -- I believe I have it working correctly, but it hasn't been reviewed by anybody else. My updated plots look like this (merge-based algorithm added in green): Assuming my code is all working as expected, largest-to-smallest converges a bit better than merge-based, although either is clearly far better than the older random-insertion algorithm. |
Your code looks wrong here
<https://github.com/erikerlandson/isarn-sketches-algebird-api/blob/blog/t_digest_sum_2/src/main/scala/org/isarnproject/sketchesAlgebirdAPI/AlgebirdFactory.scala#L108-L113>
.
The problem is that you are merging clusters with the same centroid. That
isn't right. You should only merge them if they satisfy the merge criterion.
My guess is that if that code is being triggered, then you are getting
clusters that are larger than allowed. I don't see that this code should be
triggered in your test case, however.
Also, remember that this form of the algorithm with k2q(q+dq) - k2q(q) is
fine for testing, but keep in mind that you can go about 10x faster with
some limit caching and if you are merging many digests in one go.
All that said, I think that I am seeing poor convergence like you are
seeing. Looking at ks.test of sample data versus cdf's estimated from the
t-digest is showing limited convergence like you show. Doing the biggest
first merge is going to be a pain in the butt with static data structures.
I will have to think on it.
…On Fri, Dec 23, 2016 at 4:42 PM, Erik Erlandson ***@***.***> wrote:
I implemented the new merge-based combining algorithm for comparison. The
code is here
<https://github.com/erikerlandson/isarn-sketches-algebird-api/blob/blog/t_digest_sum_2/src/main/scala/org/isarnproject/sketchesAlgebirdAPI/AlgebirdFactory.scala#L107>
-- I believe I have it working correctly, but it hasn't been reviewed by
anybody else.
My updated plots look like this (merge-based algorithm added in green):
[image: merge_compare]
<https://cloud.githubusercontent.com/assets/259898/21464321/c1cbf850-c936-11e6-9a27-40b31403bd2f.png>
Assuming my code is all working as expected, largest-to-smallest converges
a bit better than merge-based, although either is clearly far better than
the older random-insertion algorithm.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#78 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAPSetnzaOCRCV85dPcAETlFmQOGE4_Zks5rLGpugaJpZM4LUGkf>
.
|
I wonder how important it really is, especially if it's inconvenient to change. My original goal was to just find something that didn't keep getting worse, like the older random-insertion algorithm does. The merge-based algorithm clearly meets and exceeds that requirement already. My logic for combining clusters with exactly the same centroid is a by-product of my choice of backing data structure. Since I'm using a map whose key is the centroid location, multiple clusters with the same centroid are inconvenient to work with, so I decided to just live with the possibility of violating the cluster size bound. I predicted it would be a rare event in most scenarios. |
If you look at the test cases for the Java version of t-digest, you will
find a number of cases that deal with lots of repeated data values. These
tests are in there because of how badly I have been burned by real data
like that.
For instance, take a payment amounts on a mixed transaction stream. These
are often zero, often stereotypical numbers like 9.95 (plus tax). The
result of such repeated values is that you get multiple centroids with the
same mean. It is important, however, to be able to peel out one of these
centroids at a time to express some subtlety to the distribution.
A good test case for this has input that is 100,000 zero values followed by
10,000 exponentially distributed values. If you merge centroids, you get
one massive centroid from the first part of the data and then singletons.
Some of the singletons will combine with the massive centroid and shift it.
The resulting KS difference can be very near 1.
…On Sat, Dec 24, 2016 at 7:32 AM, Erik Erlandson ***@***.***> wrote:
I wonder how important it really is, especially if it's inconvenient to
change. My original goal was to just find something that didn't keep
getting worse, like the older random-insertion algorithm does. The
merge-based algorithm clearly meets and exceeds that requirement already.
My logic for combining clusters with exactly the same centroid is a
by-product of my choice of backing data structure. Since I'm using a map
whose key is the centroid location, multiple clusters with the same
centroid are inconvenient to work with, so I decided to just live with the
possibility of violating the cluster size bound. I predicted it would be a
rare event in most scenarios.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#78 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAPSeqdBdNvpgis_O-3P1Rw7j0mJ1VmFks5rLTsngaJpZM4LUGkf>
.
|
An easy, though slightly hacky, workaround would be to use the Go
equivalent of java.lang.Math.nextAfter() method instead of merging the
values. The resulting centroids would not be identical and thus would key
differently in your data structure.
My preference in such cases is actually to use an unstable sort so that you
get any of the equal centroids. This avoids a problem where one of the
"equal" centroids might shadow the others. In the original t-digest, I did
a lot of work to scan all equally distant centroids to avoid this. In the
merging digest, I plan to simply destabilize the merge with random numbers.
…On Sat, Dec 24, 2016 at 10:56 AM, Ted Dunning ***@***.***> wrote:
If you look at the test cases for the Java version of t-digest, you will
find a number of cases that deal with lots of repeated data values. These
tests are in there because of how badly I have been burned by real data
like that.
For instance, take a payment amounts on a mixed transaction stream. These
are often zero, often stereotypical numbers like 9.95 (plus tax). The
result of such repeated values is that you get multiple centroids with the
same mean. It is important, however, to be able to peel out one of these
centroids at a time to express some subtlety to the distribution.
A good test case for this has input that is 100,000 zero values followed
by 10,000 exponentially distributed values. If you merge centroids, you get
one massive centroid from the first part of the data and then singletons.
Some of the singletons will combine with the massive centroid and shift it.
The resulting KS difference can be very near 1.
On Sat, Dec 24, 2016 at 7:32 AM, Erik Erlandson ***@***.***>
wrote:
> I wonder how important it really is, especially if it's inconvenient to
> change. My original goal was to just find something that didn't keep
> getting worse, like the older random-insertion algorithm does. The
> merge-based algorithm clearly meets and exceeds that requirement already.
>
> My logic for combining clusters with exactly the same centroid is a
> by-product of my choice of backing data structure. Since I'm using a map
> whose key is the centroid location, multiple clusters with the same
> centroid are inconvenient to work with, so I decided to just live with the
> possibility of violating the cluster size bound. I predicted it would be a
> rare event in most scenarios.
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#78 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAPSeqdBdNvpgis_O-3P1Rw7j0mJ1VmFks5rLTsngaJpZM4LUGkf>
> .
>
|
I have a theory my logic could work OK with a big spike of zeros (or other numbers); such a spike would not merge with different cluster locations (and then shift) on the grounds of already being larger than the nominal limit. Clusters only get to grow past their limit by acquiring new mass at exactly the same location. But, that needs either proof or disproof via additional unit testing. My original concept involved a sort of back-off sketching model where discrete values were stored in a map, unless and until some sufficient number of unique values was encountered, and then it would back-off to a t-digest. I suppose a hybrid model might be possible, where large spikes get stored exterior to the digest. |
On Tue, Dec 27, 2016 at 11:44 AM, Erik Erlandson ***@***.***> wrote:
My original concept involved a sort of back-off sketching model where
discrete values were stored in a map, unless and until some sufficient
number of unique values was encountered, and then it would back-off to a
t-digest. I suppose a hybrid model might be possible, where large spikes
get stored exterior to the digest.
I can't remember if I mentioned this, but the merging versino of t-digest
effectively does just this.
|
Erik, Thanks again for your excellent observation. I just wanted to mention that the MergingDigest now has an add function that takes a list of TDigests. This mega merge is handled very efficiently and accuracy should be better than monoid style accumulation. |
That sounds very cool! 👍 |
Btw... I think that you may actually have been seeing the thin end of #89 AVLTreeDigest has a serious bug with repeated values. |
This seems better lately. This is from the data produced by TDigestTest.testKSDrift. Note that the algorithms are different in that the AVLTreeDigest uses a size limit that results in the number of centroids to grow without bound roughly according to \delta * log(n) while MergingDigest uses a size limit that results in the number being strictly bounded by 2 \delta and practically speaking by about 1.3 \delta. In any case, the KS statistic is very small in both cases. |
Nice, it looks like that cut the |
Well, it isn't clear that this is all apples to apples. But, yes. This is encouraging. The MergingDigest has different size limits that make things behavior very differently in terms of centroid count. When I goosed the compression limit to give similar centroid count MergingDigest and AVLTreeDigest have very similar D. The |
Hmm... last comment go cut off. I was going to say that there were some not yet understood initial convergence behavior. Not significant as far as I can tell. |
I think that this is resolved. It wasn't a single issue, but resulted in significant improvements. Thanks to all! |
I noticed that merging t-digests appeared to cause some drift from the underlying distribution, at least using the method mentioned in the original paper where clusters are randomly shuffled and inserted. Inserting combined clusters from largest-to-smallest gave a big improvement in "convergence" of the merged clusters. I wrote up the results here:
http://erikerlandson.github.io/blog/2016/12/19/converging-monoid-addition-for-t-digest/
The text was updated successfully, but these errors were encountered: