Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Histogram #42

Closed
gregakespret opened this issue Mar 15, 2015 · 5 comments
Closed

Histogram #42

gregakespret opened this issue Mar 15, 2015 · 5 comments

Comments

@gregakespret
Copy link

Hi! Not sure if opening a new Issue here is the right way to go about this. I'd like to understand why the histogram created through t-digest loses so much information.

The original histogram with 500 breaks (every 0.05). Double peak is clearly visible.
histogram.

Histogram created using t-digest centroids's means and counts. Double peak is not visible.
histogram-from-tdigest

The raw data that I used includes 49220071 data points from 0 to 60.

This is the code I used to produce histogram-from-tdigest

Is using Centroid's mean() and count() even the right way to go about this?

Thank you.

@tdunning
Copy link
Owner

Ah. I can see something from your code.

Don't use the array digest. Use the avltree digest.

Sent from my iPhone

On Mar 15, 2015, at 13:54, Grega Kespret notifications@github.com wrote:

Hi! Not sure if opening a new Issue here is the right way to go about this. I'd like to understand why the histogram created through t-digest loses so much information.

The original histogram with 500 breaks (every 0.05). Double peak is clearly visible.
.

Histogram created using t-digest centroids's means and counts. Double peak is not visible.

The raw data that I used includes 49220071 data points from 0 to 60.

This is the code I used to produce histogram-from-tdigest


Reply to this email directly or view it on GitHub.

@tdunning
Copy link
Owner

Another phoned in comment, t-digest makes no pretense of recreating densities. It recreates the cdf. Try plotting cdf's against each other. Also make sure compression factor is relatively high (200 or more) since you are interested in behavior far from the tails.

Sent from my iPhone

On Mar 15, 2015, at 13:54, Grega Kespret notifications@github.com wrote:

Hi! Not sure if opening a new Issue here is the right way to go about this. I'd like to understand why the histogram created through t-digest loses so much information.

The original histogram with 500 breaks (every 0.05). Double peak is clearly visible.
.

Histogram created using t-digest centroids's means and counts. Double peak is not visible.

The raw data that I used includes 49220071 data points from 0 to 60.

This is the code I used to produce histogram-from-tdigest


Reply to this email directly or view it on GitHub.

@tdunning
Copy link
Owner

tdunning commented Apr 1, 2015

I think I figured this out. The plot that you have here obscures the spacing between centroids. To recover the PDF from the t-digest you have to adjust the weight of the centroid according to the distance to the neighbors.

@gregakespret
Copy link
Author

I tried with your previous suggestions, that is using avltree digest with high compression factor (200). This is the histogram (pdf) that it creates:

avltree

Indeed, the cdf is captured really well.

I didn't really fully understand your last comment. If I still want to plot pdf, how do I do what you described? Thanks!

@tdunning
Copy link
Owner

Finishing this off ... yes, weighting the counts correctly is crucial to proper presentation of a PDF given a t-digest. The basic idea is that each centroid approximates the count over a particular range. To convert this to a rectangle in an approximated PDF, you need to convert the count and the range to get the slope of the CDF. Getting the bounds of the centroid is a bit tricky, but there is a pretty good way to do it:

Given: centroid index i, number of centroids n, counts k[i], positions x[i], smallest x_min and largest x_max samples seen

Wanted: left x_left and right x_right bounds of centroid

if i == 0:
x_left = x_min
else:
x_left = (x[i] * k[i-1] + x[i-1] * k[i]) / (k[i-1] + k[i])
if i == n-1:
x_right = x_max
else:
x_right = (x[i] * k[i+1] + x[i+1] * k[i]) / (k[i+1] + k[i])

Then the rectangle of the PDF can be drawn from x_left to x_right with height k[i] / (x_right-x_left)

Closing this for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants