Histogram #42

gregakespret · 2015-03-15T19:54:19Z

Hi! Not sure if opening a new Issue here is the right way to go about this. I'd like to understand why the histogram created through t-digest loses so much information.

The original histogram with 500 breaks (every 0.05). Double peak is clearly visible.
.

Histogram created using t-digest centroids's means and counts. Double peak is not visible.

The raw data that I used includes 49220071 data points from 0 to 60.

This is the code I used to produce histogram-from-tdigest

Is using Centroid's mean() and count() even the right way to go about this?

Thank you.

The text was updated successfully, but these errors were encountered:

tdunning · 2015-03-15T23:40:15Z

Ah. I can see something from your code.

Don't use the array digest. Use the avltree digest.

Sent from my iPhone

On Mar 15, 2015, at 13:54, Grega Kespret notifications@github.com wrote:

Hi! Not sure if opening a new Issue here is the right way to go about this. I'd like to understand why the histogram created through t-digest loses so much information.

The original histogram with 500 breaks (every 0.05). Double peak is clearly visible.
.

Histogram created using t-digest centroids's means and counts. Double peak is not visible.

The raw data that I used includes 49220071 data points from 0 to 60.

This is the code I used to produce histogram-from-tdigest

—
Reply to this email directly or view it on GitHub.

tdunning · 2015-03-15T23:45:10Z

Another phoned in comment, t-digest makes no pretense of recreating densities. It recreates the cdf. Try plotting cdf's against each other. Also make sure compression factor is relatively high (200 or more) since you are interested in behavior far from the tails.

Sent from my iPhone

On Mar 15, 2015, at 13:54, Grega Kespret notifications@github.com wrote:

Hi! Not sure if opening a new Issue here is the right way to go about this. I'd like to understand why the histogram created through t-digest loses so much information.

The original histogram with 500 breaks (every 0.05). Double peak is clearly visible.
.

Histogram created using t-digest centroids's means and counts. Double peak is not visible.

The raw data that I used includes 49220071 data points from 0 to 60.

This is the code I used to produce histogram-from-tdigest

—
Reply to this email directly or view it on GitHub.

tdunning · 2015-04-01T23:40:48Z

I think I figured this out. The plot that you have here obscures the spacing between centroids. To recover the PDF from the t-digest you have to adjust the weight of the centroid according to the distance to the neighbors.

gregakespret · 2015-04-05T20:16:01Z

I tried with your previous suggestions, that is using avltree digest with high compression factor (200). This is the histogram (pdf) that it creates:

Indeed, the cdf is captured really well.

I didn't really fully understand your last comment. If I still want to plot pdf, how do I do what you described? Thanks!

tdunning · 2017-04-19T21:56:27Z

Finishing this off ... yes, weighting the counts correctly is crucial to proper presentation of a PDF given a t-digest. The basic idea is that each centroid approximates the count over a particular range. To convert this to a rectangle in an approximated PDF, you need to convert the count and the range to get the slope of the CDF. Getting the bounds of the centroid is a bit tricky, but there is a pretty good way to do it:

Given: centroid index i, number of centroids n, counts k[i], positions x[i], smallest x_min and largest x_max samples seen

Wanted: left x_left and right x_right bounds of centroid

if i == 0:
x_left = x_min
else:
x_left = (x[i] * k[i-1] + x[i-1] * k[i]) / (k[i-1] + k[i])
if i == n-1:
x_right = x_max
else:
x_right = (x[i] * k[i+1] + x[i+1] * k[i]) / (k[i+1] + k[i])

Then the rectangle of the PDF can be drawn from x_left to x_right with height k[i] / (x_right-x_left)

Closing this for now.

tdunning closed this as completed Apr 19, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Histogram #42

Histogram #42

gregakespret commented Mar 15, 2015

tdunning commented Mar 15, 2015

tdunning commented Mar 15, 2015

tdunning commented Apr 1, 2015

gregakespret commented Apr 5, 2015

tdunning commented Apr 19, 2017

Histogram #42

Histogram #42

Comments

gregakespret commented Mar 15, 2015

tdunning commented Mar 15, 2015

tdunning commented Mar 15, 2015

tdunning commented Apr 1, 2015

gregakespret commented Apr 5, 2015

tdunning commented Apr 19, 2017