Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming histogram #2590

Merged
merged 4 commits into from
Mar 29, 2018
Merged

Conversation

jamesmcclain
Copy link
Member

@jamesmcclain jamesmcclain commented Mar 25, 2018

Overview

Fixes StreamingHistogram.itemCount. Has the effect of changing the behavior of StreamingHistogram.binCount so that bin counts are not always zero.

Checklist

  • docs/CHANGELOG.rst updated, if necessary
  • docs guides update, if necessary
  • New user API has useful Scaladoc strings
  • Unit tests added for bug-fix or new feature

Demo

Previously, binCount on a streaming histogram would return answers with counts of zero. This was due to the fact that the values of the bins were being generated by the function values which produced numbers that did not match the internals bins of the streaming histogram. This was due to itemCount being incorrect.

New behavior:

scala> val tile = DoubleArrayTile(Array[Double](52, 54, 61, 32, 52, 50, 11, 21, 18), 3, 3)
tile: geotrellis.raster.DoubleArrayTile = DoubleConstantNoDataArrayTile([D@40ed78d,3,3)

scala> val result = tile.histogramDouble(3)
result: geotrellis.raster.histogram.Histogram[Double] = geotrellis.raster.histogram.StreamingHistogram@2f13b4ba

scala> result.binCounts.foreach(println(_))
(16.666666666666668,3)
(32.0,1)
(53.8,5)

scala> println(result.median().get)
34.18

Notes

As stated above, the previous behavior was to use bin labels generated by values which did not (do not) line up with the internal bins used by the streaming histogram. When a count for a non-bucket-label value is requested, zero is returned (because [by construction and intent] the streaming histogram does not have access to that information).

The new behavior is to simply return the internal buckets used by the streaming histogram. Note that the interpretations of these bin counts is therefore somewhat different than for other histogram types.

Note that the median value of 34.18 above is "correct" (expected) for an approximation using three buckets. Because all of the input data are not available, the median has to be approximated. If the approximate histogram is viewed as a curve, the median is approximated by returning the value at which half of the the area under the curve is to the left and half to the right.

Closes #2274

@jamesmcclain jamesmcclain changed the title Streaming histogram [WiP] Streaming histogram Mar 26, 2018
@jamesmcclain jamesmcclain changed the title [WiP] Streaming histogram Streaming histogram Mar 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Potential Error With Histogram[Double]
2 participants