Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StreamingHistogram reports incorrect bin counts #2664

Closed
metasim opened this issue May 9, 2018 · 4 comments
Closed

StreamingHistogram reports incorrect bin counts #2664

metasim opened this issue May 9, 2018 · 4 comments
Labels

Comments

@metasim
Copy link
Member

metasim commented May 9, 2018

Unless I'm misunderstanding the purpose of binCounts, it appears to be working incorrectly.

Test code:

val tileSize = 5
def rndTile = {
  val data = Array.fill(tileSize * tileSize)(scala.util.Random.nextGaussian())
  ArrayTile(data, tileSize, tileSize): Tile
}

val rdd = spark.sparkContext.makeRDD(Seq((1, rndTile), (2, rndTile), (3, rndTile)))
val h = rdd.histogram()
println(h.totalCount()) // 75
println(h.binCounts().map(_._2).sum) // 1200
println(h.asInstanceOf[StreamingHistogram].buckets().map(_._2).sum) // 75
@metasim metasim added the bug label May 9, 2018
@pomadchin
Copy link
Member

pomadchin commented May 11, 2018

Hey @metasim can you provide more detailed information on how to reproduce it?

I implemented this unit test (see numbers in comments):

import geotrellis.spark._
import geotrellis.spark.testkit._
import geotrellis.raster._
import geotrellis.raster.histogram._
import org.scalatest._

class HistogramSpec extends FunSpec
  with Matchers
  with TestEnvironment {
  describe("HistogramSpec") {
    it("should count bins number correct") {
      val tileSize = 5
      def rndTile = {
        val data = Array.fill(tileSize * tileSize)(scala.util.Random.nextGaussian())
        ArrayTile(data, tileSize, tileSize): Tile
      }

      val rdd = sc.makeRDD(Seq((1, rndTile), (2, rndTile), (3, rndTile)))
      val h = rdd.histogram()
      println(h.totalCount()) // 75
      println(h.binCounts().map(_._2).sum) // 75
      println(h.asInstanceOf[StreamingHistogram].buckets().map(_._2).sum) // 75
    }
  }
}

@metasim
Copy link
Member Author

metasim commented May 11, 2018

That's really odd. For println(h.binCounts().map(_._2).sum) we're definitely not getting the same number. We're using GT 1.2.1. What version did you use?

@pomadchin
Copy link
Member

@metasim current master; look into this issue: #2590

@moradology
Copy link
Contributor

Looks to me like this issue is resolved on master and the upcoming 2.0 release. Closing for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants