[SPARK-8519][SPARK-11560][SPARK-11559] [ML] [MLlib] Optimize KMeans implementation #10306

yanboliang · 2015-12-15T09:42:58Z

Note: I have a new implementation for this issue at #10806 , let's move the discussion there and review that code.

Use BLAS Level 3 matrix-matrix multiplications to compute pairwise distance in k-means.
Remove runs related code completely, it will have no effect after this change.

Update:

I have make performance test, but found the new version is slower than the old one(1.5 times).
Further more, I track the calling stack and found that the cost to construct the pointMatrix, centerMatrix and distanceMatrix is expensive. I try to compute and cache pointMatrix and centerMatrix in advance, but it still can not get benefits. Looking forward others' comments.
I also found gemm is slower than axpy in the mllib.linalg package. Consider the following code which is the abstract of KMeans distance computation scenarios of new and old version:

    val n = 3000
    val random = new Random()
    val c1 = Matrices.zeros(n, n).asInstanceOf[DenseMatrix]
    val a1 = Matrices.randn(n, n, random).asInstanceOf[DenseMatrix]
    val b1 = Matrices.randn(n, n, random).asInstanceOf[DenseMatrix]

    val start1 = System.nanoTime()
    gemm(2.0, a1, b1, 2.0, c1)
    println("gemm elapsed time: = %.3f".format((System.nanoTime() - start1)/1e9) + " seconds.")

    val a2 = Vectors.dense(Array.fill(n)(random.nextDouble()))
    val aa = Array.fill(n)(a2)
    val b2 = Vectors.dense(Array.fill(n)(random.nextDouble()))
    val bb = Array.fill(n)(b2)

    val start2 = System.nanoTime()
    for (i <- 0 until n; j <- 0 until n) {
      axpy(1.0, bb(j), aa(j))
    }
    println("axpy elapsed time: = %.3f".format((System.nanoTime() - start2)/1e9) + " seconds.")

I got the performance result on my Mac:

gemm elapsed time: = 26.519 seconds.
axpy elapsed time: = 20.660 seconds.

It means we can not get benefits from BLAS Level 3 matrix-matrix multiplications to compute pairwise distance. I also found others' complains which is similar with this issue (OpenMathLib/OpenBLAS#528).
If I replace axpy(1.0, bb(j), aa(j)) with dot(bb(j), aa(j)), I got:

gemm elapsed time: = 28.938 seconds.
dot elapsed time: = 28.574 seconds.

Please correct me if I have some misunderstanding.

SparkQA · 2015-12-15T10:29:29Z

Test build #47724 has finished for PR 10306 at commit 46816f3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-12-16T10:49:34Z

Test build #47809 has finished for PR 10306 at commit 347d3ac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2015-12-16T11:17:18Z

mllib/src/test/scala/org/apache/spark/mllib/clustering/PowerIterationClusteringSuite.scala

Disable this test case because it was blocked by SPARK-12363.

SparkQA · 2015-12-17T08:45:19Z

Test build #47912 has finished for PR 10306 at commit 8f76116.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2016-01-15T01:10:55Z

@yanboliang Regarding your local performance test:

Make sure you installed optimized BLAS on your system and loaded it correctly in JVM via netlib-java. The different should be significant at 3000x3000 (with or without multi-treading).
Your test of GEMM and AXPY is not equivalent. First of all, they are not using the same matrices for multiplication. Secondly, axpy(1.0, bb(j), aa(j)) should be axpy(1.0, bb(j), aa(i)). Otherwise, you get some benefit from local caching.
There are some issues with JVM performance test. Usually you need warm up the virtual machine (by some trial runs) and then run the test multiple times and take the average.

Could you re-run the test? I will take a look at your implementation.

yanboliang · 2016-01-15T11:50:22Z

@mengxr Thanks for the prompt. I will check my environment and re-run the test.

yanboliang · 2016-01-18T15:23:28Z

@mengxr I found the misconfiguration of my test environment and updated it, thanks! I also updated the test cases based on your advice. Now gemm is about 20-30 times faster than axpy/dot in the updated test cases.

    println(com.github.fommil.netlib.BLAS.getInstance().getClass.getName)
    val n = 3000
    val count = 10
    val random = new Random()

    val a = Vectors.dense(Array.fill(n)(random.nextDouble()))
    val aa = Array.fill(n)(a)
    val b = Vectors.dense(Array.fill(n)(random.nextDouble()))
    val bb = Array.fill(n)(b)

    val a1 = new DenseMatrix(n, n, aa.flatMap(_.toArray), true)
    val b1 = new DenseMatrix(n, n, bb.flatMap(_.toArray), false)
    val c1 = Matrices.zeros(n, n).asInstanceOf[DenseMatrix]

    var total1 = 0.0

    // Trial runs
    for (i <- 0 until 10) {
      gemm(2.0, a1, b1, 2.0, c1)
    }

    for (i <- 0 until count) {
      val start = System.nanoTime()
      gemm(2.0, a1, b1, 2.0, c1)
      total1 += (System.nanoTime() - start)/1e9
    }
    total1 = total1 / count
    println("gemm elapsed time: = %.3f".format(total1) + " seconds.")

    // Trial runs
    for (m <- 0 until 10) {
      for (i <- 0 until n; j <- 0 until n) {
        dot(bb(j), aa(i))
      }
    }

    var total2 = 0.0
    for (m <- 0 until count) {
      val start = System.nanoTime()
      for (i <- 0 until n; j <- 0 until n) {
        //      axpy(1.0, bb(j), aa(i))
        dot(bb(j), aa(i))
      }
      total2 += (System.nanoTime() - start)/1e9
    }
    total2 = total2 / count
    println("dot elapsed time: = %.3f".format(total2) + " seconds.")

The output is:

com.github.fommil.netlib.NativeRefBLAS
gemm elapsed time: = 1.022 seconds.
dot elapsed time: = 29.017 seconds.

yanboliang · 2016-01-18T15:33:33Z

@mengxr I have a new and advanced implementation for this issue at #10806 , let's move the discussion there and review that code. I'm closing this PR now.

yanboliang added 2 commits December 15, 2015 15:59

Initial draft of blockify distance computation in k-means

202b578

clean up code and rename some variables

46816f3

disable one test case of PowerIterationClustering

347d3ac

yanboliang reviewed Dec 16, 2015
View reviewed changes

fix iteration terminal condition

8f76116

yanboliang changed the title ~~[SPARK-8519] [ML] [MLlib] Blockify distance computation in k-means~~ [SPARK-8519][SPARK-11560][SPARK-11559] [ML] [MLlib] Optimize KMeans implementation Dec 28, 2015

yanboliang closed this Jan 18, 2016

yanboliang mentioned this pull request Jan 18, 2016

[SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans implementation #10806

Closed

yanboliang deleted the spark-8519 branch March 23, 2016 13:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-8519][SPARK-11560][SPARK-11559] [ML] [MLlib] Optimize KMeans implementation #10306

[SPARK-8519][SPARK-11560][SPARK-11559] [ML] [MLlib] Optimize KMeans implementation #10306

Uh oh!

yanboliang commented Dec 15, 2015

Uh oh!

SparkQA commented Dec 15, 2015

Uh oh!

SparkQA commented Dec 16, 2015

Uh oh!

yanboliang Dec 16, 2015

Uh oh!

SparkQA commented Dec 17, 2015

Uh oh!

mengxr commented Jan 15, 2016

Uh oh!

yanboliang commented Jan 15, 2016

Uh oh!

yanboliang commented Jan 18, 2016

Uh oh!

yanboliang commented Jan 18, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-8519][SPARK-11560][SPARK-11559] [ML] [MLlib] Optimize KMeans implementation #10306

[SPARK-8519][SPARK-11560][SPARK-11559] [ML] [MLlib] Optimize KMeans implementation #10306

Uh oh!

Conversation

yanboliang commented Dec 15, 2015

Uh oh!

SparkQA commented Dec 15, 2015

Uh oh!

SparkQA commented Dec 16, 2015

Uh oh!

yanboliang Dec 16, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 17, 2015

Uh oh!

mengxr commented Jan 15, 2016

Uh oh!

yanboliang commented Jan 15, 2016

Uh oh!

yanboliang commented Jan 18, 2016

Uh oh!

yanboliang commented Jan 18, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants