Skip to content

Conversation

@yanboliang
Copy link
Contributor

Note: I have a new implementation for this issue at #10806 , let's move the discussion there and review that code.

  • Use BLAS Level 3 matrix-matrix multiplications to compute pairwise distance in k-means.
  • Remove runs related code completely, it will have no effect after this change.

Update:

  • I have make performance test, but found the new version is slower than the old one(1.5 times).
    Further more, I track the calling stack and found that the cost to construct the pointMatrix, centerMatrix and distanceMatrix is expensive. I try to compute and cache pointMatrix and centerMatrix in advance, but it still can not get benefits. Looking forward others' comments.
  • I also found gemm is slower than axpy in the mllib.linalg package. Consider the following code which is the abstract of KMeans distance computation scenarios of new and old version:
    val n = 3000
    val random = new Random()
    val c1 = Matrices.zeros(n, n).asInstanceOf[DenseMatrix]
    val a1 = Matrices.randn(n, n, random).asInstanceOf[DenseMatrix]
    val b1 = Matrices.randn(n, n, random).asInstanceOf[DenseMatrix]

    val start1 = System.nanoTime()
    gemm(2.0, a1, b1, 2.0, c1)
    println("gemm elapsed time: = %.3f".format((System.nanoTime() - start1)/1e9) + " seconds.")

    val a2 = Vectors.dense(Array.fill(n)(random.nextDouble()))
    val aa = Array.fill(n)(a2)
    val b2 = Vectors.dense(Array.fill(n)(random.nextDouble()))
    val bb = Array.fill(n)(b2)

    val start2 = System.nanoTime()
    for (i <- 0 until n; j <- 0 until n) {
      axpy(1.0, bb(j), aa(j))
    }
    println("axpy elapsed time: = %.3f".format((System.nanoTime() - start2)/1e9) + " seconds.")

I got the performance result on my Mac:

gemm elapsed time: = 26.519 seconds.
axpy elapsed time: = 20.660 seconds.

It means we can not get benefits from BLAS Level 3 matrix-matrix multiplications to compute pairwise distance. I also found others' complains which is similar with this issue (OpenMathLib/OpenBLAS#528).
If I replace axpy(1.0, bb(j), aa(j)) with dot(bb(j), aa(j)), I got:

gemm elapsed time: = 28.938 seconds.
dot elapsed time: = 28.574 seconds.

Please correct me if I have some misunderstanding.

@SparkQA
Copy link

SparkQA commented Dec 15, 2015

Test build #47724 has finished for PR 10306 at commit 46816f3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 16, 2015

Test build #47809 has finished for PR 10306 at commit 347d3ac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disable this test case because it was blocked by SPARK-12363.

@SparkQA
Copy link

SparkQA commented Dec 17, 2015

Test build #47912 has finished for PR 10306 at commit 8f76116.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang yanboliang changed the title [SPARK-8519] [ML] [MLlib] Blockify distance computation in k-means [SPARK-8519][SPARK-11560][SPARK-11559] [ML] [MLlib] Optimize KMeans implementation Dec 28, 2015
@mengxr
Copy link
Contributor

mengxr commented Jan 15, 2016

@yanboliang Regarding your local performance test:

  1. Make sure you installed optimized BLAS on your system and loaded it correctly in JVM via netlib-java. The different should be significant at 3000x3000 (with or without multi-treading).
  2. Your test of GEMM and AXPY is not equivalent. First of all, they are not using the same matrices for multiplication. Secondly, axpy(1.0, bb(j), aa(j)) should be axpy(1.0, bb(j), aa(i)). Otherwise, you get some benefit from local caching.
  3. There are some issues with JVM performance test. Usually you need warm up the virtual machine (by some trial runs) and then run the test multiple times and take the average.

Could you re-run the test? I will take a look at your implementation.

@yanboliang
Copy link
Contributor Author

@mengxr Thanks for the prompt. I will check my environment and re-run the test.

@yanboliang
Copy link
Contributor Author

@mengxr I found the misconfiguration of my test environment and updated it, thanks! I also updated the test cases based on your advice. Now gemm is about 20-30 times faster than axpy/dot in the updated test cases.

    println(com.github.fommil.netlib.BLAS.getInstance().getClass.getName)
    val n = 3000
    val count = 10
    val random = new Random()

    val a = Vectors.dense(Array.fill(n)(random.nextDouble()))
    val aa = Array.fill(n)(a)
    val b = Vectors.dense(Array.fill(n)(random.nextDouble()))
    val bb = Array.fill(n)(b)

    val a1 = new DenseMatrix(n, n, aa.flatMap(_.toArray), true)
    val b1 = new DenseMatrix(n, n, bb.flatMap(_.toArray), false)
    val c1 = Matrices.zeros(n, n).asInstanceOf[DenseMatrix]

    var total1 = 0.0

    // Trial runs
    for (i <- 0 until 10) {
      gemm(2.0, a1, b1, 2.0, c1)
    }

    for (i <- 0 until count) {
      val start = System.nanoTime()
      gemm(2.0, a1, b1, 2.0, c1)
      total1 += (System.nanoTime() - start)/1e9
    }
    total1 = total1 / count
    println("gemm elapsed time: = %.3f".format(total1) + " seconds.")

    // Trial runs
    for (m <- 0 until 10) {
      for (i <- 0 until n; j <- 0 until n) {
        dot(bb(j), aa(i))
      }
    }

    var total2 = 0.0
    for (m <- 0 until count) {
      val start = System.nanoTime()
      for (i <- 0 until n; j <- 0 until n) {
        //      axpy(1.0, bb(j), aa(i))
        dot(bb(j), aa(i))
      }
      total2 += (System.nanoTime() - start)/1e9
    }
    total2 = total2 / count
    println("dot elapsed time: = %.3f".format(total2) + " seconds.")

The output is:

com.github.fommil.netlib.NativeRefBLAS
gemm elapsed time: = 1.022 seconds.
dot elapsed time: = 29.017 seconds.

@yanboliang
Copy link
Contributor Author

@mengxr I have a new and advanced implementation for this issue at #10806 , let's move the discussion there and review that code. I'm closing this PR now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants