[SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans implementation #10806

yanboliang · 2016-01-18T14:54:23Z

Use BLAS Level 3 matrix-matrix multiplications to compute pairwise distance in k-means.
Remove runs related code completely, it will have no effect after this change.

Note: This is the new implementation to replace #10306 . cc @mengxr

SparkQA · 2016-01-18T15:47:16Z

Test build #49596 has finished for PR 10806 at commit d0653cb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-18T16:54:28Z

Test build #49598 has finished for PR 10806 at commit 68d830c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2016-02-26T21:58:36Z

cc: @avulanov

avulanov · 2016-03-03T00:40:11Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

Collection methods might be expensive in that case, because fill will create array of arrays, copy data there, and then flatten will create another array and copy data there. Copy the array directly with System.arraycopy using additional loop instead.

avulanov · 2016-03-03T01:55:39Z

@yanboliang @mengxr I made one pass.

yanboliang · 2016-03-03T07:19:55Z

@avulanov Thanks for your comments. I will update this PR soon.

NarineK · 2016-03-09T00:16:19Z

Hi everyone,

@yanboliang, thanks for optimizing Kmeans.

I have one question.
Is it possible to add Within Cluster Sum Square (in total and for an individual cluster) + Between Cluster Sum Square. Since kmeans has been exposed in SparkR and R supports those, it would be good if we could expose those too.

We can also work on it in a separate jira/PR. I just want to know your opinion on that.

Thanks,
Narine

mengxr · 2016-03-21T19:43:00Z

@yanboliang Do you have time to update this PR? Could you also post the latest performance results after the update? Thanks!

@NarineK Yes, I think it is useful to add those statistics (tracked in a separate JIRA). So please create a JIRA for it and include links to sklearn and R methods.

yanboliang · 2016-03-22T02:11:06Z

@mengxr Sorry for late response, I will update it and post latest performance results soon.

SparkQA · 2016-03-23T10:41:29Z

Test build #53924 has finished for PR 10806 at commit e166e86.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-23T10:42:27Z

Test build #53926 has finished for PR 10806 at commit 85b4122.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-23T12:11:40Z

Test build #53922 has finished for PR 10806 at commit 5b76bd9.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

yanboliang · 2016-03-23T13:22:53Z

@mengxr @avulanov After update, KMeans can get 2.5 times as fast as the original version. Here is the latest performance test result(I run the test multiple times and take the average):
After this PR(with optimized BLAS installed):

Iterations took 14.773 seconds.
KMeans reached the max number of iterations: 100.

The original version:

Iterations took 35.689 seconds.
KMeans reached the max number of iterations: 100.

NarineK · 2016-03-23T22:41:06Z

Thanks @yanboliang, I'll create the jira soon!

mengxr · 2016-03-25T02:49:52Z

I will make another pass soon:)

thunterdb · 2016-04-12T20:03:42Z

I see that the size of the blocks can be tuned and is fairly small by default (128). Out of curiosity, how did you pick this number, instead of the full size of the partition for example?

avulanov · 2016-04-13T02:07:12Z

Good point. According to matrices multiplication benchmark, we can get peak performance on modern CPUs with square matrices somewhere between 4Kx4K and 8Kx8K. So, it worth using a bigger batch rather than 128.
https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

SparkQA · 2016-06-13T10:53:40Z

Test build #60396 has finished for PR 10806 at commit 85b4122.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

yanboliang · 2016-09-02T13:05:09Z

Close this one and move update version to #14937, let's discuss there. Thanks!

yanboliang mentioned this pull request Jan 18, 2016

[SPARK-8519][SPARK-11560][SPARK-11559] [ML] [MLlib] Optimize KMeans implementation #10306

Closed

avulanov reviewed Mar 3, 2016
View reviewed changes

yanboliang force-pushed the spark-8519-new branch from 68d830c to 5b76bd9 Compare March 23, 2016 09:43

yanboliang added 7 commits March 23, 2016 17:54

Initial draft of KMeans optimization

a7edbeb

Disable one test of PIC

950c6d0

Mark stack size configured

52cdf8e

use first() rather than take()

89826de

optimize pointsNormArray and broadcast centers array

15c33df

support set/get blockSize

69beb5a

update docs

e166e86

yanboliang force-pushed the spark-8519-new branch from 5b76bd9 to e166e86 Compare March 23, 2016 09:54

fix typos

85b4122

yanboliang mentioned this pull request Apr 22, 2016

[SPARK-11559] [MLlib] Make runs no effect in mllib.KMeans #12608

Closed

yanboliang changed the title ~~[SPARK-8519][SPARK-11560][SPARK-11559] [ML] [MLlib] Optimize KMeans implementation~~ [SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans implementation Apr 22, 2016

yanboliang mentioned this pull request Sep 2, 2016

[SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans implementation. #14937

Closed

yanboliang closed this Sep 2, 2016

yanboliang deleted the spark-8519-new branch September 2, 2016 13:05

[SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans implementation #10806

[SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans implementation #10806

Uh oh!

Conversation

yanboliang commented Jan 18, 2016

Uh oh!

SparkQA commented Jan 18, 2016

Uh oh!

SparkQA commented Jan 18, 2016

Uh oh!

mengxr commented Feb 26, 2016

Uh oh!

avulanov Mar 3, 2016

Choose a reason for hiding this comment

Uh oh!

avulanov commented Mar 3, 2016

Uh oh!

yanboliang commented Mar 3, 2016

Uh oh!

NarineK commented Mar 9, 2016

Uh oh!

mengxr commented Mar 21, 2016

Uh oh!

yanboliang commented Mar 22, 2016

Uh oh!

SparkQA commented Mar 23, 2016

Uh oh!

SparkQA commented Mar 23, 2016

Uh oh!

SparkQA commented Mar 23, 2016

Uh oh!

yanboliang commented Mar 23, 2016

Uh oh!

NarineK commented Mar 23, 2016

Uh oh!

mengxr commented Mar 25, 2016

Uh oh!

thunterdb commented Apr 12, 2016

Uh oh!

avulanov commented Apr 13, 2016

Uh oh!

SparkQA commented Jun 13, 2016

Uh oh!

yanboliang commented Sep 2, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants