Skip to content

Conversation

@yanboliang
Copy link
Contributor

  • Use BLAS Level 3 matrix-matrix multiplications to compute pairwise distance in k-means.
  • Remove runs related code completely, it will have no effect after this change.

Note: This is the new implementation to replace #10306 . cc @mengxr

@SparkQA
Copy link

SparkQA commented Jan 18, 2016

Test build #49596 has finished for PR 10806 at commit d0653cb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 18, 2016

Test build #49598 has finished for PR 10806 at commit 68d830c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Feb 26, 2016

cc: @avulanov

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collection methods might be expensive in that case, because fill will create array of arrays, copy data there, and then flatten will create another array and copy data there. Copy the array directly with System.arraycopy using additional loop instead.

@avulanov
Copy link
Contributor

avulanov commented Mar 3, 2016

@yanboliang @mengxr I made one pass.

@yanboliang
Copy link
Contributor Author

@avulanov Thanks for your comments. I will update this PR soon.

@NarineK
Copy link
Contributor

NarineK commented Mar 9, 2016

Hi everyone,

@yanboliang, thanks for optimizing Kmeans.

I have one question.
Is it possible to add Within Cluster Sum Square (in total and for an individual cluster) + Between Cluster Sum Square. Since kmeans has been exposed in SparkR and R supports those, it would be good if we could expose those too.

We can also work on it in a separate jira/PR. I just want to know your opinion on that.

Thanks,
Narine

@mengxr
Copy link
Contributor

mengxr commented Mar 21, 2016

@yanboliang Do you have time to update this PR? Could you also post the latest performance results after the update? Thanks!

@NarineK Yes, I think it is useful to add those statistics (tracked in a separate JIRA). So please create a JIRA for it and include links to sklearn and R methods.

@yanboliang
Copy link
Contributor Author

@mengxr Sorry for late response, I will update it and post latest performance results soon.

@SparkQA
Copy link

SparkQA commented Mar 23, 2016

Test build #53924 has finished for PR 10806 at commit e166e86.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 23, 2016

Test build #53926 has finished for PR 10806 at commit 85b4122.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 23, 2016

Test build #53922 has finished for PR 10806 at commit 5b76bd9.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor Author

@mengxr @avulanov After update, KMeans can get 2.5 times as fast as the original version. Here is the latest performance test result(I run the test multiple times and take the average):
After this PR(with optimized BLAS installed):

Iterations took 14.773 seconds.
KMeans reached the max number of iterations: 100.

The original version:

Iterations took 35.689 seconds.
KMeans reached the max number of iterations: 100.

@NarineK
Copy link
Contributor

NarineK commented Mar 23, 2016

Thanks @yanboliang, I'll create the jira soon!

@mengxr
Copy link
Contributor

mengxr commented Mar 25, 2016

I will make another pass soon:)

@thunterdb
Copy link
Contributor

I see that the size of the blocks can be tuned and is fairly small by default (128). Out of curiosity, how did you pick this number, instead of the full size of the partition for example?

@avulanov
Copy link
Contributor

Good point. According to matrices multiplication benchmark, we can get peak performance on modern CPUs with square matrices somewhere between 4Kx4K and 8Kx8K. So, it worth using a bigger batch rather than 128.
https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

@yanboliang yanboliang changed the title [SPARK-8519][SPARK-11560][SPARK-11559] [ML] [MLlib] Optimize KMeans implementation [SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans implementation Apr 22, 2016
@SparkQA
Copy link

SparkQA commented Jun 13, 2016

Test build #60396 has finished for PR 10806 at commit 85b4122.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor Author

Close this one and move update version to #14937, let's discuss there. Thanks!

@yanboliang yanboliang closed this Sep 2, 2016
@yanboliang yanboliang deleted the spark-8519-new branch September 2, 2016 13:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants