[SPARK-5987] [MLlib] Save/load for GaussianMixtureModels #4986

MechCoder · 2015-03-11T20:11:10Z

Should be self explanatory.

MechCoder · 2015-03-11T20:12:37Z

cc: @mengxr @jkbradley

SparkQA · 2015-03-11T20:12:49Z

Test build #28481 has started for PR 4986 at commit 4898d57.

This patch merges cleanly.

SparkQA · 2015-03-11T21:33:16Z

Test build #28481 has finished for PR 4986 at commit 4898d57.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Data(weights: Array[Double], mus: Array[Vector], sigmas: Array[Array[Double]])

AmplabJenkins · 2015-03-11T21:33:20Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28481/
Test PASSed.

mengxr · 2015-03-12T23:43:51Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModel.scala

Should be private.

Agreed it should be private, but then it should be private in all other files as well.

AmplabJenkins · 2015-03-13T19:52:48Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28583/
Test FAILed.

shaneknapp · 2015-03-13T19:52:58Z

jenkins, test this please

MechCoder · 2015-03-13T19:55:00Z

@mengxr I am not sure if we should flatten it or not, would it be worth if the number of clusters is large? Also I think it would be better if we deal with MatrixUDT after this PR is done with. wdyt?

SparkQA · 2015-03-13T19:58:06Z

Test build #28584 has started for PR 4986 at commit 9aaa535.

This patch merges cleanly.

mengxr · 2015-03-13T20:27:25Z

The number of clusters won't be very large. Flattening an Array[Array[Double]] doesn't copy the data, so there is no overhead. The content of parquet file is easy to inspect if we list each center as a record. I think we should just use Array[Double] instead being blocked by MatrixUDT. GMM models are usually dense.

AmplabJenkins · 2015-03-13T20:46:38Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28584/
Test FAILed.

shaneknapp · 2015-03-13T20:48:22Z

jenkins, test this please

SparkQA · 2015-03-13T20:53:14Z

Test build #28588 has started for PR 4986 at commit 9aaa535.

This patch merges cleanly.

SparkQA · 2015-03-13T22:15:28Z

Test build #28588 has finished for PR 4986 at commit 9aaa535.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Data(weights: Array[Double], mus: Array[Vector], sigmas: Array[Array[Double]])

AmplabJenkins · 2015-03-13T22:15:34Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28588/
Test PASSed.

MechCoder · 2015-03-14T05:38:48Z

@mengxr I thing I have addressed your comments. sigmas is now stored as an Array of Doubles, Do you have any more comments? Thanks!

SparkQA · 2015-03-14T05:43:08Z

Test build #28607 has started for PR 4986 at commit 4321743.

This patch merges cleanly.

SparkQA · 2015-03-14T07:04:02Z

Test build #28607 has finished for PR 4986 at commit 4321743.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Data(weights: Array[Double], mus: Array[Vector], sigmas: Array[Double])

AmplabJenkins · 2015-03-14T07:04:06Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28607/
Test PASSed.

MechCoder · 2015-03-21T06:59:50Z

@mengxr I rebased over master and used MatrixUDT. Please review! :)

SparkQA · 2015-03-21T07:03:09Z

Test build #28937 has started for PR 4986 at commit 23d707e.

This patch merges cleanly.

mengxr · 2015-03-24T18:59:41Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModel.scala

This is not efficient because it may trigger multiple passes to the parquet file. Let's call collect() directly.

MechCoder · 2015-03-24T19:31:08Z

@mengxr fixed !

SparkQA · 2015-03-24T19:33:25Z

Test build #29101 has started for PR 4986 at commit e7a14cb.

This patch merges cleanly.

mengxr · 2015-03-24T20:15:30Z

docs/mllib-clustering.md

Please also update the Java example.

SparkQA · 2015-03-24T20:50:23Z

Test build #29101 has finished for PR 4986 at commit e7a14cb.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Data(weight: Double, mu: Vector, sigma: Matrix)

AmplabJenkins · 2015-03-24T20:50:27Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29101/
Test PASSed.

MechCoder · 2015-03-25T04:47:56Z

@mengxr I have addressed your comments. Please have a look !

SparkQA · 2015-03-25T04:48:15Z

Test build #29148 has started for PR 4986 at commit 7d2cd56.

This patch merges cleanly.

SparkQA · 2015-03-25T06:08:49Z

Test build #29148 has finished for PR 4986 at commit 7d2cd56.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Data(weight: Double, mu: Vector, sigma: Matrix)

AmplabJenkins · 2015-03-25T06:08:53Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29148/
Test PASSed.

mengxr · 2015-03-25T21:45:44Z

LGTM. Merged into master. Thanks!!

MechCoder · 2015-03-26T17:13:24Z

@mengxr thanks for the merge! For supporting this in PySpark, we would need support for MatrixUDT, which would need support for sparse matrices right? I could not find any existing JIRA related to sparse matrix support, if you are able to please link me to it.

mengxr reviewed Mar 12, 2015
View reviewed changes

MechCoder mentioned this pull request Mar 18, 2015

[SPARK-6364] [MLlib] Implement equals and hashcode for Matrix #5081

Closed

MechCoder added 3 commits March 21, 2015 03:06

[SPARK-5987] Save/load for GaussianMixtureModels

cb77095

Minor

b9794e4

Store sigmas as Array[Double] instead of Array[Array[Double]]

7422bb4

MechCoder force-pushed the spark-5987 branch from 4321743 to 23d707e Compare March 21, 2015 06:58

Rebased over master and used MatrixUDT

505bd57

mengxr reviewed Mar 24, 2015
View reviewed changes

Minor

e7a14cb

mengxr reviewed Mar 24, 2015
View reviewed changes

docs/mllib-clustering.md

Copy link

Contributor

mengxr Mar 24, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also update the Java example.

Iterate over dataframe in a better way

7d2cd56

MechCoder force-pushed the spark-5987 branch from 1706b8e to 7d2cd56 Compare March 25, 2015 04:45

asfgit closed this in 4fc4d03 Mar 25, 2015

MechCoder deleted the spark-5987 branch March 26, 2015 03:01

[SPARK-5987] [MLlib] Save/load for GaussianMixtureModels #4986

[SPARK-5987] [MLlib] Save/load for GaussianMixtureModels #4986

Uh oh!

Conversation

MechCoder commented Mar 11, 2015

Uh oh!

MechCoder commented Mar 11, 2015

Uh oh!

SparkQA commented Mar 11, 2015

Uh oh!

SparkQA commented Mar 11, 2015

Uh oh!

AmplabJenkins commented Mar 11, 2015

Uh oh!

mengxr Mar 12, 2015

Choose a reason for hiding this comment

Uh oh!

MechCoder Mar 13, 2015

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Mar 13, 2015

Uh oh!

shaneknapp commented Mar 13, 2015

Uh oh!

MechCoder commented Mar 13, 2015

Uh oh!

SparkQA commented Mar 13, 2015

Uh oh!

mengxr commented Mar 13, 2015

Uh oh!

AmplabJenkins commented Mar 13, 2015

Uh oh!

shaneknapp commented Mar 13, 2015

Uh oh!

SparkQA commented Mar 13, 2015

Uh oh!

SparkQA commented Mar 13, 2015

Uh oh!

AmplabJenkins commented Mar 13, 2015

Uh oh!

MechCoder commented Mar 14, 2015

Uh oh!

SparkQA commented Mar 14, 2015

Uh oh!

SparkQA commented Mar 14, 2015

Uh oh!

AmplabJenkins commented Mar 14, 2015

Uh oh!

MechCoder commented Mar 21, 2015

Uh oh!

SparkQA commented Mar 21, 2015

Uh oh!

mengxr Mar 24, 2015

Choose a reason for hiding this comment

Uh oh!

MechCoder commented Mar 24, 2015

Uh oh!

SparkQA commented Mar 24, 2015

Uh oh!

mengxr Mar 24, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 24, 2015

Uh oh!

AmplabJenkins commented Mar 24, 2015

Uh oh!

MechCoder commented Mar 25, 2015

Uh oh!

SparkQA commented Mar 25, 2015

Uh oh!

SparkQA commented Mar 25, 2015

Uh oh!

AmplabJenkins commented Mar 25, 2015

Uh oh!

mengxr commented Mar 25, 2015

Uh oh!

MechCoder commented Mar 26, 2015

Uh oh!

Reviewers

Assignees