[SPARK-6113] [ml] Tree ensembles for Pipelines API #5626

jkbradley · 2015-04-22T05:10:06Z

This is a continuation of [https://github.com//pull/5530](which was for Decision Trees), but for ensembles: Random Forests and Gradient-Boosted Trees. Please refer to the JIRA [https://issues.apache.org/jira/browse/SPARK-6113], the design doc linked from the JIRA, and the previous PR linked above for design discussions.

This PR follows the example set by the previous PR for Decision Trees. It includes a few cleanups to Decision Trees.

Note: There is one issue which will be addressed in a separate PR: Ensembles' component Models have no parent or fittingParamMap. I plan to submit a separate PR which makes those values in Model be Options. It does not matter much which PR gets merged first.

CC: @mengxr @manishamde @codedeft @chouqin

…d to add example as well

SparkQA · 2015-04-22T05:15:08Z

Test build #30729 has finished for PR 5626 at commit ea3d901.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Params(
- case class Params(
- final class GBTClassificationModel(
- final class GBTRegressionModel(
- trait TreeEnsembleModel
- case class Explode(child: Expression)
This patch does not change any dependencies.

SparkQA · 2015-04-22T17:59:43Z

Test build #30764 has finished for PR 5626 at commit 855aa9a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Params(
- case class Params(
- final class GBTClassificationModel(
- final class GBTRegressionModel(
- trait TreeEnsembleModel
This patch does not change any dependencies.

mengxr · 2015-04-23T18:07:16Z

mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala

Should it go to shared params? I see the problem with the doc. If we want to put something special, we can put it in the JavaDoc. No strong preference about this. But it makes me think that whether we should mark shared params final.

mengxr · 2015-04-23T18:15:53Z

@jkbradley I made one pass on the public APIs. There are some issues from the ml.DT PR:

Node.prediction should say "leaf" node instead of "internal":
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/Node.scala#L33
CategoricalSplit.getLeftCategories. Maybe this should be a property since it is immutable. No strong preference. Btw, I'm thinking about using BitSet to store left category indices to save storage.

We can address those in a separate PR. I will take another pass on the implementation.

jkbradley · 2015-04-23T22:55:59Z

Updated! I think the only thing I didn't do was make stepSize a shared param. Copying from the comment above:

I'm hesitating about putting it in sharedParams since the intended range can differ between algorithms. For GBTs, it should be in (0, 1], but it could be different for other algs.

I updated the doc in Node.prediction, as well as getLeft/RightCategories. I'll make a JIRA for using BitSet internally for categories.

SparkQA · 2015-04-24T00:30:04Z

Test build #30882 has finished for PR 5626 at commit bbae2a2.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Params(
- case class Params(
- final class GBTClassificationModel(
- trait HasSeed extends Params
- final class GBTRegressionModel(
This patch does not change any dependencies.

mengxr · 2015-04-24T05:36:24Z

mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala

Mention StringIndexer? Btw, we should pair TODOs with JIRAs.

mengxr · 2015-04-24T05:37:09Z

mllib/src/test/scala/org/apache/spark/ml/classification/RandomForestClassifierSuite.scala

val arr = Array( LabeledPoint(0.0, Vectors.dense(1.0, 0.0, 0.0, 3.0, 1.0)), ...)

mengxr · 2015-04-24T05:39:22Z

LGTM except some minor inline comments.

jkbradley · 2015-04-24T21:08:36Z

Updated. The only remaining question is about the (private[ml]) notes. (See comment above.)

mengxr · 2015-04-24T21:55:41Z

test this please

mengxr · 2015-04-24T23:33:41Z

test this please

SparkQA · 2015-04-25T10:13:14Z

Test build #698 has finished for PR 5626 at commit 729167a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Params(
- case class Params(
- final class GBTClassificationModel(
- trait HasSeed extends Params
- final class GBTRegressionModel(
This patch adds the following new dependencies:
- tachyon-0.6.4.jar
- tachyon-client-0.6.4.jar
This patch removes the following dependencies:
- tachyon-0.5.0.jar
- tachyon-client-0.5.0.jar

mengxr · 2015-04-25T19:27:32Z

Merged into master. Thanks!

jkbradley · 2015-04-25T21:36:48Z

@mengxr Curious: Why does it say there are unmerged commits? (I checked, and the last commit was merged correctly.)

See JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-6529). There are some notes: 1. I add `learningRate` in sharedParams since it is a common parameter for ML algorithms. 2. We will not support transform of finding synonyms from a `Vector`, which will support in further JIRA issues. 3. Word2Vec is different with other ML models that its training set and transformed set are different. Its training set is an `RDD[Iterable[String]]` which represents documents, but the transformed set we want is an `RDD[String]` that represents unique words. So you have to switch your `inputCol` in these two stages. Author: Xusen Yin <yinxusen@gmail.com> Closes #5596 from yinxusen/SPARK-6529 and squashes the following commits: ee2b37a [Xusen Yin] merge with former HEAD 4945462 [Xusen Yin] merge with #5626 3bc2cbd [Xusen Yin] change foldLeft to for loop and use blas 5dd4ee7 [Xusen Yin] fix scala style 743e0d5 [Xusen Yin] fix comments and code style 04c48e9 [Xusen Yin] ensure the functionality a190f2c [Xusen Yin] fix code style and refine the transform function of word2vec 02848fa [Xusen Yin] refine comments 34a55c0 [Xusen Yin] fix errors 109d124 [Xusen Yin] add test suite and pass it 04dde06 [Xusen Yin] add shared params c594095 [Xusen Yin] add word2vec transformer 23d77fa [Xusen Yin] merge with #5626 e8cfaf7 [Xusen Yin] fix conflict with master 66e7bd3 [Xusen Yin] change foldLeft to for loop and use blas 566ec20 [Xusen Yin] fix scala style b54399f [Xusen Yin] fix comments and code style 1211e86 [Xusen Yin] ensure the functionality 6b97ec8 [Xusen Yin] fix code style and refine the transform function of word2vec 7cde18f [Xusen Yin] rm sharedParams 618abd0 [Xusen Yin] refine comments e29680a [Xusen Yin] fix errors fe3afe9 [Xusen Yin] add test suite and pass it 02767fb [Xusen Yin] add shared params 6a514f1 [Xusen Yin] add word2vec transformer

This is a continuation of [apache#5530] (which was for Decision Trees), but for ensembles: Random Forests and Gradient-Boosted Trees. Please refer to the JIRA [https://issues.apache.org/jira/browse/SPARK-6113], the design doc linked from the JIRA, and the previous PR linked above for design discussions. This PR follows the example set by the previous PR for Decision Trees. It includes a few cleanups to Decision Trees. Note: There is one issue which will be addressed in a separate PR: Ensembles' component Models have no parent or fittingParamMap. I plan to submit a separate PR which makes those values in Model be Options. It does not matter much which PR gets merged first. CC: mengxr manishamde codedeft chouqin Author: Joseph K. Bradley <joseph@databricks.com> Closes apache#5626 from jkbradley/dt-api-ensembles and squashes the following commits: 729167a [Joseph K. Bradley] small cleanups based on code review bbae2a2 [Joseph K. Bradley] Updated per all comments in code review 855aa9a [Joseph K. Bradley] scala style fix ea3d901 [Joseph K. Bradley] Added GBT to spark.ml, with tests and examples c0f30c1 [Joseph K. Bradley] Added random forests and test suites to spark.ml. Not tested yet. Need to add example as well d045ebd [Joseph K. Bradley] some more updates, but far from done ee1a10b [Joseph K. Bradley] Added files from old PR and did some initial updates.

See JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-6529). There are some notes: 1. I add `learningRate` in sharedParams since it is a common parameter for ML algorithms. 2. We will not support transform of finding synonyms from a `Vector`, which will support in further JIRA issues. 3. Word2Vec is different with other ML models that its training set and transformed set are different. Its training set is an `RDD[Iterable[String]]` which represents documents, but the transformed set we want is an `RDD[String]` that represents unique words. So you have to switch your `inputCol` in these two stages. Author: Xusen Yin <yinxusen@gmail.com> Closes apache#5596 from yinxusen/SPARK-6529 and squashes the following commits: ee2b37a [Xusen Yin] merge with former HEAD 4945462 [Xusen Yin] merge with apache#5626 3bc2cbd [Xusen Yin] change foldLeft to for loop and use blas 5dd4ee7 [Xusen Yin] fix scala style 743e0d5 [Xusen Yin] fix comments and code style 04c48e9 [Xusen Yin] ensure the functionality a190f2c [Xusen Yin] fix code style and refine the transform function of word2vec 02848fa [Xusen Yin] refine comments 34a55c0 [Xusen Yin] fix errors 109d124 [Xusen Yin] add test suite and pass it 04dde06 [Xusen Yin] add shared params c594095 [Xusen Yin] add word2vec transformer 23d77fa [Xusen Yin] merge with apache#5626 e8cfaf7 [Xusen Yin] fix conflict with master 66e7bd3 [Xusen Yin] change foldLeft to for loop and use blas 566ec20 [Xusen Yin] fix scala style b54399f [Xusen Yin] fix comments and code style 1211e86 [Xusen Yin] ensure the functionality 6b97ec8 [Xusen Yin] fix code style and refine the transform function of word2vec 7cde18f [Xusen Yin] rm sharedParams 618abd0 [Xusen Yin] refine comments e29680a [Xusen Yin] fix errors fe3afe9 [Xusen Yin] add test suite and pass it 02767fb [Xusen Yin] add shared params 6a514f1 [Xusen Yin] add word2vec transformer

This is a continuation of [apache#5530] (which was for Decision Trees), but for ensembles: Random Forests and Gradient-Boosted Trees. Please refer to the JIRA [https://issues.apache.org/jira/browse/SPARK-6113], the design doc linked from the JIRA, and the previous PR linked above for design discussions. This PR follows the example set by the previous PR for Decision Trees. It includes a few cleanups to Decision Trees. Note: There is one issue which will be addressed in a separate PR: Ensembles' component Models have no parent or fittingParamMap. I plan to submit a separate PR which makes those values in Model be Options. It does not matter much which PR gets merged first. CC: mengxr manishamde codedeft chouqin Author: Joseph K. Bradley <joseph@databricks.com> Closes apache#5626 from jkbradley/dt-api-ensembles and squashes the following commits: 729167a [Joseph K. Bradley] small cleanups based on code review bbae2a2 [Joseph K. Bradley] Updated per all comments in code review 855aa9a [Joseph K. Bradley] scala style fix ea3d901 [Joseph K. Bradley] Added GBT to spark.ml, with tests and examples c0f30c1 [Joseph K. Bradley] Added random forests and test suites to spark.ml. Not tested yet. Need to add example as well d045ebd [Joseph K. Bradley] some more updates, but far from done ee1a10b [Joseph K. Bradley] Added files from old PR and did some initial updates.

See JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-6529). There are some notes: 1. I add `learningRate` in sharedParams since it is a common parameter for ML algorithms. 2. We will not support transform of finding synonyms from a `Vector`, which will support in further JIRA issues. 3. Word2Vec is different with other ML models that its training set and transformed set are different. Its training set is an `RDD[Iterable[String]]` which represents documents, but the transformed set we want is an `RDD[String]` that represents unique words. So you have to switch your `inputCol` in these two stages. Author: Xusen Yin <yinxusen@gmail.com> Closes apache#5596 from yinxusen/SPARK-6529 and squashes the following commits: ee2b37a [Xusen Yin] merge with former HEAD 4945462 [Xusen Yin] merge with apache#5626 3bc2cbd [Xusen Yin] change foldLeft to for loop and use blas 5dd4ee7 [Xusen Yin] fix scala style 743e0d5 [Xusen Yin] fix comments and code style 04c48e9 [Xusen Yin] ensure the functionality a190f2c [Xusen Yin] fix code style and refine the transform function of word2vec 02848fa [Xusen Yin] refine comments 34a55c0 [Xusen Yin] fix errors 109d124 [Xusen Yin] add test suite and pass it 04dde06 [Xusen Yin] add shared params c594095 [Xusen Yin] add word2vec transformer 23d77fa [Xusen Yin] merge with apache#5626 e8cfaf7 [Xusen Yin] fix conflict with master 66e7bd3 [Xusen Yin] change foldLeft to for loop and use blas 566ec20 [Xusen Yin] fix scala style b54399f [Xusen Yin] fix comments and code style 1211e86 [Xusen Yin] ensure the functionality 6b97ec8 [Xusen Yin] fix code style and refine the transform function of word2vec 7cde18f [Xusen Yin] rm sharedParams 618abd0 [Xusen Yin] refine comments e29680a [Xusen Yin] fix errors fe3afe9 [Xusen Yin] add test suite and pass it 02767fb [Xusen Yin] add shared params 6a514f1 [Xusen Yin] add word2vec transformer

jkbradley added 4 commits April 21, 2015 22:02

Added files from old PR and did some initial updates.

ee1a10b

some more updates, but far from done

d045ebd

Added random forests and test suites to spark.ml. Not tested yet. Nee…

c0f30c1

…d to add example as well

Added GBT to spark.ml, with tests and examples

ea3d901

scala style fix

855aa9a

mengxr reviewed Apr 23, 2015
View reviewed changes

Updated per all comments in code review

bbae2a2

mengxr reviewed Apr 24, 2015
View reviewed changes

mengxr mentioned this pull request Apr 24, 2015

[ML][SPARK-6529] Add Word2Vec transformer #5596

Closed

small cleanups based on code review

729167a

asfgit closed this in a7160c4 Apr 25, 2015

yinxusen added a commit to yinxusen/spark that referenced this pull request Apr 26, 2015

merge with apache#5626

23d77fa

yinxusen added a commit to yinxusen/spark that referenced this pull request Apr 29, 2015

merge with apache#5626

4945462

jkbradley deleted the dt-api-ensembles branch May 4, 2015 23:02

[SPARK-6113] [ml] Tree ensembles for Pipelines API #5626

[SPARK-6113] [ml] Tree ensembles for Pipelines API #5626

Uh oh!

Conversation

jkbradley commented Apr 22, 2015

Uh oh!

SparkQA commented Apr 22, 2015

Uh oh!

SparkQA commented Apr 22, 2015

Uh oh!

mengxr Apr 23, 2015

Choose a reason for hiding this comment

Uh oh!

mengxr commented Apr 23, 2015

Uh oh!

jkbradley commented Apr 23, 2015

Uh oh!

SparkQA commented Apr 24, 2015

Uh oh!

mengxr Apr 24, 2015

Choose a reason for hiding this comment

Uh oh!

mengxr Apr 24, 2015

Choose a reason for hiding this comment

Uh oh!

mengxr commented Apr 24, 2015

Uh oh!

jkbradley commented Apr 24, 2015

Uh oh!

mengxr commented Apr 24, 2015

Uh oh!

mengxr commented Apr 24, 2015

Uh oh!

SparkQA commented Apr 25, 2015

Uh oh!

mengxr commented Apr 25, 2015

Uh oh!

jkbradley commented Apr 25, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants