Skip to content

Conversation

@jkbradley
Copy link
Member

This is a continuation of [https://github.com//pull/5530](which was for Decision Trees), but for ensembles: Random Forests and Gradient-Boosted Trees. Please refer to the JIRA [https://issues.apache.org/jira/browse/SPARK-6113], the design doc linked from the JIRA, and the previous PR linked above for design discussions.

This PR follows the example set by the previous PR for Decision Trees. It includes a few cleanups to Decision Trees.

Note: There is one issue which will be addressed in a separate PR: Ensembles' component Models have no parent or fittingParamMap. I plan to submit a separate PR which makes those values in Model be Options. It does not matter much which PR gets merged first.

CC: @mengxr @manishamde @codedeft @chouqin

@SparkQA
Copy link

SparkQA commented Apr 22, 2015

Test build #30729 has finished for PR 5626 at commit ea3d901.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Params(
    • case class Params(
    • final class GBTClassificationModel(
    • final class GBTRegressionModel(
    • trait TreeEnsembleModel
    • case class Explode(child: Expression)
  • This patch does not change any dependencies.

@SparkQA
Copy link

SparkQA commented Apr 22, 2015

Test build #30764 has finished for PR 5626 at commit 855aa9a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Params(
    • case class Params(
    • final class GBTClassificationModel(
    • final class GBTRegressionModel(
    • trait TreeEnsembleModel
  • This patch does not change any dependencies.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it go to shared params? I see the problem with the doc. If we want to put something special, we can put it in the JavaDoc. No strong preference about this. But it makes me think that whether we should mark shared params final.

@mengxr
Copy link
Contributor

mengxr commented Apr 23, 2015

@jkbradley I made one pass on the public APIs. There are some issues from the ml.DT PR:

  1. Node.prediction should say "leaf" node instead of "internal":
    https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/Node.scala#L33
  2. CategoricalSplit.getLeftCategories. Maybe this should be a property since it is immutable. No strong preference. Btw, I'm thinking about using BitSet to store left category indices to save storage.

We can address those in a separate PR. I will take another pass on the implementation.

@jkbradley
Copy link
Member Author

Updated! I think the only thing I didn't do was make stepSize a shared param. Copying from the comment above:

I'm hesitating about putting it in sharedParams since the intended range can differ between algorithms. For GBTs, it should be in (0, 1], but it could be different for other algs.

I updated the doc in Node.prediction, as well as getLeft/RightCategories. I'll make a JIRA for using BitSet internally for categories.

@SparkQA
Copy link

SparkQA commented Apr 24, 2015

Test build #30882 has finished for PR 5626 at commit bbae2a2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Params(
    • case class Params(
    • final class GBTClassificationModel(
    • trait HasSeed extends Params
    • final class GBTRegressionModel(
  • This patch does not change any dependencies.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention StringIndexer? Btw, we should pair TODOs with JIRAs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val arr = Array(
  LabeledPoint(0.0, Vectors.dense(1.0, 0.0, 0.0, 3.0, 1.0)),
  ...)

@mengxr
Copy link
Contributor

mengxr commented Apr 24, 2015

LGTM except some minor inline comments.

@jkbradley
Copy link
Member Author

Updated. The only remaining question is about the (private[ml]) notes. (See comment above.)

@mengxr
Copy link
Contributor

mengxr commented Apr 24, 2015

test this please

1 similar comment
@mengxr
Copy link
Contributor

mengxr commented Apr 24, 2015

test this please

@SparkQA
Copy link

SparkQA commented Apr 25, 2015

Test build #698 has finished for PR 5626 at commit 729167a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Params(
    • case class Params(
    • final class GBTClassificationModel(
    • trait HasSeed extends Params
    • final class GBTRegressionModel(
  • This patch adds the following new dependencies:
    • tachyon-0.6.4.jar
    • tachyon-client-0.6.4.jar
  • This patch removes the following dependencies:
    • tachyon-0.5.0.jar
    • tachyon-client-0.5.0.jar

@mengxr
Copy link
Contributor

mengxr commented Apr 25, 2015

Merged into master. Thanks!

@asfgit asfgit closed this in a7160c4 Apr 25, 2015
@jkbradley
Copy link
Member Author

@mengxr Curious: Why does it say there are unmerged commits? (I checked, and the last commit was merged correctly.)

yinxusen added a commit to yinxusen/spark that referenced this pull request Apr 26, 2015
yinxusen added a commit to yinxusen/spark that referenced this pull request Apr 29, 2015
asfgit pushed a commit that referenced this pull request Apr 29, 2015
See JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-6529).

There are some notes:

1. I add `learningRate` in sharedParams since it is a common parameter for ML algorithms.
2. We will not support transform of finding synonyms from a `Vector`, which will support in further JIRA issues.
3. Word2Vec is different with other ML models that its training set and transformed set are different. Its training set is an `RDD[Iterable[String]]` which represents documents, but the transformed set we want is an `RDD[String]` that represents unique words. So you have to switch your `inputCol` in these two stages.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #5596 from yinxusen/SPARK-6529 and squashes the following commits:

ee2b37a [Xusen Yin] merge with former HEAD
4945462 [Xusen Yin] merge with #5626
3bc2cbd [Xusen Yin] change foldLeft to for loop and use blas
5dd4ee7 [Xusen Yin] fix scala style
743e0d5 [Xusen Yin] fix comments and code style
04c48e9 [Xusen Yin] ensure the functionality
a190f2c [Xusen Yin] fix code style and refine the transform function of word2vec
02848fa [Xusen Yin] refine comments
34a55c0 [Xusen Yin] fix errors
109d124 [Xusen Yin] add test suite and pass it
04dde06 [Xusen Yin] add shared params
c594095 [Xusen Yin] add word2vec transformer
23d77fa [Xusen Yin] merge with #5626
e8cfaf7 [Xusen Yin] fix conflict with master
66e7bd3 [Xusen Yin] change foldLeft to for loop and use blas
566ec20 [Xusen Yin] fix scala style
b54399f [Xusen Yin] fix comments and code style
1211e86 [Xusen Yin] ensure the functionality
6b97ec8 [Xusen Yin] fix code style and refine the transform function of word2vec
7cde18f [Xusen Yin] rm sharedParams
618abd0 [Xusen Yin] refine comments
e29680a [Xusen Yin] fix errors
fe3afe9 [Xusen Yin] add test suite and pass it
02767fb [Xusen Yin] add shared params
6a514f1 [Xusen Yin] add word2vec transformer
@jkbradley jkbradley deleted the dt-api-ensembles branch May 4, 2015 23:02
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 14, 2015
This is a continuation of [apache#5530] (which was for Decision Trees), but for ensembles: Random Forests and Gradient-Boosted Trees.  Please refer to the JIRA [https://issues.apache.org/jira/browse/SPARK-6113], the design doc linked from the JIRA, and the previous PR linked above for design discussions.

This PR follows the example set by the previous PR for Decision Trees.  It includes a few cleanups to Decision Trees.

Note: There is one issue which will be addressed in a separate PR: Ensembles' component Models have no parent or fittingParamMap.  I plan to submit a separate PR which makes those values in Model be Options.  It does not matter much which PR gets merged first.

CC: mengxr manishamde codedeft chouqin

Author: Joseph K. Bradley <joseph@databricks.com>

Closes apache#5626 from jkbradley/dt-api-ensembles and squashes the following commits:

729167a [Joseph K. Bradley] small cleanups based on code review
bbae2a2 [Joseph K. Bradley] Updated per all comments in code review
855aa9a [Joseph K. Bradley] scala style fix
ea3d901 [Joseph K. Bradley] Added GBT to spark.ml, with tests and examples
c0f30c1 [Joseph K. Bradley] Added random forests and test suites to spark.ml.  Not tested yet.  Need to add example as well
d045ebd [Joseph K. Bradley] some more updates, but far from done
ee1a10b [Joseph K. Bradley] Added files from old PR and did some initial updates.
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
See JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-6529).

There are some notes:

1. I add `learningRate` in sharedParams since it is a common parameter for ML algorithms.
2. We will not support transform of finding synonyms from a `Vector`, which will support in further JIRA issues.
3. Word2Vec is different with other ML models that its training set and transformed set are different. Its training set is an `RDD[Iterable[String]]` which represents documents, but the transformed set we want is an `RDD[String]` that represents unique words. So you have to switch your `inputCol` in these two stages.

Author: Xusen Yin <yinxusen@gmail.com>

Closes apache#5596 from yinxusen/SPARK-6529 and squashes the following commits:

ee2b37a [Xusen Yin] merge with former HEAD
4945462 [Xusen Yin] merge with apache#5626
3bc2cbd [Xusen Yin] change foldLeft to for loop and use blas
5dd4ee7 [Xusen Yin] fix scala style
743e0d5 [Xusen Yin] fix comments and code style
04c48e9 [Xusen Yin] ensure the functionality
a190f2c [Xusen Yin] fix code style and refine the transform function of word2vec
02848fa [Xusen Yin] refine comments
34a55c0 [Xusen Yin] fix errors
109d124 [Xusen Yin] add test suite and pass it
04dde06 [Xusen Yin] add shared params
c594095 [Xusen Yin] add word2vec transformer
23d77fa [Xusen Yin] merge with apache#5626
e8cfaf7 [Xusen Yin] fix conflict with master
66e7bd3 [Xusen Yin] change foldLeft to for loop and use blas
566ec20 [Xusen Yin] fix scala style
b54399f [Xusen Yin] fix comments and code style
1211e86 [Xusen Yin] ensure the functionality
6b97ec8 [Xusen Yin] fix code style and refine the transform function of word2vec
7cde18f [Xusen Yin] rm sharedParams
618abd0 [Xusen Yin] refine comments
e29680a [Xusen Yin] fix errors
fe3afe9 [Xusen Yin] add test suite and pass it
02767fb [Xusen Yin] add shared params
6a514f1 [Xusen Yin] add word2vec transformer
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
See JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-6529).

There are some notes:

1. I add `learningRate` in sharedParams since it is a common parameter for ML algorithms.
2. We will not support transform of finding synonyms from a `Vector`, which will support in further JIRA issues.
3. Word2Vec is different with other ML models that its training set and transformed set are different. Its training set is an `RDD[Iterable[String]]` which represents documents, but the transformed set we want is an `RDD[String]` that represents unique words. So you have to switch your `inputCol` in these two stages.

Author: Xusen Yin <yinxusen@gmail.com>

Closes apache#5596 from yinxusen/SPARK-6529 and squashes the following commits:

ee2b37a [Xusen Yin] merge with former HEAD
4945462 [Xusen Yin] merge with apache#5626
3bc2cbd [Xusen Yin] change foldLeft to for loop and use blas
5dd4ee7 [Xusen Yin] fix scala style
743e0d5 [Xusen Yin] fix comments and code style
04c48e9 [Xusen Yin] ensure the functionality
a190f2c [Xusen Yin] fix code style and refine the transform function of word2vec
02848fa [Xusen Yin] refine comments
34a55c0 [Xusen Yin] fix errors
109d124 [Xusen Yin] add test suite and pass it
04dde06 [Xusen Yin] add shared params
c594095 [Xusen Yin] add word2vec transformer
23d77fa [Xusen Yin] merge with apache#5626
e8cfaf7 [Xusen Yin] fix conflict with master
66e7bd3 [Xusen Yin] change foldLeft to for loop and use blas
566ec20 [Xusen Yin] fix scala style
b54399f [Xusen Yin] fix comments and code style
1211e86 [Xusen Yin] ensure the functionality
6b97ec8 [Xusen Yin] fix code style and refine the transform function of word2vec
7cde18f [Xusen Yin] rm sharedParams
618abd0 [Xusen Yin] refine comments
e29680a [Xusen Yin] fix errors
fe3afe9 [Xusen Yin] add test suite and pass it
02767fb [Xusen Yin] add shared params
6a514f1 [Xusen Yin] add word2vec transformer
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
This is a continuation of [apache#5530] (which was for Decision Trees), but for ensembles: Random Forests and Gradient-Boosted Trees.  Please refer to the JIRA [https://issues.apache.org/jira/browse/SPARK-6113], the design doc linked from the JIRA, and the previous PR linked above for design discussions.

This PR follows the example set by the previous PR for Decision Trees.  It includes a few cleanups to Decision Trees.

Note: There is one issue which will be addressed in a separate PR: Ensembles' component Models have no parent or fittingParamMap.  I plan to submit a separate PR which makes those values in Model be Options.  It does not matter much which PR gets merged first.

CC: mengxr manishamde codedeft chouqin

Author: Joseph K. Bradley <joseph@databricks.com>

Closes apache#5626 from jkbradley/dt-api-ensembles and squashes the following commits:

729167a [Joseph K. Bradley] small cleanups based on code review
bbae2a2 [Joseph K. Bradley] Updated per all comments in code review
855aa9a [Joseph K. Bradley] scala style fix
ea3d901 [Joseph K. Bradley] Added GBT to spark.ml, with tests and examples
c0f30c1 [Joseph K. Bradley] Added random forests and test suites to spark.ml.  Not tested yet.  Need to add example as well
d045ebd [Joseph K. Bradley] some more updates, but far from done
ee1a10b [Joseph K. Bradley] Added files from old PR and did some initial updates.
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
See JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-6529).

There are some notes:

1. I add `learningRate` in sharedParams since it is a common parameter for ML algorithms.
2. We will not support transform of finding synonyms from a `Vector`, which will support in further JIRA issues.
3. Word2Vec is different with other ML models that its training set and transformed set are different. Its training set is an `RDD[Iterable[String]]` which represents documents, but the transformed set we want is an `RDD[String]` that represents unique words. So you have to switch your `inputCol` in these two stages.

Author: Xusen Yin <yinxusen@gmail.com>

Closes apache#5596 from yinxusen/SPARK-6529 and squashes the following commits:

ee2b37a [Xusen Yin] merge with former HEAD
4945462 [Xusen Yin] merge with apache#5626
3bc2cbd [Xusen Yin] change foldLeft to for loop and use blas
5dd4ee7 [Xusen Yin] fix scala style
743e0d5 [Xusen Yin] fix comments and code style
04c48e9 [Xusen Yin] ensure the functionality
a190f2c [Xusen Yin] fix code style and refine the transform function of word2vec
02848fa [Xusen Yin] refine comments
34a55c0 [Xusen Yin] fix errors
109d124 [Xusen Yin] add test suite and pass it
04dde06 [Xusen Yin] add shared params
c594095 [Xusen Yin] add word2vec transformer
23d77fa [Xusen Yin] merge with apache#5626
e8cfaf7 [Xusen Yin] fix conflict with master
66e7bd3 [Xusen Yin] change foldLeft to for loop and use blas
566ec20 [Xusen Yin] fix scala style
b54399f [Xusen Yin] fix comments and code style
1211e86 [Xusen Yin] ensure the functionality
6b97ec8 [Xusen Yin] fix code style and refine the transform function of word2vec
7cde18f [Xusen Yin] rm sharedParams
618abd0 [Xusen Yin] refine comments
e29680a [Xusen Yin] fix errors
fe3afe9 [Xusen Yin] add test suite and pass it
02767fb [Xusen Yin] add shared params
6a514f1 [Xusen Yin] add word2vec transformer
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants