SPARK-7579 [ML] [DOC] User guide update for OneHotEncoder #6126

sryza · 2015-05-13T18:43:45Z

No description provided.

AmplabJenkins · 2015-05-13T18:47:11Z

Merged build triggered.

AmplabJenkins · 2015-05-13T18:47:18Z

Merged build started.

SparkQA · 2015-05-13T18:48:08Z

Test build #32634 has started for PR 6126 at commit 3f0af41.

SparkQA · 2015-05-13T20:35:43Z

Test build #32634 has finished for PR 6126 at commit 3f0af41.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-05-13T20:35:48Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-05-13T20:35:48Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32634/
Test PASSed.

jkbradley · 2015-05-13T22:15:47Z

docs/ml-features.md

Can you please add something like, "This encoding allows algorithms which expect continuous features to use categorical features as well; for example Logistic Regression requires continuous features, but it can use categorical features after one-hot encoding."

Could you please add a Wikipedia link? [http://en.wikipedia.org/wiki/One-hot]

"the includeFirst" --> "the includeFirst parameter"

jkbradley · 2015-05-13T22:16:10Z

@sryza Thanks for the PR!

Can you please add tags "[ml]" (not mllib) and "[doc]" to the title?

Also, a Python API has been added, so could you please add a Python example?

AmplabJenkins · 2015-05-14T19:52:10Z

Merged build triggered.

AmplabJenkins · 2015-05-14T19:52:19Z

Merged build started.

SparkQA · 2015-05-14T19:52:38Z

Test build #32724 has started for PR 6126 at commit 95e0908.

SparkQA · 2015-05-14T21:41:59Z

Test build #32724 has finished for PR 6126 at commit 95e0908.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- [One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features as well. The [OneHotEncoder](api/scala/index.html#org.apache.spark.ml.feature.OneHotEncoder) class provides this functionality. By default, the resulting binary vector has a component for each category, so with 5 categories, an input value of 2.0 would map to an output vector of (0.0, 0.0, 1.0, 0.0, 0.0). If theincludeFirstis set to false, the first category is omitted, so the output vector for the previous example would be (0.0, 1.0, 0.0, 0.0) and an input value of 0.0 would map to a vector of all zeros. Including the first category makes the vector columns linearly dependent because they sum up to one.

AmplabJenkins · 2015-05-14T21:42:04Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-05-14T21:42:04Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32724/
Test PASSed.

jkbradley · 2015-05-15T21:32:56Z

docs/ml-features.md

Need {% endhighlight %}

Q: Have you tried generating this using jekyll? This is the only issue I spotted. Thanks!

Posting a patch that fixes this. My jekyll efforts have been thwarted with errors like:
/home/sandy/spark/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala:318: polymorphic expression cannot be instantiated to expected type;

Any idea how to get past these?

AmplabJenkins · 2015-05-19T21:52:11Z

Build triggered.

AmplabJenkins · 2015-05-19T21:52:19Z

Build started.

SparkQA · 2015-05-19T21:53:05Z

Test build #33101 has started for PR 6126 at commit 4f5376e.

AmplabJenkins · 2015-05-19T21:57:11Z

Merged build triggered.

AmplabJenkins · 2015-05-19T21:57:19Z

Merged build started.

SparkQA · 2015-05-19T21:59:15Z

Test build #33102 has started for PR 6126 at commit 5af803d.

SparkQA · 2015-05-19T23:46:00Z

Test build #33102 has finished for PR 6126 at commit 5af803d.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class DataFrameReader(object):
- class DataFrameWriter(object):

AmplabJenkins · 2015-05-19T23:46:05Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-05-19T23:46:05Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33102/
Test PASSed.

SparkQA · 2015-05-20T00:17:23Z

Test build #33101 has finished for PR 6126 at commit 4f5376e.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
- [One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features as well. The [OneHotEncoder](api/scala/index.html#org.apache.spark.ml.feature.OneHotEncoder) class provides this functionality. By default, the resulting binary vector has a component for each category, so with 5 categories, an input value of 2.0 would map to an output vector of (0.0, 0.0, 1.0, 0.0, 0.0). If theincludeFirstis set to false, the first category is omitted, so the output vector for the previous example would be (0.0, 1.0, 0.0, 0.0) and an input value of 0.0 would map to a vector of all zeros. Including the first category makes the vector columns linearly dependent because they sum up to one.

AmplabJenkins · 2015-05-20T00:17:28Z

Build finished. Test PASSed.

AmplabJenkins · 2015-05-20T00:17:28Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33101/
Test PASSed.

jkbradley · 2015-05-20T02:07:37Z

@sryza The unclean merge is probably from another feature transformer PR, but it should be easy to fix.

I'm not sure about the jekyll issue; I haven't seen that happen. The only thing I know to recommend is the usual: Check jekyll version (I'm using 2.4.0), and do a clean build. If you can't get it to work, I can check on my side too.

sryza · 2015-05-20T02:28:49Z

The unclean merge is fixed in the current version of the patch. Re: Jekyll, can try a clean build tomorrow.

jkbradley · 2015-05-20T02:51:30Z

Hm, the last test says the merge is unclean.

SparkQA · 2015-05-20T02:52:24Z

Test build #835 has started for PR 6126 at commit 5af803d.

SparkQA · 2015-05-20T04:41:07Z

Test build #835 has finished for PR 6126 at commit 5af803d.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class DataFrameReader(object):
- class DataFrameWriter(object):

jkbradley · 2015-05-20T20:09:38Z

LGTM merging into master and branch-1.4
@sryza Thanks!

Author: Sandy Ryza <sandy@cloudera.com> Closes #6126 from sryza/sandy-spark-7579 and squashes the following commits: 5af803d [Sandy Ryza] SPARK-7579 [MLLIB] User guide update for OneHotEncoder (cherry picked from commit 829f1d9) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

Author: Sandy Ryza <sandy@cloudera.com> Closes apache#6126 from sryza/sandy-spark-7579 and squashes the following commits: 5af803d [Sandy Ryza] SPARK-7579 [MLLIB] User guide update for OneHotEncoder

jkbradley reviewed May 13, 2015
View reviewed changes

sryza changed the title ~~SPARK-7579 [MLLIB] User guide update for OneHotEncoder~~ SPARK-7579 [ML] [DOC] User guide update for OneHotEncoder May 13, 2015

sryza force-pushed the sandy-spark-7579 branch from 3f0af41 to 95e0908 Compare May 14, 2015 19:51

jkbradley reviewed May 15, 2015
View reviewed changes

SPARK-7579 [MLLIB] User guide update for OneHotEncoder

5af803d

sryza force-pushed the sandy-spark-7579 branch from 4f5376e to 5af803d Compare May 19, 2015 21:56

asfgit closed this in 829f1d9 May 20, 2015

SPARK-7579 [ML] [DOC] User guide update for OneHotEncoder #6126

SPARK-7579 [ML] [DOC] User guide update for OneHotEncoder #6126

Uh oh!

Conversation

sryza commented May 13, 2015

Uh oh!

AmplabJenkins commented May 13, 2015

Uh oh!

AmplabJenkins commented May 13, 2015

Uh oh!

SparkQA commented May 13, 2015

Uh oh!

SparkQA commented May 13, 2015

Uh oh!

AmplabJenkins commented May 13, 2015

Uh oh!

AmplabJenkins commented May 13, 2015

Uh oh!

jkbradley May 13, 2015

Choose a reason for hiding this comment

Uh oh!

jkbradley commented May 13, 2015

Uh oh!

AmplabJenkins commented May 14, 2015

Uh oh!

AmplabJenkins commented May 14, 2015

Uh oh!

SparkQA commented May 14, 2015

Uh oh!

SparkQA commented May 14, 2015

Uh oh!

AmplabJenkins commented May 14, 2015

Uh oh!

AmplabJenkins commented May 14, 2015

Uh oh!

jkbradley May 15, 2015

Choose a reason for hiding this comment

Uh oh!

sryza May 19, 2015

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented May 19, 2015

Uh oh!

AmplabJenkins commented May 19, 2015

Uh oh!

SparkQA commented May 19, 2015

Uh oh!

AmplabJenkins commented May 19, 2015

Uh oh!

AmplabJenkins commented May 19, 2015

Uh oh!

SparkQA commented May 19, 2015

Uh oh!

SparkQA commented May 19, 2015

Uh oh!

AmplabJenkins commented May 19, 2015

Uh oh!

AmplabJenkins commented May 19, 2015

Uh oh!

SparkQA commented May 20, 2015

Uh oh!

AmplabJenkins commented May 20, 2015

Uh oh!

AmplabJenkins commented May 20, 2015

Uh oh!

jkbradley commented May 20, 2015

Uh oh!

sryza commented May 20, 2015

Uh oh!

jkbradley commented May 20, 2015

Uh oh!

SparkQA commented May 20, 2015

Uh oh!

SparkQA commented May 20, 2015

Uh oh!

jkbradley commented May 20, 2015

Uh oh!

Reviewers

Assignees

Labels