Skip to content

Conversation

@sryza
Copy link
Contributor

@sryza sryza commented May 13, 2015

No description provided.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 13, 2015

Test build #32634 has started for PR 6126 at commit 3f0af41.

@SparkQA
Copy link

SparkQA commented May 13, 2015

Test build #32634 has finished for PR 6126 at commit 3f0af41.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32634/
Test PASSed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add something like, "This encoding allows algorithms which expect continuous features to use categorical features as well; for example Logistic Regression requires continuous features, but it can use categorical features after one-hot encoding."

Could you please add a Wikipedia link? [http://en.wikipedia.org/wiki/One-hot]

"the includeFirst" --> "the includeFirst parameter"

@jkbradley
Copy link
Member

@sryza Thanks for the PR!

Can you please add tags "[ml]" (not mllib) and "[doc]" to the title?

Also, a Python API has been added, so could you please add a Python example?

@sryza sryza changed the title SPARK-7579 [MLLIB] User guide update for OneHotEncoder SPARK-7579 [ML] [DOC] User guide update for OneHotEncoder May 13, 2015
@sryza sryza force-pushed the sandy-spark-7579 branch from 3f0af41 to 95e0908 Compare May 14, 2015 19:51
@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 14, 2015

Test build #32724 has started for PR 6126 at commit 95e0908.

@SparkQA
Copy link

SparkQA commented May 14, 2015

Test build #32724 has finished for PR 6126 at commit 95e0908.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • [One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features as well. The [OneHotEncoder](api/scala/index.html#org.apache.spark.ml.feature.OneHotEncoder) class provides this functionality. By default, the resulting binary vector has a component for each category, so with 5 categories, an input value of 2.0 would map to an output vector of (0.0, 0.0, 1.0, 0.0, 0.0). If theincludeFirstis set to false, the first category is omitted, so the output vector for the previous example would be (0.0, 1.0, 0.0, 0.0) and an input value of 0.0 would map to a vector of all zeros. Including the first category makes the vector columns linearly dependent because they sum up to one.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32724/
Test PASSed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need {% endhighlight %}

Q: Have you tried generating this using jekyll? This is the only issue I spotted. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Posting a patch that fixes this. My jekyll efforts have been thwarted with errors like:
/home/sandy/spark/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala:318: polymorphic expression cannot be instantiated to expected type;

Any idea how to get past these?

@AmplabJenkins
Copy link

Build triggered.

@AmplabJenkins
Copy link

Build started.

@SparkQA
Copy link

SparkQA commented May 19, 2015

Test build #33101 has started for PR 6126 at commit 4f5376e.

@sryza sryza force-pushed the sandy-spark-7579 branch from 4f5376e to 5af803d Compare May 19, 2015 21:56
@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 19, 2015

Test build #33102 has started for PR 6126 at commit 5af803d.

@SparkQA
Copy link

SparkQA commented May 19, 2015

Test build #33102 has finished for PR 6126 at commit 5af803d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DataFrameReader(object):
    • class DataFrameWriter(object):

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33102/
Test PASSed.

@SparkQA
Copy link

SparkQA commented May 20, 2015

Test build #33101 has finished for PR 6126 at commit 4f5376e.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • [One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features as well. The [OneHotEncoder](api/scala/index.html#org.apache.spark.ml.feature.OneHotEncoder) class provides this functionality. By default, the resulting binary vector has a component for each category, so with 5 categories, an input value of 2.0 would map to an output vector of (0.0, 0.0, 1.0, 0.0, 0.0). If theincludeFirstis set to false, the first category is omitted, so the output vector for the previous example would be (0.0, 1.0, 0.0, 0.0) and an input value of 0.0 would map to a vector of all zeros. Including the first category makes the vector columns linearly dependent because they sum up to one.

@AmplabJenkins
Copy link

Build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33101/
Test PASSed.

@jkbradley
Copy link
Member

@sryza The unclean merge is probably from another feature transformer PR, but it should be easy to fix.

I'm not sure about the jekyll issue; I haven't seen that happen. The only thing I know to recommend is the usual: Check jekyll version (I'm using 2.4.0), and do a clean build. If you can't get it to work, I can check on my side too.

@sryza
Copy link
Contributor Author

sryza commented May 20, 2015

The unclean merge is fixed in the current version of the patch. Re: Jekyll, can try a clean build tomorrow.

@jkbradley
Copy link
Member

Hm, the last test says the merge is unclean.

@SparkQA
Copy link

SparkQA commented May 20, 2015

Test build #835 has started for PR 6126 at commit 5af803d.

@SparkQA
Copy link

SparkQA commented May 20, 2015

Test build #835 has finished for PR 6126 at commit 5af803d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DataFrameReader(object):
    • class DataFrameWriter(object):

@jkbradley
Copy link
Member

LGTM merging into master and branch-1.4
@sryza Thanks!

asfgit pushed a commit that referenced this pull request May 20, 2015
Author: Sandy Ryza <sandy@cloudera.com>

Closes #6126 from sryza/sandy-spark-7579 and squashes the following commits:

5af803d [Sandy Ryza] SPARK-7579 [MLLIB] User guide update for OneHotEncoder

(cherry picked from commit 829f1d9)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
@asfgit asfgit closed this in 829f1d9 May 20, 2015
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
Author: Sandy Ryza <sandy@cloudera.com>

Closes apache#6126 from sryza/sandy-spark-7579 and squashes the following commits:

5af803d [Sandy Ryza] SPARK-7579 [MLLIB] User guide update for OneHotEncoder
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
Author: Sandy Ryza <sandy@cloudera.com>

Closes apache#6126 from sryza/sandy-spark-7579 and squashes the following commits:

5af803d [Sandy Ryza] SPARK-7579 [MLLIB] User guide update for OneHotEncoder
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
Author: Sandy Ryza <sandy@cloudera.com>

Closes apache#6126 from sryza/sandy-spark-7579 and squashes the following commits:

5af803d [Sandy Ryza] SPARK-7579 [MLLIB] User guide update for OneHotEncoder
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants