[SPARK-15509][Follow-up][ML][SparkR] R MLlib algorithms should support input columns "features" and "label" #14993

yanboliang · 2016-09-07T10:08:04Z

What changes were proposed in this pull request?

#13584 resolved the issue of features and label columns conflict with RFormula default ones when loading libsvm data, but it still left some issues should be resolved:

1, It’s not necessary to check and rename label column.
Since we have considerations on the design of RFormula, it can handle the case of label column already exists(with restriction of the existing label column should be numeric/boolean type). So it’s not necessary to change the column name to avoid conflict. If the label column is not numeric/boolean type, RFormula will throw exception.

2, We should rename features column name to new one if there is conflict, but appending a random value is enough since it was used internally only. We done similar work when implementing SQLTransformer.

3, We should set correct new features column for the estimators. Take GLM as example:
GLM estimator should set features column with the changed one(rFormula.getFeaturesCol) rather than the default “features”. Although it’s same when training model, but it involves problems when predicting. The following is the prediction result of GLM before this PR:

We should drop the internal used feature column name, otherwise, it will appear on the prediction DataFrame which will confused users. And this behavior is same as other scenarios which does not exist column name conflict.
After this PR:

How was this patch tested?

Existing unit tests.

SparkQA · 2016-09-07T11:12:41Z

Test build #65035 has finished for PR 14993 at commit 56c6873.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-09-07T11:38:24Z

cc @jkbradley @mengxr @shivaram @felixcheung @junyangq

shivaram · 2016-09-07T16:57:31Z

cc @keypointt

keypointt · 2016-09-08T20:39:50Z

Vote for appending a random_sequence, it is concise and I believe almost definitely no collision for this random_sequence

yanboliang · 2016-09-09T12:50:43Z

@keypointt Thanks for review. Then this should be go?

keypointt · 2016-09-09T16:56:53Z

hi @yanboliang it looks good to me but I don't have right to merge to master, maybe you have to ping the other reviewers :p

felixcheung · 2016-09-09T17:32:26Z

LGTM

yanboliang · 2016-09-10T07:25:59Z

Merged into master, thanks for all your review.

…t input columns "features" and "label" ## What changes were proposed in this pull request? apache#13584 resolved the issue of features and label columns conflict with ```RFormula``` default ones when loading libsvm data, but it still left some issues should be resolved: 1, It’s not necessary to check and rename label column. Since we have considerations on the design of ```RFormula```, it can handle the case of label column already exists(with restriction of the existing label column should be numeric/boolean type). So it’s not necessary to change the column name to avoid conflict. If the label column is not numeric/boolean type, ```RFormula``` will throw exception. 2, We should rename features column name to new one if there is conflict, but appending a random value is enough since it was used internally only. We done similar work when implementing ```SQLTransformer```. 3, We should set correct new features column for the estimators. Take ```GLM``` as example: ```GLM``` estimator should set features column with the changed one(rFormula.getFeaturesCol) rather than the default “features”. Although it’s same when training model, but it involves problems when predicting. The following is the prediction result of GLM before this PR: ![image](https://cloud.githubusercontent.com/assets/1962026/18308227/84c3c452-74a8-11e6-9caa-9d6d846cc957.png) We should drop the internal used feature column name, otherwise, it will appear on the prediction DataFrame which will confused users. And this behavior is same as other scenarios which does not exist column name conflict. After this PR: ![image](https://cloud.githubusercontent.com/assets/1962026/18308240/92082a04-74a8-11e6-9226-801f52b856d9.png) ## How was this patch tested? Existing unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#14993 from yanboliang/spark-15509.

yanboliang added 3 commits September 7, 2016 02:29

SparkR ML wrapper set correct featuresCol & labelCol.

9d11d43

Optimize generating new features column.

924f117

Update docs

56c6873

asfgit closed this in bcdd259 Sep 10, 2016

yanboliang deleted the spark-15509 branch September 10, 2016 07:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-15509][Follow-up][ML][SparkR] R MLlib algorithms should support input columns "features" and "label" #14993

[SPARK-15509][Follow-up][ML][SparkR] R MLlib algorithms should support input columns "features" and "label" #14993

Uh oh!

yanboliang commented Sep 7, 2016 •

edited

Loading

Uh oh!

SparkQA commented Sep 7, 2016

Uh oh!

yanboliang commented Sep 7, 2016

Uh oh!

shivaram commented Sep 7, 2016

Uh oh!

keypointt commented Sep 8, 2016 •

edited

Loading

Uh oh!

yanboliang commented Sep 9, 2016

Uh oh!

keypointt commented Sep 9, 2016

Uh oh!

felixcheung commented Sep 9, 2016

Uh oh!

yanboliang commented Sep 10, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-15509][Follow-up][ML][SparkR] R MLlib algorithms should support input columns "features" and "label" #14993

[SPARK-15509][Follow-up][ML][SparkR] R MLlib algorithms should support input columns "features" and "label" #14993

Uh oh!

Conversation

yanboliang commented Sep 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Sep 7, 2016

Uh oh!

yanboliang commented Sep 7, 2016

Uh oh!

shivaram commented Sep 7, 2016

Uh oh!

keypointt commented Sep 8, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yanboliang commented Sep 9, 2016

Uh oh!

keypointt commented Sep 9, 2016

Uh oh!

felixcheung commented Sep 9, 2016

Uh oh!

yanboliang commented Sep 10, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yanboliang commented Sep 7, 2016 •

edited

Loading

keypointt commented Sep 8, 2016 •

edited

Loading