Skip to content

Conversation

@yanboliang
Copy link
Contributor

@yanboliang yanboliang commented Sep 7, 2016

What changes were proposed in this pull request?

#13584 resolved the issue of features and label columns conflict with RFormula default ones when loading libsvm data, but it still left some issues should be resolved:

1, It’s not necessary to check and rename label column.
Since we have considerations on the design of RFormula, it can handle the case of label column already exists(with restriction of the existing label column should be numeric/boolean type). So it’s not necessary to change the column name to avoid conflict. If the label column is not numeric/boolean type, RFormula will throw exception.

2, We should rename features column name to new one if there is conflict, but appending a random value is enough since it was used internally only. We done similar work when implementing SQLTransformer.

3, We should set correct new features column for the estimators. Take GLM as example:
GLM estimator should set features column with the changed one(rFormula.getFeaturesCol) rather than the default “features”. Although it’s same when training model, but it involves problems when predicting. The following is the prediction result of GLM before this PR:
image
We should drop the internal used feature column name, otherwise, it will appear on the prediction DataFrame which will confused users. And this behavior is same as other scenarios which does not exist column name conflict.
After this PR:
image

How was this patch tested?

Existing unit tests.

@SparkQA
Copy link

SparkQA commented Sep 7, 2016

Test build #65035 has finished for PR 14993 at commit 56c6873.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor Author

@shivaram
Copy link
Contributor

shivaram commented Sep 7, 2016

cc @keypointt

@keypointt
Copy link
Contributor

keypointt commented Sep 8, 2016

Vote for appending a random_sequence, it is concise and I believe almost definitely no collision for this random_sequence

@yanboliang
Copy link
Contributor Author

@keypointt Thanks for review. Then this should be go?

@keypointt
Copy link
Contributor

hi @yanboliang it looks good to me but I don't have right to merge to master, maybe you have to ping the other reviewers :p

@felixcheung
Copy link
Member

LGTM

@yanboliang
Copy link
Contributor Author

Merged into master, thanks for all your review.

@asfgit asfgit closed this in bcdd259 Sep 10, 2016
@yanboliang yanboliang deleted the spark-15509 branch September 10, 2016 07:29
wgtmac pushed a commit to wgtmac/spark that referenced this pull request Sep 19, 2016
…t input columns "features" and "label"

## What changes were proposed in this pull request?
apache#13584 resolved the issue of features and label columns conflict with ```RFormula``` default ones when loading libsvm data, but it still left some issues should be resolved:
1, It’s not necessary to check and rename label column.
Since we have considerations on the design of ```RFormula```, it can handle the case of label column already exists(with restriction of the existing label column should be numeric/boolean type). So it’s not necessary to change the column name to avoid conflict. If the label column is not numeric/boolean type, ```RFormula``` will throw exception.

2, We should rename features column name to new one if there is conflict, but appending a random value is enough since it was used internally only. We done similar work when implementing ```SQLTransformer```.

3, We should set correct new features column for the estimators. Take ```GLM``` as example:
```GLM``` estimator should set features column with the changed one(rFormula.getFeaturesCol) rather than the default “features”. Although it’s same when training model, but it involves problems when predicting. The following is the prediction result of GLM before this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/18308227/84c3c452-74a8-11e6-9caa-9d6d846cc957.png)
We should drop the internal used feature column name, otherwise, it will appear on the prediction DataFrame which will confused users. And this behavior is same as other scenarios which does not exist column name conflict.
After this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/18308240/92082a04-74a8-11e6-9226-801f52b856d9.png)

## How was this patch tested?
Existing unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes apache#14993 from yanboliang/spark-15509.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants