-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-15509][Follow-up][ML][SparkR] R MLlib algorithms should support input columns "features" and "label" #14993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #65035 has finished for PR 14993 at commit
|
|
cc @keypointt |
|
Vote for appending a random_sequence, it is concise and I believe almost definitely no collision for this random_sequence |
|
@keypointt Thanks for review. Then this should be go? |
|
hi @yanboliang it looks good to me but I don't have right to merge to master, maybe you have to ping the other reviewers :p |
|
LGTM |
|
Merged into master, thanks for all your review. |
…t input columns "features" and "label" ## What changes were proposed in this pull request? apache#13584 resolved the issue of features and label columns conflict with ```RFormula``` default ones when loading libsvm data, but it still left some issues should be resolved: 1, It’s not necessary to check and rename label column. Since we have considerations on the design of ```RFormula```, it can handle the case of label column already exists(with restriction of the existing label column should be numeric/boolean type). So it’s not necessary to change the column name to avoid conflict. If the label column is not numeric/boolean type, ```RFormula``` will throw exception. 2, We should rename features column name to new one if there is conflict, but appending a random value is enough since it was used internally only. We done similar work when implementing ```SQLTransformer```. 3, We should set correct new features column for the estimators. Take ```GLM``` as example: ```GLM``` estimator should set features column with the changed one(rFormula.getFeaturesCol) rather than the default “features”. Although it’s same when training model, but it involves problems when predicting. The following is the prediction result of GLM before this PR:  We should drop the internal used feature column name, otherwise, it will appear on the prediction DataFrame which will confused users. And this behavior is same as other scenarios which does not exist column name conflict. After this PR:  ## How was this patch tested? Existing unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#14993 from yanboliang/spark-15509.
What changes were proposed in this pull request?
#13584 resolved the issue of features and label columns conflict with
RFormuladefault ones when loading libsvm data, but it still left some issues should be resolved:1, It’s not necessary to check and rename label column.
Since we have considerations on the design of
RFormula, it can handle the case of label column already exists(with restriction of the existing label column should be numeric/boolean type). So it’s not necessary to change the column name to avoid conflict. If the label column is not numeric/boolean type,RFormulawill throw exception.2, We should rename features column name to new one if there is conflict, but appending a random value is enough since it was used internally only. We done similar work when implementing
SQLTransformer.3, We should set correct new features column for the estimators. Take


GLMas example:GLMestimator should set features column with the changed one(rFormula.getFeaturesCol) rather than the default “features”. Although it’s same when training model, but it involves problems when predicting. The following is the prediction result of GLM before this PR:We should drop the internal used feature column name, otherwise, it will appear on the prediction DataFrame which will confused users. And this behavior is same as other scenarios which does not exist column name conflict.
After this PR:
How was this patch tested?
Existing unit tests.