-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-15509][ML][SparkR] R MLlib algorithms should support input columns "features" and "label" #13584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #60261 has finished for PR 13584 at commit
|
|
hi @jkbradley do you mind have a look on this one? thanks a lot :) |
|
@jkbradley Is this important for 2.0 ? |
|
cc @mengxr |
|
@keypointt Is this PR still relevant ? |
|
I'm not sure, I guess this one is skipped and not important anymore? I can close it if it's not going to be merged |
|
@keypointt Can we keep searching (in random or sequential way) until an unused column name has been found? |
|
sure I'll try to scan through all the mllib algorithms |
|
@shivaram Does it sound reasonable to you? Just discussed this with @jkbradley. |
|
Yeah I was going to say that we need to handle cases where |
|
Sounds good. That's also what we meant. |
|
|
||
| test("avoid column name conflicting") { | ||
| val rFormula = new RFormula().setFormula("label ~ features") | ||
| val data = spark.read.format("libsvm").load("../data/mllib/sample_libsvm_data.txt") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I used "../data/", I'm not sure if there is a better way to do it, something like $current_directory/data/mllib/sample_libsvm_data.txt?
All I found is like this val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt") https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/NaiveBayesExample.scala#L36
|
Test build #64578 has finished for PR 13584 at commit
|
| */ | ||
| def checkDataColumns(rFormula: RFormula, data: Dataset[_]): Unit = { | ||
| if (data.schema.fieldNames.contains(rFormula.getLabelCol)) { | ||
| logWarning("data containing 'label' column, so change its name to avoid conflict") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to include the featurecol name in logging?
…mn name has been found
|
Test build #64764 has finished for PR 13584 at commit
|
|
Test build #64765 has finished for PR 13584 at commit
|
|
LGTM |
| rFormula.setLabelCol(rFormula.getLabelCol + "_output") | ||
| val newLabelName = convertToUniqueName(rFormula.getLabelCol, data.schema.fieldNames) | ||
| logWarning( | ||
| s"data containing ${rFormula.getLabelCol} column, changing its name to $newLabelName") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this sounds a bit like we are renaming the existing label column?
perhaps just change to s"data containing ${rFormula.getLabelCol} column, using new name to $newLabelName instead"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, I'll change it
| rFormula.setFeaturesCol(rFormula.getFeaturesCol + "_output") | ||
| val newFeaturesName = convertToUniqueName(rFormula.getFeaturesCol, data.schema.fieldNames) | ||
| logWarning( | ||
| s"data containing ${rFormula.getFeaturesCol} column, changing its name to $newFeaturesName") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here?
|
Test build #64811 has finished for PR 13584 at commit
|
|
Test build #64813 has finished for PR 13584 at commit
|
|
LGTM. @shivaram do you have any other comment? |
|
LGTM - @felixcheung Feel free to merge when its ready |
|
Merged. I could't change the assignee in the JIRA, somehow - @shivaram could you please do that? |
…t input columns "features" and "label" ## What changes were proposed in this pull request? apache#13584 resolved the issue of features and label columns conflict with ```RFormula``` default ones when loading libsvm data, but it still left some issues should be resolved: 1, It’s not necessary to check and rename label column. Since we have considerations on the design of ```RFormula```, it can handle the case of label column already exists(with restriction of the existing label column should be numeric/boolean type). So it’s not necessary to change the column name to avoid conflict. If the label column is not numeric/boolean type, ```RFormula``` will throw exception. 2, We should rename features column name to new one if there is conflict, but appending a random value is enough since it was used internally only. We done similar work when implementing ```SQLTransformer```. 3, We should set correct new features column for the estimators. Take ```GLM``` as example: ```GLM``` estimator should set features column with the changed one(rFormula.getFeaturesCol) rather than the default “features”. Although it’s same when training model, but it involves problems when predicting. The following is the prediction result of GLM before this PR:  We should drop the internal used feature column name, otherwise, it will appear on the prediction DataFrame which will confused users. And this behavior is same as other scenarios which does not exist column name conflict. After this PR:  ## How was this patch tested? Existing unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#14993 from yanboliang/spark-15509.
…t input columns "features" and "label" ## What changes were proposed in this pull request? apache#13584 resolved the issue of features and label columns conflict with ```RFormula``` default ones when loading libsvm data, but it still left some issues should be resolved: 1, It’s not necessary to check and rename label column. Since we have considerations on the design of ```RFormula```, it can handle the case of label column already exists(with restriction of the existing label column should be numeric/boolean type). So it’s not necessary to change the column name to avoid conflict. If the label column is not numeric/boolean type, ```RFormula``` will throw exception. 2, We should rename features column name to new one if there is conflict, but appending a random value is enough since it was used internally only. We done similar work when implementing ```SQLTransformer```. 3, We should set correct new features column for the estimators. Take ```GLM``` as example: ```GLM``` estimator should set features column with the changed one(rFormula.getFeaturesCol) rather than the default “features”. Although it’s same when training model, but it involves problems when predicting. The following is the prediction result of GLM before this PR:  We should drop the internal used feature column name, otherwise, it will appear on the prediction DataFrame which will confused users. And this behavior is same as other scenarios which does not exist column name conflict. After this PR:  ## How was this patch tested? Existing unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#14993 from yanboliang/spark-15509.
https://issues.apache.org/jira/browse/SPARK-15509
What changes were proposed in this pull request?
Currently in SparkR, when you load a LibSVM dataset using the sqlContext and then pass it to an MLlib algorithm, the ML wrappers will fail since they will try to create a "features" column, which conflicts with the existing "features" column from the LibSVM loader. E.g., using the "mnist" dataset from LibSVM:
training <- loadDF(sqlContext, ".../mnist", "libsvm")model <- naiveBayes(label ~ features, training)This fails with:
The cause is, when using
loadDF()to generate dataframes, sometimes it’s with default column name“label”and“features”, and these two name will conflict with default column namessetDefault(labelCol, "label")andsetDefault(featuresCol, "features")ofSharedParams.scalaHow was this patch tested?
Test on my local machine.