[SPARK-15509][ML][SparkR] R MLlib algorithms should support input columns "features" and "label" #13584

keypointt · 2016-06-10T00:50:09Z

https://issues.apache.org/jira/browse/SPARK-15509

What changes were proposed in this pull request?

Currently in SparkR, when you load a LibSVM dataset using the sqlContext and then pass it to an MLlib algorithm, the ML wrappers will fail since they will try to create a "features" column, which conflicts with the existing "features" column from the LibSVM loader. E.g., using the "mnist" dataset from LibSVM:
training <- loadDF(sqlContext, ".../mnist", "libsvm")
model <- naiveBayes(label ~ features, training)
This fails with:

16/05/24 11:52:41 ERROR RBackendHandler: fit on org.apache.spark.ml.r.NaiveBayesWrapper failed
Error in invokeJava(isStatic = TRUE, className, methodName, ...) : 
  java.lang.IllegalArgumentException: Output column features already exists.
    at org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:120)
    at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179)
    at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179)
    at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
    at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
    at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
    at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:179)
    at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:67)
    at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:131)
    at org.apache.spark.ml.feature.RFormula.fit(RFormula.scala:169)
    at org.apache.spark.ml.r.NaiveBayesWrapper$.fit(NaiveBayesWrapper.scala:62)
    at org.apache.spark.ml.r.NaiveBayesWrapper.fit(NaiveBayesWrapper.sca
The same issue appears for the "label" column once you rename the "features" column.

The cause is, when using loadDF() to generate dataframes, sometimes it’s with default column name “label” and “features”, and these two name will conflict with default column names setDefault(labelCol, "label") and setDefault(featuresCol, "features") of SharedParams.scala

How was this patch tested?

Test on my local machine.

…wrappers

SparkQA · 2016-06-10T01:47:24Z

Test build #60261 has finished for PR 13584 at commit 43b2f8c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

keypointt · 2016-06-17T16:53:35Z

hi @jkbradley do you mind have a look on this one? thanks a lot :)

shivaram · 2016-06-17T22:59:38Z

@jkbradley Is this important for 2.0 ?

shivaram · 2016-06-21T20:42:02Z

cc @mengxr

shivaram · 2016-08-24T17:47:08Z

@keypointt Is this PR still relevant ?

keypointt · 2016-08-24T17:56:49Z

I'm not sure, I guess this one is skipped and not important anymore?

I can close it if it's not going to be merged

shivaram · 2016-08-24T18:11:33Z

Hmm - the problem still seems to be relevant. @mengxr @junyangq Would one of you be able to look at this ?

junyangq · 2016-08-25T22:56:47Z

@keypointt Can we keep searching (in random or sequential way) until an unused column name has been found?

keypointt · 2016-08-25T23:14:15Z

sure I'll try to scan through all the mllib algorithms

junyangq · 2016-08-25T23:30:10Z

@shivaram Does it sound reasonable to you? Just discussed this with @jkbradley.

shivaram · 2016-08-25T23:34:05Z

Yeah I was going to say that we need to handle cases where labels_output is also used. We can just add a numeric suffix maybe ?

junyangq · 2016-08-25T23:40:40Z

Sounds good. That's also what we meant.

keypointt · 2016-08-29T00:47:28Z

mllib/src/test/scala/org/apache/spark/ml/r/RWrapperUtilsSuite.scala

+
+  test("avoid column name conflicting") {
+    val rFormula = new RFormula().setFormula("label ~ features")
+    val data = spark.read.format("libsvm").load("../data/mllib/sample_libsvm_data.txt")


Here I used "../data/", I'm not sure if there is a better way to do it, something like $current_directory/data/mllib/sample_libsvm_data.txt?

All I found is like this val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt") https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/NaiveBayesExample.scala#L36

SparkQA · 2016-08-29T17:59:39Z

Test build #64578 has finished for PR 13584 at commit 1bc150f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-08-31T07:56:25Z

mllib/src/main/scala/org/apache/spark/ml/r/RWrapperUtils.scala

+   */
+  def checkDataColumns(rFormula: RFormula, data: Dataset[_]): Unit = {
+    if (data.schema.fieldNames.contains(rFormula.getLabelCol)) {
+      logWarning("data containing 'label' column, so change its name to avoid conflict")


is it possible to include the featurecol name in logging?

…mn name has been found

SparkQA · 2016-09-01T06:24:24Z

Test build #64764 has finished for PR 13584 at commit caa4183.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-01T07:32:34Z

Test build #64765 has finished for PR 13584 at commit 1701252.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

junyangq · 2016-09-01T20:32:46Z

LGTM

felixcheung · 2016-09-01T23:21:36Z

mllib/src/main/scala/org/apache/spark/ml/r/RWrapperUtils.scala

-      rFormula.setLabelCol(rFormula.getLabelCol + "_output")
+      val newLabelName = convertToUniqueName(rFormula.getLabelCol, data.schema.fieldNames)
+      logWarning(
+        s"data containing ${rFormula.getLabelCol} column, changing its name to $newLabelName")


this sounds a bit like we are renaming the existing label column?
perhaps just change to s"data containing ${rFormula.getLabelCol} column, using new name to $newLabelName instead"?

sure, I'll change it

felixcheung · 2016-09-01T23:30:03Z

mllib/src/main/scala/org/apache/spark/ml/r/RWrapperUtils.scala

-      rFormula.setFeaturesCol(rFormula.getFeaturesCol + "_output")
+      val newFeaturesName = convertToUniqueName(rFormula.getFeaturesCol, data.schema.fieldNames)
+      logWarning(
+        s"data containing ${rFormula.getFeaturesCol} column, changing its name to $newFeaturesName")


SparkQA · 2016-09-02T00:34:27Z

Test build #64811 has finished for PR 13584 at commit d9e3be5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-02T00:42:56Z

Test build #64813 has finished for PR 13584 at commit 8bb370e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-09-02T04:12:23Z

LGTM. @shivaram do you have any other comment?

shivaram · 2016-09-02T04:19:01Z

LGTM - @felixcheung Feel free to merge when its ready

felixcheung · 2016-09-02T08:56:23Z

Merged. I could't change the assignee in the JIRA, somehow - @shivaram could you please do that?

…t input columns "features" and "label" ## What changes were proposed in this pull request? apache#13584 resolved the issue of features and label columns conflict with ```RFormula``` default ones when loading libsvm data, but it still left some issues should be resolved: 1, It’s not necessary to check and rename label column. Since we have considerations on the design of ```RFormula```, it can handle the case of label column already exists(with restriction of the existing label column should be numeric/boolean type). So it’s not necessary to change the column name to avoid conflict. If the label column is not numeric/boolean type, ```RFormula``` will throw exception. 2, We should rename features column name to new one if there is conflict, but appending a random value is enough since it was used internally only. We done similar work when implementing ```SQLTransformer```. 3, We should set correct new features column for the estimators. Take ```GLM``` as example: ```GLM``` estimator should set features column with the changed one(rFormula.getFeaturesCol) rather than the default “features”. Although it’s same when training model, but it involves problems when predicting. The following is the prediction result of GLM before this PR: ![image](https://cloud.githubusercontent.com/assets/1962026/18308227/84c3c452-74a8-11e6-9caa-9d6d846cc957.png) We should drop the internal used feature column name, otherwise, it will appear on the prediction DataFrame which will confused users. And this behavior is same as other scenarios which does not exist column name conflict. After this PR: ![image](https://cloud.githubusercontent.com/assets/1962026/18308240/92082a04-74a8-11e6-9226-801f52b856d9.png) ## How was this patch tested? Existing unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#14993 from yanboliang/spark-15509.

keypointt added 8 commits June 8, 2016 16:39

[SPARK-15509] remove duplicate of intercept[IllegalArgumentException]

cfed884

[SPARK-15509] no column exists error for naivebayes. expand to other …

77886fe

…wrappers

[SPARK-15509] add a util function for all wrappers

e112ac0

[SPARK-15509] expand column check to other wrappers

ef3702e

[SPARK-15509] add unit test

aab3a12

[SPARK-15509] some clean up

f68ac34

[SPARK-15509] fix path

c8e30e9

[SPARK-15509] fix path

43b2f8c

keypointt added 2 commits August 28, 2016 16:11

Merge branch 'master' into SPARK-15509

82069c7

[SPARK-15509] scan through all r wrappers and add checking for formular

1bc150f

keypointt reviewed Aug 29, 2016
View reviewed changes

felixcheung reviewed Aug 31, 2016
View reviewed changes

[SPARK-15509] keep searching in a sequential way until an unused colu…

caa4183

…mn name has been found

[SPARK-15509] fix style

1701252

felixcheung reviewed Sep 1, 2016
View reviewed changes

[SPARK-15509] modify logging msg

d9e3be5

felixcheung reviewed Sep 1, 2016
View reviewed changes

[SPARK-15509] add 'instead'...

8bb370e

asfgit closed this in 6969dcc Sep 2, 2016

yanboliang mentioned this pull request Sep 7, 2016

[SPARK-15509][Follow-up][ML][SparkR] R MLlib algorithms should support input columns "features" and "label" #14993

Closed

[SPARK-15509][ML][SparkR] R MLlib algorithms should support input columns "features" and "label" #13584

[SPARK-15509][ML][SparkR] R MLlib algorithms should support input columns "features" and "label" #13584

Uh oh!

Conversation

keypointt commented Jun 10, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jun 10, 2016

Uh oh!

keypointt commented Jun 17, 2016

Uh oh!

shivaram commented Jun 17, 2016

Uh oh!

shivaram commented Jun 21, 2016

Uh oh!

shivaram commented Aug 24, 2016

Uh oh!

keypointt commented Aug 24, 2016

Uh oh!

shivaram commented Aug 24, 2016

Uh oh!

junyangq commented Aug 25, 2016

Uh oh!

keypointt commented Aug 25, 2016

Uh oh!

junyangq commented Aug 25, 2016

Uh oh!

shivaram commented Aug 25, 2016

Uh oh!

junyangq commented Aug 25, 2016

Uh oh!

keypointt Aug 29, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 29, 2016

Uh oh!

felixcheung Aug 31, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 1, 2016

Uh oh!

SparkQA commented Sep 1, 2016

Uh oh!

junyangq commented Sep 1, 2016

Uh oh!

felixcheung Sep 1, 2016

Choose a reason for hiding this comment

Uh oh!

keypointt Sep 1, 2016

Choose a reason for hiding this comment

Uh oh!

felixcheung Sep 1, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 2, 2016

Uh oh!

SparkQA commented Sep 2, 2016

Uh oh!

felixcheung commented Sep 2, 2016

Uh oh!

shivaram commented Sep 2, 2016

Uh oh!

felixcheung commented Sep 2, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants