Skip to content

Conversation

@keypointt
Copy link
Contributor

https://issues.apache.org/jira/browse/SPARK-15509

What changes were proposed in this pull request?

Currently in SparkR, when you load a LibSVM dataset using the sqlContext and then pass it to an MLlib algorithm, the ML wrappers will fail since they will try to create a "features" column, which conflicts with the existing "features" column from the LibSVM loader. E.g., using the "mnist" dataset from LibSVM:
training <- loadDF(sqlContext, ".../mnist", "libsvm")
model <- naiveBayes(label ~ features, training)
This fails with:

16/05/24 11:52:41 ERROR RBackendHandler: fit on org.apache.spark.ml.r.NaiveBayesWrapper failed
Error in invokeJava(isStatic = TRUE, className, methodName, ...) : 
  java.lang.IllegalArgumentException: Output column features already exists.
    at org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:120)
    at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179)
    at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179)
    at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
    at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
    at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
    at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:179)
    at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:67)
    at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:131)
    at org.apache.spark.ml.feature.RFormula.fit(RFormula.scala:169)
    at org.apache.spark.ml.r.NaiveBayesWrapper$.fit(NaiveBayesWrapper.scala:62)
    at org.apache.spark.ml.r.NaiveBayesWrapper.fit(NaiveBayesWrapper.sca
The same issue appears for the "label" column once you rename the "features" column.

The cause is, when using loadDF() to generate dataframes, sometimes it’s with default column name “label” and “features”, and these two name will conflict with default column names setDefault(labelCol, "label") and setDefault(featuresCol, "features") of SharedParams.scala

How was this patch tested?

Test on my local machine.

@SparkQA
Copy link

SparkQA commented Jun 10, 2016

Test build #60261 has finished for PR 13584 at commit 43b2f8c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@keypointt
Copy link
Contributor Author

hi @jkbradley do you mind have a look on this one? thanks a lot :)

@shivaram
Copy link
Contributor

@jkbradley Is this important for 2.0 ?

@shivaram
Copy link
Contributor

cc @mengxr

@shivaram
Copy link
Contributor

@keypointt Is this PR still relevant ?

@keypointt
Copy link
Contributor Author

I'm not sure, I guess this one is skipped and not important anymore?

I can close it if it's not going to be merged

@shivaram
Copy link
Contributor

Hmm - the problem still seems to be relevant. @mengxr @junyangq Would one of you be able to look at this ?

@junyangq
Copy link
Contributor

@keypointt Can we keep searching (in random or sequential way) until an unused column name has been found?

@keypointt
Copy link
Contributor Author

sure I'll try to scan through all the mllib algorithms

@junyangq
Copy link
Contributor

@shivaram Does it sound reasonable to you? Just discussed this with @jkbradley.

@shivaram
Copy link
Contributor

Yeah I was going to say that we need to handle cases where labels_output is also used. We can just add a numeric suffix maybe ?

@junyangq
Copy link
Contributor

Sounds good. That's also what we meant.


test("avoid column name conflicting") {
val rFormula = new RFormula().setFormula("label ~ features")
val data = spark.read.format("libsvm").load("../data/mllib/sample_libsvm_data.txt")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I used "../data/", I'm not sure if there is a better way to do it, something like $current_directory/data/mllib/sample_libsvm_data.txt?

All I found is like this val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt") https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/NaiveBayesExample.scala#L36

@SparkQA
Copy link

SparkQA commented Aug 29, 2016

Test build #64578 has finished for PR 13584 at commit 1bc150f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

*/
def checkDataColumns(rFormula: RFormula, data: Dataset[_]): Unit = {
if (data.schema.fieldNames.contains(rFormula.getLabelCol)) {
logWarning("data containing 'label' column, so change its name to avoid conflict")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to include the featurecol name in logging?

@SparkQA
Copy link

SparkQA commented Sep 1, 2016

Test build #64764 has finished for PR 13584 at commit caa4183.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 1, 2016

Test build #64765 has finished for PR 13584 at commit 1701252.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@junyangq
Copy link
Contributor

junyangq commented Sep 1, 2016

LGTM

rFormula.setLabelCol(rFormula.getLabelCol + "_output")
val newLabelName = convertToUniqueName(rFormula.getLabelCol, data.schema.fieldNames)
logWarning(
s"data containing ${rFormula.getLabelCol} column, changing its name to $newLabelName")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sounds a bit like we are renaming the existing label column?
perhaps just change to s"data containing ${rFormula.getLabelCol} column, using new name to $newLabelName instead"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I'll change it

rFormula.setFeaturesCol(rFormula.getFeaturesCol + "_output")
val newFeaturesName = convertToUniqueName(rFormula.getFeaturesCol, data.schema.fieldNames)
logWarning(
s"data containing ${rFormula.getFeaturesCol} column, changing its name to $newFeaturesName")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here?

@SparkQA
Copy link

SparkQA commented Sep 2, 2016

Test build #64811 has finished for PR 13584 at commit d9e3be5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 2, 2016

Test build #64813 has finished for PR 13584 at commit 8bb370e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

LGTM. @shivaram do you have any other comment?

@shivaram
Copy link
Contributor

shivaram commented Sep 2, 2016

LGTM - @felixcheung Feel free to merge when its ready

@felixcheung
Copy link
Member

Merged. I could't change the assignee in the JIRA, somehow - @shivaram could you please do that?

@asfgit asfgit closed this in 6969dcc Sep 2, 2016
ghost pushed a commit to dbtsai/spark that referenced this pull request Sep 10, 2016
…t input columns "features" and "label"

## What changes were proposed in this pull request?
apache#13584 resolved the issue of features and label columns conflict with ```RFormula``` default ones when loading libsvm data, but it still left some issues should be resolved:
1, It’s not necessary to check and rename label column.
Since we have considerations on the design of ```RFormula```, it can handle the case of label column already exists(with restriction of the existing label column should be numeric/boolean type). So it’s not necessary to change the column name to avoid conflict. If the label column is not numeric/boolean type, ```RFormula``` will throw exception.

2, We should rename features column name to new one if there is conflict, but appending a random value is enough since it was used internally only. We done similar work when implementing ```SQLTransformer```.

3, We should set correct new features column for the estimators. Take ```GLM``` as example:
```GLM``` estimator should set features column with the changed one(rFormula.getFeaturesCol) rather than the default “features”. Although it’s same when training model, but it involves problems when predicting. The following is the prediction result of GLM before this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/18308227/84c3c452-74a8-11e6-9caa-9d6d846cc957.png)
We should drop the internal used feature column name, otherwise, it will appear on the prediction DataFrame which will confused users. And this behavior is same as other scenarios which does not exist column name conflict.
After this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/18308240/92082a04-74a8-11e6-9226-801f52b856d9.png)

## How was this patch tested?
Existing unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes apache#14993 from yanboliang/spark-15509.
wgtmac pushed a commit to wgtmac/spark that referenced this pull request Sep 19, 2016
…t input columns "features" and "label"

## What changes were proposed in this pull request?
apache#13584 resolved the issue of features and label columns conflict with ```RFormula``` default ones when loading libsvm data, but it still left some issues should be resolved:
1, It’s not necessary to check and rename label column.
Since we have considerations on the design of ```RFormula```, it can handle the case of label column already exists(with restriction of the existing label column should be numeric/boolean type). So it’s not necessary to change the column name to avoid conflict. If the label column is not numeric/boolean type, ```RFormula``` will throw exception.

2, We should rename features column name to new one if there is conflict, but appending a random value is enough since it was used internally only. We done similar work when implementing ```SQLTransformer```.

3, We should set correct new features column for the estimators. Take ```GLM``` as example:
```GLM``` estimator should set features column with the changed one(rFormula.getFeaturesCol) rather than the default “features”. Although it’s same when training model, but it involves problems when predicting. The following is the prediction result of GLM before this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/18308227/84c3c452-74a8-11e6-9caa-9d6d846cc957.png)
We should drop the internal used feature column name, otherwise, it will appear on the prediction DataFrame which will confused users. And this behavior is same as other scenarios which does not exist column name conflict.
After this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/18308240/92082a04-74a8-11e6-9226-801f52b856d9.png)

## How was this patch tested?
Existing unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes apache#14993 from yanboliang/spark-15509.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants