Skip to content

Conversation

@actuaryzhang
Copy link
Contributor

What changes were proposed in this pull request?

The names method fails to check for validity of the assignment values. This can be fixed by calling colnames within names.

How was this patch tested?

new tests.

@actuaryzhang
Copy link
Contributor Author

An example illustrating the issue:

df <- suppressWarnings(createDataFrame(iris))
# this is error
colnames(df) <- NULL
# this should report error
names(df) <- NULL

@felixcheung @wangmiao1981

@SparkQA
Copy link

SparkQA commented Feb 3, 2017

Test build #72321 has finished for PR 16794 at commit fce4b6f.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@actuaryzhang
Copy link
Contributor Author

Not sure why the unit test on kmeans summary failed since nothing was changed there. Also, all unit tests passed on my computer.

@felixcheung
Copy link
Member

good catch, seems reasonable (although might be a small breaking behavior change)

Error in tests is with doc generation:

Error: processing vignette 'sparkr-vignettes.Rmd' failed with diagnostics:
error in evaluating the argument 'object' in selecting a method for function 'summary': Error: object 'kmeansModel' not found

@wangmiao1981
Copy link
Contributor

Warning in FUN(X[[1L]], ...) :
Use Sepal_Length instead of Sepal.Length as column name
Warning in FUN(X[[2L]], ...) :
Use Sepal_Width instead of Sepal.Width as column name
Warning in FUN(X[[3L]], ...) :
Use Petal_Length instead of Petal.Length as column name
Warning in FUN(X[[4L]], ...) :
Use Petal_Width instead of Petal.Width as column name
Warning in FUN(X[[1L]], ...) :
Use Sepal_Length instead of Sepal.Length as column name
Warning in FUN(X[[2L]], ...) :
Use Sepal_Width instead of Sepal.Width as column name
Warning in FUN(X[[3L]], ...) :
Use Petal_Length instead of Petal.Length as column name
Warning in FUN(X[[4L]], ...) :
Use Petal_Width instead of Petal.Width as column name
Warning in FUN(X[[1L]], ...) :
Use Sepal_Length instead of Sepal.Length as column name
Warning in FUN(X[[2L]], ...) :
Use Sepal_Width instead of Sepal.Width as column name
Warning in FUN(X[[3L]], ...) :
Use Petal_Length instead of Petal.Length as column name
Warning in FUN(X[[4L]], ...) :
Use Petal_Width instead of Petal.Width as column name
Warning in FUN(X[[4L]], ...) :
Use resid_ds instead of resid.ds as column name
Warning in FUN(X[[6L]], ...) :
Use ecog_ps instead of ecog.ps as column name
Warning in FUN(X[[1L]], ...) :
Use GNP_deflator instead of GNP.deflator as column name
Warning in FUN(X[[4L]], ...) :
Use Armed_Forces instead of Armed.Forces as column name
Warning in FUN(X[[1L]], ...) :
Use GNP_deflator instead of GNP.deflator as column name
Warning in FUN(X[[4L]], ...) :
Use Armed_Forces instead of Armed.Forces as column name
Quitting from lines 748-753 (sparkr-vignettes.Rmd)
Error: processing vignette 'sparkr-vignettes.Rmd' failed with diagnostics:
error in evaluating the argument 'object' in selecting a method for function 'summary': Error: object 'kmeansModel' not found

What are these warnings for? Seems relevant.

@wangmiao1981
Copy link
Contributor

try {r, warning=FALSE} for the warning cases?

@SparkQA
Copy link

SparkQA commented Feb 3, 2017

Test build #72327 has finished for PR 16794 at commit 723b0a4.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 4, 2017

Test build #72330 has finished for PR 16794 at commit f7dc3c6.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangmiao1981
Copy link
Contributor

still document failures.

@wangmiao1981
Copy link
Contributor

retest this please.

@actuaryzhang
Copy link
Contributor Author

@felixcheung @wangmiao1981
I spent quite some time on this b/c I could not replicate the results and all tests on my end worked.
Then I updated my local spark with a pull request and found that all these error messages are introduced in #16767. I created another PR to fix this #16799 since this is a separate issue. Please review and merge #16799 first.

@SparkQA
Copy link

SparkQA commented Feb 4, 2017

Test build #72335 has finished for PR 16794 at commit f7dc3c6.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Feb 4, 2017

Test build #72345 has finished for PR 16794 at commit f7dc3c6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

I'd prefer not to {r, warning=FALSE} on the whole vignettes block - it could mask unexpected warnings. But good point on it being very noisy - I was concerned to have suppressWarnings(createDataFrame(iris)) because it might seems confusing to user reading the example not understanding why we would have warning there.

To address all these noise, I'd propose either

  • we really should fix the root problem with . in column names SPARK-12191, SPARK-11976
  • or we should really move examples and documentation away from dataset that has . in column names

@felixcheung
Copy link
Member

in any case, this PR LGTM. I'll target master for this fix.

@felixcheung
Copy link
Member

opened SPARK-19460.

@actuaryzhang actuaryzhang reopened this Feb 5, 2017
@actuaryzhang
Copy link
Contributor Author

@felixcheung I am in favor of the first proposal to support dot in names. I'm curious why this was not supported yet since we can create DataFrame in Spark with dot in names?

val df = Seq((1.0, 2.0), (2.0, 3.0)).toDF("a.b", "b.c")

@felixcheung
Copy link
Member

felixcheung commented Feb 5, 2017

that's awesome - it's likely a recent change we have missed. If you dig through the history of various related JIRAs you should be able to pinpoint a couple of places (at least 2 or 3, on top of my head) that we are replacing . with _ on the R side. If it works as-is by removing those replacement R code then we are good. But, I wouldn't be surprised if we are going to run into some R bugs - R really like its . and it carries special meaning.

I'd really appreciate it if someone could look into this - the . -> _ has been problematic for a long time.

@actuaryzhang
Copy link
Contributor Author

@felixcheung I'll be happy to look into this and fix it. Do you want me to open a new JIRA or just create a PR against SPARK-19460?

@felixcheung
Copy link
Member

that would be great @actuaryzhang - I think SPARK-11976 would be the place to start.

@felixcheung
Copy link
Member

merged to master.

@asfgit asfgit closed this in b94f4b6 Feb 5, 2017
@actuaryzhang actuaryzhang deleted the sparkRNames branch February 6, 2017 22:49
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
## What changes were proposed in this pull request?
The names method fails to check for validity of the assignment values. This can be fixed by calling colnames within names.

## How was this patch tested?
new tests.

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes apache#16794 from actuaryzhang/sparkRNames.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants