[SPARK-19818][SparkR] rbind should check for name consistency of input data frames #17159

actuaryzhang · 2017-03-04T03:00:51Z

What changes were proposed in this pull request?

Added checks for name consistency of input data frames in union.

How was this patch tested?

new test.

actuaryzhang · 2017-03-04T03:01:43Z

The current implementation accepts data frames with different schemas. See issues below:

df <- createDataFrame(data.frame(name = c("Michael", "Andy", "Justin"), age = c(1, 30, 19)))
union(df, df[, c(2, 1)])
     name     age
1 Michael     1.0
2    Andy    30.0
3  Justin    19.0
4     1.0 Michael

@felixcheung

SparkQA · 2017-03-04T03:35:58Z

Test build #73888 has finished for PR 17159 at commit 7697806.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-04T08:42:29Z

Test build #73895 has finished for PR 17159 at commit 293dc35.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-04T09:47:35Z

Test build #73897 has finished for PR 17159 at commit ef84501.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-03-05T05:16:33Z

hmm... this is somewhat by design in Spark - union could take in 2 DataFrames that might not match in column names or types. In that case values in one of the DataFrame will be coerced to make things fit

>>> d = spark.createDataFrame([{'name': 'Alice', 'age': 1}])
>>> l = spark.createDataFrame([(1, 2)])
>>> d.union(l).head(2)
[Row(age=1, name=u'Alice'), Row(age=1, name=u'2')]

>>> l.dtypes
[('_1', 'bigint'), ('_2', 'bigint')]
>>> d.dtypes
[('age', 'bigint'), ('name', 'string')]

Do you see this as something that might be unexpected for R users (in which case rbind might be the overload to look into) or SQL users (documented as equivalent to SQL UNION ALL)?

actuaryzhang · 2017-03-05T19:11:49Z

@felixcheung OK, did not know it was by design. It does seem that the union behavior is similar to R's SQL (in sqldf), but as you pointed out, the overload method rbind is different from base R, which checks name consistency. See examples below. Should I make the change to rbind, or leave it as is and close this PR? Thanks.

df <- data.frame(name = c("Michael", "Andy", "Justin"), age = c(1, 30, 19))
df2 <- df
names(df2)[1] <- "name2"

# 1. SQL
library(sqldf)
query <- "select * from df union all select * from df2"
sqldf(query)

     name age
1 Michael   1
2    Andy  30
3  Justin  19
4 Michael   1
5    Andy  30
6  Justin  19

# 2. rbind
rbind(df, df2)
Error in match.names(clabs, names(xi)) : 
  names do not match previous names

felixcheung · 2017-03-05T19:34:28Z

I think it's a good idea to get SparkR rbind to match behavior of R data.frame rbind.
We should clearly indicate the difference between SparkR union and rbind then in documentation.

actuaryzhang · 2017-03-05T22:37:47Z

Makes sense. Made changes to rbind and added tests. Please take a look. Thanks.

SparkQA · 2017-03-05T23:09:25Z

Test build #73941 has finished for PR 17159 at commit decc468.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-06T00:19:09Z

Test build #73947 has finished for PR 17159 at commit cc80de3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-03-06T00:24:27Z

R/pkg/R/DataFrame.R

 #'
-#' Union two or more SparkDataFrames. This is equivalent to \code{UNION ALL} in SQL.
+#' Union two or more SparkDataFrames by row. In constrast to \link{union}, this method
+#' requires that the input SparkDataFrames have the same column names.


I'd just say, as in R's rbind, this method requires...
btw, should we care about data type matching - does R's rbind check?

Thanks. Updated doc. R's rbind seems to do type conversion similarly to union:

df <- data.frame(name = c("Michael", "Andy", "Justin"), age = c(1, 30, 19)) df2 <- df df2$age <- as.character(df2$age) rbind(df, df2) name age 1 Michael 1 2 Andy 30 3 Justin 19 4 Michael 1 5 Andy 30 6 Justin 19 str(rbind(df, df2)) 'data.frame': 6 obs. of 2 variables: $ name: Factor w/ 3 levels "Andy","Justin",..: 3 1 2 3 1 2 $ age : chr "1" "30" "19" "1" ...

SparkQA · 2017-03-06T03:19:19Z

Test build #73954 has finished for PR 17159 at commit 54427d5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-03-07T05:55:27Z

merged to master. thanks

union checks for name consistency

7697806

fix test issue

293dc35

fix equal test

ef84501

check names in rbind rather than union

7ea0c4a

actuaryzhang added 2 commits March 5, 2017 14:34

update doc and test

b8b96d6

update doc

decc468

actuaryzhang changed the title ~~[SPARK-19818][SparkR] union should check for name consistency of input data frames~~ [SPARK-19818][SparkR] rbind should check for name consistency of input data frames Mar 5, 2017

fix test issue

cc80de3

felixcheung reviewed Mar 6, 2017

View reviewed changes

update doc

54427d5

asfgit closed this in 1f6c090 Mar 7, 2017

actuaryzhang deleted the sparkRUnion branch July 1, 2017 00:36

[SPARK-19818][SparkR] rbind should check for name consistency of input data frames #17159

[SPARK-19818][SparkR] rbind should check for name consistency of input data frames #17159

Uh oh!

Conversation

actuaryzhang commented Mar 4, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

actuaryzhang commented Mar 4, 2017

Uh oh!

SparkQA commented Mar 4, 2017

Uh oh!

SparkQA commented Mar 4, 2017

Uh oh!

SparkQA commented Mar 4, 2017

Uh oh!

felixcheung commented Mar 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

actuaryzhang commented Mar 5, 2017

Uh oh!

felixcheung commented Mar 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

actuaryzhang commented Mar 5, 2017

Uh oh!

SparkQA commented Mar 5, 2017

Uh oh!

SparkQA commented Mar 6, 2017

Uh oh!

felixcheung Mar 6, 2017

Choose a reason for hiding this comment

Uh oh!

actuaryzhang Mar 6, 2017

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 6, 2017

Uh oh!

felixcheung commented Mar 7, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

felixcheung commented Mar 5, 2017 •

edited

Loading

felixcheung commented Mar 5, 2017 •

edited

Loading