Skip to content

Conversation

@actuaryzhang
Copy link
Contributor

What changes were proposed in this pull request?

Added checks for name consistency of input data frames in union.

How was this patch tested?

new test.

@actuaryzhang
Copy link
Contributor Author

The current implementation accepts data frames with different schemas. See issues below:

df <- createDataFrame(data.frame(name = c("Michael", "Andy", "Justin"), age = c(1, 30, 19)))
union(df, df[, c(2, 1)])
     name     age
1 Michael     1.0
2    Andy    30.0
3  Justin    19.0
4     1.0 Michael

@felixcheung

@SparkQA
Copy link

SparkQA commented Mar 4, 2017

Test build #73888 has finished for PR 17159 at commit 7697806.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 4, 2017

Test build #73895 has finished for PR 17159 at commit 293dc35.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 4, 2017

Test build #73897 has finished for PR 17159 at commit ef84501.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

felixcheung commented Mar 5, 2017

hmm... this is somewhat by design in Spark - union could take in 2 DataFrames that might not match in column names or types. In that case values in one of the DataFrame will be coerced to make things fit

>>> d = spark.createDataFrame([{'name': 'Alice', 'age': 1}])
>>> l = spark.createDataFrame([(1, 2)])
>>> d.union(l).head(2)
[Row(age=1, name=u'Alice'), Row(age=1, name=u'2')]

>>> l.dtypes
[('_1', 'bigint'), ('_2', 'bigint')]
>>> d.dtypes
[('age', 'bigint'), ('name', 'string')]

Do you see this as something that might be unexpected for R users (in which case rbind might be the overload to look into) or SQL users (documented as equivalent to SQL UNION ALL)?

@actuaryzhang
Copy link
Contributor Author

@felixcheung OK, did not know it was by design. It does seem that the union behavior is similar to R's SQL (in sqldf), but as you pointed out, the overload method rbind is different from base R, which checks name consistency. See examples below. Should I make the change to rbind, or leave it as is and close this PR? Thanks.

df <- data.frame(name = c("Michael", "Andy", "Justin"), age = c(1, 30, 19))
df2 <- df
names(df2)[1] <- "name2"

# 1. SQL
library(sqldf)
query <- "select * from df union all select * from df2"
sqldf(query)

     name age
1 Michael   1
2    Andy  30
3  Justin  19
4 Michael   1
5    Andy  30
6  Justin  19

# 2. rbind
rbind(df, df2)
Error in match.names(clabs, names(xi)) : 
  names do not match previous names

@felixcheung
Copy link
Member

felixcheung commented Mar 5, 2017

I think it's a good idea to get SparkR rbind to match behavior of R data.frame rbind.
We should clearly indicate the difference between SparkR union and rbind then in documentation.

@actuaryzhang actuaryzhang changed the title [SPARK-19818][SparkR] union should check for name consistency of input data frames [SPARK-19818][SparkR] rbind should check for name consistency of input data frames Mar 5, 2017
@actuaryzhang
Copy link
Contributor Author

Makes sense. Made changes to rbind and added tests. Please take a look. Thanks.

@SparkQA
Copy link

SparkQA commented Mar 5, 2017

Test build #73941 has finished for PR 17159 at commit decc468.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #73947 has finished for PR 17159 at commit cc80de3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

#'
#' Union two or more SparkDataFrames. This is equivalent to \code{UNION ALL} in SQL.
#' Union two or more SparkDataFrames by row. In constrast to \link{union}, this method
#' requires that the input SparkDataFrames have the same column names.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just say, as in R's rbind, this method requires...
btw, should we care about data type matching - does R's rbind check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Updated doc. R's rbind seems to do type conversion similarly to union:

df <- data.frame(name = c("Michael", "Andy", "Justin"), age = c(1, 30, 19))
df2 <- df
df2$age <- as.character(df2$age)
rbind(df, df2)
     name age
1 Michael   1
2    Andy  30
3  Justin  19
4 Michael   1
5    Andy  30
6  Justin  19
str(rbind(df, df2))
'data.frame':	6 obs. of  2 variables:
 $ name: Factor w/ 3 levels "Andy","Justin",..: 3 1 2 3 1 2
 $ age : chr  "1" "30" "19" "1" ...

@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #73954 has finished for PR 17159 at commit 54427d5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

merged to master. thanks

@asfgit asfgit closed this in 1f6c090 Mar 7, 2017
@actuaryzhang actuaryzhang deleted the sparkRUnion branch July 1, 2017 00:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants