-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-11086][SPARKR] Use dropFactors column-wise instead of nested loop when createDataFrame #9099
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Jenkins, ok to test |
|
@zero323 Do you have any benchmark results before this PR vs. after this PR ? |
|
Spark 1.5.1 After patch on master: |
|
Test build #43654 has finished for PR 9099 at commit
|
|
Test build #43656 has finished for PR 9099 at commit
|
|
very cool, could you run a benchmark on data.table dataset too (read.csv returns data.frame)? |
|
cc @sun-rui |
|
@felixcheung Here you are: Spark 1.5.1 Patched master: It looks a little bit to good to be true, but as far as I can tell everything works as expected. |
|
that's good. data.table is very fast to begin with... |
|
could you add some tests specifically to make sure iris or some other known/available dataset is "serialized" properly? |
|
Sure. Should I make a separate test for that or simply add to |
|
you could probably add to that. this is just an extra tests to be safe (and should check for values too, the current tests don't seem to do that, only col names, data types and counts) |
|
Test build #43684 has finished for PR 9099 at commit
|
|
Jenkins, retest this please |
|
Test build #43686 has finished for PR 9099 at commit
|
|
OK, I am puzzled here. I've played with different test scenarios and this PR is either buggy, fixes unreported bug or exposes a more serious problem. Let's say we have following local data frame: ldf <- structure(list(foo = list(structure(list(a = 1, b = 3), .Names = c("a",
"b")), structure(list(a = 2, c = 6), .Names = c("a", "c"))),
bar = c(1, 2), baz = c("a", "b")), .Names = c("foo", "bar",
"baz"), row.names = c("1", "2"), class = "data.frame")
ldf
## foo bar baz
## 1 1, 3 1 a
## 2 2, 6 2 b
str(ldf)
## 'data.frame': 2 obs. of 3 variables:
## $ foo:List of 2
## ..$ :List of 2
## .. ..$ a: num 1
## .. ..$ b: num 3
## ..$ :List of 2
## .. ..$ a: num 2
## .. ..$ c: num 6
## $ bar: num 1 2
## $ baz: chr "a" "b"On 1.5.1 an attempt of converting this to Spark DataFrame fails with following error: sdf <- createDataFrame(sqlContext, ldf)
## Error in structField.character(names[[i]], types[[i]], TRUE) :
## Field type must be a string.while patched version creates a relatively reasonable schema: sdf <- createDataFrame(sqlContext, ldf)
printSchema(sdf)
## root
## |-- foo: array (nullable = true)
## | |-- element: double (containsNull = true)
## |-- bar: double (nullable = true)
## |-- baz: string (nullable = true)I believe that patched behavior is what we want here but as far as I can tell it is neither covered by tests or docs. Still, after transformation on 1.5.1
Please ignore that. I've checked source once again and I've found type mapping. Question remains if it should fail or not. As far I can tell this problem affects at least 1.4.1, 1.5.0, and 1.5.1. |
|
@sun-rui -- Could this be related to the StructType change ? |
|
it does look like except for a 2-3 lines the |
|
could you add this test for this structure? |
|
It is there from 1.4.0. Regarding tests I would prefer to wait until I get some clarification, because right now I am not sure how to handle this. If expected mapping is ldf1 <- structure(list(foo = list(
structure(list(foo = "a_foo", bar = 3), .Names = c("foo", "bar")),
structure(list(foo = 2, bar = 3), .Names = c("foo", "bar"))
)), .Names = "foo", class = "data.frame", row.names = c("1", "2"))
sdf1 <- createDataFrame(sqlContext, ldf1)
printSchema(sdf1)
## root
## |-- foo: array (nullable = true)
## | |-- element: string (containsNull = true)
dtypes(sdf1)
## [[1]]
## [1] "foo" "array<string>"in contrast to: ldf2 <- structure(list(foo = list(
structure(list(foo = 1, bar = 3), .Names = c("foo", "bar")),
structure(list(foo = 2, bar = "a_bar"), .Names = c("foo", "bar"))
)), .Names = "foo", class = "data.frame", row.names = c("1", "2"))
sdf2 <- createDataFrame(sqlContext, ldf2)
printSchema(sdf2)
## root
## |-- foo: array (nullable = true)
## | |-- element: double (containsNull = true)
dtypes(sdf2)
## [[1]]
## [1] "foo" "array<double>"What is even more confusing the first will fail on head(sdf2)
## foo
## 1 1, 3
## 2 2, a_barbut understandably fails on head(select(sdf2, explode(sdf2$foo)))
## Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
## java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Double |
… createDataFrame from a data.frame
|
Test build #43787 has finished for PR 9099 at commit
|
|
Test build #43788 has finished for PR 9099 at commit
|
|
@zero323, I have an alternative proposal. That is as simpler as: For the problem you observed regarding sdf, that is because previously inferring schema of complex types was buggy. That was fixed by my PR for SPARK-10049. The fix is currently on master, not in 1.5.x releases. For the bug regarding sdf1, I will investigate it (maybe some problem in handling nested array) and fix in another PR. You can go on with this PR. |
|
@sun-rui It is simple but not correct. Since iris_t <- t(iris)
stopifnot(is.character(iris_t), is.matrix(iris_t)) # Type coercion, everything is character
iris_t_df <- data.frame(iris_t)
stopifnot(all(lapply(iris_t_df, class) == "factor")) # Now everything is factorThe second coercion can be handled with data <- read.csv("flights.csv")
microbenchmark::microbenchmark(data.frame(t(data)), times=10)
## Unit: seconds
## expr min lq mean median uq max
## data.frame(t(data)) 22.16711 22.99645 23.25316 23.12763 23.62046 24.14375
## neval
## 10
Without string to factor conversion it is actually faster, but I still don't see how it can be used here. |
|
Test build #43870 has finished for PR 9099 at commit
|
|
I've prepared some tests and it looks like neither heterogeneous lists or environments are properly handled. |
|
@sun-rui any more comments on this ? |
|
@sun-rui As far as I can tell this is ready. There is still SPARK-11283, which could be fixed here as well. |
|
Test build #44224 has finished for PR 9099 at commit
|
|
@shivaram Not yet. I hope I'll have some time later this week to take a deeper look at this. And I still would like to hear some official response regarding SPARK-11283, because right now there is really nothing we can test here. |
|
@shivaram Yes, thats right. These problems already exist on the master. The only problem is I have to mimic current behavior to match what is described in SPARK-11283. I removed tests from this PR but I strongly believe it should be tested. |
|
Test build #45085 has finished for PR 9099 at commit
|
|
@zero323 you could add the test code to SPARK-11283 so that they could be added back then. |
|
@sun-rui Could you take one more look at this ? My idea is to just get a fix for the performance issue in this PR and not change any behavior (for better or for worse). |
R/pkg/R/SQLContext.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#' is for Roxygen. so use # here.
|
Test build #45209 has finished for PR 9099 at commit
|
|
@sun-rui Could you take another look ? It'll be good to get this into 1.6 if we can |
|
I will look at it tomorrow |
R/pkg/R/SQLContext.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zero323, I don't understand the meaning here. could you give me an example?
An experiment like:
mapply(list, c(1:2), list(list(3,4),list(5,6)), SIMPLIFY=FALSE)
works as expected. The combination of atomic vector and list works fine for mapply.
I would recommend to eliminate this code piece here so that this PR focus on performance improvement only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I think you were trying to emulating a behavior of a bug when creating a DataFrame from a data.frame having any column of list type. Actually, this bug is https://issues.apache.org/jira/browse/SPARK-11283 reported by you:)
So this PR not only is a performance improvement, but also fixes SPARK-11283.
please remove code piece here:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bug of SPARK-11283 lies in
lapply(1:m, function(j) { dropFactor(data[i,j]) })
in which data[i,j] returns a list of item instead of item itself when the item is in a column of list type.
Use mapply fixs it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sun-rui Yes, the only reason to add requires_wrapping was to mimic behavior described in SPARK-11283 and not introduce unexpected changes in behavior. I understand I can safely drop this and add tests assuming that SPARK-11283 is indeed a bug :) If so I'll try to do it before Monday.
|
Test build #45952 has finished for PR 9099 at commit
|
|
Test build #45953 has finished for PR 9099 at commit
|
|
LGTM |
…oop when createDataFrame Use `dropFactors` column-wise instead of nested loop when `createDataFrame` from a `data.frame` At this moment SparkR createDataFrame is using nested loop to convert factors to character when called on a local data.frame. It works but is incredibly slow especially with data.table (~ 2 orders of magnitude compared to PySpark / Pandas version on a DateFrame of size 1M rows x 2 columns). A simple improvement is to apply `dropFactor `column-wise and then reshape output list. It should at least partially address [SPARK-8277](https://issues.apache.org/jira/browse/SPARK-8277). Author: zero323 <matthew.szymkiewicz@gmail.com> Closes #9099 from zero323/SPARK-11086. (cherry picked from commit d7d9fa0) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Use
dropFactorscolumn-wise instead of nested loop whencreateDataFramefrom adata.frameAt this moment SparkR createDataFrame is using nested loop to convert factors to character when called on a local data.frame. It works but is incredibly slow especially with data.table (~ 2 orders of magnitude compared to PySpark / Pandas version on a DateFrame of size 1M rows x 2 columns).
A simple improvement is to apply
dropFactorcolumn-wise and then reshape output list.It should at least partially address SPARK-8277.