[SPARK-8277][SPARKR] Faster createDataFrame using mapply #9234

saurfang · 2015-10-22T22:07:32Z

With a single loop using mapply, I'm able to create DataFrame much faster from R data.frame.

Please see benchmark results and code:

Unit: milliseconds
 expr        min         lq       mean     median        uq        max neval
  old 284.874542 312.012427 426.822889 336.360288 436.47377 1356.91518   100
  new   4.875089   6.357665   9.013442   6.904729  10.41597   42.41479   100

library(nycflights13)
library(microbenchmark)
data <- head(flights, n = 1000)

# get rid of factor type
dropFactor <- function(x) {
  if (is.factor(x)) {
    as.character(x)
  } else {
    x
  }
}

createDataFrameNew <- function(data) {
  do.call(mapply, c(list, unname(lapply(data, dropFactor)), SIMPLIFY = FALSE))
}

createDataFrameOld <- function(data) {
  n <- nrow(data)
  m <- ncol(data)
  lapply(1:n, function(i) {
    lapply(1:m, function(j) { dropFactor(data[i,j]) })
  })
}

my_check <- function(values) {
  all(sapply(values[-1], function(x) identical(values[[1]], x)))
}

microbenchmark(old = createDataFrameOld(data), new = createDataFrameNew(data), check = my_check)

AmplabJenkins · 2015-10-22T22:32:31Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44181/
Test FAILed.

felixcheung · 2015-10-22T23:20:53Z

Hi thanks for the contribution, you might want to check out the ongoing work in #9099
and SPARK-11086

saurfang · 2015-10-23T01:12:51Z

Ah. Thanks for the pointer and I didn't realize this issue has already been worked on. Looks like that PR already had all my brilliant idea ;) I'm closing this then. My apology on the duplicate work.

saurfang added 3 commits October 22, 2015 17:59

use mapply for faster createDataFrame

0bf8b7c

drop no longer used m,n

397438d

chop redundant data.frame. unname list to save space

cb9c106

saurfang closed this Oct 23, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-8277][SPARKR] Faster createDataFrame using mapply #9234

[SPARK-8277][SPARKR] Faster createDataFrame using mapply #9234

Uh oh!

saurfang commented Oct 22, 2015

Uh oh!

AmplabJenkins commented Oct 22, 2015

Uh oh!

felixcheung commented Oct 22, 2015

Uh oh!

saurfang commented Oct 23, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-8277][SPARKR] Faster createDataFrame using mapply #9234

[SPARK-8277][SPARKR] Faster createDataFrame using mapply #9234

Uh oh!

Conversation

saurfang commented Oct 22, 2015

Uh oh!

AmplabJenkins commented Oct 22, 2015

Uh oh!

felixcheung commented Oct 22, 2015

Uh oh!

saurfang commented Oct 23, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants