Skip to content

Conversation

@saurfang
Copy link
Contributor

With a single loop using mapply, I'm able to create DataFrame much faster from R data.frame.

Please see benchmark results and code:

Unit: milliseconds
 expr        min         lq       mean     median        uq        max neval
  old 284.874542 312.012427 426.822889 336.360288 436.47377 1356.91518   100
  new   4.875089   6.357665   9.013442   6.904729  10.41597   42.41479   100
library(nycflights13)
library(microbenchmark)
data <- head(flights, n = 1000)

# get rid of factor type
dropFactor <- function(x) {
  if (is.factor(x)) {
    as.character(x)
  } else {
    x
  }
}

createDataFrameNew <- function(data) {
  do.call(mapply, c(list, unname(lapply(data, dropFactor)), SIMPLIFY = FALSE))
}

createDataFrameOld <- function(data) {
  n <- nrow(data)
  m <- ncol(data)
  lapply(1:n, function(i) {
    lapply(1:m, function(j) { dropFactor(data[i,j]) })
  })
}

my_check <- function(values) {
  all(sapply(values[-1], function(x) identical(values[[1]], x)))
}

microbenchmark(old = createDataFrameOld(data), new = createDataFrameNew(data), check = my_check)

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44181/
Test FAILed.

@felixcheung
Copy link
Member

Hi thanks for the contribution, you might want to check out the ongoing work in #9099
and SPARK-11086

@saurfang
Copy link
Contributor Author

Ah. Thanks for the pointer and I didn't realize this issue has already been worked on. Looks like that PR already had all my brilliant idea ;) I'm closing this then. My apology on the duplicate work.

@saurfang saurfang closed this Oct 23, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants