Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slower utils::View() performance on wide data.table #4497

Closed
matthewgson opened this issue May 26, 2020 · 2 comments
Closed

Slower utils::View() performance on wide data.table #4497

matthewgson opened this issue May 26, 2020 · 2 comments

Comments

@matthewgson
Copy link

Hi,

I noticed there's a noticeable performance difference between data.table and data.frame in calling View() function on wide data. My data contains 2.4 rows and 280 string and numeric columns.

Example

d = data.frame(replicate(280,sample(200:50000,2.4e6,rep=TRUE)))
d[,1:140] = lapply(d[,1:140], as.character)
class(d) # "data.frame"
system.time(View(d))
#  user  system elapsed 
# 0.002   0.003   0.005 

library(data.table)
dt = as.data.table(d)
class(dt) #[1] "data.table" "data.frame"
system.time(View(dt))
# user  system elapsed 
# 20.711  16.510  44.928

microbenchmark(
  View(d),View(dt), times = 10L
)
#Unit: microseconds
#     expr         min           lq         mean       median           uq          max neval
#  View(d) 4.55003e+02     3407.382     7926.779     7806.118     11299.95     20750.27    10
# View(dt) 2.67615e+07 34860488.551 81676854.453 68228162.130 113672257.78 176008641.94    10

# Output of sessionInfo()

> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Catalina 10.15.4

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] microbenchmark_1.4-7 data.table_1.12.8   

loaded via a namespace (and not attached):
[1] compiler_3.6.3 tools_3.6.3   
@MichaelChirico
Copy link
Member

MichaelChirico commented May 26, 2020

Hi @matthewgson,

We won't be able to do anything about that. See here in the code for utils::View:

https://github.com/wch/r-source/blob/2825c6e39c848ce479289ba5f7521baae955d6f6/src/library/utils/R/edit.R#L63

as.data.frame is creating a copy of your huge table in the data.table case, but not the data.frame case.

Moreover, View is not S3 generic, so we can't offer a faster View.data.table method.

Some recommendations:

  • Learn the setDT and setDF functions -- with these you can switch between data.frame and data.table roughly instantaneously, with no copies. I would write your example as this:
NN = 2.4e5
JJ = 280L
jj = 140L

d = setDF(replicate(JJ, sample(200:50000, NN, TRUE), simplify = FALSE))
d[ , (1:jj) := lapply(.SD, as.character), .SDcols = 1:jj]

system.time(utils::View(d))

setDT(d)
# very slow: system.time(utils::View(d))

# faster
setDF(d)
system.time(utils::View(d))

It may be possible to get R to make View S3 generic, but I'm not sure.

@matthewgson
Copy link
Author

matthewgson commented May 26, 2020

@MichaelChirico Thanks for your suggestion. Actually setDF and setDT was what I was using for View. I just wanted to let the developers know about this issue and it seems it's not data.table issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants