-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Execution time of bind_rows()
#1396
Comments
Please provide a reproducible example. It matters what's in your columns. Are you using dplyr 0.4.3? |
Very simple example: library(dplyr)
for (n in c(1e4, 1e5, 1e6)) {
ll = lapply(1:n, function(x) as.list(setNames(runif(5), letters[1:5])))
print(n)
print(system.time(bind_rows(ll)))
}
Between 1e4 and 1e5 rows, the time difference is about 100 x. _edit_
|
Thanks, no need to also supply the 1e6 time. |
I get this with the devel version:
which looks perfectly linear. Can somebody else double check. @hadley @kevinushey ? |
Perhaps @mschubert's computer struggles with memory and does a lot of GCs. Might be worthwhile running it through valgrind to make sure we have not forgotten to release objects to the GC. |
It's not memory, that was running on a cluster with 10G requested just for the test. Now I tested it on the following systems with the same symptoms: Gentoo Prefix, R 3.2.0 & dplyr_0.4.3.9000 R version 3.2.0 (2015-04-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.6 (Santiago)
locale:
[1] C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_0.4.3.9000 modules_0.8.1
loaded via a namespace (and not attached):
[1] magrittr_1.5 R6_2.1.1 assertthat_0.1 parallel_3.2.0 tools_3.2.0
[6] DBI_0.3.1 Rcpp_0.12.1 Archlinux, R 3.2.1 & dplyr_0.4.3.9000 R version 3.2.1 (2015-06-18)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Arch Linux
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_0.4.3.9000
loaded via a namespace (and not attached):
[1] magrittr_1.5 R6_2.1.1 assertthat_0.1 parallel_3.2.1 tools_3.2.1
[6] DBI_0.3.1 Rcpp_0.12.1 And for those it worked fine: OS-X, R 3.2.1 & dplyr_0.4.3 R version 3.2.1 (2015-06-18)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_0.4.3
loaded via a namespace (and not attached):
[1] magrittr_1.5 R6_2.1.1 assertthat_0.1 parallel_3.2.1 DBI_0.3.1 tools_3.2.1
Rcpp_0.12.0 |
Everything looks fine on my machine: library(dplyr)
sizes <- seq(1E3, 1E5, by = 1E3)
times <- lapply(sizes, function(n) {
ll = lapply(1:n, function(x) as.list(setNames(runif(5), letters[1:5])))
print(n)
(system.time(bind_rows(ll)))
})
plot(unlist(lapply(times, "[[", 1)), type = "b") giving which is linear aside from gc spikes. > devtools::session_info()
Session info -----------------------------------------------------------------------------------------------------------------
setting value
version R version 3.2.2 Patched (2015-09-09 r69342)
system x86_64, darwin13.4.0
ui RStudio (0.99.484)
language (EN)
collate en_US.UTF-8
tz America/Los_Angeles
Packages ---------------------------------------------------------------------------------------------------------------------
package * version date source
assertthat 0.1 2013-12-06 CRAN (R 3.2.0)
BiocInstaller * 1.18.4 2015-09-03 Bioconductor
crayon 1.3.1 2015-07-13 CRAN (R 3.2.0)
curl 0.9.3 2015-08-25 CRAN (R 3.2.1)
DBI 0.3.1 2014-09-24 CRAN (R 3.2.0)
devtools * 1.8.0 2015-05-09 CRAN (R 3.2.0)
digest 0.6.8 2014-12-31 CRAN (R 3.2.0)
dplyr * 0.4.3.9000 2015-09-15 Github (hadley/dplyr@c0f64cb)
git2r 0.11.0 2015-08-12 CRAN (R 3.2.0)
knitr * 1.11 2015-08-14 CRAN (R 3.2.1)
magrittr 1.5 2014-11-22 CRAN (R 3.2.0)
memoise 0.2.1 2014-04-22 CRAN (R 3.2.0)
R6 2.1.1 2015-08-19 CRAN (R 3.2.0)
Rcpp 0.12.1 2015-09-10 CRAN (R 3.2.2)
rversions 1.0.2 2015-07-13 CRAN (R 3.2.0)
testthat * 0.10.0 2015-05-22 CRAN (R 3.2.0)
xml2 0.1.2 2015-09-01 CRAN (R 3.2.0) |
I got a further confirmation on an independent machine running Arch (@deeenes) with the same issues and two Macs (@luzgaral, @akalamara) without. |
I'll try it out on my Ubuntu VM to see if I can reproduce. What compilers + flags are you using in each case, just to check? If you have time, does it make a difference whether you compile with |
I'll check it tonight, will update here and ping you. |
I can repro on my Ubuntu VM (Ubuntu 14.04 64bit, with gcc 4.8.2); also with Maybe some weird performance edge-case when using |
One potential confusing factor is the version of Rcpp. As things appear to be fine with I know for example that |
@kevinushey had |
Right. So glad we can rule that out. That's a weird one. |
My compile args are (R the same, with addition of CPPFLAGS="-D_FORTIFY_SOURCE=2"
CFLAGS="-march=x86-64 -mtune=generic -O2 -pipe -fstack-protector --param=ssp-buffer-size=4"
CXXFLAGS="-march=x86-64 -mtune=generic -O2 -pipe -fstack-protector --param=ssp-buffer-size=4"
LDFLAGS="-Wl,-O1,--sort-common,--as-needed,-z,relro" The package linux-vdso.so.1 (0x00007ffd6e5d7000)
libR.so => /usr/lib/R/lib/libR.so (0x00007f8435239000)
libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007f8434eb7000)
libm.so.6 => /usr/lib/libm.so.6 (0x00007f8434bb9000)
libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x00007f84349a2000)
libc.so.6 => /usr/lib/libc.so.6 (0x00007f84345fe000)
libblas.so => /usr/lib/libblas.so (0x00007f8433884000)
libreadline.so.6 => /usr/lib/libreadline.so.6 (0x00007f8433638000)
libpcre.so.1 => /usr/lib/libpcre.so.1 (0x00007f84333c8000)
liblzma.so.5 => /usr/lib/liblzma.so.5 (0x00007f84331a2000)
libbz2.so.1.0 => /usr/lib/libbz2.so.1.0 (0x00007f8432f91000)
libz.so.1 => /usr/lib/libz.so.1 (0x00007f8432d7b000)
librt.so.1 => /usr/lib/librt.so.1 (0x00007f8432b73000)
libdl.so.2 => /usr/lib/libdl.so.2 (0x00007f843296e000)
libgomp.so.1 => /usr/lib/libgomp.so.1 (0x00007f843274c000)
libpthread.so.0 => /usr/lib/libpthread.so.0 (0x00007f843252f000)
/usr/lib64/ld-linux-x86-64.so.2 (0x0000560900506000)
libgfortran.so.3 => /usr/lib/libgfortran.so.3 (0x00007f8432204000)
libncursesw.so.5 => /usr/lib/libncursesw.so.5 (0x00007f8431f9f000)
libquadmath.so.0 => /usr/lib/../lib/libquadmath.so.0 (0x00007f8431d5f000) Tested also with R/dplyr compiled using clang=3.6, same issues. |
I think we'll need to see output from a sampling profiler to see where most of the time is being spent when running |
Ok, next set of tests. Using CPUPROFILE=samples.log R --no-save --no-restore < test.r
pprof --text /usr/lib/R/lib/libR.so samples.log where Using local file /usr/lib/R/lib/libR.so.
Using local file samples.log.
Total: 35 samples
8 22.9% 22.9% 8 22.9% R_BadLongVector
8 22.9% 45.7% 8 22.9% Rf_pmatch
3 8.6% 54.3% 3 8.6% CONS_NR
2 5.7% 60.0% 2 5.7% Rf_allocVector3
2 5.7% 65.7% 2 5.7% __strcmp_sse2_unaligned
2 5.7% 71.4% 2 5.7% do_Rprof
1 2.9% 74.3% 1 2.9% R_do_slot_assign
1 2.9% 77.1% 1 2.9% R_init_stats If I try with Check failed: depth > 0: ProfileData::Add depth <= 0
Aborted (core dumped) |
After Update: recompiled R with debugging symbols and
The main problem seems to be a call to Since at least a part of the delay includes |
I'm pretty sure it was Boost 1.58, but doesn't dplyr just use the version of It definitely seems like some weirdness with garbage collection is going on; using the 'stop it randomly' tactic, I see stack frames of size ~10000 (!!) so perhaps a huge number of temporary R objects are being created, protected, and unprotected in the call to For posterity, the stack trace of one sample stopped.
|
After some more manual debugging, I can confirm that almost all the time spent is to destruct Taken together with the profiling above, something is going seriously wrong with the |
Perhaps the issue is that objects are getting released with a bad ordering? If I understand correctly, we have a large vector of R objects. When a new object is added, it's protected with @romainfrancois, does that sound plausible? Would it be possible to explicitly release the objects in reverse order (so that we just pop off the top of the |
FWIW, it looks like destruction order of elements in a |
ah ok. I actually had something before to control order of destruction, but I ran a few tests and it seemed destruction happened in the reverse order. But I guess not on all implementations. I'll look into it. |
Indeed, explicitly popping items off the chunks list with something like: while (chunks.size()) chunks.pop_back(); fixes this (tested on my Ubuntu VM). There's probably a better / more explicit way of handling this though. |
Confirmed, 1e5 rows down from 95 to 1.8 seconds. |
I pushed some code based on @kevinushey's hint, but embedding the logic in the destructor of the @mschubert can you let me know if this gives you the right performance please. |
1e5 rows in 2.8 seconds, 1e6 in 29s - yes, looks linear (but is a bit slower than the solution above, or more load on the node right now) |
Slower than what solution ? If you mean @kevinushey's that's what I'm doing, the code just is in a destructor rather than at the end of the function. |
Yes, that one. That was probably just more load on the node I tested, feel free to ignore if what you changed can't have caused it. |
Fine. I'll consider this closed then. |
I am trying to bind together a large number of rows, for which I would assume that the time used should be approximately O(n).
With
bind_rows
and N individual (1-row, 6-column)data.frame
objects, this is what I see:Measurements are taken loading a
data.frame
row list into memory, and then usingsystem.time(bind_rows(data))
.Is there any reason performance decreases that much with a higher number of rows?
I guess that
Rcpp
should be able to bind 1M rows in a couple of minutes.The text was updated successfully, but these errors were encountered: