Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data.table crashes R session on rbindlist #2340

Closed
peekxc opened this issue Sep 7, 2017 · 8 comments
Closed

Data.table crashes R session on rbindlist #2340

peekxc opened this issue Sep 7, 2017 · 8 comments

Comments

@peekxc
Copy link

peekxc commented Sep 7, 2017

At the following github is a test data set that should replicate the issue.

On a clean R session, the following seems to completely crash R from an memory violation:

    load(file = "test.rdata")
    data.table::rbindlist(test, idcol = "tid")

I'm running R version 3.4.1.

The issue seems to be when 'idcol' is used and there are empty data.tables in the list to be rbind'ed. I found the following workaround seems to produce the expected behaviour:

  test2 <- Filter(function(dt) nrow(dt) != 0, test)
  data.table::rbindlist(test2, idcol = "tid")

But it would be great if data.table handled this.

@franknarf1
Copy link
Contributor

The FAQ, vignette("datatable-faq") says

Reading data.table from RDS or RData file

*.RDS and *.RData are file types which can store in-memory R objects on disk efficiently. However, storing data.table into the binary file loses its column over-allocation. This isn't a big deal -- your data.table will be copied in memory on the next by reference operation and throw a warning. Therefore it is recommended to call alloc.col() on each data.table loaded with readRDS() or load() calls.

Does rbindlist(lapply(test, alloc.col), idcol = "tid") also crash?

@peekxc
Copy link
Author

peekxc commented Sep 7, 2017

Hmm I did not notice this in the FAQ. That being said, it does still crash the session.

@franknarf1
Copy link
Contributor

Oh, hadn't noticed your example was in a gist. No repro here on R 3.3.3:

library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2017-09-06 22:02:57 UTC; travis
load("C:\\Users\\ferickson\\Downloads\\test.rdata")
rbindlist(test, idcol = "tid")
# typical results

sessionInfo()
# R version 3.3.3 (2017-03-06)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 7 x64 (build 7601) Service Pack 1

Btw, I guess you are testing on the devel version. If not, see https://github.com/Rdatatable/data.table/wiki/Support

@peekxc
Copy link
Author

peekxc commented Sep 7, 2017

Interesting that it doesn't seem to reproduce.

I don't think I'm using a dev version?

sessionInfo()
# R version 3.4.1 (2017-06-30)
# Platform: x86_64-apple-darwin15.6.0 (64-bit)
# Running under: OS X El Capitan 10.11.6
library("data.table")
# data.table 1.10.4

@arunsrinivasan
Copy link
Member

Just got caught with this as well. Here's a simpler example... Run it a couple of times to get a segfault (on Windows with 1.10.4 at the moment):

require(data.table)
ll <- list(data.table(x=1, y=2), data.table(), data.table(x=3, y=4))
dt <- data.table(bla=1:3, ll)

# run multiple times and you should get a segfault)
dt[, rbindlist(ll, idcol=".id")]
dt[, rbindlist(ll, idcol=".id")]
dt[, rbindlist(ll, idcol=".id")]
dt[, rbindlist(ll, idcol=".id")]

@jsams
Copy link
Contributor

jsams commented Sep 21, 2017

I'm not sure this is related only to the use of idcol. I'm getting it with some real world data, but I don't have a replicable use case. My setup reads 1000 data.tables from disk (in 1000 rds files), selects some rows and aggregates, and then rbinds the results.

(Actually, 40 files are read from disk and the selection and aggregation are applied, with rbindlist called on that result which is stored in a list via lapply. then rbindlist is called again on the resulting list. this is result of that process.)

I've verified that all items in the list are of class data.table, and have nonzero number of rows. The list is 30.1 GB as reported by pryr::object_size on a machine with 1.5 TB of RAM

uscd = rbindlist(user_song_count_dtlist)

*** caught segfault ***
address 0x7f27442b92c4, cause 'memory not mapped'

Traceback:
1: rbindlist(user_song_count_dtlist)

(sorry for the delete/repost, wanted to use this account, edit on reading from disk turned out to be inaccurate)

@jsams
Copy link
Contributor

jsams commented Sep 22, 2017

apologies, after attempting some workarounds, I've discovered that it seems as if a data.table can't have more than MAXINT rows. Didn't realize that was a limitation and is probably what was causing my issue

@arunsrinivasan arunsrinivasan added this to the v1.10.6 milestone Nov 3, 2017
@mattdowle
Copy link
Member

mattdowle commented Nov 8, 2017

I can reproduce using @arunsrinivasan's example with current CRAN (1.10.4-3) on Ubuntu.
It looks fixed in dev though. Bug fix 5 in news for 1.10.5 :

Seg fault in rbindlist() when one or more items are empty, #2019. Thanks Michael Lang for the pull request.

Thanks to @mllg's PR #2077 merged in May 2017.

@jsams Given the above, I doubt it's related to MAXINT in your case. Can you confirm dev works fine for you? But MAXINT is a good side issue. The rbindlist.c source accumulates n_rows in type size_t so that looks correct and won't overflow but I can't see a check that n_rows < MAXINT. A test could construct a list() with multiple references to the same DT so that the "big" test will run and test it fails gracefully (that the result would be > 2bn rows) without actually needing a lot of RAM to run. [Update: yes there was a segfault there. Fixed and test added.]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants