Skip to content

merge.data.table using character column crashing Rstudio with tables above 65536 rows #5552

@Stikus

Description

@Stikus

Hello, our production code become broken after we one of our dataframes become bigger than 65536 rows.
For about a year, we have a code that merge small data.frame to big data.table with merge. Due to merge.data.table syntax, first argument of merge decided merge manner, and it was data.table, resulting merge.data.table. I understand that it is not correct to mix types, but it was our oversight.

Last week, our "small" data.frame got bigger and Rstudio start to crash not only when merging with big data.table, but even with small ones. We switched to merge.data.frame and all start working again, but in takes 12+ hours for our code to run instead 1+ hour - that's why I'm asking for help.

After some debug I've found that before 65536 rows in data.frame all is good, and after it is not. We tested this example on several machines, and two of our main machines have error, but some others don't. I've posted session infos from broken and working machine FYI.

I've created an example that shows this behavior. As you can see - merge.data.frame working good (a_test and b_test are data.frames), merge.data.table with numeric column working good (a1_test and b1_test are data.tables), merge.data.table with character column fails.

image

# Minimal reproducible example

library(data.table)

a_test <- setNames(as.data.frame(cbind(1:10, 1:10, 11:20)), c('A', 'A_char',  'B'))
a_test$A_char <- as.character(a_test$A_char)
a1_test <- as.data.table(a_test)
b_test <- setNames(as.data.frame(cbind(1:65600, 1:65600, 21:65620)), c('A', 'A_char', 'C'))
b_test$A_char <- as.character(b_test$A_char)
b1_test <- as.data.table(b_test)
c_test <- merge(a_test, b_test, by = 'A')
c1_test <- merge(a1_test, b1_test, by = 'A')
c1_char_test <- merge(a1_test, b1_test, by = 'A_char')

# Output of sessionInfo()

> sessionInfo()
R version 4.2.2 Patched (2022-11-10 r83330)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.14.6

loaded via a namespace (and not attached):
[1] compiler_4.2.2 tools_4.2.2 

# Output of sessionInfo() from machine where all is ok

> sessionInfo()
R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.14.6

loaded via a namespace (and not attached):
[1] compiler_4.2.1 tools_4.2.1   

I've found somewhat similar issue #4733, that's why I've linked it.

I can produce any other debug information if you tell me how. Looking for help - we cannot afford 12+ hours of analysis in production.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions