-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Importing R data.table's gzipped csv files stops early #6522
Comments
JIRA Issue Details Jira Issue: PUBDEV-8932 |
This issues deals with how data.table perform compression and it does not seem to follow the same protocol as normal comression as gzip. Hence, it is not an issue with H2O and more a problem with data.table. |
Reposting the original issue with better formatting to link to library(data.table)
library(h2o)
sessionInfo()
# R version 4.4.2 (2024-10-31 ucrt)
# Platform: x86_64-w64-mingw32/x64
# Running under: Windows 11 x64 (build 26100)
#
# Matrix products: default
#
#
# locale:
# [1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8 LC_MONETARY=English_United States.utf8 LC_NUMERIC=C
# [5] LC_TIME=English_United States.utf8
#
# time zone: America/Chicago
# tzcode source: internal
#
# attached base packages:
# [1] stats graphics grDevices utils datasets methods base
#
# other attached packages:
# [1] h2o_3.44.0.3 data.table_1.16.2
#
# loaded via a namespace (and not attached):
# [1] compiler_4.4.2 tools_4.4.2 RCurl_1.98-1.16 rstudioapi_0.16.0 jsonlite_1.8.9 bitops_1.0-9 renv_1.0.10
h2o.init()
set.seed(87)
dt <- data.table(a = rnorm(1e6),
b = sample(x = 0:1, size = 1e6, replace = TRUE))
# write a .csv using data.table's gzip
# uses zlib, I believe, due to SystemRequirements in DESCRIPTION
fwrite(x = dt, file = "fake_data1.csv.gz")
# same as
fwrite(x = dt, file = "fake_data1.csv", compress = "gzip")
# export a normal .csv, then use builtin gzip
fwrite(x = dt, file = "fake_data2.csv")
system2(command = "gzip", args = "fake_data2.csv")
# no "error" but only imports ~6k rows
h2oframe <- h2o.importFile(normalizePath("~/fake_data1.csv.gz"))
nrow(h2oframe)
# [1] 6197
# imports full file correctly
h2oframe <- h2o.importFile(normalizePath("~/fake_data2.csv.gz"))
nrow(h2oframe)
# [1] 1000000 |
The following example shows that if you try to import a {{.csv.gz}} file created by {{data.table}} , {{h2o}} does not import the full file, whereas it will if you do the {{gzip}} as a separate step. I’m guessing there’s a difference in the header which messes up the import logic. Also reproduced this issue on a linux machine.
{code:r}library(data.table)
library(h2o)
sessionInfo()
R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22621)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8 LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C LC_TIME=English_United States.utf8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] h2o_3.38.0.1 data.table_1.14.6
loaded via a namespace (and not attached):
[1] Rcpp_1.0.9 compiler_4.2.2 later_1.3.0 urlchecker_1.0.1 bitops_1.0-7 prettyunits_1.1.1 profvis_0.3.7
[8] remotes_2.4.2 tools_4.2.2 digest_0.6.30 pkgbuild_1.4.0 pkgload_1.3.2 jsonlite_1.8.4 memoise_2.0.1
[15] lifecycle_1.0.3 rlang_1.0.6 shiny_1.7.3 cli_3.4.1 rstudioapi_0.14 fastmap_1.1.0 stringr_1.5.0
[22] fs_1.5.2 htmlwidgets_1.5.4 devtools_2.4.5 glue_1.6.2 R6_2.5.1 processx_3.8.0 sessioninfo_1.2.2
[29] callr_3.7.3 purrr_0.3.5 magrittr_2.0.3 ps_1.7.2 promises_1.2.0.1 ellipsis_0.3.2 htmltools_0.5.4
[36] usethis_2.1.6 mime_0.12 xtable_1.8-4 httpuv_1.6.6 stringi_1.7.8 miniUI_0.1.1.1 RCurl_1.98-1.9
[43] cachem_1.0.6 crayon_1.5.2
h2o.init()
set.seed(87)
dt <- data.table(a = rnorm(1e6),
b = sample(x = 0:1, size = 1e6, replace = TRUE))
write a .csv using data.table's gzip
uses zlib, I believe, due to SystemRequirements in DESCRIPTION
fwrite(x = dt, file = "fake_data1.csv.gz")
same as
fwrite(x = dt, file = "fake_data1.csv", compress = "gzip")
export a normal .csv, then use builtin gzip
fwrite(x = dt, file = "fake_data2.csv")
system2(command = "gzip", args = "fake_data2.csv")
no "error" but only imports ~6k rows
h2oframe <- h2o.importFile(normalizePath("~/fake_data1.csv.gz"))
nrow(h2oframe)
[1] 6197
imports full file correctly
h2oframe <- h2o.importFile(normalizePath("~/fake_data2.csv.gz"))
nrow(h2oframe)
[1] 1000000{code}
The text was updated successfully, but these errors were encountered: