Importing R data.table's gzipped csv files stops early #6522

exalate-issue-sync · 2023-02-21T22:08:47Z

The following example shows that if you try to import a {{.csv.gz}} file created by {{data.table}} , {{h2o}} does not import the full file, whereas it will if you do the {{gzip}} as a separate step. I’m guessing there’s a difference in the header which messes up the import logic. Also reproduced this issue on a linux machine.

{code:r}library(data.table)
library(h2o)

sessionInfo()
R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22621)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8 LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C LC_TIME=English_United States.utf8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] h2o_3.38.0.1 data.table_1.14.6

loaded via a namespace (and not attached):
[1] Rcpp_1.0.9 compiler_4.2.2 later_1.3.0 urlchecker_1.0.1 bitops_1.0-7 prettyunits_1.1.1 profvis_0.3.7
[8] remotes_2.4.2 tools_4.2.2 digest_0.6.30 pkgbuild_1.4.0 pkgload_1.3.2 jsonlite_1.8.4 memoise_2.0.1
[15] lifecycle_1.0.3 rlang_1.0.6 shiny_1.7.3 cli_3.4.1 rstudioapi_0.14 fastmap_1.1.0 stringr_1.5.0
[22] fs_1.5.2 htmlwidgets_1.5.4 devtools_2.4.5 glue_1.6.2 R6_2.5.1 processx_3.8.0 sessioninfo_1.2.2
[29] callr_3.7.3 purrr_0.3.5 magrittr_2.0.3 ps_1.7.2 promises_1.2.0.1 ellipsis_0.3.2 htmltools_0.5.4
[36] usethis_2.1.6 mime_0.12 xtable_1.8-4 httpuv_1.6.6 stringi_1.7.8 miniUI_0.1.1.1 RCurl_1.98-1.9
[43] cachem_1.0.6 crayon_1.5.2

h2o.init()

set.seed(87)
dt <- data.table(a = rnorm(1e6),
b = sample(x = 0:1, size = 1e6, replace = TRUE))

write a .csv using data.table's gzip

uses zlib, I believe, due to SystemRequirements in DESCRIPTION

fwrite(x = dt, file = "fake_data1.csv.gz")

same as

fwrite(x = dt, file = "fake_data1.csv", compress = "gzip")

export a normal .csv, then use builtin gzip

fwrite(x = dt, file = "fake_data2.csv")
system2(command = "gzip", args = "fake_data2.csv")

no "error" but only imports ~6k rows

h2oframe <- h2o.importFile(normalizePath("~/fake_data1.csv.gz"))
nrow(h2oframe)

[1] 6197

imports full file correctly

h2oframe <- h2o.importFile(normalizePath("~/fake_data2.csv.gz"))
nrow(h2oframe)

[1] 1000000{code}

h2o-ops · 2023-05-10T13:54:13Z

JIRA Issue Details

Jira Issue: PUBDEV-8932
Assignee: New H2O Bugs
Reporter: Paul Donnelly
State: Open
Fix Version: N/A
Attachments: N/A
Development PRs: N/A

wendycwong · 2023-06-25T00:21:18Z

This issues deals with how data.table perform compression and it does not seem to follow the same protocol as normal comression as gzip. Hence, it is not an issue with H2O and more a problem with data.table.

hutch3232 · 2024-11-15T02:16:17Z

Reposting the original issue with better formatting to link to data.table repo.

library(data.table)
library(h2o)

sessionInfo()
# R version 4.4.2 (2024-10-31 ucrt)
# Platform: x86_64-w64-mingw32/x64
# Running under: Windows 11 x64 (build 26100)
# 
# Matrix products: default
# 
# 
# locale:
#   [1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8    LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                          
# [5] LC_TIME=English_United States.utf8    
# 
# time zone: America/Chicago
# tzcode source: internal
# 
# attached base packages:
#   [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
#   [1] h2o_3.44.0.3      data.table_1.16.2
# 
# loaded via a namespace (and not attached):
#   [1] compiler_4.4.2    tools_4.4.2       RCurl_1.98-1.16   rstudioapi_0.16.0 jsonlite_1.8.9    bitops_1.0-9      renv_1.0.10

h2o.init()

set.seed(87)
dt <- data.table(a = rnorm(1e6),
                 b = sample(x = 0:1, size = 1e6, replace = TRUE))

# write a .csv using data.table's gzip
# uses zlib, I believe, due to SystemRequirements in DESCRIPTION
fwrite(x = dt, file = "fake_data1.csv.gz")

# same as
fwrite(x = dt, file = "fake_data1.csv", compress = "gzip")
# export a normal .csv, then use builtin gzip
fwrite(x = dt, file = "fake_data2.csv")
system2(command = "gzip", args = "fake_data2.csv")

# no "error" but only imports ~6k rows
h2oframe <- h2o.importFile(normalizePath("~/fake_data1.csv.gz"))
nrow(h2oframe)
# [1] 6197

# imports full file correctly
h2oframe <- h2o.importFile(normalizePath("~/fake_data2.csv.gz"))
nrow(h2oframe)
# [1] 1000000

wendycwong closed this as completed Jun 25, 2023

hutch3232 mentioned this issue Nov 15, 2024

fwrite with compress="gzip" produces gz files with incorrect uncompressed file sizes Rdatatable/data.table#6356

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Importing R data.table's gzipped csv files stops early #6522

Importing R data.table's gzipped csv files stops early #6522

exalate-issue-sync bot commented Feb 21, 2023

h2o-ops commented May 10, 2023

wendycwong commented Jun 25, 2023

hutch3232 commented Nov 15, 2024

Importing R data.table's gzipped csv files stops early #6522

Importing R data.table's gzipped csv files stops early #6522

Comments

exalate-issue-sync bot commented Feb 21, 2023

write a .csv using data.table's gzip

uses zlib, I believe, due to SystemRequirements in DESCRIPTION

same as

fwrite(x = dt, file = "fake_data1.csv", compress = "gzip")

export a normal .csv, then use builtin gzip

no "error" but only imports ~6k rows

[1] 6197

imports full file correctly

[1] 1000000{code}

h2o-ops commented May 10, 2023

wendycwong commented Jun 25, 2023

hutch3232 commented Nov 15, 2024