Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fread(integer64 = "double") not working for some data #2607

Open
renkun-ken opened this issue Feb 4, 2018 · 8 comments
Open

fread(integer64 = "double") not working for some data #2607

renkun-ken opened this issue Feb 4, 2018 · 8 comments

Comments

@renkun-ken
Copy link
Member

renkun-ken commented Feb 4, 2018

I'm testing the latest development version of data.table and find that fread does not respect integer64 = "double" for some of my data. There's no such problem in the release version.

The test code is:

dt <- fread("test_data.txt", sep = "|", integer64 = "double")

but the resulted data.table still has integer64 column:

> str(dt)
lasses ‘data.table’ and 'data.frame':	2583 obs. of  15 variables:
 $ V1 : num  0.1 0.01 0.04 0.04 0.02 NA 0.01 0.01 0.03 0.02 ...
 $ V2 : num  0.0509 0.01 0.0138 0.0248 0.0141 ...
 $ V3 : num  0.0451 0.01 0.0153 0.0445 0.0133 ...
 $ V4 : num  0.0386 0.01 0.0153 0.0557 0.0129 ...
 $ V5 :integer64 1020396 55949051 4935942 3668540 11818540 30119787 115742884 0 ... 
 $ V6 : num  1734596 60721591 10311588 1711172 15439786 ...
 $ V7 : num  1541020 46302203 5275696 1276379 13350567 ...
 $ V8 : num  1408261 42321135 3844999 1173221 11545387 ...
 $ V9 : num  9282004 84280297 15006800 14062030 81537656 ...
 $ V10: logi  NA NA NA NA NA NA ...
 $ V11: logi  NA NA NA NA NA NA ...
 $ V12: logi  NA NA NA NA NA NA ...
 $ V13: logi  NA NA NA NA NA NA ...
 $ V14: logi  NA NA NA NA NA NA ...
 $ V15: num  2.35e+09 2.45e+09 1.11e+09 2.28e+09 5.38e+09 ...
 - attr(*, ".internal.selfref")=<externalptr> 

The data is attached below:

test_data.txt

The same happens on both macOS and Ubuntu as I tested.

Here's my session info:

R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.3

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.5

loaded via a namespace (and not attached):
 [1] bit_1.1-12      httr_1.3.1      compiler_3.4.3  R6_2.2.2        tools_3.4.3     withr_2.1.1     curl_3.1        yaml_2.1.16    
 [9] memoise_1.1.0   bit64_0.9-7     knitr_1.19      git2r_0.21.0    digest_0.6.15   devtools_1.13.4
R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS

Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.18.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.5

loaded via a namespace (and not attached):
[1] compiler_3.4.3 tools_3.4.3    yaml_2.1.16   
@chenwq
Copy link

chenwq commented Feb 7, 2018

DT = fread("A\n1.010203040506070809010203040506\n")  # too precise for double, so read as character
# TODO: add numerals=c("allow.loss", "warn.loss", "no.loss") from base::read.table
typeof(DT$A)=="character"   # TRUE

@renkun-ken
Copy link
Member Author

I'm now using the latest release 1.11.0.

Another data that causes exactly the same problem is as follows which further causes rbindlist to fail:

fread-issue-sample.txt

> p1 <- fread("~/data/fread-issue-sample.txt", integer64 = "double")
> str(p1)
Classes ‘data.table’ and 'data.frame':	8160 obs. of  2 variables:
 $ volume  : int  13203 8201 8041 5391 7779 6079 5848 7249 6109 5406 ...
 $ turnover:integer64 53907549 33420919 32739957 21940344 31610610 24671207 23693960 29322253 ... 
 - attr(*, ".internal.selfref")=<externalptr> 

Any idea on this @mattdowle?

@st-pasha
Copy link
Contributor

st-pasha commented May 4, 2018

When verbose mode is turned on, it shows the following:

...
Column 2 ("turnover") bumped from 'int32' to 'int64' due to <<2402620023>> on row 2400

Thus, this issue appears to be caused by #2749 : the "integer64" parameter is only applied during stage [09] Apply user overrides on column types, but is not taken into account when an out-of-sample type bump occurs.

@mattdowle
Copy link
Member

Yes integer64= control is dealt with in userOverride() currently. Looks like we'll need to pass readInt64As down to fread.c in order for it be used in out-of-sample type bump as well. It's not just a matter of disabling the int64 parser unfortunately, because disabling it would only provide readInt64As="double" ability whereas the control allows readInt64As="character" too (skipping double in the type hierarchy).
If there might be similar requirements for other types, perhaps disabled_parsers in fread.c could be expanded. It is currently int holding 0/1 only. Instead it could hold the number of positions to skip.
So:

readInt64As="integer64"  =>  disabled_parsers[CT_INT64] == 0
readInt64As="double"     =>  disabled_parsers[CT_INT64] == 1
readInt64As="character"  =>  disabled_parsers[CT_INT64] == 4

That 4 being due to needing to skip CT_FLOAT64, CT_FLOAT64_HEX and CT_FLOAT64_EXT to get to CT_STRING.
Or, disabled_parsers could hold which type to use instead. 0 would mean use that parser as it means now. Non zero value in position i would need to be >i otherwise a infinite loop would occur.

readInt64As="integer64"  =>  disabled_parsers[CT_INT64] == 0
readInt64As="double"     =>  disabled_parsers[CT_INT64] == CT_FLOAT64
readInt64As="character"  =>  disabled_parsers[CT_INT64] == CT_STRING

This approach would save needing to maintain the skip values in disabled_parsers as parsers are added and removed in future.

@mattdowle mattdowle modified the milestones: 1.11.4, 1.11.6 May 24, 2018
@jangorecki jangorecki modified the milestones: 1.12.0, 1.11.6 Jun 6, 2018
@bhagwataditya
Copy link

bhagwataditya commented Jun 7, 2018

I am experiencing the same issue in the release version (1.11.4): one of my columns is being bumped to integer64, despite data.table::fread(..., integer64 = 'numeric').

(If useful I could create a reproducible example, but it looks like you have that already.)

@mecoskun
Copy link

mecoskun commented Feb 9, 2020

I'm having the same problem with integer64 conversions. Did this problem get solved?

@elad663
Copy link

elad663 commented Feb 19, 2020

is this related to int64 issues in rbindlist or should it be addressed in another issue?

@DavorJ
Copy link

DavorJ commented Jun 8, 2020

Quick & Dirty workaround:

  coln.int64 <- names(which(sapply(df, bit64::is.integer64)))
  if (length(coln.int64) > 0L)
    df[, c(coln.int64) := lapply(.SD, as.numeric), .SDcols = coln.int64]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants