Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue a warning when converting DF to DT and cols have large whole numbers (numeric types) to convert to bit64 with reference to setNumericRounding #1463

Closed
yvanrichard opened this issue Dec 8, 2015 · 7 comments
Assignees

Comments

@yvanrichard
Copy link

The result from dt[, .N, by=x] can be wrong when dt$x contains large integers.
That bit me today and I was surprised not to get the same counts as with table(dt$x).

Example:

library(data.table)
# Get example data
base <- data.table(read.csv("https://www.data.gouv.fr/s/resources/base-de-donnees-accidents-corporels-de-la-circulation-sur-6-annees/20150806-153355/vehicules_2014.csv"))
dt <- data.table(base)

Now get the number of occurrences of each Num_Acc:

The table() version, head(table(dt$Num_Acc)) returns, as expected:

201400000001 201400000002 201400000003 201400000004 201400000005 201400000006 
           2            1            2            2            2            2

But the data.table count version, head(dt[, .N, by = Num_Acc]), returns:

        Num_Acc N
1: 201400000001 3
2: 201400000003 4
3: 201400000005 4
4: 201400000007 2
5: 201400000009 3
6: 201400000011 4

In the latter version, the sum of all N is equal to the number of rows in dt, which is right, but the even numbers seem to be aggregated with odd numbers. This is not right.

It definitely has something to do with large numbers, since it returns the right answer when Num_Acc is converted to character, or transformed into a lower number (e.g. substracted by 201400000000).

Is it possible to make it right?...


sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=en_NZ.UTF-8       LC_NUMERIC=C               LC_TIME=en_NZ.UTF-8       
 [4] LC_COLLATE=en_NZ.UTF-8     LC_MONETARY=en_NZ.UTF-8    LC_MESSAGES=en_NZ.UTF-8   
 [7] LC_PAPER=en_NZ.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_NZ.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.9.6

loaded via a namespace (and not attached):
[1] chron_2.3-47
@franknarf1
Copy link
Contributor

I tried using fread (data.table's way of reading in data), which alerted me to install the bit64 package. With that, no problem:

library(data.table)
install.packages('bit64') 
# not necessary to load the latter library, apparently

DT = fread("https://www.data.gouv.fr/s/resources/base-de-donnees-accidents-corporels-de-la-circulation-sur-6-annees/20150806-153355/vehicules_2014.csv")

head(DT[, .N, by = Num_Acc])
#         Num_Acc N
# 1: 201400000001 2
# 2: 201400000002 1
# 3: 201400000003 2
# 4: 201400000004 2 
# 5: 201400000005 2
# 6: 201400000006 2

@arunsrinivasan
Copy link
Member

See ?setNumericRounding. You'll have to do setNumericRounding(0L) to disable it. But we recommend using integer64 from bit64 package for such cases.

Feel free to reopen if that doesn't fix the issue.

There's a FR under vignettes to explain about numeric rounding.

@yvanrichard
Copy link
Author

Ah. Thank you. Although I have been using data tables for quite some time, I never came across setNumericRounding and int64. Since large integers are quite common, being often used as ids, shouldn't a warning be issued when converting a data.frame to a data.table in presence of large integers? It's an easy trap, especially when base methods return the right answer. Now, I just hope that this didn't introduce mistakes in previous projects... Working collaboratively, I often have to convert data frames to data tables on the fly, with the data being read and processed as data frames prior to my input in the code.

@arunsrinivasan arunsrinivasan added this to the v1.9.8 milestone Dec 9, 2015
@arunsrinivasan arunsrinivasan changed the title Different result between table(dt$x) and dt[, .N, by = x] Issue a warning when converting DF to DT and cols have large whole numbers (numeric types) to convert to bit64 with reference to setNumericRounding Dec 9, 2015
@arunsrinivasan arunsrinivasan reopened this Dec 9, 2015
@arunsrinivasan
Copy link
Member

Good point. We should be able to do that..

@arunsrinivasan arunsrinivasan modified the milestones: v2.0.0, v1.9.8 Apr 10, 2016
arunsrinivasan added a commit that referenced this issue Jul 21, 2016
…oining by default, i.e., no rounding.
@arunsrinivasan
Copy link
Member

Default is not to do rounding for now..

@mattdowle
Copy link
Member

Reopened to look at along with #485 and #1642.

@mattdowle mattdowle reopened this Nov 23, 2016
@mattdowle mattdowle self-assigned this Nov 23, 2016
@mattdowle mattdowle removed this from the Candidate milestone May 10, 2018
@MichaelChirico
Copy link
Member

If OP had used fread() directly on the URL, the columns would have been read as integer64 to begin with. getNumericRounding() is 0 by default, so this also shouldn't happen when converting data.frame->data.table if the column is the wrong type:

dt[, .N, by = as.numeric(Num_Acc)]
         as.numeric     N
              <num> <int>
    1: 201400000001     2
    2: 201400000002     1
    3: 201400000003     2
    4: 201400000004     2
    5: 201400000005     2
   ---                   
59850: 201400059850     1
59851: 201400059851     1
59852: 201400059852     1
59853: 201400059853     1
59854: 201400059854     3

I think we can close here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants