Issue a warning when converting DF to DT and cols have large whole numbers (numeric types) to convert to bit64 with reference to setNumericRounding #1463

yvanrichard · 2015-12-08T02:37:00Z

The result from dt[, .N, by=x] can be wrong when dt$x contains large integers.
That bit me today and I was surprised not to get the same counts as with table(dt$x).

Example:

library(data.table)
# Get example data
base <- data.table(read.csv("https://www.data.gouv.fr/s/resources/base-de-donnees-accidents-corporels-de-la-circulation-sur-6-annees/20150806-153355/vehicules_2014.csv"))
dt <- data.table(base)

Now get the number of occurrences of each Num_Acc:

The table() version, head(table(dt$Num_Acc)) returns, as expected:

201400000001 201400000002 201400000003 201400000004 201400000005 201400000006 
           2            1            2            2            2            2

But the data.table count version, head(dt[, .N, by = Num_Acc]), returns:

        Num_Acc N
1: 201400000001 3
2: 201400000003 4
3: 201400000005 4
4: 201400000007 2
5: 201400000009 3
6: 201400000011 4

In the latter version, the sum of all N is equal to the number of rows in dt, which is right, but the even numbers seem to be aggregated with odd numbers. This is not right.

It definitely has something to do with large numbers, since it returns the right answer when Num_Acc is converted to character, or transformed into a lower number (e.g. substracted by 201400000000).

Is it possible to make it right?...

sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=en_NZ.UTF-8       LC_NUMERIC=C               LC_TIME=en_NZ.UTF-8       
 [4] LC_COLLATE=en_NZ.UTF-8     LC_MONETARY=en_NZ.UTF-8    LC_MESSAGES=en_NZ.UTF-8   
 [7] LC_PAPER=en_NZ.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_NZ.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.9.6

loaded via a namespace (and not attached):
[1] chron_2.3-47

The text was updated successfully, but these errors were encountered:

franknarf1 · 2015-12-08T05:28:57Z

I tried using fread (data.table's way of reading in data), which alerted me to install the bit64 package. With that, no problem:

library(data.table)
install.packages('bit64') 
# not necessary to load the latter library, apparently

DT = fread("https://www.data.gouv.fr/s/resources/base-de-donnees-accidents-corporels-de-la-circulation-sur-6-annees/20150806-153355/vehicules_2014.csv")

head(DT[, .N, by = Num_Acc])
#         Num_Acc N
# 1: 201400000001 2
# 2: 201400000002 1
# 3: 201400000003 2
# 4: 201400000004 2 
# 5: 201400000005 2
# 6: 201400000006 2

arunsrinivasan · 2015-12-08T18:27:30Z

See ?setNumericRounding. You'll have to do setNumericRounding(0L) to disable it. But we recommend using integer64 from bit64 package for such cases.

Feel free to reopen if that doesn't fix the issue.

There's a FR under vignettes to explain about numeric rounding.

yvanrichard · 2015-12-08T21:23:35Z

Ah. Thank you. Although I have been using data tables for quite some time, I never came across setNumericRounding and int64. Since large integers are quite common, being often used as ids, shouldn't a warning be issued when converting a data.frame to a data.table in presence of large integers? It's an easy trap, especially when base methods return the right answer. Now, I just hope that this didn't introduce mistakes in previous projects... Working collaboratively, I often have to convert data frames to data tables on the fly, with the data being read and processed as data frames prior to my input in the code.

arunsrinivasan · 2015-12-09T11:03:42Z

Good point. We should be able to do that..

…oining by default, i.e., no rounding.

arunsrinivasan · 2016-07-21T01:03:33Z

Default is not to do rounding for now..

mattdowle · 2016-11-23T19:55:38Z

Reopened to look at along with #485 and #1642.

MichaelChirico · 2024-04-14T05:52:14Z

If OP had used fread() directly on the URL, the columns would have been read as integer64 to begin with. getNumericRounding() is 0 by default, so this also shouldn't happen when converting data.frame->data.table if the column is the wrong type:

dt[, .N, by = as.numeric(Num_Acc)]
         as.numeric     N
              <num> <int>
    1: 201400000001     2
    2: 201400000002     1
    3: 201400000003     2
    4: 201400000004     2
    5: 201400000005     2
   ---                   
59850: 201400059850     1
59851: 201400059851     1
59852: 201400059852     1
59853: 201400059853     1
59854: 201400059854     3

I think we can close here.

arunsrinivasan closed this as completed Dec 8, 2015

arunsrinivasan added the wontfix label Dec 8, 2015

arunsrinivasan added enhancement and removed wontfix labels Dec 9, 2015

arunsrinivasan added this to the v1.9.8 milestone Dec 9, 2015

arunsrinivasan changed the title ~~Different result between table(dt$x) and dt[, .N, by = x]~~ Issue a warning when converting DF to DT and cols have large whole numbers (numeric types) to convert to bit64 with reference to setNumericRounding Dec 9, 2015

arunsrinivasan reopened this Dec 9, 2015

arunsrinivasan modified the milestones: v2.0.0, v1.9.8 Apr 10, 2016

arunsrinivasan added a commit that referenced this issue Jul 21, 2016

Handles #1642, #1728, #1463, #485. Full precision grouping/ordering/j…

f982e2e

…oining by default, i.e., no rounding.

arunsrinivasan closed this as completed Jul 21, 2016

arunsrinivasan added a commit that referenced this issue Jul 21, 2016

#1642, #1728, #1463, #485 full precision, also update init.c.

7a17da2

mattdowle reopened this Nov 23, 2016

mattdowle self-assigned this Nov 23, 2016

mattdowle removed this from the Candidate milestone May 10, 2018

MichaelChirico closed this as completed Apr 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue a warning when converting DF to DT and cols have large whole numbers (numeric types) to convert to bit64 with reference to setNumericRounding #1463

Issue a warning when converting DF to DT and cols have large whole numbers (numeric types) to convert to bit64 with reference to setNumericRounding #1463

yvanrichard commented Dec 8, 2015

franknarf1 commented Dec 8, 2015

arunsrinivasan commented Dec 8, 2015

yvanrichard commented Dec 8, 2015

arunsrinivasan commented Dec 9, 2015

arunsrinivasan commented Jul 21, 2016

mattdowle commented Nov 23, 2016

MichaelChirico commented Apr 14, 2024

Issue a warning when converting DF to DT and cols have large whole numbers (numeric types) to convert to bit64 with reference to setNumericRounding #1463

Issue a warning when converting DF to DT and cols have large whole numbers (numeric types) to convert to bit64 with reference to setNumericRounding #1463

Comments

yvanrichard commented Dec 8, 2015

franknarf1 commented Dec 8, 2015

arunsrinivasan commented Dec 8, 2015

yvanrichard commented Dec 8, 2015

arunsrinivasan commented Dec 9, 2015

arunsrinivasan commented Jul 21, 2016

mattdowle commented Nov 23, 2016

MichaelChirico commented Apr 14, 2024