Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-Forge #5369] Implement integer64 grouping/unique etc or options(datatable.tolerance=0) or both #342

Closed
arunsrinivasan opened this issue Jun 8, 2014 · 0 comments

Comments

@arunsrinivasan
Copy link
Member

Submitted by: James Sams; Assigned to: Nobody; R-Forge link

TL;DR: dim(unique(..., by=c(A, B))) reports MORE rows than dim(unique(..., by=c(A, B, C))). Affects duplicated() and merge(). I see this in 1.8.11, not 1.8.10.

I actually discovered this when a merge that was working previously stopped working, believing itself to be a cartesian join. So, the affected code is used by merge() as well. However, I think the problem is made more clear using unique(). I have a data.table with 3 columns (double, integer, integer). The double column, when read by fread, is integer64. However, I've found integer64 to be unreliable; so, I stick to using double/numeric. The values are up to 12 digits, all positive, and as I said, always integral values. I've duplicated this problem by coercing the other columns to double and reading using read.delim and coercing to data.table.

sapply(DT, class)
#        upc upc_ver_uc panel_year 
#  "numeric"  "integer"  "integer"
# 
str(DT)
# Classes ‘data.table’ and 'data.frame':  779473 obs. of  3 variables:
# <censored>
#  - attr(*, ".internal.selfref")=<externalptr> 
#  - attr(*, "sorted")= chr  "upc" "panel_year"
# 
dim(DT)
# [1] 779473      3
key(DT)
# [1] "upc"        "panel_year"
dim(unique(DT))
# [1] 779473      3
dim(unique(DT, by=c("upc", "panel_year")))
# [1] 779473      3

THIS is where things go wrong. Notice that adding the rows:

dim(unique(DT, by=c("upc", "upc_ver_uc", "panel_year")))
# [1] 725228      3

There are no NA's or similar in the data:

DT[,list(sum(is.na(upc), is.na(upc_ver_uc), is.na(panel_year)))]
#    V1
#1:  0
DT[,list(sum(is.nan(upc), is.nan(upc_ver_uc), is.nan(panel_year)))]
#    V1
#1:  0
DT[,list(sum(is.null(upc), is.null(upc_ver_uc), is.null(panel_year)))]
#    V1
#1:  0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant