Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keys are wrong/don't update if column names aren't unique #4888

Open
magerton opened this issue Feb 3, 2021 · 0 comments
Open

keys are wrong/don't update if column names aren't unique #4888

magerton opened this issue Feb 3, 2021 · 0 comments

Comments

@magerton
Copy link

magerton commented Feb 3, 2021

When setkey is called on a data.table with columns that have identical names, and then those names are updated, the keys appear not to update.

That means that if you want to do a cross-join of row-IDs in a dataset, and then update the CJ with additional attributes from the original data, you have to (a) update the key, (b) do CJ(..., sorted=F), or (c) use base::merge.data.frame() to get the merge to work (MWE 2)

MWE 1 is a silly example to show the key/name issue. I think that it might be what drives the errors in MWE 2, which is based on the issue I ran into today.

I'm running data.table version 1.13.6

MWE 1

library(data.table)

jnk <- data.table(x=1:3, x=4:6)
setkey(jnk, x, x)
setnames(jnk, c("y","z"))
all(key(jnk) %in% c("y","z")) # key(jnk) = c("z", "x")... but there is no "x" anymore

MWE 2

library(data.table)

nobs = 4
dat = data.table(id = 1:nobs, x = runif(nobs))

cj_sort <- with(dat, CJ(id, id, sorted=T))  # don't do fixes on this one
cj_srt2 <- with(dat, CJ(id, id, sorted=T)) # works
cj_unst <- with(dat, CJ(id, id, sorted=F)) # works b/c we update keys?

# set colnames to be unique
setnames(cj_sort, c("id_1", "id_2"))
setnames(cj_unst, c("id_1", "id_2"))
setnames(cj_srt2, c("id_1", "id_2"))

# fixes the issue
setkey(cj_unst, id_1, id_2)  # key unsorted data to fix
setkey(cj_srt2, id_1, id_2)   # re-key sorted data to fix

stopifnot(key(cj_sort) == c("id_1", "id_2")) # broken, keys are c("id_2","id")
stopifnot(key(cj_unst) == c("id_1", "id_2")) # ok
stopifnot(key(cj_srt2) == c("id_1", "id_2")) # ok

stopifnot(cj_sort[i = dat, on = .(id_1 = id), .N] == nobs^2)  # ok
stopifnot(cj_sort[i = dat, on = .(id_2 = id), .N] == nobs^2)  # broken - won't merge to nobs^2 rows

stopifnot(cj_unst[i = dat, on = .(id_1 = id), .N] == nobs^2)  # ok
stopifnot(cj_unst[i = dat, on = .(id_2 = id), .N] == nobs^2)  # ok - works b/c of setkey?

stopifnot(cj_srt2[i = dat, on = .(id_1 = id), .N] == nobs^2)  # ok
stopifnot(cj_srt2[i = dat, on = .(id_2 = id), .N] == nobs^2)  # ok - works b/c of setkey() workaround?

# data.table::merge
stopifnot(nrow(merge.data.table(cj_sort, dat, by.x="id_2", by.y="id")) == nobs^2) # broken
stopifnot(nrow(merge.data.table(cj_unst, dat, by.x="id_2", by.y="id")) == nobs^2) # ok
stopifnot(nrow(merge.data.table(cj_srt2, dat, by.x="id_2", by.y="id")) == nobs^2) # ok

# base::merge works, even though data.table::merge doesn't
stopifnot(nrow(merge.data.frame(cj_sort, dat, by.x="id_2", by.y="id")) == nobs^2) # workaround: use base::merge

Output of sessionInfo()

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.13.6

loaded via a namespace (and not attached):
[1] compiler_4.0.2 tools_4.0.2   
> 

 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant