Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dplyr::arrange] interfering with data.table's auto-indexing #259

Closed
Waldi73 opened this issue Jun 12, 2021 · 2 comments
Closed

[dplyr::arrange] interfering with data.table's auto-indexing #259

Waldi73 opened this issue Jun 12, 2021 · 2 comments

Comments

@Waldi73
Copy link

Waldi73 commented Jun 12, 2021

This is a follow-up on this StackOverflow question/answer.
The issue is documented in data.table/issues/5042, and this is a cross-reference because data.table team suggested there might an issue with dplyr as well, in the way the indexes are reset.
dplyr::arrange seems to interfere with auto-indexing in data.table leading to unexpected wrong results.

MRE :

library(dplyr); 
library(data.table)
DT <-
  fread(
    "iso3c  country income
MOZ Mozambique  LIC
ZMB Zambia  LMIC
ALB Albania UMIC
MOZ Mozambique  LIC
ZMB Zambia  LMIC
ALB Albania UMIC
"
  )

codes <- c("ALB", "ZMB")
options(datatable.auto.index = TRUE) # Default
DT <- distinct(DT) %>%   as.data.table()

# Index creation because %in% is used for the first time
DT[iso3c %in% codes,verbose=T]

# Index mixed up by arrange
DT <- DT %>% arrange(iso3c) %>% as.data.table()

# this is wack because data.table uses the old index where row were rearranged:
DT[iso3c %in% codes,verbose=T]
#>    iso3c country income
#> 1:   ALB Albania   UMIC

# this works because (...) prevents the parser to use auto-index
DT[(iso3c %in% codes)]
#>    iso3c country income
#> 1:   ALB Albania   UMIC
#> 2:   ZMB  Zambia   LMIC
@romainfrancois romainfrancois transferred this issue from tidyverse/dplyr Jun 16, 2021
@mgirlich
Copy link
Collaborator

mgirlich commented Jul 1, 2021

@romainfrancois This is not really a dtplyr issue as it works correctly as soon as dtplyr is loaded

library(dplyr, warn.conflicts = FALSE)
library(dtplyr)
library(data.table, warn.conflicts = FALSE)
DT <-
  fread(
    "iso3c  country income
MOZ Mozambique  LIC
ZMB Zambia  LMIC
ALB Albania UMIC
MOZ Mozambique  LIC
ZMB Zambia  LMIC
ALB Albania UMIC
"
  )

codes <- c("ALB", "ZMB")
options(datatable.auto.index = TRUE) # Default
DT <- distinct(DT) %>%   as.data.table()

# Index creation because %in% is used for the first time
DT[iso3c %in% codes,verbose=T]
#> Creating new index 'iso3c'
#> Creating index iso3c done in ... forder.c received 3 rows and 3 columns
#> forder took 0 sec
#> 0.048s elapsed (0.048s cpu) 
#> Optimized subsetting with index 'iso3c'
#> forder.c received 2 rows and 1 columns
#> forder took 0 sec
#> x is already ordered by these columns, no need to call reorder
#> i.iso3c has same type (character) as x.iso3c. No coercion needed.
#> on= matches existing index, using index
#> Starting bmerge ...
#> bmerge done in 0.000s elapsed (0.000s cpu) 
#> Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu) 
#> Reordering 2 rows after bmerge done in ... forder.c received a vector type 'integer' length 2
#> 0 secs
#>    iso3c country income
#> 1:   ZMB  Zambia   LMIC
#> 2:   ALB Albania   UMIC

# Index mixed up by arrange
DT <- DT %>% arrange(iso3c) %>% as.data.table()

# this is wack because data.table uses the old index where row were rearranged:
DT[iso3c %in% codes,verbose=T]
#> Creating new index 'iso3c'
#> Creating index iso3c done in ... forder.c received 3 rows and 3 columns
#> forder took 0.001 sec
#> 0.045s elapsed (0.045s cpu) 
#> Optimized subsetting with index 'iso3c'
#> forder.c received 2 rows and 1 columns
#> forder took 0 sec
#> x is already ordered by these columns, no need to call reorder
#> i.iso3c has same type (character) as x.iso3c. No coercion needed.
#> on= matches existing index, using index
#> Starting bmerge ...
#> bmerge done in 0.000s elapsed (0.000s cpu) 
#> Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu)
#>    iso3c country income
#> 1:   ALB Albania   UMIC
#> 2:   ZMB  Zambia   LMIC

# this works because (...) prevents the parser to use auto-index
DT[(iso3c %in% codes)]
#>    iso3c country income
#> 1:   ALB Albania   UMIC
#> 2:   ZMB  Zambia   LMIC

Created on 2021-07-01 by the reprex package (v2.0.0)

@markfairbanks
Copy link
Collaborator

Closing as this was fixed on the data.table side. (Also as @mgirlich mentioned it was a dplyr issue, not a dtplyr one)

Rdatatable/data.table#5042

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants