Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dplyr::arrange interfering with data.table's auto-indexing #5042

Closed
Waldi73 opened this issue Jun 11, 2021 · 11 comments · Fixed by #5084
Closed

dplyr::arrange interfering with data.table's auto-indexing #5042

Waldi73 opened this issue Jun 11, 2021 · 11 comments · Fixed by #5084
Milestone

Comments

@Waldi73
Copy link

Waldi73 commented Jun 11, 2021

Minimal reproducible example

This is a follow-up on this StackOverflow question/answer.
dplyr::arrange seems to interfere with auto-indexing, leading to unexpected wrong results.
A work-around is to disable auto-indexing with options(datatable.auto.index = FALSE).
As this issue is quite tricky, a warning, a mention to it in FAQ or a protection against this happening would be useful!

library(dplyr); 
library(data.table)
DT <-
  fread(
    "iso3c  country income
MOZ Mozambique  LIC
ZMB Zambia  LMIC
ALB Albania UMIC
MOZ Mozambique  LIC
ZMB Zambia  LMIC
ALB Albania UMIC
"
  )

codes <- c("ALB", "ZMB")
options(datatable.auto.index = TRUE) # Default
DT <- distinct(DT) %>%   as.data.table()

# Index creation because %in% is used for the first time
DT[iso3c %in% codes,verbose=T]

# Index mixed up by arrange
DT <- DT %>% arrange(iso3c) %>% as.data.table()

# this is wack because data.table uses the old index where row were rearranged:
DT[iso3c %in% codes,verbose=T]
#>    iso3c country income
#> 1:   ALB Albania   UMIC

# this works because (...) prevents the parser to use auto-index
DT[(iso3c %in% codes)]
#>    iso3c country income
#> 1:   ALB Albania   UMIC
#> 2:   ZMB  Zambia   LMIC

Output of sessionInfo()

R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.14.0 dplyr_1.0.6   
@MichaelChirico
Copy link
Member

could you report this with dplyr as well? at first glance it seems like a bug on that end

@avimallu
Copy link
Contributor

avimallu commented Jun 12, 2021

I think it might be related somewhat to #4889, and not fixed by #4893. dplyr, AFAIK works with either tibbles or data.frames, so a data.table's index attribute would be ignored during that conversion but retained, leading to an incorrect result. I took a quick look at as.data.table.data.frame, which I believe would be called here, and it doesn't look like it removed the index:

require(data.table)
mtcars = as.data.table(mtcars)
setindex(mtcars, mpg)
mtcars2 = as.data.frame(mtcars)
attributes(mtcars2)$index

The last row returns an output:

integer(0)
attr(,"__mpg")
 [1] 15 16 24  7 17 31 14 23 22 29 12 13 11  6  5 10 25 30  1  2  4 32 21  3  9  8 27 26 19 28 18 20

which would remain unchanged in an operation such as dplyr::arrange followed by a as.data.table. Source codes for easy reference:

as.data.frame.data.table
function (x, ...) 
{
    ans = copy(x)
    setattr(ans, "row.names", .set_row_names(nrow(x)))
    setattr(ans, "class", "data.frame")
    setattr(ans, "sorted", NULL)
    setattr(ans, ".internal.selfref", NULL)
    ans
}
<bytecode: 0x00000256511304d0>
<environment: namespace:data.table>
as.data.table.data.frame
function (x, keep.rownames = FALSE, key = NULL, ...) 
{
    if (!identical(keep.rownames, FALSE)) {
        ans = data.table(rn = rownames(x), x, keep.rownames = FALSE, 
            key = key)
        if (is.character(keep.rownames)) 
            setnames(ans, "rn", keep.rownames[1L])
        return(ans)
    }
    if (any(vapply_1i(x, function(xi) length(dim(xi))))) {
        return(as.data.table.list(as.list(x), keep.rownames = keep.rownames, 
            ...))
    }
    ans = copy(x)
    setattr(ans, "row.names", .set_row_names(nrow(x)))
    setattr(ans, "class", .resetclass(x, "data.frame"))
    setalloccol(ans)
    setkeyv(ans, key)
    ans
}
<bytecode: 0x0000025651853d60>
<environment: namespace:data.table>

@jangorecki
Copy link
Member

I don't have dplyr on hand to try that out, but as said by @avimallu, as.data.frame should remove index. I pushed that change to #5043 Let me know if it does fix the problem so I will add NEWS entry.

@tlapak
Copy link
Contributor

tlapak commented Jun 12, 2021

As far as I can tell, dplyr explicitly preserves all attributes of the input. There needs to be a dplyr_row_slice.data.table function to accommodate the attributes that depend on row order, see
https://github.com/tidyverse/dplyr/blob/55917b6194aa8ba5b0ad8b2109ac746193996c96/R/generics.R#L25-L47

@Waldi73
Copy link
Author

Waldi73 commented Jun 12, 2021

Thanks all for analyzing this!
@MichaelChirico , reported on dplyr/issues/5917 as well

@tlapak
Copy link
Contributor

tlapak commented Jun 14, 2021

@jangorecki I've tested your patch in the meantime and as I expected it does not solve the problem. One possibility (with probably minimal future maintainence, but that tends to be wishful thinking) would be to clear the attributes in as.data.table and setDT but I'm not really a fan of that.

The following would fix the problem as well as obviating the need to call as.data.table at all here:

dplyr_row_slice.data.table <- function(data, i, ...) {
  ans <- NextMethod()
  setattr(ans,"sorted",NULL)
  setattr(ans,"index",NULL)
  setalloccol(ans)
  return(ans)
}

but that could start to get hairy as to who has to maintain it and beg the question of how much more should be done to make dplyr play nicely with data.table.

@avimallu
Copy link
Contributor

avimallu commented Jun 14, 2021

Wouldn't getting as.data.table and setDT to recognize an index attribute and updating it at the time of function call work that doesn't depend on other packages and have much maintenance overhead? I don't know much about the source code for data.table, but I do remember reading discussions about a computationally efficient manner of checking if a vector is already ordered that is in use right now.

A potential problem is if a different package or the user specifically creates an index attribute for a data.frame or derived object, that conflicts with how data.table uses that attribute - a chance that is much less likely.

@hadley
Copy link
Contributor

hadley commented Jun 16, 2021

As @tlapak points out, dplyr defaults to preserving all attributes (since a reasonable number of people store arbitrary metadata in attributes and expect it to be maintained). As mentioned, dplyr_row_slice.data.table method would fix the problem, but it's unclear where it should live — since it affects the combination of dplyr and data.table, it feels odd to put it in either. I think it's slightly more correct for it to live in data.table (since y'all know best how to handle your own attributes), but would that require adding dplyr as a suggested package and a little work to conditionally register the method. Given that, I'd be happy to include the method in dplyr.

@tlapak, you'd be welcome to do a PR if you wanted, and it would be great to get a sense from other data.table developers what attributes you think we should preserve/drop.

@mattdowle mattdowle added this to the 1.14.1 milestone Jun 21, 2021
@mattdowle
Copy link
Member

mattdowle commented Jul 27, 2021

Thanks all. #4893 now merged too and I confirmed it doesn't fix this one either, as @avimallu and @tlapak expected.

The root of this issue with dplyr::arrange appears to be vctrs::vec_slice which dplyr:::dplyr_row_slice.data.frame calls.

> DT
    iso3c    country income
   <char>     <char> <char>
1:    MOZ Mozambique    LIC
2:    ZMB     Zambia   LMIC
3:    ALB    Albania   UMIC
> attr(DT, "index")
integer(0)
attr(,"__iso3c")
[1] 3 1 2
> 
> DT2 = vctrs::vec_slice(DT, c(3,1,2))
> 
> DT2
    iso3c    country income
   <char>     <char> <char>
1:    ALB    Albania   UMIC
2:    MOZ Mozambique    LIC
3:    ZMB     Zambia   LMIC
> attr(DT2, "index")
integer(0)
attr(,"__iso3c")
[1] 3 1 2    # index left intact and now incorrect
> class(DT2)
[1] "data.table" "data.frame"  # still a data.table

The intention is that [.data.table detects that dplyr and vctrs are not data.table-aware, and any data.frame-code (in this case [ on a data.table from those not-data-table-aware packages) gets redirected through the cedta() catch which drops the key and indices. That happens here :

if (!missing(i) && is.data.table(ans)) setkey(ans, NULL) # See test 304

where that setkey(ans, NULL) removes the index attribute too as well the key. Test 304 was moved to be the plyr::arrange test in other.Rraw which tests that :
if (loaded[["plyr"]]) {

But for dplyr::arrange (dplyr not plyr), vctrs::vec_slice is implemented in C :

> vctrs::vec_slice
function (x, i) 
{
    .Call(vctrs_slice, x, i)
}
<bytecode: 0x5617bbed46d8>
<environment: namespace:vctrs>

Since vctrs::vec_slice is doing its own subsetting at C level, it's bypassing [ method dispatch.

There's been some attention here to as.data.table apparently because there's two calls to as.data.table in the original reproducible example at the top :

DT <- distinct(DT) %>% as.data.table()
....
# Index mixed up by arrange
DT <- DT %>% arrange(iso3c) %>% as.data.table()

But neither of those as.data.table() are needed. dplyr::distinct(DT) and dplyr::arrange(DT) already retain the data.table class so those as.data.table() calls just return their input. Hence, even if any potential fix was applied to as.data.table (for example like checking any index was valid and dropping it if not) wouldn't work if as.data.table() isn't used.

Even more minimal example :

require(data.table)
require(dplyr)
DT = data.table(A=c("b","c","a"), B=10:12)
setindex(DT, A)
DT[A=="c"]   # correctly returns row 2
DT2 = dplyr::arrange(DT, A)
class(DT2)   # retains data.tabe class
DT2[A=="c"]  # incorrectly returns no rows
> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Pop!_OS 21.04

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.13.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.0.6       data.table_1.14.1

loaded via a namespace (and not attached):
 [1] fansi_0.4.2      utf8_1.2.1       crayon_1.4.1     R6_2.5.0        
 [5] lifecycle_1.0.0  magrittr_2.0.1   pillar_1.6.0     rlang_0.4.11    
 [9] vctrs_0.3.8      generics_0.1.0   ellipsis_0.3.2   glue_1.4.2      
[13] purrr_0.3.4      compiler_4.1.0   pkgconfig_2.0.3  tidyselect_1.1.1
[17] tibble_3.1.1    

@mattdowle
Copy link
Member

mattdowle commented Jul 27, 2021

The .internal.selfref is detecting the invalid status as intended :

require(data.table)
require(dplyr)
DT = data.table(A=c("b","c","a"), B=10:12)
setindex(DT, A)
DT[A=="c"]   # correctly returns row 2
data.table:::selfrefok(DT)  # 1 i.e. ok
DT2 = dplyr::arrange(DT, A)
data.table:::selfrefok(DT2)  # 0 i.e. not ok
class(DT2)   # retains data.tabe class
DT2[A=="c", verbose=TRUE]  # incorrectly returns no rows
# ...
# ... using index
# ...

We shouldn't be using indexes when the selfref is not ok.

@mattdowle
Copy link
Member

mattdowle commented Jul 28, 2021

#5084 fixes this and adds a dplyr::arrange test to test.data.table("other.Rraw")

But now we know what we're looking for, I suspect there are other entry points that need catching too. setkey.R::getindex() could be a central place to drop index if selfref isn't ok. Would need to think of a test to trigger that without going through bmerge() which is now caught by this fix to shallow(). [.data.table could be another central place, at the top.

Current thoughts are that it's probably best to merge #5084 so that folk can test dev, and then follow up in a new issue on a more comprehensive review adding tests for dropping indexes when selfref detects the prior copy by another package.

The advantage of this approach over adding dplyr_row_slice.data.table is that it tackles the root more directly and thus would fix other packages that may now or in future do the same thing as vctrs::vec_slice at C level which bypasses [ dispatch. The selfref mechanism was introduced to catch base R copies after all, so it's probably possible to reproduce an invalid index using base R (like the test in #4893) rather than dplyr::arrange (and if so #5084+#5085 would fix that too). We're just using dplyr::arrange as the first known example for the test really. There may be other methods in dplyr that find their way to vctrs::vec_slice, and maybe there are other functions in vctrs that work similarly to vec_slice, or dplyr's method names might change in future. All that should be covered by the selfref approach, so we insulate ourselves from maintenance, and users from specific version incompatibilities.

That's the way I'm thinking currently, anyway. Happy to reconsider.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants