[R-Forge #1841] Implement a fast droplevels.data.table method. #647

arunsrinivasan · 2014-06-08T13:17:05Z

Submitted by: Steve Lianoglou; Assigned to: Nobody; R-Forge link

The default droplevels.data.frame function is invoked on a call to droplevels on a data.table. This results in borken output from droplevels, eg. as reported by pchalasani on the ML:

d <- data.table(name = c('a','b','c'), value = 1:3)
dt <- data.table(d)
setkey(dt,'name')
dt1 <- subset(dt,name != 'a') # or dt1 <- dt[ name != 'a' ]

dt1
name value
[1,] b 2
[2,] c 3

droplevels(dt1)
name value
[1,] b 1
[2,] c 3

jangorecki · 2015-12-09T16:33:53Z

Code in the above post is not reproducible any more.

Regarding data.table method... I've made one based on droplevels.data.frame method.
Extended for in.place update, default to FALSE as it is not a set* function

it is not a fast one, as the factor() is the bottleneck. Probably it could be replaced with faster unique to detect levels and construct factor manually. If you are willing to merge below one I can make doc and prepare PR.

library(data.table)
droplevels.data.table <- function(x, except = NULL, in.place = FALSE, ...){
    stopifnot(length(x) > 0L, is.logical(in.place))
    ix = vapply(x, is.factor, NA)
    if(!is.null(except)){
        stopifnot(is.numeric(except), except <= length(x))
        ix[except] = FALSE
    }
    if(!sum(ix)) return(x)
    if(!in.place) x = copy(x)
    for(nx in names(ix)[ix==TRUE]){
        set(x, i = NULL, j = nx, value = factor(x[[nx]]))
    }
    return(x)
}

update: Confirmed by @arunsrinivasan it can be improved with internal fast as_factor function. Maybe it is a matter of replacing factor to as_factor in the above function, not sure yet.

ben-schwen · 2021-08-07T17:11:28Z

I think we could cut this even further by not using as_factor which operates on strings but using a version which directly operates on the integer representation of factors by using tabulate instead.

# for existing factors
drop_levels = function(x) {
  lev = which(tabulate(x, length(levels(x)))>0)
  ans = match(as.integer(x), lev)
  setattr(ans, 'levels', levels(x)[lev])
  setattr(ans, 'class', 'factor')
}

as_factor = data.table:::as_factor

set.seed(373)
N=1e7

x=as_factor(paste0("id",1:N))

y1 = sample(x, 1e6)
c1 = as.character(y1)

system.time(a1<-factor(y1))
# system.time(a1<-factor(y1))
#    user  system elapsed 
#   8.178   0.272   8.451
system.time(a2<-droplevels(y1))
# system.time(a2<-droplevels(y1))
#    user  system elapsed 
#   6.975   0.200   7.176 
system.time(a3<-as_factor(c1))
# system.time(a3<-as_factor(c1))
#    user  system elapsed 
#   2.148   0.004   1.208
system.time(a4<-drop_levels(y1))
# system.time(a4<-drop_levels(y1))
#    user  system elapsed 
#   0.145   0.008   0.154 
identical(a1, a2, a3, a4)
# identical(a1, a2, a3, a4)
# [1] TRUE

ColeMiller1 · 2021-08-13T22:58:35Z

@ben-schwen amazing! That could be proposed to R Core, right? With only slight modifications to use levels(ans) <- levels(x)[lev]; class(ans) <- 'factor' . Also, since ans is newly allocated, it probably would be a similar speed as the setattr(...) method for newish versions of R.

ben-schwen · 2021-08-22T20:27:58Z

@ColeMiller1 we could, but I'm not convinced it has a good chance to really make it into base, since speed is not the main factor.

jangorecki added this to the 1.12.4 milestone Jan 29, 2019

jangorecki modified the milestones: 1.12.4, 1.13.0 Jul 25, 2019

mattdowle modified the milestones: 1.12.7, 1.12.9 Dec 8, 2019

mattdowle modified the milestones: 1.13.1, 1.13.3 Oct 17, 2020

ben-schwen mentioned this issue Aug 27, 2021

fast droplevels.data.table #5116

Merged

5 tasks

mattdowle closed this as completed in #5116 Aug 30, 2021

mattdowle mentioned this issue Oct 5, 2021

droplevels for empty table and ordered factors #5185

Merged

jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R-Forge #1841] Implement a fast droplevels.data.table method. #647

[R-Forge #1841] Implement a fast droplevels.data.table method. #647

arunsrinivasan commented Jun 8, 2014

jangorecki commented Dec 9, 2015

ben-schwen commented Aug 7, 2021

ColeMiller1 commented Aug 13, 2021

ben-schwen commented Aug 22, 2021

[R-Forge #1841] Implement a fast droplevels.data.table method. #647

[R-Forge #1841] Implement a fast droplevels.data.table method. #647

Comments

arunsrinivasan commented Jun 8, 2014

jangorecki commented Dec 9, 2015

ben-schwen commented Aug 7, 2021

ColeMiller1 commented Aug 13, 2021

ben-schwen commented Aug 22, 2021