Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-Forge #1841] Implement a fast droplevels.data.table method. #647

Closed
arunsrinivasan opened this issue Jun 8, 2014 · 4 comments · Fixed by #5116
Closed

[R-Forge #1841] Implement a fast droplevels.data.table method. #647

arunsrinivasan opened this issue Jun 8, 2014 · 4 comments · Fixed by #5116

Comments

@arunsrinivasan
Copy link
Member

Submitted by: Steve Lianoglou; Assigned to: Nobody; R-Forge link

The default droplevels.data.frame function is invoked on a call to droplevels on a data.table. This results in borken output from droplevels, eg. as reported by pchalasani on the ML:

d <- data.table(name = c('a','b','c'), value = 1:3)
dt <- data.table(d)
setkey(dt,'name')
dt1 <- subset(dt,name != 'a') # or dt1 <- dt[ name != 'a' ]

dt1
name value
[1,] b 2
[2,] c 3

droplevels(dt1)
name value
[1,] b 1
[2,] c 3

@jangorecki
Copy link
Member

Code in the above post is not reproducible any more.

Regarding data.table method... I've made one based on droplevels.data.frame method.
Extended for in.place update, default to FALSE as it is not a set* function

it is not a fast one, as the factor() is the bottleneck. Probably it could be replaced with faster unique to detect levels and construct factor manually. If you are willing to merge below one I can make doc and prepare PR.

library(data.table)
droplevels.data.table <- function(x, except = NULL, in.place = FALSE, ...){
    stopifnot(length(x) > 0L, is.logical(in.place))
    ix = vapply(x, is.factor, NA)
    if(!is.null(except)){
        stopifnot(is.numeric(except), except <= length(x))
        ix[except] = FALSE
    }
    if(!sum(ix)) return(x)
    if(!in.place) x = copy(x)
    for(nx in names(ix)[ix==TRUE]){
        set(x, i = NULL, j = nx, value = factor(x[[nx]]))
    }
    return(x)
}

update: Confirmed by @arunsrinivasan it can be improved with internal fast as_factor function. Maybe it is a matter of replacing factor to as_factor in the above function, not sure yet.

@jangorecki jangorecki added this to the 1.12.4 milestone Jan 29, 2019
@jangorecki jangorecki modified the milestones: 1.12.4, 1.13.0 Jul 25, 2019
@mattdowle mattdowle modified the milestones: 1.12.7, 1.12.9 Dec 8, 2019
@mattdowle mattdowle modified the milestones: 1.13.1, 1.13.3 Oct 17, 2020
@ben-schwen
Copy link
Member

I think we could cut this even further by not using as_factor which operates on strings but using a version which directly operates on the integer representation of factors by using tabulate instead.

# for existing factors
drop_levels = function(x) {
  lev = which(tabulate(x, length(levels(x)))>0)
  ans = match(as.integer(x), lev)
  setattr(ans, 'levels', levels(x)[lev])
  setattr(ans, 'class', 'factor')
}

as_factor = data.table:::as_factor

set.seed(373)
N=1e7

x=as_factor(paste0("id",1:N))

y1 = sample(x, 1e6)
c1 = as.character(y1)

system.time(a1<-factor(y1))
# system.time(a1<-factor(y1))
#    user  system elapsed 
#   8.178   0.272   8.451
system.time(a2<-droplevels(y1))
# system.time(a2<-droplevels(y1))
#    user  system elapsed 
#   6.975   0.200   7.176 
system.time(a3<-as_factor(c1))
# system.time(a3<-as_factor(c1))
#    user  system elapsed 
#   2.148   0.004   1.208
system.time(a4<-drop_levels(y1))
# system.time(a4<-drop_levels(y1))
#    user  system elapsed 
#   0.145   0.008   0.154 
identical(a1, a2, a3, a4)
# identical(a1, a2, a3, a4)
# [1] TRUE

@ColeMiller1
Copy link
Contributor

@ben-schwen amazing! That could be proposed to R Core, right? With only slight modifications to use levels(ans) <- levels(x)[lev]; class(ans) <- 'factor' . Also, since ans is newly allocated, it probably would be a similar speed as the setattr(...) method for newish versions of R.

@ben-schwen
Copy link
Member

@ColeMiller1 we could, but I'm not convinced it has a good chance to really make it into base, since speed is not the main factor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants