Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sum & mean group by operation not optimised for factor types #2458

Open
xiaodaigh opened this issue Nov 2, 2017 · 9 comments
Open

sum & mean group by operation not optimised for factor types #2458

xiaodaigh opened this issue Nov 2, 2017 · 9 comments

Comments

@xiaodaigh
Copy link

xiaodaigh commented Nov 2, 2017

Seems like group by are not optimised where the by-variable is of factor type. Essentially one can perform group-by faster for factors given that it's represented by integers from 1 to n where n is the number of groups.

See Python implementation discussion http://wesmckinney.com/blog/mastering-high-performance-data-algorithms-i-group-by/
and Julia implementation discussion https://www.codementor.io/zhuojiadai/an-empirical-study-of-group-by-strategies-in-julia-dagnosell

Update - Benchmarks

Here are some benchmarks to show that the factors as group-by doesn't seem to receive extra optimisation.

So I ran two benchmarks one for group-by variable with 2.5million groups, and one for 100millions groups. I ran one where the group-by variable is an integer and one where the group-by variable is a factor. In both cases the run time did not get faster in the group by variable is factor case. One can probably confirm by looking into the source code that sum group-by a factor is not optimised for.

@MichaelChirico
Copy link
Member

MichaelChirico commented Nov 2, 2017 via email

@arunsrinivasan
Copy link
Member

arunsrinivasan commented Nov 3, 2017

Seems like group by are not optimisation where the by-variable is of factor type.

Please explain what you mean by this, and why with a minimal example. As Michael mentions, factors are internally represented as integers, and data.table makes use of it while finding groups.

@xiaodaigh
Copy link
Author

xiaodaigh commented Nov 3, 2017

Apologies for bad grammar.

I can't comment on the general group-by case where you can apply arbitrary code to .SD but I am thinking about the case where I am just doing a simple function DT[,sum(v1), by = some_factor] in this case I can write

result <- vector("numeric", length = nrow(DT))
for(r in 1:nrow(DT)) {
  result[DT$some_factor] =   result[DT$some_factor] + DT[r, v1]
}
result # is the group by results

Imagine if the above was in some efficient C/C++ implementation.

Because the number of groups, n, is already known up front one can simply create an array of size n and using the levels of the factor to index into the vector, to do a sum (or any reduction).

If this was implemented in C/C++ as a fast path in data.table then it will be faster than the traditional group by.

This seciton of the post talks about speed ups achieve in Julia which can be replicated in data.table.

@MichaelChirico
Copy link
Member

MichaelChirico commented Nov 3, 2017 via email

@xiaodaigh
Copy link
Author

@MichaelChirico My point is not that factors are slower than integers, the point is that factors can be even faster than it is now in special cases such as sum and mean etc.

Benchmarks for this in other languages to show that optimising for factors will be faster is already known, e.g. in the post about Julia above (also I believe panda optimises this case as well but not sure).

Hope this clears things up.

@jangorecki
Copy link
Member

jangorecki commented Nov 3, 2017

@xiaodaigh any reason why you assume it is not already optimised in data.table? I can imagine it is optimised even better than in Julia or pandas, as many other algorithms. Question about benchmark, in R, in data.table, is very important here. So please provide one. Without the benchmark this issue should not be even created in first place, and definitely not with current title "group by operation not optimised for factor types ".

@xiaodaigh
Copy link
Author

Here are some benchmarks to show that the factors as group-by doesn't seem to receive extra optimisation.

Also updated original issue post with link to benchmarks.

@xiaodaigh xiaodaigh changed the title group by operation not optimised for factor types sum & mean group by operation not optimised for factor types Nov 3, 2017
@xiaodaigh
Copy link
Author

xiaodaigh commented Nov 8, 2017

I decided to write a simple benchmark in R using Rcpp showing a faster sumby is possible for group-by vectors that are factor vectors using Rcpp.

The data.table case takes about 2.2 seconds vs 0.16 seconds for optimised version on my computer.

library(dplyr)
library(data.table)
library(dtplyr)

N = 100000000
K = 100
a <- data.table(grps = sample(as.factor(1:K),N, replace = T), val = runif(N))

library(Rcpp)
cppFunction('NumericVector sumbycpp(IntegerVector grps, NumericVector vals, int ngrps) {
            NumericVector out(ngrps);
            
            int n = grps.size();
            
            for(int i = 0; i < n; ++i) {
            out[grps[i]-1] = out[grps[i]-1] + vals[i];
            }
            return out;
            }')

fastersumby <- function(grps, vals) {
  ngrps <- nlevels(grps)
  data.table(levels(grps), sumbycpp(grps, vals, ngrps))
}

system.time(print(a[,sum(val), grps])) # 2 seconds
system.time(print(a[,fastersumby(grps, val)])) # 0.16 seconds

system.time({
  a %>% 
    group_by(grps) %>% 
    summarise(sum(val))
}) # 2.4

@jangorecki
Copy link
Member

Thanks @xiaodaigh, this is exactly what we needed in the report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants