sum & mean group by operation not optimised for factor types #2458

xiaodaigh · 2017-11-02T00:31:03Z

Seems like group by are not optimised where the by-variable is of factor type. Essentially one can perform group-by faster for factors given that it's represented by integers from 1 to n where n is the number of groups.

See Python implementation discussion http://wesmckinney.com/blog/mastering-high-performance-data-algorithms-i-group-by/
and Julia implementation discussion https://www.codementor.io/zhuojiadai/an-empirical-study-of-group-by-strategies-in-julia-dagnosell

Update - Benchmarks

Here are some benchmarks to show that the factors as group-by doesn't seem to receive extra optimisation.

So I ran two benchmarks one for group-by variable with 2.5million groups, and one for 100millions groups. I ran one where the group-by variable is an integer and one where the group-by variable is a factor. In both cases the run time did not get faster in the group by variable is factor case. One can probably confirm by looking into the source code that sum group-by a factor is not optimised for.

MichaelChirico · 2017-11-02T01:54:52Z

Could you add a benchmark suggesting this is the case? it's no secret factors are integers

…

On Nov 2, 2017 8:31 AM, "susabi" ***@***.***> wrote: Seems like group by are not optimisation where the by-variable is of factor type. Essentially one can perform group faster for factors given that it's represented by integers from 1 to n where n is the number of groups. See Python implementation discussion http://wesmckinney.com/blog/ mastering-high-performance-data-algorithms-i-group-by/ and Julia implementation discussion https://www.codementor.io/ zhuojiadai/an-empirical-study-of-group-by-strategies-in-julia-dagnosell — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#2458>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHQQdZXv6-vXXJ9CrpAm_6z2kWgMUNOFks5syQ1KgaJpZM4QPAqZ> .

arunsrinivasan · 2017-11-03T00:38:50Z

Seems like group by are not optimisation where the by-variable is of factor type.

Please explain what you mean by this, and why with a minimal example. As Michael mentions, factors are internally represented as integers, and data.table makes use of it while finding groups.

xiaodaigh · 2017-11-03T01:14:05Z

Apologies for bad grammar.

I can't comment on the general group-by case where you can apply arbitrary code to .SD but I am thinking about the case where I am just doing a simple function DT[,sum(v1), by = some_factor] in this case I can write

result <- vector("numeric", length = nrow(DT))
for(r in 1:nrow(DT)) {
  result[DT$some_factor] =   result[DT$some_factor] + DT[r, v1]
}
result # is the group by results

Imagine if the above was in some efficient C/C++ implementation.

Because the number of groups, n, is already known up front one can simply create an array of size n and using the levels of the factor to index into the vector, to do a sum (or any reduction).

If this was implemented in C/C++ as a fast path in data.table then it will be faster than the traditional group by.

This seciton of the post talks about speed ups achieve in Julia which can be replicated in data.table.

MichaelChirico · 2017-11-03T01:17:26Z

I reiterate -- could you provide some benchmarks indicating the code is, for example, much slower for factors than for integer columns?

…

On Nov 3, 2017 9:14 AM, "susabi" ***@***.***> wrote: Apologies for bad grammar. I can't comment on the general group-by case where you can apply arbitrary code to .SD but I am thinking about the case where I am just doing a simple function DT[,sum(v1), by = some_factor] in this case I can write result <- vector("numeric", length = nrow(DT))for(r in 1:nrow(DT)) { result[DT$some_factor] = result[DT$some_factor] + DT[r, v1] }result # is the group by results Because the number of groups, n, is already known up front one can simply create an array of size n and using the levels of the factor to index into the vector, to do a sum (or any reduction). If this was implemented in C/C++ as a fast path in data.table then it will be faster than the traditional group by. this seciton <https://www.codementor.io/zhuojiadai/an-empirical-study-of-group-by-strategies-in-julia-dagnosell#benchmark-pooledarrays-and-therefore-categoricalarrays> of the post talks about speed up achieve in Julia which can be replicated in data.table. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2458 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHQQdTbK8cd8GKCG89ure34lH0iA2A8_ks5symjggaJpZM4QPAqZ> .

xiaodaigh · 2017-11-03T01:22:33Z

@MichaelChirico My point is not that factors are slower than integers, the point is that factors can be even faster than it is now in special cases such as sum and mean etc.

Benchmarks for this in other languages to show that optimising for factors will be faster is already known, e.g. in the post about Julia above (also I believe panda optimises this case as well but not sure).

Hope this clears things up.

jangorecki · 2017-11-03T05:06:23Z

@xiaodaigh any reason why you assume it is not already optimised in data.table? I can imagine it is optimised even better than in Julia or pandas, as many other algorithms. Question about benchmark, in R, in data.table, is very important here. So please provide one. Without the benchmark this issue should not be even created in first place, and definitely not with current title "group by operation not optimised for factor types ".

xiaodaigh · 2017-11-03T05:25:47Z

Here are some benchmarks to show that the factors as group-by doesn't seem to receive extra optimisation.

Also updated original issue post with link to benchmarks.

xiaodaigh · 2017-11-08T23:57:52Z

I decided to write a simple benchmark in R using Rcpp showing a faster sumby is possible for group-by vectors that are factor vectors using Rcpp.

The data.table case takes about 2.2 seconds vs 0.16 seconds for optimised version on my computer.

library(dplyr)
library(data.table)
library(dtplyr)

N = 100000000
K = 100
a <- data.table(grps = sample(as.factor(1:K),N, replace = T), val = runif(N))

library(Rcpp)
cppFunction('NumericVector sumbycpp(IntegerVector grps, NumericVector vals, int ngrps) {
            NumericVector out(ngrps);
            
            int n = grps.size();
            
            for(int i = 0; i < n; ++i) {
            out[grps[i]-1] = out[grps[i]-1] + vals[i];
            }
            return out;
            }')

fastersumby <- function(grps, vals) {
  ngrps <- nlevels(grps)
  data.table(levels(grps), sumbycpp(grps, vals, ngrps))
}

system.time(print(a[,sum(val), grps])) # 2 seconds
system.time(print(a[,fastersumby(grps, val)])) # 0.16 seconds

system.time({
  a %>% 
    group_by(grps) %>% 
    summarise(sum(val))
}) # 2.4

jangorecki · 2017-11-09T00:27:41Z

Thanks @xiaodaigh, this is exactly what we needed in the report.

arunsrinivasan added the Incomplete label Nov 3, 2017

xiaodaigh changed the title ~~group by operation not optimised for factor types~~ sum & mean group by operation not optimised for factor types Nov 3, 2017

jangorecki added performance and removed Incomplete labels Nov 9, 2017

MichaelChirico mentioned this issue Feb 2, 2018

Use lookup (join to dictionary) for performance boost #2603

Open

MichaelChirico mentioned this issue Mar 23, 2018

Extend uniqueN to speed up for factors as well #2703

Open

xiaodaigh mentioned this issue Nov 8, 2019

Faster groupby JuliaData/DataFrames.jl#1896

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sum & mean group by operation not optimised for factor types #2458

sum & mean group by operation not optimised for factor types #2458

xiaodaigh commented Nov 2, 2017 •

edited

Loading

MichaelChirico commented Nov 2, 2017 via email

arunsrinivasan commented Nov 3, 2017 •

edited

Loading

xiaodaigh commented Nov 3, 2017 •

edited

Loading

MichaelChirico commented Nov 3, 2017 via email

xiaodaigh commented Nov 3, 2017

jangorecki commented Nov 3, 2017 •

edited

Loading

xiaodaigh commented Nov 3, 2017

xiaodaigh commented Nov 8, 2017 •

edited

Loading

jangorecki commented Nov 9, 2017

sum & mean group by operation not optimised for factor types #2458

sum & mean group by operation not optimised for factor types #2458

Comments

xiaodaigh commented Nov 2, 2017 • edited Loading

MichaelChirico commented Nov 2, 2017 via email

arunsrinivasan commented Nov 3, 2017 • edited Loading

xiaodaigh commented Nov 3, 2017 • edited Loading

MichaelChirico commented Nov 3, 2017 via email

xiaodaigh commented Nov 3, 2017

jangorecki commented Nov 3, 2017 • edited Loading

xiaodaigh commented Nov 3, 2017

xiaodaigh commented Nov 8, 2017 • edited Loading

jangorecki commented Nov 9, 2017

xiaodaigh commented Nov 2, 2017 •

edited

Loading

arunsrinivasan commented Nov 3, 2017 •

edited

Loading

xiaodaigh commented Nov 3, 2017 •

edited

Loading

jangorecki commented Nov 3, 2017 •

edited

Loading

xiaodaigh commented Nov 8, 2017 •

edited

Loading