-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sum & mean group by operation not optimised for factor types #2458
Comments
Could you add a benchmark suggesting this is the case? it's no secret
factors are integers
…On Nov 2, 2017 8:31 AM, "susabi" ***@***.***> wrote:
Seems like group by are not optimisation where the by-variable is of
factor type. Essentially one can perform group faster for factors given
that it's represented by integers from 1 to n where n is the number of
groups.
See Python implementation discussion http://wesmckinney.com/blog/
mastering-high-performance-data-algorithms-i-group-by/
and Julia implementation discussion https://www.codementor.io/
zhuojiadai/an-empirical-study-of-group-by-strategies-in-julia-dagnosell
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#2458>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHQQdZXv6-vXXJ9CrpAm_6z2kWgMUNOFks5syQ1KgaJpZM4QPAqZ>
.
|
Please explain what you mean by this, and why with a minimal example. As Michael mentions, factors are internally represented as integers, and data.table makes use of it while finding groups. |
Apologies for bad grammar. I can't comment on the general group-by case where you can apply arbitrary code to result <- vector("numeric", length = nrow(DT))
for(r in 1:nrow(DT)) {
result[DT$some_factor] = result[DT$some_factor] + DT[r, v1]
}
result # is the group by results Imagine if the above was in some efficient C/C++ implementation. Because the number of groups, If this was implemented in C/C++ as a fast path in data.table then it will be faster than the traditional group by. This seciton of the post talks about speed ups achieve in Julia which can be replicated in data.table. |
I reiterate -- could you provide some benchmarks indicating the code is,
for example, much slower for factors than for integer columns?
…On Nov 3, 2017 9:14 AM, "susabi" ***@***.***> wrote:
Apologies for bad grammar.
I can't comment on the general group-by case where you can apply arbitrary
code to .SD but I am thinking about the case where I am just doing a
simple function DT[,sum(v1), by = some_factor] in this case I can write
result <- vector("numeric", length = nrow(DT))for(r in 1:nrow(DT)) {
result[DT$some_factor] = result[DT$some_factor] + DT[r, v1]
}result # is the group by results
Because the number of groups, n, is already known up front one can simply
create an array of size n and using the levels of the factor to index
into the vector, to do a sum (or any reduction).
If this was implemented in C/C++ as a fast path in data.table then it will
be faster than the traditional group by.
this seciton
<https://www.codementor.io/zhuojiadai/an-empirical-study-of-group-by-strategies-in-julia-dagnosell#benchmark-pooledarrays-and-therefore-categoricalarrays>
of the post talks about speed up achieve in Julia which can be replicated
in data.table.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2458 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHQQdTbK8cd8GKCG89ure34lH0iA2A8_ks5symjggaJpZM4QPAqZ>
.
|
@MichaelChirico My point is not that factors are slower than integers, the point is that factors can be even faster than it is now in special cases such as Benchmarks for this in other languages to show that optimising for factors will be faster is already known, e.g. in the post about Julia above (also I believe panda optimises this case as well but not sure). Hope this clears things up. |
@xiaodaigh any reason why you assume it is not already optimised in data.table? I can imagine it is optimised even better than in Julia or pandas, as many other algorithms. Question about benchmark, in R, in data.table, is very important here. So please provide one. Without the benchmark this issue should not be even created in first place, and definitely not with current title "group by operation not optimised for factor types ". |
Here are some benchmarks to show that the factors as group-by doesn't seem to receive extra optimisation. Also updated original issue post with link to benchmarks. |
I decided to write a simple benchmark in R using Rcpp showing a faster sumby is possible for group-by vectors that are factor vectors using Rcpp. The data.table case takes about 2.2 seconds vs 0.16 seconds for optimised version on my computer. library(dplyr)
library(data.table)
library(dtplyr)
N = 100000000
K = 100
a <- data.table(grps = sample(as.factor(1:K),N, replace = T), val = runif(N))
library(Rcpp)
cppFunction('NumericVector sumbycpp(IntegerVector grps, NumericVector vals, int ngrps) {
NumericVector out(ngrps);
int n = grps.size();
for(int i = 0; i < n; ++i) {
out[grps[i]-1] = out[grps[i]-1] + vals[i];
}
return out;
}')
fastersumby <- function(grps, vals) {
ngrps <- nlevels(grps)
data.table(levels(grps), sumbycpp(grps, vals, ngrps))
}
system.time(print(a[,sum(val), grps])) # 2 seconds
system.time(print(a[,fastersumby(grps, val)])) # 0.16 seconds
system.time({
a %>%
group_by(grps) %>%
summarise(sum(val))
}) # 2.4
|
Thanks @xiaodaigh, this is exactly what we needed in the report. |
Seems like group by are not optimised where the by-variable is of factor type. Essentially one can perform group-by faster for factors given that it's represented by integers from 1 to n where n is the number of groups.
See Python implementation discussion http://wesmckinney.com/blog/mastering-high-performance-data-algorithms-i-group-by/
and Julia implementation discussion https://www.codementor.io/zhuojiadai/an-empirical-study-of-group-by-strategies-in-julia-dagnosell
Update - Benchmarks
Here are some benchmarks to show that the factors as group-by doesn't seem to receive extra optimisation.
So I ran two benchmarks one for group-by variable with 2.5million groups, and one for 100millions groups. I ran one where the group-by variable is an integer and one where the group-by variable is a factor. In both cases the run time did not get faster in the group by variable is factor case. One can probably confirm by looking into the source code that sum group-by a factor is not optimised for.
The text was updated successfully, but these errors were encountered: