-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
computing index could find groups as well #4387
Comments
memory overhead can be reduced in the most pesimistic case (all unq rows). attr(data.table:::forderv(sample(10), retGrp=TRUE), "starts")
# [1] 1 2 3 4 5 6 7 8 9 10
## could be
integer() Initial version of it can be found in 63e7097 if (length(starts)) xo = xo[starts] |
From R-devel mailing list |
Assuming pessimistic case of all unique elements... this is for integer
and double
and character
and character optimistic case of 2 unq values
|
Taking that out from #4346 so discussion only about that could be here.
I would like to propose for
forderv
to have a defaultretGrp=TRUE
, that means secondary indices would carry that attribute as well. As a result it will be a little bit more heavy, but it opens more possibilities to avoid heavy re-computation. One of many examplesas well #2947
I made small benchmark...
tl;dr
The difference in timings above are significant. My conclusion is that we should not make that a defaut, but rather keep those information whenever user compute them somehow, for example when calling
unique
. In such case there is no extra performance cost, and those information doesn't have to be re-computed again. It could be computed when callingsetindex
.Each of comment describes a different factor used.
and got the following timings
On average finding
order
but nogroups
takes 82% of time thatorder+groups
would take.Importance of unique value (number of groups) is 97% vs 67%. So if there are only 2 groups, the difference is not significant, but for all unique rows, the average difference is 67%.
Importance of 40 vs 1 thread is 92% vs 72%.
In combination of 40 threads and all unique rows, calculating
order+groups
is twice slower comparing to justorder
. When using 1 thread it is only around 10% slower.Regarding memory, number of threads is not factor anymore.
All unique rows, will take twice as much memory, while 2 groups will take almost no extra memory.
The text was updated successfully, but these errors were encountered: