Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

computing index could find groups as well #4387

Closed
jangorecki opened this issue Apr 20, 2020 · 3 comments · Fixed by #4386
Closed

computing index could find groups as well #4387

jangorecki opened this issue Apr 20, 2020 · 3 comments · Fixed by #4386

Comments

@jangorecki
Copy link
Member

jangorecki commented Apr 20, 2020

Taking that out from #4346 so discussion only about that could be here.
I would like to propose for forderv to have a default retGrp=TRUE, that means secondary indices would carry that attribute as well. As a result it will be a little bit more heavy, but it opens more possibilities to avoid heavy re-computation. One of many examples

# TODO: could check/reuse secondary indices, but we need 'starts' attribute as well!

as well #2947


I made small benchmark...

tl;dr

The difference in timings above are significant. My conclusion is that we should not make that a defaut, but rather keep those information whenever user compute them somehow, for example when calling unique. In such case there is no extra performance cost, and those information doesn't have to be re-computed again. It could be computed when calling setindex.


Each of comment describes a different factor used.

library(data.table)
set.seed(108)
forderv = data.table:::forderv
N = 1e8

## th
setDTthreads(40L)
setDTthreads(1L)

## n unique
DT = data.table(V1 = sample(N, N, FALSE))
DT = data.table(V1 = sample(1:2, N, TRUE))

## fun: order vs order+groups
system.time(o <- forderv(DT, by="V1", sort=TRUE, retGrp=FALSE))
system.time(p <- forderv(DT, by="V1", sort=TRUE, retGrp=TRUE))

and got the following timings

d = fread("
th,unqn,fun,sec
40,1e8,o,0.851
40,1e8,og,1.759
40,2,o,0.244
40,2,og,0.253
1,1e8,o,4.901
1,1e8,og,5.630
1,2,o,1.061
1,2,og,1.075
")
cube(d, by=c("th","unqn"), j=sprintf("%.2f%%", mean(sec[fun=="o"]/sec[fun=="og"])*100))
#      th  unqn     V1
#1:    40 1e+08 48.38%
#2:    40 2e+00 96.44%
#3:     1 1e+08 87.05%
#4:     1 2e+00 98.70%
#5:    40    NA 72.41%
#6:     1    NA 92.87%
#7:    NA 1e+08 67.72%
#8:    NA 2e+00 97.57%
#9:    NA    NA 82.64%

On average finding order but no groups takes 82% of time that order+groups would take.
Importance of unique value (number of groups) is 97% vs 67%. So if there are only 2 groups, the difference is not significant, but for all unique rows, the average difference is 67%.
Importance of 40 vs 1 thread is 92% vs 72%.
In combination of 40 threads and all unique rows, calculating order+groups is twice slower comparing to just order. When using 1 thread it is only around 10% slower.

Regarding memory, number of threads is not factor anymore.
All unique rows, will take twice as much memory, while 2 groups will take almost no extra memory.

@jangorecki
Copy link
Member Author

jangorecki commented Apr 21, 2020

memory overhead can be reduced in the most pesimistic case (all unq rows).
forderv(, retGrp=TRUE) could return a compact version of starts argument, the same as it is already handled for the retGrp=FALSE on order vector.

attr(data.table:::forderv(sample(10), retGrp=TRUE), "starts")
# [1]  1  2  3  4  5  6  7  8  9 10

## could be
integer()

Initial version of it can be found in 63e7097
It touches many files so don't want to include it now in PRs to avoid conflicts. It would be good to utilize this feature in other places, to avoid extra subset/copy in cases like this:

if (length(starts)) xo = xo[starts]

@jangorecki
Copy link
Member Author

From R-devel mailing list
[Rd] base::order making available retGrp and sortStr options for radix method?, it seems that not only me need better access to those attributes.

@jangorecki
Copy link
Member Author

jangorecki commented May 31, 2020

Assuming pessimistic case of all unique elements...
Interesting that for character type, using retGrp=TRUE did not have much impact.
We can also observe that using more threads for character is slowing it down, unlike integer or double. In optimistic case of 2 unique values , character is faster using more threads.
This might be related to R's issues when handling many unique characters that @MichaelChirico recently experienced as well. Might be as well related to checking encoding, which has to happen on each comparison, because string pointers are never equal in pessimistic case.

this is for integer

       type      size cardinality elapsed    th                 args            cflags
     <char>    <char>      <char>   <num> <int>               <char>            <char>
 1: integer 100000000         unq   4.844     1             DT,""x"" -O3 -mtune=native
 2: integer 100000000         unq   5.528     1 DT,""x"",retGrp=TRUE -O3 -mtune=native
 3: integer 100000000         unq   1.804     4             DT,""x"" -O3 -mtune=native
 4: integer 100000000         unq   3.248     4 DT,""x"",retGrp=TRUE -O3 -mtune=native
 5: integer 100000000         unq   1.017    10             DT,""x"" -O3 -mtune=native
 6: integer 100000000         unq   2.548    10 DT,""x"",retGrp=TRUE -O3 -mtune=native
 7: integer 100000000         unq   0.856    20             DT,""x"" -O3 -mtune=native
 8: integer 100000000         unq   2.312    20 DT,""x"",retGrp=TRUE -O3 -mtune=native
 9: integer 100000000         unq   0.872    30             DT,""x"" -O3 -mtune=native
10: integer 100000000         unq   2.042    30 DT,""x"",retGrp=TRUE -O3 -mtune=native
11: integer 100000000         unq   0.857    40             DT,""x"" -O3 -mtune=native
12: integer 100000000         unq   1.954    40 DT,""x"",retGrp=TRUE -O3 -mtune=native

and double

      type      size cardinality elapsed    th                 args            cflags
    <char>    <char>      <char>   <num> <int>               <char>            <char>
 1: double 100000000         unq  11.519     1             DT,""x"" -O3 -mtune=native
 2: double 100000000         unq  12.150     1 DT,""x"",retGrp=TRUE -O3 -mtune=native
 3: double 100000000         unq   5.178     4             DT,""x"" -O3 -mtune=native
 4: double 100000000         unq   6.842     4 DT,""x"",retGrp=TRUE -O3 -mtune=native
 5: double 100000000         unq   3.402    10             DT,""x"" -O3 -mtune=native
 6: double 100000000         unq   4.550    10 DT,""x"",retGrp=TRUE -O3 -mtune=native
 7: double 100000000         unq   2.842    20             DT,""x"" -O3 -mtune=native
 8: double 100000000         unq   4.824    20 DT,""x"",retGrp=TRUE -O3 -mtune=native
 9: double 100000000         unq   2.890    30             DT,""x"" -O3 -mtune=native
10: double 100000000         unq   4.254    30 DT,""x"",retGrp=TRUE -O3 -mtune=native
11: double 100000000         unq   2.893    40             DT,""x"" -O3 -mtune=native
12: double 100000000         unq   4.139    40 DT,""x"",retGrp=TRUE -O3 -mtune=native

and character

         type      size cardinality elapsed    th                 args            cflags
       <char>    <char>      <char>   <num> <int>               <char>            <char>
 1: character 100000000         unq 120.087     1             DT,""x"" -O3 -mtune=native
 2: character 100000000         unq 121.910     1 DT,""x"",retGrp=TRUE -O3 -mtune=native
 3: character 100000000         unq 145.838     4             DT,""x"" -O3 -mtune=native
 4: character 100000000         unq 142.330     4 DT,""x"",retGrp=TRUE -O3 -mtune=native
 5: character 100000000         unq 147.387    10             DT,""x"" -O3 -mtune=native
 6: character 100000000         unq 147.786    10 DT,""x"",retGrp=TRUE -O3 -mtune=native
 7: character 100000000         unq 207.849    20             DT,""x"" -O3 -mtune=native
 8: character 100000000         unq 211.663    20 DT,""x"",retGrp=TRUE -O3 -mtune=native
 9: character 100000000         unq 232.446    30             DT,""x"" -O3 -mtune=native
10: character 100000000         unq 214.417    30 DT,""x"",retGrp=TRUE -O3 -mtune=native
11: character 100000000         unq 206.653    40             DT,""x"" -O3 -mtune=native
12: character 100000000         unq 213.415    40 DT,""x"",retGrp=TRUE -O3 -mtune=native

and character optimistic case of 2 unq values

         type      size cardinality elapsed    th                 args            cflags
       <char>    <char>      <char>   <num> <int>               <char>            <char>
 1: character 100000000       grpn2   1.472     1             DT,""x"" -O3 -mtune=native
 2: character 100000000       grpn2   1.426     1 DT,""x"",retGrp=TRUE -O3 -mtune=native
 3: character 100000000       grpn2   0.657     4             DT,""x"" -O3 -mtune=native
 4: character 100000000       grpn2   0.553     4 DT,""x"",retGrp=TRUE -O3 -mtune=native
 5: character 100000000       grpn2   0.307    10             DT,""x"" -O3 -mtune=native
 6: character 100000000       grpn2   0.315    10 DT,""x"",retGrp=TRUE -O3 -mtune=native
 7: character 100000000       grpn2   0.289    20             DT,""x"" -O3 -mtune=native
 8: character 100000000       grpn2   0.277    20 DT,""x"",retGrp=TRUE -O3 -mtune=native
 9: character 100000000       grpn2   0.227    30             DT,""x"" -O3 -mtune=native
10: character 100000000       grpn2   0.228    30 DT,""x"",retGrp=TRUE -O3 -mtune=native
11: character 100000000       grpn2   0.213    40             DT,""x"" -O3 -mtune=native
12: character 100000000       grpn2   0.229    40 DT,""x"",retGrp=TRUE -O3 -mtune=native

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant