computing index could find groups as well #4387

jangorecki · 2020-04-20T11:32:40Z

Taking that out from #4346 so discussion only about that could be here.
I would like to propose for forderv to have a default retGrp=TRUE, that means secondary indices would carry that attribute as well. As a result it will be a little bit more heavy, but it opens more possibilities to avoid heavy re-computation. One of many examples

# TODO: could check/reuse secondary indices, but we need 'starts' attribute as well!

as well #2947

I made small benchmark...

tl;dr

The difference in timings above are significant. My conclusion is that we should not make that a defaut, but rather keep those information whenever user compute them somehow, for example when calling unique. In such case there is no extra performance cost, and those information doesn't have to be re-computed again. It could be computed when calling setindex.

Each of comment describes a different factor used.

library(data.table)
set.seed(108)
forderv = data.table:::forderv
N = 1e8

## th
setDTthreads(40L)
setDTthreads(1L)

## n unique
DT = data.table(V1 = sample(N, N, FALSE))
DT = data.table(V1 = sample(1:2, N, TRUE))

## fun: order vs order+groups
system.time(o <- forderv(DT, by="V1", sort=TRUE, retGrp=FALSE))
system.time(p <- forderv(DT, by="V1", sort=TRUE, retGrp=TRUE))

and got the following timings

d = fread("
th,unqn,fun,sec
40,1e8,o,0.851
40,1e8,og,1.759
40,2,o,0.244
40,2,og,0.253
1,1e8,o,4.901
1,1e8,og,5.630
1,2,o,1.061
1,2,og,1.075
")
cube(d, by=c("th","unqn"), j=sprintf("%.2f%%", mean(sec[fun=="o"]/sec[fun=="og"])*100))
#      th  unqn     V1
#1:    40 1e+08 48.38%
#2:    40 2e+00 96.44%
#3:     1 1e+08 87.05%
#4:     1 2e+00 98.70%
#5:    40    NA 72.41%
#6:     1    NA 92.87%
#7:    NA 1e+08 67.72%
#8:    NA 2e+00 97.57%
#9:    NA    NA 82.64%

On average finding order but no groups takes 82% of time that order+groups would take.
Importance of unique value (number of groups) is 97% vs 67%. So if there are only 2 groups, the difference is not significant, but for all unique rows, the average difference is 67%.
Importance of 40 vs 1 thread is 92% vs 72%.
In combination of 40 threads and all unique rows, calculating order+groups is twice slower comparing to just order. When using 1 thread it is only around 10% slower.

Regarding memory, number of threads is not factor anymore.
All unique rows, will take twice as much memory, while 2 groups will take almost no extra memory.

The text was updated successfully, but these errors were encountered:

jangorecki · 2020-04-21T10:33:41Z

memory overhead can be reduced in the most pesimistic case (all unq rows).
forderv(, retGrp=TRUE) could return a compact version of starts argument, the same as it is already handled for the retGrp=FALSE on order vector.

attr(data.table:::forderv(sample(10), retGrp=TRUE), "starts")
# [1]  1  2  3  4  5  6  7  8  9 10

## could be
integer()

Initial version of it can be found in 63e7097
It touches many files so don't want to include it now in PRs to avoid conflicts. It would be good to utilize this feature in other places, to avoid extra subset/copy in cases like this:

if (length(starts)) xo = xo[starts]

jangorecki · 2020-05-09T09:00:40Z

From R-devel mailing list
[Rd] base::order making available retGrp and sortStr options for radix method?, it seems that not only me need better access to those attributes.

jangorecki · 2020-05-31T12:20:24Z

Assuming pessimistic case of all unique elements...
Interesting that for character type, using retGrp=TRUE did not have much impact.
We can also observe that using more threads for character is slowing it down, unlike integer or double. In optimistic case of 2 unique values , character is faster using more threads.
This might be related to R's issues when handling many unique characters that @MichaelChirico recently experienced as well. Might be as well related to checking encoding, which has to happen on each comparison, because string pointers are never equal in pessimistic case.

this is for integer

       type      size cardinality elapsed    th                 args            cflags
     <char>    <char>      <char>   <num> <int>               <char>            <char>
 1: integer 100000000         unq   4.844     1             DT,""x"" -O3 -mtune=native
 2: integer 100000000         unq   5.528     1 DT,""x"",retGrp=TRUE -O3 -mtune=native
 3: integer 100000000         unq   1.804     4             DT,""x"" -O3 -mtune=native
 4: integer 100000000         unq   3.248     4 DT,""x"",retGrp=TRUE -O3 -mtune=native
 5: integer 100000000         unq   1.017    10             DT,""x"" -O3 -mtune=native
 6: integer 100000000         unq   2.548    10 DT,""x"",retGrp=TRUE -O3 -mtune=native
 7: integer 100000000         unq   0.856    20             DT,""x"" -O3 -mtune=native
 8: integer 100000000         unq   2.312    20 DT,""x"",retGrp=TRUE -O3 -mtune=native
 9: integer 100000000         unq   0.872    30             DT,""x"" -O3 -mtune=native
10: integer 100000000         unq   2.042    30 DT,""x"",retGrp=TRUE -O3 -mtune=native
11: integer 100000000         unq   0.857    40             DT,""x"" -O3 -mtune=native
12: integer 100000000         unq   1.954    40 DT,""x"",retGrp=TRUE -O3 -mtune=native

and double

      type      size cardinality elapsed    th                 args            cflags
    <char>    <char>      <char>   <num> <int>               <char>            <char>
 1: double 100000000         unq  11.519     1             DT,""x"" -O3 -mtune=native
 2: double 100000000         unq  12.150     1 DT,""x"",retGrp=TRUE -O3 -mtune=native
 3: double 100000000         unq   5.178     4             DT,""x"" -O3 -mtune=native
 4: double 100000000         unq   6.842     4 DT,""x"",retGrp=TRUE -O3 -mtune=native
 5: double 100000000         unq   3.402    10             DT,""x"" -O3 -mtune=native
 6: double 100000000         unq   4.550    10 DT,""x"",retGrp=TRUE -O3 -mtune=native
 7: double 100000000         unq   2.842    20             DT,""x"" -O3 -mtune=native
 8: double 100000000         unq   4.824    20 DT,""x"",retGrp=TRUE -O3 -mtune=native
 9: double 100000000         unq   2.890    30             DT,""x"" -O3 -mtune=native
10: double 100000000         unq   4.254    30 DT,""x"",retGrp=TRUE -O3 -mtune=native
11: double 100000000         unq   2.893    40             DT,""x"" -O3 -mtune=native
12: double 100000000         unq   4.139    40 DT,""x"",retGrp=TRUE -O3 -mtune=native

and character

         type      size cardinality elapsed    th                 args            cflags
       <char>    <char>      <char>   <num> <int>               <char>            <char>
 1: character 100000000         unq 120.087     1             DT,""x"" -O3 -mtune=native
 2: character 100000000         unq 121.910     1 DT,""x"",retGrp=TRUE -O3 -mtune=native
 3: character 100000000         unq 145.838     4             DT,""x"" -O3 -mtune=native
 4: character 100000000         unq 142.330     4 DT,""x"",retGrp=TRUE -O3 -mtune=native
 5: character 100000000         unq 147.387    10             DT,""x"" -O3 -mtune=native
 6: character 100000000         unq 147.786    10 DT,""x"",retGrp=TRUE -O3 -mtune=native
 7: character 100000000         unq 207.849    20             DT,""x"" -O3 -mtune=native
 8: character 100000000         unq 211.663    20 DT,""x"",retGrp=TRUE -O3 -mtune=native
 9: character 100000000         unq 232.446    30             DT,""x"" -O3 -mtune=native
10: character 100000000         unq 214.417    30 DT,""x"",retGrp=TRUE -O3 -mtune=native
11: character 100000000         unq 206.653    40             DT,""x"" -O3 -mtune=native
12: character 100000000         unq 213.415    40 DT,""x"",retGrp=TRUE -O3 -mtune=native

and character optimistic case of 2 unq values

         type      size cardinality elapsed    th                 args            cflags
       <char>    <char>      <char>   <num> <int>               <char>            <char>
 1: character 100000000       grpn2   1.472     1             DT,""x"" -O3 -mtune=native
 2: character 100000000       grpn2   1.426     1 DT,""x"",retGrp=TRUE -O3 -mtune=native
 3: character 100000000       grpn2   0.657     4             DT,""x"" -O3 -mtune=native
 4: character 100000000       grpn2   0.553     4 DT,""x"",retGrp=TRUE -O3 -mtune=native
 5: character 100000000       grpn2   0.307    10             DT,""x"" -O3 -mtune=native
 6: character 100000000       grpn2   0.315    10 DT,""x"",retGrp=TRUE -O3 -mtune=native
 7: character 100000000       grpn2   0.289    20             DT,""x"" -O3 -mtune=native
 8: character 100000000       grpn2   0.277    20 DT,""x"",retGrp=TRUE -O3 -mtune=native
 9: character 100000000       grpn2   0.227    30             DT,""x"" -O3 -mtune=native
10: character 100000000       grpn2   0.228    30 DT,""x"",retGrp=TRUE -O3 -mtune=native
11: character 100000000       grpn2   0.213    40             DT,""x"" -O3 -mtune=native
12: character 100000000       grpn2   0.229    40 DT,""x"",retGrp=TRUE -O3 -mtune=native

jangorecki added the performance label Apr 20, 2020

jangorecki linked a pull request Apr 20, 2020 that will close this issue

Teach forder to re-use existing key and index attributes instead of sorting from scratch #4386

Merged

3 tasks

jangorecki mentioned this issue Apr 20, 2020

Teach forder to re-use existing key and index attributes instead of sorting from scratch #4386

Merged

3 tasks

jangorecki self-assigned this Apr 20, 2020

jangorecki added the internals label Apr 22, 2020

jangorecki mentioned this issue Jun 4, 2022

Use lookup (join to dictionary) for performance boost #2603

Open

MichaelChirico closed this as completed in #4386 Jul 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

computing index could find groups as well #4387

computing index could find groups as well #4387

jangorecki commented Apr 20, 2020 •

edited

Loading

jangorecki commented Apr 21, 2020 •

edited

Loading

jangorecki commented May 9, 2020

jangorecki commented May 31, 2020 •

edited

Loading

computing index could find groups as well #4387

computing index could find groups as well #4387

Comments

jangorecki commented Apr 20, 2020 • edited Loading

tl;dr

jangorecki commented Apr 21, 2020 • edited Loading

jangorecki commented May 9, 2020

jangorecki commented May 31, 2020 • edited Loading

jangorecki commented Apr 20, 2020 •

edited

Loading

jangorecki commented Apr 21, 2020 •

edited

Loading

jangorecki commented May 31, 2020 •

edited

Loading