Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New function topn #4187

Closed
wants to merge 19 commits into from
Closed

New function topn #4187

wants to merge 19 commits into from

Conversation

2005m
Copy link
Contributor

@2005m 2005m commented Jan 17, 2020

Closes #3804

@2005m 2005m added the WIP label Jan 17, 2020
@codecov
Copy link

codecov bot commented Jan 18, 2020

Codecov Report

Merging #4187 into master will increase coverage by 0.00%.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff            @@
##           master    #4187    +/-   ##
========================================
  Coverage   99.61%   99.61%            
========================================
  Files          72       72            
  Lines       13916    14046   +130     
========================================
+ Hits        13862    13992   +130     
  Misses         54       54            

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b1b1832...37f503e. Read the comment docs.

@2005m 2005m removed the WIP label Jan 18, 2020
@2005m
Copy link
Contributor Author

2005m commented Jan 18, 2020

Tried to use dplyr::top_n but it crashed my session 3 times. Seems buggy. I gave up! If someone have some benchmarks I am happy to add them.

@2005m 2005m added this to the 1.12.9 milestone Jan 18, 2020
@ColeMiller1
Copy link
Contributor

For topn(x, 1L), this is a special case which equals which.min(x). which.max is more performant but not sure if it's worth accommodating it.

x = rnorm(5e7)
microbenchmark::microbenchmark(
  topn(x, 1L),
  which.min(x), times = 10L
)

##Unit: milliseconds
##         expr      min       lq      mean   median       uq      max neval
##  topn(x, 1L) 128.8612 129.0326 138.75400 130.1807 135.3845 202.9439    10
## which.min(x)  85.8646  86.6461  96.82178  89.4887  91.3402 156.3643    10

identical(topn(x, 1L), which.min(x))
## [1] TRUE

src/fifelse.c Outdated
error("Function 'topn' is not built for large value of 'n'. The algorithm is made for small values. Please prefer the 'order' if you want to proceed with such large value.");
}
if (len0 > len1) {
error("'n' cannot be larger than length of 'vec'.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A very likely use case is topn by group. I think most people familiar with top_n would expect this to work:

set.seed(123L)
dt = data.table(x = rnorm(10), grp = sample(5L, 10L, replace = T))
dt[dt[, .I[topn(x, 2L)], by = grp]$V1, ]

##Error: 'n' cannot be larger than length of 'vec'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you suggest we return something like below?

> dt[dt[, .I[order(x)[1:2]], by = grp]$V1, ]
             x grp
 1: -0.6868529   1
 2: -0.5604756   1
 3: -0.2301775   4
 4:         NA  NA
 5:  0.1292877   5
 6:         NA  NA
 7: -0.4456620   3
 8:  1.7150650   3
 9: -1.2650612   2
10:  0.4609162   2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference would it be to return seq_len(n) and maybe have a warning that there are less records than n.

dt[dt[, .I[order(x)[1:min(.N,2)]], by = grp]$V1, ]

            x   grp
        <num> <int>
1: -0.6868529     1
2: -0.5604756     1
3: -0.2301775     4
4:  0.1292877     5
5: -0.4456620     3
6:  1.7150650     3
7: -1.2650612     2
8:  0.4609162     2

P.S., this PR is great.

set.seed(123)
dt = data.table(x = rnorm(1e7), grp = sample(100, 1e7, TRUE))

microbenchmark::microbenchmark(
dt[, .I[topn(x, 5L)], by = grp],
dt[order(-x), .I[seq_len(5L)], by = grp],
dt[, .I[frank(-x, ties.method = "min") <= 5L], by = grp], ##closest to dplyr::top_n
times = 5L
)

Unit: milliseconds
                                                     expr       min        lq      mean    median        uq       max neval
                          dt[, .I[topn(x, 5L)], by = grp]  157.9859  158.9587  161.4327  160.2456  160.6012  169.3721     5
                 dt[order(-x), .I[seq_len(5L)], by = grp] 1336.8226 1343.3002 1347.5568 1345.3929 1350.1008 1362.1676     5
 dt[, .I[frank(-x, ties.method = "min") <= 5L], by = grp] 1432.7179 1459.2029 1529.9473 1498.1096 1626.5848 1633.1211     5

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be now working like you asked.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

This is more for future thought. I was also going to suggest the maintainers look at updating the Database-like ops benchmark for top n by group as dt[, .(largest2_v3 = v3[topn(v3, 2L, decreasing = TRUE)], by = id6]. But as the number of groups increase, there seems to be some overhead. I assume this is a memory and gforce type of issue. I can log a separate issue once this is merged for better grouping performance.

Details, benchmark, and verbose message

There are two rows with ******* which identify the main performance different - for topn it is collecting discontiguous groups took 0.616s for 10000 groups versus only 0.016s for the other method.

##remotes::install_github("Rdatatable/data.table", ref = "topn")

library(data.table)
N = 1000000L; K = 100L; set.seed(108L)
DT = data.table(id6 = sample(N/K, N, TRUE), v3 = runif(N, max = 100))

identical(DT[DT[order(-v3), head(.I, 2L), by = id6]$V1, .(id6, largest2_v3 = v3)][order(-largest2_v3, id6)],
          DT[, .(largest2_v3 = v3[topn(v3, 2L, decreasing = TRUE)]), by = id6][order(-largest2_v3, id6)])
#> [1] TRUE

microbenchmark::microbenchmark(
  DT[DT[order(-v3), head(.I, 2L), by = id6]$V1, .(id6, largest2_v3 = v3)],
  DT[, .(largest2_v3 = v3[topn(v3, 2L, decreasing = TRUE)]), by = id6]
  , times = 10L
  )
#> Unit: milliseconds
#>                                                                     expr
#>  DT[DT[order(-v3), head(.I, 2L), by = id6]$V1, .(id6, largest2_v3 = v3)]
#>     DT[, .(largest2_v3 = v3[topn(v3, 2L, decreasing = TRUE)]), by = id6]
#>       min       lq     mean   median       uq      max neval
#>  242.2958 248.2904 254.6175 250.4815 255.1830 288.5708    10
#>  674.9867 680.0911 722.0210 682.1045 697.4592 915.3662    10

DT[DT[order(-v3), head(.I, 2L), by = id6, verbose = T]$V1, .(id6, largest2_v3 = v3)]
#> forder.c received 1000000 rows and 1 columns
#> i clause present and columns used in by detected, only these subset: id6 
#> Detected that j uses these columns: <none> 
#> Finding groups using forderv ... forder.c received 1000000 rows and 1 columns
#> 0.000s elapsed (0.000s cpu) 
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
#> Getting back original order ... forder.c received a vector type 'integer' length 10000
#> 0.000s elapsed (0.000s cpu) 
#> lapply optimization is on, j unchanged as 'head(.I, 2L)'
#> GForce is on, left j unchanged
#> Old mean optimization is on, left j unchanged.
#> Making each group and running j (GForce FALSE) ... 
#>   collecting discontiguous groups took 0.016s for 10000 groups *********
#>   eval(j) took 0.090s for 10000 calls
#> 0.110s elapsed (0.110s cpu)

DT[, .(largest2_v3 = v3[topn(v3, 2L, decreasing = TRUE)]), by = id6, verbose = TRUE]
#> Detected that j uses these columns: v3 
#> Finding groups using forderv ... forder.c received 1000000 rows and 1 columns
#> 0.000s elapsed (0.000s cpu) 
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
#> Getting back original order ... forder.c received a vector type 'integer' length 10000
#> 0.000s elapsed (0.000s cpu) 
#> lapply optimization is on, j unchanged as 'list(v3[topn(v3, 2L, decreasing = TRUE)])'
#> GForce is on, left j unchanged
#> Old mean optimization is on, left j unchanged.
#> Making each group and running j (GForce FALSE) ... 
#>   collecting discontiguous groups took 0.616s for 10000 groups  **********
#>   eval(j) took 0.034s for 10000 calls
#> 0.690s elapsed (0.440s cpu)

@@ -7,6 +7,7 @@ setcoalesce = function(...) .Call(Ccoalesce, list(...), TRUE)

fifelse = function(test, yes, no, na=NA) .Call(CfifelseR, test, yes, no, na)
fcase = function(..., default=NA) .Call(CfcaseR, default, parent.frame(), as.list(substitute(list(...)))[-1L])
topn = function(vec, n=6L, decreasing=FALSE) .Call(CtopnR, vec, n, decreasing)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the default be decreasing = TRUE? When I think top I think highest values. If it helps, dplyr::top_n uses the sign of n to determine whether the top n or bottom n are returned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally don't mind. The way I implemented it was to be in line with order and sort. dplyr::top_n does not work for me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think decreasing=TRUE may be the more common use case, but practically speaking most sorting functions in R and elsewhere are doing decreasing=FALSE by default.

More practically, in other places in data.table we use lazy evaluation to allow - to mean decreasing=TRUE and switch the argument without requiring to negate the vector (should be efficient). In particular for character input - doesn't work for "flipping the sign".

@2005m
Copy link
Contributor Author

2005m commented Jan 19, 2020

To answer your question on which.min / which.max, I am happy to play corner cases but these are function which are made for a different purpose plus correct me if I am wrong they don't bother dealing with NA. If you put the topn function that I have written in the issue and does not care about NA here is what you get:

 x = rnorm(5e7)
microbenchmark::microbenchmark(
   topn(x, 1L),
   which.max(x), times = 10L
 )
# Unit: milliseconds
#         expr       min        lq     mean    median        uq       max neval
#  topn(x, 1L)  58.00159  60.38602  63.5568  63.06315  65.64793  71.91218    10
# which.max(x) 135.89879 153.44461 161.4339 163.98896 173.29448 176.23781    10

And it was not even my intention to beat which.max. Just to say that:
-in my view topn is a more efficient way of getting the same ouput as order(x)[1:n] (for numeric vectors) which is what I understood from the feature request (and given the example provided).
-we can probably re-write which.min and which.max to get something which is faster
-if we want to play corner case (which I am totally happy to do :-)) I would recommend we stick to C and forget about R.

NEWS.md Outdated

microbenchmark::microbenchmark(
x[topn(x, 6L)],
sort(x)[1:6],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sort has a partial argument but it returns the full vector which means it still needs to be subsetted. Your implementation is still 6 times faster.


x = rnorm(5e7) 
microbenchmark::microbenchmark(
sort(x, partial = 1:6)[1:6],
x[topn(x, 6L)],
times = 10L
)
##Unit: milliseconds
##                        expr      min       lq     mean   median       uq
## sort(x, partial = 1:6)[1:6] 689.2710 724.5292 730.2857 728.4854 735.3350
##              x[topn(x, 6L)] 128.7904 129.3341 151.5083 134.5046 159.8133

identical(sort(x, partial = 1:6)[1:6],x[topn(x, 6L)])
## [1] TRUE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I have updated the benchmarch. Your machine is clearly faster than mine :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL about partial=TRUE, pretty cool.

NEWS.md Outdated Show resolved Hide resolved
@@ -7,6 +7,7 @@ setcoalesce = function(...) .Call(Ccoalesce, list(...), TRUE)

fifelse = function(test, yes, no, na=NA) .Call(CfifelseR, test, yes, no, na)
fcase = function(..., default=NA) .Call(CfcaseR, default, parent.frame(), as.list(substitute(list(...)))[-1L])
topn = function(vec, n=6L, decreasing=FALSE) .Call(CtopnR, vec, n, decreasing)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also have na.last argument for data.table:::forder, we should include that here if we really want to replace order(x)[1:n]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

na are always last, I did not look at forder but the behaviour of topn should be similar to order when it comes to NA.

}
\arguments{
\item{vec}{ A numeric vector of type numeric or integer. Other types are not supported yet. }
\item{n}{ A positive integer value greater or equal to 1. Maximum value is 1000. }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we put such a maximum?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the algorithm is not build fir large number. Try to remove the max and test the function for large number. You will notice a massive reduction in performance above a given threshold. It can even become 100 times slower than order for very large number.

src/data.table.h Outdated
@@ -242,3 +242,4 @@ SEXP testMsgR(SEXP status, SEXP x, SEXP k);
//fifelse.c
SEXP fifelseR(SEXP l, SEXP a, SEXP b, SEXP na);
SEXP fcaseR(SEXP na, SEXP rho, SEXP args);
SEXP topnR(SEXP vec, SEXP n, SEXP dec);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing trailing newline

@@ -343,3 +343,194 @@ SEXP fcaseR(SEXP na, SEXP rho, SEXP args) {
UNPROTECT(nprotect);
return ans;
}

SEXP topnR(SEXP vec, SEXP n, SEXP dec) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it in fifelse.c? Seems forder.c would be a more natural place, otherwise its own C file should be fine

Copy link
Contributor Author

@2005m 2005m Jan 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to add it wherever you wish. I just did not want to create too many files. There are already enough :-)

src/fifelse.c Outdated
if (isS4(vec) && !INHERITS(vec, char_nanotime)) {
error("S4 class objects (excluding nanotime) are not supported.");
}
if (len0 > 1000) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If input is 1e8 values it should still be faster to do topn(x, 1500), no?

A fixed value I don't think makes sense. Maybe something like len1/2 makes more sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or perhaps just re-route with warning -- do order(x)[1:n] if n>length(x)/2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my comment above. Thanks.

src/fifelse.c Outdated
switch(tvec) {
case INTSXP: {
const int *restrict pvec = INTEGER(vec);
int min_value = pvec[0];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IINM we haven't already returned length-0 input, so this will be out-of-memory access

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, if we initialize this to INT_MAX, we won't have to access pvec[0] both here and within the next for loop, does that make sense to do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure to understand. Do you mean we need to check that vec is of non null length? If yes, I think you are right I might need to add a check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you be able to provide an example that fails? Happy to correct it if needed.

src/fifelse.c Outdated
if (pvec[i] > min_value) {
min_value = pvec[i];
pans[idx] = i;
for (j = 0; j <len0; ++j) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to read more carefully, but it looks like we do a full re-sort of the len0 indices each time we find a new candidate is that right? Can we try to compare the min-heap implementation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we don't do a full re-sort, we just recompute the min/max value and its position to be able to compare it to future values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sorting of the final vector is done in the next loop. Maybe I need to add some comments to explain what I did?

@2005m 2005m closed this Apr 8, 2020
@2005m 2005m deleted the topn branch April 8, 2020 13:23
@jangorecki jangorecki modified the milestones: 1.12.11, 1.12.9 May 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New function: topn for efficiently doing sorted head/tail
4 participants