-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New function topn #4187
New function topn #4187
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4187 +/- ##
========================================
Coverage 99.61% 99.61%
========================================
Files 72 72
Lines 13916 14046 +130
========================================
+ Hits 13862 13992 +130
Misses 54 54 Continue to review full report at Codecov.
|
Tried to use |
For
|
src/fifelse.c
Outdated
error("Function 'topn' is not built for large value of 'n'. The algorithm is made for small values. Please prefer the 'order' if you want to proceed with such large value."); | ||
} | ||
if (len0 > len1) { | ||
error("'n' cannot be larger than length of 'vec'."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A very likely use case is topn
by group. I think most people familiar with top_n
would expect this to work:
set.seed(123L)
dt = data.table(x = rnorm(10), grp = sample(5L, 10L, replace = T))
dt[dt[, .I[topn(x, 2L)], by = grp]$V1, ]
##Error: 'n' cannot be larger than length of 'vec'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so you suggest we return something like below?
> dt[dt[, .I[order(x)[1:2]], by = grp]$V1, ]
x grp
1: -0.6868529 1
2: -0.5604756 1
3: -0.2301775 4
4: NA NA
5: 0.1292877 5
6: NA NA
7: -0.4456620 3
8: 1.7150650 3
9: -1.2650612 2
10: 0.4609162 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My preference would it be to return seq_len(n)
and maybe have a warning that there are less records than n
.
dt[dt[, .I[order(x)[1:min(.N,2)]], by = grp]$V1, ]
x grp
<num> <int>
1: -0.6868529 1
2: -0.5604756 1
3: -0.2301775 4
4: 0.1292877 5
5: -0.4456620 3
6: 1.7150650 3
7: -1.2650612 2
8: 0.4609162 2
P.S., this PR is great.
set.seed(123)
dt = data.table(x = rnorm(1e7), grp = sample(100, 1e7, TRUE))
microbenchmark::microbenchmark(
dt[, .I[topn(x, 5L)], by = grp],
dt[order(-x), .I[seq_len(5L)], by = grp],
dt[, .I[frank(-x, ties.method = "min") <= 5L], by = grp], ##closest to dplyr::top_n
times = 5L
)
Unit: milliseconds
expr min lq mean median uq max neval
dt[, .I[topn(x, 5L)], by = grp] 157.9859 158.9587 161.4327 160.2456 160.6012 169.3721 5
dt[order(-x), .I[seq_len(5L)], by = grp] 1336.8226 1343.3002 1347.5568 1345.3929 1350.1008 1362.1676 5
dt[, .I[frank(-x, ties.method = "min") <= 5L], by = grp] 1432.7179 1459.2029 1529.9473 1498.1096 1626.5848 1633.1211 5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be now working like you asked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
This is more for future thought. I was also going to suggest the maintainers look at updating the Database-like ops benchmark for top n by group as dt[, .(largest2_v3 = v3[topn(v3, 2L, decreasing = TRUE)], by = id6]
. But as the number of groups increase, there seems to be some overhead. I assume this is a memory and gforce
type of issue. I can log a separate issue once this is merged for better grouping performance.
Details, benchmark, and verbose message
There are two rows with ******* which identify the main performance different - for topn
it is collecting discontiguous groups took 0.616s for 10000 groups
versus only 0.016s for the other method.
##remotes::install_github("Rdatatable/data.table", ref = "topn")
library(data.table)
N = 1000000L; K = 100L; set.seed(108L)
DT = data.table(id6 = sample(N/K, N, TRUE), v3 = runif(N, max = 100))
identical(DT[DT[order(-v3), head(.I, 2L), by = id6]$V1, .(id6, largest2_v3 = v3)][order(-largest2_v3, id6)],
DT[, .(largest2_v3 = v3[topn(v3, 2L, decreasing = TRUE)]), by = id6][order(-largest2_v3, id6)])
#> [1] TRUE
microbenchmark::microbenchmark(
DT[DT[order(-v3), head(.I, 2L), by = id6]$V1, .(id6, largest2_v3 = v3)],
DT[, .(largest2_v3 = v3[topn(v3, 2L, decreasing = TRUE)]), by = id6]
, times = 10L
)
#> Unit: milliseconds
#> expr
#> DT[DT[order(-v3), head(.I, 2L), by = id6]$V1, .(id6, largest2_v3 = v3)]
#> DT[, .(largest2_v3 = v3[topn(v3, 2L, decreasing = TRUE)]), by = id6]
#> min lq mean median uq max neval
#> 242.2958 248.2904 254.6175 250.4815 255.1830 288.5708 10
#> 674.9867 680.0911 722.0210 682.1045 697.4592 915.3662 10
DT[DT[order(-v3), head(.I, 2L), by = id6, verbose = T]$V1, .(id6, largest2_v3 = v3)]
#> forder.c received 1000000 rows and 1 columns
#> i clause present and columns used in by detected, only these subset: id6
#> Detected that j uses these columns: <none>
#> Finding groups using forderv ... forder.c received 1000000 rows and 1 columns
#> 0.000s elapsed (0.000s cpu)
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
#> Getting back original order ... forder.c received a vector type 'integer' length 10000
#> 0.000s elapsed (0.000s cpu)
#> lapply optimization is on, j unchanged as 'head(.I, 2L)'
#> GForce is on, left j unchanged
#> Old mean optimization is on, left j unchanged.
#> Making each group and running j (GForce FALSE) ...
#> collecting discontiguous groups took 0.016s for 10000 groups *********
#> eval(j) took 0.090s for 10000 calls
#> 0.110s elapsed (0.110s cpu)
DT[, .(largest2_v3 = v3[topn(v3, 2L, decreasing = TRUE)]), by = id6, verbose = TRUE]
#> Detected that j uses these columns: v3
#> Finding groups using forderv ... forder.c received 1000000 rows and 1 columns
#> 0.000s elapsed (0.000s cpu)
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
#> Getting back original order ... forder.c received a vector type 'integer' length 10000
#> 0.000s elapsed (0.000s cpu)
#> lapply optimization is on, j unchanged as 'list(v3[topn(v3, 2L, decreasing = TRUE)])'
#> GForce is on, left j unchanged
#> Old mean optimization is on, left j unchanged.
#> Making each group and running j (GForce FALSE) ...
#> collecting discontiguous groups took 0.616s for 10000 groups **********
#> eval(j) took 0.034s for 10000 calls
#> 0.690s elapsed (0.440s cpu)
@@ -7,6 +7,7 @@ setcoalesce = function(...) .Call(Ccoalesce, list(...), TRUE) | |||
|
|||
fifelse = function(test, yes, no, na=NA) .Call(CfifelseR, test, yes, no, na) | |||
fcase = function(..., default=NA) .Call(CfcaseR, default, parent.frame(), as.list(substitute(list(...)))[-1L]) | |||
topn = function(vec, n=6L, decreasing=FALSE) .Call(CtopnR, vec, n, decreasing) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the default be decreasing = TRUE
? When I think top I think highest values. If it helps, dplyr::top_n
uses the sign of n to determine whether the top n or bottom n are returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally don't mind. The way I implemented it was to be in line with order
and sort
. dplyr::top_n
does not work for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think decreasing=TRUE
may be the more common use case, but practically speaking most sorting functions in R and elsewhere are doing decreasing=FALSE
by default.
More practically, in other places in data.table
we use lazy evaluation to allow -
to mean decreasing=TRUE
and switch the argument without requiring to negate the vector (should be efficient). In particular for character input -
doesn't work for "flipping the sign".
To answer your question on x = rnorm(5e7)
microbenchmark::microbenchmark(
topn(x, 1L),
which.max(x), times = 10L
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# topn(x, 1L) 58.00159 60.38602 63.5568 63.06315 65.64793 71.91218 10
# which.max(x) 135.89879 153.44461 161.4339 163.98896 173.29448 176.23781 10 And it was not even my intention to beat |
NEWS.md
Outdated
|
||
microbenchmark::microbenchmark( | ||
x[topn(x, 6L)], | ||
sort(x)[1:6], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sort
has a partial
argument but it returns the full vector which means it still needs to be subsetted. Your implementation is still 6 times faster.
x = rnorm(5e7)
microbenchmark::microbenchmark(
sort(x, partial = 1:6)[1:6],
x[topn(x, 6L)],
times = 10L
)
##Unit: milliseconds
## expr min lq mean median uq
## sort(x, partial = 1:6)[1:6] 689.2710 724.5292 730.2857 728.4854 735.3350
## x[topn(x, 6L)] 128.7904 129.3341 151.5083 134.5046 159.8133
identical(sort(x, partial = 1:6)[1:6],x[topn(x, 6L)])
## [1] TRUE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I have updated the benchmarch. Your machine is clearly faster than mine :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL about partial=TRUE
, pretty cool.
@@ -7,6 +7,7 @@ setcoalesce = function(...) .Call(Ccoalesce, list(...), TRUE) | |||
|
|||
fifelse = function(test, yes, no, na=NA) .Call(CfifelseR, test, yes, no, na) | |||
fcase = function(..., default=NA) .Call(CfcaseR, default, parent.frame(), as.list(substitute(list(...)))[-1L]) | |||
topn = function(vec, n=6L, decreasing=FALSE) .Call(CtopnR, vec, n, decreasing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also have na.last
argument for data.table:::forder
, we should include that here if we really want to replace order(x)[1:n]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
na
are always last, I did not look at forder
but the behaviour of topn
should be similar to order
when it comes to NA
.
} | ||
\arguments{ | ||
\item{vec}{ A numeric vector of type numeric or integer. Other types are not supported yet. } | ||
\item{n}{ A positive integer value greater or equal to 1. Maximum value is 1000. } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we put such a maximum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the algorithm is not build fir large number. Try to remove the max and test the function for large number. You will notice a massive reduction in performance above a given threshold. It can even become 100 times slower than order
for very large number.
src/data.table.h
Outdated
@@ -242,3 +242,4 @@ SEXP testMsgR(SEXP status, SEXP x, SEXP k); | |||
//fifelse.c | |||
SEXP fifelseR(SEXP l, SEXP a, SEXP b, SEXP na); | |||
SEXP fcaseR(SEXP na, SEXP rho, SEXP args); | |||
SEXP topnR(SEXP vec, SEXP n, SEXP dec); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing trailing newline
@@ -343,3 +343,194 @@ SEXP fcaseR(SEXP na, SEXP rho, SEXP args) { | |||
UNPROTECT(nprotect); | |||
return ans; | |||
} | |||
|
|||
SEXP topnR(SEXP vec, SEXP n, SEXP dec) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is it in fifelse.c
? Seems forder.c
would be a more natural place, otherwise its own C file should be fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to add it wherever you wish. I just did not want to create too many files. There are already enough :-)
src/fifelse.c
Outdated
if (isS4(vec) && !INHERITS(vec, char_nanotime)) { | ||
error("S4 class objects (excluding nanotime) are not supported."); | ||
} | ||
if (len0 > 1000) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If input is 1e8 values it should still be faster to do topn(x, 1500)
, no?
A fixed value I don't think makes sense. Maybe something like len1/2
makes more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or perhaps just re-route with warning -- do order(x)[1:n]
if n>length(x)/2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see my comment above. Thanks.
src/fifelse.c
Outdated
switch(tvec) { | ||
case INTSXP: { | ||
const int *restrict pvec = INTEGER(vec); | ||
int min_value = pvec[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IINM we haven't already returned length-0 input, so this will be out-of-memory access
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, if we initialize this to INT_MAX
, we won't have to access pvec[0]
both here and within the next for
loop, does that make sense to do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure to understand. Do you mean we need to check that vec
is of non null length? If yes, I think you are right I might need to add a check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you be able to provide an example that fails? Happy to correct it if needed.
src/fifelse.c
Outdated
if (pvec[i] > min_value) { | ||
min_value = pvec[i]; | ||
pans[idx] = i; | ||
for (j = 0; j <len0; ++j) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have to read more carefully, but it looks like we do a full re-sort of the len0
indices each time we find a new candidate is that right? Can we try to compare the min-heap implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we don't do a full re-sort, we just recompute the min/max value and its position to be able to compare it to future values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sorting of the final vector is done in the next loop. Maybe I need to add some comments to explain what I did?
Closes #3804