Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New function topn #4187
New function topn #4187
Changes from 14 commits
0807b1d
aae0158
dfbfb28
051886d
dd1772d
535bae8
656ff75
4e24c8f
8322e9b
8f3c977
ded01ce
d2d1921
6f322e4
c22c140
af37e1c
815729a
467c2c4
f34b759
37f503e
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also have
na.last
argument fordata.table:::forder
, we should include that here if we really want to replaceorder(x)[1:n]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
na
are always last, I did not look atforder
but the behaviour oftopn
should be similar toorder
when it comes toNA
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the default be
decreasing = TRUE
? When I think top I think highest values. If it helps,dplyr::top_n
uses the sign of n to determine whether the top n or bottom n are returned.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally don't mind. The way I implemented it was to be in line with
order
andsort
.dplyr::top_n
does not work for me.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think
decreasing=TRUE
may be the more common use case, but practically speaking most sorting functions in R and elsewhere are doingdecreasing=FALSE
by default.More practically, in other places in
data.table
we use lazy evaluation to allow-
to meandecreasing=TRUE
and switch the argument without requiring to negate the vector (should be efficient). In particular for character input-
doesn't work for "flipping the sign".There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we put such a maximum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the algorithm is not build fir large number. Try to remove the max and test the function for large number. You will notice a massive reduction in performance above a given threshold. It can even become 100 times slower than
order
for very large number.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is it in
fifelse.c
? Seemsforder.c
would be a more natural place, otherwise its own C file should be fineThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to add it wherever you wish. I just did not want to create too many files. There are already enough :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If input is 1e8 values it should still be faster to do
topn(x, 1500)
, no?A fixed value I don't think makes sense. Maybe something like
len1/2
makes more sense.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or perhaps just re-route with warning -- do
order(x)[1:n]
ifn>length(x)/2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see my comment above. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IINM we haven't already returned length-0 input, so this will be out-of-memory access
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, if we initialize this to
INT_MAX
, we won't have to accesspvec[0]
both here and within the nextfor
loop, does that make sense to do?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure to understand. Do you mean we need to check that
vec
is of non null length? If yes, I think you are right I might need to add a check.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you be able to provide an example that fails? Happy to correct it if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have to read more carefully, but it looks like we do a full re-sort of the
len0
indices each time we find a new candidate is that right? Can we try to compare the min-heap implementation?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we don't do a full re-sort, we just recompute the min/max value and its position to be able to compare it to future values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sorting of the final vector is done in the next loop. Maybe I need to add some comments to explain what I did?