-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speedup of unique.data.table #2474
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2474 +/- ##
==========================================
- Coverage 91.44% 91.44% -0.01%
==========================================
Files 63 63
Lines 12070 12093 +23
==========================================
+ Hits 11038 11058 +20
- Misses 1032 1035 +3
Continue to review full report at Codecov.
|
@mattdowle any initial thoughts? Copying my comment from the issue for quicker reference Two remaining things can be done:
|
…y logical vector length nrow. Fails two tests but pushing to PR to park it and switch to another PR.
if (is.character(by)) by=chmatch(by, names(x)) | ||
if (is.character(by)) { | ||
w = chmatch(by, names(x)) | ||
if (anyNA(w)) stop("'by' contains '",by[is.na(w)][1],"' which is not a column name") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
heads up that anyNA
is R 3.1, in case we decide to keep the 3.0 dependency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here changing to any(is.na(.))
won't hurt much, I don't think grouping by million of columns would be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment for completeness ... yep good point, we're now >= R 3.1 for other reasons, so can use anyNA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -2209,9 +2210,10 @@ na.omit.data.table <- function (object, cols = seq_along(object), invert = FALSE | |||
} | |||
cols = as.integer(cols) | |||
ix = .Call(Cdt_na, object, cols) | |||
ans = .Call(CsubsetDT, object, which_(ix, bool = invert), seq_along(object)) | |||
if (any(ix)) setindexv(ans, NULL)[] else ans #1734 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is removing index now? Won't we end up with corrupted index here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jangorecki hmm can't say I know... @mattdowle I think this came from your commit: aacf2b9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK it will make index corrupted. Once #1762 will be solved we don't need to care about that anymore. Could you ensure there is a unit test for this index corruption now? @MichaelChirico update: probably solved already by Matt, details in mentioned issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jangorecki Yes, as you noted in #1762 the index is removed inside CsubsetDT. It's better to remove it in a central place at C level as close to where the update happens, rather than having to remember to clear the index each time we call CsubsetDT. It contains the comment "// clear any index that was copied over by copyMostAttrib() above, e.g. #1760 and #1734 (test 1678)"
Closes #2013
@MichaelChirico started branch.