Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup of unique.data.table #2474

Merged
merged 8 commits into from
Jan 12, 2018
Merged

Speedup of unique.data.table #2474

merged 8 commits into from
Jan 12, 2018

Conversation

mattdowle
Copy link
Member

@mattdowle mattdowle commented Nov 10, 2017

Closes #2013
@MichaelChirico started branch.

@codecov-io
Copy link

codecov-io commented Nov 10, 2017

Codecov Report

Merging #2474 into master will decrease coverage by <.01%.
The diff coverage is 97.95%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2474      +/-   ##
==========================================
- Coverage   91.44%   91.44%   -0.01%     
==========================================
  Files          63       63              
  Lines       12070    12093      +23     
==========================================
+ Hits        11038    11058      +20     
- Misses       1032     1035       +3
Impacted Files Coverage Δ
src/init.c 93.22% <100%> (+0.11%) ⬆️
src/subset.c 97.34% <100%> (+0.05%) ⬆️
R/data.table.R 97.17% <100%> (ø) ⬆️
R/fcast.R 86.76% <100%> (ø) ⬆️
R/setkey.R 94.16% <100%> (+0.15%) ⬆️
R/foverlaps.R 94.3% <100%> (ø) ⬆️
R/duplicated.R 95.12% <93.75%> (-4.88%) ⬇️
src/forder.c 97.86% <0%> (+0.13%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e871a4f...0178d4b. Read the comment docs.

@MichaelChirico
Copy link
Member

MichaelChirico commented Nov 10, 2017

@mattdowle any initial thoughts? Copying my comment from the issue for quicker reference


Two remaining things can be done:

  • Confirm forderv can be short-circuit prematurely when we're only checking for uniqueness and have established that before iterating over all columns. Requires a new argument to forderv?

  • Running duplicated.data.table within unique.data.table still necessitates declaring/returning the object rep.int(FALSE, nrow(x)), which is probably slow. Better to split the logic of unique.data.table so we can just return(x) instead?

…y logical vector length nrow. Fails two tests but pushing to PR to park it and switch to another PR.
if (is.character(by)) by=chmatch(by, names(x))
if (is.character(by)) {
w = chmatch(by, names(x))
if (anyNA(w)) stop("'by' contains '",by[is.na(w)][1],"' which is not a column name")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heads up that anyNA is R 3.1, in case we decide to keep the 3.0 dependency

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here changing to any(is.na(.)) won't hurt much, I don't think grouping by million of columns would be useful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment for completeness ... yep good point, we're now >= R 3.1 for other reasons, so can use anyNA

@mattdowle mattdowle added this to the v1.10.6 milestone Jan 12, 2018
@mattdowle mattdowle merged commit 1fd3862 into master Jan 12, 2018
@mattdowle mattdowle deleted the unique_speedup branch January 12, 2018 22:35
Copy link
Member

@jangorecki jangorecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -2209,9 +2210,10 @@ na.omit.data.table <- function (object, cols = seq_along(object), invert = FALSE
}
cols = as.integer(cols)
ix = .Call(Cdt_na, object, cols)
ans = .Call(CsubsetDT, object, which_(ix, bool = invert), seq_along(object))
if (any(ix)) setindexv(ans, NULL)[] else ans #1734
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is removing index now? Won't we end up with corrupted index here?

Copy link
Member

@MichaelChirico MichaelChirico Jan 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jangorecki hmm can't say I know... @mattdowle I think this came from your commit: aacf2b9

Copy link
Member

@jangorecki jangorecki Jan 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK it will make index corrupted. Once #1762 will be solved we don't need to care about that anymore. Could you ensure there is a unit test for this index corruption now? @MichaelChirico update: probably solved already by Matt, details in mentioned issue.

Copy link
Member Author

@mattdowle mattdowle Feb 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jangorecki Yes, as you noted in #1762 the index is removed inside CsubsetDT. It's better to remove it in a central place at C level as close to where the update happens, rather than having to remember to clear the index each time we call CsubsetDT. It contains the comment "// clear any index that was copied over by copyMostAttrib() above, e.g. #1760 and #1734 (test 1678)"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants