Speedup of unique.data.table #2474

mattdowle · 2017-11-10T00:24:53Z

Closes #2013
@MichaelChirico started branch.

codecov-io · 2017-11-10T00:24:57Z

Codecov Report

Merging #2474 into master will decrease coverage by <.01%.
The diff coverage is 97.95%.

@@            Coverage Diff             @@
##           master    #2474      +/-   ##
==========================================
- Coverage   91.44%   91.44%   -0.01%     
==========================================
  Files          63       63              
  Lines       12070    12093      +23     
==========================================
+ Hits        11038    11058      +20     
- Misses       1032     1035       +3

Impacted Files	Coverage Δ
src/init.c	`93.22% <100%> (+0.11%)`	⬆️
src/subset.c	`97.34% <100%> (+0.05%)`	⬆️
R/data.table.R	`97.17% <100%> (ø)`	⬆️
R/fcast.R	`86.76% <100%> (ø)`	⬆️
R/setkey.R	`94.16% <100%> (+0.15%)`	⬆️
R/foverlaps.R	`94.3% <100%> (ø)`	⬆️
R/duplicated.R	`95.12% <93.75%> (-4.88%)`	⬇️
src/forder.c	`97.86% <0%> (+0.13%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e871a4f...0178d4b. Read the comment docs.

MichaelChirico · 2017-11-10T06:17:14Z

@mattdowle any initial thoughts? Copying my comment from the issue for quicker reference

Two remaining things can be done:

Confirm forderv can be short-circuit prematurely when we're only checking for uniqueness and have established that before iterating over all columns. Requires a new argument to forderv?
Running duplicated.data.table within unique.data.table still necessitates declaring/returning the object rep.int(FALSE, nrow(x)), which is probably slow. Better to split the logic of unique.data.table so we can just return(x) instead?

…y logical vector length nrow. Fails two tests but pushing to PR to park it and switch to another PR.

MichaelChirico · 2017-11-15T02:18:28Z

R/setkey.R

-    if (is.character(by)) by=chmatch(by, names(x))
+    if (is.character(by)) {
+      w = chmatch(by, names(x))
+      if (anyNA(w)) stop("'by' contains '",by[is.na(w)][1],"' which is not a column name")


heads up that anyNA is R 3.1, in case we decide to keep the 3.0 dependency

Here changing to any(is.na(.)) won't hurt much, I don't think grouping by million of columns would be useful.

Comment for completeness ... yep good point, we're now >= R 3.1 for other reasons, so can use anyNA

jangorecki

@MichaelChirico

jangorecki · 2018-01-14T01:21:12Z

R/data.table.R

@@ -2209,9 +2210,10 @@ na.omit.data.table <- function (object, cols = seq_along(object), invert = FALSE
  }
  cols = as.integer(cols)
  ix = .Call(Cdt_na, object, cols)
-  ans = .Call(CsubsetDT, object, which_(ix, bool = invert), seq_along(object))
-  if (any(ix)) setindexv(ans, NULL)[] else ans #1734


Where is removing index now? Won't we end up with corrupted index here?

@jangorecki hmm can't say I know... @mattdowle I think this came from your commit: aacf2b9

AFAIK it will make index corrupted. Once #1762 will be solved we don't need to care about that anymore. Could you ensure there is a unit test for this index corruption now? @MichaelChirico update: probably solved already by Matt, details in mentioned issue.

@jangorecki Yes, as you noted in #1762 the index is removed inside CsubsetDT. It's better to remove it in a central place at C level as close to where the update happens, rather than having to remember to clear the index each time we call CsubsetDT. It contains the comment "// clear any index that was copied over by copyMostAttrib() above, e.g. #1760 and #1734 (test 1678)"

Initial assay of #2013 -- free speedup of unique.data.table

4331dd9

Merge branch 'master' into unique_speedup

5735c09

unique.data.table reworked to call forderv directly to avoid temporar…

96dc495

…y logical vector length nrow. Fails two tests but pushing to PR to park it and switch to another PR.

MichaelChirico reviewed Nov 15, 2017

View reviewed changes

mattdowle mentioned this pull request Nov 21, 2017

Closes #2046 and closes #2111. Fixes -ve length vectors issue with GForce #2480

Merged

mattdowle added 3 commits January 11, 2018 15:09

Merge branch 'master' into unique_speedup

4f2035f

Pass final 2 tests and na.omit() no longer copies if there are no NAs

aacf2b9

Upgrade all any(is.na(.)) to anyNA(.) now we depend on R 3.1

f9b9619

mattdowle added this to the v1.10.6 milestone Jan 12, 2018

mattdowle added 2 commits January 12, 2018 13:26

Added news item and too-big-for-cran test as comment

84c44d8

Added test for #2013

0178d4b

mattdowle merged commit 1fd3862 into master Jan 12, 2018

mattdowle deleted the unique_speedup branch January 12, 2018 22:35

jangorecki reviewed Jan 14, 2018

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup of unique.data.table #2474

Speedup of unique.data.table #2474

mattdowle commented Nov 10, 2017 •

edited

Loading

codecov-io commented Nov 10, 2017 •

edited

Loading

MichaelChirico commented Nov 10, 2017 •

edited by mattdowle

Loading

MichaelChirico Nov 15, 2017

jangorecki Nov 15, 2017

mattdowle Jan 12, 2018

jangorecki left a comment

jangorecki Jan 14, 2018

MichaelChirico Jan 15, 2018 •

edited

Loading

jangorecki Jan 26, 2018 •

edited

Loading

mattdowle Feb 6, 2018 •

edited

Loading

Speedup of unique.data.table #2474

Speedup of unique.data.table #2474

Conversation

mattdowle commented Nov 10, 2017 • edited Loading

codecov-io commented Nov 10, 2017 • edited Loading

Codecov Report

MichaelChirico commented Nov 10, 2017 • edited by mattdowle Loading

MichaelChirico Nov 15, 2017

Choose a reason for hiding this comment

jangorecki Nov 15, 2017

Choose a reason for hiding this comment

mattdowle Jan 12, 2018

Choose a reason for hiding this comment

jangorecki left a comment

Choose a reason for hiding this comment

jangorecki Jan 14, 2018

Choose a reason for hiding this comment

MichaelChirico Jan 15, 2018 • edited Loading

Choose a reason for hiding this comment

jangorecki Jan 26, 2018 • edited Loading

Choose a reason for hiding this comment

mattdowle Feb 6, 2018 • edited Loading

Choose a reason for hiding this comment

mattdowle commented Nov 10, 2017 •

edited

Loading

codecov-io commented Nov 10, 2017 •

edited

Loading

MichaelChirico commented Nov 10, 2017 •

edited by mattdowle

Loading

MichaelChirico Jan 15, 2018 •

edited

Loading

jangorecki Jan 26, 2018 •

edited

Loading

mattdowle Feb 6, 2018 •

edited

Loading