unique(DT) when there are no dups could be much faster #2013

mattdowle · 2017-02-03T08:18:36Z

(The new default of using all columns brings this to the fore.)

DT = data.table(A=1:3, B=4:6)
DT
   A B
1: 1 4
2: 2 5
3: 3 6
debug(duplicated.data.table)
debug(unique.data.table)
unique(DT)

/duplicated.R#22:
Browse[3]> o
integer(0)
attr(,"starts")
[1] 1 2 3
attr(,"maxgrpn")
[1] 1

So at this point it knows that DT is unique and it could return it or a shallow copy straight away. But it doesn't. It carries on to turn all-FALSE into 1:nrow and then subset every column by that 1:nrow.

Also should time the forderv to make sure it is short-circuiting correctly once it resolves ambiguities in the first few columns. forderv should not touch B in this example at all because A is enough to reach uniqueness.

The text was updated successfully, but these errors were encountered:

MichaelChirico · 2017-10-19T15:04:44Z

Working on this on branch unique_speedup

For now, just inserted the short-circuit you mentioned in duplicated.data.table. Speed-up from doing this alone seems to be about 30% (regardless of # of rows). Speed testing script:

# dup_timing.R
use_old = commandArgs(trailingOnly = TRUE)[1L] == 'old'

repos = if (use_old) 'http://Rdatatable.github.io/data.table' else NULL
pkgs = if (use_old) 'data.table' else '~/data.table_1.10.5.tar.gz'

remove.packages('data.table')
install.packages(pkgs, type = 'source', repos = repos)
library(data.table)


set.seed(039203)
NN = 1e8
DT = data.table(
  A = sample(1000, NN, TRUE),
  B = sample(1000, NN, TRUE),
  C = sample(1000, NN, TRUE)
)
DT = unique(DT)

system.time(unique(DT))

# timing_runs.sh
Rscript dup_timing.R old
Rscript dup_timing.R new

This is free and required almost no effort.

Two remaining things can be done:

Confirm forderv can be short-circuit prematurely when we're only checking for uniqueness and have established that before iterating over all columns. Requires a new argument to forderv?
Running duplicated.data.table within unique.data.table still necessitates declaring/returning the object rep.int(FALSE, nrow(x)), which is probably slow. Better to split the logic of unique.data.table so we can just return(x) instead?

mattdowle added this to the v1.10.6 milestone Feb 3, 2017

arunsrinivasan added the enhancement label Mar 30, 2017

st-pasha added the performance label Jul 13, 2017

MichaelChirico added a commit that referenced this issue Oct 19, 2017

Initial assay of #2013 -- free speedup of unique.data.table

4331dd9

mattdowle mentioned this issue Nov 10, 2017

Speedup of unique.data.table #2474

Merged

mattdowle added a commit that referenced this issue Jan 12, 2018

Added test for #2013

0178d4b

mattdowle closed this as completed in #2474 Jan 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unique(DT) when there are no dups could be much faster #2013

unique(DT) when there are no dups could be much faster #2013

mattdowle commented Feb 3, 2017 •

edited

Loading

MichaelChirico commented Oct 19, 2017 •

edited by mattdowle

Loading

unique(DT) when there are no dups could be much faster #2013

unique(DT) when there are no dups could be much faster #2013

Comments

mattdowle commented Feb 3, 2017 • edited Loading

MichaelChirico commented Oct 19, 2017 • edited by mattdowle Loading

mattdowle commented Feb 3, 2017 •

edited

Loading

MichaelChirico commented Oct 19, 2017 •

edited by mattdowle

Loading