You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
So at this point it knows that DT is unique and it could return it or a shallow copy straight away. But it doesn't. It carries on to turn all-FALSE into 1:nrow and then subset every column by that 1:nrow.
Also should time the forderv to make sure it is short-circuiting correctly once it resolves ambiguities in the first few columns. forderv should not touch B in this example at all because A is enough to reach uniqueness.
The text was updated successfully, but these errors were encountered:
For now, just inserted the short-circuit you mentioned in duplicated.data.table. Speed-up from doing this alone seems to be about 30% (regardless of # of rows). Speed testing script:
# timing_runs.sh
Rscript dup_timing.R old
Rscript dup_timing.R new
This is free and required almost no effort.
Two remaining things can be done:
Confirm forderv can be short-circuit prematurely when we're only checking for uniqueness and have established that before iterating over all columns. Requires a new argument to forderv?
Running duplicated.data.table within unique.data.table still necessitates declaring/returning the object rep.int(FALSE, nrow(x)), which is probably slow. Better to split the logic of unique.data.table so we can just return(x) instead?
(The new default of using all columns brings this to the fore.)
So at this point it knows that DT is unique and it could return it or a shallow copy straight away. But it doesn't. It carries on to turn all-FALSE into 1:nrow and then subset every column by that 1:nrow.
Also should time the forderv to make sure it is short-circuiting correctly once it resolves ambiguities in the first few columns. forderv should not touch B in this example at all because A is enough to reach uniqueness.
The text was updated successfully, but these errors were encountered: