unique.data.table could get a cols argument #5243

MichaelChirico · 2021-10-30T22:21:50Z

This would allow taking unique() on a subset of columns to save memory overhead. It's basically equivalent to

DT[, unique(.SD, by = BY_COLS), .SDcols = c(BY_COLS, KEEP_COLS)]

with a more natural API:

unique(DT, by = BY_COLS, cols = KEEP_COLS)

I believe other workarounds are still memory-inefficient (as well as clunkier), e.g.

unique(DT[, .SD, .SDcols = c(BY_COLS, KEEP_COLS)], by = BY_COLS)

while the first .SD approach (IINM) is using a shallow copy and thus faster.

NN = 1e7
DT = data.table(grp = sample(c(letters, LETTERS, 0:9), NN, TRUE))
JJ = 100
for (jj in seq_len(JJ)) set(DT, NULL, paste0("V", jj), rnorm(NN))

BY_COLS = "grp"
KEEP_COLS = paste0("V", 1:5)

f1 <- function() DT[, unique(.SD, by = BY_COLS), .SDcols = c(BY_COLS, KEEP_COLS)]
f2 <- function() unique(DT[, .SD, .SDcols = c(BY_COLS, KEEP_COLS)], by = BY_COLS)
f3 <- function() unique(DT, by = BY_COLS)[, .SD, .SDcols = c(BY_COLS, KEEP_COLS)]
f4 <- function() DT[, head(.SD, 1L), by = BY_COLS, .SDcols = KEEP_COLS]

bench::mark(min_iterations = 10L, f1(), f2(), f3(), f4())
# A tibble: 4 x 13
#   expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#   <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
# 1 f1()        39.7ms  56.4ms     19.3         NA     0       10     0   519.25ms
# 2 f2()         223ms 225.7ms      4.43        NA     1.90     7     3      1.58s
# 3 f3()        38.5ms  39.4ms     25.2         NA     2.29    11     1   436.45ms
# 4 f4()        97.6ms 107.3ms      9.26        NA     1.03     9     1   972.44ms
# … with 4 more variables: result <list>, memory <list>, time <list>, gc <list>

The text was updated successfully, but these errors were encountered:

jangorecki · 2021-10-31T20:07:58Z

There is internal function distinct in mergelist PR, AFAIR. I can't promise but I think it has also less overhead than unique.

MichaelChirico mentioned this issue Oct 31, 2021

cols argument for unique.data.table #5244

Merged

mattdowle added this to the 1.14.3 milestone Dec 3, 2021

mattdowle closed this as completed in #5244 Dec 3, 2021

jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unique.data.table could get a cols argument #5243

unique.data.table could get a cols argument #5243

MichaelChirico commented Oct 30, 2021 •

edited

Loading

jangorecki commented Oct 31, 2021 •

edited

Loading

unique.data.table could get a cols argument #5243

unique.data.table could get a cols argument #5243

Comments

MichaelChirico commented Oct 30, 2021 • edited Loading

jangorecki commented Oct 31, 2021 • edited Loading

MichaelChirico commented Oct 30, 2021 •

edited

Loading

jangorecki commented Oct 31, 2021 •

edited

Loading