We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
This would allow taking unique() on a subset of columns to save memory overhead. It's basically equivalent to
unique()
DT[, unique(.SD, by = BY_COLS), .SDcols = c(BY_COLS, KEEP_COLS)]
with a more natural API:
unique(DT, by = BY_COLS, cols = KEEP_COLS)
I believe other workarounds are still memory-inefficient (as well as clunkier), e.g.
unique(DT[, .SD, .SDcols = c(BY_COLS, KEEP_COLS)], by = BY_COLS)
while the first .SD approach (IINM) is using a shallow copy and thus faster.
.SD
NN = 1e7 DT = data.table(grp = sample(c(letters, LETTERS, 0:9), NN, TRUE)) JJ = 100 for (jj in seq_len(JJ)) set(DT, NULL, paste0("V", jj), rnorm(NN)) BY_COLS = "grp" KEEP_COLS = paste0("V", 1:5) f1 <- function() DT[, unique(.SD, by = BY_COLS), .SDcols = c(BY_COLS, KEEP_COLS)] f2 <- function() unique(DT[, .SD, .SDcols = c(BY_COLS, KEEP_COLS)], by = BY_COLS) f3 <- function() unique(DT, by = BY_COLS)[, .SD, .SDcols = c(BY_COLS, KEEP_COLS)] f4 <- function() DT[, head(.SD, 1L), by = BY_COLS, .SDcols = KEEP_COLS] bench::mark(min_iterations = 10L, f1(), f2(), f3(), f4()) # A tibble: 4 x 13 # expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time # <bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> # 1 f1() 39.7ms 56.4ms 19.3 NA 0 10 0 519.25ms # 2 f2() 223ms 225.7ms 4.43 NA 1.90 7 3 1.58s # 3 f3() 38.5ms 39.4ms 25.2 NA 2.29 11 1 436.45ms # 4 f4() 97.6ms 107.3ms 9.26 NA 1.03 9 1 972.44ms # … with 4 more variables: result <list>, memory <list>, time <list>, gc <list>
The text was updated successfully, but these errors were encountered:
There is internal function distinct in mergelist PR, AFAIR. I can't promise but I think it has also less overhead than unique.
Sorry, something went wrong.
Successfully merging a pull request may close this issue.
This would allow taking
unique()
on a subset of columns to save memory overhead. It's basically equivalent towith a more natural API:
I believe other workarounds are still memory-inefficient (as well as clunkier), e.g.
while the first
.SD
approach (IINM) is using a shallow copy and thus faster.The text was updated successfully, but these errors were encountered: