Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FR: .SDcols support list of expressions #6619

Open
r2evans opened this issue Nov 16, 2024 · 2 comments
Open

FR: .SDcols support list of expressions #6619

r2evans opened this issue Nov 16, 2024 · 2 comments

Comments

@r2evans
Copy link

r2evans commented Nov 16, 2024

There are use-cases where I want to use .SDcols=function and add an extra column or two. Currently, I think I need to do something like:

DT[, c(.(chrvar = chrvar), .SD), .SDcols = is.numeric]

(where chrvar is a non-numeric column that I want included in the output). What I'd like to be able to do is

DT[, .SD, .SDcols = .("chrvar", is.numeric)]

The below patch supports this notion. While it might be easy to infer that this is similar to dplyr::select, I didn't intend to follow it perfectly. Some points about my attempt so far:

  • It still works with a non-list .SDcols=.
  • It works with functions, integers, strings, patterns(.), just as a non-list argument would.
  • It defaults to be additive.
  • It can remove selected columns by prefacing with a double-minus, as in --patterns("r2$"); I chose -- because - is already taken for negation, !! is used for other (symbol-evaluation), and I couldn't think of another intuitive and not-otherwise-used unary operator;
  • It continues to support DT's ability to return repeats of a column (e.g., DT[, .SD, .SDcols=c(1,1,2,3)]), so a user can still choose repeated columns, but must do it explicitly within more of the expressions.

I think I have it working completely:

DT = data.table(int1=1L, int2=2L, chr1="A", chr2="B", num1=1, num2=2, lgl=TRUE)
DT[, .SD, .SDcols = .(is.logical, !is.numeric)]
#     lgl chr1 chr2
# 1: TRUE    A    B
DT[, .SD, .SDcols = .(patterns("r2"), c(1L, 1L, 2L, 3L))]
#    chr2 int1 int1 int2 chr1
# 1:    B    1    1    2    A
DT[, .SD, .SDcols = .("lgl", is.numeric)]
#     lgl int1 int2 num1 num2
# 1: TRUE    1    2    1    2
DT[, .SD, .SDcols = .(patterns("1$"), --is.integer)]
#    chr1 num1
# 1:    A    1
DT[, .SD, .SDcols = .(patterns("1$"), !is.integer)]
#    int1 chr1 num1 chr2 num2  lgl
# 1:    1    A    1    B    2 TRUE

All tests in the current repo pass without changes.

Possible extension:

  • If there is a time when we want to remove columns but only if they exist and don't fail otherwise, I have another patch (not included) to implement patterns(..., mustWork=FALSE), similar to dplyr::any_of() semantics, i.e., no error if not found. This also passes all tests but I didn't want to combine with this PR since it's mostly different additive functionality. And because I think it might go against design intentions of the package.
The minimal patch for this.

This does not include documentation, and I minimized indentation changes for the sake of highlighting only. I provide it here only for terse discussion, a PR will be a better place to go over more details.

The vast majority of the code is due to:

  • adding a for loop to iterate over the .SDcols expressions (stored in colsubs, iterated as colsub)
  • changing reassignment back to .SDcols <- to colsub <-
  • preassignment of empty ansvars, sdvars, and ansvals, and then combine-ing the new values correctly (add/subtract) with it
modified   R/data.table.R
@@ -1004,7 +1004,16 @@ replace_dot_alias = function(e) {
           ansvals = chmatchdup(ansvars, names_x)
         } else {
           # FR #355 - negative numeric and character indices for SDcols
-          colsub = substitute(.SDcols)
+          colsubs = substitute(.SDcols)
+          # FR #6619 - list of expressions for .SDcols
+          colsubs = if (colsubs %iscall% c(".", "list")) as.list(colsubs)[-1] else list(colsubs)
+          combine = function(a, b, subtract = FALSE) if (subtract) a[!a %in% b] else c(a, b[!b %in% a])
+          ansvars = sdvars = character()
+          ansvals = integer()
+          for (colsub in colsubs) {
+          # double-minus means to _remove_ columns from selection
+          rem_cols = (colsub %iscall% "-") && as.list(colsub[-1])[[1]] %iscall% "-"
+          if (rem_cols) colsub = colsub[[-1]][[-1]]
           # peel from parentheses before negation so (-1L) works as well: as.data.table(as.list(1:3))[, .SD,.SDcols=(-1L)] #4231
           while(colsub %iscall% "(") colsub = as.list(colsub)[[-1L]]
           # fix for R-Forge #5190. colsub[[1L]] gave error when it's a symbol.
@@ -1017,48 +1026,52 @@ replace_dot_alias = function(e) {
           while(colsub %iscall% "(") colsub = as.list(colsub)[[-1L]]
           if (colsub %iscall% ':' && length(colsub)==3L && !is.call(colsub[[2L]]) && !is.call(colsub[[3L]])) {
             # .SDcols is of the format a:b, ensure none of : arguments is a call data.table(V1=-1L, V2=-2L, V3=-3L)[,.SD,.SDcols=-V2:-V1] #4231
-            .SDcols = eval(colsub, setattr(as.list(seq_along(x)), 'names', names_x), parent.frame())
+            colsub = eval(colsub, setattr(as.list(seq_along(x)), 'names', names_x), parent.frame())
           } else {
             if (colsub %iscall% 'patterns') {
               patterns_list_or_vector = eval_with_cols(colsub, names_x)
-              .SDcols = if (is.list(patterns_list_or_vector)) {
+              colsub = if (is.list(patterns_list_or_vector)) {
                 # each pattern gives a new filter condition, intersect the end result
                 Reduce(intersect, patterns_list_or_vector)
               } else {
                 patterns_list_or_vector
               }
             } else {
-              .SDcols = eval(colsub, parent.frame(), parent.frame())
+              colsub = eval(colsub, parent.frame(), parent.frame())
               # allow filtering via function in .SDcols, #3950
-              if (is.function(.SDcols)) {
-                .SDcols = lapply(x, .SDcols)
-                if (any(idx <- lengths(.SDcols) > 1L | vapply_1c(.SDcols, typeof) != 'logical' | vapply_1b(.SDcols, anyNA)))
+              if (is.function(colsub)) {
+                colsub = lapply(x, colsub)
+                if (any(idx <- lengths(colsub) > 1L | vapply_1c(colsub, typeof) != 'logical' | vapply_1b(colsub, anyNA)))
                   stopf("When .SDcols is a function, it is applied to each column; the output of this function must be a non-missing boolean scalar signalling inclusion/exclusion of the column. However, these conditions were not met for: %s", brackify(names(x)[idx]))
-                .SDcols = unlist(.SDcols, use.names = FALSE)
+                colsub = unlist(colsub, use.names = FALSE)
               }
             }
           }
-          if (anyNA(.SDcols))
-            stopf(".SDcols missing at the following indices: %s", brackify(which(is.na(.SDcols))))
-          if (is.logical(.SDcols)) {
-            if (length(.SDcols)!=length(x)) stopf(".SDcols is a logical vector of length %d but there are %d columns", length(.SDcols), length(x))
-            ansvals = which_(.SDcols, !negate_sdcols)
-            ansvars = sdvars = names_x[ansvals]
-          } else if (is.numeric(.SDcols)) {
-            .SDcols = as.integer(.SDcols)
-            # if .SDcols is numeric, use 'dupdiff' instead of 'setdiff'
-            if (length(unique(sign(.SDcols))) > 1L) stopf(".SDcols is numeric but has both +ve and -ve indices")
-            if (any(idx <- abs(.SDcols)>ncol(x) | abs(.SDcols)<1L))
+          if (anyNA(colsub))
+            stopf(".SDcols missing at the following indices: %s", brackify(which(is.na(colsub))))
+          if (is.logical(colsub)) {
+            if (length(colsub)!=length(x)) stopf(".SDcols is a logical vector of length %d but there are %d columns", length(colsub), length(x))
+            newvars = which_(colsub, !negate_sdcols)
+            ansvals = combine(ansvals, newvars, rem_cols)
+            ansvars = sdvars = combine(ansvars, names_x[newvars], rem_cols)
+          } else if (is.numeric(colsub)) {
+            colsub = as.integer(colsub)
+            # if colsub is numeric, use 'dupdiff' instead of 'setdiff'
+            if (length(unique(sign(colsub))) > 1L) stopf(".SDcols is numeric but has both +ve and -ve indices")
+            if (any(idx <- abs(colsub)>ncol(x) | abs(colsub)<1L))
               stopf(".SDcols is numeric but out of bounds [1, %d] at: %s", ncol(x), brackify(which(idx)))
-            ansvars = sdvars = if (negate_sdcols) dupdiff(names_x[-.SDcols], bynames) else names_x[.SDcols]
-            ansvals = if (negate_sdcols) setdiff(seq_along(names(x)), c(.SDcols, which(names(x) %chin% bynames))) else .SDcols
+            newvars = if (negate_sdcols) dupdiff(names_x[-colsub], bynames) else names_x[colsub]
+            ansvars = sdvars = combine(ansvars, newvars, rem_cols)
+            ansvals = combine(ansvals, if (negate_sdcols) setdiff(seq_along(names(x)), c(colsub, which(names(x) %chin% bynames))) else colsub, rem_cols)
           } else {
-            if (!is.character(.SDcols)) stopf(".SDcols should be column numbers or names")
-            if (!all(idx <- .SDcols %chin% names_x))
-              stopf("Some items of .SDcols are not column names: %s", brackify(.SDcols[!idx]))
-            ansvars = sdvars = if (negate_sdcols) setdiff(names_x, c(.SDcols, bynames)) else .SDcols
+            if (!is.character(colsub)) stopf(".SDcols should be column numbers or names")
+            if (!all(idx <- colsub %chin% names_x))
+              stopf("Some items of .SDcols are not column names: %s", brackify(colsub[!idx]))
+            newvars = if (negate_sdcols) setdiff(names_x, c(colsub, bynames)) else colsub
+            ansvars = sdvars = combine(ansvars, newvars, rem_cols)
             # dups = FALSE here. DT[, .SD, .SDcols=c("x", "x")] again doesn't really help with which 'x' to keep (and if '-' which x to remove)
-            ansvals = chmatch(ansvars, names_x)
+            ansvals = combine(ansvals, chmatch(newvars, names_x), rem_cols)
+          }
           }
         }
         # fix for long standing FR/bug, #495 and #484
modified   inst/tests/tests.Rraw
@@ -20596,3 +20596,12 @@ test(2295.3, is.data.table(d2))
 
 # #6588: .checkTypos used to give arbitrary strings to stopf as the first argument
 test(2296, d2[x %no such operator% 1], error = '%no such operator%')
+
+# #6619: .SDcols supports list of expressions
+DT = data.table(int1=1L, int2=2L, chr1="A", chr2="B", num1=1, num2=2, lgl=TRUE)
+test(2297.1, DT[, .SD, .SDcols = .(is.logical, !is.numeric)], DT[, .(lgl, chr1, chr2)])
+test(2297.2, DT[, .SD, .SDcols = .(patterns("r2"), c(1L, 1L, 2L, 3L))], DT[, .(chr2, int1, int1, int2, chr1)])
+test(2297.3, DT[, .SD, .SDcols = .("lgl", is.numeric)], DT[, .(lgl, int1, int2, num1, num2)])
+# difference between `--` and `!`
+test(2297.4, DT[, .SD, .SDcols = .(patterns("1$"), --is.integer)], DT[, .(chr1, num1)])
+test(2297.5, DT[, .SD, .SDcols = .(patterns("1$"), !is.integer)], DT[, .(int1, chr1, num1, chr2 ,num2, lgl)])
Test performance comparison (no significant difference with this patch).
### current 'master' from this repo
> test.data.table("inst/tests/tests.Rraw")
getDTthreads(verbose=TRUE):
  OpenMP version (_OPENMP)       201511
  omp_get_num_procs()            16
  R_DATATABLE_NUM_PROCS_PERCENT  unset (default 50)
  R_DATATABLE_NUM_THREADS        unset
  R_DATATABLE_THROTTLE           unset (default 1024)
  omp_get_thread_limit()         2147483647
  omp_get_max_threads()          16
  OMP_THREAD_LIMIT               unset
  OMP_NUM_THREADS                unset
  RestoreAfterFork               true
  data.table is using 8 threads with throttle==1024. See ?setDTthreads.
test.data.table() running: /home/r2/Projects/github/data.table/inst/tests/tests.Rraw
Running test id 2296          
Sat Nov 16 14:48:33 2024  endian==little, sizeof(long double)==16, longdouble.digits==64, sizeof(pointer)==8, TZ==unset, Sys.timezone()=='America/New_York', Sys.getlocale()=='LC_CTYPE=C.UTF-8;LC_NUMERIC=C;LC_TIME=C.UTF-8;LC_COLLATE=C.UTF-8;LC_MONETARY=C.UTF-8;LC_MESSAGES=C.UTF-8;LC_PAPER=C.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=C.UTF-8;LC_IDENTIFICATION=C', l10n_info()=='MBCS=TRUE; UTF-8=TRUE; Latin-1=FALSE; codeset=UTF-8', getDTthreads()=='OpenMP version (_OPENMP)==201511; omp_get_num_procs()==16; R_DATATABLE_NUM_PROCS_PERCENT==unset (default 50); R_DATATABLE_NUM_THREADS==unset; R_DATATABLE_THROTTLE==unset (default 1024); omp_get_thread_limit()==2147483647; omp_get_max_threads()==16; OMP_THREAD_LIMIT==unset; OMP_NUM_THREADS==unset; RestoreAfterFork==true; data.table is using 8 threads with throttle==1024. See ?setDTthreads.', .libPaths()=='/home/r2/R/x86_64-pc-linux-gnu-library/4.3','/opt/R/4.3.2/lib/R/site-library','/opt/R/4.3.2/lib/R/library', zlibVersion()==1.3 ZLIB_VERSION==1.3
10 longest running tests took 19s (46% of 42s)
      ID  time nTest
 1: 2155 5.552     5
 2: 1437 3.092    36
 3: 2233 2.063    55
 4: 1438 1.766   738
 5: 1648 1.411    91
 6: 1652 1.404    91
 7: 1650 1.370    91
 8: 1376 1.127    12
 9: 1642 0.927    91
10: 1644 0.927    91
All 11584 tests (last 2296) in inst/tests/tests.Rraw completed ok in 45.9s elapsed (51.8s cpu)

#### with my patch
> setwd("_sdcols/")
> devtools::load_all("~/Projects/github/data.table/_sdcols")
*** output flushed ***
> test.data.table("inst/tests/tests.Rraw")
getDTthreads(verbose=TRUE):
  OpenMP version (_OPENMP)       201511
  omp_get_num_procs()            16
  R_DATATABLE_NUM_PROCS_PERCENT  unset (default 50)
  R_DATATABLE_NUM_THREADS        unset
  R_DATATABLE_THROTTLE           unset (default 1024)
  omp_get_thread_limit()         2147483647
  omp_get_max_threads()          16
  OMP_THREAD_LIMIT               unset
  OMP_NUM_THREADS                unset
  RestoreAfterFork               true
  data.table is using 8 threads with throttle==1024. See ?setDTthreads.
test.data.table() running: /home/r2/Projects/github/data.table/_sdcols/inst/tests/tests.Rraw
Running test id 2298.4          
Sat Nov 16 14:50:26 2024  endian==little, sizeof(long double)==16, longdouble.digits==64, sizeof(pointer)==8, TZ==unset, Sys.timezone()=='America/New_York', Sys.getlocale()=='LC_CTYPE=C.UTF-8;LC_NUMERIC=C;LC_TIME=C.UTF-8;LC_COLLATE=C.UTF-8;LC_MONETARY=C.UTF-8;LC_MESSAGES=C.UTF-8;LC_PAPER=C.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=C.UTF-8;LC_IDENTIFICATION=C', l10n_info()=='MBCS=TRUE; UTF-8=TRUE; Latin-1=FALSE; codeset=UTF-8', getDTthreads()=='OpenMP version (_OPENMP)==201511; omp_get_num_procs()==16; R_DATATABLE_NUM_PROCS_PERCENT==unset (default 50); R_DATATABLE_NUM_THREADS==unset; R_DATATABLE_THROTTLE==unset (default 1024); omp_get_thread_limit()==2147483647; omp_get_max_threads()==16; OMP_THREAD_LIMIT==unset; OMP_NUM_THREADS==unset; RestoreAfterFork==true; data.table is using 8 threads with throttle==1024. See ?setDTthreads.', .libPaths()=='/home/r2/R/x86_64-pc-linux-gnu-library/4.3','/opt/R/4.3.2/lib/R/site-library','/opt/R/4.3.2/lib/R/library', zlibVersion()==1.3 ZLIB_VERSION==1.3
10 longest running tests took 19s (46% of 41s)
      ID  time nTest
 1: 2155 5.492     5
 2: 1437 3.112    36
 3: 2233 2.042    55
 4: 1438 1.978   738
 5: 1648 1.374    91
 6: 1650 1.352    91
 7: 1652 1.341    91
 8: 1376 1.116    12
 9: 1642 0.908    91
10: 1646 0.895    91
All 11588 tests (last 2298.4) in inst/tests/tests.Rraw completed ok in 45.6s elapsed (51.5s cpu)

I'm happy to convert this into a PR.

Output from `sessionInfo()`.
> sessionInfo()
R version 4.3.3 (2024-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 24.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 
 
locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8        LC_COLLATE=C.UTF-8    
 [5] LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8    LC_PAPER=C.UTF-8       LC_NAME=C             
 [9] LC_ADDRESS=C           LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.16.99 R.oo_1.25.0        R.methodsS3_1.8.2  bit_4.0.5          testthat_3.2.1    
[6] R.utils_2.12.3     bit64_4.0.5       

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 devtools_2.4.5    remotes_2.5.0     processx_3.8.4    callr_3.7.6       vctrs_0.6.5      
 [7] tools_4.3.3       ps_1.7.7          generics_0.1.3    tibble_3.2.1      fansi_1.0.6       pkgconfig_2.0.3  
[13] desc_1.4.3        lifecycle_1.0.4   compiler_4.3.3    stringr_1.5.1     brio_1.1.4        httpuv_1.6.13    
[19] htmltools_0.5.8.1 usethis_2.2.3     later_1.3.2       pillar_1.9.0      urlchecker_1.0.1  ellipsis_0.3.2   
[25] cachem_1.1.0      sessioninfo_1.2.2 mime_0.12         tidyselect_1.2.1  digest_0.6.34     stringi_1.8.3    
[31] dplyr_1.1.4       purrr_1.0.2       rprojroot_2.0.4   fastmap_1.2.0     cli_3.6.3         magrittr_2.0.3   
[37] pkgbuild_1.4.4    utf8_1.2.4        withr_3.0.1       promises_1.3.0    memoise_2.0.1     shiny_1.9.1      
[43] miniUI_0.1.1.1    profvis_0.3.8     rlang_1.1.4       Rcpp_1.0.11       xtable_1.8-4      glue_1.8.0       
[49] pkgload_1.3.4     rstudioapi_0.15.0 R6_2.5.1          fs_1.6.5         
@tdhock
Copy link
Member

tdhock commented Nov 19, 2024

I'm not sure I understand your proposition. A patch of the man page would help.
Can you please provide a more realistic use case? (more than one row, what calculations you want to do on .SD, instead of just returning .SD)
I wonder if you tried melt/dcast? I get the feeling that they could be used instead.

@r2evans
Copy link
Author

r2evans commented Nov 19, 2024

Sorry it wasn't quite clear, I'll try again. (And I think it helps to bring the example of the diff, sorry about that.)

The notion has nothing to do with multiple rows, it is all about selecting columns using more than one of the available data.table column-selection methods. If I am correct, the mechanisms for things we can pass to .SDcols= include: names, integers, patterns(.), and a function that returns logical for each column (having data passed).

My NEWS.md change (in draft) reads:

5. `.SDcols=` now supports a list of expressions. The default action for the second and subsequent expressions will be to add to columns already selected, but if an expression is prefaced with the literal `--` then columns found in that expression will be removed from the columns selected so far. The order matters, columns removed in one `--expr` may be added in the next expression (so order of `--`-expressions within the list of expressions matters). All current forms of expressions normally supported are still allowed in the list of expressions. Thanks to @r2evans for the request and PR.

    ```r
    DT = data.table(int1=1L, int2=2L, chr1="A", chr2="B", num1=1, num2=2, lgl=TRUE)
    DT[, .SD, .SDcols = .(is.logical, !is.numeric)]
    #     lgl chr1 chr2
    # 1: TRUE    A    B
    DT[, .SD, .SDcols = .(patterns("r2"), c(1L, 1L, 2L, 3L))]
    #    chr2 int1 int1 int2 chr1
    # 1:    B    1    1    2    A
    DT[, .SD, .SDcols = .("lgl", is.numeric)]
    #     lgl int1 int2 num1 num2
    # 1: TRUE    1    2    1    2
    DT[, .SD, .SDcols = .(patterns("1$"), --is.integer)]
    #    chr1 num1
    # 1:    A    1
    DT[, .SD, .SDcols = .(patterns("1$"), !is.integer)]
    #    int1 chr1 num1 chr2 num2  lgl
    # 1:    1    A    1    B    2 TRUE
    ```

    Please note in the last two examples that `--` is distinct from `!`/`-`: with `--is.integer`, already-selected columns that are integers will be _removed_ from the set of columns to return, whereas with `!is.integer`, all columns that are not integers will be _added_ to the set of columns to return.

What if I want to collect all numeric columns plus one column that is not numeric? Let's say I have a time column, some numeric columns, some character columns, and perhaps others. Currently, I would need to do something like

DT = data.table(int1=1L, int2=2L, chr1="A", chr2="B", num1=1, num2=2, lgl=TRUE)
DT[, .SD, .SDcols = c("chr1", names(which(sapply(DT, is.numeric))))]
#      chr1  int1  int2  num1  num2
#    <char> <int> <int> <num> <num>
# 1:      A     1     2     1     2

### suggested replacement syntax for that:
DT[, .SD, .SDcols = .("chr1", is.numeric)]

It's all about terse heterogeneous column selection. What calculations are done on the data is not really important.

Another example:

### double all columns that end in "2" that are not strings
DT[, names(.SD) := lapply(.SD, `*`, 2), .SDcols = .(patterns("2$"), --is.character)]
DT
#     int1  int2   chr1   chr2  num1  num2    lgl
# 1:     1     4      A      B     1     4   TRUE

I'm not looking to reshape anything (long-to-wide or wide-to-long), so melt and dcast are not necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants