data.table-style coalesce #3424

MichaelChirico · 2019-02-20T15:05:30Z

Did a bunch of coalesceing today and was sorely missing an efficient version.

@HughParsonage, would you be happy to add hutils::coalesce to data.table? I have the discussion in #2677 in mind...

I made tinkered with hutils::coalesce to come up with:

coalesce <- function(x, ...) {
  if (!any(na_idx <- is.na(x)) || missing(..1)) return(x)
  values <- list(...)
  
  lx <- length(x)
  lengths <- c(lx, vapply(values, length, FUN.VALUE = 0L))
  lengthsn1 <- lengths != 1L
  if (any(lengthsn1 & lengths != lx)) {
    wrong_len_i <- which(lengthsn1 & lengths != lx)
    stop("Argument ", wrong_len_i[1], " had length ", lengths[wrong_len_i[1]], ", ",
         "but length(x) = ", lx, ". ",
         "The only permissible lengths in ... are 1 or the length of `x` (", lx, ").")
  }
  
  typeof_x <- typeof(x)
  x_not_factor <- !inherits(x, what = 'factor')
  lv <- length(values)
  
  for (i in seq_len(lv)) {
    vi <- values[[i]]
    if (typeof(vi) != typeof_x) {
      stop("Argument ", i + 1L, " had type '", typeof(vi), "' but ",
           "typeof(x) was ", typeof_x, ". All types ",
           "in `...` must be the same type.")
    }
    
    if (inherits(vi, what = "factor") && x_not_factor) {
      stop("Argument ", i + 1L, " was a factor, but `x` was not. ",
           "All `...` must be the same type.")
    }
    
    if (lengthsn1[i + 1L]) {
      has_value_idx <- !is.na(vi[na_idx])
      x[na_idx][has_value_idx] <- vi[na_idx][has_value_idx]
      if (all(has_value_idx)) break
      na_idx[has_value_idx] = FALSE
    } else {
      if (is.na(vi)) next 
      else x[na_idx] <- vi
      break
    }
  }
  x
}

main difference being to skip running anyNA every iteration and instead focus on "whittling down" the is.na(x) vector

Benchmarked against hmisc::coalesce and it's hit or miss... maybe need more replications (function evaluation takes at most around 2 seconds so this is doable)? Or there's some extra optimization I'm missing...

Anyway, the same logic would probably be faster in C...

The text was updated successfully, but these errors were encountered:

HughParsonage · 2019-02-20T15:11:36Z

You'll need to go back to GPL-2 for this.

HughParsonage · 2019-02-20T15:20:19Z

Otherwise happy of course.

For the optimization, you'll need to go down to C to do better. anyNA is so much faster than is.na that it's only in perverse circumstances where any(is.na(v[v_not_na])) is going to outperform it (like where ...length() is large and the NAs are rare and at the bottom of all of them).

I've got 128 GB and I've never had hutils::coalesce be a bottleneck.

jangorecki · 2019-02-20T15:24:11Z

I used coalesce a lot of in SQL. I would say it is worth to write it in C.

smingerson · 2019-03-02T16:48:12Z

This would be an excellent addition to data.table. Do you think it would be possible to support patterns(), similar to functionality suggested in #3396 ? Frequently, I have columns which come from sources of varying reliability. I'll have 20 columns from each of 3+ sources, and want to coalesce values in order of reliability.

set.seed(702)
dt <- data.table(first_a = sample(c(1, NA), size = 10, replace = TRUE),
                 second_a = c(rnorm(9), NA),
                 third_a = 10, 
                 first_b = NA_real_)
   first_a      second_a third_a first_b
 1:      NA -0.0004156106      10      NA
 2:       1 -1.7100788510      10      NA
 3:       1 -0.7910541140      10      NA
 4:       1 -0.1093094194      10      NA
 5:       1 -1.1020017136      10      NA
 6:      NA  1.0196378767      10      NA
 7:      NA  0.9189197761      10      NA
 8:       1  1.7176207637      10      NA
 9:       1  0.7649291191      10      NA
10:      NA            NA      10      NA

# from providing, say, `coalesce(dt, patterns = patterns(c("first", "second", "third")))`
# first_b would not match be filled because no other name matched "_b" when the pattern was removed
# from the name.
        first_a      second_a third_a first_b
 1: -0.0004156106 -0.0004156106      10      NA
 2:  1.0000000000 -1.7100788510      10      NA
 3:  1.0000000000 -0.7910541140      10      NA
 4:  1.0000000000 -0.1093094194      10      NA
 5:  1.0000000000 -1.1020017136      10      NA
 6:  1.0196378767  1.0196378767      10      NA
 7:  0.9189197761  0.9189197761      10      NA
 8:  1.0000000000  1.7176207637      10      NA
 9:  1.0000000000  0.7649291191      10      NA
10: 10.0000000000            NA      10      NA

It would take a vector (or list) of patterns, find matches, then group the matches by the remaining portion of the name. It would coalesce each group of matches in the order the patterns were provided. I think this would be very convenient functionality, but not quite sure how it would work in the j-slot. I also may be overlooking a simple way to do this using the proposed version of coalesce.

jangorecki · 2019-05-29T12:24:49Z

@smingerson these kind of interface can be easily and reliably handled on user side.

your_pattern_fun = function() c("second_a","third_a")
jj = c(as.name("coalesce"), as.name("first_a"), lapply(your_pattern_fun(), as.name))
jj = as.call(jj)
print(jj)
#coalesce(first_a, second_a, third_a)
dt[, "first_a" := eval(jj)][]
#          first_a      second_a third_a first_b
# 1: -0.0004156106 -0.0004156106      10      NA
# 2:  1.0000000000 -1.7100788510      10      NA
# 3:  1.0000000000 -0.7910541140      10      NA
# 4:  1.0000000000 -0.1093094194      10      NA
# 5:  1.0000000000 -1.1020017136      10      NA
# 6:  1.0196378767  1.0196378767      10      NA
# 7:  0.9189197761  0.9189197761      10      NA
# 8:  1.0000000000  1.7176207637      10      NA
# 9:  1.0000000000  0.7649291191      10      NA
#10: 10.0000000000            NA      10      NA

I generally agree it would be useful to provide alternative interface to pass ... argument, for example as a list.
So I pushed improvement for that...

cc(F)
set.seed(702)
dt <- data.table(first_a = sample(c(1, NA), size = 10, replace = TRUE),
                 second_a = c(rnorm(9), NA),
                 third_a = 10, 
                 first_b = NA_real_)
your_pattern_fun = function() c("second_a","third_a")
dt[, "first_a" := coalesce(first_a, .dots=.SD), .SDcols=your_pattern_fun()][]

and because you can use patterns helper in .SDcols, you can easily replace your_pattern_fun from above example with data.table patterns helper.

mattdowle · 2019-06-17T20:45:49Z

@HughParsonage

You'll need to go back to GPL-2 for this.

Please note https://www.gnu.org/licenses/gpl-faq.en.html#IfInterpreterIsGPL which contains "The interpreted program, to the interpreter, is just data.". See also my comments about that GPL FAQ at the top of PR #2456 which changed data.table's license from GPL to MPL.
In short: my understanding is that the license of any R-only package (such as your hutils) is not relevant because the R code is considered data by the GPL, not code. According to the official GPL FAQ.
Of course, on the data.table project we will respect your intent. I'm just saying what I believe the license situation to be.

mattdowle · 2019-07-17T21:06:43Z

PR #3696 added complex support to coalesce, part of #3690.

jangorecki added this to the 1.12.4 milestone Mar 7, 2019

jangorecki added the feature request label Mar 7, 2019

MichaelChirico mentioned this issue May 29, 2019

fast coalesce function for data.table #3608

Merged

jangorecki added a commit that referenced this issue May 29, 2019

add alt interface for dots to coalesce, as requested in #3424

0415de5

jangorecki changed the title ~~FR: data.table-style coalesce~~ data.table-style coalesce May 29, 2019

mattdowle closed this as completed in #3608 Jun 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data.table-style coalesce #3424

data.table-style coalesce #3424

MichaelChirico commented Feb 20, 2019

HughParsonage commented Feb 20, 2019

HughParsonage commented Feb 20, 2019

jangorecki commented Feb 20, 2019

smingerson commented Mar 2, 2019

jangorecki commented May 29, 2019 •

edited

Loading

mattdowle commented Jun 17, 2019 •

edited

Loading

mattdowle commented Jul 17, 2019

data.table-style coalesce #3424

data.table-style coalesce #3424

Comments

MichaelChirico commented Feb 20, 2019

HughParsonage commented Feb 20, 2019

HughParsonage commented Feb 20, 2019

jangorecki commented Feb 20, 2019

smingerson commented Mar 2, 2019

jangorecki commented May 29, 2019 • edited Loading

mattdowle commented Jun 17, 2019 • edited Loading

mattdowle commented Jul 17, 2019

jangorecki commented May 29, 2019 •

edited

Loading

mattdowle commented Jun 17, 2019 •

edited

Loading