Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data.table-style coalesce #3424

Closed
MichaelChirico opened this issue Feb 20, 2019 · 7 comments · Fixed by #3608
Closed

data.table-style coalesce #3424

MichaelChirico opened this issue Feb 20, 2019 · 7 comments · Fixed by #3608

Comments

@MichaelChirico
Copy link
Member

Did a bunch of coalesceing today and was sorely missing an efficient version.

@HughParsonage, would you be happy to add hutils::coalesce to data.table? I have the discussion in #2677 in mind...

I made tinkered with hutils::coalesce to come up with:

coalesce <- function(x, ...) {
  if (!any(na_idx <- is.na(x)) || missing(..1)) return(x)
  values <- list(...)
  
  lx <- length(x)
  lengths <- c(lx, vapply(values, length, FUN.VALUE = 0L))
  lengthsn1 <- lengths != 1L
  if (any(lengthsn1 & lengths != lx)) {
    wrong_len_i <- which(lengthsn1 & lengths != lx)
    stop("Argument ", wrong_len_i[1], " had length ", lengths[wrong_len_i[1]], ", ",
         "but length(x) = ", lx, ". ",
         "The only permissible lengths in ... are 1 or the length of `x` (", lx, ").")
  }
  
  typeof_x <- typeof(x)
  x_not_factor <- !inherits(x, what = 'factor')
  lv <- length(values)
  
  for (i in seq_len(lv)) {
    vi <- values[[i]]
    if (typeof(vi) != typeof_x) {
      stop("Argument ", i + 1L, " had type '", typeof(vi), "' but ",
           "typeof(x) was ", typeof_x, ". All types ",
           "in `...` must be the same type.")
    }
    
    if (inherits(vi, what = "factor") && x_not_factor) {
      stop("Argument ", i + 1L, " was a factor, but `x` was not. ",
           "All `...` must be the same type.")
    }
    
    if (lengthsn1[i + 1L]) {
      has_value_idx <- !is.na(vi[na_idx])
      x[na_idx][has_value_idx] <- vi[na_idx][has_value_idx]
      if (all(has_value_idx)) break
      na_idx[has_value_idx] = FALSE
    } else {
      if (is.na(vi)) next 
      else x[na_idx] <- vi
      break
    }
  }
  x
}

main difference being to skip running anyNA every iteration and instead focus on "whittling down" the is.na(x) vector

Benchmarked against hmisc::coalesce and it's hit or miss... maybe need more replications (function evaluation takes at most around 2 seconds so this is doable)? Or there's some extra optimization I'm missing...

Anyway, the same logic would probably be faster in C...

@HughParsonage
Copy link
Member

You'll need to go back to GPL-2 for this.

@HughParsonage
Copy link
Member

Otherwise happy of course.

For the optimization, you'll need to go down to C to do better. anyNA is so much faster than is.na that it's only in perverse circumstances where any(is.na(v[v_not_na])) is going to outperform it (like where ...length() is large and the NAs are rare and at the bottom of all of them).

I've got 128 GB and I've never had hutils::coalesce be a bottleneck.

@jangorecki
Copy link
Member

I used coalesce a lot of in SQL. I would say it is worth to write it in C.

@smingerson
Copy link

This would be an excellent addition to data.table. Do you think it would be possible to support patterns(), similar to functionality suggested in #3396 ? Frequently, I have columns which come from sources of varying reliability. I'll have 20 columns from each of 3+ sources, and want to coalesce values in order of reliability.

set.seed(702)
dt <- data.table(first_a = sample(c(1, NA), size = 10, replace = TRUE),
                 second_a = c(rnorm(9), NA),
                 third_a = 10, 
                 first_b = NA_real_)
   first_a      second_a third_a first_b
 1:      NA -0.0004156106      10      NA
 2:       1 -1.7100788510      10      NA
 3:       1 -0.7910541140      10      NA
 4:       1 -0.1093094194      10      NA
 5:       1 -1.1020017136      10      NA
 6:      NA  1.0196378767      10      NA
 7:      NA  0.9189197761      10      NA
 8:       1  1.7176207637      10      NA
 9:       1  0.7649291191      10      NA
10:      NA            NA      10      NA

# from providing, say, `coalesce(dt, patterns = patterns(c("first", "second", "third")))`
# first_b would not match be filled because no other name matched "_b" when the pattern was removed
# from the name.
        first_a      second_a third_a first_b
 1: -0.0004156106 -0.0004156106      10      NA
 2:  1.0000000000 -1.7100788510      10      NA
 3:  1.0000000000 -0.7910541140      10      NA
 4:  1.0000000000 -0.1093094194      10      NA
 5:  1.0000000000 -1.1020017136      10      NA
 6:  1.0196378767  1.0196378767      10      NA
 7:  0.9189197761  0.9189197761      10      NA
 8:  1.0000000000  1.7176207637      10      NA
 9:  1.0000000000  0.7649291191      10      NA
10: 10.0000000000            NA      10      NA

It would take a vector (or list) of patterns, find matches, then group the matches by the remaining portion of the name. It would coalesce each group of matches in the order the patterns were provided. I think this would be very convenient functionality, but not quite sure how it would work in the j-slot. I also may be overlooking a simple way to do this using the proposed version of coalesce.

@jangorecki
Copy link
Member

jangorecki commented May 29, 2019

@smingerson these kind of interface can be easily and reliably handled on user side.

your_pattern_fun = function() c("second_a","third_a")
jj = c(as.name("coalesce"), as.name("first_a"), lapply(your_pattern_fun(), as.name))
jj = as.call(jj)
print(jj)
#coalesce(first_a, second_a, third_a)
dt[, "first_a" := eval(jj)][]
#          first_a      second_a third_a first_b
# 1: -0.0004156106 -0.0004156106      10      NA
# 2:  1.0000000000 -1.7100788510      10      NA
# 3:  1.0000000000 -0.7910541140      10      NA
# 4:  1.0000000000 -0.1093094194      10      NA
# 5:  1.0000000000 -1.1020017136      10      NA
# 6:  1.0196378767  1.0196378767      10      NA
# 7:  0.9189197761  0.9189197761      10      NA
# 8:  1.0000000000  1.7176207637      10      NA
# 9:  1.0000000000  0.7649291191      10      NA
#10: 10.0000000000            NA      10      NA

I generally agree it would be useful to provide alternative interface to pass ... argument, for example as a list.
So I pushed improvement for that...

cc(F)
set.seed(702)
dt <- data.table(first_a = sample(c(1, NA), size = 10, replace = TRUE),
                 second_a = c(rnorm(9), NA),
                 third_a = 10, 
                 first_b = NA_real_)
your_pattern_fun = function() c("second_a","third_a")
dt[, "first_a" := coalesce(first_a, .dots=.SD), .SDcols=your_pattern_fun()][]

and because you can use patterns helper in .SDcols, you can easily replace your_pattern_fun from above example with data.table patterns helper.

@jangorecki jangorecki changed the title FR: data.table-style coalesce data.table-style coalesce May 29, 2019
@mattdowle
Copy link
Member

mattdowle commented Jun 17, 2019

@HughParsonage

You'll need to go back to GPL-2 for this.

Please note https://www.gnu.org/licenses/gpl-faq.en.html#IfInterpreterIsGPL which contains "The interpreted program, to the interpreter, is just data.". See also my comments about that GPL FAQ at the top of PR #2456 which changed data.table's license from GPL to MPL.
In short: my understanding is that the license of any R-only package (such as your hutils) is not relevant because the R code is considered data by the GPL, not code. According to the official GPL FAQ.
Of course, on the data.table project we will respect your intent. I'm just saying what I believe the license situation to be.

@mattdowle
Copy link
Member

PR #3696 added complex support to coalesce, part of #3690.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants