Convenience method for join+insertion operation #4080

ben519 · 2019-11-27T23:01:59Z

My workflow involves a lot of the following: I have two tables, A & B, and I want to insert some of A's columns into B via a join. So, something like B[A, `:=`(col1 = i.col1, col2 = i.col2, ....), on = c("x", "y")]

A few annoyances led me to write the helper function below

When A has a lot of columns, writing out (col1 = i.col1, col2 = i.col2, ...) feels lengthy and redundant
Sometimes A's columns are dynamically created, so I have to piece together a string for the insertion and then use B[A, eval(parse(text = insertStr)), on = c("x", "y")]
I often want to do a rolling join + insertion. To my knowledge, there is no simple way to accomplish this

So, I made the following convenience/helper function for myself

data.table_join_insertion <- function(fromDT, intoDT, cols, on, roll = FALSE, mult = NULL){
  # Insert the specificed columns from 'fromDT' into 'intoDT' via a join+insertion using the 'on' columns
  # optionally, do a rolling join
  # if fromDT has a duplicated key, set mult = 'first'/'last' to pick which record to join+insert
  # 'on' can be a named vector like c("Foo"="Bar", "Baz"="Bingo") in which case the names will match
  # columns of 'fromDT' and the values will match columns of 'intoDT'
  
  # roll = TRUE or Inf
  # fromDT: --1---2------3----4--5---
  # intoDT: --|--|------|------|--|-|-
  # output: --1--1------2------4--5-5
  
  # roll = -Inf
  # fromDT: --1---2------3----4--5---
  # intoDT: --|--|------|------|--|-|-
  # output: --1--2------3------5-NA-NA
  
  # Force on to be a named vector
  names(on) <-if(is.null(names(on))) on else ifelse(names(on) == "", on, names(on))
  
  # what about ties?
  if(any(duplicated(fromDT, by = names(on)))){
    if(is.null(mult)) stop("duplicate keys found in fromDT but 'mult' = NULL gives no indication how to handle")
    if(mult == "first"){
      fromDT <- fromDT[fromDT[, .I[1L], by = names(on)]$V1]
    } else if(mult == "last"){
      fromDT <- fromDT[fromDT[, .I[.N], by = names(on)]$V1]
    } else{
      stop("'mult' should be 'first' or 'last'")
    }
  }
  
  # join+insertion
  insertStr <- paste0("`:=`(", paste0(cols, "=i.", cols, collapse = ","), ")")
  if(roll == FALSE){
    intoDT[fromDT, eval(parse(text = insertStr)), on = on]
  } else{
    
    # Setup
    fromDT[, FromRowId := .I]
    intoDT[, IntoRowId := .I]
    tempFrom <- fromDT[, c(names(on), "FromRowId"), with = FALSE]
    tempInto <- intoDT[, c(on, "IntoRowId"), with = FALSE]
    
    # rolling join
    temp <- tempFrom[tempInto, on = on, roll = roll]
    
    # back insert
    intoDT[temp, FromRowId := i.FromRowId, on = "IntoRowId"]
    intoDT[fromDT, eval(parse(text = insertStr)), on = "FromRowId"]
    
    # clean up
    fromDT[, c("FromRowId") := NULL]
    intoDT[, c("IntoRowId", "FromRowId") := NULL]
  }
}

I find it extremely useful as it saves me a lot of time and makes my code cleaner. Allow me to demonstrate.

foo <- data.table(x = c(1L, 3L, 9L, 10L, 10L))
bar <- data.table(x = c(1L, 2L, 7L, 10L, 10L), w = c("a","b","c","d","e"), z = c(0.5, 1.1, 3.1, 9.2, 7.8))

> print(foo)
    x w   z
1:  1 a 0.5
2:  3 b 1.1
3:  9 c 3.1
4: 10 d 9.2
5: 10 d 9.2

> print(bar)
    x w   z
1:  1 a 0.5
2:  2 b 1.1
3:  7 c 3.1
4: 10 d 9.2
5: 10 e 7.8

# Insert columns w and z from bar into foo, via x
# In case bar has dupes, use the first matching row
data.table_join_insertion(fromDT = bar, intoDT = foo, cols = c("w", "z"), on = c("x"), mult = "first")
> print(foo)
    x    w   z
1:  1    a 0.5
2:  3 <NA>  NA
3:  9 <NA>  NA
4: 10    d 9.2
5: 10    d 9.2

# Insert columns w and z from bar into foo, via rolling join on x
# In case bar has dupes, use the first matching row
data.table_join_insertion(fromDT = bar, intoDT = foo, cols = c("w", "z"), on = c("x"), mult = "first", roll = T)
> print(foo)
    x w   z
1:  1 a 0.5
2:  3 b 1.1
3:  9 c 3.1
4: 10 d 9.2
5: 10 d 9.2

I know this function has much room for improvement, but I wanted to share in case others found it useful and would like it to become a part of data.table. ..or maybe this functionality already exists and somehow I missed it.

Thanks

The text was updated successfully, but these errors were encountered:

jangorecki · 2019-11-28T02:53:11Z

Thanks for sharing Ben.

eval+parse is not really justified here, should be avoided in favor of meta programming, this SO is related: R data.table join: SQL “select *” alike syntax in joined tables? - this comment explains why: improvement on programmatically substituting columns in expressions #2655 (comment)
maybe some of rolling joins can be replaced by nafill? it is quite low overhead

What we actually need to make it simple is

B[A, names(A) := i.SD, on=c("x","y")]

related #935

jangorecki · 2020-06-17T19:51:26Z

Closing as duplicate of #3184. Please let me know @ben519 if that issue does not address your case.

jangorecki added the feature request label Nov 29, 2019

jangorecki added joins Use label:"non-equi joins" for rolling, overlapping, and non-equi joins programming parameterizing queries: get, mget, eval, env labels Apr 5, 2020

jangorecki added the duplicate label Jun 17, 2020

jangorecki closed this as completed Jun 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convenience method for join+insertion operation #4080

Convenience method for join+insertion operation #4080

ben519 commented Nov 27, 2019 •

edited

Loading

jangorecki commented Nov 28, 2019 •

edited

Loading

jangorecki commented Jun 17, 2020

Convenience method for join+insertion operation #4080

Convenience method for join+insertion operation #4080

Comments

ben519 commented Nov 27, 2019 • edited Loading

jangorecki commented Nov 28, 2019 • edited Loading

jangorecki commented Jun 17, 2020

ben519 commented Nov 27, 2019 •

edited

Loading

jangorecki commented Nov 28, 2019 •

edited

Loading