Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convenience method for join+insertion operation #4080

Closed
ben519 opened this issue Nov 27, 2019 · 2 comments
Closed

Convenience method for join+insertion operation #4080

ben519 opened this issue Nov 27, 2019 · 2 comments
Labels
duplicate feature request joins Use label:"non-equi joins" for rolling, overlapping, and non-equi joins programming parameterizing queries: get, mget, eval, env

Comments

@ben519
Copy link

ben519 commented Nov 27, 2019

My workflow involves a lot of the following: I have two tables, A & B, and I want to insert some of A's columns into B via a join. So, something like B[A, `:=`(col1 = i.col1, col2 = i.col2, ....), on = c("x", "y")]

A few annoyances led me to write the helper function below

  1. When A has a lot of columns, writing out (col1 = i.col1, col2 = i.col2, ...) feels lengthy and redundant
  2. Sometimes A's columns are dynamically created, so I have to piece together a string for the insertion and then use B[A, eval(parse(text = insertStr)), on = c("x", "y")]
  3. I often want to do a rolling join + insertion. To my knowledge, there is no simple way to accomplish this

So, I made the following convenience/helper function for myself

data.table_join_insertion <- function(fromDT, intoDT, cols, on, roll = FALSE, mult = NULL){
  # Insert the specificed columns from 'fromDT' into 'intoDT' via a join+insertion using the 'on' columns
  # optionally, do a rolling join
  # if fromDT has a duplicated key, set mult = 'first'/'last' to pick which record to join+insert
  # 'on' can be a named vector like c("Foo"="Bar", "Baz"="Bingo") in which case the names will match
  # columns of 'fromDT' and the values will match columns of 'intoDT'
  
  # roll = TRUE or Inf
  # fromDT: --1---2------3----4--5---
  # intoDT: --|--|------|------|--|-|-
  # output: --1--1------2------4--5-5
  
  # roll = -Inf
  # fromDT: --1---2------3----4--5---
  # intoDT: --|--|------|------|--|-|-
  # output: --1--2------3------5-NA-NA
  
  # Force on to be a named vector
  names(on) <-if(is.null(names(on))) on else ifelse(names(on) == "", on, names(on))
  
  # what about ties?
  if(any(duplicated(fromDT, by = names(on)))){
    if(is.null(mult)) stop("duplicate keys found in fromDT but 'mult' = NULL gives no indication how to handle")
    if(mult == "first"){
      fromDT <- fromDT[fromDT[, .I[1L], by = names(on)]$V1]
    } else if(mult == "last"){
      fromDT <- fromDT[fromDT[, .I[.N], by = names(on)]$V1]
    } else{
      stop("'mult' should be 'first' or 'last'")
    }
  }
  
  # join+insertion
  insertStr <- paste0("`:=`(", paste0(cols, "=i.", cols, collapse = ","), ")")
  if(roll == FALSE){
    intoDT[fromDT, eval(parse(text = insertStr)), on = on]
  } else{
    
    # Setup
    fromDT[, FromRowId := .I]
    intoDT[, IntoRowId := .I]
    tempFrom <- fromDT[, c(names(on), "FromRowId"), with = FALSE]
    tempInto <- intoDT[, c(on, "IntoRowId"), with = FALSE]
    
    # rolling join
    temp <- tempFrom[tempInto, on = on, roll = roll]
    
    # back insert
    intoDT[temp, FromRowId := i.FromRowId, on = "IntoRowId"]
    intoDT[fromDT, eval(parse(text = insertStr)), on = "FromRowId"]
    
    # clean up
    fromDT[, c("FromRowId") := NULL]
    intoDT[, c("IntoRowId", "FromRowId") := NULL]
  }
}

I find it extremely useful as it saves me a lot of time and makes my code cleaner. Allow me to demonstrate.

foo <- data.table(x = c(1L, 3L, 9L, 10L, 10L))
bar <- data.table(x = c(1L, 2L, 7L, 10L, 10L), w = c("a","b","c","d","e"), z = c(0.5, 1.1, 3.1, 9.2, 7.8))

> print(foo)
    x w   z
1:  1 a 0.5
2:  3 b 1.1
3:  9 c 3.1
4: 10 d 9.2
5: 10 d 9.2

> print(bar)
    x w   z
1:  1 a 0.5
2:  2 b 1.1
3:  7 c 3.1
4: 10 d 9.2
5: 10 e 7.8

# Insert columns w and z from bar into foo, via x
# In case bar has dupes, use the first matching row
data.table_join_insertion(fromDT = bar, intoDT = foo, cols = c("w", "z"), on = c("x"), mult = "first")
> print(foo)
    x    w   z
1:  1    a 0.5
2:  3 <NA>  NA
3:  9 <NA>  NA
4: 10    d 9.2
5: 10    d 9.2

# Insert columns w and z from bar into foo, via rolling join on x
# In case bar has dupes, use the first matching row
data.table_join_insertion(fromDT = bar, intoDT = foo, cols = c("w", "z"), on = c("x"), mult = "first", roll = T)
> print(foo)
    x w   z
1:  1 a 0.5
2:  3 b 1.1
3:  9 c 3.1
4: 10 d 9.2
5: 10 d 9.2

I know this function has much room for improvement, but I wanted to share in case others found it useful and would like it to become a part of data.table. ..or maybe this functionality already exists and somehow I missed it.

Thanks

@jangorecki
Copy link
Member

jangorecki commented Nov 28, 2019

Thanks for sharing Ben.


What we actually need to make it simple is

B[A, names(A) := i.SD, on=c("x","y")]

related #935

@jangorecki jangorecki added joins Use label:"non-equi joins" for rolling, overlapping, and non-equi joins programming parameterizing queries: get, mget, eval, env labels Apr 5, 2020
@jangorecki
Copy link
Member

Closing as duplicate of #3184. Please let me know @ben519 if that issue does not address your case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate feature request joins Use label:"non-equi joins" for rolling, overlapping, and non-equi joins programming parameterizing queries: get, mget, eval, env
Projects
None yet
Development

No branches or pull requests

2 participants