Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed Up When Simple Join Works for Most Columns #69

Open
billdenney opened this issue Jun 11, 2020 · 3 comments
Open

Speed Up When Simple Join Works for Most Columns #69

billdenney opened this issue Jun 11, 2020 · 3 comments

Comments

@billdenney
Copy link

Thanks for writing the package.

I have data where most columns should match exactly, but two of the columns require fuzzy matching. The data take a while to join on the 6 by columns. One data.frame is 2597 rows and the other is 44929 rows.

I started running the following code to try to match the 6 columns, and it ran several minutes before crashing because R ran out of memory.

foo <-
  fuzzyjoin::fuzzy_left_join(
    x=sdtab %>% rename(USUBJIDN=ID) %>% mutate(ROWID=row_number()),
    y=d_combined,
    by=c("USUBJIDN", intersect(names(sdtab), names(d_combined))),
    match_fun=
      list(
        USUBJIDN=function(x, y) x == y,
        PARAMN=function(x, y) x == y,
        TSFM=function(x, y) abs(x/y - 1) < 0.0001,
        AVALN=function(x, y) abs(x/y - 1) < 0.0001,
        EVID=function(x, y) x == y,
        MDV=function(x, y) x == y
      )
  )

Is there a more efficient way to perform an exact join on several columns and only perform the fuzzy join on the remaining two?

@billdenney
Copy link
Author

Here is a (much less feature-rich) consideration for a faster and less memory intensive algorithm:

fuzzy_join_2 <- function(x, y, match_fun, ...) {
  idx_df <- tibble()
  for (row_x in seq_len(nrow(x))) {
    current_mask <-
      fuzzy_join_match(
        rep(TRUE, nrow(y)),
        x=x[row_x, , drop=FALSE],
        y=y,
        match_fun=match_fun,
        ...
      )
    current_idx <- which(current_mask)
    idx_df <-
      bind_rows(
        idx_df,
        if (length(current_idx)) {
          tibble(
            x=row_x,
            y=current_idx
          )
        } else {
          tibble(
            x=row_x,
            y=NA_integer_
          )
        }
      )
  }
  extra_rows <- setdiff(seq_len(nrow(y)), idx_df$y)
  bind_rows(
    idx_df,
    if (length(extra_rows)) {
      tibble(
        x=NA_integer_,
        y=extra_rows
      )
    }
  )
}

fuzzy_join_match <- function(current_mask, x, y, match_fun, ...) {
  current_name <- names(match_fun)[1]
  current_mask[current_mask] <-
    match_fun[[1]](
      x[[current_name]],
      y[[current_name]][current_mask],
      ...
    )
  current_mask[is.na(current_mask)] <- FALSE
  if (any(current_mask) & length(match_fun) > 1) {
    current_mask <- fuzzy_join_match(current_mask, x, y, match_fun[-1], ...)
  }
  current_mask
}

@espinielli
Copy link

Do you have some examples and benchmarks on your proposal?
I recently tried fuzzyjoin on a small (31k and 19700) dataset, exact matching on 1 column, one <= and one >= match, but always hit out of memory 😢

@billdenney
Copy link
Author

@espinielli, I have not significantly tested my proposal. The example should be similar to my initial example (just giving an x and y along with appropriate match_fun values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants