-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed Up When Simple Join Works for Most Columns #69
Comments
Here is a (much less feature-rich) consideration for a faster and less memory intensive algorithm: fuzzy_join_2 <- function(x, y, match_fun, ...) {
idx_df <- tibble()
for (row_x in seq_len(nrow(x))) {
current_mask <-
fuzzy_join_match(
rep(TRUE, nrow(y)),
x=x[row_x, , drop=FALSE],
y=y,
match_fun=match_fun,
...
)
current_idx <- which(current_mask)
idx_df <-
bind_rows(
idx_df,
if (length(current_idx)) {
tibble(
x=row_x,
y=current_idx
)
} else {
tibble(
x=row_x,
y=NA_integer_
)
}
)
}
extra_rows <- setdiff(seq_len(nrow(y)), idx_df$y)
bind_rows(
idx_df,
if (length(extra_rows)) {
tibble(
x=NA_integer_,
y=extra_rows
)
}
)
}
fuzzy_join_match <- function(current_mask, x, y, match_fun, ...) {
current_name <- names(match_fun)[1]
current_mask[current_mask] <-
match_fun[[1]](
x[[current_name]],
y[[current_name]][current_mask],
...
)
current_mask[is.na(current_mask)] <- FALSE
if (any(current_mask) & length(match_fun) > 1) {
current_mask <- fuzzy_join_match(current_mask, x, y, match_fun[-1], ...)
}
current_mask
} |
Do you have some examples and benchmarks on your proposal? |
@espinielli, I have not significantly tested my proposal. The example should be similar to my initial example (just giving an |
Thanks for writing the package.
I have data where most columns should match exactly, but two of the columns require fuzzy matching. The data take a while to join on the 6 by columns. One data.frame is 2597 rows and the other is 44929 rows.
I started running the following code to try to match the 6 columns, and it ran several minutes before crashing because R ran out of memory.
Is there a more efficient way to perform an exact join on several columns and only perform the fuzzy join on the remaining two?
The text was updated successfully, but these errors were encountered: