Join using merge #45

christophsax · 2017-05-10T19:42:44Z

In data.table v1.9.6, merge.data.table has gained a by.x and by.y argument. Also it does not copy anymore, and does not need a key, so using merge to perform joins should be more efficient and more convenient.

From ?merge.data.table:

In versions <= v1.9.4, if the specified columns in by was not the key (or head of the key) of x or y, then a copy is first rekeyed prior to performing the merge. This was less performant and memory inefficient. The concept of secondary keys (implemented in v1.9.4) was used to overcome this limitation from v1.9.6+. No deep copies are made anymore and therefore very performant and memory efficient. Also there is better control for providing the columns to merge on with the help of newly implemented by.x and by.y arguments.

Using column renaming and the on = argument for non-merge joins, semi_join.data.table and anti_join.data.table, saves the need for setkey and copying.

This should fix #20, #21.

@krlmlr, we checked this out together, you may have another look. I had to remove some with = FALSE, which only work for the j, not for the i argument. But also without with = FALSE, the current solution works if a column is named y_labels, so we should be fine. Also nomatch = 0L cannot be used in an anti join.

krlmlr

Looks good.

I have seen you opened two PRs that are based on this PR. This will be difficult to handle. You could add these commits to this PR's branch and push, or wait for this to be merged.

krlmlr · 2017-05-10T19:45:43Z

R/joins.R

-    if (!identical(by$x, by$y)) {
-      stop("Data table joins must be on same key", call. = FALSE)
-    }
-    y <- dplyr::auto_copy(x, y, copy = copy)


We might want to keep this auto_copy() call.

krlmlr · 2017-05-10T19:46:51Z

R/joins.R

-  w <- unique(x[y, which = TRUE, allow.cartesian = TRUE])
-  w <- w[!is.na(w)]
-  x[w]
+   y <- as.data.table(y)


auto_copy() should take care of that.

Without copy = TRUE we still would get:

> library(dplyr) library(dplyr) dt <- data.table::data.table(x = 1:3, y = 3:1) df <- data.frame(x = 3:1, z = 1:3) anti_join(dt, df) Joining, by = "x" Error in `[.data.frame`(y, , by_y, with = FALSE) : unused argument (with = FALSE)

This could be avoided with y <- as.data.table(y)

But wouldn't #13 also take care of that case? I think it's better to have automatic coercion from data frame to data table under our control, and perhaps give a message.

Fair enough, will remove it.

That means #14 is not yet fixed but will only be if #13 is done.

I think that's reasonable.

krlmlr · 2017-05-10T19:47:35Z

R/joins.R

-  w <- unique(x[y, which = TRUE, allow.cartesian = TRUE])
-  w <- w[!is.na(w)]
-  x[w]
+   y <- as.data.table(y)


Perhaps we could refactor the common parts in a function?

Perhaps with separate templates for merge and non-merge joins? As here?

For the suffix argument, we need to treat them differently anyway.

Actually, I was thinking about a simple function that computes filter_y.

Templating: We could also do something like

join_dt <- function(op, suffix = TRUE) { if (suffix) suffix_op <- ... else suffix_op <- NULL template <- substitute({ ... suffix_op ... op ... }) f <- ... if (!suffix) formals(f) <- formals(f)[names(formals(f)) != "suffix"] ... }

Both solutions are feasible, but really look too complicated for the task at hand. In the end it seems simpler to admin some duplication and get rid of the templating mechanism. But this doesn't have to be in that PR.

Scratch that, suffix_op affects op. The conclusion still holds -- let's expand the templates.

My last comment may have been ambiguous, sorry for that. I was thinking about expanding the templates to six simple functions that have a few lines of code in common.

ok, done, I also like it better that way. Also added a small test.

christophsax · 2017-05-10T20:20:18Z

Closed the two other PR and will open again if this is merged.

krlmlr

Thanks, looks much simpler now without the templates.

krlmlr · 2017-05-10T22:45:39Z

R/joins.R


 #' @rdname join.tbl_dt
-anti_join.data.table <- join_dt({x[!y, allow.cartesian = TRUE]})
+full_join.data.table <- function(x, y, by = NULL, copy = FALSE, ...){


Can you please move the full join above the semi join? I think this is the order we use elsewhere, too.

krlmlr · 2017-05-10T22:48:57Z

R/joins.R

+  y <- dplyr::auto_copy(x, y, copy = copy)
+  by_x <- by$x
+  by_y <- by$y
+  y_filter <- y[, by_y, with = FALSE]


Is the first comma still necessary with data.table?

Without it, it would take it for the i, not the j argument. So I guess, yes.

krlmlr · 2017-05-10T22:55:01Z

R/joins.R

+  by_x <- by$x
+  by_y <- by$y
+  y_filter <- y[, by_y, with = FALSE]
+  names(y_filter) <- by_x


Perhaps y_trimmed <- trim_y_for_semi_join(y, by), or even a better verb than "trim", instead of y_filter? This removes duplication and adds readability IMO.

krlmlr

Looks good. I wouldn't mind adding tests to this PR, so that we're sure it works as expected.

@lionel-: Could you please take a look, too?

krlmlr · 2017-05-11T06:31:11Z

R/joins.R

+  by <- dplyr::common_by(by, x, y)
+  y <- dplyr::auto_copy(x, y, copy = copy)
+  by_x <- by$x
+  by_x <- by$x


Can you please double-check, also if we could perhaps write by$x instead? The docs for the on argument suggest that we could also avoid renaming the columns in y altogether:

on <- rlang::set_names(by$y, by$x) w <- x[!y, which = TRUE, on = on] ...

Would that work with data.table 1.9.6?

dplyr already imports rlang, so no extra dependency added, but we need to explicitly declare the import in DESCRIPTION to be able to use it here.

you can also use @import rlang and call rlang functions unqualified. I have started on a tidyevalish implementation (not fully tidy because we'll need to flatten quosures) a couple weeks ago, so we'll import it in the future anyway.

Yes, indeed. I wasn't aware of that. And it was already introduced in data.table 1.9.6.

Since using by$y in trim_y_for_semi_join can be used as well, I removed the one line function altogether.

Also added @import rlang and called set_names unqualified.

lionel- · 2017-05-11T07:15:43Z

Thanks, looks much simpler now without the templates.

Agreed. Note that in cases where it makes sense, you can now do:

expr <- quote(a <- b)
exprs <- list(quote(foo(a)), quote(foo(b)))

fn <- expr_interp(function(...) {
  !! expr
  !!! exprs
})

krlmlr

Awesome, thanks! Ready to merge if you could please add a bullet to NEWS.

krlmlr · 2017-05-11T11:17:46Z

Thanks!

christophsax added 2 commits May 10, 2017 20:06

Merge remote-tracking branch 'hadley/master'

6a2378a

Use by.x and by.y from merge.data.table

21b734b

krlmlr reviewed May 10, 2017

View reviewed changes

re-add auto copy

010ad84

christophsax added 4 commits May 10, 2017 22:29

separate templates for merge and non-merge joins

a81bfa0

Merge branch 'separate-template-merge-non-merge' into join-using-merge

80a1368

do not use function templates

f44ffb0

add a small test

97c1204

krlmlr reviewed May 10, 2017

View reviewed changes

move full_join in before semi_join, trim_y_for_semi_join()

8877e9c

krlmlr reviewed May 11, 2017

View reviewed changes

krlmlr requested a review from lionel- May 11, 2017 06:33

lionel- approved these changes May 11, 2017

View reviewed changes

christophsax added 2 commits May 11, 2017 10:03

no renaming, import rlang

a312d42

add rlang to imports

6166cc7

krlmlr approved these changes May 11, 2017

View reviewed changes

bullet to NEWS

4fe5a97

krlmlr merged commit 6c81a9a into tidyverse:master May 11, 2017

christophsax deleted the join-using-merge branch May 11, 2017 19:05

hadley mentioned this pull request Jul 25, 2017

auto_copy should distinguish non-tbl data.tables from data.frames #13

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Join using merge #45

Join using merge #45

christophsax commented May 10, 2017 •

edited

Loading

krlmlr left a comment

krlmlr May 10, 2017

krlmlr May 10, 2017

christophsax May 10, 2017

krlmlr May 10, 2017

christophsax May 10, 2017

christophsax May 10, 2017

krlmlr May 10, 2017

krlmlr May 10, 2017

christophsax May 10, 2017

krlmlr May 10, 2017

krlmlr May 10, 2017

krlmlr May 10, 2017

christophsax May 10, 2017

christophsax commented May 10, 2017

krlmlr left a comment

krlmlr May 10, 2017

krlmlr May 10, 2017

christophsax May 10, 2017

krlmlr May 10, 2017

krlmlr left a comment

krlmlr May 11, 2017

lionel- May 11, 2017

christophsax May 11, 2017

lionel- commented May 11, 2017

krlmlr left a comment

krlmlr commented May 11, 2017

Join using merge #45

Join using merge #45

Conversation

christophsax commented May 10, 2017 • edited Loading

krlmlr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

christophsax commented May 10, 2017

krlmlr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krlmlr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lionel- commented May 11, 2017

krlmlr left a comment

Choose a reason for hiding this comment

krlmlr commented May 11, 2017

christophsax commented May 10, 2017 •

edited

Loading