-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Join using merge #45
Join using merge #45
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
I have seen you opened two PRs that are based on this PR. This will be difficult to handle. You could add these commits to this PR's branch and push, or wait for this to be merged.
R/joins.R
Outdated
if (!identical(by$x, by$y)) { | ||
stop("Data table joins must be on same key", call. = FALSE) | ||
} | ||
y <- dplyr::auto_copy(x, y, copy = copy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to keep this auto_copy()
call.
R/joins.R
Outdated
w <- unique(x[y, which = TRUE, allow.cartesian = TRUE]) | ||
w <- w[!is.na(w)] | ||
x[w] | ||
y <- as.data.table(y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto_copy()
should take care of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without copy = TRUE
we still would get:
> library(dplyr)
library(dplyr)
dt <- data.table::data.table(x = 1:3, y = 3:1)
df <- data.frame(x = 3:1, z = 1:3)
anti_join(dt, df)
Joining, by = "x"
Error in `[.data.frame`(y, , by_y, with = FALSE) :
unused argument (with = FALSE)
This could be avoided with y <- as.data.table(y)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But wouldn't #13 also take care of that case? I think it's better to have automatic coercion from data frame to data table under our control, and perhaps give a message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, will remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's reasonable.
R/joins.R
Outdated
w <- unique(x[y, which = TRUE, allow.cartesian = TRUE]) | ||
w <- w[!is.na(w)] | ||
x[w] | ||
y <- as.data.table(y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could refactor the common parts in a function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps with separate templates for merge and non-merge joins? As here?
For the suffix
argument, we need to treat them differently anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I was thinking about a simple function that computes filter_y
.
Templating: We could also do something like
join_dt <- function(op, suffix = TRUE) {
if (suffix) suffix_op <- ...
else suffix_op <- NULL
template <- substitute({
...
suffix_op
...
op
...
})
f <- ...
if (!suffix) formals(f) <- formals(f)[names(formals(f)) != "suffix"]
...
}
Both solutions are feasible, but really look too complicated for the task at hand. In the end it seems simpler to admin some duplication and get rid of the templating mechanism. But this doesn't have to be in that PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scratch that, suffix_op
affects op
. The conclusion still holds -- let's expand the templates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My last comment may have been ambiguous, sorry for that. I was thinking about expanding the templates to six simple functions that have a few lines of code in common.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, done, I also like it better that way. Also added a small test.
Closed the two other PR and will open again if this is merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks much simpler now without the templates.
R/joins.R
Outdated
|
||
#' @rdname join.tbl_dt | ||
anti_join.data.table <- join_dt({x[!y, allow.cartesian = TRUE]}) | ||
full_join.data.table <- function(x, y, by = NULL, copy = FALSE, ...){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please move the full join above the semi join? I think this is the order we use elsewhere, too.
R/joins.R
Outdated
y <- dplyr::auto_copy(x, y, copy = copy) | ||
by_x <- by$x | ||
by_y <- by$y | ||
y_filter <- y[, by_y, with = FALSE] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the first comma still necessary with data.table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without it, it would take it for the i
, not the j
argument. So I guess, yes.
R/joins.R
Outdated
by_x <- by$x | ||
by_y <- by$y | ||
y_filter <- y[, by_y, with = FALSE] | ||
names(y_filter) <- by_x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps y_trimmed <- trim_y_for_semi_join(y, by)
, or even a better verb than "trim", instead of y_filter
? This removes duplication and adds readability IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I wouldn't mind adding tests to this PR, so that we're sure it works as expected.
@lionel-: Could you please take a look, too?
R/joins.R
Outdated
by <- dplyr::common_by(by, x, y) | ||
y <- dplyr::auto_copy(x, y, copy = copy) | ||
by_x <- by$x | ||
by_x <- by$x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please double-check, also if we could perhaps write by$x
instead? The docs for the on
argument suggest that we could also avoid renaming the columns in y
altogether:
on <- rlang::set_names(by$y, by$x)
w <- x[!y, which = TRUE, on = on]
...
Would that work with data.table 1.9.6?
dplyr already imports rlang, so no extra dependency added, but we need to explicitly declare the import in DESCRIPTION to be able to use it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can also use @import rlang
and call rlang functions unqualified. I have started on a tidyevalish implementation (not fully tidy because we'll need to flatten quosures) a couple weeks ago, so we'll import it in the future anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, indeed. I wasn't aware of that. And it was already introduced in data.table 1.9.6.
Since using by$y
in trim_y_for_semi_join
can be used as well, I removed the one line function altogether.
Also added @import rlang
and called set_names
unqualified.
Agreed. Note that in cases where it makes sense, you can now do: expr <- quote(a <- b)
exprs <- list(quote(foo(a)), quote(foo(b)))
fn <- expr_interp(function(...) {
!! expr
!!! exprs
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks! Ready to merge if you could please add a bullet to NEWS.
Thanks! |
In data.table v1.9.6,
merge.data.table
has gained aby.x
andby.y
argument. Also it does not copy anymore, and does not need a key, so usingmerge
to perform joins should be more efficient and more convenient.From
?merge.data.table
:Using column renaming and the
on =
argument for non-merge
joins,semi_join.data.table
andanti_join.data.table
, saves the need forsetkey
and copying.This should fix #20, #21.
@krlmlr, we checked this out together, you may have another look. I had to remove some
with = FALSE
, which only work for thej
, not for thei
argument. But also withoutwith = FALSE
, the current solution works if a column is namedy_labels
, so we should be fine. Alsonomatch = 0L
cannot be used in an anti join.