Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues when joining by named vector #222

Closed
mgirlich opened this issue Mar 9, 2021 · 1 comment · Fixed by #240
Closed

Issues when joining by named vector #222

mgirlich opened this issue Mar 9, 2021 · 1 comment · Fixed by #240

Comments

@mgirlich
Copy link
Collaborator

mgirlich commented Mar 9, 2021

I wanted to fix #198 and discovered some more issues when joining by a named vector:

  • wrong tbl_vars for left/full/inner_join()
  • wrong column order in left/full/inner_join()
  • right_join() complains about missing variable
  • semi/anti_join() complain about missing variable
library(dtplyr)
library(dplyr, warn.conflicts = FALSE)

df1 <- data.frame(x = 1, y = 1, z = 2)
df2 <- data.frame(x = 1, y = 2)
dt1 <- lazy_dt(df1, "dt1")
dt2 <- lazy_dt(df2, "dt2")

# left/full/inner
left_join(dt1, dt2, by = c("x", z = "y"))
#> Source: local data table [1 x 3]
#> Call:   merge(dt1, dt2, all.x = TRUE, all.y = FALSE, by.x = c("x", "z"
#> ), by.y = c("x", "y"), allow.cartesian = TRUE)
#> 
#>       x     z     y
#>   <dbl> <dbl> <dbl>
#> 1     1     2     1
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results
left_join(df1, df2, by = c("x", z = "y"))
#>   x y z
#> 1 1 1 2

# note the wrong order here
left_join(dt1, dt2, by = c("x", z = "y")) %>% tbl_vars()
#> <dplyr:::vars>
#> [1] "x"   "z"   "y.x" "y.y"

# right
right_join(dt1, dt2, by = c("x", z = "y"))
#> Error: `by` can't contain join column `z` which is missing from LHS.
right_join(df1, df2, by = c("x", z = "y"))
#>   x y z
#> 1 1 1 2

# semi/anti
semi_join(dt1, dt2, by = c("x", z = "y"))
#> Error: argument specifying columns specify non existing column(s): cols[2]='z'
semi_join(df1, df2, by = c("x", z = "y"))
#>   x y z
#> 1 1 1 2

anti_join(dt1, dt2, by = c("x", z = "y"))
#> Error in colnamesInt(i, unname(on), check_dups = FALSE): argument specifying columns specify non existing column(s): cols[2]='z'
anti_join(df1, df2, by = c("x", z = "y"))
#> [1] x y z
#> <0 rows> (or 0-length row.names)

Created on 2021-03-09 by the reprex package (v1.0.0)

@hadley
Copy link
Member

hadley commented Mar 9, 2021

I have great difficulty correctly reasoning about joins so this doesn't surprise me 😞

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants