Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: join_many_to_one() and join_one_to_one() #6307

Closed
billdenney opened this issue Jun 24, 2022 · 5 comments
Closed

Feature Request: join_many_to_one() and join_one_to_one() #6307

billdenney opened this issue Jun 24, 2022 · 5 comments

Comments

@billdenney
Copy link
Contributor

Related to sfirke/janitor#427

I often need to perform join operations which confirm that the join is either 1:1 or many:1. Is there interest in this and similar functions to confirmation the result of join operations?

The underlying idea is to ensure a connection between rows in a data.frame on both the left and right side of the join with a join_controller() function. The join_controller() function ensures that a row is matched on the left or right data.frame:

  • "any": Any outcome is acceptable; this overrides all other options.
  • "all": Each row from the input must appear in the output at least one time.
  • "unique": A row may appear in the output zero or one time.
  • "missing": At least one row must not match in the new dataset (the values must be missing). This option is rarely used.
  • "nomissing": All rows must match in the new dataset (the values must not be missing).

join_one_to_one() and join_many_to_one() are special cases of the above controls.

I've implemented it here: sfirke/janitor#480

library(bsd.report)

join_many_to_one(
  data.frame(x=rep(1:2, 2)),
  data.frame(x=1:4, y=5:8)
)
#> Joining, by = "x"
#>   x y
#> 1 1 5
#> 2 2 6
#> 3 1 5
#> 4 2 6

join_many_to_one(
  data.frame(x=rep(1:2, 2)),
  data.frame(x=c(1, 1:3), y=5:8)
)
#> Joining, by = "x"
#>   x
#> 2 1
#> Error in join_control_detect(x = ret, control = x_control, detect_column = x[, : `x`: Rows are not unique in the new dataset. Keys for duplicated rows are above.

join_one_to_one(
  data.frame(x=rep(1:2, 2)),
  data.frame(x=1:4, y=5:8)
)
#> Joining, by = "x"
#> Error in join_control_detect(x = ret, control = y_control, detect_column = y[, : `y`: All rows were are not in the new dataset. Missing rows: 3, 4

join_one_to_one(
  data.frame(x=rep(1:2, 2)),
  data.frame(x=1:2, y=5:6)
)
#> Joining, by = "x"
#>   x
#> 3 1
#> 4 2
#> Error in join_control_detect(x = ret, control = y_control, detect_column = y[, : `y`: Rows are not unique in the new dataset. Keys for duplicated rows are above.

join_one_to_one(
  data.frame(x=1:2),
  data.frame(x=1:2, y=5:6)
)
#> Joining, by = "x"
#>   x y
#> 1 1 5
#> 2 2 6

Created on 2022-06-24 by the reprex package (v2.0.1)

@DavisVaughan
Copy link
Member

You might like the new multiple and unmatched arguments in the dev version
https://dplyr.tidyverse.org/dev/reference/mutate-joins.html

@billdenney
Copy link
Contributor Author

@DavisVaughan , Thanks for pointing those out. Those are very close to the request!

In my ideal world, this would have a bit more granularity in the required matching. As an example, I see that "error" would be the desired behavior for requiring a 1:1 match. But, I sometimes want matches to have the following characteristics at the same time:

Overall, there are a lot of potential scenarios between the x and y arguments of joining. I will try to enumerate them below and indicate what I see as likely being possible in the current dev version-- and what I do not see is possible. Can you confirm that understanding?

  • match 0 or more rows from x and drop rows from y: left_join(x, y, multiple=[anything], unmatched="drop")
  • match 0 or more rows from x and insert rows from y: right_join(x, y, multiple=[anything], unmatched="drop")
  • match exactly 1 row from x and require exactly one matched from y: full_join(x, y, multiple="error", unmatched="error")
    • It's not clear that this would require 1:1 matching or if it would allow many:1 matching.
  • match exactly 1 row from x and require zero or one matched from y
    • I don't see how to do that. The closest that I see is: left_join(x, y, multiple="error", unmatched="error"), but that doesn't allow 0 matched rows; it only allows ` matched row.
  • match 1 or more rows from x and require exactly one matched from y: full_join(x, y, multiple="error", unmatched="error")
  • match 1 or more rows from x and require zero or one matched from y
    • I don't see how to do that. The closest that I see is: left_join(x, y, multiple="error", unmatched="error"), but that doesn't allow 0 matched rows; it only allows 1 matched row, I think.

Am I missing control abilities?

@eutwt
Copy link
Contributor

eutwt commented Jul 19, 2022

@billdenney

Obviously not as convenient as having it built into a join function, but you might check out these functions from the {dm} package if you haven't already.

https://cynkra.github.io/dm/reference/examine_cardinality.html

Edit: I missed that you already implemented it, so probably aren't looking for solutions in general but just proposing a dplyr feature. Oops. Oh well, I'll leave the link in case its useful to others.

@billdenney
Copy link
Contributor Author

@eutwt , Thanks for the pointer to {dm}. I like several of the things that the package provides, and it seems to cover some of the space of what I'm looking for.

I do prefer the join to be combined with the check so that it has a combined success or failure. (Though I understand the benefits of the separation, too.)

Though your subsequent read is correct, I have implemented it for myself and was hoping to make it available to others.

@hadley
Copy link
Member

hadley commented Jul 21, 2022

Thanks for filing this issue! Unfortunately, I think it's unlikely we'll work on it in the short term because we've already spend a considerable amount of time analysing the features that folks need and implementing what we believe is the best tradeoff for payoff and implementation effort.

@hadley hadley closed this as completed Jul 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants