Feature Request: join_many_to_one() and join_one_to_one() #427

billdenney · 2021-02-05T02:56:41Z

I often need to perform join operations which confirm that the join is either 1:1 or many:1. I have written functions for this in my personal package, but I think that they may be of general interest. That said, they are a bit of a stretch for normal janitor functionality, so if you don't think it's a good fit, no worries.

The text was updated successfully, but these errors were encountered:

billdenney · 2022-04-15T15:03:50Z

@sfirke , If I make a PR for this, do you think it would be of interest? (And no worries if you think it's out of scope.)

sfirke · 2022-04-15T15:10:35Z

Sorry I missed this the first time. Can you explain a little more? I'm reading that as a wrapper for dplyr::left_join that fails if a record from the left table is matched to more than one in the right table?

billdenney · 2022-04-15T15:17:05Z

You've effectively got it. For join_one_to_one() it is a wrapper for full_join() that ensures that each row in both the left and right arguments is matched exactly once. For join_many_to_one() it is a wrapper for left_join() that ensures that each row in the left side matches exactly one row in the right table (but, it does not require that all rows in the right table are matched).

In my personal package, I have a more complex join_control() function that lets you specify several kinds of requirements on the join like no missing rows, no duplicated rows, all rows must be present, etc. and you can combine the requirements.

jzadra · 2022-04-15T16:05:11Z

If you don't put this in janitor, I'd love to see your personal code. I also run into these questions after a join. Looking at the row numbers before and after tells you something, but this sounds like it's much more specific.

billdenney · 2022-04-15T19:51:15Z

@jzadra , I just made a PR for them. Yes, the personal code is much more extensive in its testing. It creates a row number column for each of the input data.frames and then checks that the output row numbering is as-expected for missingness, uniqueness, etc.

jzadra · 2022-04-15T19:57:42Z

@billdenney Sounds great. It would be cool if it also provided a summary of which variables were used in the join match that ended up producing multiple rows (and showing the value of those variables), or that didn't have a match and resulted in NAs for the variables from the table being added. Happy to help with this, I think I'd use it a lot.

billdenney · 2022-04-15T19:59:32Z

@jzadra , It does that summary reporting, too. 😄

jzadra · 2022-04-15T20:04:27Z

@billdenney Sweet! Is your personal package public?

billdenney · 2022-04-15T20:08:54Z

It's kinda, sorta public. It is public in that it's on GitHub. You're welcome to check it out (it's my bsd.report repo). But, it should not be relied upon to be stable. I change things in it all the time and do not consider the API to be stable at all. And, I'm aware of the parts of it that are well-thought-through and tested vs the ones that are one-off hacks that got some job done for a given day.

billdenney · 2022-04-15T20:14:01Z

@jzadra , For this particular feature, I'd suggest going to PR #480.

sda030 · 2022-06-24T12:45:38Z

Any reason why no suggesting this for the dplyr package? I think many people have similar issues when joining, and I prefer the idea of having all join functions in the same package - mentally easier to remember that one package does everything related to a topic rather than splitting across several. I also think that you join_control is the better selling point than the join_one_to_one(). Perhaps the latter convenience wrapper is better suited for jantitor?

billdenney · 2022-06-24T16:17:40Z

@sda030 , That's a good question. I thought I had proposed something like this in dplyr in the past, but I had not. I'll put a feeler out over there to see if it would be of interest for dplyr.

billdenney · 2022-06-24T16:28:01Z

@sda030 , And I've now added an issue over there to see if it would be of interest.

billdenney mentioned this issue Apr 15, 2022

implement join controlling functions #480

Closed

billdenney mentioned this issue Jun 24, 2022

Feature Request: join_many_to_one() and join_one_to_one() tidyverse/dplyr#6307

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: join_many_to_one() and join_one_to_one() #427

Feature Request: join_many_to_one() and join_one_to_one() #427

billdenney commented Feb 5, 2021

billdenney commented Apr 15, 2022 •

edited

Loading

sfirke commented Apr 15, 2022

billdenney commented Apr 15, 2022

jzadra commented Apr 15, 2022

billdenney commented Apr 15, 2022

jzadra commented Apr 15, 2022

billdenney commented Apr 15, 2022

jzadra commented Apr 15, 2022

billdenney commented Apr 15, 2022

billdenney commented Apr 15, 2022

sda030 commented Jun 24, 2022

billdenney commented Jun 24, 2022

billdenney commented Jun 24, 2022

Feature Request: join_many_to_one() and join_one_to_one() #427

Feature Request: join_many_to_one() and join_one_to_one() #427

Comments

billdenney commented Feb 5, 2021

billdenney commented Apr 15, 2022 • edited Loading

sfirke commented Apr 15, 2022

billdenney commented Apr 15, 2022

jzadra commented Apr 15, 2022

billdenney commented Apr 15, 2022

jzadra commented Apr 15, 2022

billdenney commented Apr 15, 2022

jzadra commented Apr 15, 2022

billdenney commented Apr 15, 2022

billdenney commented Apr 15, 2022

sda030 commented Jun 24, 2022

billdenney commented Jun 24, 2022

billdenney commented Jun 24, 2022

billdenney commented Apr 15, 2022 •

edited

Loading