-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explicitly handling missingness in join columns #2499
Comments
Thank you for commenting on it. The API is that by default joins are performed using However, I understand this request and we can add new functionality in which you could opt-in to handle |
I like that joins are fast. I want to add a dispatch for joining columns of type
That's a great point that merits more thinking. If
I'm glad the stability of a 1.0 release is taken seriously. That said, the primary goal of this issue is to intentionally break unsafe code by raising an error -- which would ideally happen before 1.0. If more leniency is desired after that release, it's easy to switch to a more permissive model without breaking code. |
The problem is that such columns might not contain
I understand this. In general we have a tension between two completely opposing points of view:
E.g. for In general - while I perfectly understand your reasoning for the above reasons this is not a simple decision (if we wanted to go this way we should consult the community first and get the consensus). The additional point is that the current behavior does not lead to irreversible data corruption - you just get more results than you would get with a strict approach that are easy to be filtered out later. Also - as commented - the most consistent approach with In general this request is kind of similar to asking Looking at it from another angle the question is: how often the behavior we currently have would lead to significant problems? (leaving aside the logical purity of your proposal which I appreciate and agree with) @nalimilan - can you please comment on what you think here. I am tagging it for a decision before 1.0 release. |
This is true in the case of groupby: a
Perhaps there's a way for users to specify how strict they want to be. using DataFrames.Strict vs using DataFrames.Permissive and ctx = DataFrames.strict()
c = innerjoin(ctx, a,b) |
By the way @bkamins , I want to thank you for considering this suggestion. Missingness handling is a big decision, and regardless of how DataFrames.jl ultimately decides to proceed, I'm appreciative that you're taking the time to think seriously about the varied needs of the JuliaData community. |
indeed it drops it, but there is no way to "properly fix it" without filling the missings (in your approach you would throw an error in this case but then the user gets nothing - actually this is something I do in #2494 by default, but there it is less debatable).
We avoid keeping global state in DataFrames.jl. Probably if we add it then it will be via kwarg. The decision to make is what we want to be the default (as if we keep the current default this PR is is not urgent, but if we decide that the default is what you propose we need to implement it now). |
Thanks for spotting this. Indeed the presence of missing values in the join keys seems to be more dangerous than e.g. in I suspect that in practice people don't perform joins on columns with missing values very frequently, and that when they do so it's often due to a mistake. We could deprecate this in 0.22 and see whether we get complaints. A keyword argument would have to be added to disable the deprecation warning. But there are several possible behaviors to choose: throw an error (default); treat |
If we went for this I would do
I will ask for opinions on slack |
Do any languages support joining where I would be most comfortable with always throwing an error and not allowing an option to do otherwise. |
FWIW dplyr supports this and does the same as we currently do by default:
(Looks like they anticipated having to support more than two options but so far didn't add them. :-) We should also consider |
I just don't see how the example in the OP can do anything other than what it currently does. You are labelling it to be a false negative only because you constructed the example. Suppose instead that I hand you those two datasets, now how certain are you that those observations correspond to each other? Suppose you observe another John who worked at a different place (or the same place!). Then what? Real data is messy like this. I think the typical way (at least, what I've encountered and what I usually do) to merge datasets when you have missing keys is a multiple step merge. You first merge based on what you know gives the best match, then from the remaining unmatched subsets you might match on different -- less good -- criteria. But you might also not do this second step. It requires assumptions and outside knowledge. This isn't something the package should do for you. If anything like this is implemented I would want it to be off by default (put me in the strict camp I guess). |
I would be unfavor of doing Since turning errors into non-errors is not breaking. The deprecation is fiddly. |
I understand you wanted to say "in favor". Right? This is what pandas does (essentially what we do now):
In summary we essentially have three options then:
So we would a kwarg that would most likely take |
As mentioned above, I don't think we should allow other options besides erroring at all. We can always add it later. |
But why? If we make error the default, then I think it is better to allow users to get the current behavior by using a kwarg (so that they can easily update their codes). What downsides of this approach do you see? |
Only that exposing a larger API means more to maintain. I also find it hard to motivate real-world examples where matching |
Given how simple it is to support the current behavior using a keyword argument, it would be quite extreme to completely stop supporting it. |
|
Okay, then I think we should support the keyword arguments. |
@adkabo - in summary we will follow your recommendation then most probably (let us wait a few days for the feedback). Thank you for raising the issue. |
I guess maybe I misunderstood exactly what was on the table here? I do support the proposal if it is just essentially just about the matching-ness of Can someone clarify what would happen, though, in OPs example? Either of the non-erroring options still returns empty, yes? |
Do any languages implement the "distinct" solution? Otherwise it's probably not worth supporting that. (I couldn't find what Stata does.)
Yes. |
|
This is what I meant at the top of this discussion. We will not implement it now, but as it is the most logically consistent approach so I would not rule it out in the design (just like R did). |
Cross joins are a very special kind of join (I almost wouldn't consider it as a join). They don't need to compare values so I don't think |
Another data point: in SQL inner joins, NULL is treated as distinct from all other values, which is considered a trap (see Wikipedia). However NULLs are preserved by left/right/outer/cross joins. |
i think in typical relational database, columns that you use as primary keys or foreign keys should be unique, distinct, and not empty. i think we are complicating the processing by virtue that the assumption does not hold. by making the API flexible, we are allowing the mistake in data encoding to be addressed by the dataframe. some basic assumptions in relational data should be met and if they are not met, the API should not be changed because it will just make the design messy. |
the data should be fixed and not the algorithm for joining tables. perhaps the join operation will spit an error if the index column has missing values because the table violates the requirement of joining indexed keys or skip those with missing values because the algorithm should not guess. keys with missing values lacks the information to relate information from one table to another. |
I support throwing an error but then allow an option to allow matching. |
I favor following SQL behavior. |
Julia has a great philosophy of taking missingness seriously. For example, unlike in Pandas or Postgres,
sum([1,2,missing])
givesmissing
. However, this philosophy hasn't yet been applied to all of the functions in the JuliaData ecosystem. I'll give an example to illustrate.My goal is to find the relationship between age and salary. To find out, I will combine observations from two datasets.
One with age,
and one with salary.
In a complete-data environment, if these observations correspond to the same person, I want one row in the joined dataframe; on the other hand, if they correspond to different people, I want zero rows in the join.
Given the missingness in the "employer" column, we don't know if that's the same person or not. So when I join them on
("name", "employer")
, we cannot know the right answer. Yetmakes a decision implicitly, returning an empty dataframe. If the observations correspond to the same person, this result -- failing to match the two observations -- is a false negative.
To avoid drawing mistaken conclusions from analysis, I would like to extend the practice of enforcing explicit handling of missing values. So I would like to get an error message in this case by default, and to actively tell
innerjoin()
how it should handle missing values.The
passmissing()
andskipmissing()
patterns used elsewhere in JuliaData are a great reassurance that Julia is looking out for missing data problems. When applied to joins, I would like to consider:I'm not sure if something like
passmissing(innerjoin)(a,b)
would work, or if it should be more likeinnerjoin(a,b, missingrule=:drop)
or something else. But I do want to start the conversation about it.#2243 has some discussion of missingness and joins, mainly focused on the
a.fillna(b)
use case.The text was updated successfully, but these errors were encountered: