-
Notifications
You must be signed in to change notification settings - Fork 993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
updating factor column by reference after joining with other DT in i when both contain NA values gives NA as a factor level #1718
Comments
So I've still got no clue where things actually go wrong, but it's worth noting that you can't select it by, for example, either by
What does work to get around the issue is droplevels()
Maybe that gives some extra insight into what happens. |
I'm not sure if having
It avoids creating NA level. |
Hi jangorecki, The goal here was to illustrate a problem that I'm having in my real dataset, wherein I am merging multiple data sets a column at a time after converting them to factors (the idea being that this reduces the human error component. If I have, for example, a column sex in two different data tables, and in one, it's {0, 1} and in the other, it's {1, 2}, then I can better first turn them both into factors before the join than I can manually convert the numbers. What happens, however, when I have NA values (not as factor levels, but properly NA) in both factors before the merge, they merge into two different NAs somehow, as is demonstrated by my first code chunk. Before the join, A$bar and B$bar both contained one NA value. (Not a factor level.) After the join, B$bar contains one NA value and one NA factor. That's not the intended behavior, and I'm not sure why it happens. Instead there should be, as you say, two values of NA in B$bar. Does that make sense? Edit: To respond to your suggestion, the problem is that I would rather have the join and assign-by-reference work as it should rather than having to recreate the factor levels again after the fact. I CAN of course do that, but it adds another layer to something that ought to work intrinsically that is -- the expected behaviour from joining the two data frames:
But not only does it become a factor level, it's fundamentally impossible to access with any sort of boolean logic referring to the value it takes on. I can't call it to change it like so:
And I can't even access it by referencing what levels() indicates is its factor level. (I don't want it to be a factor level, of course. I want it to have the value NA. I just thought this might provide more insight.)
I hope that helps with understanding what I'm trying to talk about. The error is that it becomes a factor level after the join and update by reference. It shouldn't. It should stay with the value NA. |
This particular case happens because levels doesn't contain as.integer(B$bar)
# [1] 1 2 3 1 2 NA after the join+update operation. The 3rd value needs to be |
I'm not totally certain if this is a data.table specific issue since data.table is the only function in R that works this way that I know of, but it's possible it's a function of the way factors are handled in general. Here's my minimal working example:
As you can see, if you try to update by reference, you end up with (the original) DT A's NA being a proper NA value, but DT B's NA is now another factor level. If only one of the columns contains NA, there's no problem.
The text was updated successfully, but these errors were encountered: