Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

updating factor column by reference after joining with other DT in i when both contain NA values gives NA as a factor level #1718

Closed
daniellemccool opened this issue May 24, 2016 · 4 comments
Assignees
Labels
Milestone

Comments

@daniellemccool
Copy link

I'm not totally certain if this is a data.table specific issue since data.table is the only function in R that works this way that I know of, but it's possible it's a function of the way factors are handled in general. Here's my minimal working example:

library(data.table)
A <- data.table(foo = c(1, 2, 3), bar = c(4, 5, 6))
B <- data.table(foo = c(1, 2, 3, 4, 5, 6), bar = c(NA, NA, NA, 4, 5, 6))
setkey(A, foo)
setkey(B, foo)
A[, bar := factor(bar, levels = c(4, 5), labels = c("Boop", "Beep"), exclude = 6)]
B[, bar := factor(bar, levels = c(4, 5), labels = c("Boop", "Beep"), exclude = 6)]
B[A, bar := i.bar]
B[, .N, bar]
B[is.na(bar)]

As you can see, if you try to update by reference, you end up with (the original) DT A's NA being a proper NA value, but DT B's NA is now another factor level. If only one of the columns contains NA, there's no problem.

@daniellemccool
Copy link
Author

daniellemccool commented May 26, 2016

So I've still got no clue where things actually go wrong, but it's worth noting that you can't select it by, for example, either by

B[bar == "NA"] # Expected

B[bar == levels(bar)[3]] # Kind of figured that would work at least, but it clearly identifies the level as NA instead of NA as a factor level

What does work to get around the issue is droplevels()

B[, bar := droplevels(bar)]

Maybe that gives some extra insight into what happens.

@jangorecki
Copy link
Member

jangorecki commented May 26, 2016

I'm not sure if having NA values as a factor level is a correct usage of factor. I would expect to have NA factor value, not NA factor level. Now your B$bar has one NA value and another not NA value referring to NA level.
Try the following:

levels(B$bar)
levels(factor(c(letters[1:3], NA)))

It avoids creating NA level.

@daniellemccool
Copy link
Author

daniellemccool commented May 26, 2016

Hi jangorecki,

The goal here was to illustrate a problem that I'm having in my real dataset, wherein I am merging multiple data sets a column at a time after converting them to factors (the idea being that this reduces the human error component. If I have, for example, a column sex in two different data tables, and in one, it's {0, 1} and in the other, it's {1, 2}, then I can better first turn them both into factors before the join than I can manually convert the numbers.

What happens, however, when I have NA values (not as factor levels, but properly NA) in both factors before the merge, they merge into two different NAs somehow, as is demonstrated by my first code chunk. Before the join, A$bar and B$bar both contained one NA value. (Not a factor level.) After the join, B$bar contains one NA value and one NA factor. That's not the intended behavior, and I'm not sure why it happens. Instead there should be, as you say, two values of NA in B$bar.

Does that make sense?

Edit: To respond to your suggestion, the problem is that I would rather have the join and assign-by-reference work as it should rather than having to recreate the factor levels again after the fact. I CAN of course do that, but it adds another layer to something that ought to work intrinsically that is -- the expected behaviour from joining the two data frames:

library(data.table)
A <- data.table(foo = c(1, 2, 3), bar = c(4, 5, 6))
B <- data.table(foo = c(1, 2, 3, 4, 5, 6), bar = c(NA, NA, NA, 4, 5, 6))
setkey(A, foo)
setkey(B, foo)
A[, bar := factor(bar, levels = c(4, 5), labels = c("Boop", "Beep"), exclude = 6)]
# A$bar now looks like c("Boop", "Beep", NA) where NA is a value, not a level. This is correct.
is.na(A$bar) # FALSE, FALSE, TRUE

B[, bar := factor(bar, levels = c(4, 5), labels = c("Boop", "Beep"), exclude = 6)]
# B$bar looks like c(NA, NA, NA, "Boop", "Beep", NA), where NA is a value, not a factor level.  Also correct.
is.na(B$bar)  # TRUE, TRUE, TRUE, FALSE, FALSE, TRUE


B[A, bar := i.bar] # This is where things go wrong.  
# The NA from A becomes a factor LEVEL and not a value.  I can't imagine why it does so.

is.na(B$bar) # FALSE, FALSE, FALSE, FALSE, FALSE, TRUE

But not only does it become a factor level, it's fundamentally impossible to access with any sort of boolean logic referring to the value it takes on. I can't call it to change it like so:

B[bar == "NA", bar := NA]

And I can't even access it by referencing what levels() indicates is its factor level. (I don't want it to be a factor level, of course. I want it to have the value NA. I just thought this might provide more insight.)

B[bar == levels(bar)[3], bar := NA]

I hope that helps with understanding what I'm trying to talk about. The error is that it becomes a factor level after the join and update by reference. It shouldn't. It should stay with the value NA.

@arunsrinivasan
Copy link
Member

This particular case happens because levels doesn't contain NA in both data.tables. And the internal logic for generating levels in an attempt to add missing levels handles this case wrongly. The root of the problem can be seen by doing:

as.integer(B$bar)
# [1]  1  2  3  1  2 NA

after the join+update operation. The 3rd value needs to be NA. Will fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants