updating factor column by reference after joining with other DT in i when both contain NA values gives NA as a factor level #1718

daniellemccool · 2016-05-24T16:12:10Z

I'm not totally certain if this is a data.table specific issue since data.table is the only function in R that works this way that I know of, but it's possible it's a function of the way factors are handled in general. Here's my minimal working example:

library(data.table)
A <- data.table(foo = c(1, 2, 3), bar = c(4, 5, 6))
B <- data.table(foo = c(1, 2, 3, 4, 5, 6), bar = c(NA, NA, NA, 4, 5, 6))
setkey(A, foo)
setkey(B, foo)
A[, bar := factor(bar, levels = c(4, 5), labels = c("Boop", "Beep"), exclude = 6)]
B[, bar := factor(bar, levels = c(4, 5), labels = c("Boop", "Beep"), exclude = 6)]
B[A, bar := i.bar]
B[, .N, bar]
B[is.na(bar)]

As you can see, if you try to update by reference, you end up with (the original) DT A's NA being a proper NA value, but DT B's NA is now another factor level. If only one of the columns contains NA, there's no problem.

The text was updated successfully, but these errors were encountered:

daniellemccool · 2016-05-26T12:56:50Z

So I've still got no clue where things actually go wrong, but it's worth noting that you can't select it by, for example, either by

B[bar == "NA"] # Expected

B[bar == levels(bar)[3]] # Kind of figured that would work at least, but it clearly identifies the level as NA instead of NA as a factor level

What does work to get around the issue is droplevels()

B[, bar := droplevels(bar)]

Maybe that gives some extra insight into what happens.

jangorecki · 2016-05-26T13:56:34Z

I'm not sure if having NA values as a factor level is a correct usage of factor. I would expect to have NA factor value, not NA factor level. Now your B$bar has one NA value and another not NA value referring to NA level.
Try the following:

levels(B$bar)
levels(factor(c(letters[1:3], NA)))

It avoids creating NA level.

daniellemccool · 2016-05-26T14:36:57Z

Hi jangorecki,

The goal here was to illustrate a problem that I'm having in my real dataset, wherein I am merging multiple data sets a column at a time after converting them to factors (the idea being that this reduces the human error component. If I have, for example, a column sex in two different data tables, and in one, it's {0, 1} and in the other, it's {1, 2}, then I can better first turn them both into factors before the join than I can manually convert the numbers.

What happens, however, when I have NA values (not as factor levels, but properly NA) in both factors before the merge, they merge into two different NAs somehow, as is demonstrated by my first code chunk. Before the join, A$bar and B$bar both contained one NA value. (Not a factor level.) After the join, B$bar contains one NA value and one NA factor. That's not the intended behavior, and I'm not sure why it happens. Instead there should be, as you say, two values of NA in B$bar.

Does that make sense?

Edit: To respond to your suggestion, the problem is that I would rather have the join and assign-by-reference work as it should rather than having to recreate the factor levels again after the fact. I CAN of course do that, but it adds another layer to something that ought to work intrinsically that is -- the expected behaviour from joining the two data frames:

library(data.table)
A <- data.table(foo = c(1, 2, 3), bar = c(4, 5, 6))
B <- data.table(foo = c(1, 2, 3, 4, 5, 6), bar = c(NA, NA, NA, 4, 5, 6))
setkey(A, foo)
setkey(B, foo)
A[, bar := factor(bar, levels = c(4, 5), labels = c("Boop", "Beep"), exclude = 6)]
# A$bar now looks like c("Boop", "Beep", NA) where NA is a value, not a level. This is correct.
is.na(A$bar) # FALSE, FALSE, TRUE

B[, bar := factor(bar, levels = c(4, 5), labels = c("Boop", "Beep"), exclude = 6)]
# B$bar looks like c(NA, NA, NA, "Boop", "Beep", NA), where NA is a value, not a factor level.  Also correct.
is.na(B$bar)  # TRUE, TRUE, TRUE, FALSE, FALSE, TRUE


B[A, bar := i.bar] # This is where things go wrong.  
# The NA from A becomes a factor LEVEL and not a value.  I can't imagine why it does so.

is.na(B$bar) # FALSE, FALSE, FALSE, FALSE, FALSE, TRUE

But not only does it become a factor level, it's fundamentally impossible to access with any sort of boolean logic referring to the value it takes on. I can't call it to change it like so:

B[bar == "NA", bar := NA]

And I can't even access it by referencing what levels() indicates is its factor level. (I don't want it to be a factor level, of course. I want it to have the value NA. I just thought this might provide more insight.)

B[bar == levels(bar)[3], bar := NA]

I hope that helps with understanding what I'm trying to talk about. The error is that it becomes a factor level after the join and update by reference. It shouldn't. It should stay with the value NA.

arunsrinivasan · 2016-05-26T14:51:44Z

This particular case happens because levels doesn't contain NA in both data.tables. And the internal logic for generating levels in an attempt to add missing levels handles this case wrongly. The root of the problem can be seen by doing:

as.integer(B$bar)
# [1]  1  2  3  1  2 NA

after the join+update operation. The 3rd value needs to be NA. Will fix.

…evels. #1718

arunsrinivasan added the bug label May 26, 2016

arunsrinivasan closed this as completed in 9be2b13 May 26, 2016

arunsrinivasan added this to the v1.9.8 milestone May 26, 2016

arunsrinivasan self-assigned this May 26, 2016

mattdowle added a commit that referenced this issue Jun 29, 2017

Test 1675.1 updated to cope with a change in R-devel related to and l…

58b4f2d

…evels. #1718

mattdowle added a commit that referenced this issue Jun 29, 2017

Minor news item update. #1718

ea1e50e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

updating factor column by reference after joining with other DT in i when both contain NA values gives NA as a factor level #1718

updating factor column by reference after joining with other DT in i when both contain NA values gives NA as a factor level #1718

daniellemccool commented May 24, 2016

daniellemccool commented May 26, 2016 •

edited

Loading

jangorecki commented May 26, 2016 •

edited

Loading

daniellemccool commented May 26, 2016 •

edited

Loading

arunsrinivasan commented May 26, 2016

updating factor column by reference after joining with other DT in i when both contain NA values gives NA as a factor level #1718

updating factor column by reference after joining with other DT in i when both contain NA values gives NA as a factor level #1718

Comments

daniellemccool commented May 24, 2016

daniellemccool commented May 26, 2016 • edited Loading

jangorecki commented May 26, 2016 • edited Loading

daniellemccool commented May 26, 2016 • edited Loading

arunsrinivasan commented May 26, 2016

daniellemccool commented May 26, 2016 •

edited

Loading

jangorecki commented May 26, 2016 •

edited

Loading

daniellemccool commented May 26, 2016 •

edited

Loading