Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tabyl not counting NAs when variable is a factor #111

Closed
emilelatour opened this issue Apr 17, 2017 · 11 comments
Closed

tabyl not counting NAs when variable is a factor #111

emilelatour opened this issue Apr 17, 2017 · 11 comments

Comments

@emilelatour
Copy link

emilelatour commented Apr 17, 2017

I thought that tabyl used to count the number of NAs when the variable was a factor, but for some reason it doesn't seem to be doing it this morning. Here is an example to illustrate the issue:

Create a data set as an example

> my_cars <- rbind(mtcars, rep(NA, 11))
> tail(my_cars)
                mpg cyl  disp  hp drat    wt qsec vs am gear carb
Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2
33               NA  NA    NA  NA   NA    NA   NA NA NA   NA   NA

Here my_cars$cyl is a number class could be a character and it would still work correctly

> tabyl(my_cars$cyl)
  my_cars$cyl  n    percent valid_percent
1           4 11 0.33333333       0.34375
2           6  7 0.21212121       0.21875
3           8 14 0.42424242       0.43750
4          NA  1 0.03030303            NA

Now if I change my_cars$cyl to a factor, tabyl does not count NAs

> my_cars$cyl <- factor(my_cars$cyl)
> tabyl(my_cars$cyl)
  my_cars$cyl  n percent valid_percent
1           4 11 0.34375       0.34375
2           6  7 0.21875       0.21875
3           8 14 0.43750       0.43750
4        <NA>  0 0.00000            NA

This seems to be a new phenomena as I've been using the janitor package with this data set for a few months now and never had this come up

@sfirke
Copy link
Owner

sfirke commented Apr 17, 2017

Hi and thanks for filing the bug report with an example! I cannot reproduce this behavior, either with janitor 0.2.1 on CRAN or janitor 0.2.1.9000 from GitHub (the current development version). I get:

library(janitor)
my_cars <- rbind(mtcars, rep(NA, 11))

tabyl(my_cars$cyl)
#>   my_cars$cyl  n    percent valid_percent
#> 1           4 11 0.33333333       0.34375
#> 2           6  7 0.21212121       0.21875
#> 3           8 14 0.42424242       0.43750
#> 4          NA  1 0.03030303            NA

my_cars$cyl <- factor(my_cars$cyl)
tabyl(my_cars$cyl)
#>   my_cars$cyl  n    percent valid_percent
#> 1           4 11 0.33333333       0.34375
#> 2           6  7 0.21212121       0.21875
#> 3           8 14 0.42424242       0.43750
#> 4        <NA>  1 0.03030303            NA

Can you run sessionInfo() and let me know what version of janitor you're running?

@emilelatour
Copy link
Author

emilelatour commented Apr 17, 2017 via email

@sfirke
Copy link
Owner

sfirke commented Apr 17, 2017

Hm, I wonder if it's the new dplyr (which will require some updates on my end anyway when it hits CRAN). I will try installing your versions of dplyr + tidyr and see what's going on.

@emilelatour
Copy link
Author

emilelatour commented Apr 17, 2017 via email

@emilelatour
Copy link
Author

emilelatour commented Apr 17, 2017 via email

@sfirke
Copy link
Owner

sfirke commented Apr 17, 2017

Odd. What's your sessionInfo() now that it's working?

@emilelatour
Copy link
Author

emilelatour commented Apr 17, 2017 via email

@sfirke
Copy link
Owner

sfirke commented Apr 17, 2017

I upgraded to the soon-to-be-released dplyr 0.5.0.9002 and now get the error above you report. Will look into it.

@sfirke
Copy link
Owner

sfirke commented Apr 17, 2017

The error is that dplyr joins now don't match on NA.

Note to self: the following works under dplyr >0.5.0.9002, but requires na_matches argument in left_join so won't work on dplyr 0.5.0. Which makes it probably not suitable for janitor anytime soon.

    if(is.factor(vec)){
      expanded <- tidyr::expand(result, vec)
      result <- dplyr::left_join(expanded,
                                 result,
                                 by = "vec",
                                 na_matches = "na")
      if(sort){result <- dplyr::arrange(result, dplyr::desc(n))} # undo reorder caused by complete()
    }

@sfirke
Copy link
Owner

sfirke commented Apr 17, 2017

Thanks for reporting this! Now I feel good about the integrity of janitor under the impending launch of dplyr 0.6.0. I had test coverage for this, so would have seen the failing tests, but only after dplyr 0.6.0 launched - it feels good to be proactive.

@emilelatour
Copy link
Author

emilelatour commented Apr 17, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants