Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

missing column names after crosstab? #77

Closed
randomgambit opened this issue Oct 26, 2016 · 16 comments
Closed

missing column names after crosstab? #77

randomgambit opened this issue Oct 26, 2016 · 16 comments

Comments

@randomgambit
Copy link

randomgambit commented Oct 26, 2016

Hello there,

First of all, congrats for this package. Coming from stata and pandas, this is EXACLY what dplyr needed. A robust, easy to use package that excels at computing tabulations and cross-tabulations!

I have a quick question. Consider this simple example


var1<- c(5,10,2,2,NA)
var2<- c(NA,NA,NA,NA,NA)
var3<- c(5,10,2,2,1)
var4<- c('A','B','C','C','D')
var5<- c('G','B','C','F','D')
df = df=data.frame(var1,var2,var3,var4,var5)


> df
  var1 var2 var3 var4 var5
1    5   NA    5    A    G
2   10   NA   10    B    B
3    2   NA    2    C    C
4    2   NA    2    C    F
5   NA   NA    1    D    D

Now I compute a simple cross-tabulation

> df %>% crosstab(var1, var3,  percent = "row")
  var1 1 2 5 10
1    2 0 1 0  0
2    5 0 0 1  0
3   10 0 0 0  1
4   NA 1 0 0  0

I dont get why I only see the variable name var1 above and not var1 and var3. Given the table above, how could you possibly know that this is a cross-tabulation between var1 and var3?
Am I missing something?

Thanks!

@randomgambit
Copy link
Author

randomgambit commented Oct 26, 2016

by looking at the documentation, it seems this is the expected behavior of crosstab.

In my opinion, there is room for improvement here. In essence, by not showing the name of the second variable, we are losing information and our two-way frequency tables are not self-contained anymore. For a package specifically designed at cross-tabulations, I think this is a pretty serious issue.

What do you think?
In any case, your package is really great and I hope my comments will help you improve it!

Example output in Stata where all the information is kept.

. tab make foreign if  _n <5

                   |  Car type
    Make and Model |  Domestic |     Total
-------------------+-----------+----------
       AMC Concord |         1 |         1 
         AMC Pacer |         1 |         1 
        AMC Spirit |         1 |         1 
     Buick Century |         1 |         1 
-------------------+-----------+----------
             Total |         4 |         4 

@sfirke
Copy link
Owner

sfirke commented Oct 28, 2016

I agree that the ideal result would contain information about both variables used in the crosstab and have actually put considerable thought into this. I'm open to other ideas, but I haven't found a good way to do that without compromising the data.frame structure of the result.

For interactive use, something like that Stata result is nice, to have the full info. But then the result can't be written to .csv, passed to adorn_crosstab(), joined to another data.frame, etc.

gmodels::CrossTable does something very similar to that Stata output: http://stackoverflow.com/questions/13043817/mimic-tabulate-command-from-stata-in-r Perhaps that is your best bet. Or from this SO answer about xtabs:

t1 <- rep(c("A","B","C"),5)
t2 <- rpois(15,4)
df <- data.frame(ques=t1,resp=t2)
z <- xtabs(~ques+resp,data=df)

> z
    resp
ques 1 3 4 5 6 8 9
   A 0 3 0 1 0 0 1
   B 1 1 2 0 0 1 0
   C 0 0 2 1 1 1 0

But I don't like that z is of class "table". And the var2 name is lost when printing with knitr::kable or writing as a .csv.

write.csv(z, "dummy.csv")
knitr::kable(z)

and as.data.frame(z) gives an undesirable long result.

I don't see where the var2 name could go in a data.frame. I tried printing it for interactive use like this:

crosstab2 <- function(...){
  dots <- as.list(substitute(list(...)))[-1L]
  print(paste0("          ", deparse(dots[[2]])))
  crosstab(...)
}
> crosstab2(mtcars$cyl, mtcars$carb)
[1] "          mtcars$carb"
  mtcars$cyl 1 2 3 4 6 8
1          4 5 6 0 0 0 0
2          6 2 0 0 4 1 0
3          8 0 4 3 6 0 1

But in interactive use, you already have your command right above it the console output, so it's not as much of an issue - and then if the printed statement doesn't stay with the resulting data.frame, it doesn't solve the underlying problem.

Here's an Excel screenshot from trying to conceptualize this:

image

I don't think the var2 name can be included without creating a blank column.

@randomgambit
Copy link
Author

thanks for your detailed and nice answers.
One possibility: what about concatenating in some form or another the two column names, so that the output is

> df %>% crosstab(var1, var3,  percent = "row")
  var1/var3 1 2 5 10
      1    2 0 1 0  0
      2    5 0 0 1  0
      3   10 0 0 0  1
      4   NA 1 0 0  0

@sfirke
Copy link
Owner

sfirke commented May 5, 2017

I see the upside of this, but am stuck on two problems:

  1. var1/var3 doesn't describe the contents of column 1; that's var1. This breaks an R convention and is confusing if say you want to %>% filter(var1 == 3).
  2. It's an illegal variable name with that special character, so subsequent function calls would have to handle that.

My solutions are (a) for interactive use, you have the syntax immediately before the result so can look there; (b) for scripted use you can add a title in Markdown or tweak the first column name with kable().

@sfirke sfirke closed this as completed May 5, 2017
@randomgambit
Copy link
Author

@sfirke !!! I wrote this in October!!!!! :D
but, seriously, why dont you just add an option that says rowcol = True that, instead of var1/var3 writes the first 3 character of each variable separated by a - ? say, if the variables are samuel and firke it would show sam-fir?

@sfirke
Copy link
Owner

sfirke commented May 5, 2017

I'm hoping to get faster at responding :) Cleaning up some issues now before a minor CRAN submission.

That approach sounds like a good compromise. I'm going to redo the whole crosstab() function later this year, and I'll look at that as a possible option for crosstab() to take.

@sfirke sfirke reopened this May 5, 2017
@sfirke
Copy link
Owner

sfirke commented Oct 11, 2017

A year later and I'm struck by an idea.

To recap, this looks pretty nice with "cyl" on top:

> library(magrittr)
> mtcars %$% table(am, cyl)
   cyl
am   4  6  8
  0  3  4 12
  1  8  3  2

But it's not a data.frame.

This is a data.frame:

result <- mtcars %>%
  tabyl(am, cyl)

top <- result[1, ]
top[1, ] <- names(top)

bind_rows(
  top, result %>%
    mutate_if(is.numeric, as.character)
) %>%
  setNames(c("", "cyl", "", ""))

     cyl     
1 am   4 6  8
2  0   3 4 12
3  1   8 3  2

It's no longer any good for manipulating, since the column names are put in the 1st row of the data.frame - feels so wrong. BUT, it looks good for printing in a report. I think I could implement this as an adorn_, like adorn_col_title. I'd need to store the variable names as tabyl attributes, not hard, then it could do this.

I think it's at least worth a shot. Anyone want to implement this? The attribute stuff is not hard if you look at as_tabyl for guidance.

I wonder if it would throw off any bolding of column names that might occur in RMarkdown or another output format like a .docx template; would it bold only that single name in the row by itself?

@sfirke
Copy link
Owner

sfirke commented Oct 18, 2017

Note to self: allow it to take optional text argument to put on top. Default is to get it from the tabyl attributes, but allow for override. Use cases are overriding on a tabyl, or using with a non-tabyl, like some other tidyverse calculation ending with spread.

@randomgambit
Copy link
Author

Hi @sfirke sorry for my late andwer. Yes, I think this is a very good idea, and actually the only possible one when you think about it
but why not sticking to

> mtcars %$% table(am, cyl)
   cyl
am   4  6  8
  0  3  4 12
  1  8  3  2

at the end of the day? for printing, nobody cares whether this is a dataframe or a duck

@sfirke
Copy link
Owner

sfirke commented Oct 18, 2017

I agree that table is satisfactory in your example. But what if I want to print something like this:

> mtcars %>%
+   tabyl(am, cyl) %>%
+   adorn_percentages() %>%
+   adorn_pct_formatting()
  am     4     6     8
1  0 15.8% 21.1% 63.2%
2  1 61.5% 23.1% 15.4%

table can't do the % stuff, but I still might want that cyl at the top.

I'm tinkering with this and it should be doable.

@sfirke
Copy link
Owner

sfirke commented Nov 15, 2017

Might consider kableExtra::add_header_above. That could be inspiration, though it seems targeted at more heavy-weight LaTeX and HTML tables while this is still a simple data.frame.

@sfirke
Copy link
Owner

sfirke commented Jan 19, 2018

This feature is now live on the master branch! Try out adorn_title(). I did some data analysis today and found myself reflexively using it, which is a good sign. It's a weird function so would love feedback from anyone.

@sfirke sfirke closed this as completed Jan 19, 2018
@randomgambit
Copy link
Author

thanks that is great! sorry for getting back to you so late

@vishalugle
Copy link

Hi,

Thanks for adding this feature, I am using version 0.3.1 of janitor package and don't see adorn_title() in it.
How can I get the latest version of the package ?

Thanks

@sfirke
Copy link
Owner

sfirke commented Mar 8, 2018

@vishalugle I plan to submit version 1.0 to CRAN tomorrow at which point this will be a moot question 😄 But in the meantime and for installing other package development versions from GitHub, try:

install.package(devtools)
devtools::install_github("sfirke/janitor")

See if that gives you that function?

@vishalugle
Copy link

vishalugle commented Mar 8, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants