missing column names after crosstab? #77

randomgambit · 2016-10-26T16:56:04Z

Hello there,

First of all, congrats for this package. Coming from stata and pandas, this is EXACLY what dplyr needed. A robust, easy to use package that excels at computing tabulations and cross-tabulations!

I have a quick question. Consider this simple example


var1<- c(5,10,2,2,NA)
var2<- c(NA,NA,NA,NA,NA)
var3<- c(5,10,2,2,1)
var4<- c('A','B','C','C','D')
var5<- c('G','B','C','F','D')
df = df=data.frame(var1,var2,var3,var4,var5)


> df
  var1 var2 var3 var4 var5
1    5   NA    5    A    G
2   10   NA   10    B    B
3    2   NA    2    C    C
4    2   NA    2    C    F
5   NA   NA    1    D    D

Now I compute a simple cross-tabulation

> df %>% crosstab(var1, var3,  percent = "row")
  var1 1 2 5 10
1    2 0 1 0  0
2    5 0 0 1  0
3   10 0 0 0  1
4   NA 1 0 0  0

I dont get why I only see the variable name var1 above and not var1 and var3. Given the table above, how could you possibly know that this is a cross-tabulation between var1 and var3?
Am I missing something?

Thanks!

The text was updated successfully, but these errors were encountered:

randomgambit · 2016-10-26T18:37:10Z

by looking at the documentation, it seems this is the expected behavior of crosstab.

In my opinion, there is room for improvement here. In essence, by not showing the name of the second variable, we are losing information and our two-way frequency tables are not self-contained anymore. For a package specifically designed at cross-tabulations, I think this is a pretty serious issue.

What do you think?
In any case, your package is really great and I hope my comments will help you improve it!

Example output in Stata where all the information is kept.

. tab make foreign if  _n <5

                   |  Car type
    Make and Model |  Domestic |     Total
-------------------+-----------+----------
       AMC Concord |         1 |         1 
         AMC Pacer |         1 |         1 
        AMC Spirit |         1 |         1 
     Buick Century |         1 |         1 
-------------------+-----------+----------
             Total |         4 |         4

sfirke · 2016-10-28T19:31:51Z

I agree that the ideal result would contain information about both variables used in the crosstab and have actually put considerable thought into this. I'm open to other ideas, but I haven't found a good way to do that without compromising the data.frame structure of the result.

For interactive use, something like that Stata result is nice, to have the full info. But then the result can't be written to .csv, passed to adorn_crosstab(), joined to another data.frame, etc.

gmodels::CrossTable does something very similar to that Stata output: http://stackoverflow.com/questions/13043817/mimic-tabulate-command-from-stata-in-r Perhaps that is your best bet. Or from this SO answer about xtabs:

t1 <- rep(c("A","B","C"),5)
t2 <- rpois(15,4)
df <- data.frame(ques=t1,resp=t2)
z <- xtabs(~ques+resp,data=df)

> z
    resp
ques 1 3 4 5 6 8 9
   A 0 3 0 1 0 0 1
   B 1 1 2 0 0 1 0
   C 0 0 2 1 1 1 0

But I don't like that z is of class "table". And the var2 name is lost when printing with knitr::kable or writing as a .csv.

write.csv(z, "dummy.csv")
knitr::kable(z)

and as.data.frame(z) gives an undesirable long result.

I don't see where the var2 name could go in a data.frame. I tried printing it for interactive use like this:

crosstab2 <- function(...){
  dots <- as.list(substitute(list(...)))[-1L]
  print(paste0("          ", deparse(dots[[2]])))
  crosstab(...)
}
> crosstab2(mtcars$cyl, mtcars$carb)
[1] "          mtcars$carb"
  mtcars$cyl 1 2 3 4 6 8
1          4 5 6 0 0 0 0
2          6 2 0 0 4 1 0
3          8 0 4 3 6 0 1

But in interactive use, you already have your command right above it the console output, so it's not as much of an issue - and then if the printed statement doesn't stay with the resulting data.frame, it doesn't solve the underlying problem.

Here's an Excel screenshot from trying to conceptualize this:

I don't think the var2 name can be included without creating a blank column.

randomgambit · 2016-10-28T23:11:38Z

thanks for your detailed and nice answers.
One possibility: what about concatenating in some form or another the two column names, so that the output is

> df %>% crosstab(var1, var3,  percent = "row")
  var1/var3 1 2 5 10
      1    2 0 1 0  0
      2    5 0 0 1  0
      3   10 0 0 0  1
      4   NA 1 0 0  0

sfirke · 2017-05-05T00:11:43Z

I see the upside of this, but am stuck on two problems:

var1/var3 doesn't describe the contents of column 1; that's var1. This breaks an R convention and is confusing if say you want to %>% filter(var1 == 3).
It's an illegal variable name with that special character, so subsequent function calls would have to handle that.

My solutions are (a) for interactive use, you have the syntax immediately before the result so can look there; (b) for scripted use you can add a title in Markdown or tweak the first column name with kable().

randomgambit · 2017-05-05T00:50:41Z

@sfirke !!! I wrote this in October!!!!! :D
but, seriously, why dont you just add an option that says rowcol = True that, instead of var1/var3 writes the first 3 character of each variable separated by a - ? say, if the variables are samuel and firke it would show sam-fir?

sfirke · 2017-05-05T02:37:28Z

I'm hoping to get faster at responding :) Cleaning up some issues now before a minor CRAN submission.

That approach sounds like a good compromise. I'm going to redo the whole crosstab() function later this year, and I'll look at that as a possible option for crosstab() to take.

sfirke · 2017-10-11T02:25:34Z

A year later and I'm struck by an idea.

To recap, this looks pretty nice with "cyl" on top:

> library(magrittr)
> mtcars %$% table(am, cyl)
   cyl
am   4  6  8
  0  3  4 12
  1  8  3  2

But it's not a data.frame.

This is a data.frame:

result <- mtcars %>%
  tabyl(am, cyl)

top <- result[1, ]
top[1, ] <- names(top)

bind_rows(
  top, result %>%
    mutate_if(is.numeric, as.character)
) %>%
  setNames(c("", "cyl", "", ""))

     cyl     
1 am   4 6  8
2  0   3 4 12
3  1   8 3  2

It's no longer any good for manipulating, since the column names are put in the 1st row of the data.frame - feels so wrong. BUT, it looks good for printing in a report. I think I could implement this as an adorn_, like adorn_col_title. I'd need to store the variable names as tabyl attributes, not hard, then it could do this.

I think it's at least worth a shot. Anyone want to implement this? The attribute stuff is not hard if you look at as_tabyl for guidance.

I wonder if it would throw off any bolding of column names that might occur in RMarkdown or another output format like a .docx template; would it bold only that single name in the row by itself?

sfirke · 2017-10-18T02:47:18Z

Note to self: allow it to take optional text argument to put on top. Default is to get it from the tabyl attributes, but allow for override. Use cases are overriding on a tabyl, or using with a non-tabyl, like some other tidyverse calculation ending with spread.

randomgambit · 2017-10-18T12:01:26Z

Hi @sfirke sorry for my late andwer. Yes, I think this is a very good idea, and actually the only possible one when you think about it
but why not sticking to

> mtcars %$% table(am, cyl)
   cyl
am   4  6  8
  0  3  4 12
  1  8  3  2

at the end of the day? for printing, nobody cares whether this is a dataframe or a duck

sfirke · 2017-10-18T13:07:04Z

I agree that table is satisfactory in your example. But what if I want to print something like this:

> mtcars %>%
+   tabyl(am, cyl) %>%
+   adorn_percentages() %>%
+   adorn_pct_formatting()
  am     4     6     8
1  0 15.8% 21.1% 63.2%
2  1 61.5% 23.1% 15.4%

table can't do the % stuff, but I still might want that cyl at the top.

I'm tinkering with this and it should be doable.

sfirke · 2017-11-15T13:40:18Z

Might consider kableExtra::add_header_above. That could be inspiration, though it seems targeted at more heavy-weight LaTeX and HTML tables while this is still a simple data.frame.

sfirke · 2018-01-19T02:50:24Z

This feature is now live on the master branch! Try out adorn_title(). I did some data analysis today and found myself reflexively using it, which is a good sign. It's a weird function so would love feedback from anyone.

randomgambit · 2018-02-05T14:43:02Z

thanks that is great! sorry for getting back to you so late

vishalugle · 2018-03-08T16:50:36Z

Hi,

Thanks for adding this feature, I am using version 0.3.1 of janitor package and don't see adorn_title() in it.
How can I get the latest version of the package ?

Thanks

sfirke · 2018-03-08T17:19:11Z

@vishalugle I plan to submit version 1.0 to CRAN tomorrow at which point this will be a moot question 😄 But in the meantime and for installing other package development versions from GitHub, try:

install.package(devtools)
devtools::install_github("sfirke/janitor")

See if that gives you that function?

vishalugle · 2018-03-08T17:40:34Z

Thanks for your quick response , I got the latest version using devtools.

…

On Thu, Mar 8, 2018 at 12:19 PM, Sam Firke ***@***.***> wrote: @vishalugle <https://github.com/vishalugle> I plan to submit version 1.0 to CRAN tomorrow at which point this will be a moot question 😄 But in the meantime and for installing other package development versions from GitHub, try: install.package(devtools) devtools::install_github("sfirke/janitor") See if that gives you that function? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#77 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AOZTR7i5yf7Qf7nUKNmL-3fY9aMgWeBsks5tcWgQgaJpZM4KhajE> .

sfirke closed this as completed May 5, 2017

sfirke reopened this May 5, 2017

sfirke added the hacktoberfest label Oct 11, 2017

sfirke mentioned this issue Oct 13, 2017

Submit version 1.0 to CRAN #140

Closed

20 tasks

sfirke mentioned this issue Jan 18, 2018

implement adorn_col_name() and underlying attribute support #145

Merged

sfirke closed this as completed Jan 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

missing column names after crosstab? #77

missing column names after crosstab? #77

randomgambit commented Oct 26, 2016 •

edited

Loading

randomgambit commented Oct 26, 2016 •

edited

Loading

sfirke commented Oct 28, 2016

randomgambit commented Oct 28, 2016

sfirke commented May 5, 2017

randomgambit commented May 5, 2017

sfirke commented May 5, 2017

sfirke commented Oct 11, 2017 •

edited

Loading

sfirke commented Oct 18, 2017

randomgambit commented Oct 18, 2017

sfirke commented Oct 18, 2017

sfirke commented Nov 15, 2017

sfirke commented Jan 19, 2018

randomgambit commented Feb 5, 2018

vishalugle commented Mar 8, 2018

sfirke commented Mar 8, 2018

vishalugle commented Mar 8, 2018 via email

missing column names after crosstab? #77

missing column names after crosstab? #77

Comments

randomgambit commented Oct 26, 2016 • edited Loading

randomgambit commented Oct 26, 2016 • edited Loading

sfirke commented Oct 28, 2016

randomgambit commented Oct 28, 2016

sfirke commented May 5, 2017

randomgambit commented May 5, 2017

sfirke commented May 5, 2017

sfirke commented Oct 11, 2017 • edited Loading

sfirke commented Oct 18, 2017

randomgambit commented Oct 18, 2017

sfirke commented Oct 18, 2017

sfirke commented Nov 15, 2017

sfirke commented Jan 19, 2018

randomgambit commented Feb 5, 2018

vishalugle commented Mar 8, 2018

sfirke commented Mar 8, 2018

vishalugle commented Mar 8, 2018 via email

randomgambit commented Oct 26, 2016 •

edited

Loading

randomgambit commented Oct 26, 2016 •

edited

Loading

sfirke commented Oct 11, 2017 •

edited

Loading