Wrong percentage calculated when categorical column includes missing values. #114

rherman9 · 2021-02-27T17:12:30Z

When your categorical variables include missing values, a wrong percentage gets calculated.

See Parameter1:

Currently percentages for categorical variables get calculated with the total being the amount of non-missing values for that variable.

In my opinion the percentage, in cell E3 for example should be 17.1, because you're interested in how many times Parameter1 Category 1.0 occured in the validation_set: 104/609

A quick workaround is to replace all nan values in categorical variables to some number and then dropping the rows with that number:

df[categorical] = df[categorical].replace(np.nan, 199848)
mytable = mytable.tableone[mytable.tableone.index.get_level_values(1) != "199848.0"]

Great library though!

The text was updated successfully, but these errors were encountered:

tompollard · 2021-02-27T17:31:23Z

@rherman9 thanks for picking this up. we'll take a look!

tompollard · 2022-08-18T00:15:35Z

To reproduce this issue:

import pandas as pd
from tableone import tableone

df = pd.DataFrame(
    {'cats': ["1", "2", "3", "4", None, None],
    'set': ["train","train", "val", "val", "val", "val"]}
    )

t = tableone(df, groupby = "set")
print(t.tabulate(headers=None, tablefmt="github"))

Output:

		Missing	Overall	train	val
n			6	2	4
cats, n (%)	1	2	1 (25.0)	1 (50.0)
	2		1 (25.0)	1 (50.0)
	3		1 (25.0)		1 (50.0)
	4		1 (25.0)		1 (50.0)

Expected output:

		Missing	Overall	train	val
n			6	2	4
cats, n (%)	1	2	1 (25.0)	1 (50.0)
	2		1 (25.0)	1 (50.0)
	3		1 (25.0)		1 (25.0)
	4		1 (25.0)		1 (25.0)

The best fix for this might be to treat NaN/None etc as a category? @lbulgarelli any thoughts?

lbulgarelli · 2022-08-25T06:57:51Z

It is a good idea to add missing as a category itself, especially because it will allow to easily compare missing values between groups.

That said, the number of missing alone is not very informative for non-categorical variables, so I'd also probably hide that information by default, with the option to display it.

tompollard · 2022-08-25T17:50:32Z

That said, the number of missing alone is not very informative for non-categorical variables, so I'd also probably hide that information by default, with the option to display it.

I feel like it's pretty important to know how many data points are missing, even for continuous variables. If you're reporting a summary statistic and it is based on a small proportion of your overall data, it feels like it would be good to know.

@jraffa any thoughts on this conversation? (how to handle missing values for categorical and continuous variables).

Missing values are now treated as a category for categorical values.

… include_null=True. Ref #114.

Add include_null argument to handle nulls for categorical values. Ref #114.

tompollard · 2024-06-14T13:48:57Z

Should be fixed in #175, which adds an include_null argument. When include_null=True (the default), missing values are treated as a level of the categorical variable.

! pip install git+https://github.com/tompollard/tableone.git@main

df = pd.DataFrame(
    {'cats': ["1", "2", "3", "4", None, None],
    'set': ["train","train", "val", "val", "val", "val"]}
    )

t = tableone(df, groupby = "set")
print(t.tabulate(headers=None, tablefmt="github"))

Outputs:

		Overall	train	val
n		6	2	4
cats, n (%)	1	1 (16.7)	1 (50.0)
	2	1 (16.7)	1 (50.0)
	3	1 (16.7)		1 (25.0)
	4	1 (16.7)		1 (25.0)
	None	2 (33.3)		2 (50.0)

lbulgarelli added this to tableone improvements Aug 11, 2022

lbulgarelli moved this to Todo in tableone improvements Aug 16, 2022

tompollard added a commit that referenced this issue Jun 14, 2024

Add auto_fill_nulls argument. Ref #114

44ef61f

Missing values are now treated as a category for categorical values.

tompollard added a commit that referenced this issue Jun 14, 2024

Rename auto_fill_nulls to include_null. Handle include_null=False and…

02a7326

… include_null=True. Ref #114.

tompollard mentioned this issue Jun 14, 2024

Add include_null argument to handle nulls for categorical values. Ref #114. #175

Merged

tompollard added a commit that referenced this issue Jun 14, 2024

Merge pull request #175 from tompollard/tp/auto_fill_nulls

6fc6a30

Add include_null argument to handle nulls for categorical values. Ref #114.

tompollard closed this as completed Jun 14, 2024

github-project-automation bot moved this from Todo to Done in tableone improvements Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong percentage calculated when categorical column includes missing values. #114

Wrong percentage calculated when categorical column includes missing values. #114

rherman9 commented Feb 27, 2021 •

edited

Loading

tompollard commented Feb 27, 2021

tompollard commented Aug 18, 2022

lbulgarelli commented Aug 25, 2022

tompollard commented Aug 25, 2022

tompollard commented Jun 14, 2024

Wrong percentage calculated when categorical column includes missing values. #114

Wrong percentage calculated when categorical column includes missing values. #114

Comments

rherman9 commented Feb 27, 2021 • edited Loading

tompollard commented Feb 27, 2021

tompollard commented Aug 18, 2022

lbulgarelli commented Aug 25, 2022

tompollard commented Aug 25, 2022

tompollard commented Jun 14, 2024

rherman9 commented Feb 27, 2021 •

edited

Loading