Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The order of categorical variables #93

Open
epimedplotly opened this issue Dec 5, 2019 · 6 comments
Open

The order of categorical variables #93

epimedplotly opened this issue Dec 5, 2019 · 6 comments

Comments

@epimedplotly
Copy link

Hello.

I’d like to suggest you to allow for categorical variables be ordered in TableOne.

For example:

Suppose I have a variable that can assume values: “<10”,”10-20”,”>20”

I’d like to see it on TableOne in exactly order above.

But, instead of that, it seems to assume an alphabetic order like ”10-20”,”<10”,”<20”.

It would be usefull to see the correctly order for that.

Also, if that isn't an order for a categorical variable, it should be ordered by the percentual of each category, don't you agree?

Thanks for your attention.

Best regards,
Lunna

@tompollard
Copy link
Owner

tompollard commented May 7, 2020

Thanks for picking this up. Version 0.7.5 now respects the order of categorical variables. For example:

import pandas as pd
from tableone import TableOne

day_cat = pd.Categorical(["mon", "wed", "tue", "thu"],
                         categories=["wed", "thu", "mon", "tue"], ordered=True)

alph_cat = pd.Categorical(["a", "b", "c", "a"],
                         categories=["b", "c", "d", "a"], ordered=False)

mon_cat = pd.Categorical(["jan", "feb", "mar", "apr"],
                         categories=["feb", "jan", "mar", "apr"], ordered=True)

data = pd.DataFrame({"A": ["a", "b", "c", "a"]})
data["day"] = day_cat
data["alph"] = alph_cat
data["month"] = mon_cat
data

Input DataFrame.

Note that the order specified in the DataFrame for day is ["wed", "thu", "mon", "tue"] and the order for month is: ["feb", "jan", "mar", "apr"].

Screen Shot 2020-05-07 at 15 46 47

# the categorical order reflects the order in the DataFrame
t1 = TableOne(data, label_suffix=False)
t1

Table 1 uses the order specified in the DataFrame.

The order of day and month is retained in Table 1:

Screen Shot 2020-05-07 at 15 48 42

The order argument overrides the natural order of the DataFrame

The order in the DataFrame may not be what we want. We can either modify the order in the DataFrame directly, or alternatively we can use the order argument to fix it. If the order argument is provided, it overrides the order in the dataframe.

new_order = {"month": ["jan"], "day": ["mon", "tue", "wed"]}

t2 = TableOne(data, order=new_order, label_suffix=False)
t2

Screen Shot 2020-05-07 at 15 51 25

@tompollard
Copy link
Owner

@epimedplotly please test the sorting if you have the opportunity (the latest version can be pip/conda installed) and let us know if it works as expected.

Also, if that isn't an order for a categorical variable, it should be ordered by the percentual of each category, don't you agree?

Sounds reasonable to me. We haven't implemented this yet, but can look into it.

@epimedplotly
Copy link
Author

Hello, @tompollard !

I finally had the opportunity to test the latest version of tableone.

It is really working how I expected, thank you so much!

I reiterate that if there isn't an order for a categorical variable it would be awesome if it could be ordered by the percentual of each category, but the order argument is already making my life easier.

Thanks!

@tompollard
Copy link
Owner

thanks @epimedplotly, glad to hear this helps :)

I reiterate that if there isn't an order for a categorical variable it would be awesome if it could be ordered by the percentual of each category, but the order argument is already making my life easier.

Point taken, and let's keep this issue open for now.

If you come up with new bugs, suggestions, etc, please feel free to raise more issues.

@vsocrates
Copy link

I reiterate that if there isn't an order for a categorical variable it would be awesome if it could be ordered by the percentual of each category, but the order argument is already making my life easier.

I'd also be interested in this functionality. It looks like it's already implemented in the limit parameter? Would adding a new parameter be the way to do this? I'm willing to take a stab if someone can provide me with some design direction!

tableone/tableone/tableone.py

Lines 1501 to 1516 in bfd6fba

# re-order the variables by frequency
count = data[k].value_counts().sort_values(ascending=False)
new_idx = [(k, '{}'.format(i)) for i in count.index]
else:
# apply order
all_var = table.loc[k].index.unique(level='value')
new_idx = [(k, '{}'.format(v)) for v in self._order[k]]
new_idx += [(k, '{}'.format(v)) for v in all_var
if v not in self._order[k]]
# restructure to match the original idx
new_idx_array = np.empty((len(new_idx),), dtype=object)
new_idx_array[:] = [tuple(i) for i in new_idx]
orig_idx = table.index.values.copy()
orig_idx[table.index.get_loc(k)] = new_idx_array
table = table.reindex(orig_idx)

@vsocrates
Copy link

For anyone that wants a adhoc fix:

# Function to sort values in a column by frequency
def sort_by_frequency(series):
    freq = series.value_counts()
    sorted_values = freq.index.tolist()
    return sorted_values

# Apply the function to each column
sorted_values_by_column = {col: sort_by_frequency(dfcol]) for col in df[columns].columns}

mytable = TableOne(df, ..., order=sorted_values_by_column)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants