Description
Brief Description
There are a few adorn_*
functions from R's janitor that are not yet ported over to pyjanitor. Janitor docs here.
I'm specifically looking at:
adorn_totals
: adds a "total" column to either the rows, the columns, or bothadorn_percentages
: converts the cell values into percentages, calculated along either axis or over the entire dataframe. In the R formulation, these are floats between 0 and 1, not the 0-100 percentages.adorn_pct_formatting
: formats the 0 to 1 values into the 0 to 100 percentage values, with rounding/formatting optionsadorn_ns
: adds the raw counts back into the cell values (meant to be run afteradorn_percentages
), so each cell has both percentage & count info, like "56 (24.3%)" for example.
I imagine these might be particularly useful for those doing data reporting.
These should go into the functions
module.
Example API
In pyjanitor, I don't think having four separate functions work (how to enforce that adorn_ns
comes after adorn_percentages
? and where would we get the counts required for adorn_ns
? etc.).
Perhaps we could just do a adorn_totals
, and an adorn_percentages
(which encapsulates the behaviour of adorn_pct_formatting
and adorn_ns
as well, controlled via function parameters).
adorn_totals
This function should mirror the R function almost 1-1.
>>> df = pd.DataFrame({"a": [6, np.nan, 2.5], "b": list("xyz")}); df
a b
0 6.0 x
1 NaN y
2 2.5 z
>>> df.adorn_totals(
... subset=None, # or list of index/col names; preferably can take in ranges like `slice("col_a","col_d")` also since `.loc` supports it
... axis="col", # index/0/row or column/1/col or both
... fill_value: str='-',
... name: str='Total',
... )
a b
0 6.0 x
1 NaN y
2 2.5 z
Total 3.5 -
A few points I disagree(?) with the R implementation:
- I'm thinking that NaN values will be treated as 0 here by default, so totals won't be affected by presence of NaN -> sum(1, NaN, 2.5) = 3.5. The R janitor function has an
na.rm
parameter for this, but I somehow feel this isn't necessary. - The
where
parameter, as defined by the R implementation, is to dictate whether to add a Totals "row" or "col"; as opposed to doing the summation over "row"/"col". In the latter case,where="row"
would add a new column containing the Totals across the rows (which to me is more natural). I'm calling this parameteraxis
here btw.
adorn_percentages
TBD. Let me have a little think about this over the weekend, I decided against my own implementation idea while writing out the example API.. ><
Original idea
>>> df = pd.DataFrame({"a": [6, np.nan, 2.5], "b": list("xyz")}); df
a b
0 6.0 x
1 NaN y
2 2.5 z
>>> df.adorn_percentages(
... subset=None, # similar to `adorn_totals`
... axis='col', # similar to `adorn_totals`
... adorn_count=True,
... count_position='front', # ignored if adorn_count=False
... count_format=0, # ignored if adorn_count=False
... percentage_format=2,
... )
a b
0 6 (70.59%) x
1 nan y
2 3 (29.4%) z
Parameters:
count_position
: whether to do front=="56 (23.4%)", back=="23.4% (56)"count_format
/percentage_format
: if int, then represents the number of decimal places to round to. otherwise a string format specification like ':,.2f' or whatever.
I'm not that sold on this API yet. Doesn't look too clean / friendly to use. After all, it is an amalgamation of 3 different behaviours in 1 function 😅). Would be happy to hear comments / suggestions to improve, if any.