Skip to content

[ENH] adorn functions #993

Open
@thatlittleboy

Description

@thatlittleboy

Brief Description

There are a few adorn_* functions from R's janitor that are not yet ported over to pyjanitor. Janitor docs here.

I'm specifically looking at:

  • adorn_totals: adds a "total" column to either the rows, the columns, or both
  • adorn_percentages: converts the cell values into percentages, calculated along either axis or over the entire dataframe. In the R formulation, these are floats between 0 and 1, not the 0-100 percentages.
  • adorn_pct_formatting: formats the 0 to 1 values into the 0 to 100 percentage values, with rounding/formatting options
  • adorn_ns: adds the raw counts back into the cell values (meant to be run after adorn_percentages), so each cell has both percentage & count info, like "56 (24.3%)" for example.

I imagine these might be particularly useful for those doing data reporting.
These should go into the functions module.

Example API

In pyjanitor, I don't think having four separate functions work (how to enforce that adorn_ns comes after adorn_percentages? and where would we get the counts required for adorn_ns? etc.).

Perhaps we could just do a adorn_totals, and an adorn_percentages (which encapsulates the behaviour of adorn_pct_formatting and adorn_ns as well, controlled via function parameters).

adorn_totals

This function should mirror the R function almost 1-1.

>>> df = pd.DataFrame({"a": [6, np.nan, 2.5], "b": list("xyz")}); df
     a  b
0  6.0  x
1  NaN  y
2  2.5  z
>>> df.adorn_totals(
...     subset=None,  # or list of index/col names; preferably can take in ranges like `slice("col_a","col_d")` also since `.loc` supports it
...     axis="col",  # index/0/row or column/1/col or both
...     fill_value: str='-',
...     name: str='Total',
... )
         a  b
0      6.0  x
1      NaN  y
2      2.5  z
Total  3.5  -

A few points I disagree(?) with the R implementation:

  • I'm thinking that NaN values will be treated as 0 here by default, so totals won't be affected by presence of NaN -> sum(1, NaN, 2.5) = 3.5. The R janitor function has an na.rm parameter for this, but I somehow feel this isn't necessary.
  • The where parameter, as defined by the R implementation, is to dictate whether to add a Totals "row" or "col"; as opposed to doing the summation over "row"/"col". In the latter case, where="row" would add a new column containing the Totals across the rows (which to me is more natural). I'm calling this parameter axis here btw.

adorn_percentages

TBD. Let me have a little think about this over the weekend, I decided against my own implementation idea while writing out the example API.. ><

Original idea
>>> df = pd.DataFrame({"a": [6, np.nan, 2.5], "b": list("xyz")}); df
     a  b
0  6.0  x
1  NaN  y
2  2.5  z
>>> df.adorn_percentages(
...     subset=None,  # similar to `adorn_totals`
...     axis='col',  # similar to `adorn_totals`
...     adorn_count=True,
...     count_position='front',  # ignored if adorn_count=False
...     count_format=0,  # ignored if adorn_count=False
...     percentage_format=2,
... )
            a  b
0  6 (70.59%)  x
1         nan  y
2   3 (29.4%)  z

Parameters:

  • count_position: whether to do front=="56 (23.4%)", back=="23.4% (56)"
  • count_format / percentage_format: if int, then represents the number of decimal places to round to. otherwise a string format specification like ':,.2f' or whatever.

I'm not that sold on this API yet. Doesn't look too clean / friendly to use. After all, it is an amalgamation of 3 different behaviours in 1 function 😅). Would be happy to hear comments / suggestions to improve, if any.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions