Skip to content

Improve docs on what the axis= kwarg does in individual functions/methods #29203

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dlukes opened this issue Oct 24, 2019 · 10 comments · Fixed by #37029
Closed

Improve docs on what the axis= kwarg does in individual functions/methods #29203

dlukes opened this issue Oct 24, 2019 · 10 comments · Fixed by #37029
Assignees
Milestone

Comments

@dlukes
Copy link

dlukes commented Oct 24, 2019

axis=0 or axis=1, which is it?

I've always found it hard to remember which axis (0/"index" vs. 1/"columns") does what for various operations. I suppose some people find it intuitive, while others (like me) find it confusing and inconsistent.

Case in point, DataFrame.sum vs. DataFrame.drop: if I want column sums, I need axis=0...

>>> import pandas as pd
>>> df = pd.DataFrame(dict(a=[1, 2, 3], b=[4, 5, 6]))
>>> df
   a  b
0  1  4
1  2  5
2  3  6
>>> df.sum(axis=0)
a     6
b    15
dtype: int64

... but if I want to drop a column, I need axis=1:

>>> df.drop("a", axis=1)
   b
0  4
1  5
2  6

There's an analogous discrepancy in numpy (which is probably where pandas inherited it from?):

>>> import numpy as np
>>> a = df.to_numpy()
>>> a
array([[1, 4],
       [2, 5],
       [3, 6]])
>>> np.sum(a, axis=0)  # sum columns
array([ 6, 15])
>>> np.delete(a, 0, axis=1)  # delete first column
array([[4],
       [5],
       [6]])

I just intuitively conceptualize these operations as working along the same axis, so it's hard for me to internalize that the value of the axis parameter is different in each case. Apparently, I'm not the only person to find this confusing (quoting from the article: "For example, in the np.sum() function, the axis parameter behaves in a way that many people think is counter intuitive").

At the same time, I can imagine that some people find this behavior completely natural (at the very least those who designed the API). And I understand that changing this in pandas while keeping the status quo in numpy would introduce a (probably) worse inconsistency, so I'm not suggesting that.

Suggestion for improvement

What I am suggesting is reviewing the documentation of functions/methods using the axis= keyword argument and (where applicable) improving the description of what it controls in each case. Pandas is typically used interactively, so documentation is easily accessible. If it contains useful hints on what each axis value does (and possibly why), it's not such a big problem if this behavior goes against some people's expectations.

Examples

For example, based on the current master docs, the description of the axis parameter for drop does a good job at this:

axis : {0 or ‘index’, 1 or ‘columns’}, default 0
Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).

This makes it reasonably clear to me that if I specify 0, I'll be removing rows, whereas 1 will result in removing columns.

By contrast, the description of the axis parameter for sum is somewhat too generic:

axis : {index (0), columns (1)}
Axis for the function to be applied on.

Based on this, I could conclude (and have repeatedly done so) that if I want column sums, I need to "apply the function on columns", hence axis=1 (which is wrong, cf. above).

A revised description could look something like the following:

axis : {index (0), columns (1)}
Whether to collapse the index (0 or ‘index’), resulting in column sums, or the columns (1 or ‘columns’), resulting in row sums.

@WillAyd
Copy link
Member

WillAyd commented Oct 24, 2019

If you can break the request up into clear actionable items, we generally accept PRs to improve docs so sure.

Case in point, DataFrame.sum vs. DataFrame.drop: if I want column sums, I need axis=0

Maybe helpful for you to think of it as the "sum of rows by column"

@WillAyd WillAyd added Docs Needs Info Clarification about behavior needed to assess issue labels Oct 24, 2019
@dlukes
Copy link
Author

dlukes commented Oct 25, 2019

If you can break the request up into clear actionable items

Sure, I just wanted to provide a broader rationale first and see if changes along these lines feel acceptable at all :) I'll comb through the source code for places where the descriptions could be improved. Should I:

  • just add concrete suggestions as replies to this thread?
  • create a separate issue for each candidate I find, referencing this issue for context?
  • create a separate pull request for each candidate I find?
  • create one pull request with everything once I've gone through it all?

Maybe helpful for you to think of it as the "sum of rows by column"

I hope I'll remember now, it's one of the reasons I decided to type up the issue -- to etch this into my brain once and for all, hopefully :)

@smcinerney
Copy link

smcinerney commented Dec 10, 2019

  1. Conceptually it prevents confusion to think in terms of 'by-row'/axis=0 or 'by-column'/axis=1
  • e.g. to get column sums, we actually need to sum by-row
  • cf. the R functions colSums/rowSums, which uses 'by-row'/'by-column' terminology more consistently
  1. Further confusing is that the doc for DataFrame.sum (unlike say DataFrame.drop) doesn't mention 'row' anywhere. It merely says "axis : {index (0), columns (1)}\n Axis for the function to be applied on." Whereas DataFrame.drop is a better example for doc.

@simonjayhawkins simonjayhawkins added good first issue and removed Needs Info Clarification about behavior needed to assess issue labels Apr 1, 2020
@simonjayhawkins simonjayhawkins added this to the Contributions Welcome milestone Apr 1, 2020
@marielledado
Copy link
Contributor

@dlukes You're definitely not the only one, I would love to see these changes in the docs. Have you created PRs on this issue yet? If there are still any open issues I'm happy to work on them!

@dlukes
Copy link
Author

dlukes commented May 18, 2020

@marielledado I wasn't quite sure how to go forward with it (cf. the bullet points with questions in my previous post), and then life got in the way, so no, unfortunately, there's no active PR... If you have time to pick this up though, it would be great!

@marielledado
Copy link
Contributor

@dlukes life always gets in the way 😌 but yes happy to pick this up and make suggestions on this issue page based on those bullet points.

For good measure: Take!

@marielledado
Copy link
Contributor

Hi @simonjayhawkins first time contributor here and relatively new to programming and OS contribution, I'd like to work on this issue but it's not clear to me from the contributing guidelines where I should edit documentation. Should I edit the docstring of the function (df.drop) directly?

@simonjayhawkins
Copy link
Member

Should I edit the docstring of the function (df.drop) directly?

the answer is yes, but I think the docstring for df.drop is clear on the usage. maybe there are other methods where it is less clear.

it's not clear to me from the contributing guidelines where I should edit documentation

some other methods either use templates, inherit docstrings or inherit templates.

@aidanmontare-edu
Copy link
Contributor

Taking a look at the code, the functions like .sum() .min() are defined in the https://github.com/pandas-dev/pandas/blob/master/pandas/core/generic.py, but they aren't defined separately--there's one function that defines a whole bunch of these. I suppose that's the work of the inherit templates @simonjayhawkins was talking about (these things are new to me).

I guess the way forward is to define a way to put different text in the inherited templates for different methods? That way the description could be written for each method based on what makes sense.

@hongshaoyang
Copy link
Contributor

take

hongshaoyang added a commit to hongshaoyang/pandas that referenced this issue Oct 10, 2020
hongshaoyang added a commit to hongshaoyang/pandas that referenced this issue Oct 10, 2020
hongshaoyang added a commit to hongshaoyang/pandas that referenced this issue Oct 10, 2020
@jreback jreback modified the milestones: Contributions Welcome, 1.2 Oct 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants