Skip to content

DOC: Adding guide for the pandas documentation sprint #19704

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Mar 12, 2018
Merged

DOC: Adding guide for the pandas documentation sprint #19704

merged 9 commits into from
Mar 12, 2018

Conversation

datapythonista
Copy link
Member

This PR is to make it easier to review the proposal guide for the pandas documentation sprint, as discussed in pandas-dev.

numpydoc recommends avoiding "obvious" imports and importing them with
aliases, so for example `import numpy as np`. While this is now an standard
in the data ecosystem of Python, it doesn't seem a good practise, for the
next reasons:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fully acknowledging that I'm bike-shedding...but I really disagree with this. Personally think aliasing is a great compromise between import * (like R, etc), while still having enough brevity for interactive workflows. Also conflicts with majority of pandas/numpy code in the wild.

I also agree with the numpydoc suggestion to avoid obvious imports - I suspect the first most common use of docstrings is inside a repl/notebook/etc - showing the imports adds noise in that context.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the aliasing (import pandas as pd, import numpy as np), I agree with @chris-b1. It's something new users will have to learn indeed, but it's something they will need to learn anyhow, as almost any code you will see online uses those aliases.

Also conflicts with majority of pandas/numpy code in the wild.

That's not my perception, but maybe I am a bit biased by my environment :)

About whether we want to show the imports or not, here I am more open to be convinced otherwise, although I also find having those two imports everywhere adds noise.

@chris-b1
Copy link
Contributor

Apart from my one comment, I think this is great and appreciate what you've done to pull it together!

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together.


The short summary must start with a verb infinitive, end with a dot, and fit in
a single line. It needs to express what the function does without providing
details.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want a length limit here, so the lines in http://pandas-docs.github.io/pandas-docs-travis/api.html don't wrap. Though it looks like we can't set a hard limit, since the width available depends on the section...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't really care about the "verb infinitive" part.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't solve the wrapping issue you mention but I wonder if either here or in the General section we should make note that comments still need to wrap at 79 characters for PEP-8 compliance

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the line length, and seems that for the rendering in autosummaries something around 60 to 75 characters is the maximum length to avoid wrapping in. The document currently states to fit in a single line, which with PEP-8 line length means 76 characters for functions and 72 for methods. Unless you have a better idea, I'll leave like it is, as I think it's much easier for people to write a single line, than to count the characters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the infinitive verb, it's a standard used in Python projects in investment banking. I think it helps people write concise summaries. All functions/methods do things, so starting with a verb always should make sense, and it avoids starts like "This function [...]", "Method to [...]". Being infinitive is just to standardize "Generates [...]", "Generating [...]" and "Generate [...]" to always have the same form. Not sure if any other reason was used besides the infinitive being shorter, but unless you really want to get rid of this rule, I'll keep it as it's used in investment banks.

- tuple of (str, int, int)
- set of {str}

In case there are just a set of values allowed, list them in curly brackets
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specify that the default value, if any, is listed first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to require the curly braces? At the moment we don't really use it that much I think.

I also like a more explicit "default value" than just relying on the fact it is listed first

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the default value, and that it goes first in a list of options.

Regarding curly brackets, didn't realize they're not being used. But that's part of the numpy convention "When a parameter can only assume one of a fixed set of values, those values can be listed in braces, with the default appearing first". Leaving that, unless you want to change that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the default, I think the rule that the default is the first is rather obscure to readers. I am also not sure if the first place is necessarily the best. Eg for the hypothetical example below of {0, 10, 25}, I would rather list them in numerical order even if 0 is not the default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @jorisvandenbossche : again, we document the default value quite consistently already

- pandas.DataFrame

If more than one type is accepted, separate them by commas, except the
last two types, that need to be separated by the word 'or':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh do we get to bikeshed about using serial commas? :)

Section 5: See also
~~~~~~~~~~~~~~~~~~~

This is an optional section, used to let users know about pandas functionality
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily just pandas. We do "See Also" to other packages.

Specify that if you're referring to a method in another package you need the package name, like numpy.where, not np.where.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's strictly optional, but I would maybe add "optional but strongly recommended section"


The way to present examples is as follows:

1. Import required libraries
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need to import, since we import them in our doctest setup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally we assume the imports

import numpy as np
import pandas as pd

other imports should be done explicitly


3. Show a very basic example that gives an idea of the most common use case

4. Add commented examples that illustrate how the parameters can be used for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a different word than "commented", since people may interpret that as lines starting with #

example in the head method, where it requires to be higher than 5, to show
the example with the default values.

Avoid using data without interpretation, like a matrix of random numbers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an issue about using common, meaningful datasets for these. Maybe we can make a decision on that before the sprint (will try to find link later).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also remind such a discussion, but didn't find an open issue, only a discussion in a PR (with mention of gitter), so opened a new issue: #19710

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also make a kind of "summary checklist"

documents that explain this convention:

- `Guide to NumPy/SciPy documentation <https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt>`_
- `numpydoc docstring guide <http://numpydoc.readthedocs.io/en/latest/format.html>`_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is enough to only link to this last one, normally it should contain everything that is in the first (they only recently made that doc page)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if the last one should contain the same as the first, but with a better presentation, I think it's good that people is aware of the first document. People is not really expected to read or follow them, they're presented here just for reference. I'll leave both unless you really feel we should get rid of the first.

Copy link
Member

@jorisvandenbossche jorisvandenbossche Feb 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think it will just confuse people by giving 2 links. If there are two, I expect that somehow it is useful that I look at both of them, but then left wondering what is the difference, because the content is almost exactly the same.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I left just the numpy doc one in the list, and kept the other just in a comment. Let me know if you think it's still worthless to have it this way.

- tuple of (str, int, int)
- set of {str}

In case there are just a set of values allowed, list them in curly brackets
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to require the curly braces? At the moment we don't really use it that much I think.

I also like a more explicit "default value" than just relying on the fact it is listed first

If the type is a pandas type, also specify pandas:

- pandas.Series
- pandas.DataFrame
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that for those two just "Series" and "DataFrame" is enough? (otherwise it can become quite lengthy)


If the type is in a package, the module must be also specified:

- numpy.ndarray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, we now often say something like "array-like", when both lists and arrays are allowed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we actually have many cases in the user facing functions where we require a numpy array (I mean, where we don't accept a list as well, and thus we would not use 'array' or 'array-like' in general)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, although we could maybe have some place in the docs where we clearly define what an "array-like" is (e.g. tuples aren't)... and maybe even refer to it with a footnote?

- str or list of str

If None is one of the accepted values, it always needs to be the last in
the list.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should discuss how we want to have the "notion" of optional:

  • int or float, optional
  • in, float or None
  • int or float, default None

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote option 3. I'd give a second place nod to option 1, but optional and default None are essentially the same thing. It seems like an unnatural rule to enforce that default 'foo' is the rule when a keyword has a non-None default argument but optional is the route for None

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like an unnatural rule to enforce that default 'foo' is the rule when a keyword has a non-None default argument but optional is the route for None

Having a different way to describe it, can also signal that it actually is something different in practice. In (many, but not all) cases, a value of None really means that it is not specified and is optional (like method=None in fillna, because the default is to fill with a fixed value and not to use a forward or backward filling method), in constrast with other keywords that have a default value (like skipna=True. For users it 'feels' like optional because you typically don't need to specify it, but it is not optional)


Examples in docstrings are also unit tests, and besides illustrating the
usage of the function or method, they need to be valid Python code, that in a
deterministic way returns the presented output.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would maybe not explicitly say "unit tests" (strictly spoken they also aren't at the moment, as we don't run doctests), but we want it be correct python syntax because people can copy paste it to interact themselves with the example or to reproduce the example?


The way to present examples is as follows:

1. Import required libraries
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally we assume the imports

import numpy as np
import pandas as pd

other imports should be done explicitly

Examples
--------
>>> import pandas
>>> s = pandas.Series(['Ant', 'Bear', 'Cow', 'Dog', 'Falcon',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandas -> pd

numpydoc recommends avoiding "obvious" imports and importing them with
aliases, so for example `import numpy as np`. While this is now an standard
in the data ecosystem of Python, it doesn't seem a good practise, for the
next reasons:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the aliasing (import pandas as pd, import numpy as np), I agree with @chris-b1. It's something new users will have to learn indeed, but it's something they will need to learn anyhow, as almost any code you will see online uses those aliases.

Also conflicts with majority of pandas/numpy code in the wild.

That's not my perception, but maybe I am a bit biased by my environment :)

About whether we want to show the imports or not, here I am more open to be convinced otherwise, although I also find having those two imports everywhere adds noise.

in the data ecosystem of Python, it doesn't seem a good practise, for the
next reasons:

* The code is not executable anymore (as doctests for example)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When running doctests with pytest, numpy and pandas will always be imported automatically

@jorisvandenbossche
Copy link
Member

BTW, we should certainly keep this as a separate document, the contributing.rst is already long enough (we should rather start splitting more of our long doc pages in separate pieces IMO)

so programmers can understand what it does without having to read the details
of the implementation.

Also, it is a commonn practice to generate online (html) documentation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

common

@datapythonista
Copy link
Member Author

All comments should be addressed in this new version, except in the cases where I added a comment to the review. And also, it's pending the part on the standard datasets (#19710).

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some further comments.

Another question, with your docstring_validation script, would it be easy to print out all current type descriptions? (to have a quick overview of what we currently have, and to know which ones are useful to include here in the docs or which ones we need to discuss to have a consistent usage)

description in this case would be "Description of the arg (default is X).". In
some cases it may be useful to explain what the default argument means, which
can be added after a comma "Description of the arg (default is -1, which means
all cpus).".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, a very frequently occurring pattern to list the default is in the types, like color:, str, default 'blue' or copy : boolean, default True (I think we even do it relatively consistently)

I personally think we would like to keep this. Often the description can be quite long. Having it at the end of the type description gives it a prominent and consistent place to find it.
See eg the how keyword in https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.join.html#pandas.DataFrame.join where the description is a list of the different possible values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having the default after the description (and not the type), and having the default first if it's a set like {0, 10, 25}, is in the numpy docstring convention. I agree with you in both cases, it can be clearer having the default after the type, and the options in a consistent order. Just pointing out where these come from.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also like the current way, and our users might be used to it (I certainly am)


- int
- float
- str
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add dict, list and tuple as other often occurring types

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and boolean (or bool), whichever of the two we converge on

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding bool. dict, list and tuple are already documented in the next block.

- tuple of (str, int, int)
- set of {str}

In case there are just a set of values allowed, list them in curly brackets
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the default, I think the rule that the default is the first is rather obscure to readers. I am also not sure if the first place is necessarily the best. Eg for the hypothetical example below of {0, 10, 25}, I would rather list them in numerical order even if 0 is not the default.


If the type is in a package, the module must be also specified:

- numpy.ndarray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we actually have many cases in the user facing functions where we require a numpy array (I mean, where we don't accept a list as well, and thus we would not use 'array' or 'array-like' in general)


For complex types, define the subtypes:

- list of [int]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is also something we are currently not doing?
I am not sure if I find list of [int] better than list of int

(for the dict and tuple it can be more illustrative)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just trying to be consistent for list. I agree, in dict and tuple I think it adds value.

Probably not something it'd ever happen, but it could be clearer:
list of [dict of {int: str}]

than
list of dict of {int: str}

But happy with list of int too.

- str or list of str

If None is one of the accepted values, it always needs to be the last in
the list.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like an unnatural rule to enforce that default 'foo' is the rule when a keyword has a non-None default argument but optional is the route for None

Having a different way to describe it, can also signal that it actually is something different in practice. In (many, but not all) cases, a value of None really means that it is not specified and is optional (like method=None in fillna, because the default is to fill with a fixed value and not to use a forward or backward filling method), in constrast with other keywords that have a default value (like skipna=True. For users it 'feels' like optional because you typically don't need to specify it, but it is not optional)

think about what can be useful for the users reading the documentation,
especially the less experienced ones.

When relating to other methods (mainly `numpy`), use the name of the module
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"When relating to other methods" -> was this intended to be "other libraries" or "other modules"?


Return
------
pandas.Series
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandas.Series -> Series

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this linked anywhere from the contributing docs?

0
"""
return num1 + num2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a See Also section

yield random.random()


Section 5: See also
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add refs to all of these sections, also capitalize as you would in the doc-string

... columns=('a', 'b', 'c'))
"""
pass

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe put the sections in the same order as we want in the doc-string

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand this, sorry... There is just the Examples section in this docstring. Do you mean adding the parameters, see also...?

@datapythonista
Copy link
Member Author

@jorisvandenbossche, here you have the list of all parameters currently used in docstrings:

 {'any', 'all'}, default 'any'
'all', list-like of dtypes or None (default), optional
'fixed(f)|table(t)', default is 'fixed'
'infer', bool-ndarray, 'NaT', default 'raise'
'raise', 'coerce', default 'raise'
(float,float), optional
1d array-like
1d ndarray or Series
2-length sequence (tuple, list, ...)
A tuple (width, height) in inches
CategoricalDtype
DataFrame
DataFrame or Panel
DataFrame or Series
DataFrame or Series/dict-like object, or list of these
DataFrame, Series
DataFrame, Series with name field set, or list of DataFrame
DataFrame, or object coercible into a DataFrame
DateOffset object, or string
DateOffset, timedelta, or offset alias string, optional
DateOffset, timedelta, or time rule string, default None
DateOffset, timedelta, or time rule string, optional
DatetimeIndex or TimedeltaIndex
Drop groups that do not pass the filter. True by default;
Grouped DataFrame
How to join individual DataFrames
Index
Index or array-like
Index or list/tuple of indices
Index, optional
Index-like
Index-like (unique), optional
IndexSlice
Keyword Arguments
Matplotlib axes object, optional
Matplotlib axis object, default None
Matplotlib axis object, optional
MultiIndex or list of tuples
NDFrame, default None
Name of the column containing class names
None
None or float value, default None
None or float value, default None (NaN)
None or str, optional
None, integer or string axis name, optional
NumPy array or integer, optional
NumPy dtype (default: float64)
NumPy dtype (default: int64)
NumPy dtype (default: object)
NumPy dtype (default: uint64)
Number of points to plot in each curve
Object
Panel or list of Panels
Panel, or object coercible to Panel
Period frequency
Period or compat.string_types, default None
Python write mode, default 'w'
SQLAlchemy connectable (engine/connection) or database string URI
SQLAlchemy engine or DBAPI2 connection (legacy mode)
Series
Series or DataFrame
Series or list/tuple of Series
Series or scalar value
Series, DataFrame
Series, DataFrame, or constant
Series, DataFrame, or ndarray, optional
Setting this to True will show the grid
StringIO-like, optional
The number of decimal places to use when encoding
The object to check.
Timedelta, timedelta, np.timedelta64, string, or integer
Type name or dict of column -> type, default None
a list of columns that if not None, will limit the return
a sequence or mapping of Series, DataFrame, or Panel objects
a valid JSON string or file-like, default: None
alignment axis if needed, default None
alignment level if needed, default None
allowed axis of the other object, default None
an iterable
array or boolean, default None
array-like
array-like (1-dimensional)
array-like (1-dimensional), optional
array-like or Categorical, (1-dimensional)
array-like or Index (1d)
array-like or callable, default None
array-like, Series, or DataFrame
array-like, Series, or list of arrays/Series
array-like, default None
array-like, dict, or scalar value
array-like, integers
array-like, optional
array-like, optional (should be specified using keywords)
array_like
axes to direct sorting
axis to shift, default 0
bool
bool (default True)
bool (default: True)
bool or None, default True
bool or list of bool, default True
bool or same types as ``to_replace``, default False
bool, default False
bool, default NaN
bool, default None
bool, default True
bool, default True.
bool, defaults to False
bool, optional
boolean
boolean (default: False)
boolean NDFrame, array-like, or callable
boolean array-like with the same length as self
boolean or dict, default True
boolean or list of ints or names or list of lists or dict, default False
boolean or list of string, default True
boolean or string, default False
boolean or string, default True
boolean whether to append to an existing msgpack
boolean,
boolean, (default False)
boolean, True by default
boolean, default False
boolean, default False, do not write an ALL nan row to
boolean, default None
boolean, default True
boolean, default True if ax is None else False
boolean, default True, append the input data to the
boolean, default ``True``
boolean, default is True,
boolean, default to False
boolean, defaults to False
boolean, defaults to True
boolean, if True, return an iterator to the unpacker
boolean, optional
boolean, return an iterator, default False
boolean, should automatically close the store when
boolean, {'all', 'index', 'columns'}, or {0,1}, default False
boolean/string, default None
callable
callable or tuple of (callable, string)
callable(1d-array) -> 1d-array<boolean>, default None
callable, default None
callable, optional
callable, string, dictionary, or list of string/callables
category or list of categories
category or list-like of category
character, default ","
class, default dict
closed end of interval; 'left' or 'right'
column label or list of column labels / arrays
column label or sequence of labels, optional
column name or list of names, or vector
column to aggregate, optional
column, Grouper, array, or list of the previous
data type, or dict of column name -> data type
date or array of dates
date, string, int
datetime
datetime-like, str, int, float
datetime.time or string
datetime.time, str
default NaN, fill value for missing values.
default None, provide an encoding for strings
deprecated, use `expand`
dict
dict (python 3), str or None (python 2)
dict of column name to SQL type, default None
dict of columns that specify minimum string sizes
dict or list of dicts
dict, default None
dict, default is None
dict, optional
dict-like or function, optional
dtype or None, default None
dtype, default None
dtype, default np.uint8
end time, datetime-like, optional
end time, timedelta-like, optional
end value, period-like, optional
expected TOTAL row size of this table
float
float or array-like, default 0.5 (50% quantile)
float or array_like
float or array_like, default None
float, default NaN
float, default None
float, defaults to NaN (missing)
float, optional
force encoded string to be ASCII, default True.
freq string/object
frequency string
function
function, default None
function, dict, or Series
function, list of functions, dict, default numpy.mean
function, optional
hint to the hashtable sizer
identifier of index column, defaults to None
ignored
index to direct sorting
index, columns to direct sorting
index-like
int
int (can only be zero)
int (default: 0)
int (default: 0), or other RangeIndex instance.
int (default: 1)
int or None
int or array
int or axis name
int or basestring
int or csv.QUOTE_* instance, default 0
int or level name or list of ints or list of level names
int or level name, default None
int or list of ints
int or list of ints, default 'infer'
int or list, default None
int or name
int or numpy.random.RandomState, optional
int or sequence or False, default None
int or str
int or str, default 0
int or str, optional
int or str, optional, default None
int or string
int or string axis name
int or string axis name, optional
int or string, default 0
int or string, optional
int, Series, or array-like
int, array, or Series, default None
int, array-like
int, default -1
int, default -1 (all)
int, default 0
int, default 0 (no flags)
int, default 1
int, default 5
int, default None
int, default None.
int, defaults None
int, dict, Series
int, float, Interval
int, level name, or sequence of int/level names (default None)
int, level name, or sequence of such, default None
int, list of ints, default 0
int, list of ints, default None
int, optional
int, optional, > 0
int, optional, default 0
int, or offset
int, sequence of scalars, or IntervalIndex
int, str or None
int, str, default None
int, str, tuple, or list, default None
int, string (can be mixed)
int, string, or list of these, default -1 (last level)
int, string, or list of these, default last level
int/level name or list thereof
integer (defaults to None), row number to start selection
integer (defaults to None), row number to stop selection
integer or array of quantiles
integer or sequence, default 10
integer, default 1
integer, default None
integer, float, string, datetime, list, tuple, 1-d array, Series
integer, optional
interval boundary to use for labeling; 'left' or 'right'
item label (panel item)
iterable, Series, DataFrame or dictionary
iterable, optional
keyword arguments to pass on to the constructor
keyword arguments to pass on to the interpolating function.
keyword, value pairs
keywords
label
label or list
label or list, or array-like
label or position, optional
label or tuple of labels (one for each level)
label rotation angle
label, default None
list
list / sequence of array-likes
list / sequence of iterables
list / sequence of strings or None
list / sequence of tuple-likes
list of Index objects
list of Term (or convertible) objects, optional
list of columns to create as data columns, or True to
list of int or list of str
list of int representing new level order.
list of ints. optional
list of pairs (int, int) or 'infer'. optional
list of paths (string or list of strings), default None
list of sequences, default None
list or None
list or dict of one-parameter functions, optional
list or dict, default: None
list, default None
list, default: None
list, tuple or dict, optional, default: None
list, tuple, 1-d array, or Series
list-like
list-like of Categorical, CategoricalIndex,
list-like of dtypes or None (default), optional,
list-like of numbers, optional
list-like or None, default None
list-like or integer or callable, default None
list-like, default None
list-like, dict-like or callable
list-like, int or str, default 0
list-like, or list of list-likes
mapping, function, label, or list of labels
mapping, optional
matplotlib axes object, default None
matplotlib axis object
name, tuple/list of names, or array-like
name/number, defaults to None
ndarray
ndarray (1-d)
ndarray (items x major x minor), or dict of DataFrames
ndarray (structured dtype), list of tuples, dict, or DataFrame
ndarray or object value
nrows to include in iteration, return an iterator
number/name of the axis, defaults to 0
numeric or datetime-like, default None
numeric, optional
numeric, string, or DateOffset, default None
numpy dtype or pandas type
numpy ndarray (structured or homogeneous), dict, or DataFrame
numpy.dtype or None
object
object to be converted
object, default ''
object, default None
object, defaults to first n levels (n=1 or len(key))
object, optional
one-parameter function, optional
optional
optional int
optional sequence of objects
optional, 'infer' or None, defaults to None
optional, array-like
optional, defaults False.
optional, defaults to tab
other plotting keyword arguments
path (string), buffer or path object (pathlib.Path or
pytz.timezone or dateutil.tz.tzfile
raise on invalid input
replace NaN with this value if the unstack produces
scalar
scalar or array_like, optional
scalar or list-like
scalar value
scalar, NDFrame, or callable
scalar, default 'value'
scalar, default None
scalar, default is 'unix'
scalar, default np.NaN
scalar, dict, Series, or DataFrame
scalar, dict, list, str, regex, default None
scalar, hashable sequence, dict-like or function, optional
scalar, list-like, dict-like or function, optional
scalar, list-like, optional
scalar, or array-like
scalar, str, list-like, or dict, default None
scipy.sparse.coo_matrix
sequence
sequence of (key, value) pairs
sequence of arrays
sequence or list of sequence
sequence, default None
sequence, optional
set or list-like
single label or list-like
size to chunk the writing
sort by the remaining levels after level.
starting value, datetime-like, optional
starting value, period-like, optional
starting value, timedelta-like, optional
str
str (length 1), default None
str (length 1), optional
str or None
str or PeriodDtype, default None
str or buffer
str or csv.Dialect instance, default None
str or file-like
str or int, optional
str or list
str or list of str
str or list-like
str or matplotlib colormap object, default None
str or ndarray-like, optional
str or sequence
str or unicode
str {'E', 'S'}
str {'dict', 'list', 'series', 'split', 'records', 'index'}
str, default ""
str, default ','
str, default '.'
str, default '\\d+'
str, default '\s+'.
str, default 'pad'
str, default None
str, default \t (tab-stop)
str, default ``'	' + ' '``
str, default ``None``
str, default is 'utf-8'
str, method of resampling ('ffill', 'bfill')
str, optional
str, optional (python 2)
str, pathlib.Path, py._path.local.LocalPath or any \
str, regex, list, dict, Series, numeric, or None
str, tuple, datetime.timedelta, DateOffset or None
str, {'raise', 'ignore'}, default 'raise'
string
string (regular expression)
string / frequency object, defaults to None
string File path, BytesIO like or string
string File path, buffer-like, or None
string SQL query or SQLAlchemy Selectable (select or text object)
string file path or file handle / StringIO
string file path, or file-like object
string or DateOffset, default 'B' (business daily)
string or DateOffset, default 'D' (calendar daily)
string or DateOffset, optional
string or ExcelWriter object
string or None
string or None, default None
string or SQLAlchemy Selectable (select or text object)
string or callable
string or compiled regex
string or datetime-like, default None
string or file handle, default None
string or file-like object
string or int, optional
string or list of strings, default None
string or list of strings, optional, default: None
string or object
string or object, optional
string or pandas offset object, optional
string or period object, optional
string or period-like, default None
string or pytz.timezone object
string or sequence
string or sequence, default None
string or timedelta-like, default None
string to use as string nan represenation
string {'xport', 'sas7bdat'} or None
string,
string, DateOffset, dateutil.relativedelta
string, None or encoding
string, default
string, default "Pandas"
string, default "|"
string, default ''
string, default '.'
string, default 'All'
string, default 'Sheet1'
string, default '_'
string, default 'inf'
string, default 'ms' (milliseconds)
string, default 'ns'
string, default None
string, default frequency of PeriodIndex
string, default is None
string, default whitespace
string, defaults to None
string, int, mixed list of strings/ints, or None, default 0
string, list of fields, array-like
string, list of strings, or dict of strings, default None
string, number, or hashable object
string, optional
string, optional, default: None
string, optional, {'pad', 'ffill', 'bfill'}
string, path object (pathlib.Path or py._path.local.LocalPath),
string, pytz.timezone, dateutil.tz.tzfile or None
string, timedelta, list, tuple, 1-d array, or Series
string, valid regular expression
string, {'ns', 'us', 'ms', 's', 'm', 'h', 'D'}, optional
the axis to convert
the axis to localize
the pandas object holding the data
the path (string) or HDFStore object
the path or buffer to write the result string
three positional arguments: each one of
timedelta
tuple
tuple (optional)
tuple and dict
tuple of integer (length 2), default None
tuple, default None
tuple, list, or ndarray, optional
tuple, optional
tuple/list
type of compressor (zlib or blosc), default to None (no
type of object to recover (series or frame), default 'frame'
unit of the arg (D,h,m,s,ms,us,ns) denote the unit, which is an
value
where to reorder levels
writable buffer, defaults to sys.stdout
{'NFC', 'NFKC', 'NFD', 'NFKD'}
{'all', 'any'}, default 'any'
{'any', 'all'}
{'auto', 'pyarrow', 'fastparquet'}, default 'auto'
{'average', 'min', 'max', 'first', 'dense'}
{'average', 'min', 'max', 'first', 'dense'}, efault 'average'
{'backfill', 'bfill', 'pad', 'ffill', None}, default None
{'backfill'/'bfill', 'pad'/'ffill'}, default None
{'block', 'integer'}
{'c', 'python'}, optional
{'columns', 'index'}, default 'columns'
{'fail', 'replace', 'append'}, default 'fail'
{'first', 'last', False}, default 'first'
{'first', 'last'}, default 'first'
{'first', 'last'}, default 'last'
{'forward', 'backward', 'both'}, default 'forward'
{'hist', 'kde'}
{'ignore', 'raise', 'coerce'}, default 'raise'
{'ignore', 'raise'}, default 'raise'
{'infer', 'gzip', 'bz2', 'xz', 'zip', None}, default 'infer'
{'infer', 'gzip', 'bz2', 'xz', None}, default 'infer'
{'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default 'infer'
{'inner', 'outer'}, default 'outer'
{'inside', 'outside'}, default None
{'integer', 'signed', 'unsigned', 'float'} , default None
{'items', 'major', 'minor'}
{'items', 'major', 'minor'} or {0, 1, 2}
{'items', 'major', 'minor}, default 1/'major'
{'items', 'minor', 'major'}, or {0, 1, 2}, or a tuple with two
{'items', 'minor'}, default 'items'
{'ix', 'loc', 'getitem'}
{'ix', 'loc', 'getitem'} or None
{'keep', 'top', 'bottom'}
{'left', 'right', 'both', 'neither'}, default 'right'
{'left', 'right', 'both'}, default 'left'
{'left', 'right', 'inner', 'outer'}
{'left', 'right', 'outer', 'inner'}
{'left', 'right', 'outer', 'inner'}, default 'inner'
{'left', 'right', 'outer', 'inner'}, default: 'left'
{'left', 'right'}
{'left', 'zero',' mid'}, default 'left'
{'left'}, default 'left'
{'linear', 'lower', 'higher', 'midpoint', 'nearest'}
{'linear', 'time', 'index', 'values', 'nearest', 'zero',
{'major', 'minor', 'items'}, default 'major'
{'mergesort', 'quicksort', 'heapsort'}, default 'quicksort'
{'outer', 'inner', 'left', 'right'}, default 'outer'
{'pearson', 'kendall', 'spearman'}
{'quicksort', 'mergesort', 'heapsort'}, default 'quicksort'
{'raise', 'ignore'}, default 'raise'
{'raise', 'ignore'}, default 'raise'.
{'right', 'left'}
{'s', 'e', 'start', 'end'}
{'snappy', 'gzip', 'brotli', None}, default 'snappy'
{'start', 'end', 'e', 's'}
{'start', 'end', 's', 'e'}
{'start', 'end'}, default end
{0 or 'index', 1 or 'columns'}
{0 or 'index', 1 or 'columns'}, default 0
{0 or 'index', 1 or 'columns'}, default None
{0 or 'index', 1 or 'columns'}, or tuple/list thereof
{0, 'index'}
{0, 'index'}, default 0
{0, 'index'}, default None
{0, 1, 'index', 'columns'}
{0, 1, 'index', 'columns'} (default 0)
{0, 1}, default 0
{0/'index', 1/'columns'}, default 0
{None, 'axes', 'dict', 'both'}, default None
{None, 'backfill'/'bfill', 'pad'/'ffill', 'nearest'}, optional
{None, 'epoch', 'iso'}
{None, 'gzip', 'bz2', 'xz'}
{None, 'ignore'}
{None, 'pad'/'ffill', 'backfill'/'bfill', 'nearest'}, optional
{None, True, False}, optional
{Series, DataFrame, Panel}
{default 'raise', 'drop'}, optional
{index (0), columns (1)}
{index (0)}
{items (0), major_axis (1), minor_axis (2)}
{items, major_axis, minor_axis}

~~~~~~~~~~~~~

Docstrings must be defined with three double-quotes. No blank lines should be
left before or after the docstring. The text starts immediately after the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means the vast majority of current pandas docstring do it wrong (and as a result... are more readable, I think). Everybody all right with this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I personally also like how we mostly start on the following line (which alsi gives you 3 characters more for the summary line ... :-))

PEP257 also says both are ok.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I got that from the numpy convention. I'm more used to start in the same line, but as far as we always use the same, I don't think it'll make a difference for anyone.

can have multiple lines. The description must start with a capital letter, and
finish with a dot.

Keyword arguments with a default value, the default will be listed in brackets
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For keywords arguments with a default value

- pandas.SparseArray

If the exact type is not relevant, but must be compatible with a numpy
array, array-like can be specified. If Any type that can be iterated is
Copy link
Member

@toobaz toobaz Mar 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any (lower case)

accepted, iterable can be used:

- array-like
- iterable
Copy link
Member

@toobaz toobaz Mar 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be subtle. For instance, pd.Series(i for i in range(3)) works, but it is undocumented. However, I don't think we want to replace "array-like" with "iterable": probably have both, although they are theoretically redundant.

@TomAugspurger
Copy link
Contributor

Where's this at? Are we going through another round of reviews or can we merge this and iterate if needed?

@datapythonista
Copy link
Member Author

@TomAugspurger I made some changes based on the points discussed in pandas-dev in the python-sprints version:
python-sprints/python-sprints.github.io@0dc3c18

I need to address couple of comments that @jorisvandenbossche pointed out (the default is incorrectly defined, the fillna is not good...) and then should be a good first version. I'll make the changes later today and update this PR with them.

@jorisvandenbossche
Copy link
Member

I copied the latest version of the sprints repo, and pushed that as a commit.So people can already give this another round of review here if needed.

@datapythonista
Copy link
Member Author

Updated the documentation with the last changes (mainly the points discussed in pandas-dev), and two new sections, one at the beginning about when to use backticks in the docstrings, and another at the end on how to add plots to the documentation.

Any feedback welcome. If you prefer to read it in html, the exact same version as this PR is available here: https://python-sprints.github.io/pandas/guide/pandas_docstring.html

@jorisvandenbossche
Copy link
Member

@datapythonista thanks a lot for the updates. Really looking great.

At the others, if you could do a final read of it before the sprint, that would be very welcome.


def add_values(arr):
"""
Add the values in `arr`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we settle on single or double backticks here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, typing

`arr`

Uses sphinx's default role. That's currently None (no role) in our conf.py, but it could be whatever. Does sphinx have a "parameter role"? I'm not finding one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Single backticks is what numpydoc spec says to do. But I don't know how useful it is to make the distinction between single backticks for parameters but double backticks for code (other function names, parameter name combined with a value, ..).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numpy uses 'autolink' as their default role. Which makes that they also use single backticks for other functions, and then they automatically become links to the docstring page, which is also nice.

Look eg at the keepdims explanation in the parameter section: https://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html
All of keepdims, ndarray and sum are single backticks. The first is rendered as italic, while the other are links to their docstring page.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But probably a bit late to change now before the sprint, without really trying out. I would propose to keep it as is?

Using it would however make the docs a bit more pleasant to read (or write) in plain text.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I think that'd be the best behavior. I'm not a huge fan of double backticks in docstrings, because they make the text version too noisy. For parameters, we get italics in the HTML (code might be better, but we at least have some formatting), or a link to the object without all the :ref: noise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry didn't see your last post before posting.

Agreed it's too late to change for the sprint. But let's leave the recommendation as is (use single backtick for parameters), since I think it's what we'll want in the future.

- numpy.ndarray
- scipy.sparse.coo_matrix

If the type is a pandas type, also specify pandas except for Series and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the exception here? IMO, it'd be clearer to follow the rules that numpydoc / sphinx uses for discovery (so anything in the top-level pandas namespace should be found). That way we have consistency with the See Also section.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean everything with 'pandas', or everything without it?
I suppose without? I am fine with that, I think it was added mainly for being explicit for objects that maybe not everybody knows are coming from pandas.

Section 5: See Also
~~~~~~~~~~~~~~~~~~~

This section is used to let users know about pandas functionality
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strike pandas, since we'll often link to numpy / python / other libraries as well.

related to the one being documented. In rare cases, if no related methods
or functions can be found at all, this section can be skipped.

An obvious example would be the `head()` and `tail()` methods. As `tail()` does
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change these to be links to the methods? So people can click it and see the rendered docstring?

followed by a space, a colon, another space, and a short description that
illustrated what this method or function does, why is it relevant in this
context, and what are the key differences between the documented function and
the one referencing. The description must also finish with a dot.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to require a description, do we? I think if you're writing read_csv and want to link to DataFrame.to_csv, just the link should be sufficient.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 9, 2018 via email

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

I think we should merge this, as we already have sprint PRs coming in. We revise as needed in followup PRs.

@jorisvandenbossche
Copy link
Member

Un up to date version is hosted on the sprint website, so merging is not that urgent. But also no problem to merge

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 9, 2018 via email

@jorisvandenbossche
Copy link
Member

@TomAugspurger yeah, if you would have time for that, that would be welcome

@jorisvandenbossche jorisvandenbossche merged commit 7169830 into pandas-dev:master Mar 12, 2018
@jorisvandenbossche
Copy link
Member

OK, I updated it with the latest version from the sprint website and merged this ones, I will create another issue to discuss further needed clarifications.

@jorisvandenbossche
Copy link
Member

Opened issue here: #20309

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants