-
-
Notifications
You must be signed in to change notification settings - Fork 19.2k
DOC: Update Working with text data for 3.0 #62581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
LGTM cc @jorisvandenbossche |
###################### | ||
Working with text data | ||
====================== | ||
###################### |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not important at all for this PR, but was just wondering since you changed it: it seems we are not very consistent with the type of char we use for the top-level header, but I haven't actually seen #
in other files (seems we mostly use *
or =
). We could maybe open an issue to standardize this between all files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been going off of this: https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#sections
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Funnily enough, they refer to cpython https://devguide.python.org/documentation/markup/#sections for this choice, but then the rst page that mentions this does not actually follow that ..
I think part of the ambiguous interpretation is whether that first ###
double line is actually used as the first header on every page, or whether this is only used on index.rst pages with a "higher" (like would be the case for "User guide" on user_guide/index.rst).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for working on this!
doc/source/user_guide/text.rst
Outdated
1. ``object`` dtype NumPy array. | ||
2. :class:`StringDtype` extension type. | ||
1. :class:`StringDtype` extension type. | ||
2. ``object`` dtype NumPy array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2. ``object`` dtype NumPy array. | |
2. NumPy ``object`` dtype. |
?
Wondering if we want to say "dtype" instead of array, since the line above is also using type and not array
s2 | ||
type(s2[0]) | ||
However there are four distinct :class:`StringDtype` variants that may be utilized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I think I would move this lower down the file, so it is not one of the first things a user reads when looking at how to use strings in pandas (not before string methods are shown for example).
Because the file is meant as a user guide, although it probably reads more as a reference guide ..
Now, that's a bigger comment in general on the structure of the file, so let's see this comment as a possible future improvement. The PR is already a good step forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, and it was easy to move down to the bottom. I left a cross-reference in this section.
doc/source/user_guide/text.rst
Outdated
This is the same as ``dtype='str'`` *when PyArrow is installed*. | ||
|
||
The implementation uses a PyArrow array, however NA values in this array | ||
are stored using ``np.nan``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are stored using ``np.nan``. | |
are represented as ``np.nan``. |
I don't know if this change helps clarity, but strictly speaking the missing values are not stored as such, we only use np.nan as the sentinel when the user accesses the data, or in things like __repr__
etc ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, this is a good point. I added in and behave as
.
doc/source/user_guide/text.rst
Outdated
1. Like ``dtype="object"``, :ref:`string accessor methods<api.series.str>` | ||
that return **integer** output will return a NumPy array that is | ||
either dtype int or float depending on the presence of NA values. | ||
Methods returning **boolean** output will return a NumPy array this is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Methods returning **boolean** output will return a NumPy array this is | |
Methods returning **boolean** output will return a NumPy array that is |
These are places where the behavior of ``StringDtype`` objects differ from | ||
``object`` dtype: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are also still behaviour differences compared to object dtype. For example one of the items mentioned below about string methods returning bool with NA becoming False, that is different compared to object dtype.
Now that is maybe more something that is relevant for the migration guide? (assuming that readers of this page generally should be using str
by default)
(and I also realize that this change in behaviour is not yet mentioned in https://pandas.pydata.org/docs/dev/user_guide/migration-3-strings.html)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about replacing this entire section with a link to the migration guide?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the entire section should live in the migration guide. Because this now mostly explains differences between the NaN variant and NA variant. While the current migration guide focuses on going from object dtype to the NaN variant.
I would maybe put back the last version of the text you had, but move it lower in the page together with the The four :class:`StringDtype` variants
you moved?
At the same time I will add something about the predicate methods difference between object dtype and str
to the migration guide.
Thanks @jorisvandenbossche - I think this is ready for another look. |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.One of the more subtle changes is replacing
StringArray
withStringDtype
throughout. I did this primarily because it seems to me this should focus on what users specify thedtype=
as, but also because whether you get aStringArray
depends on Python vs PyArrow storage.