Skip to content

Conversation

rhshadrach
Copy link
Member

@rhshadrach rhshadrach commented Oct 4, 2025

One of the more subtle changes is replacing StringArray with StringDtype throughout. I did this primarily because it seems to me this should focus on what users specify the dtype= as, but also because whether you get a StringArray depends on Python vs PyArrow storage.

@rhshadrach rhshadrach added Docs Strings String extension data type and string data labels Oct 4, 2025
@rhshadrach rhshadrach marked this pull request as ready for review October 5, 2025 13:27
@jbrockmendel
Copy link
Member

LGTM cc @jorisvandenbossche

######################
Working with text data
======================
######################
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not important at all for this PR, but was just wondering since you changed it: it seems we are not very consistent with the type of char we use for the top-level header, but I haven't actually seen # in other files (seems we mostly use * or =). We could maybe open an issue to standardize this between all files.

Copy link
Member Author

@rhshadrach rhshadrach Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funnily enough, they refer to cpython https://devguide.python.org/documentation/markup/#sections for this choice, but then the rst page that mentions this does not actually follow that ..
I think part of the ambiguous interpretation is whether that first ### double line is actually used as the first header on every page, or whether this is only used on index.rst pages with a "higher" (like would be the case for "User guide" on user_guide/index.rst).

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for working on this!

1. ``object`` dtype NumPy array.
2. :class:`StringDtype` extension type.
1. :class:`StringDtype` extension type.
2. ``object`` dtype NumPy array.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. ``object`` dtype NumPy array.
2. NumPy ``object`` dtype.

?

Wondering if we want to say "dtype" instead of array, since the line above is also using type and not array

s2
type(s2[0])
However there are four distinct :class:`StringDtype` variants that may be utilized.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I think I would move this lower down the file, so it is not one of the first things a user reads when looking at how to use strings in pandas (not before string methods are shown for example).
Because the file is meant as a user guide, although it probably reads more as a reference guide ..

Now, that's a bigger comment in general on the structure of the file, so let's see this comment as a possible future improvement. The PR is already a good step forward.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, and it was easy to move down to the bottom. I left a cross-reference in this section.

This is the same as ``dtype='str'`` *when PyArrow is installed*.

The implementation uses a PyArrow array, however NA values in this array
are stored using ``np.nan``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
are stored using ``np.nan``.
are represented as ``np.nan``.

I don't know if this change helps clarity, but strictly speaking the missing values are not stored as such, we only use np.nan as the sentinel when the user accesses the data, or in things like __repr__ etc ...

Copy link
Member Author

@rhshadrach rhshadrach Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, this is a good point. I added in and behave as.

1. Like ``dtype="object"``, :ref:`string accessor methods<api.series.str>`
that return **integer** output will return a NumPy array that is
either dtype int or float depending on the presence of NA values.
Methods returning **boolean** output will return a NumPy array this is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Methods returning **boolean** output will return a NumPy array this is
Methods returning **boolean** output will return a NumPy array that is

Comment on lines -90 to -91
These are places where the behavior of ``StringDtype`` objects differ from
``object`` dtype:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are also still behaviour differences compared to object dtype. For example one of the items mentioned below about string methods returning bool with NA becoming False, that is different compared to object dtype.

Now that is maybe more something that is relevant for the migration guide? (assuming that readers of this page generally should be using str by default)

(and I also realize that this change in behaviour is not yet mentioned in https://pandas.pydata.org/docs/dev/user_guide/migration-3-strings.html)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about replacing this entire section with a link to the migration guide?

Copy link
Member

@jorisvandenbossche jorisvandenbossche Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the entire section should live in the migration guide. Because this now mostly explains differences between the NaN variant and NA variant. While the current migration guide focuses on going from object dtype to the NaN variant.

I would maybe put back the last version of the text you had, but move it lower in the page together with the The four :class:`StringDtype` variants you moved?

At the same time I will add something about the predicate methods difference between object dtype and str to the migration guide.

@rhshadrach
Copy link
Member Author

Thanks @jorisvandenbossche - I think this is ready for another look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Docs Strings String extension data type and string data

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DOC (string dtype): update user guide page "Working with text data"

3 participants