DOC: Update Working with text data for 3.0 #62581

rhshadrach · 2025-10-04T19:12:38Z

closes DOC (string dtype): update user guide page "Working with text data" #60348 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

One of the more subtle changes is replacing StringArray with StringDtype throughout. I did this primarily because it seems to me this should focus on what users specify the dtype= as, but also because whether you get a StringArray depends on Python vs PyArrow storage.

jbrockmendel · 2025-10-16T17:43:05Z

LGTM cc @jorisvandenbossche

jorisvandenbossche · 2025-10-16T21:03:35Z

doc/source/user_guide/text.rst

+######################
 Working with text data
-======================
+######################


Not important at all for this PR, but was just wondering since you changed it: it seems we are not very consistent with the type of char we use for the top-level header, but I haven't actually seen # in other files (seems we mostly use * or =). We could maybe open an issue to standardize this between all files.

I've been going off of this: https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#sections

Funnily enough, they refer to cpython https://devguide.python.org/documentation/markup/#sections for this choice, but then the rst page that mentions this does not actually follow that ..
I think part of the ambiguous interpretation is whether that first ### double line is actually used as the first header on every page, or whether this is only used on index.rst pages with a "higher" (like would be the case for "User guide" on user_guide/index.rst).

jorisvandenbossche

Thanks a lot for working on this!

jorisvandenbossche · 2025-10-16T21:04:58Z

doc/source/user_guide/text.rst

-1. ``object`` dtype NumPy array.
-2. :class:`StringDtype` extension type.
+1. :class:`StringDtype` extension type.
+2. ``object`` dtype NumPy array.


Suggested change

2. ``object`` dtype NumPy array.

2. NumPy ``object`` dtype.

?

Wondering if we want to say "dtype" instead of array, since the line above is also using type and not array

jorisvandenbossche · 2025-10-16T21:08:47Z

doc/source/user_guide/text.rst

   s2
   type(s2[0])

+However there are four distinct :class:`StringDtype` variants that may be utilized.


Personally, I think I would move this lower down the file, so it is not one of the first things a user reads when looking at how to use strings in pandas (not before string methods are shown for example).
Because the file is meant as a user guide, although it probably reads more as a reference guide ..

Now, that's a bigger comment in general on the structure of the file, so let's see this comment as a possible future improvement. The PR is already a good step forward.

Agreed, and it was easy to move down to the bottom. I left a cross-reference in this section.

jorisvandenbossche · 2025-10-16T21:11:12Z

doc/source/user_guide/text.rst

+   This is the same as ``dtype='str'`` *when PyArrow is installed*.
+
+The implementation uses a PyArrow array, however NA values in this array
+are stored using ``np.nan``.


Suggested change

are stored using ``np.nan``.

are represented as ``np.nan``.

I don't know if this change helps clarity, but strictly speaking the missing values are not stored as such, we only use np.nan as the sentinel when the user accesses the data, or in things like __repr__ etc ...

Yea, this is a good point. I added in and behave as.

jorisvandenbossche · 2025-10-16T21:12:31Z

doc/source/user_guide/text.rst

+1. Like ``dtype="object"``, :ref:`string accessor methods<api.series.str>`
+   that return **integer** output will return a NumPy array that is
+   either dtype int or float depending on the presence of NA values.
+   Methods returning **boolean** output will return a NumPy array this is


Suggested change

Methods returning **boolean** output will return a NumPy array this is

Methods returning **boolean** output will return a NumPy array that is

jorisvandenbossche · 2025-10-16T21:17:08Z

doc/source/user_guide/text.rst

-These are places where the behavior of ``StringDtype`` objects differ from
-``object`` dtype:


There are also still behaviour differences compared to object dtype. For example one of the items mentioned below about string methods returning bool with NA becoming False, that is different compared to object dtype.

Now that is maybe more something that is relevant for the migration guide? (assuming that readers of this page generally should be using str by default)

(and I also realize that this change in behaviour is not yet mentioned in https://pandas.pydata.org/docs/dev/user_guide/migration-3-strings.html)

What do you think about replacing this entire section with a link to the migration guide?

I don't think the entire section should live in the migration guide. Because this now mostly explains differences between the NaN variant and NA variant. While the current migration guide focuses on going from object dtype to the NaN variant.

I would maybe put back the last version of the text you had, but move it lower in the page together with the The four :class:`StringDtype` variants you moved?

At the same time I will add something about the predicate methods difference between object dtype and str to the migration guide.

I restored this section moving it down, just above the four StringDtype variants section.

rhshadrach · 2025-10-17T22:29:31Z

Thanks @jorisvandenbossche - I think this is ready for another look.

…text_data_strings

DOC: Update Working with text data for 3.0

5cdfe9f

rhshadrach added Docs Strings String extension data type and string data labels Oct 4, 2025

Refinements

5b2afe7

rhshadrach marked this pull request as ready for review October 5, 2025 13:27

rhshadrach requested a review from jorisvandenbossche October 5, 2025 14:22

jorisvandenbossche reviewed Oct 16, 2025

View reviewed changes

jorisvandenbossche approved these changes Oct 16, 2025

View reviewed changes

Refinements

2938ec7

rhshadrach added 2 commits October 23, 2025 16:52

Merge branch 'main' of https://github.com/pandas-dev/pandas into doc_…

8e8b1ce

…text_data_strings

Restore changes section

6db4415

	Methods returning boolean output will return a NumPy array this is
	Methods returning boolean output will return a NumPy array that is

		These are places where the behavior of ``StringDtype`` objects differ from
		``object`` dtype:

Uh oh!

DOC: Update Working with text data for 3.0 #62581

Are you sure you want to change the base?

DOC: Update Working with text data for 3.0 #62581

Conversation

rhshadrach commented Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jbrockmendel commented Oct 16, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rhshadrach Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rhshadrach Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rhshadrach commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rhshadrach commented Oct 4, 2025 •

edited

Loading

rhshadrach Oct 16, 2025 •

edited

Loading

rhshadrach Oct 17, 2025 •

edited

Loading

jorisvandenbossche Oct 19, 2025 •

edited

Loading