Skip to content

ENH: str.extractall for several matches #11386

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -526,6 +526,7 @@ strings and apply several methods to it. These can be accessed like
Series.str.encode
Series.str.endswith
Series.str.extract
Series.str.extractall
Series.str.find
Series.str.findall
Series.str.get
Expand Down
144 changes: 127 additions & 17 deletions doc/source/text.rst
Original file line number Diff line number Diff line change
Expand Up @@ -168,28 +168,37 @@ Extracting Substrings

.. _text.extract:

The method ``extract`` (introduced in version 0.13) accepts `regular expressions
<https://docs.python.org/2/library/re.html>`__ with match groups. Extracting a
regular expression with one group returns a Series of strings.
Extract first match in each subject (extract)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. ipython:: python
.. versionadded:: 0.13.0

.. warning::

In version 0.18.0, ``extract`` gained the ``expand`` argument. When
``expand=False`` it returns a ``Series``, ``Index``, or
``DataFrame``, depending on the subject and regular expression
pattern (same behavior as pre-0.18.0). When ``expand=True`` it
always returns a ``DataFrame``, which is more consistent and less
confusing from the perspective of a user.

pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)')
The ``extract`` method accepts a `regular expression
<https://docs.python.org/2/library/re.html>`__ with at least one
capture group.

Elements that do not match return ``NaN``. Extracting a regular expression
with more than one group returns a DataFrame with one column per group.
Extracting a regular expression with more than one group returns a
DataFrame with one column per group.

.. ipython:: python

pd.Series(['a1', 'b2', 'c3']).str.extract('([ab])(\d)')

Elements that do not match return a row filled with ``NaN``.
Thus, a Series of messy strings can be "converted" into a
like-indexed Series or DataFrame of cleaned-up or more useful strings,
without necessitating ``get()`` to access tuples or ``re.match`` objects.

The results dtype always is object, even if no match is found and the result
only contains ``NaN``.
Elements that do not match return a row filled with ``NaN``. Thus, a
Series of messy strings can be "converted" into a like-indexed Series
or DataFrame of cleaned-up or more useful strings, without
necessitating ``get()`` to access tuples or ``re.match`` objects. The
results dtype always is object, even if no match is found and the
result only contains ``NaN``.

Named groups like

Expand All @@ -201,9 +210,109 @@ and optional groups like

.. ipython:: python

pd.Series(['a1', 'b2', '3']).str.extract('(?P<letter>[ab])?(?P<digit>\d)')
pd.Series(['a1', 'b2', '3']).str.extract('([ab])?(\d)')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make extract and extractall sub-sections (I think you might have to use ^^^^) as the sub-headings

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK


can also be used. Note that any capture group names in the regular
expression will be used for column names; otherwise capture group
numbers will be used.

Extracting a regular expression with one group returns a ``DataFrame``
with one column if ``expand=True``.

.. ipython:: python

pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)', expand=True)

It returns a Series if ``expand=False``.

.. ipython:: python

pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)', expand=False)

Calling on an ``Index`` with a regex with exactly one capture group
returns a ``DataFrame`` with one column if ``expand=True``,

.. ipython:: python

s = pd.Series(["a1", "b2", "c3"], ["A11", "B22", "C33"])
s
s.index.str.extract("(?P<letter>[a-zA-Z])", expand=True)

It returns an ``Index`` if ``expand=False``.

.. ipython:: python

s.index.str.extract("(?P<letter>[a-zA-Z])", expand=False)

Calling on an ``Index`` with a regex with more than one capture group
returns a ``DataFrame`` if ``expand=True``.

.. ipython:: python

s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)", expand=True)

It raises ``ValueError`` if ``expand=False``.

.. code-block:: python

>>> s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)", expand=False)
ValueError: This pattern contains no groups to capture.

The table below summarizes the behavior of ``extract(expand=False)``
(input subject in first column, number of groups in regex in
first row)

+--------+---------+------------+
| | 1 group | >1 group |
+--------+---------+------------+
| Index | Index | ValueError |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are defering expand=True on an Index -> MultiIndex for later, yes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

+--------+---------+------------+
| Series | Series | DataFrame |
+--------+---------+------------+

Extract all matches in each subject (extractall)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. _text.extractall:

Unlike ``extract`` (which returns only the first match),

.. ipython:: python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also a Method Summary section, pls add .extracall there as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK


s = pd.Series(["a1a2", "b1", "c1"], ["A", "B", "C"])
s
s.str.extract("[ab](?P<digit>\d)")

.. versionadded:: 0.18.0

the ``extractall`` method returns every match. The result of
``extractall`` is always a ``DataFrame`` with a ``MultiIndex`` on its
rows. The last level of the ``MultiIndex`` is named ``match`` and
indicates the order in the subject.

.. ipython:: python

s.str.extractall("[ab](?P<digit>\d)")

When each subject string in the Series has exactly one match,

.. ipython:: python

s = pd.Series(['a3', 'b3', 'c2'])
s
two_groups = '(?P<letter>[a-z])(?P<digit>[0-9])'

then ``extractall(pat).xs(0, level='match')`` gives the same result as
``extract(pat)``.

.. ipython:: python

extract_result = s.str.extract(two_groups)
extract_result
extractall_result = s.str.extractall(two_groups)
extractall_result
extractall_result.xs(0, level="match")

can also be used.

Testing for Strings that Match or Contain a Pattern
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -288,7 +397,8 @@ Method Summary
:meth:`~Series.str.endswith`,Equivalent to ``str.endswith(pat)`` for each element
:meth:`~Series.str.findall`,Compute list of all occurrences of pattern/regex for each string
:meth:`~Series.str.match`,"Call ``re.match`` on each element, returning matched groups as list"
:meth:`~Series.str.extract`,"Call ``re.match`` on each element, as ``match`` does, but return matched groups as strings for convenience."
:meth:`~Series.str.extract`,"Call ``re.search`` on each element, returning DataFrame with one row for each element and one column for each regex capture group"
:meth:`~Series.str.extractall`,"Call ``re.findall`` on each element, returning DataFrame with one row for each match and one column for each regex capture group"
:meth:`~Series.str.len`,Compute string lengths
:meth:`~Series.str.strip`,Equivalent to ``str.strip``
:meth:`~Series.str.rstrip`,Equivalent to ``str.rstrip``
Expand Down
86 changes: 86 additions & 0 deletions doc/source/whatsnew/v0.18.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,92 @@ New Behavior:
s.index
s.index.nbytes

.. _whatsnew_0180.enhancements.extract:

Changes to str.extract
^^^^^^^^^^^^^^^^^^^^^^

The :ref:`.str.extract <text.extract>` method takes a regular
expression with capture groups, finds the first match in each subject
string, and returns the contents of the capture groups
(:issue:`11386`). In v0.18.0, the ``expand`` argument was added to
``extract``. When ``expand=False`` it returns a ``Series``, ``Index``,
or ``DataFrame``, depending on the subject and regular expression
pattern (same behavior as pre-0.18.0). When ``expand=True`` it always
returns a ``DataFrame``, which is more consistent and less confusing
from the perspective of a user. Currently the default is
``expand=None`` which gives a ``FutureWarning`` and uses
``expand=False``. To avoid this warning, please explicitly specify
``expand``.

.. ipython:: python

pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)')

Extracting a regular expression with one group returns a ``DataFrame``
with one column if ``expand=True``.

.. ipython:: python

pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)', expand=True)

It returns a Series if ``expand=False``.

.. ipython:: python

pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)', expand=False)

Calling on an ``Index`` with a regex with exactly one capture group
returns a ``DataFrame`` with one column if ``expand=True``,

.. ipython:: python

s = pd.Series(["a1", "b2", "c3"], ["A11", "B22", "C33"])
s
s.index.str.extract("(?P<letter>[a-zA-Z])", expand=True)

It returns an ``Index`` if ``expand=False``.

.. ipython:: python

s.index.str.extract("(?P<letter>[a-zA-Z])", expand=False)

Calling on an ``Index`` with a regex with more than one capture group
returns a ``DataFrame`` if ``expand=True``.

.. ipython:: python

s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)", expand=True)

It raises ``ValueError`` if ``expand=False``.

.. code-block:: python

>>> s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)", expand=False)
ValueError: only one regex group is supported with Index

In summary, ``extract(expand=True)`` always returns a ``DataFrame``
with a row for every subject string, and a column for every capture
group.

.. _whatsnew_0180.enhancements.extractall:

The :ref:`.str.extractall <text.extractall>` method was added
(:issue:`11386`). Unlike ``extract`` (which returns only the first
match),

.. ipython:: python

s = pd.Series(["a1a2", "b1", "c1"], ["A", "B", "C"])
s
s.str.extract("(?P<letter>[ab])(?P<digit>\d)")

the ``extractall`` method returns all matches.

.. ipython:: python

s.str.extractall("(?P<letter>[ab])(?P<digit>\d)")

.. _whatsnew_0180.enhancements.rounding:

Datetimelike rounding
Expand Down
Loading