From 2019d1b16615e4fa97369c7abf2cb34435771dfb Mon Sep 17 00:00:00 2001 From: Toby Dylan Hocking Date: Wed, 10 Feb 2016 08:45:09 -0500 Subject: [PATCH] DOC: extract/extractall clarifications --- doc/source/text.rst | 22 ++++++++++----------- doc/source/whatsnew/v0.18.0.txt | 34 ++++++++++++++++----------------- 2 files changed, 28 insertions(+), 28 deletions(-) diff --git a/doc/source/text.rst b/doc/source/text.rst index c8a878747a9b7..a0cc32ecea531 100644 --- a/doc/source/text.rst +++ b/doc/source/text.rst @@ -196,9 +196,9 @@ DataFrame with one column per group. Elements that do not match return a row filled with ``NaN``. Thus, a Series of messy strings can be "converted" into a like-indexed Series or DataFrame of cleaned-up or more useful strings, without -necessitating ``get()`` to access tuples or ``re.match`` objects. The -results dtype always is object, even if no match is found and the -result only contains ``NaN``. +necessitating ``get()`` to access tuples or ``re.match`` objects. The +dtype of the result is always object, even if no match is found and +the result only contains ``NaN``. Named groups like @@ -275,15 +275,16 @@ Extract all matches in each subject (extractall) .. _text.extractall: +.. versionadded:: 0.18.0 + Unlike ``extract`` (which returns only the first match), .. ipython:: python s = pd.Series(["a1a2", "b1", "c1"], ["A", "B", "C"]) s - s.str.extract("[ab](?P\d)", expand=False) - -.. versionadded:: 0.18.0 + two_groups = '(?P[a-z])(?P[0-9])' + s.str.extract(two_groups, expand=True) the ``extractall`` method returns every match. The result of ``extractall`` is always a ``DataFrame`` with a ``MultiIndex`` on its @@ -292,7 +293,7 @@ indicates the order in the subject. .. ipython:: python - s.str.extractall("[ab](?P\d)") + s.str.extractall(two_groups) When each subject string in the Series has exactly one match, @@ -300,14 +301,13 @@ When each subject string in the Series has exactly one match, s = pd.Series(['a3', 'b3', 'c2']) s - two_groups = '(?P[a-z])(?P[0-9])' then ``extractall(pat).xs(0, level='match')`` gives the same result as ``extract(pat)``. .. ipython:: python - extract_result = s.str.extract(two_groups, expand=False) + extract_result = s.str.extract(two_groups, expand=True) extract_result extractall_result = s.str.extractall(two_groups) extractall_result @@ -315,7 +315,7 @@ then ``extractall(pat).xs(0, level='match')`` gives the same result as Testing for Strings that Match or Contain a Pattern -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +--------------------------------------------------- You can check whether elements contain a pattern: @@ -355,7 +355,7 @@ Methods like ``match``, ``contains``, ``startswith``, and ``endswith`` take s4.str.contains('A', na=False) Creating Indicator Variables -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +---------------------------- You can extract dummy variables from string columns. For example if they are separated by a ``'|'``: diff --git a/doc/source/whatsnew/v0.18.0.txt b/doc/source/whatsnew/v0.18.0.txt index ec002fae3b4b9..554647bb015e1 100644 --- a/doc/source/whatsnew/v0.18.0.txt +++ b/doc/source/whatsnew/v0.18.0.txt @@ -157,50 +157,50 @@ Currently the default is ``expand=None`` which gives a ``FutureWarning`` and use .. ipython:: python - pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)', expand=False) + pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)', expand=None) -Extracting a regular expression with one group returns a ``DataFrame`` -with one column if ``expand=True``. +Extracting a regular expression with one group returns a Series if +``expand=False``. .. ipython:: python - pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)', expand=True) + pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)', expand=False) -It returns a Series if ``expand=False``. +It returns a ``DataFrame`` with one column if ``expand=True``. .. ipython:: python - pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)', expand=False) + pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)', expand=True) Calling on an ``Index`` with a regex with exactly one capture group -returns a ``DataFrame`` with one column if ``expand=True``, +returns an ``Index`` if ``expand=False``. .. ipython:: python s = pd.Series(["a1", "b2", "c3"], ["A11", "B22", "C33"]) s - s.index.str.extract("(?P[a-zA-Z])", expand=True) - -It returns an ``Index`` if ``expand=False``. - -.. ipython:: python - s.index.str.extract("(?P[a-zA-Z])", expand=False) -Calling on an ``Index`` with a regex with more than one capture group -returns a ``DataFrame`` if ``expand=True``. +It returns a ``DataFrame`` with one column if ``expand=True``. .. ipython:: python - s.index.str.extract("(?P[a-zA-Z])([0-9]+)", expand=True) + s.index.str.extract("(?P[a-zA-Z])", expand=True) -It raises ``ValueError`` if ``expand=False``. +Calling on an ``Index`` with a regex with more than one capture group +raises ``ValueError`` if ``expand=False``. .. code-block:: python >>> s.index.str.extract("(?P[a-zA-Z])([0-9]+)", expand=False) ValueError: only one regex group is supported with Index +It returns a ``DataFrame`` if ``expand=True``. + +.. ipython:: python + + s.index.str.extract("(?P[a-zA-Z])([0-9]+)", expand=True) + In summary, ``extract(expand=True)`` always returns a ``DataFrame`` with a row for every subject string, and a column for every capture group.