-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Refactor string methods for StringArray + return IntegerArray for numeric results #29640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
14ca804
07fc53d
ddef378
a147081
36b1bf7
d01ddfe
22a2c79
686de87
0ea2d6b
358ce59
7b25d8d
46dfc20
1b96e53
3b20ffc
e4bae5e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -63,7 +63,7 @@ Previously, strings were typically stored in object-dtype NumPy arrays. | |
``StringDtype`` is currently considered experimental. The implementation | ||
and parts of the API may change without warning. | ||
|
||
The text extension type solves several issues with object-dtype NumPy arrays: | ||
The ``'string'`` extension type solves several issues with object-dtype NumPy arrays: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. does this need any updating an/or reference to your new section that you added in text.rst? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @TomAugspurger you can probably update the sentence "The usual string accessor methods work. Where appropriate, the return type of the Series or columns of a DataFrame will also have string dtype." 20 lines below this line, to include that it can also return IntegerDtype in certain cases. |
||
|
||
1. You can accidentally store a *mixture* of strings and non-strings in an | ||
``object`` dtype array. A ``StringArray`` can only store strings. | ||
|
@@ -88,9 +88,17 @@ You can use the alias ``"string"`` as well. | |
The usual string accessor methods work. Where appropriate, the return type | ||
of the Series or columns of a DataFrame will also have string dtype. | ||
|
||
.. ipython:: python | ||
|
||
s.str.upper() | ||
s.str.split('b', expand=True).dtypes | ||
|
||
String accessor methods returning integers will return a value with :class:`Int64Dtype` | ||
|
||
.. ipython:: python | ||
|
||
s.str.count("a") | ||
|
||
We recommend explicitly using the ``string`` data type when working with strings. | ||
See :ref:`text.types` for more. | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,7 +2,7 @@ | |
from functools import wraps | ||
import re | ||
import textwrap | ||
from typing import Dict, List | ||
from typing import TYPE_CHECKING, Any, Callable, Dict, List | ||
import warnings | ||
|
||
import numpy as np | ||
|
@@ -15,10 +15,14 @@ | |
ensure_object, | ||
is_bool_dtype, | ||
is_categorical_dtype, | ||
is_extension_array_dtype, | ||
is_integer, | ||
is_integer_dtype, | ||
is_list_like, | ||
is_object_dtype, | ||
is_re, | ||
is_scalar, | ||
is_string_dtype, | ||
) | ||
from pandas.core.dtypes.generic import ( | ||
ABCDataFrame, | ||
|
@@ -28,9 +32,14 @@ | |
) | ||
from pandas.core.dtypes.missing import isna | ||
|
||
from pandas._typing import ArrayLike, Dtype | ||
from pandas.core.algorithms import take_1d | ||
from pandas.core.base import NoNewAttributesMixin | ||
import pandas.core.common as com | ||
from pandas.core.construction import extract_array | ||
|
||
if TYPE_CHECKING: | ||
from pandas.arrays import StringArray | ||
|
||
_cpython_optimized_encoders = ( | ||
"utf-8", | ||
|
@@ -109,10 +118,79 @@ def cat_safe(list_of_columns: List, sep: str): | |
|
||
def _na_map(f, arr, na_result=np.nan, dtype=object): | ||
# should really _check_ for NA | ||
return _map(f, arr, na_mask=True, na_value=na_result, dtype=dtype) | ||
if is_extension_array_dtype(arr.dtype): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you sync the signatures up with _map, also rename _map -> _map_arraylike (or similar) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, don't follow this. I think they do match, aside from @jbrockmendel's request to call it I guess There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am really -1 on 2 different branches here. If they have exactly the same signature a little less negative. again I would rename _map to be more informative here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What do you mean with "-1 on 2 different branches here" ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's why my initial PR was so much more complex, since it tried to handle both cases similarly. I think that was more complex than this. As Joris says, the main point of divergence is that for StringArray we usually know the result dtype exactly. It doesn't depend on the presence of NAs. Additionally,
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok, at the very least these signatures for what you call _map and _stringarray_map should be exactly the same. I think this is crucial for not adding technical debt. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jreback is your suggestion to add There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
What's the technical debt we're adding here? By definition, we need to handle StringArray differently, since its result type doesn't depend on the presence of NAs. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
And there is also a good reason for that, as There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you create an issue to clean this up (_map and _na_map), and/or refactor of this, post this PR.
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
# just StringDtype | ||
arr = extract_array(arr) | ||
return _map_stringarray(f, arr, na_value=na_result, dtype=dtype) | ||
return _map_object(f, arr, na_mask=True, na_value=na_result, dtype=dtype) | ||
|
||
|
||
def _map_stringarray( | ||
func: Callable[[str], Any], arr: "StringArray", na_value: Any, dtype: Dtype | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. na_value restricted to scalar? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, I don't think so. |
||
) -> ArrayLike: | ||
""" | ||
Map a callable over valid elements of a StringArrray. | ||
|
||
Parameters | ||
---------- | ||
func : Callable[[str], Any] | ||
Apply to each valid element. | ||
arr : StringArray | ||
na_value : Any | ||
The value to use for missing values. By default, this is | ||
the original value (NA). | ||
dtype : Dtype | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Elsewhere we say "np.dtype or ExtensionDtype". Is this the new policy? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure if we have a policy. If it matters, this is an internal docstring, so I'm OK to use things from |
||
The result dtype to use. Specifying this aviods an intermediate | ||
object-dtype allocation. | ||
|
||
Returns | ||
------- | ||
ArrayLike | ||
An ExtensionArray for integer or string dtypes, otherwise | ||
an ndarray. | ||
|
||
""" | ||
from pandas.arrays import IntegerArray, StringArray | ||
|
||
mask = isna(arr) | ||
|
||
assert isinstance(arr, StringArray) | ||
arr = np.asarray(arr) | ||
|
||
if is_integer_dtype(dtype): | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
na_value_is_na = isna(na_value) | ||
if na_value_is_na: | ||
na_value = 1 | ||
result = lib.map_infer_mask( | ||
arr, | ||
func, | ||
mask.view("uint8"), | ||
convert=False, | ||
na_value=na_value, | ||
dtype=np.dtype("int64"), | ||
) | ||
|
||
if not na_value_is_na: | ||
mask[:] = False | ||
|
||
return IntegerArray(result, mask) | ||
|
||
elif is_string_dtype(dtype) and not is_object_dtype(dtype): | ||
# i.e. StringDtype | ||
result = lib.map_infer_mask( | ||
arr, func, mask.view("uint8"), convert=False, na_value=na_value | ||
) | ||
return StringArray(result) | ||
# TODO: BooleanArray | ||
else: | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
# This is when the result type is object. We reach this when | ||
# -> We know the result type is truly object (e.g. .encode returns bytes | ||
# or .findall returns a list). | ||
# -> We don't know the result type. E.g. `.get` can return anything. | ||
return lib.map_infer_mask(arr, func, mask.view("uint8")) | ||
|
||
|
||
def _map(f, arr, na_mask=False, na_value=np.nan, dtype=object): | ||
def _map_object(f, arr, na_mask=False, na_value=np.nan, dtype=object): | ||
if not len(arr): | ||
return np.ndarray(0, dtype=dtype) | ||
|
||
|
@@ -143,7 +221,7 @@ def g(x): | |
except (TypeError, AttributeError): | ||
return na_value | ||
|
||
return _map(g, arr, dtype=dtype) | ||
return _map_object(g, arr, dtype=dtype) | ||
if na_value is not np.nan: | ||
np.putmask(result, mask, na_value) | ||
if result.dtype == object: | ||
|
@@ -634,7 +712,7 @@ def str_replace(arr, pat, repl, n=-1, case=None, flags=0, regex=True): | |
raise ValueError("Cannot use a callable replacement when regex=False") | ||
f = lambda x: x.replace(pat, repl, n) | ||
|
||
return _na_map(f, arr) | ||
return _na_map(f, arr, dtype=str) | ||
|
||
|
||
def str_repeat(arr, repeats): | ||
|
@@ -685,7 +763,7 @@ def scalar_rep(x): | |
except TypeError: | ||
return str.__mul__(x, repeats) | ||
|
||
return _na_map(scalar_rep, arr) | ||
return _na_map(scalar_rep, arr, dtype=str) | ||
else: | ||
|
||
def rep(x, r): | ||
|
@@ -1150,7 +1228,7 @@ def str_join(arr, sep): | |
4 NaN | ||
dtype: object | ||
""" | ||
return _na_map(sep.join, arr) | ||
return _na_map(sep.join, arr, dtype=str) | ||
|
||
|
||
def str_findall(arr, pat, flags=0): | ||
|
@@ -1381,7 +1459,7 @@ def str_pad(arr, width, side="left", fillchar=" "): | |
else: # pragma: no cover | ||
raise ValueError("Invalid side") | ||
|
||
return _na_map(f, arr) | ||
return _na_map(f, arr, dtype=str) | ||
|
||
|
||
def str_split(arr, pat=None, n=None): | ||
|
@@ -1487,7 +1565,7 @@ def str_slice(arr, start=None, stop=None, step=None): | |
""" | ||
obj = slice(start, stop, step) | ||
f = lambda x: x[obj] | ||
return _na_map(f, arr) | ||
return _na_map(f, arr, dtype=str) | ||
|
||
|
||
def str_slice_replace(arr, start=None, stop=None, repl=None): | ||
|
@@ -1578,7 +1656,7 @@ def f(x): | |
y += x[local_stop:] | ||
return y | ||
|
||
return _na_map(f, arr) | ||
return _na_map(f, arr, dtype=str) | ||
|
||
|
||
def str_strip(arr, to_strip=None, side="both"): | ||
|
@@ -1603,7 +1681,7 @@ def str_strip(arr, to_strip=None, side="both"): | |
f = lambda x: x.rstrip(to_strip) | ||
else: # pragma: no cover | ||
raise ValueError("Invalid side") | ||
return _na_map(f, arr) | ||
return _na_map(f, arr, dtype=str) | ||
|
||
|
||
def str_wrap(arr, width, **kwargs): | ||
|
@@ -1667,7 +1745,7 @@ def str_wrap(arr, width, **kwargs): | |
|
||
tw = textwrap.TextWrapper(**kwargs) | ||
|
||
return _na_map(lambda s: "\n".join(tw.wrap(s)), arr) | ||
return _na_map(lambda s: "\n".join(tw.wrap(s)), arr, dtype=str) | ||
|
||
|
||
def str_translate(arr, table): | ||
|
@@ -1687,7 +1765,7 @@ def str_translate(arr, table): | |
------- | ||
Series or Index | ||
""" | ||
return _na_map(lambda x: x.translate(table), arr) | ||
return _na_map(lambda x: x.translate(table), arr, dtype=str) | ||
|
||
|
||
def str_get(arr, i): | ||
|
@@ -3025,7 +3103,7 @@ def normalize(self, form): | |
import unicodedata | ||
|
||
f = lambda x: unicodedata.normalize(form, x) | ||
result = _na_map(f, self._parent) | ||
result = _na_map(f, self._parent, dtype=str) | ||
return self._wrap_result(result) | ||
|
||
_shared_docs[ | ||
|
@@ -3223,31 +3301,37 @@ def rindex(self, sub, start=0, end=None): | |
lambda x: x.lower(), | ||
name="lower", | ||
docstring=_shared_docs["casemethods"] % _doc_args["lower"], | ||
dtype=str, | ||
) | ||
upper = _noarg_wrapper( | ||
lambda x: x.upper(), | ||
name="upper", | ||
docstring=_shared_docs["casemethods"] % _doc_args["upper"], | ||
dtype=str, | ||
) | ||
title = _noarg_wrapper( | ||
lambda x: x.title(), | ||
name="title", | ||
docstring=_shared_docs["casemethods"] % _doc_args["title"], | ||
dtype=str, | ||
) | ||
capitalize = _noarg_wrapper( | ||
lambda x: x.capitalize(), | ||
name="capitalize", | ||
docstring=_shared_docs["casemethods"] % _doc_args["capitalize"], | ||
dtype=str, | ||
) | ||
swapcase = _noarg_wrapper( | ||
lambda x: x.swapcase(), | ||
name="swapcase", | ||
docstring=_shared_docs["casemethods"] % _doc_args["swapcase"], | ||
dtype=str, | ||
) | ||
casefold = _noarg_wrapper( | ||
lambda x: x.casefold(), | ||
name="casefold", | ||
docstring=_shared_docs["casemethods"] % _doc_args["casefold"], | ||
dtype=str, | ||
) | ||
|
||
_shared_docs[ | ||
|
Uh oh!
There was an error while loading. Please reload this page.