-
-
Notifications
You must be signed in to change notification settings - Fork 19.2k
Refactor string methods for StringArray + return IntegerArray for numeric results #29640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor string methods for StringArray + return IntegerArray for numeric results #29640
Conversation
Maybe a copy paste error, but the example doesn't show much performance boost :) (less than a percent) |
This looks good to me! I suppose this is simpler than your other PR |
Is this time-sensitive or can i wait and look at it on Monday? |
No rush on this. I did fail on the copy paste. The speed up is about 15% IIRC. |
|
||
def _na_map(f, arr, na_result=np.nan, dtype=object): | ||
# should really _check_ for NA | ||
if is_extension_array_dtype(arr.dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you sync the signatures up with _map, also rename _map -> _map_arraylike (or similar)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, don't follow this. I think they do match, aside from @jbrockmendel's request to call it func
rather than f
.
I guess na_map
calls it na_result
while _map
calls it na_value
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am really -1 on 2 different branches here. If they have exactly the same signature a little less negative. again I would rename _map to be more informative here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean with "-1 on 2 different branches here" ?
The whole purpose of this PR is to add a separate branch in case of StringArray (because we can be more efficient and want to be more specific in the result dtype)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's why my initial PR was so much more complex, since it tried to handle both cases similarly. I think that was more complex than this.
As Joris says, the main point of divergence is that for StringArray we usually know the result dtype exactly. It doesn't depend on the presence of NAs. Additionally,
- We're still using
map_infer_dtype
for both, so the core implementation is the same. - We'll eventually deprecate
.str
on object-dtype, so we will end up with just this implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, at the very least these signatures for what you call _map and _stringarray_map should be exactly the same.
and _map -> _map_object and _stringarray_map -> _map_stringarray.
I think this is crucial for not adding technical debt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback is your suggestion to add na_mask
to the _stringarray_map signature and have it just not be used? I think this relates to the "this should be a StringArray method" discussion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_map
and _na_map
already have inconsistent signatures. I'm not sure why it's that way on master, but I'm a bit against adding unused arguments in this case.
What's the technical debt we're adding here? By definition, we need to handle StringArray differently, since its result type doesn't depend on the presence of NAs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_map and _na_map already have inconsistent signatures. I'm not sure why it's that way on master, but I'm a bit against adding unused arguments in this case.
And there is also a good reason for that, as _map
has an additional argument na_mask
that is used internally in _map
(for a recursive call).
I think refactoring _map
is outside of the scope of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you create an issue to clean this up (_map and _na_map), and/or refactor of this, post this PR.
Just the clipboard failure: #29676 |
|
||
def _na_map(f, arr, na_result=np.nan, dtype=object): | ||
# should really _check_ for NA | ||
if is_extension_array_dtype(arr.dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am really -1 on 2 different branches here. If they have exactly the same signature a little less negative. again I would rename _map to be more informative here
CI failure is from #29514 |
|
||
|
||
def _stringarray_map( | ||
func: Callable[[str], Any], arr: "StringArray", na_value: Any, dtype: Dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
na_value restricted to scalar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I don't think so.
na_value : Any | ||
The value to use for missing values. By default, this is | ||
the original value (NA). | ||
dtype : Dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Elsewhere we say "np.dtype or ExtensionDtype". Is this the new policy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we have a policy. If it matters, this is an internal docstring, so I'm OK to use things from pandas._typing
that we wouldn't have in a public docstring yet.
Just the clipboard CI failure again. |
@jreback thoughts on the conversation around the signatures? I think this is slightly blocking the pd.NA PR now.
… On Nov 19, 2019, at 06:01, Joris Van den Bossche ***@***.***> wrote:
@jorisvandenbossche approved this pull request.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very minor comments, rebase and ping on green.
and parts of the API may change without warning. | ||
|
||
The text extension type solves several issues with object-dtype NumPy arrays: | ||
The ``'string'`` extension type solves several issues with object-dtype NumPy arrays: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need any updating an/or reference to your new section that you added in text.rst?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger you can probably update the sentence "The usual string accessor methods work. Where appropriate, the return type of the Series or columns of a DataFrame will also have string dtype." 20 lines below this line, to include that it can also return IntegerDtype in certain cases.
|
||
def _na_map(f, arr, na_result=np.nan, dtype=object): | ||
# should really _check_ for NA | ||
if is_extension_array_dtype(arr.dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you create an issue to clean this up (_map and _na_map), and/or refactor of this, post this PR.
pandas/core/strings.py
Outdated
if is_extension_array_dtype(arr.dtype): | ||
# just StringDtype | ||
arr = extract_array(arr) | ||
return _stringarray_map(f, arr, na_value=na_result, dtype=dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rename these as indicated above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rename these as indicated above.
Can you be explicit in what exact names you want to propose? (I am getting a bit lost in all the comments)
Tom already renamed _ea_map
to _stringarray_map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right we should change _map -> _object_map (though I actually prefer the opposite, _map_object and _map_stringarray), but either is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, make sense. I renamed to _map_object
and _map_stringarray
Merged master, and writing up an issue for the followup. |
#29710 for followup. |
Thanks all! |
Intended as an alternative to #29637.
This is a much smaller change. It only changes the codepath for StringDtype.
I think this is OK since someday (well down the road) we'll want to deprecate
the
.str
accessor on object-dtype Series. When we enforce that, we can justdelete the entire old implementation.
The API change is limited to always returning Int64Dtype for numeric outputs, rather than int if there's no NAs and float if there are any.
When BoolArray is done, we'll change that for the boolean-returning ones.
As a side benefit, we get a nice perf boost, since we have deterministic output dtypes we
can skip an object-dtype allocation.