-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
REF/API: String methods refactor #29637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* Refactor string ops to use masks, rather than applying the op to non-strings. * Disable unnecessary inference. * Always use boolean dtype for empty methods. (cherry picked from commit 454576a6c2b22177567c1ba6ec2b6c3fd2452b14)
|
||
if result_dtype == "int": | ||
result_dtype = pandas_dtype("Int64") | ||
# TODO: Avoid this object allocation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this optimization would be something like
result = np.zeros(dtype="int64")
result[notna] = lib.map_infer(np.asarray(arr[notna]), f)
return IntegerArray(result, mask)
which skips the object-dtype ndarray allocation.
return x.get(i) | ||
elif len(x) > i >= -len(x): | ||
return x[i] | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, this was handled inside _map
by
- calling the
f
on each element inside a big try/except - If the callable failed, we'd define a new callable that wrapped
f
inside a try / except, and mapped the new callable over the values.
That means something like Series(['a'] * 1000 + [None]).str.get(0)
would call this f
2002 times! The first 1,000 were good, then we failed on 1,001. So we define a new wrapped callable and call that 1,001 more times.
Better to do the wrapping here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool!
Did only a very quick skim, some questions:
- did you check performance? More or less the same as before?
- long term, we might want something like
lib.map_infer
but where we can pass a mask? (so we don't need to filter first, and assign to a subset)
elif method in {"get", "join"}: | ||
# we want anything that supports __getitem__ or .get | ||
non_strings = ~np.asarray( | ||
[isinstance(x, (Iterable, Sequence, Mapping)) for x in arr], dtype=bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to do this check? (just trying to think if users could have had objects that would work with get
but would not be following this check)
result_dtype = pandas_dtype("Int64") | ||
# TODO: Avoid this object allocation | ||
# Should be able to just use zero, and pass through | ||
# the mask to IntegerArray. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been thinking about this as well recently in context of BooleanArray, but we might want to add a mask
keyword to the pd.array(..)
function?
I think that would be useful in general, and would also here give a way to not go the object way.
there already is a lib.map_infer_mask FYI |
Which is what was being used before actually. Tom, what's the reason for no longer using it? Being able to pass the 'na' value might be useful (now it takes the original value from the array, which is not always what you want I suppose) |
checking now. Slowdown in things like Will look into re-using |
I wonder if it makes sense to just apply this change to just StringDtype? That'd keep the diff a lot smaller at least, and is probably less like to introduce new bugs. |
Closing in favor of #29640 |
edit: probably closing this: See #29640 for a simpler alternative.
In preparation for #29597, this changes the string methods implemetation to
This works quite well for the new StringDtype, since we know the values are always string. The object-dtype behavior is a bit more complex. I'll note that inline below.
Additionally, it fixes #29624
edit: Oh, forgot another change. String methods returning numeric values will return a nullable integer type. So
Series[string].str.count('a')
returns an Int64Dtype, rather than a maybe int or maybe float (depending on the presence of NAs). Will document that.