Skip to content

REF/API: String methods refactor #29637

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Nov 15, 2019

edit: probably closing this: See #29640 for a simpler alternative.

In preparation for #29597, this changes the string methods implemetation to

  • Use masks to apply the function just to valid values.
  • Disables all unnecessary inference on the result.

This works quite well for the new StringDtype, since we know the values are always string. The object-dtype behavior is a bit more complex. I'll note that inline below.

Additionally, it fixes #29624

edit: Oh, forgot another change. String methods returning numeric values will return a nullable integer type. So Series[string].str.count('a') returns an Int64Dtype, rather than a maybe int or maybe float (depending on the presence of NAs). Will document that.

* Refactor string ops to use masks, rather than applying
  the op to non-strings.
* Disable unnecessary inference.
* Always use boolean dtype for empty methods.

(cherry picked from commit 454576a6c2b22177567c1ba6ec2b6c3fd2452b14)
@TomAugspurger
Copy link
Contributor Author

cc @jorisvandenbossche

@TomAugspurger TomAugspurger added Refactor Internal refactoring of code Strings String extension data type and string data labels Nov 15, 2019

if result_dtype == "int":
result_dtype = pandas_dtype("Int64")
# TODO: Avoid this object allocation
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this optimization would be something like

result = np.zeros(dtype="int64")
result[notna] = lib.map_infer(np.asarray(arr[notna]), f)
return IntegerArray(result, mask)

which skips the object-dtype ndarray allocation.

return x.get(i)
elif len(x) > i >= -len(x):
return x[i]
try:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, this was handled inside _map by

  1. calling the f on each element inside a big try/except
  2. If the callable failed, we'd define a new callable that wrapped f inside a try / except, and mapped the new callable over the values.

That means something like Series(['a'] * 1000 + [None]).str.get(0) would call this f 2002 times! The first 1,000 were good, then we failed on 1,001. So we define a new wrapped callable and call that 1,001 more times.

Better to do the wrapping here.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

Did only a very quick skim, some questions:

  • did you check performance? More or less the same as before?
  • long term, we might want something like lib.map_infer but where we can pass a mask? (so we don't need to filter first, and assign to a subset)

elif method in {"get", "join"}:
# we want anything that supports __getitem__ or .get
non_strings = ~np.asarray(
[isinstance(x, (Iterable, Sequence, Mapping)) for x in arr], dtype=bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to do this check? (just trying to think if users could have had objects that would work with get but would not be following this check)

result_dtype = pandas_dtype("Int64")
# TODO: Avoid this object allocation
# Should be able to just use zero, and pass through
# the mask to IntegerArray.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been thinking about this as well recently in context of BooleanArray, but we might want to add a mask keyword to the pd.array(..) function?
I think that would be useful in general, and would also here give a way to not go the object way.

@jreback
Copy link
Contributor

jreback commented Nov 15, 2019

long term, we might want something like lib.map_infer but where we can pass a mask? (so we don't need to filter first, and assign to a subset)

there already is a lib.map_infer_mask FYI

@jorisvandenbossche
Copy link
Member

there already is a lib.map_infer_mask FYI

Which is what was being used before actually. Tom, what's the reason for no longer using it? Being able to pass the 'na' value might be useful (now it takes the original value from the array, which is not always what you want I suppose)

@TomAugspurger
Copy link
Contributor Author

did you check performance? More or less the same as before?

checking now. Slowdown in things like str.get and .str.decode in a few cases.

Will look into re-using map_infer_mask. If I can pass through an na_result rather than the original, that may be enough.

@TomAugspurger
Copy link
Contributor Author

I wonder if it makes sense to just apply this change to just StringDtype? That'd keep the diff a lot smaller at least, and is probably less like to introduce new bugs.

@TomAugspurger
Copy link
Contributor Author

Closing in favor of #29640

@TomAugspurger TomAugspurger deleted the masked-string branch November 16, 2019 22:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Refactor Internal refactoring of code Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Boolean string methods on empty Series return object dtype
3 participants