REF/API: String methods refactor #29637

TomAugspurger · 2019-11-15T14:53:32Z

edit: probably closing this: See #29640 for a simpler alternative.

In preparation for #29597, this changes the string methods implemetation to

Use masks to apply the function just to valid values.
Disables all unnecessary inference on the result.

This works quite well for the new StringDtype, since we know the values are always string. The object-dtype behavior is a bit more complex. I'll note that inline below.

Additionally, it fixes #29624

edit: Oh, forgot another change. String methods returning numeric values will return a nullable integer type. So Series[string].str.count('a') returns an Int64Dtype, rather than a maybe int or maybe float (depending on the presence of NAs). Will document that.

* Refactor string ops to use masks, rather than applying the op to non-strings. * Disable unnecessary inference. * Always use boolean dtype for empty methods. (cherry picked from commit 454576a6c2b22177567c1ba6ec2b6c3fd2452b14)

TomAugspurger · 2019-11-15T14:53:40Z

cc @jorisvandenbossche

TomAugspurger · 2019-11-15T15:13:03Z

pandas/core/strings.py

+
+        if result_dtype == "int":
+            result_dtype = pandas_dtype("Int64")
+            # TODO: Avoid this object allocation


I think this optimization would be something like

result = np.zeros(dtype="int64") result[notna] = lib.map_infer(np.asarray(arr[notna]), f) return IntegerArray(result, mask)

which skips the object-dtype ndarray allocation.

TomAugspurger · 2019-11-15T15:16:26Z

pandas/core/strings.py

-            return x.get(i)
-        elif len(x) > i >= -len(x):
-            return x[i]
+        try:


Previously, this was handled inside _map by

calling the f on each element inside a big try/except

If the callable failed, we'd define a new callable that wrapped f inside a try / except, and mapped the new callable over the values.

That means something like Series(['a'] * 1000 + [None]).str.get(0) would call this f 2002 times! The first 1,000 were good, then we failed on 1,001. So we define a new wrapped callable and call that 1,001 more times.

Better to do the wrapping here.

jorisvandenbossche

Cool!

Did only a very quick skim, some questions:

did you check performance? More or less the same as before?
long term, we might want something like lib.map_infer but where we can pass a mask? (so we don't need to filter first, and assign to a subset)

jorisvandenbossche · 2019-11-15T15:17:11Z

pandas/core/strings.py

+        elif method in {"get", "join"}:
+            # we want anything that supports __getitem__ or .get
+            non_strings = ~np.asarray(
+                [isinstance(x, (Iterable, Sequence, Mapping)) for x in arr], dtype=bool


Do we need to do this check? (just trying to think if users could have had objects that would work with get but would not be following this check)

jorisvandenbossche · 2019-11-15T15:19:04Z

pandas/core/strings.py

+            result_dtype = pandas_dtype("Int64")
+            # TODO: Avoid this object allocation
+            #   Should be able to just use zero, and pass through
+            #   the mask to IntegerArray.


I have been thinking about this as well recently in context of BooleanArray, but we might want to add a mask keyword to the pd.array(..) function?
I think that would be useful in general, and would also here give a way to not go the object way.

jreback · 2019-11-15T15:21:59Z

long term, we might want something like lib.map_infer but where we can pass a mask? (so we don't need to filter first, and assign to a subset)

there already is a lib.map_infer_mask FYI

jorisvandenbossche · 2019-11-15T15:30:15Z

there already is a lib.map_infer_mask FYI

Which is what was being used before actually. Tom, what's the reason for no longer using it? Being able to pass the 'na' value might be useful (now it takes the original value from the array, which is not always what you want I suppose)

TomAugspurger · 2019-11-15T15:33:55Z

did you check performance? More or less the same as before?

checking now. Slowdown in things like str.get and .str.decode in a few cases.

Will look into re-using map_infer_mask. If I can pass through an na_result rather than the original, that may be enough.

TomAugspurger · 2019-11-15T16:05:20Z

I wonder if it makes sense to just apply this change to just StringDtype? That'd keep the diff a lot smaller at least, and is probably less like to introduce new bugs.

TomAugspurger · 2019-11-16T22:33:01Z

Closing in favor of #29640

REF/API: String methods refactor

25cb919

* Refactor string ops to use masks, rather than applying the op to non-strings. * Disable unnecessary inference. * Always use boolean dtype for empty methods. (cherry picked from commit 454576a6c2b22177567c1ba6ec2b6c3fd2452b14)

TomAugspurger added Refactor Internal refactoring of code Strings String extension data type and string data labels Nov 15, 2019

TomAugspurger commented Nov 15, 2019

View reviewed changes

jorisvandenbossche reviewed Nov 15, 2019

View reviewed changes

TomAugspurger mentioned this pull request Nov 15, 2019

Refactor string methods for StringArray + return IntegerArray for numeric results #29640

Merged

TomAugspurger closed this Nov 16, 2019

TomAugspurger deleted the masked-string branch November 16, 2019 22:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

REF/API: String methods refactor #29637

REF/API: String methods refactor #29637

Uh oh!

TomAugspurger commented Nov 15, 2019 •

edited

Loading

Uh oh!

TomAugspurger commented Nov 15, 2019

Uh oh!

TomAugspurger Nov 15, 2019

Uh oh!

TomAugspurger Nov 15, 2019

Uh oh!

jorisvandenbossche left a comment

Uh oh!

jorisvandenbossche Nov 15, 2019

Uh oh!

jorisvandenbossche Nov 15, 2019

Uh oh!

jreback commented Nov 15, 2019 •

edited

Loading

Uh oh!

jorisvandenbossche commented Nov 15, 2019

Uh oh!

TomAugspurger commented Nov 15, 2019

Uh oh!

TomAugspurger commented Nov 15, 2019

Uh oh!

TomAugspurger commented Nov 16, 2019

Uh oh!

Uh oh!

Uh oh!

REF/API: String methods refactor #29637

REF/API: String methods refactor #29637

Uh oh!

Conversation

TomAugspurger commented Nov 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger commented Nov 15, 2019

Uh oh!

TomAugspurger Nov 15, 2019

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Nov 15, 2019

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Nov 15, 2019

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Nov 15, 2019

Choose a reason for hiding this comment

Uh oh!

jreback commented Nov 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorisvandenbossche commented Nov 15, 2019

Uh oh!

TomAugspurger commented Nov 15, 2019

Uh oh!

TomAugspurger commented Nov 15, 2019

Uh oh!

TomAugspurger commented Nov 16, 2019

Uh oh!

Uh oh!

TomAugspurger commented Nov 15, 2019 •

edited

Loading

jreback commented Nov 15, 2019 •

edited

Loading