REF: cython cleanups and optimizations #23382

jbrockmendel · 2018-10-27T14:58:44Z

closes #xxxx
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

codecov · 2018-10-27T19:08:00Z

Codecov Report

Merging #23382 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #23382   +/-   ##
=======================================
  Coverage   92.22%   92.22%           
=======================================
  Files         161      161           
  Lines       51187    51187           
=======================================
  Hits        47209    47209           
  Misses       3978     3978

Flag	Coverage Δ
#multiple	`90.61% <ø> (ø)`	⬆️
#single	`42.26% <ø> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4f71755...31a1319. Read the comment docs.

jbrockmendel · 2018-10-27T19:23:47Z

Mostly changing ndarray to memoryviews where possible. A few notation cleanups around pointer types. Adds some extra type annotations, using py3-syntax where feasible.

gfyoung · 2018-10-27T21:55:43Z

@jbrockmendel : How is performance impacted, if at all, after these changes?

jbrockmendel · 2018-10-27T22:22:47Z

How is performance impacted, if at all, after these changes?

Small but positive. For the changes of ndarray[dtype] to dtype[:] the impact is small enough that I can't measure it, but I trust the cython folks when they say that the memoryview usage is more performant.

Adding @cython.wraparound(False) and @cython.boundscheck(False) is a clear improvement. Easiest way to see this is in the output of cython -a. A decent chunk of python-land calls are avoided.

Changing ndarray[algos_t] to ndarray[algos_t, ndim=1] is another one where I can't measure it, but it should be an improvement according to cython folks.

Changing out.fill(1-flag_val) to out[:] = 1 - flag_val (prompted by change to memoryview) replaces a python call with a C call.

Removal of group_cummax_float64 etc doesn't make a difference. in core.groupby there is a lookup for getattr(libgroupby, "group_cummax") that now succeeds, so it never gets around to doing the dtype-specific lookup.

Adding type annotations in _libs.lib for non-cdef functions I'm not sure if cython actually uses those, so that might just be for funsies.

Replacing util.get_value_at(arr, i) with arr[i] in _libs.lib should be a clear-but-tiny win, since all the former does is some not-necessary-in-these-cases validation before returning arr[i].

So tiny-but-positive all around.

gfyoung · 2018-10-27T22:26:45Z

@jbrockmendel : Cool, just wanted to make sure we had the rationale documented 👍

jreback · 2018-10-28T02:10:42Z

pandas/_libs/hashtable_func_helper.pxi.in

-    return modes[:j + 1]
+    # Note: For reasons unknown, slicing modes.base works but modes[:j+1].base
+    #  returns an object with an incorrect length
+    return modes.base[:j + 1]  # `.base` to access underlying np.ndarray


you need np.asarray i think

same comment

jreback · 2018-10-28T02:11:19Z

pandas/_libs/lib.pyx

@@ -284,7 +284,7 @@ def dicts_to_array(list dicts, list columns):
            else:
                result[i, j] = onan

-    return result
+    return result.base  # `.base` to access underlying np.ndarray


jreback · 2018-10-28T02:11:25Z

pandas/_libs/lib.pyx

    for i in range(n):
        idx = indexer[i]
        if idx != -1:
            rev_indexer[idx] = i

-    return rev_indexer
+    return rev_indexer.base  # `.base` to access underlying np.ndarray


jreback

see comments

jreback · 2018-10-28T02:11:37Z

pandas/_libs/lib.pyx

@@ -525,11 +534,12 @@ def astype_unicode(arr: ndarray,

        result[i] = arr_i

-    return result
+    return result.base  # `.base` to access underlying np.ndarray


jreback · 2018-10-28T02:12:32Z

pandas/_libs/lib.pyx

        Seen seen = Seen()
        object val
        float64_t fval, fnan

+    if objects is None:
+        # Without explicitly raising, groupby.ops _aggregate_series_pure_python
+        #  can pass None and incorrectly raise an AttributeError when trying


what fails for this? it is very odd you needed to change this

At the moment the call to _aggregate_series_pure_python raises because it expects an ndarray and gets None. But if we make it expect a memoryview then None is technically allowed. So this reinstates the raising explicitly.

jreback · 2018-10-28T02:12:40Z

pandas/_libs/lib.pyx

@@ -2036,7 +2053,7 @@ def maybe_convert_objects(ndarray[object] objects, bint try_float=0,
    if seen.datetimetz_:
        if len({getattr(val, 'tzinfo', None) for val in objects}) == 1:
            from pandas import DatetimeIndex
-            return DatetimeIndex(objects)
+            return DatetimeIndex(objects.base)


why accessing base here?

Because DatetimeIndex.__new__ doesn't know how to handle a cython memoryview object

then use np.asarray, that's the standard

jreback

let's avoid using .base anywhere. i believe we had this discussion once before. np.asarray is much more idiomatic and readable; its not a perf issue as these are the last call in a routine.

…bmore

jbrockmendel · 2018-10-30T16:53:36Z

Reverted change of foo.base to np.assarray(foo). It risks incorrectly returning np.array(None) instead of raising.

jreback

same comments as before

jreback · 2018-10-31T12:51:00Z

pandas/_libs/groupby.pyx

        int64_t idx, curr_fill_idx=-1, filled_vals=0

    N = len(out)

    # Make sure all arrays are the same size
    assert N == len(labels) == len(mask)

-    sorted_labels = np.argsort(labels, kind='mergesort').astype(
-        np.int64, copy=False)
+    sorted_labels = np.argsort(labels, kind='mergesort').astype(np.int64,


i prefer the former

will revert.

jreback · 2018-10-31T12:51:54Z

pandas/_libs/groupby_helper.pxi.in

@@ -837,8 +823,3 @@ def group_cummax(ndarray[groupby_t, ndim=2] out,
                        if val > mval:
                            accum[lab, j] = mval = val
                        out[i, j] = mval
-
-
-group_cummax_float64 = group_cummax["float64_t"]


are these just relics?

groupby.ops (I think) looks these up by doing something like:

funcname = "group_cummax" func = getattr(libgroupby, funcname, None) if func is None: func = getattr(libgroupby, funcname+"_"+dtype)

Before we used the fused types, it would succeed on the second getattr. Now it succeeds on the first.

jreback · 2018-10-31T12:52:16Z

pandas/_libs/hashtable_func_helper.pxi.in

-    return modes[:j + 1]
+    # Note: For reasons unknown, slicing modes.base works but modes[:j+1].base
+    #  returns an object with an incorrect length
+    return modes.base[:j + 1]  # `.base` to access underlying np.ndarray


same comment

…bmore

jbrockmendel · 2018-11-01T15:40:21Z

updated to use np.asarray(foo) instead of foo.base and to explicitly reject None in all appropriate places. I think we may also need to do a pass to add const to memoryview-accepting py-exposed functions. Between these two, the added verbiage is making this look less appealing.

jreback · 2018-11-02T14:13:54Z

needs a rebase

…bmore

jbrockmendel · 2018-11-02T15:06:05Z

Rebased, but pls hold off until I make another pass.

jbrockmendel · 2018-11-03T01:08:16Z

The worthwhile parts of this are in #23464. Closing.

* Easy bits of #23382 * Easy parts of #23368

* Easy bits of pandas-dev#23382 * Easy parts of pandas-dev#23368

gfyoung added Internals Related to non-user accessible pandas implementation Clean Performance Memory or execution speed performance labels Oct 27, 2018

gfyoung approved these changes Oct 27, 2018

View reviewed changes

jreback added this to the 0.24.0 milestone Oct 28, 2018

jreback reviewed Oct 28, 2018

View reviewed changes

jreback requested changes Oct 28, 2018

View reviewed changes

jbrockmendel and others added 9 commits October 30, 2018 09:50

use more memoryviews

de87047

use more memoryviews

9a5d635

memoryviews, avoid built-in names

ca7fb48

cython optimizations and cleanup

dc14378

fixup typo

43c085b

types, cleanup, let cython handle dtype dispatch

99c7e27

remove unnecessary specialization

0f45833

unpin openpyxl (pandas-dev#23361)

79137e3

Merge branch 'master' of https://github.com/pandas-dev/pandas into li…

dc79424

…bmore

jbrockmendel force-pushed the libmore branch from a740d8e to dc79424 Compare October 30, 2018 16:51

jreback requested changes Oct 31, 2018

View reviewed changes

jbrockmendel added 2 commits November 1, 2018 08:19

Merge branch 'master' of https://github.com/pandas-dev/pandas into li…

ab123f3

…bmore

use np.asarray, explicitly require not-none

beb4048

fix ValueError-->TypeError

75cfc6c

jbrockmendel mentioned this pull request Nov 2, 2018

REF: cython cleanup, typing, optimizations #23456

Merged

Merge branch 'master' of https://github.com/pandas-dev/pandas into li…

31a1319

…bmore

jbrockmendel added a commit to jbrockmendel/pandas that referenced this pull request Nov 2, 2018

Easy bits of pandas-dev#23382

7a9adb7

jbrockmendel mentioned this pull request Nov 2, 2018

REF: cython cleanup, typing, optimizations #23464

Merged

jbrockmendel closed this Nov 3, 2018

jbrockmendel deleted the libmore branch November 3, 2018 01:08

jreback pushed a commit that referenced this pull request Nov 3, 2018

REF: cython cleanup, typing, optimizations (#23464)

6fe83bb

* Easy bits of #23382 * Easy parts of #23368

JustinZhengBC pushed a commit to JustinZhengBC/pandas that referenced this pull request Nov 14, 2018

REF: cython cleanup, typing, optimizations (pandas-dev#23464)

3faf1a9

* Easy bits of pandas-dev#23382 * Easy parts of pandas-dev#23368

tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request Nov 19, 2018

REF: cython cleanup, typing, optimizations (pandas-dev#23464)

a43fb86

* Easy bits of pandas-dev#23382 * Easy parts of pandas-dev#23368

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

REF: cython cleanup, typing, optimizations (pandas-dev#23464)

2477d28

* Easy bits of pandas-dev#23382 * Easy parts of pandas-dev#23368

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

REF: cython cleanup, typing, optimizations (pandas-dev#23464)

543553b

* Easy bits of pandas-dev#23382 * Easy parts of pandas-dev#23368

Uh oh!

REF: cython cleanups and optimizations #23382

REF: cython cleanups and optimizations #23382

Uh oh!

Conversation

jbrockmendel commented Oct 27, 2018

Uh oh!

codecov bot commented Oct 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jbrockmendel commented Oct 27, 2018

Uh oh!

gfyoung commented Oct 27, 2018

Uh oh!

jbrockmendel commented Oct 27, 2018

Uh oh!

gfyoung commented Oct 27, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

jbrockmendel commented Oct 30, 2018

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbrockmendel commented Nov 1, 2018

Uh oh!

jreback commented Nov 2, 2018

Uh oh!

jbrockmendel commented Nov 2, 2018

Uh oh!

jbrockmendel commented Nov 3, 2018

Uh oh!

Uh oh!

codecov bot commented Oct 27, 2018 •

edited

Loading