-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
REF: cython cleanups and optimizations #23382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #23382 +/- ##
=======================================
Coverage 92.22% 92.22%
=======================================
Files 161 161
Lines 51187 51187
=======================================
Hits 47209 47209
Misses 3978 3978
Continue to review full report at Codecov.
|
Mostly changing ndarray to memoryviews where possible. A few notation cleanups around pointer types. Adds some extra type annotations, using py3-syntax where feasible. |
@jbrockmendel : How is performance impacted, if at all, after these changes? |
Small but positive. For the changes of ndarray[dtype] to dtype[:] the impact is small enough that I can't measure it, but I trust the cython folks when they say that the memoryview usage is more performant. Adding Changing Changing Removal of Adding type annotations in _libs.lib for non-cdef functions I'm not sure if cython actually uses those, so that might just be for funsies. Replacing So tiny-but-positive all around. |
@jbrockmendel : Cool, just wanted to make sure we had the rationale documented 👍 |
return modes[:j + 1] | ||
# Note: For reasons unknown, slicing modes.base works but modes[:j+1].base | ||
# returns an object with an incorrect length | ||
return modes.base[:j + 1] # `.base` to access underlying np.ndarray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need np.asarray i think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment
pandas/_libs/lib.pyx
Outdated
@@ -284,7 +284,7 @@ def dicts_to_array(list dicts, list columns): | |||
else: | |||
result[i, j] = onan | |||
|
|||
return result | |||
return result.base # `.base` to access underlying np.ndarray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.asarray
pandas/_libs/lib.pyx
Outdated
for i in range(n): | ||
idx = indexer[i] | ||
if idx != -1: | ||
rev_indexer[idx] = i | ||
|
||
return rev_indexer | ||
return rev_indexer.base # `.base` to access underlying np.ndarray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comments
pandas/_libs/lib.pyx
Outdated
@@ -525,11 +534,12 @@ def astype_unicode(arr: ndarray, | |||
|
|||
result[i] = arr_i | |||
|
|||
return result | |||
return result.base # `.base` to access underlying np.ndarray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
pandas/_libs/lib.pyx
Outdated
Seen seen = Seen() | ||
object val | ||
float64_t fval, fnan | ||
|
||
if objects is None: | ||
# Without explicitly raising, groupby.ops _aggregate_series_pure_python | ||
# can pass None and incorrectly raise an AttributeError when trying |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what fails for this? it is very odd you needed to change this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment the call to _aggregate_series_pure_python raises because it expects an ndarray
and gets None
. But if we make it expect a memoryview then None
is technically allowed. So this reinstates the raising explicitly.
pandas/_libs/lib.pyx
Outdated
@@ -2036,7 +2053,7 @@ def maybe_convert_objects(ndarray[object] objects, bint try_float=0, | |||
if seen.datetimetz_: | |||
if len({getattr(val, 'tzinfo', None) for val in objects}) == 1: | |||
from pandas import DatetimeIndex | |||
return DatetimeIndex(objects) | |||
return DatetimeIndex(objects.base) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why accessing base here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because DatetimeIndex.__new__
doesn't know how to handle a cython memoryview object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then use np.asarray, that's the standard
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's avoid using .base anywhere. i believe we had this discussion once before. np.asarray is much more idiomatic and readable; its not a perf issue as these are the last call in a routine.
Reverted change of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comments as before
pandas/_libs/groupby.pyx
Outdated
int64_t idx, curr_fill_idx=-1, filled_vals=0 | ||
|
||
N = len(out) | ||
|
||
# Make sure all arrays are the same size | ||
assert N == len(labels) == len(mask) | ||
|
||
sorted_labels = np.argsort(labels, kind='mergesort').astype( | ||
np.int64, copy=False) | ||
sorted_labels = np.argsort(labels, kind='mergesort').astype(np.int64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i prefer the former
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will revert.
@@ -837,8 +823,3 @@ def group_cummax(ndarray[groupby_t, ndim=2] out, | |||
if val > mval: | |||
accum[lab, j] = mval = val | |||
out[i, j] = mval | |||
|
|||
|
|||
group_cummax_float64 = group_cummax["float64_t"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these just relics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
groupby.ops (I think) looks these up by doing something like:
funcname = "group_cummax"
func = getattr(libgroupby, funcname, None)
if func is None:
func = getattr(libgroupby, funcname+"_"+dtype)
Before we used the fused types, it would succeed on the second getattr
. Now it succeeds on the first.
return modes[:j + 1] | ||
# Note: For reasons unknown, slicing modes.base works but modes[:j+1].base | ||
# returns an object with an incorrect length | ||
return modes.base[:j + 1] # `.base` to access underlying np.ndarray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment
updated to use |
needs a rebase |
Rebased, but pls hold off until I make another pass. |
The worthwhile parts of this are in #23464. Closing. |
* Easy bits of pandas-dev#23382 * Easy parts of pandas-dev#23368
* Easy bits of pandas-dev#23382 * Easy parts of pandas-dev#23368
* Easy bits of pandas-dev#23382 * Easy parts of pandas-dev#23368
* Easy bits of pandas-dev#23382 * Easy parts of pandas-dev#23368
git diff upstream/master -u -- "*.py" | flake8 --diff