PERF: optimize Block.getitem_block #34978

jbrockmendel · 2020-06-24T22:20:54Z

Performance comparison is based on the the asv groupby.Apply.time_scalar_function_single_col, which is the one in which disabling the libreduction path has the biggest impact.

import pandas as pd
import numpy as np

N = 10 ** 4
labels = np.random.randint(0, 2000, size=N)
labels2 = np.random.randint(0, 3, size=N)
df = pd.DataFrame(
    {
        "key": labels,
        "key2": labels2,
        "value1": np.random.randn(N),
        "value2": ["foo", "bar", "baz", "qux"] * (N // 4),
    }
)

In [4]: %prun -s cumtime df.groupby("key").apply(lambda x: 1)

master-but-with-fast_apply-disabled:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.081    0.081 groupby.py:822(apply)
        1    0.000    0.000    0.081    0.081 groupby.py:871(_python_apply_general)
        1    0.006    0.006    0.080    0.080 ops.py:157(apply)
     1989    0.002    0.000    0.069    0.000 ops.py:933(__iter__)
     1988    0.003    0.000    0.066    0.000 ops.py:966(_chop)
     1988    0.004    0.000    0.059    0.000 managers.py:724(get_slice)
     1988    0.002    0.000    0.035    0.000 managers.py:730(<listcomp>)
     5964    0.009    0.000    0.033    0.000 blocks.py:283(getitem_block)
     5971    0.005    0.000    0.021    0.000 blocks.py:247(make_block_same_class)
     5974    0.007    0.000    0.013    0.000 blocks.py:115(__init__)
     1988    0.003    0.000    0.011    0.000 base.py:4064(__getitem__)
     1990    0.003    0.000    0.008    0.000 managers.py:120(__init__)
     1988    0.002    0.000    0.007    0.000 numeric.py:105(_shallow_copy)
     1992    0.002    0.000    0.007    0.000 blocks.py:2379(__init__)
     1988    0.002    0.000    0.005    0.000 base.py:485(_shallow_copy)
     1990    0.002    0.000    0.004    0.000 frame.py:432(__init__)
     1992    0.002    0.000    0.003    0.000 base.py:450(_simple_new)

PR-but-with-fast_apply-disabled

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.061    0.061 groupby.py:822(apply)
        1    0.000    0.000    0.061    0.061 groupby.py:871(_python_apply_general)
        1    0.005    0.005    0.058    0.058 ops.py:157(apply)
     1991    0.002    0.000    0.048    0.000 ops.py:933(__iter__)
     1990    0.003    0.000    0.046    0.000 ops.py:966(_chop)
     1990    0.004    0.000    0.039    0.000 managers.py:724(get_slice)
     1990    0.002    0.000    0.017    0.000 managers.py:730(<listcomp>)
     5970    0.009    0.000    0.015    0.000 blocks.py:297(getitem_block)
     1990    0.002    0.000    0.011    0.000 base.py:4064(__getitem__)
     1992    0.003    0.000    0.007    0.000 managers.py:120(__init__)
     1990    0.002    0.000    0.007    0.000 numeric.py:105(_shallow_copy)
     1990    0.002    0.000    0.005    0.000 base.py:485(_shallow_copy)
     1992    0.002    0.000    0.004    0.000 frame.py:432(__init__)
     5970    0.002    0.000    0.003    0.000 blocks.py:116(_simple_new)
     1994    0.002    0.000    0.003    0.000 base.py:450(_simple_new)
     1992    0.001    0.000    0.002    0.000 managers.py:126(<listcomp>)
    22438    0.002    0.000    0.002    0.000 {built-in method builtins.isinstance}

master

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.009    0.009 groupby.py:822(apply)
        1    0.000    0.000    0.009    0.009 groupby.py:871(_python_apply_general)
        1    0.000    0.000    0.008    0.008 ops.py:157(apply)
        1    0.000    0.000    0.005    0.005 ops.py:961(fast_apply)
        1    0.003    0.003    0.005    0.005 {pandas._libs.reduction.apply_frame_axis0}
     1994    0.001    0.000    0.002    0.000 base.py:4064(__getitem__)
        1    0.000    0.000    0.001    0.001 ops.py:135(_get_splitter)
        1    0.000    0.000    0.001    0.001 ops.py:268(group_info)
        1    0.000    0.000    0.001    0.001 generic.py:1206(_wrap_applied_output)
        1    0.000    0.000    0.001    0.001 ops.py:285(_get_compressed_codes)

pandas/core/internals/blocks.py

jreback · 2020-06-25T14:41:31Z

pandas/_libs/internals.pyx

@@ -16,6 +16,7 @@ cnp.import_array()
 from pandas._libs.algos import ensure_int64


+@cython.final


do these actually make a diff?

i havent measured this independently, so not sure how much it matters in this case (tried it for some Timestamp methods and got a decent boost). it allows cython to do some inlining, at the cost of disallowing subclassing.

PERF: optimize Block.getitem_block

5667e09

jreback added Internals Related to non-user accessible pandas implementation Performance Memory or execution speed performance labels Jun 24, 2020

jreback requested changes Jun 24, 2020

View reviewed changes

pandas/core/internals/blocks.py Show resolved Hide resolved

jreback added this to the 1.1 milestone Jun 25, 2020

jreback reviewed Jun 25, 2020

View reviewed changes

jreback approved these changes Jun 25, 2020

View reviewed changes

jreback merged commit 4ffd1f1 into pandas-dev:master Jun 25, 2020

jbrockmendel deleted the perf-getitem_block branch June 25, 2020 15:25

fangchenli pushed a commit to fangchenli/pandas that referenced this pull request Jun 27, 2020

PERF: optimize Block.getitem_block (pandas-dev#34978)

d953fe4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PERF: optimize Block.getitem_block #34978

PERF: optimize Block.getitem_block #34978

Uh oh!

jbrockmendel commented Jun 24, 2020

Uh oh!

Uh oh!

jreback Jun 25, 2020

Uh oh!

jbrockmendel Jun 25, 2020

Uh oh!

Uh oh!

		@@ -16,6 +16,7 @@ cnp.import_array()
		from pandas._libs.algos import ensure_int64


		@cython.final

Uh oh!

PERF: optimize Block.getitem_block #34978

PERF: optimize Block.getitem_block #34978

Uh oh!

Conversation

jbrockmendel commented Jun 24, 2020

Uh oh!

Uh oh!

jreback Jun 25, 2020

Choose a reason for hiding this comment

Uh oh!

jbrockmendel Jun 25, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!