bpo-44850: Improve the performance of methodcaller. #27782

anntzer · 2021-08-16T15:59:24Z

See discussion at https://bugs.python.org/issue44850.

This is currently split into two separate commits, so that the separate speedups from _PyObject_GetMethod and vectorcall can be evaluated separately (plus a third doc-only commit). Edit: Now a single commit, per #27782 (comment), so only consider the first and last benchmarks.

The benchmark script tests builtin and python methods, with and without kwargs:

(
    export PYTHON=./python
    echo 'No kws.'
    $PYTHON -m pyperf timeit -s "from operator import methodcaller as mc" -s "call = mc('sort')" -s "arr = []" "call(arr)"
    $PYTHON -m pyperf timeit -s "call = lambda x: x.sort()" -s "arr = []" "call(arr)"
    $PYTHON -m pyperf timeit -s "from operator import methodcaller as mc" -s "call = mc('sort')" -s "class A: sort = lambda self: None" -s "arr=A()" "call(arr)"
    $PYTHON -m pyperf timeit -s "call = lambda x: x.sort()" -s "class A: sort = lambda self: None" -s "arr=A()" "call(arr)"
    echo 'With kws.'
    $PYTHON -m pyperf timeit -s "from operator import methodcaller as mc" -s "call = mc('sort', reverse=True)" -s "arr = []" "call(arr)"
    $PYTHON -m pyperf timeit -s "call = lambda x: x.sort(reverse=True)" -s "arr = []" "call(arr)"
    $PYTHON -m pyperf timeit -s "from operator import methodcaller as mc" -s "call = mc('sort', reverse=True)" -s "class A: sort = lambda self, reverse=False: None" -s "arr=A()" "call(arr)"
    $PYTHON -m pyperf timeit -s "call = lambda x: x.sort(reverse=True)" -s "class A: sort = lambda self, reverse=False: None" -s "arr=A()" "call(arr)"
)

Before:

No kws.
.....................
Mean +- std dev: 81.5 ns +- 0.6 ns
.....................
Mean +- std dev: 74.4 ns +- 1.0 ns
.....................
Mean +- std dev: 112 ns +- 1 ns
.....................
Mean +- std dev: 101 ns +- 1 ns
With kws.
.....................
Mean +- std dev: 138 ns +- 1 ns
.....................
Mean +- std dev: 97.5 ns +- 1.2 ns
.....................
Mean +- std dev: 157 ns +- 1 ns
.....................
Mean +- std dev: 112 ns +- 1 ns

Using _PyObject_GetMethod:

No kws.
.....................
Mean +- std dev: 72.5 ns +- 1.6 ns
.....................
Mean +- std dev: 74.4 ns +- 1.6 ns
.....................
Mean +- std dev: 106 ns +- 1 ns
.....................
Mean +- std dev: 101 ns +- 1 ns
With kws.
.....................
Mean +- std dev: 128 ns +- 1 ns
.....................
Mean +- std dev: 97.2 ns +- 0.8 ns
.....................
Mean +- std dev: 150 ns +- 1 ns
.....................
Mean +- std dev: 112 ns +- 1 ns

Using _PyObject_GetMethod and vectorcall:

No kws.
.....................
Mean +- std dev: 54.6 ns +- 0.8 ns
.....................
Mean +- std dev: 74.8 ns +- 1.0 ns
.....................
Mean +- std dev: 87.8 ns +- 1.2 ns
.....................
Mean +- std dev: 101 ns +- 2 ns
With kws.
.....................
Mean +- std dev: 71.0 ns +- 0.6 ns
.....................
Mean +- std dev: 97.5 ns +- 0.7 ns
.....................
Mean +- std dev: 93.3 ns +- 0.7 ns
.....................
Mean +- std dev: 113 ns +- 3 ns

https://bugs.python.org/issue44850

sweeneyde · 2021-08-16T21:58:12Z

Modules/_operator.c

+            mc->vectorcall_args[1 + nargs + i] = value;
+            ++i;
+        }
+    } else {


PEP 7 says to put else on the next line

sweeneyde · 2021-08-16T22:00:05Z

Modules/_operator.c

        return NULL;
-    result = PyObject_Call(method, mc->args, mc->kwds);
+    } else if (meth_found) {


else if (...) { on its own line

sweeneyde · 2021-08-16T22:00:23Z

Modules/_operator.c

+        mc->vectorcall_args[0] = obj;
+        result = PyObject_Vectorcall(
+                method, mc->vectorcall_args, PyTuple_GET_SIZE(mc->args) + 1, mc->vectorcall_kwnames);
+    } else {


else { on its own line

sweeneyde · 2021-08-16T22:20:27Z

Modules/_operator.c

+        result = PyObject_Vectorcall(
+                method, mc->vectorcall_args, PyTuple_GET_SIZE(mc->args) + 1, mc->vectorcall_kwnames);
+    } else {
+        result = PyObject_Call(method, mc->args, mc->kwds);


I'm just brainstorming, but couldn't PyObject_VectorcallMethod be used to replace a bunch of this logic and remove this branch? Or is there a reason not to default to vectorcall in this case? Maybe mc->args and mc->kwds could even be removed entirely?

It might also be nice to add more tests to make sure this case is covered for various C- and Python-implemented objects.

Actually that's a great idea, and makes everything much simpler (Hopefully the removal of the branch also means I can get away with no additional tests...). I squashed the commits together as a result, as the distinction didn't make much sense anymore. Performance doesn't change.

For now I kept args and kwds as 1) they avoid having to fiddle with the refcounts of the individual args, and 2) I didn't want to rewrite all of methodcaller_repr and methodcaller_reduce.

sweeneyde · 2021-08-16T22:23:50Z

Misc/NEWS.d/next/Library/2021-08-16-17-52-26.bpo-44850.r8jx5u.rst

@@ -0,0 +1,3 @@
+The performance of `operator.methodcaller` has been improved by using


I think this needs ``double backticks``.

sweeneyde · 2021-08-16T22:58:52Z

Modules/_operator.c

        return NULL;
-    result = PyObject_Call(method, mc->args, mc->kwds);
+    } else if (meth_found) {
+        mc->vectorcall_args[0] = obj;


Does this become a problem if mc() calls itself? I don't think it's a problem, but I just wanted to make sure it was thought through.

I think yes, also the very existence of https://docs.python.org/3/c-api/call.html#c.PY_VECTORCALL_ARGUMENTS_OFFSET suggests that this pattern can work in "normal" conditions.

It seems tricky to come up with a meaningful test, but

from operator import methodcaller barcall = methodcaller("bar") class A: def bar(self): print(self) return barcall(self) a = A() a.bar()

correctly results in an infinite recursion.

sweeneyde · 2021-08-17T18:35:09Z

A thought: there could be another performance boost in adding a tp_vectorcall_offset to the methodcaller type, to avoid building the args tuple.

anntzer · 2021-08-18T00:14:43Z

This is not possible because methodcaller is a heap type, for which vectorcall is not recommended (https://docs.python.org/3/c-api/typeobj.html#c.PyTypeObject.tp_vectorcall_offset).

sweeneyde · 2021-08-18T02:11:09Z

Is that restriction still applicable since Py_TPFLAGS_IMMUTABLETYPE was added?

neonene · 2021-08-18T02:22:45Z

I think #27001 have made this possible.

anntzer · 2021-08-18T08:33:12Z

OK, I gave it a try, but I wasn't able to make it work :( If you want to give it a try, the patch is below; the printf() shows that methodcaller_vectorcall is never reached.

diff --git i/Modules/_operator.c w/Modules/_operator.c
index de4613a6a8..cee6df24fe 100644
--- i/Modules/_operator.c
+++ w/Modules/_operator.c
@@ -1,5 +1,6 @@
 #include "Python.h"
 #include "pycore_moduleobject.h"  // _PyModule_GetState()
+#include "structmember.h"
 #include "clinic/_operator.c.h"
 
 typedef struct {
@@ -1480,8 +1481,44 @@ typedef struct {
     PyObject *kwds;
     PyObject **vectorcall_args;  /* Borrowed references */
     PyObject *vectorcall_kwnames;
+    vectorcallfunc vectorcall;
 } methodcallerobject;
 
+static PyObject *
+methodcaller_vectorcall(
+        methodcallerobject *mc, PyObject *const *args, size_t nargsf, PyObject* kwnames)
+{
+    printf("vectorcall\n");
+    if (!_PyArg_NoKwnames("methodcaller", kwnames))
+        return NULL;
+    if (PyVectorcall_NARGS(nargsf) != 1) {
+        PyErr_Format(
+                PyExc_TypeError,
+                "methodcaller expected 1 argument, got %zd",
+                PyVectorcall_NARGS(nargsf));
+        return NULL;
+    }
+    mc->vectorcall_args[0] = args[0];
+    return PyObject_VectorcallMethod(
+            mc->name, mc->vectorcall_args,
+            (1 + PyTuple_GET_SIZE(mc->args)) | PY_VECTORCALL_ARGUMENTS_OFFSET,
+            mc->vectorcall_kwnames);
+}
+
+static PyObject *
+methodcaller_call(methodcallerobject *mc, PyObject *args, PyObject *kw)
+{
+    if (!_PyArg_NoKeywords("methodcaller", kw))
+        return NULL;
+    if (!_PyArg_CheckPositional("methodcaller", PyTuple_GET_SIZE(args), 1, 1))
+        return NULL;
+    mc->vectorcall_args[0] = PyTuple_GET_ITEM(args, 0);
+    return PyObject_VectorcallMethod(
+            mc->name, mc->vectorcall_args,
+            (1 + PyTuple_GET_SIZE(mc->args)) | PY_VECTORCALL_ARGUMENTS_OFFSET,
+            mc->vectorcall_kwnames);
+}
+
 /* AC 3.5: variable number of arguments, not currently support by AC */
 static PyObject *
 methodcaller_new(PyTypeObject *type, PyObject *args, PyObject *kwds)
@@ -1548,6 +1585,7 @@ methodcaller_new(PyTypeObject *type, PyObject *args, PyObject *kwds)
     else {
         mc->vectorcall_kwnames = NULL;
     }
+    mc->vectorcall = (vectorcallfunc)methodcaller_vectorcall;
 
     PyObject_GC_Track(mc);
     return (PyObject *)mc;
@@ -1584,20 +1622,6 @@ methodcaller_traverse(methodcallerobject *mc, visitproc visit, void *arg)
     return 0;
 }
 
-static PyObject *
-methodcaller_call(methodcallerobject *mc, PyObject *args, PyObject *kw)
-{
-    if (!_PyArg_NoKeywords("methodcaller", kw))
-        return NULL;
-    if (!_PyArg_CheckPositional("methodcaller", PyTuple_GET_SIZE(args), 1, 1))
-        return NULL;
-    mc->vectorcall_args[0] = PyTuple_GET_ITEM(args, 0);
-    return PyObject_VectorcallMethod(
-            mc->name, mc->vectorcall_args,
-            (1 + PyTuple_GET_SIZE(mc->args)) | PY_VECTORCALL_ARGUMENTS_OFFSET,
-            mc->vectorcall_kwnames);
-}
-
 static PyObject *
 methodcaller_repr(methodcallerobject *mc)
 {
@@ -1722,6 +1746,12 @@ static PyMethodDef methodcaller_methods[] = {
      reduce_doc},
     {NULL}
 };
+
+static PyMemberDef methodcaller_members[] = {
+    {"__vectorcalloffset__", T_PYSSIZET, offsetof(methodcallerobject, vectorcall), READONLY},
+    {NULL}
+};
+
 PyDoc_STRVAR(methodcaller_doc,
 "methodcaller(name, ...) --> methodcaller object\n\
 \n\
@@ -1737,6 +1767,7 @@ static PyType_Slot methodcaller_type_slots[] = {
     {Py_tp_traverse, methodcaller_traverse},
     {Py_tp_clear, methodcaller_clear},
     {Py_tp_methods, methodcaller_methods},
+    {Py_tp_members, methodcaller_members},
     {Py_tp_new, methodcaller_new},
     {Py_tp_getattro, PyObject_GenericGetAttr},
     {Py_tp_repr, methodcaller_repr},

neonene · 2021-08-18T15:54:48Z

The patch should work adding Py_TPFLAGS_HAVE_VECTORCALL in methodcaller_type_spec.

anntzer · 2021-08-18T16:33:02Z

Ah yes, thanks, it now works, and this provided another ~25-33% speedup on the same benchmarks.

markshannon · 2021-08-19T10:04:01Z

Modules/_operator.c

@@ -1696,6 +1745,12 @@ static PyMethodDef methodcaller_methods[] = {
     reduce_doc},
    {NULL}
 };
+
+static PyMemberDef methodcaller_members[] = {


Why is this being added?

It seems the logic has been introduced by #20026 instead of introducing a type slot for tp_vectorcall_offset. (Documentation is here)

markshannon · 2021-08-19T10:05:17Z

Without benchmarks showing methodcaller being used in a larger program, you can't claim any sort of performance improvements.

This PR also makes methodcaller objects larger and slower to create. The microbenchmarks don't account for that.

anntzer · 2021-08-19T17:37:57Z

Here's a still toyish example, but actually representative of the original motivation (which was in fact to speed up numpy.loadtxt):

from operator import methodcaller
import secrets
from timeit import Timer

# Construct many rows of pseudo-random data...
randdata = [secrets.token_hex(4) for _ in range(1_000_000)]
# ... on which we want to apply some simple operation.  We'll also sum() these
# (or perform some other kind of reduction).
strcount0 = methodcaller("count", "0")

tglobals = {"randdata": randdata, "strcount0": strcount0}
# With a genexp.
timer = Timer("sum(s.count('0') for s in randdata)",
              globals=tglobals)
n, t = timer.autorange()
print(t/n)
# With map + methodcaller.
timer = Timer("sum(map(strcount0, randdata))",
              globals=tglobals)
n, t = timer.autorange()
print(t/n)

# Check that we got our implementation right.
assert sum(s.count('0') for s in randdata) == sum(map(strcount0, randdata))

Before this patch, the timings where

0.1718709634999982
0.18424206999998205

After, they go to

0.17785336750000624
0.12897691050000049

i.e. speeding up methodcaller significantly improved the performance.

Note that the docs of the itertools module expressly suggest using the "high-speed functions in the operator module", which (admittedly implicitly) hints that the per-iteration performance of operators is perhaps more worthy of consideration that the construction time.

anntzer · 2021-08-19T21:36:03Z

Edit: I missed the comment about lambdas at https://bugs.python.org/issue44850#msg399903, so here's the same benchmark as above including the lambda approach (unsuprisingly the slowest):

from operator import methodcaller
import secrets
from timeit import Timer

# Construct many rows of pseudo-random data...
randdata = [secrets.token_hex(4) for _ in range(1_000_000)]
# ... on which we want to apply some simple operation.  We'll also sum() these
# (or perform some other kind of reduction).
func = lambda x: x.count("0")
mc = methodcaller("count", "0")

tglobals = {"randdata": randdata, "func": func, "mc": mc}
# With a genexp.
timer = Timer("sum(s.count('0') for s in randdata)", globals=tglobals)
n, t = timer.autorange()
print("genexp  ", t/n)
# With map + lambda.
timer = Timer("sum(map(func, randdata))", globals=tglobals)
n, t = timer.autorange()
print("lambda  ", t/n)
# With map + methodcaller.
timer = Timer("sum(map(mc, randdata))", globals=tglobals)
n, t = timer.autorange()
print("methcall", t/n)

# Check that we got our implementation right.
assert sum(s.count('0') for s in randdata) == sum(map(func, randdata)) == sum(map(mc, randdata))

Results:

genexp   0.17928955050001605
lambda   0.22078816400016876
methcall 0.1899308030000384  (before this patch)
...
methcall 0.13393379699994057 (after this patch)

neonene · 2021-08-20T23:06:33Z

Modules/_operator.c

-static PyObject *
-methodcaller_call(methodcallerobject *mc, PyObject *args, PyObject *kw)
-{
-    PyObject *method, *obj, *result;


methodcaller_call() will be almost unused ?
If so, I prefer to keep the original unchanged for maintenability.

It's not changed at all, just moved next to methodcaller_vectorcall as I think that makes things easier to track (and then both have to come before methodcaller_new, or else we'd need a forward declaration; I can switch to that if you prefer).

Would it be possible for methodcaller_call to rollback to the previous version implemented with PyObject_Call ? Any position seems fine.
It is over 5 years old (well tested), and sufficient for a fallback function.

Ah, I see what you mean, sorry I forgot the change. I just put the whole thing back where it was and with the old version, to decrease the diff.

neonene · 2021-08-20T23:08:14Z

Misc/NEWS.d/next/Library/2021-08-16-17-52-26.bpo-44850.r8jx5u.rst

@@ -0,0 +1,3 @@
+The performance of ``operator.methodcaller`` has been improved by using
+``_PyObject_GetMethod`` (avoiding the creation of a bound method in many


The NEWS seems outdated (_PyObject_GetMethod is not in the diff).
And it might be better to rewrite this PR's introduction (which now looks historically complicated), including benchmark results.

thanks, edited.

github-actions · 2021-09-22T00:05:16Z

This PR is stale because it has been open for 30 days with no activity.

anntzer · 2021-09-22T07:10:30Z

This is still ready from my side, AFAICT.

rhettinger · 2022-05-10T18:17:47Z

This all looks reasonable to me.

eendebakpt · 2023-07-19T09:32:37Z

@anntzer The PR has merge conflicts, but they are trivial to resolve (feel free to merge https://github.com/eendebakpt/cpython/tree/fastmethodcaller where I merged with current main). Compared to current main this PR is still faster. With:

python -m pyperf timeit -s "from operator import methodcaller as mc" -s "call = mc('sort')" -s "arr = []" "call(arr)"

I get

Mean +- std dev: [main] 131 ns +- 1 ns -> [p] 53.0 ns +- 2.1 ns: 2.47x faster

anntzer · 2023-07-19T11:38:55Z

@eendebakpt If you are interested, do you mind taking over this patch? I've kind of moved on since 2y ago.

eendebakpt · 2023-09-22T17:32:26Z

This pr can be closed, since the alternate has been merged.

the-knights-who-say-ni added the CLA signed label Aug 16, 2021

bedevere-bot added the awaiting review label Aug 16, 2021

anntzer force-pushed the fastmethodcaller branch 2 times, most recently from 0429366 to c074967 Compare August 16, 2021 18:54

sweeneyde reviewed Aug 16, 2021

View reviewed changes

anntzer force-pushed the fastmethodcaller branch from c074967 to 563988d Compare August 17, 2021 08:55

anntzer force-pushed the fastmethodcaller branch from 563988d to a4a04bf Compare August 18, 2021 16:32

sweeneyde mentioned this pull request Aug 19, 2021

bpo-44953: Add vectorcall for itemgetter and attrgetter #27828

Merged

markshannon reviewed Aug 19, 2021

View reviewed changes

neonene reviewed Aug 20, 2021

View reviewed changes

anntzer force-pushed the fastmethodcaller branch 2 times, most recently from 22b89cf to 5897c07 Compare August 22, 2021 09:58

bpo-44850: Speedup methodcaller via vectorcall.

12a80bd

anntzer force-pushed the fastmethodcaller branch from 5897c07 to 12a80bd Compare August 22, 2021 10:06

github-actions bot added the stale Stale PR or inactive for long period of time. label Sep 22, 2021

github-actions bot removed the stale Stale PR or inactive for long period of time. label Sep 23, 2021

ezio-melotti removed the CLA signed label Jul 13, 2022

eendebakpt mentioned this pull request Aug 31, 2022

Triage the open "performance" issues on cpython repo faster-cpython/ideas#440

Closed

iritkatriel added the performance Performance or resource usage label Aug 31, 2022

eendebakpt mentioned this pull request Jul 21, 2023

gh-89013: Improve the performance of methodcaller #106960

Closed

anntzer mannequin mentioned this pull request Sep 7, 2022

Could operator.methodcaller be optimized using LOAD_METHOD? #89013

Closed

eendebakpt mentioned this pull request Jul 27, 2023

gh-89013: Improve the performance of methodcaller (lazy version) #107201

Merged

AA-Turner closed this Sep 22, 2023

anntzer deleted the fastmethodcaller branch September 22, 2023 17:58

		@@ -0,0 +1,3 @@
		The performance of `operator.methodcaller` has been improved by using

		@@ -0,0 +1,3 @@
		The performance of ``operator.methodcaller`` has been improved by using
		``_PyObject_GetMethod`` (avoiding the creation of a bound method in many

Uh oh!

bpo-44850: Improve the performance of methodcaller. #27782

bpo-44850: Improve the performance of methodcaller. #27782

Uh oh!

Conversation

anntzer commented Aug 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anntzer Aug 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sweeneyde commented Aug 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anntzer commented Aug 18, 2021

Uh oh!

sweeneyde commented Aug 18, 2021

Uh oh!

neonene commented Aug 18, 2021

Uh oh!

anntzer commented Aug 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

neonene commented Aug 18, 2021

Uh oh!

anntzer commented Aug 18, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

markshannon commented Aug 19, 2021

Uh oh!

anntzer commented Aug 19, 2021

Uh oh!

anntzer commented Aug 19, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

neonene Aug 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 22, 2021

Uh oh!

anntzer commented Sep 22, 2021

Uh oh!

rhettinger commented May 10, 2022

Uh oh!

anntzer commented Aug 16, 2021 •

edited

Loading

anntzer Aug 17, 2021 •

edited

Loading

sweeneyde commented Aug 17, 2021 •

edited

Loading

anntzer commented Aug 18, 2021 •

edited

Loading

neonene Aug 20, 2021 •

edited

Loading