Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-44850: Improve the performance of methodcaller. #27782

Closed
wants to merge 1 commit into from

Conversation

anntzer
Copy link
Contributor

@anntzer anntzer commented Aug 16, 2021

See discussion at https://bugs.python.org/issue44850.

This is currently split into two separate commits, so that the separate speedups from _PyObject_GetMethod and vectorcall can be evaluated separately (plus a third doc-only commit). Edit: Now a single commit, per #27782 (comment), so only consider the first and last benchmarks.

The benchmark script tests builtin and python methods, with and without kwargs:

(
    export PYTHON=./python
    echo 'No kws.'
    $PYTHON -m pyperf timeit -s "from operator import methodcaller as mc" -s "call = mc('sort')" -s "arr = []" "call(arr)"
    $PYTHON -m pyperf timeit -s "call = lambda x: x.sort()" -s "arr = []" "call(arr)"
    $PYTHON -m pyperf timeit -s "from operator import methodcaller as mc" -s "call = mc('sort')" -s "class A: sort = lambda self: None" -s "arr=A()" "call(arr)"
    $PYTHON -m pyperf timeit -s "call = lambda x: x.sort()" -s "class A: sort = lambda self: None" -s "arr=A()" "call(arr)"
    echo 'With kws.'
    $PYTHON -m pyperf timeit -s "from operator import methodcaller as mc" -s "call = mc('sort', reverse=True)" -s "arr = []" "call(arr)"
    $PYTHON -m pyperf timeit -s "call = lambda x: x.sort(reverse=True)" -s "arr = []" "call(arr)"
    $PYTHON -m pyperf timeit -s "from operator import methodcaller as mc" -s "call = mc('sort', reverse=True)" -s "class A: sort = lambda self, reverse=False: None" -s "arr=A()" "call(arr)"
    $PYTHON -m pyperf timeit -s "call = lambda x: x.sort(reverse=True)" -s "class A: sort = lambda self, reverse=False: None" -s "arr=A()" "call(arr)"
)

Before:

No kws.
.....................
Mean +- std dev: 81.5 ns +- 0.6 ns
.....................
Mean +- std dev: 74.4 ns +- 1.0 ns
.....................
Mean +- std dev: 112 ns +- 1 ns
.....................
Mean +- std dev: 101 ns +- 1 ns
With kws.
.....................
Mean +- std dev: 138 ns +- 1 ns
.....................
Mean +- std dev: 97.5 ns +- 1.2 ns
.....................
Mean +- std dev: 157 ns +- 1 ns
.....................
Mean +- std dev: 112 ns +- 1 ns

Using _PyObject_GetMethod:

No kws.
.....................
Mean +- std dev: 72.5 ns +- 1.6 ns
.....................
Mean +- std dev: 74.4 ns +- 1.6 ns
.....................
Mean +- std dev: 106 ns +- 1 ns
.....................
Mean +- std dev: 101 ns +- 1 ns
With kws.
.....................
Mean +- std dev: 128 ns +- 1 ns
.....................
Mean +- std dev: 97.2 ns +- 0.8 ns
.....................
Mean +- std dev: 150 ns +- 1 ns
.....................
Mean +- std dev: 112 ns +- 1 ns

Using _PyObject_GetMethod and vectorcall:

No kws.
.....................
Mean +- std dev: 54.6 ns +- 0.8 ns
.....................
Mean +- std dev: 74.8 ns +- 1.0 ns
.....................
Mean +- std dev: 87.8 ns +- 1.2 ns
.....................
Mean +- std dev: 101 ns +- 2 ns
With kws.
.....................
Mean +- std dev: 71.0 ns +- 0.6 ns
.....................
Mean +- std dev: 97.5 ns +- 0.7 ns
.....................
Mean +- std dev: 93.3 ns +- 0.7 ns
.....................
Mean +- std dev: 113 ns +- 3 ns

https://bugs.python.org/issue44850

mc->vectorcall_args[1 + nargs + i] = value;
++i;
}
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP 7 says to put else on the next line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

return NULL;
result = PyObject_Call(method, mc->args, mc->kwds);
} else if (meth_found) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else if (...) { on its own line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

mc->vectorcall_args[0] = obj;
result = PyObject_Vectorcall(
method, mc->vectorcall_args, PyTuple_GET_SIZE(mc->args) + 1, mc->vectorcall_kwnames);
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else { on its own line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

result = PyObject_Vectorcall(
method, mc->vectorcall_args, PyTuple_GET_SIZE(mc->args) + 1, mc->vectorcall_kwnames);
} else {
result = PyObject_Call(method, mc->args, mc->kwds);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just brainstorming, but couldn't PyObject_VectorcallMethod be used to replace a bunch of this logic and remove this branch? Or is there a reason not to default to vectorcall in this case? Maybe mc->args and mc->kwds could even be removed entirely?

It might also be nice to add more tests to make sure this case is covered for various C- and Python-implemented objects.

Copy link
Contributor Author

@anntzer anntzer Aug 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually that's a great idea, and makes everything much simpler (Hopefully the removal of the branch also means I can get away with no additional tests...). I squashed the commits together as a result, as the distinction didn't make much sense anymore. Performance doesn't change.

For now I kept args and kwds as 1) they avoid having to fiddle with the refcounts of the individual args, and 2) I didn't want to rewrite all of methodcaller_repr and methodcaller_reduce.

@@ -0,0 +1,3 @@
The performance of `operator.methodcaller` has been improved by using
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs ``double backticks``.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

return NULL;
result = PyObject_Call(method, mc->args, mc->kwds);
} else if (meth_found) {
mc->vectorcall_args[0] = obj;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this become a problem if mc() calls itself? I don't think it's a problem, but I just wanted to make sure it was thought through.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think yes, also the very existence of https://docs.python.org/3/c-api/call.html#c.PY_VECTORCALL_ARGUMENTS_OFFSET suggests that this pattern can work in "normal" conditions.

It seems tricky to come up with a meaningful test, but

from operator import methodcaller

barcall = methodcaller("bar")

class A:
    def bar(self):
        print(self)
        return barcall(self)

a = A()
a.bar()

correctly results in an infinite recursion.

@sweeneyde
Copy link
Member

sweeneyde commented Aug 17, 2021

A thought: there could be another performance boost in adding a tp_vectorcall_offset to the methodcaller type, to avoid building the args tuple.

@anntzer
Copy link
Contributor Author

anntzer commented Aug 18, 2021

This is not possible because methodcaller is a heap type, for which vectorcall is not recommended (https://docs.python.org/3/c-api/typeobj.html#c.PyTypeObject.tp_vectorcall_offset).

@sweeneyde
Copy link
Member

Is that restriction still applicable since Py_TPFLAGS_IMMUTABLETYPE was added?

@neonene
Copy link
Contributor

neonene commented Aug 18, 2021

I think #27001 have made this possible.

@anntzer
Copy link
Contributor Author

anntzer commented Aug 18, 2021

OK, I gave it a try, but I wasn't able to make it work :( If you want to give it a try, the patch is below; the printf() shows that methodcaller_vectorcall is never reached.

diff --git i/Modules/_operator.c w/Modules/_operator.c
index de4613a6a8..cee6df24fe 100644
--- i/Modules/_operator.c
+++ w/Modules/_operator.c
@@ -1,5 +1,6 @@
 #include "Python.h"
 #include "pycore_moduleobject.h"  // _PyModule_GetState()
+#include "structmember.h"
 #include "clinic/_operator.c.h"
 
 typedef struct {
@@ -1480,8 +1481,44 @@ typedef struct {
     PyObject *kwds;
     PyObject **vectorcall_args;  /* Borrowed references */
     PyObject *vectorcall_kwnames;
+    vectorcallfunc vectorcall;
 } methodcallerobject;
 
+static PyObject *
+methodcaller_vectorcall(
+        methodcallerobject *mc, PyObject *const *args, size_t nargsf, PyObject* kwnames)
+{
+    printf("vectorcall\n");
+    if (!_PyArg_NoKwnames("methodcaller", kwnames))
+        return NULL;
+    if (PyVectorcall_NARGS(nargsf) != 1) {
+        PyErr_Format(
+                PyExc_TypeError,
+                "methodcaller expected 1 argument, got %zd",
+                PyVectorcall_NARGS(nargsf));
+        return NULL;
+    }
+    mc->vectorcall_args[0] = args[0];
+    return PyObject_VectorcallMethod(
+            mc->name, mc->vectorcall_args,
+            (1 + PyTuple_GET_SIZE(mc->args)) | PY_VECTORCALL_ARGUMENTS_OFFSET,
+            mc->vectorcall_kwnames);
+}
+
+static PyObject *
+methodcaller_call(methodcallerobject *mc, PyObject *args, PyObject *kw)
+{
+    if (!_PyArg_NoKeywords("methodcaller", kw))
+        return NULL;
+    if (!_PyArg_CheckPositional("methodcaller", PyTuple_GET_SIZE(args), 1, 1))
+        return NULL;
+    mc->vectorcall_args[0] = PyTuple_GET_ITEM(args, 0);
+    return PyObject_VectorcallMethod(
+            mc->name, mc->vectorcall_args,
+            (1 + PyTuple_GET_SIZE(mc->args)) | PY_VECTORCALL_ARGUMENTS_OFFSET,
+            mc->vectorcall_kwnames);
+}
+
 /* AC 3.5: variable number of arguments, not currently support by AC */
 static PyObject *
 methodcaller_new(PyTypeObject *type, PyObject *args, PyObject *kwds)
@@ -1548,6 +1585,7 @@ methodcaller_new(PyTypeObject *type, PyObject *args, PyObject *kwds)
     else {
         mc->vectorcall_kwnames = NULL;
     }
+    mc->vectorcall = (vectorcallfunc)methodcaller_vectorcall;
 
     PyObject_GC_Track(mc);
     return (PyObject *)mc;
@@ -1584,20 +1622,6 @@ methodcaller_traverse(methodcallerobject *mc, visitproc visit, void *arg)
     return 0;
 }
 
-static PyObject *
-methodcaller_call(methodcallerobject *mc, PyObject *args, PyObject *kw)
-{
-    if (!_PyArg_NoKeywords("methodcaller", kw))
-        return NULL;
-    if (!_PyArg_CheckPositional("methodcaller", PyTuple_GET_SIZE(args), 1, 1))
-        return NULL;
-    mc->vectorcall_args[0] = PyTuple_GET_ITEM(args, 0);
-    return PyObject_VectorcallMethod(
-            mc->name, mc->vectorcall_args,
-            (1 + PyTuple_GET_SIZE(mc->args)) | PY_VECTORCALL_ARGUMENTS_OFFSET,
-            mc->vectorcall_kwnames);
-}
-
 static PyObject *
 methodcaller_repr(methodcallerobject *mc)
 {
@@ -1722,6 +1746,12 @@ static PyMethodDef methodcaller_methods[] = {
      reduce_doc},
     {NULL}
 };
+
+static PyMemberDef methodcaller_members[] = {
+    {"__vectorcalloffset__", T_PYSSIZET, offsetof(methodcallerobject, vectorcall), READONLY},
+    {NULL}
+};
+
 PyDoc_STRVAR(methodcaller_doc,
 "methodcaller(name, ...) --> methodcaller object\n\
 \n\
@@ -1737,6 +1767,7 @@ static PyType_Slot methodcaller_type_slots[] = {
     {Py_tp_traverse, methodcaller_traverse},
     {Py_tp_clear, methodcaller_clear},
     {Py_tp_methods, methodcaller_methods},
+    {Py_tp_members, methodcaller_members},
     {Py_tp_new, methodcaller_new},
     {Py_tp_getattro, PyObject_GenericGetAttr},
     {Py_tp_repr, methodcaller_repr},

@neonene
Copy link
Contributor

neonene commented Aug 18, 2021

The patch should work adding Py_TPFLAGS_HAVE_VECTORCALL in methodcaller_type_spec.

@anntzer
Copy link
Contributor Author

anntzer commented Aug 18, 2021

Ah yes, thanks, it now works, and this provided another ~25-33% speedup on the same benchmarks.

@@ -1696,6 +1745,12 @@ static PyMethodDef methodcaller_methods[] = {
reduce_doc},
{NULL}
};

static PyMemberDef methodcaller_members[] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this being added?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the logic has been introduced by #20026 instead of introducing a type slot for tp_vectorcall_offset. (Documentation is here)

@markshannon
Copy link
Member

Without benchmarks showing methodcaller being used in a larger program, you can't claim any sort of performance improvements.

This PR also makes methodcaller objects larger and slower to create. The microbenchmarks don't account for that.

@anntzer
Copy link
Contributor Author

anntzer commented Aug 19, 2021

Here's a still toyish example, but actually representative of the original motivation (which was in fact to speed up numpy.loadtxt):

from operator import methodcaller
import secrets
from timeit import Timer

# Construct many rows of pseudo-random data...
randdata = [secrets.token_hex(4) for _ in range(1_000_000)]
# ... on which we want to apply some simple operation.  We'll also sum() these
# (or perform some other kind of reduction).
strcount0 = methodcaller("count", "0")

tglobals = {"randdata": randdata, "strcount0": strcount0}
# With a genexp.
timer = Timer("sum(s.count('0') for s in randdata)",
              globals=tglobals)
n, t = timer.autorange()
print(t/n)
# With map + methodcaller.
timer = Timer("sum(map(strcount0, randdata))",
              globals=tglobals)
n, t = timer.autorange()
print(t/n)

# Check that we got our implementation right.
assert sum(s.count('0') for s in randdata) == sum(map(strcount0, randdata))

Before this patch, the timings where

0.1718709634999982
0.18424206999998205

After, they go to

0.17785336750000624
0.12897691050000049

i.e. speeding up methodcaller significantly improved the performance.

Note that the docs of the itertools module expressly suggest using the "high-speed functions in the operator module", which (admittedly implicitly) hints that the per-iteration performance of operators is perhaps more worthy of consideration that the construction time.

@anntzer
Copy link
Contributor Author

anntzer commented Aug 19, 2021

Edit: I missed the comment about lambdas at https://bugs.python.org/issue44850#msg399903, so here's the same benchmark as above including the lambda approach (unsuprisingly the slowest):

from operator import methodcaller
import secrets
from timeit import Timer

# Construct many rows of pseudo-random data...
randdata = [secrets.token_hex(4) for _ in range(1_000_000)]
# ... on which we want to apply some simple operation.  We'll also sum() these
# (or perform some other kind of reduction).
func = lambda x: x.count("0")
mc = methodcaller("count", "0")

tglobals = {"randdata": randdata, "func": func, "mc": mc}
# With a genexp.
timer = Timer("sum(s.count('0') for s in randdata)", globals=tglobals)
n, t = timer.autorange()
print("genexp  ", t/n)
# With map + lambda.
timer = Timer("sum(map(func, randdata))", globals=tglobals)
n, t = timer.autorange()
print("lambda  ", t/n)
# With map + methodcaller.
timer = Timer("sum(map(mc, randdata))", globals=tglobals)
n, t = timer.autorange()
print("methcall", t/n)

# Check that we got our implementation right.
assert sum(s.count('0') for s in randdata) == sum(map(func, randdata)) == sum(map(mc, randdata))

Results:

genexp   0.17928955050001605
lambda   0.22078816400016876
methcall 0.1899308030000384  (before this patch)
...
methcall 0.13393379699994057 (after this patch)

static PyObject *
methodcaller_call(methodcallerobject *mc, PyObject *args, PyObject *kw)
{
PyObject *method, *obj, *result;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

methodcaller_call() will be almost unused ?
If so, I prefer to keep the original unchanged for maintenability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not changed at all, just moved next to methodcaller_vectorcall as I think that makes things easier to track (and then both have to come before methodcaller_new, or else we'd need a forward declaration; I can switch to that if you prefer).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible for methodcaller_call to rollback to the previous version implemented with PyObject_Call ? Any position seems fine.
It is over 5 years old (well tested), and sufficient for a fallback function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see what you mean, sorry I forgot the change. I just put the whole thing back where it was and with the old version, to decrease the diff.

@@ -0,0 +1,3 @@
The performance of ``operator.methodcaller`` has been improved by using
``_PyObject_GetMethod`` (avoiding the creation of a bound method in many
Copy link
Contributor

@neonene neonene Aug 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NEWS seems outdated (_PyObject_GetMethod is not in the diff).
And it might be better to rewrite this PR's introduction (which now looks historically complicated), including benchmark results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, edited.

@anntzer anntzer force-pushed the fastmethodcaller branch 2 times, most recently from 22b89cf to 5897c07 Compare August 22, 2021 09:58
@github-actions
Copy link

This PR is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale Stale PR or inactive for long period of time. label Sep 22, 2021
@anntzer
Copy link
Contributor Author

anntzer commented Sep 22, 2021

This is still ready from my side, AFAICT.

@github-actions github-actions bot removed the stale Stale PR or inactive for long period of time. label Sep 23, 2021
@rhettinger
Copy link
Contributor

This all looks reasonable to me.

@eendebakpt
Copy link
Contributor

@anntzer The PR has merge conflicts, but they are trivial to resolve (feel free to merge https://github.com/eendebakpt/cpython/tree/fastmethodcaller where I merged with current main). Compared to current main this PR is still faster. With:

python -m pyperf timeit -s "from operator import methodcaller as mc" -s "call = mc('sort')" -s "arr = []" "call(arr)"

I get

Mean +- std dev: [main] 131 ns +- 1 ns -> [p] 53.0 ns +- 2.1 ns: 2.47x faster

@anntzer
Copy link
Contributor Author

anntzer commented Jul 19, 2023

@eendebakpt If you are interested, do you mind taking over this patch? I've kind of moved on since 2y ago.

@eendebakpt
Copy link
Contributor

This pr can be closed, since the alternate has been merged.

@AA-Turner AA-Turner closed this Sep 22, 2023
@anntzer anntzer deleted the fastmethodcaller branch September 22, 2023 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting review performance Performance or resource usage
Projects
None yet
Development

Successfully merging this pull request may close these issues.