-
-
Notifications
You must be signed in to change notification settings - Fork 31.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-43684: Add ADD_INT opcode #25090
Conversation
Hi Guido, I ran some preliminary benchmarks with the main branch against the current implementation (some tests errored out with an error message which looks like pyperformance on average see no change (please take these benchmarks with a grain of salt, I'm quite unhappy with how much change there is for benchmarks which shouldn't even be affected :( ): pyperformance compare output
micro benchmarks have some noticeable speedups:
|
Thanks @Fidget-Spinner!
That's disturbing, I will look into that.
Yeah, this is what I've experienced running pyperformance as well. It's really hard to move the needle, and there's so much noise that for improvements that are expected to give less than 5% improvement in the general case the best we can hope from the benchmark is give us confidence we haven't accidentally pessimized things. Your timeit-based microbenchmarks look like what I expected. Maybe we'll be able to move the needle by adding a few other similar changes to this same PR (that saves bumps in the pyc magic number as well :-). |
(FWIW the reason I didn't finish generating bytecode using RETURN_CONST and RETURN_NONE is that this causes a large number of failures in test_dis.py. I need to strategize on what to do about those.) |
Python/compile.c
Outdated
@@ -6910,14 +6910,14 @@ optimize_basic_block(basicblock *bb, PyObject *consts) | |||
} | |||
} | |||
break; | |||
# if 0 | |||
# if 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get a scary error in test_sys_settrace.py:
PS C:\Users\gvanrossum\cpython> .\PCbuild\amd64\python.exe -m test test_sys_settrace
0:00:00 Run tests sequentially
0:00:00 [1/1] test_sys_settrace
Fatal Python error: PyFrame_BlockPop: block stack underflow
Python runtime state: initialized
Current thread 0x00005704 (most recent call first):
File "C:\Users\gvanrossum\cpython\lib\test\test_sys_settrace.py", line 1550 in test_jump_over_return_in_try_finally_block
File "C:\Users\gvanrossum\cpython\lib\test\test_sys_settrace.py", line 1179 in run_test
File "C:\Users\gvanrossum\cpython\lib\test\test_sys_settrace.py", line 1207 in test
File "C:\Users\gvanrossum\cpython\lib\unittest\case.py", line 549 in _callTestMethod
File "C:\Users\gvanrossum\cpython\lib\unittest\case.py", line 592 in run
File "C:\Users\gvanrossum\cpython\lib\unittest\case.py", line 652 in __call__
File "C:\Users\gvanrossum\cpython\lib\unittest\suite.py", line 122 in run
File "C:\Users\gvanrossum\cpython\lib\unittest\suite.py", line 84 in __call__
File "C:\Users\gvanrossum\cpython\lib\unittest\suite.py", line 122 in run
File "C:\Users\gvanrossum\cpython\lib\unittest\suite.py", line 84 in __call__
File "C:\Users\gvanrossum\cpython\lib\unittest\suite.py", line 122 in run
File "C:\Users\gvanrossum\cpython\lib\unittest\suite.py", line 84 in __call__
File "C:\Users\gvanrossum\cpython\lib\test\support\testresult.py", line 169 in run
File "C:\Users\gvanrossum\cpython\lib\test\support\__init__.py", line 959 in _run_suite
File "C:\Users\gvanrossum\cpython\lib\test\support\__init__.py", line 1082 in run_unittest
File "C:\Users\gvanrossum\cpython\lib\test\libregrtest\runtest.py", line 210 in _test_module
File "C:\Users\gvanrossum\cpython\lib\test\libregrtest\runtest.py", line 246 in _runtest_inner2
File "C:\Users\gvanrossum\cpython\lib\test\libregrtest\runtest.py", line 282 in _runtest_inner
File "C:\Users\gvanrossum\cpython\lib\test\libregrtest\runtest.py", line 154 in _runtest
File "C:\Users\gvanrossum\cpython\lib\test\libregrtest\runtest.py", line 194 in runtest
File "C:\Users\gvanrossum\cpython\lib\test\libregrtest\main.py", line 423 in run_tests_sequential
File "C:\Users\gvanrossum\cpython\lib\test\libregrtest\main.py", line 521 in run_tests
File "C:\Users\gvanrossum\cpython\lib\test\libregrtest\main.py", line 694 in _main
File "C:\Users\gvanrossum\cpython\lib\test\libregrtest\main.py", line 641 in main
File "C:\Users\gvanrossum\cpython\lib\test\libregrtest\main.py", line 719 in main
File "C:\Users\gvanrossum\cpython\lib\test\__main__.py", line 2 in <module>
File "C:\Users\gvanrossum\cpython\lib\runpy.py", line 86 in _run_code
File "C:\Users\gvanrossum\cpython\lib\runpy.py", line 196 in _run_module_as_main
Extension modules: _testcapi, _overlapped (total: 2)
@markshannon You changed that file last. Should I just give up on RETURN_CONST/RETURN_NONE for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this PR, yes. See my https://github.com/python/cpython/pull/25090/files#r604752931 for a fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: _overlapped is listed as a 3rd party C extension, whereas it's part of the stdlib. I forgot this one and I created #25122 to add it to sys.stdlib_module_names :-)
I don't think adding a bunch of new opcodes to one PR is a good idea. It makes it difficult to review and get merged. |
Python/compile.c
Outdated
cnt = PyList_GET_ITEM(consts, oparg); | ||
inst->i_opcode = RETURN_CONST; | ||
// cnt == Py_None ? RETURN_NONE : RETURN_CONST; | ||
bb->b_instr[i+1].i_opcode = NOP; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes the last instruction of the BB a non-terminator. Either put the NOP first or shorten the BB by one.
You'll need to add RETURN_CONST as a BB terminator to all the code that cares about such things.
I was hoping to get a bunch of opcodes in so that together they would show an improvement in the benchmark, but I think I've bitten off more than I could chew in one PR, so I'll remove the new opcodes other than ADD_INT later today. |
How many potential new opcodes are in the pipeline? My register vm with squished the stack opcodes down towards zero, with my new stuff being that. Obviously, if you get much post 128 opcodes I will have some extra thinking to do. :-) |
This builds on top of python#25090
I've got 6 in the works so far, but potentially dozens more. However for 3.10 I don't expect to be doing more than 4-6. |
Hm. GitHub shows that I pushed e8e979f , but that's in a different branch (trying to save some of the work I rolled back here). Looking at the File Changes tab, none of the changes in e8e979f are included in this PR. Seems a GitHub bug that it shows up on the Conversation page here. Please ignore it. |
Include/longobject.h
Outdated
@@ -212,6 +212,7 @@ PyAPI_FUNC(PyObject *) _PyLong_GCD(PyObject *, PyObject *); | |||
#ifndef Py_LIMITED_API | |||
PyAPI_FUNC(PyObject *) _PyLong_Rshift(PyObject *, size_t); | |||
PyAPI_FUNC(PyObject *) _PyLong_Lshift(PyObject *, size_t); | |||
PyAPI_FUNC(PyObject *) _PyLong_AddInt(PyLongObject *, int); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this API public?
If no, please move it to Include/internal
.
If yes, how about using Py_ssize_t
instead of int
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should not be public. I will move it. I am still not used to what goes where now that we've got several different subdirectories of header files, plus "LIMITED_API" defines.
Misc/NEWS.d/next/Core and Builtins/2021-03-31-17-47-25.bpo-43684.hHmyyt.rst
Outdated
Show resolved
Hide resolved
Misc/NEWS.d/next/Core and Builtins/2021-03-31-17-47-25.bpo-43684.hHmyyt.rst
Outdated
Show resolved
Hide resolved
Misc/NEWS.d/next/Core and Builtins/2021-03-31-17-47-25.bpo-43684.hHmyyt.rst
Outdated
Show resolved
Hide resolved
Guido (and anyone else on this thread), do you think emitting I'm a little wary of this though, because it combines across instructions (though not lines!) and slightly changes the instruction order. With the patch, Running the patchdiff --git a/Python/compile.c b/Python/compile.c
index b6fbb500d2..b7f44654ae 100644
--- a/Python/compile.c
+++ b/Python/compile.c
@@ -6835,6 +6835,7 @@ optimize_basic_block(basicblock *bb, PyObject *consts)
struct instr *inst = &bb->b_instr[i];
int oparg = inst->i_oparg;
int nextop = i+1 < bb->b_iused ? bb->b_instr[i+1].i_opcode : 0;
+ int nextnextop = i+2 < bb->b_iused ? bb->b_instr[i+2].i_opcode : 0;
if (is_jump(inst)) {
/* Skip over empty basic blocks. */
while (inst->i_target->b_iused == 0) {
@@ -6848,6 +6849,7 @@ optimize_basic_block(basicblock *bb, PyObject *consts)
switch (inst->i_opcode) {
/* Remove LOAD_CONST const; conditional jump */
/* Also optimize LOAD_CONST(small_int) + BINARY_ADD */
+ /* Also optimize LOAD_CONST(small_int) + LOAD_NAME + BINARY_ADD*/
case LOAD_CONST:
{
PyObject* cnt;
@@ -6904,6 +6906,23 @@ optimize_basic_block(basicblock *bb, PyObject *consts)
}
}
break;
+ case LOAD_NAME:
+ if (nextnextop != BINARY_ADD) {
+ break;
+ }
+ cnt = PyList_GET_ITEM(consts, oparg);
+ if (PyLong_CheckExact(cnt) &&
+ inst->i_lineno == bb->b_instr[i+2].i_lineno) {
+ int ovf = 0;
+ long val = PyLong_AsLongAndOverflow(cnt, &ovf);
+ if (ovf == 0 && val >= 0 && val < 256) {
+ bb->b_instr[i+2].i_opcode = ADD_INT;
+ bb->b_instr[i+2].i_oparg = val;
+ inst->i_opcode = NOP;
+ break;
+ }
+ }
+ break;
}
break;
} |
Co-authored-by: Carol Willing <carolcode@willingconsulting.com>
Co-authored-by: Carol Willing <carolcode@willingconsulting.com>
Idea: interpret the PyObject *left = TOP();
PyObject *result, *right;
int signed_oparg = (int8_t)oparg;
if (PyLong_CheckExact(left)) {
result = _PyLong_AddInt((PyLongObject *)left, signed_oparg);
Py_DECREF(left);
SET_TOP(result);
if (result == NULL) {
goto error;
}
DISPATCH();
}
if (signed_oparg >= 0) {
right = PyLong_FromLongLong(signed_oparg);
if (right == NULL) {
goto error;
}
result = PyNumber_Add(left, right);
}
else {
right = PyLong_FromLongLong(-signed_oparg);
if (right == NULL) {
goto error;
}
result = PyNumber_Subtract(left, right);
}
Py_DECREF(left);
Py_DECREF(right);
SET_TOP(result);
if (result == NULL) {
goto error;
}
DISPATCH(); Then you could compile But I also imagine that being able to optimize things like |
Clever, I will look whether this makes more occurrences. |
Well, I think I'm going to close this PR. I haven't found any example programs that execute a significant number of ADD_INT operations (measured at runtime this time). The typical percentage seems around 0.5%. Thanks to everyone who thought of refinements:
I have some similar ideas for |
Sorry for you. So my comment remains relevant :-) A good approach to optimize int+int is to use Cython or anything else to specialize a function for int+int operations, and implement the addition in C rather than using Python bytecode. The bytecode evaluation loop cost is too high compared to the cost of doing int+int. PyPy is able to specialize the code using C number types and handles integer overflow (the tricky part). Another idea to experiment is to add a cache for the "virtual table lookup". Something like https://bugs.python.org/issue14757 "INCA: Inline Caching meets Quickening in Python 3.3". One part of the PyNumber_Add() cost comes from binary_op1() which needs to look into the type of both objects to decide which function should be called. The ceval.c "OPCACHE" (per code object opcode cache) might be used, but I don't know if it can be modified for this specific optimization. |
I don't understand what you're saying. Are you saying that if you have a lot of additions you're better off using Cython etc.? I can't argue with that (and numpy comes to mind :-).
It wasn't the eval loop cost. In a micro-benchmark I saw a considerable speedup for ADD_INT compared to LOAD_CONST + BINARY_ADD. It's just that in most code x+1 just doesn't occur frequently enough to make a difference. I'm aware of the Quickening approach and we'll probably do something along these lines eventually. |
Yes, it's an existing solution until CPython is optimized. Sadly, using Cython requires to change the code. |
This PR is stale because it has been open for 30 days with no activity. |
This PR was abandoned but not closed. Hopefully @brandtbucher can sort out what we should do -- close it, merge it, or do something else. |
|
Yeah, this is superseded by more recent changes. |
ADD_INT combines LOAD_CONST with BINARY_ADD, but only when the constant is an integer in range(256). It adds a new internal function to longobject.c that is called from ceval.c to deal with the PyLong internals. This optimizes relatively common code expressions like
x + 1
.I'm creating this PR in draft mode because I haven't validated the speedup yet (there definitely is some in a micro-benchmark)
and because I have ideas for more opcodes that could be added this way. In particular, the following seem promising:LOAD_CONST(None) + IS_OP + POP_JUMP_IF_{TRUE,FALSE}(LOAD_CONST or LOAD_FAST) + RETURN_VALUE(CALL_METHOD or CALL_FUNCTION) + (POP_TOP or RETURN_VALUE)GET_ITER + FOR_ITERAll these occur somewhat commonly as opcode pairs in an app I ran with DXPAIRS enabled, and have the following desirable properties:At most one of the opcodes has an argument. (For this purpose, we treat IS_OP(0) and IS_OP(1) as different opcodes.)Neither opcode seems a good candidate for type specialization (e.g. inline caching).We also need to take care that we don't combine instructions on different line numbers.
TODO
items[i] = items[i+1] # IndexError: list index out of range
)Decide which other opcodes to dohttps://bugs.python.org/issue43684