Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-43684: Add ADD_INT opcode #25090

Closed
wants to merge 18 commits into from
Closed

Conversation

gvanrossum
Copy link
Member

@gvanrossum gvanrossum commented Mar 30, 2021

ADD_INT combines LOAD_CONST with BINARY_ADD, but only when the constant is an integer in range(256). It adds a new internal function to longobject.c that is called from ceval.c to deal with the PyLong internals. This optimizes relatively common code expressions like x + 1.

I'm creating this PR in draft mode because I haven't validated the speedup yet (there definitely is some in a micro-benchmark) and because I have ideas for more opcodes that could be added this way. In particular, the following seem promising:

  • LOAD_CONST(None) + IS_OP + POP_JUMP_IF_{TRUE,FALSE}
  • (LOAD_CONST or LOAD_FAST) + RETURN_VALUE
  • (CALL_METHOD or CALL_FUNCTION) + (POP_TOP or RETURN_VALUE)
  • GET_ITER + FOR_ITER

All these occur somewhat commonly as opcode pairs in an app I ran with DXPAIRS enabled, and have the following desirable properties:

  • At most one of the opcodes has an argument. (For this purpose, we treat IS_OP(0) and IS_OP(1) as different opcodes.)
  • Neither opcode seems a good candidate for type specialization (e.g. inline caching).

We also need to take care that we don't combine instructions on different line numbers.

TODO

  • Get benchmark numbers
  • Fix failure in test_buffer (items[i] = items[i+1] # IndexError: list index out of range)
  • Rename INT_ADD to ADD_INT
  • Adopt Mark's suggestion in longobject.c
  • Fix other test failures (e.g. test_dis)
  • Create bpo issue and add here
  • Add news blurb
  • Decide which other opcodes to do

https://bugs.python.org/issue43684

Include/longobject.h Outdated Show resolved Hide resolved
Lib/opcode.py Show resolved Hide resolved
Objects/longobject.c Outdated Show resolved Hide resolved
Python/compile.c Show resolved Hide resolved
@Fidget-Spinner
Copy link
Member

Fidget-Spinner commented Mar 30, 2021

Hi Guido, I ran some preliminary benchmarks with the main branch against the current implementation (some tests errored out with an error message which looks like self[args[i]] = args[i + 1] # IndexError: list index out of range, this looks like the same error raised in test_buffer in the test suite)

pyperformance on average see no change (please take these benchmarks with a grain of salt, I'm quite unhappy with how much change there is for benchmarks which shouldn't even be affected :( ):

pyperformance compare output

### 2to3 ###
Mean +- std dev: 974 ms +- 9 ms -> 968 ms +- 9 ms: 1.01x faster
Not significant

### chameleon ###
Mean +- std dev: 28.3 ms +- 0.5 ms -> 28.2 ms +- 0.4 ms: 1.00x faster
Not significant

### chaos ###
Mean +- std dev: 330 ms +- 3 ms -> 323 ms +- 3 ms: 1.02x faster
Significant (t=11.94)

### crypto_pyaes ###
Mean +- std dev: 349 ms +- 4 ms -> 344 ms +- 3 ms: 1.01x faster
Not significant

### deltablue ###
Mean +- std dev: 24.6 ms +- 0.4 ms -> 24.6 ms +- 0.5 ms: 1.00x slower
Not significant

### dulwich_log ###
Mean +- std dev: 199 ms +- 6 ms -> 196 ms +- 2 ms: 1.01x faster
Not significant

### fannkuch ###
Mean +- std dev: 1.43 sec +- 0.01 sec -> 1.42 sec +- 0.01 sec: 1.01x faster
Not significant

### float ###
Mean +- std dev: 333 ms +- 6 ms -> 330 ms +- 4 ms: 1.01x faster
Not significant

### genshi_text ###
Mean +- std dev: 84.8 ms +- 1.4 ms -> 85.3 ms +- 1.2 ms: 1.01x slower
Not significant

### genshi_xml ###
Mean +- std dev: 188 ms +- 3 ms -> 184 ms +- 3 ms: 1.02x faster
Not significant

### go ###
Mean +- std dev: 705 ms +- 11 ms -> 704 ms +- 7 ms: 1.00x faster
Not significant

### hexiom ###
Mean +- std dev: 29.8 ms +- 0.4 ms -> 29.6 ms +- 0.4 ms: 1.01x faster
Not significant

### json_dumps ###
Mean +- std dev: 39.3 ms +- 0.7 ms -> 40.2 ms +- 1.8 ms: 1.02x slower
Significant (t=-3.34)

### json_loads ###
Mean +- std dev: 75.9 us +- 0.9 us -> 76.8 us +- 2.0 us: 1.01x slower
Not significant

### logging_format ###
Mean +- std dev: 30.8 us +- 0.5 us -> 30.5 us +- 0.4 us: 1.01x faster
Not significant

### logging_silent ###
Mean +- std dev: 582 ns +- 16 ns -> 573 ns +- 16 ns: 1.02x faster
Not significant

### logging_simple ###
Mean +- std dev: 28.0 us +- 0.3 us -> 27.6 us +- 0.4 us: 1.01x faster
Not significant

### mako ###
Mean +- std dev: 48.8 ms +- 0.7 ms -> 48.7 ms +- 0.5 ms: 1.00x faster
Not significant

### meteor_contest ###
Mean +- std dev: 318 ms +- 3 ms -> 311 ms +- 5 ms: 1.02x faster
Significant (t=8.49)

### nbody ###
Mean +- std dev: 413 ms +- 6 ms -> 420 ms +- 7 ms: 1.02x slower
Not significant

### nqueens ###
Mean +- std dev: 307 ms +- 3 ms -> 305 ms +- 4 ms: 1.00x faster
Not significant

### pathlib ###
Mean +- std dev: 57.8 ms +- 1.0 ms -> 58.1 ms +- 1.5 ms: 1.01x slower
Not significant

### pickle ###
Mean +- std dev: 31.7 us +- 0.6 us -> 31.4 us +- 0.7 us: 1.01x faster
Not significant

### pickle_dict ###
Mean +- std dev: 74.1 us +- 0.7 us -> 73.6 us +- 0.7 us: 1.01x faster
Not significant

### pickle_list ###
Mean +- std dev: 11.3 us +- 1.0 us -> 10.9 us +- 0.2 us: 1.03x faster
Significant (t=2.82)

### pickle_pure_python ###
Mean +- std dev: 1.41 ms +- 0.02 ms -> 1.41 ms +- 0.02 ms: 1.00x faster
Not significant

### pidigits ###
Mean +- std dev: 477 ms +- 4 ms -> 476 ms +- 3 ms: 1.00x faster
Not significant

### pyflate ###
Mean +- std dev: 2.05 sec +- 0.02 sec -> 2.01 sec +- 0.02 sec: 1.02x faster
Not significant

### python_startup ###
Mean +- std dev: 23.3 ms +- 0.2 ms -> 23.3 ms +- 0.2 ms: 1.00x slower
Not significant

### python_startup_no_site ###
Mean +- std dev: 15.5 ms +- 0.3 ms -> 15.6 ms +- 0.2 ms: 1.00x slower
Not significant

### raytrace ###
Mean +- std dev: 1.59 sec +- 0.03 sec -> 1.60 sec +- 0.02 sec: 1.00x slower
Not significant

### regex_compile ###
Mean +- std dev: 520 ms +- 6 ms -> 509 ms +- 8 ms: 1.02x faster
Significant (t=7.76)

### regex_dna ###
Mean +- std dev: 500 ms +- 5 ms -> 505 ms +- 4 ms: 1.01x slower
Not significant

### regex_effbot ###
Mean +- std dev: 8.42 ms +- 0.10 ms -> 8.57 ms +- 0.20 ms: 1.02x slower
Not significant

### regex_v8 ###
Mean +- std dev: 68.5 ms +- 1.0 ms -> 67.6 ms +- 1.0 ms: 1.01x faster
Not significant

### richards ###
Mean +- std dev: 244 ms +- 2 ms -> 241 ms +- 3 ms: 1.01x faster
Not significant

### scimark_fft ###
Mean +- std dev: 1.27 sec +- 0.01 sec -> 1.26 sec +- 0.01 sec: 1.00x faster
Not significant

### scimark_lu ###
Mean +- std dev: 539 ms +- 10 ms -> 551 ms +- 35 ms: 1.02x slower
Significant (t=-2.78)

### scimark_monte_carlo ###
Mean +- std dev: 340 ms +- 4 ms -> 335 ms +- 6 ms: 1.02x faster
Not significant

### scimark_sor ###
Mean +- std dev: 642 ms +- 15 ms -> 628 ms +- 11 ms: 1.02x faster
Significant (t=5.63)

### scimark_sparse_mat_mult ###
Mean +- std dev: 18.0 ms +- 0.2 ms -> 18.1 ms +- 0.2 ms: 1.01x slower
Not significant

### spectral_norm ###
Mean +- std dev: 447 ms +- 3 ms -> 429 ms +- 4 ms: 1.04x faster
Significant (t=26.99)

### sqlalchemy_declarative ###
Mean +- std dev: 459 ms +- 10 ms -> 459 ms +- 9 ms: 1.00x slower
Not significant

### sqlalchemy_imperative ###
Mean +- std dev: 85.9 ms +- 2.0 ms -> 84.1 ms +- 1.8 ms: 1.02x faster
Significant (t=5.21)

### sqlite_synth ###
Mean +- std dev: 8.17 us +- 0.20 us -> 8.46 us +- 0.25 us: 1.04x slower
Significant (t=-7.08)

### telco ###
Mean +- std dev: 21.0 ms +- 0.8 ms -> 21.6 ms +- 0.9 ms: 1.03x slower
Significant (t=-3.59)

### tornado_http ###
Mean +- std dev: 439 ms +- 6 ms -> 439 ms +- 7 ms: 1.00x faster
Not significant

### unpack_sequence ###
Mean +- std dev: 148 ns +- 3 ns -> 147 ns +- 3 ns: 1.01x faster
Not significant

### unpickle ###
Mean +- std dev: 42.1 us +- 1.3 us -> 41.6 us +- 0.7 us: 1.01x faster
Not significant

### unpickle_list ###
Mean +- std dev: 12.8 us +- 0.1 us -> 13.0 us +- 0.2 us: 1.02x slower
Not significant

### unpickle_pure_python ###
Mean +- std dev: 1.00 ms +- 0.02 ms -> 1.01 ms +- 0.02 ms: 1.01x slower
Not significant

### xml_etree_generate ###
Mean +- std dev: 295 ms +- 4 ms -> 293 ms +- 4 ms: 1.01x faster
Not significant

### xml_etree_iterparse ###
Mean +- std dev: 314 ms +- 4 ms -> 311 ms +- 4 ms: 1.01x faster
Not significant

### xml_etree_parse ###
Mean +- std dev: 428 ms +- 5 ms -> 432 ms +- 6 ms: 1.01x slower
Not significant

### xml_etree_process ###
Mean +- std dev: 237 ms +- 4 ms -> 236 ms +- 4 ms: 1.00x faster
Not significant

Skipped 5 benchmarks only in master_pyperformance.json: django_template, sympy_expand, sympy_integrate, sympy_str, sympy_sum

micro benchmarks have some noticeable speedups:

# pyperf timeit -s "x = 1" "x + 1"
Mean +- std dev: [master_x+1] 46.4 ns +- 1.0 ns -> [addint_x+1] 35.2 ns +- 1.2 ns: 1.32x faster

# pyperf timeit "for x in range(255): x + 1"
Mean +- std dev: [master_range255] 13.8 us +- 0.3 us -> [addint_range255] 11.8 us +- 0.3 us: 1.16x faster

@gvanrossum
Copy link
Member Author

Hi Guido, I ran some preliminary benchmarks with the main branch against the current implementation

Thanks @Fidget-Spinner!

(some tests errored out with an error message which looks like self[args[i]] = args[i + 1] # IndexError: list index out of range, this looks like the same error raised in test_buffer in the test suite)

That's disturbing, I will look into that.

pyperformance on average see no change (please take these benchmarks with a grain of salt, I'm quite unhappy with how much change there is for benchmarks which shouldn't even be affected :( ):

Yeah, this is what I've experienced running pyperformance as well. It's really hard to move the needle, and there's so much noise that for improvements that are expected to give less than 5% improvement in the general case the best we can hope from the benchmark is give us confidence we haven't accidentally pessimized things. Your timeit-based microbenchmarks look like what I expected.

Maybe we'll be able to move the needle by adding a few other similar changes to this same PR (that saves bumps in the pyc magic number as well :-).

@gvanrossum gvanrossum changed the title Add specialized INT_ADD opcode Add specialized opcodes Mar 30, 2021
@gvanrossum
Copy link
Member Author

(FWIW the reason I didn't finish generating bytecode using RETURN_CONST and RETURN_NONE is that this causes a large number of failures in test_dis.py. I need to strategize on what to do about those.)

Python/compile.c Outdated
@@ -6910,14 +6910,14 @@ optimize_basic_block(basicblock *bb, PyObject *consts)
}
}
break;
# if 0
# if 1
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get a scary error in test_sys_settrace.py:

PS C:\Users\gvanrossum\cpython> .\PCbuild\amd64\python.exe -m test test_sys_settrace
0:00:00 Run tests sequentially
0:00:00 [1/1] test_sys_settrace
Fatal Python error: PyFrame_BlockPop: block stack underflow
Python runtime state: initialized

Current thread 0x00005704 (most recent call first):
  File "C:\Users\gvanrossum\cpython\lib\test\test_sys_settrace.py", line 1550 in test_jump_over_return_in_try_finally_block
  File "C:\Users\gvanrossum\cpython\lib\test\test_sys_settrace.py", line 1179 in run_test
  File "C:\Users\gvanrossum\cpython\lib\test\test_sys_settrace.py", line 1207 in test
  File "C:\Users\gvanrossum\cpython\lib\unittest\case.py", line 549 in _callTestMethod
  File "C:\Users\gvanrossum\cpython\lib\unittest\case.py", line 592 in run
  File "C:\Users\gvanrossum\cpython\lib\unittest\case.py", line 652 in __call__
  File "C:\Users\gvanrossum\cpython\lib\unittest\suite.py", line 122 in run
  File "C:\Users\gvanrossum\cpython\lib\unittest\suite.py", line 84 in __call__
  File "C:\Users\gvanrossum\cpython\lib\unittest\suite.py", line 122 in run
  File "C:\Users\gvanrossum\cpython\lib\unittest\suite.py", line 84 in __call__
  File "C:\Users\gvanrossum\cpython\lib\unittest\suite.py", line 122 in run
  File "C:\Users\gvanrossum\cpython\lib\unittest\suite.py", line 84 in __call__
  File "C:\Users\gvanrossum\cpython\lib\test\support\testresult.py", line 169 in run
  File "C:\Users\gvanrossum\cpython\lib\test\support\__init__.py", line 959 in _run_suite
  File "C:\Users\gvanrossum\cpython\lib\test\support\__init__.py", line 1082 in run_unittest
  File "C:\Users\gvanrossum\cpython\lib\test\libregrtest\runtest.py", line 210 in _test_module
  File "C:\Users\gvanrossum\cpython\lib\test\libregrtest\runtest.py", line 246 in _runtest_inner2
  File "C:\Users\gvanrossum\cpython\lib\test\libregrtest\runtest.py", line 282 in _runtest_inner
  File "C:\Users\gvanrossum\cpython\lib\test\libregrtest\runtest.py", line 154 in _runtest
  File "C:\Users\gvanrossum\cpython\lib\test\libregrtest\runtest.py", line 194 in runtest
  File "C:\Users\gvanrossum\cpython\lib\test\libregrtest\main.py", line 423 in run_tests_sequential
  File "C:\Users\gvanrossum\cpython\lib\test\libregrtest\main.py", line 521 in run_tests
  File "C:\Users\gvanrossum\cpython\lib\test\libregrtest\main.py", line 694 in _main
  File "C:\Users\gvanrossum\cpython\lib\test\libregrtest\main.py", line 641 in main
  File "C:\Users\gvanrossum\cpython\lib\test\libregrtest\main.py", line 719 in main
  File "C:\Users\gvanrossum\cpython\lib\test\__main__.py", line 2 in <module>
  File "C:\Users\gvanrossum\cpython\lib\runpy.py", line 86 in _run_code
  File "C:\Users\gvanrossum\cpython\lib\runpy.py", line 196 in _run_module_as_main

Extension modules: _testcapi, _overlapped (total: 2)

@markshannon You changed that file last. Should I just give up on RETURN_CONST/RETURN_NONE for now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: _overlapped is listed as a 3rd party C extension, whereas it's part of the stdlib. I forgot this one and I created #25122 to add it to sys.stdlib_module_names :-)

@markshannon
Copy link
Member

I don't think adding a bunch of new opcodes to one PR is a good idea. It makes it difficult to review and get merged.

Python/compile.c Outdated
cnt = PyList_GET_ITEM(consts, oparg);
inst->i_opcode = RETURN_CONST;
// cnt == Py_None ? RETURN_NONE : RETURN_CONST;
bb->b_instr[i+1].i_opcode = NOP;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes the last instruction of the BB a non-terminator. Either put the NOP first or shorten the BB by one.
You'll need to add RETURN_CONST as a BB terminator to all the code that cares about such things.

@gvanrossum
Copy link
Member Author

I was hoping to get a bunch of opcodes in so that together they would show an improvement in the benchmark, but I think I've bitten off more than I could chew in one PR, so I'll remove the new opcodes other than ADD_INT later today.

@smontanaro
Copy link
Contributor

How many potential new opcodes are in the pipeline? My register vm with squished the stack opcodes down towards zero, with my new stuff being that. Obviously, if you get much post 128 opcodes I will have some extra thinking to do. :-)

gvanrossum added a commit to faster-cpython/cpython that referenced this pull request Mar 31, 2021
@gvanrossum
Copy link
Member Author

How many potential new opcodes are in the pipeline? My register vm with squished the stack opcodes down towards zero, with my new stuff being that. Obviously, if you get much post 128 opcodes I will have some extra thinking to do. :-)

I've got 6 in the works so far, but potentially dozens more. However for 3.10 I don't expect to be doing more than 4-6.

@gvanrossum
Copy link
Member Author

Hm. GitHub shows that I pushed e8e979f , but that's in a different branch (trying to save some of the work I rolled back here). Looking at the File Changes tab, none of the changes in e8e979f are included in this PR. Seems a GitHub bug that it shows up on the Conversation page here. Please ignore it.

@@ -212,6 +212,7 @@ PyAPI_FUNC(PyObject *) _PyLong_GCD(PyObject *, PyObject *);
#ifndef Py_LIMITED_API
PyAPI_FUNC(PyObject *) _PyLong_Rshift(PyObject *, size_t);
PyAPI_FUNC(PyObject *) _PyLong_Lshift(PyObject *, size_t);
PyAPI_FUNC(PyObject *) _PyLong_AddInt(PyLongObject *, int);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this API public?
If no, please move it to Include/internal.
If yes, how about using Py_ssize_t instead of int?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should not be public. I will move it. I am still not used to what goes where now that we've got several different subdirectories of header files, plus "LIMITED_API" defines.

@Fidget-Spinner
Copy link
Member

Guido (and anyone else on this thread), do you think emitting ADD_INT for not just x + 1 but also 1 + x sounds reasonable? It may help reach more code, and also it seems a little funny that x + 1 is now magically way faster than 1 + x 😉 . I made a small patch to add that in.

I'm a little wary of this though, because it combines across instructions (though not lines!) and slightly changes the instruction order. With the patch, 1 + x and x + 1 produces the exact same bytecode.

Running the count_opcodes script against ./Lib brings the number from 529 to only 531. I'm honestly surprised at how little that number increased :(. I don't think this will increase performance of long addition chains by much either - the current implementation already does a good job capturing those.

patch
diff --git a/Python/compile.c b/Python/compile.c
index b6fbb500d2..b7f44654ae 100644
--- a/Python/compile.c
+++ b/Python/compile.c
@@ -6835,6 +6835,7 @@ optimize_basic_block(basicblock *bb, PyObject *consts)
         struct instr *inst = &bb->b_instr[i];
         int oparg = inst->i_oparg;
         int nextop = i+1 < bb->b_iused ? bb->b_instr[i+1].i_opcode : 0;
+        int nextnextop = i+2 < bb->b_iused ? bb->b_instr[i+2].i_opcode : 0;
         if (is_jump(inst)) {
             /* Skip over empty basic blocks. */
             while (inst->i_target->b_iused == 0) {
@@ -6848,6 +6849,7 @@ optimize_basic_block(basicblock *bb, PyObject *consts)
         switch (inst->i_opcode) {
             /* Remove LOAD_CONST const; conditional jump */
             /* Also optimize LOAD_CONST(small_int) + BINARY_ADD */
+            /* Also optimize LOAD_CONST(small_int) + LOAD_NAME + BINARY_ADD*/
             case LOAD_CONST:
             {
                 PyObject* cnt;
@@ -6904,6 +6906,23 @@ optimize_basic_block(basicblock *bb, PyObject *consts)
                             }
                         }
                         break;
+                    case LOAD_NAME:
+                        if (nextnextop != BINARY_ADD) {
+                            break;
+                        }
+                        cnt = PyList_GET_ITEM(consts, oparg);
+                        if (PyLong_CheckExact(cnt) &&
+                            inst->i_lineno == bb->b_instr[i+2].i_lineno) {
+                            int ovf = 0;
+                            long val = PyLong_AsLongAndOverflow(cnt, &ovf);
+                            if (ovf == 0 && val >= 0 && val < 256) {
+                                bb->b_instr[i+2].i_opcode = ADD_INT;
+                                bb->b_instr[i+2].i_oparg = val;
+                                inst->i_opcode = NOP;
+                                break;
+                            }
+                        }
+                        break;
                 }
                 break;
             }

gvanrossum and others added 5 commits April 3, 2021 13:40
Co-authored-by: Carol Willing <carolcode@willingconsulting.com>
Co-authored-by: Carol Willing <carolcode@willingconsulting.com>
@sweeneyde
Copy link
Member

sweeneyde commented Apr 4, 2021

Idea: interpret the ADD_INT oparg as a signed int in [-128..127] to get a two-behaviors-for-one-opcode deal, with an equally-fast fast-path:

PyObject *left = TOP();
PyObject *result, *right;
int signed_oparg = (int8_t)oparg;
if (PyLong_CheckExact(left)) {
    result = _PyLong_AddInt((PyLongObject *)left, signed_oparg);
    Py_DECREF(left);
    SET_TOP(result);
    if (result == NULL) {
        goto error;
    }
    DISPATCH();
}
if (signed_oparg >= 0) {
    right = PyLong_FromLongLong(signed_oparg);
    if (right == NULL) {
        goto error;
    }
    result = PyNumber_Add(left, right);
}
else {
    right = PyLong_FromLongLong(-signed_oparg);
    if (right == NULL) {
        goto error;
    }
    result = PyNumber_Subtract(left, right);
}
Py_DECREF(left);
Py_DECREF(right);
SET_TOP(result);
if (result == NULL) {
    goto error;
}
DISPATCH();

Then you could compile x - 10 as ADD_INT -10. Technically couldn't optimize x - 0, but I would imagine that is uncommon.

But I also imagine that being able to optimize things like while i <= n - 1: could be nice.

@gvanrossum
Copy link
Member Author

Clever, I will look whether this makes more occurrences.

@gvanrossum
Copy link
Member Author

Well, I think I'm going to close this PR. I haven't found any example programs that execute a significant number of ADD_INT operations (measured at runtime this time). The typical percentage seems around 0.5%. Thanks to everyone who thought of refinements:

  • (@Fidget-Spinner) Also do this for "1 + x" -- that would not be correct if x isn't an int (or float), since the order in which __add__ and __radd__ are tried matters. For a crude example, the error messages for ""+1 and 1+"" are different.
  • (@sweeneyde) Not a bad idea (and you handle the fallback case correctly) but I doubt this would pull this over the line.

I have some similar ideas for if x is None and if x is not None but I'll do some more research on how many of these I can expect to execute first...

@vstinner
Copy link
Member

vstinner commented Apr 6, 2021

Well, I think I'm going to close this PR. I haven't found any example programs that execute a significant number of ADD_INT operations (measured at runtime this time). The typical percentage seems around 0.5%.

Sorry for you. So my comment remains relevant :-)

A good approach to optimize int+int is to use Cython or anything else to specialize a function for int+int operations, and implement the addition in C rather than using Python bytecode. The bytecode evaluation loop cost is too high compared to the cost of doing int+int. PyPy is able to specialize the code using C number types and handles integer overflow (the tricky part).

Another idea to experiment is to add a cache for the "virtual table lookup". Something like https://bugs.python.org/issue14757 "INCA: Inline Caching meets Quickening in Python 3.3". One part of the PyNumber_Add() cost comes from binary_op1() which needs to look into the type of both objects to decide which function should be called. The ceval.c "OPCACHE" (per code object opcode cache) might be used, but I don't know if it can be modified for this specific optimization.

@gvanrossum
Copy link
Member Author

A good approach to optimize int+int is to use Cython or anything else to specialize a function for int+int operations, and implement the addition in C rather than using Python bytecode.

I don't understand what you're saying. Are you saying that if you have a lot of additions you're better off using Cython etc.? I can't argue with that (and numpy comes to mind :-).

The bytecode evaluation loop cost is too high compared to the cost of doing int+int. PyPy is able to specialize the code using C number types and handles integer overflow (the tricky part).

It wasn't the eval loop cost. In a micro-benchmark I saw a considerable speedup for ADD_INT compared to LOAD_CONST + BINARY_ADD. It's just that in most code x+1 just doesn't occur frequently enough to make a difference.

I'm aware of the Quickening approach and we'll probably do something along these lines eventually.

@vstinner
Copy link
Member

vstinner commented Apr 6, 2021

Are you saying that if you have a lot of additions you're better off using Cython etc.?

Yes, it's an existing solution until CPython is optimized. Sadly, using Cython requires to change the code.

@github-actions
Copy link

github-actions bot commented Jun 3, 2021

This PR is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale Stale PR or inactive for long period of time. label Jun 3, 2021
@gvanrossum
Copy link
Member Author

This PR was abandoned but not closed. Hopefully @brandtbucher can sort out what we should do -- close it, merge it, or do something else.

@kumaraditya303
Copy link
Contributor

ceval.c has specialized bytecode for int operations now so isn't needed anymore, can we close it now?

@kumaraditya303 kumaraditya303 added pending The issue will be closed if no feedback is provided and removed stale Stale PR or inactive for long period of time. labels Aug 1, 2022
@gvanrossum
Copy link
Member Author

Yeah, this is superseded by more recent changes.

@gvanrossum gvanrossum closed this Aug 1, 2022
@gvanrossum gvanrossum deleted the int-add branch August 1, 2022 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting core review pending The issue will be closed if no feedback is provided
Projects
None yet
Development

Successfully merging this pull request may close these issues.