-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Py311 is 10-20% slower in loops of empty and variant assignments than Py38, Py39 and Py310 #420
Comments
Can you show the code you used to measure this? |
env: video(just upload): result: Python 3.10.5 Python 3.9.13 code: t0 = time.perf_counter_ns() t0 = time.perf_counter_ns() t0 = time.perf_counter_ns() t0 = time.perf_counter_ns() |
Can you try again with the whole thing inside a function (which you call just once)? Your current code writes globals ( |
They're all faster, but Python311 is still 10-20% slower than Python310 for empty loops. For empty loops: For a=i: For sin: |
please use pyperformance or similar tool to call function hundreds of times. |
I think it would still be good to know if Python 3.11 has performance regressions for code that doesn't call the same function multiple times (such as in simple scripts) so I tried creating OPs test cases in a way that others can hopefully more easily reproduce. Test environment: I have not set up anything special in terms of performance control, I am just running the tests several times to see if there is a consistent and big enough difference to make any conclusions (though my PC is well ventilated so there is no thermal throttling). Test execution format:
Test 1 Test 2 Test 3 Not sure this information helps anyone, but there it is. |
The math.sin() test is pretty uninteresting, |
It always seems to be very hard to get stable Windows benchmarks. I got these results:
|
AMD Ryzen 9 5900HX with Radeon Graphics 3.30 GHz I got "WARNING: unable to increase process priority" while measuring with py311.
|
Does installing psutil for 3.11 make a difference? |
I can repro the results from @wangyi041228 on my Windows laptop (although I wasn't very rigorous). I'm not sure why we see such different results where @sweeneyde saw basically no difference (at least, the difference of the means was a fraction of the std dev) -- it'd be surprising if the background noise always affected 3.11 more than 3.10. We talked about this in the Faster CPython perf team and did not come up with any good explanation (although we expect the issue doesn't affect real code, since there's always some code in the loop). (@markshannon @brandtbucher) I'm sure my Python 3.10.5 and 3.11.0b3 binaries all came from python.org, and I have to assume they were built using the same toolchain. I reproduced (at least some of) the difference on my Mac as well, so I don't think it's the MSVC compiler. I looked at the disassembly, and the heart of the loop is equivalent in 3.10 and 3.11: 3.10:
3.11:
Interestingly, getting rid of Here's a thought. Could it be the
|
Agreed, though I think it helps my machine has a high count of homogeneous CPU cores, there is no thermal throttling, and I have all virus scanning turned off. I created 2 virtual environments with pyperf and psutils (and both my Python installs are from python.org), I ran the pyperf several times and interleaved the runs between Python 3.10 and 3.11, I never saw the mean change more than 2% for any of the runs: Python 3.10
Python 3.11
Though on my machine I can not reproduce anything other than an empty loop where 3.11 is slower, as soon as the loop has code in it I don't see any performance regression. |
To see what's actually being run on 3.11+, pass |
For those who can reproduce: how does the slowdown vary with the number of iterations? Is it constant, or does it grow larger/smaller for different range sizes? |
The single-digit fast-path for Do we see a similar slowdown when dropping the loop into C with:
This version simply exhausts the iterator with as little overhead as possible. It might give us an idea if this slowdown is happening in the VM or not. |
I tried it on Windows with 10k iterations and with 10M iterations, and the % difference between 3.10 and 3.11 is the same for each (9% or so). The variant using deque also shows a comparable slowdown. |
Hm, so that would seem to point to a slowdown in |
Wouldn't it have had to be a lot slower to show a 10% slowdown on those three bytecode instructions? Is the allocator at all under suspicion at this point? |
Depends what you consider “a lot slower”, I guess. All this code is doing is creating and destroying tons of single-digit integers. If that specific code got 10-15% slower, it could possibly make this code 5-10% slower without really affecting anything else.
Maybe. I’m not sure I know enough about the object allocator to have any ideas there, but I would assume that a degradation like that would affect pretty much everything. Unless this is some edge case where we’re repeatedly allocating and freeing pools of memory or something. |
Okay, I came back to this for a bit and I too can reproduce it on a Windows machine. For the record, I only tried the python.org installers. I ran more experiments (using the
My best theory is still that single-digit
Anyways, I'll stop here, because I'm not really sure how to proceed further. Also, I'm not really convinced that this is severe enough to spend more time investigating. 🙃 |
If you want to isolate the cost of iterating over a There have been changes to |
So does this mean that DIGIT is only 15 bits on Windows? That might explain a lot. (Though not that I also saw a 6% slowdown on my Mac -- I wasn't very rigorous there though.) |
Can we use |
This might be a relevant change in 3.11: python/cpython#89732 (from 15 bits sometimes to always 30 bits by default) |
Can someone with access to the Windows tooling verify that there's a perf drop for this example around the commit that changed the digit size default? |
Um, sorry, according to @brandtbucher is looking into definitive proof that it's in |
Just checked, and both builds I were using to get these numbers have the same PS C:\Users\brandtbucher> py -3.11
Python 3.11.0b3 (main, Jun 1 2022, 13:29:14) [MSC v.1932 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.int_info
sys.int_info(bits_per_digit=30, sizeof_digit=4)
>>> exit() PS C:\Users\brandtbucher> py -3.10
Python 3.10.5 (tags/v3.10.5:f377153, Jun 6 2022, 16:14:13) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.int_info
sys.int_info(bits_per_digit=30, sizeof_digit=4)
>>> exit() |
Oh, whoops. Crossed posts. Yeah, I'm going to try making the new |
As for Windows, v143 tools in VS2022 cause this issue, in which the PGO build gets the same performance as the release build. It seems to me that the official binaries have been built with them since 3.11b1. |
@zooba Can you confirm that? |
I tried exercising the Local x64 PGO builds:pyperformance:3.11 (not significant)
Additional exercise:
Benchmark results:
This is off-topic, but the |
Can confirm that 3.11 is using the latest tools, yeah, and I think 3.10 still tries to use the one it originally released with. It shows up in the sys.version string - There are a number of other possibilities though. It may be that we need to adjust the profile to account for the changed interpreter loop. I assume it's not going to have the same proportion of "normal" vs. "unlikely" paths as it used to, which could affect training decisions. But that requires macro-benchmarks, not microbenchmarks, so it's less interesting to discuss ;-) |
It would be easy enough to rule out the profile as the reason for the slowdown: Compile 3.11 with v142 and v143 tools and compare the result. IIUC that's exactly what @neonene did, and the "v143 trained" vs. "v142 trained" columns seem to indicate that code compiled with v143 is always slower than v142, both for 3.10 and for 3.11. IIUC this is with the same profile. Also, comparing the (untrained) "v142" and "v143" columns, we see the same thing -- for both 3.10 and 3.11, code compiled with v143 is slower than v142. So that rather clearly points to v143 as the culprit. What am I missing that sheds doubt on this conclusion? Since we both work for Microsoft at this point it would make sense to invite some internal compiler backend expert to this discussion, right? And it's not unreasonable that there might be some regression here, given that the VS2022 compiler is still relatively young. |
Good point, you're right. We've had such a terrible time getting any traction from the compiler team on this. I suspect they're overwhelmed with actual errors that a regression without a micro-reproduction can't justify the attention. Our best bet is probably to convince the team to include a CPython build and perf suite in their test runs. I'm away from work for 2 weeks now, but Graham and/or Mike should be able to make the connections to start that discussion. |
Related commits on main (backported before Benchmark results:
|
Possibly naive question: Could we make python.org 3.11 releases with the v142 compiler? Does that commit us to using the compiler for the life of the release, or are 142 and 143 ABI-compatible? The reason I'm asking is to understand the urgency. Is this something that needs to be resolved before 3.11rc1? I imagine no, both because (a) using an older compiler is fine and (b) this benchmark isn't terribly representative of real-world code we'd be ok with this specific regression (even if it is a bonafide regression in the C compiler). |
Yes, otherwise wheels on windows would not work.
This is possible although it would require changing the build files, basically a revert of python/cpython#29577 would suffice. |
How hard would it be to create a standalone C program to reproduce the regression? That would probably be the most helpful for the MSVC folks. If we suspect the allocator, maybe just something that simulates what the Python memory pools are doing in this instance? |
I suppose a simple static inline function with one or more branches which is used a lot in a c file is inlined or not is a good indication of MSVC PGO.
You can use libc malloc with |
Workaround to reduce the cost of the optimization:
Alternative:
Maybe there are more effective ways to help EDIT: It seems that this issue is about whether or not |
@neonene: What type of CPU have you been measuring on? Our internal compiler contact is seeing something different on AMD vs. Intel (and maybe even the exact SKU may matter). |
A MS engineer suggested:
I'm not sure what the pure-C equivalent would be. |
@mdboom Sorry, I keep details private, but I'm an Intel user. |
I'm not sure my patch is correct, but I can build 3.11 with:
The performance did not change for me with |
Can you clarify -- did the 64-byte alignment not change performance, or did un-inlining |
My 3.11 got faster when I marked As for v142, I can see the slowdown by marking |
Hi, I'm the "compiler contact" :) Tried now with 3.12a0 and confirming what @neonene noticed, __forceinlining arena_map_get with v143 shows a 15% speedup. There is some CSE of loads (store to load value propagation) that is enabled by the inlining. Looking at the number of instrs in PyObject_Free also shows the benefit of the inlining: Without inlining, PyObject_Free + arena_map_get = 82+48 = 130 instrs From what I know there is right now no estimate of how inlining a func. helps such optimization like CSE, With PGO inlining relies mostly on the PGO counters + a few other simple heuristics. There is a more recent compiler flag that is also worth trying, /Ob3 – it increases the budget of allowed inlining and it could help in other spots. For arena_map_get the __forceinline is for now still the best option to make sure it always gets inlined. Thanks, |
Thanks @gratianlup! @neonene could you prepare a patch that adds the force inline to arena_map_get? Then we can put this to rest. |
PR has been posted: gh-94842 |
This is now fixed in main and in 3.11 (will be in 3.11b5). The backport to 3.10 was a little problematic, and won't be necessary unless we also switch compiler versions there. |
EDITED:
Py311 is 10-20% slower in big loops of empty or global variant assignments than Py38, Py39 and Py310, measuring with time or timeit.
Even though is's faster in other functions and libs.
The text was updated successfully, but these errors were encountered: