-
-
Notifications
You must be signed in to change notification settings - Fork 30.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generator finalization is slower in 3.11 vs 3.10 #100762
Comments
I can confirm this issue (comparing 3.10.8 to 3.12.0a3+). A shorter minimal example is: import timeit
import platform
setup="""
def go( ) :
item=[1,2,3]
for ii in range(3000):
all(item[k] == 2 for k in range(3))
"""
cmd='go()'
timeit_output = timeit.repeat(
cmd, setup=setup , repeat=500, number=10,
)
average_time = sum(timeit_output) / len(timeit_output)
best_time = min(timeit_output)
tps_output = 1 / average_time
print(f"Python version = {platform.python_version()}")
print(f"Average time = {average_time}")
print(f"Best time = {best_time}")
print(f"Average transactions per second = {tps_output}") The expression inside the
With git bisect I tried to identify the commit where the regression is introduced, but I could not pinpoint it (there seem to be multiple regressions, and not all commits build on my system) |
What makes you blame Do you see a difference with @eendebakpt What times did you get? |
@pochmann I executed the the following code:
Results 3.10.9:
Results 3.12.0a3+:
I am not sure what to conclude from the numbers, but the last statement tested |
There doesn't seem to be any difference in the implementation code of |
Yes, indeed signs point to generators (or maybe generator expressions specifically) more than |
EDIT: I wasn't measuring what I thought I was here. See below for corrected results. I tried to see if the new overhead is in the setup/starting of the generator or the actual iteration (is the overhead constant per generator used, or scales with the number of items iterated over) and it seems it's maybe both...? I modified this script so that the number of keys in the dictionary is configurable (using only 3 here seems like a hard case if the regression is in generator creation), and then ran it for powers of 2 from 2 to 2048. It definitely gets better as the number of keys goes up, but it never gets on par with 3.10. Not sure what any of this means, but maybe some other faster CPython folks may have insight. Code example with a variable number of keys``` import timeit from functools import partial import sysdef find_by_keys(
def main():
if name == "main":
|
Some additional information -- here's the differences in the bytecode. 3.10 bytecode
3.11 bytecode
|
I think @mdboom's example is still dominated by generator creation, since the first 2999 loops will only advance the generator once, and the 3000th will advance it n times, then return. Perhaps put |
Thanks, @brandtbucher. Indeed I was not measuring what I thought I was measuring. By putting As the number of iterations in the generator goes up, it trends toward "at par" between 3.10 and 3.11. So I think it's fair to say that whatever regression exists is due to generator startup, not the time spent iterating over it. From the bytecode, it definitely is doing "more work" to start the generator, but I don't know if I could say what room there is for improvement. Code to test with variable number of keys
|
This was bugging me, so I dug into it a bit more by running some nano-benchmarks to measure the performance of different parts of the generator life-cycle. Here are the times to... Create a new generator:
Yield from a generator:
Return from a generator:
Destroy a completed generator:
Destroy a suspended generator:
Code here: import pyperf
LOOPS = 1 << 18
def g():
yield
def bench_gen_create(loops: int) -> float:
it = range(loops)
start = pyperf.perf_counter()
gens = [g() for _ in it]
return pyperf.perf_counter() - start
def bench_gen_yield(loops: int) -> float:
it = range(loops)
gens = [g() for _ in it]
start = pyperf.perf_counter()
for gen in gens:
for _ in gen:
break
return pyperf.perf_counter() - start
def bench_gen_return(loops: int) -> float:
it = range(loops)
gens = [g() for _ in it]
for gen in gens:
for _ in gen:
break
start = pyperf.perf_counter()
for gen in gens:
for _ in gen:
break
return pyperf.perf_counter() - start
def bench_gen_destroy_completed(loops: int) -> float:
it = range(loops)
gens = [g() for _ in it]
for gen in gens:
for _ in gen:
break
for gen in gens:
for _ in gen:
break
start = pyperf.perf_counter()
del gens
return pyperf.perf_counter() - start
def bench_gen_destroy_suspended(loops: int) -> float:
it = range(loops)
gens = [g() for _ in it]
for gen in gens:
for _ in gen:
break
start = pyperf.perf_counter()
del gens
return pyperf.perf_counter() - start
if __name__ == "__main__":
runner = pyperf.Runner(loops=LOOPS)
runner.bench_time_func("gen_create", bench_gen_create)
runner.bench_time_func("gen_yield", bench_gen_yield)
runner.bench_time_func("gen_return", bench_gen_return)
runner.bench_time_func("gen_destroy_completed", bench_gen_destroy_completed)
runner.bench_time_func("gen_destroy_suspended", bench_gen_destroy_suspended) The conclusion seems pretty clear: everything about generators has gotten significantly faster, except for finalizing suspended generators (or, in other words, While it's certainly a relief that generators in general haven't gotten slower, I still think the finalization issue deserves a closer look (especially since the situation seems much worse in 3.12). Off the top of my head, I would guess it's due to our changes to exception handling (and/or frames) - the cost of raising exceptions has gotten higher in recent versions. Generators finalize themselves by throwing |
@brandtbucher Thanks for the analysis.
So there must something else going on here. |
In the above example, the code is creating generator expressions, not plain generators. |
Looking back on the "create" benchmark, I think that the results are heavily skewed by the time taken to build a list of hundreds of thousands of new generators (benchmarking the creation, but not destruction, of generators is actually pretty tricky), as well as the time taken to look up the global name Here are some new numbers that avoid these issues: Create and destroy:
Create and destroy (expression):
Create, exhaust, and destroy:
Create, exhaust, and destroy (expression):
Code: def g():
"""Equivalient to: (_ for _ in ())"""
for _ in (): yield _
def bench_gen_create_destroy(loops: int) -> float:
it = range(loops)
g_fast = g
start = pyperf.perf_counter()
for _ in it:
g_fast()
return pyperf.perf_counter() - start
def bench_gen_expr_create_destroy(loops: int) -> float:
it = range(loops)
start = pyperf.perf_counter()
for _ in it:
(_ for _ in ())
return pyperf.perf_counter() - start
def bench_gen_create_exhaust_destroy(loops: int) -> float:
it = range(loops)
g_fast = g
start = pyperf.perf_counter()
for _ in it:
for _ in g_fast():
pass
return pyperf.perf_counter() - start
def bench_gen_expr_create_exhaust_destroy(loops: int) -> float:
it = range(loops)
start = pyperf.perf_counter()
for _ in it:
for _ in (_ for _ in ()):
pass
return pyperf.perf_counter() - start You can clearly see that the time taken to create and destroy an unexhausted generator got much worse in 3.11, in a way that isn't reflected in the time taken to create, exhaust, and destroy the same generator. Also, the results are consistent across both "normal" generators and generator expressions... so I think that generator expressions themselves are a red herring here, and the issue is indeed finalization. (The 3.12 improvements are likely irrelevant, and due to our specialization of the benchmarks' |
Profiling with valgrind shows that the difference between
is in If we somehow avoid this expensive part for the common case that would be great. While analyzing the code I noticed that in the check
at the end of |
If this issue does get resolved via some patches to the main branch, is there any chance they can be applied to the publicly released Python 3.11 branch as well? Hoping to not wait another 12 months min to upgrade from Python 3.10 |
There's definitely a chance. It just depends on how complicated the fix is. We don't have a solution yet, so I can't comment further. |
The proposed fix in #101013 changes the bytecode, but only because it's necessary to handle other changes we made to how generators work in 3.12. The basic idea (don't call |
...that is, if we consider this a bad enough performance regression to call it a "bug". I'm on the fence, myself, since this is just a consequence of all exceptions becoming more expensive to raise in Python code, not anything specific to generators. But, it may impact async codebases in a real way... I've not done enough async programming to know how common it is for apps to abandon "unfinished" coroutines, for instance. |
Abandoning an unfinished coroutine is likely a bug -- most coroutines are wrapped in tasks and we usually complain about abandoned tasks, so users are used to seeing abandonment as an anti-pattern. Throwing exceptions into coroutines must definitely work (e.g. cancellation) but I don't think it's a major speed concern. I'm curious if the OP (@Onlooker2186) has found this while debugging a real perf regression (in production code)? Esp. since the scope of the regression is much narrower than the original comment implied -- it's not that |
@gvanrossum Yeah it was production code for a websocket client implementation to keep track of a large number of order books. That code in the opening post is used to identify key value pairs to after an "update" or "delete" event received from the websocket server, and is the main cause of regression in total performance for the overall implementation. |
Got it. That must have been frustrating. If you want a backport could you help ? |
For a backport, I probably wont be of much use since I don't know any C and if building Python etc assumes that knowledge. Purely just a user of Python so far :) |
…y. (GH-101013) * Store exception stack depth in YIELD_VALUE's oparg and use it avoid expensive gen.throw() in gen.close() where possible.
I believe this is fixed. |
Thanks for the update @markshannon . Should I be using 3.11.3 to test this? Testing the all() function - the performance regression seems to still be there on 3.11.3. I've also put the timings for 3.12.0a7 below.
|
This isn't fixed in 3.11... the PR (#101316) is failing for some reason. It should be fixed in 3.12, though, since that's where the change landed. |
(I think it landed in a5, so the comparison to test would be between 3.12.0a4 and 3.12.0a5.) But I can see from your comment that things appear to have gotten worse in 3.12. Are these all release builds with the same build options (PGO, etc.)? |
I can confirm that there still appears to be a regression:
|
What are those times for? |
I just used the script from the very first comment on the issue. |
There are a few things going on here. First of all, the workaround is to inline the generator code. def find_by_keys2(
keys: list[str],
table: list[dict[str, str | int]],
match_data: dict[str, str | int],
) -> dict[str, str | int] | None:
for item in table:
for k in keys:
if item[k] != match_data[k]:
break
else:
return item
return None The overhead of destroying the generator is much reduced in 3.12, but the cost of switching between C and Python code is increased in 3.11 compared to 3.10, and seems to be even worse in 3.12. This will be much improved in future, as we optimize larger regions and convert |
(cherry picked from commit 0db2517) Co-authored-by: Irit Katriel <1055913+iritkatriel@users.noreply.github.com>
…GH-111069) (python#115818)" This reverts commit eb4774d.
I found that the Python 3.11.1 implementation of all() is 30% slower compared to Python 3.10.9.
any() also seems to be around 4% slower on my device
Environment
Code to test all():
Results for all()
Console output using Python 3.10.9:
Console output using Python 3.11.1:
Code to test any():
Results for any()
Console output using Python 3.10.9:
Console output using Python 3.11.1:
Linked PRs
gen.throw()
ingen.close()
, unless necessary. #101013The text was updated successfully, but these errors were encountered: