Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizing BOLT flags #128514

Open
Tracked by #101525
zanieb opened this issue Jan 5, 2025 · 18 comments
Open
Tracked by #101525

Optimizing BOLT flags #128514

zanieb opened this issue Jan 5, 2025 · 18 comments
Assignees
Labels
build The build process and cross-build performance Performance or resource usage type-feature A feature request or enhancement

Comments

@zanieb
Copy link
Contributor

zanieb commented Jan 5, 2025

Feature or enhancement

This is a tracking issue for discussion on determining the optimal flags for BOLT to improve performance.

Tuning the flags is mentioned in #101525, but doesn't feel like a blocker for stabilization.

Linked PRs

@zanieb zanieb added type-feature A feature request or enhancement performance Performance or resource usage build The build process and cross-build labels Jan 5, 2025
@zanieb
Copy link
Contributor Author

zanieb commented Jan 5, 2025

There was a talk in March 2024 at the LLVM Performance Workshop; I can't find a copy of the talk online but the slides are available at https://llvm.org/devmtg/2024-03/slides/practical-use-of-bolt.pdf

It includes the following suggestions:

  • Function splitting: -split-functions, -split-strategy=cdsplit
  • Function reordering: -reorder-functions=cdsort
  • Block reordering: -reorder-blocks=ext-tsp
  • Use THP pages for hot text: -hugify
  • PLT optimization: -plt
  • More aggressive ICF: -icf
  • Indirect Call Promotion: -indirect-call-promotion

We're currently using:

cpython/configure.ac

Lines 2199 to 2212 in b60044b

-reorder-blocks=ext-tsp
-reorder-functions=cdsort
-split-functions
-icf=1
-inline-all
-split-eh
-reorder-functions-use-hot-size
-peepholes=none
-jump-tables=aggressive
-inline-ap
-indirect-call-promotion=all
-dyno-stats
-use-gnu-stack
-frame-opt=hot

Suggesting we should explore -split-strategy=cdsplit, -hugify, and -plt

@zanieb
Copy link
Contributor Author

zanieb commented Jan 5, 2025

There's some commentary in #124948 (comment)

python-build-standalone recently added -hugify and -split-strategy=cdsplit (astral-sh/python-build-standalone#462), though the performance benefits were not validated.

My intent is to do some benchmarking for each flag.

@liusy58
Copy link

liusy58 commented Jan 7, 2025

Ask me everything if you need. And I am now working on BOLT, I also want to contribute to Python!

@liusy58
Copy link

liusy58 commented Jan 7, 2025

By the way, I wonder how you collect profiles? By instrumentation or perf ?

@liusy58
Copy link

liusy58 commented Jan 7, 2025

I strongly recommend that --split-all-cold should be added.

@zanieb
Copy link
Contributor Author

zanieb commented Jan 7, 2025

I was going to collect benchmarks with https://github.com/python/pyperformance (i.e., not instrumentation) on my Linux machine.

I think I can also post branches and ask the faster-cpython team to run benchmarks https://github.com/faster-cpython/benchmarking-public

I have a few commits ready

@liusy58
Copy link

liusy58 commented Jan 8, 2025

Profiles are key to BOLT. You are on x86, right? I remember cdsplit is not supported on AArch64.

@zanieb
Copy link
Contributor Author

zanieb commented Jan 8, 2025

I have machines with both architectures.

Are you suggesting an alternative approach to measuring the effect?

@liusy58
Copy link

liusy58 commented Jan 8, 2025

Yeah, maybe aarch64 can get more performance.

@zanieb
Copy link
Contributor Author

zanieb commented Jan 10, 2025

As an update, I set up an x86-64 bare metal machine with LLVM 19 and am running benchmarks for the flags I described above. I'm not including LTO in the baseline, should I?

@zanieb
Copy link
Contributor Author

zanieb commented Jan 11, 2025

@corona10
Copy link
Member

By the way, I wonder how you collect profiles? By instrumentation or perf ?

FYI, We are getting BOLTed binary through instrumentation, not perf, when we actually build.

@corona10
Copy link
Member

As an update, I set up an x86-64 bare metal machine with LLVM 19 and am running benchmarks for the flags I described above. I'm not including LTO in the baseline, should I?

I belive that we don't have to, if only the difference between baseline is flag :)

@corona10
Copy link
Member

(I am adding myself as assignee to catch up)

@zanieb
Copy link
Contributor Author

zanieb commented Jan 16, 2025

A second round of benchmarks with more samples comes out a little different https://gist.github.com/zanieb/8614bcb40b0db24dd678f2983146fb43

The effect depends on the workload.

@zanieb
Copy link
Contributor Author

zanieb commented Feb 9, 2025

Following up with more analysis here

hugify

All benchmarks:
===============

Slower (23):
- sqlglot_normalize: 102 ms +- 1 ms -> 279 ms +- 3 ms: 2.74x slower
- python_startup_no_site: 6.52 ms +- 0.03 ms -> 8.53 ms +- 0.04 ms: 1.31x slower
- python_startup: 9.30 ms +- 0.03 ms -> 11.3 ms +- 0.0 ms: 1.21x slower
- spectral_norm: 93.0 ms +- 0.8 ms -> 98.5 ms +- 1.0 ms: 1.06x slower
- richards_super: 49.7 ms +- 0.4 ms -> 51.2 ms +- 0.4 ms: 1.03x slower
- generators: 25.3 ms +- 0.2 ms -> 26.1 ms +- 0.3 ms: 1.03x slower
- richards: 43.6 ms +- 0.4 ms -> 44.8 ms +- 0.4 ms: 1.03x slower
- coroutines: 20.4 ms +- 0.3 ms -> 21.0 ms +- 0.3 ms: 1.03x slower
- scimark_sor: 114 ms +- 1 ms -> 116 ms +- 1 ms: 1.02x slower
- mako: 10.2 ms +- 0.1 ms -> 10.4 ms +- 0.2 ms: 1.02x slower
- regex_dna: 147 ms +- 3 ms -> 150 ms +- 4 ms: 1.02x slower
- genshi_text: 19.6 ms +- 0.2 ms -> 19.9 ms +- 0.2 ms: 1.02x slower
- comprehensions: 15.3 us +- 0.1 us -> 15.5 us +- 0.2 us: 1.01x slower
- deltablue: 2.91 ms +- 0.03 ms -> 2.95 ms +- 0.03 ms: 1.01x slower
- xml_etree_iterparse: 78.7 ms +- 0.6 ms -> 79.7 ms +- 0.7 ms: 1.01x slower
- xml_etree_parse: 114 ms +- 1 ms -> 115 ms +- 1 ms: 1.01x slower
- raytrace: 256 ms +- 3 ms -> 258 ms +- 3 ms: 1.01x slower
- logging_simple: 5.25 us +- 0.07 us -> 5.29 us +- 0.10 us: 1.01x slower
- asyncio_tcp: 429 ms +- 4 ms -> 432 ms +- 4 ms: 1.01x slower
- pyflate: 380 ms +- 3 ms -> 382 ms +- 3 ms: 1.01x slower
- genshi_xml: 46.1 ms +- 0.6 ms -> 46.4 ms +- 0.6 ms: 1.01x slower
- go: 110 ms +- 1 ms -> 110 ms +- 1 ms: 1.00x slower
- asyncio_tcp_ssl: 1.48 sec +- 0.01 sec -> 1.48 sec +- 0.00 sec: 1.00x slower

Faster (39):
- logging_silent: 107 ns +- 1 ns -> 100 ns +- 1 ns: 1.07x faster
- regex_effbot: 2.47 ms +- 0.03 ms -> 2.33 ms +- 0.06 ms: 1.06x faster
- scimark_sparse_mat_mult: 4.21 ms +- 0.11 ms -> 3.98 ms +- 0.09 ms: 1.06x faster
- json_loads: 22.9 us +- 0.2 us -> 21.9 us +- 0.3 us: 1.05x faster
- json_dumps: 10.1 ms +- 0.2 ms -> 9.65 ms +- 0.16 ms: 1.04x faster
- regex_v8: 22.1 ms +- 0.2 ms -> 21.2 ms +- 0.2 ms: 1.04x faster
- telco: 6.60 ms +- 0.10 ms -> 6.38 ms +- 0.08 ms: 1.03x faster
- unpack_sequence: 37.4 ns +- 1.9 ns -> 36.2 ns +- 0.8 ns: 1.03x faster
- deepcopy: 250 us +- 2 us -> 242 us +- 2 us: 1.03x faster
- deepcopy_reduce: 2.60 us +- 0.03 us -> 2.52 us +- 0.04 us: 1.03x faster
- scimark_monte_carlo: 59.3 ms +- 0.4 ms -> 57.6 ms +- 0.6 ms: 1.03x faster
- deepcopy_memo: 29.6 us +- 0.2 us -> 28.7 us +- 0.3 us: 1.03x faster
- typing_runtime_protocols: 153 us +- 3 us -> 148 us +- 3 us: 1.03x faster
- scimark_fft: 313 ms +- 3 ms -> 305 ms +- 6 ms: 1.03x faster
- nqueens: 74.4 ms +- 0.7 ms -> 72.6 ms +- 0.5 ms: 1.02x faster
- chaos: 58.0 ms +- 0.8 ms -> 56.7 ms +- 0.7 ms: 1.02x faster
- pprint_pformat: 1.38 sec +- 0.01 sec -> 1.35 sec +- 0.02 sec: 1.02x faster
- scimark_lu: 112 ms +- 1 ms -> 110 ms +- 1 ms: 1.02x faster
- tomli_loads: 1.93 sec +- 0.02 sec -> 1.89 sec +- 0.03 sec: 1.02x faster
- pickle: 11.1 us +- 0.4 us -> 10.9 us +- 0.2 us: 1.02x faster
- pathlib: 19.2 ms +- 0.1 ms -> 18.8 ms +- 0.1 ms: 1.02x faster
- unpickle: 12.1 us +- 0.2 us -> 11.9 us +- 0.2 us: 1.02x faster
- logging_format: 5.82 us +- 0.13 us -> 5.74 us +- 0.12 us: 1.01x faster
- pprint_safe_repr: 667 ms +- 6 ms -> 657 ms +- 6 ms: 1.01x faster
- xml_etree_process: 54.6 ms +- 0.4 ms -> 53.8 ms +- 0.4 ms: 1.01x faster
- xml_etree_generate: 78.3 ms +- 0.5 ms -> 77.3 ms +- 0.6 ms: 1.01x faster
- pidigits: 160 ms +- 1 ms -> 158 ms +- 1 ms: 1.01x faster
- regex_compile: 118 ms +- 1 ms -> 117 ms +- 1 ms: 1.01x faster
- pickle_list: 3.84 us +- 0.07 us -> 3.80 us +- 0.16 us: 1.01x faster
- mdp: 2.33 sec +- 0.02 sec -> 2.30 sec +- 0.02 sec: 1.01x faster
- pickle_pure_python: 298 us +- 3 us -> 295 us +- 3 us: 1.01x faster
- meteor_contest: 89.6 ms +- 0.4 ms -> 89.0 ms +- 0.5 ms: 1.01x faster
- gc_traversal: 2.78 ms +- 0.06 ms -> 2.76 ms +- 0.05 ms: 1.01x faster
- html5lib: 60.5 ms +- 0.6 ms -> 60.1 ms +- 0.6 ms: 1.01x faster
- create_gc_cycles: 954 us +- 5 us -> 948 us +- 5 us: 1.01x faster
- dulwich_log: 60.0 ms +- 0.3 ms -> 59.7 ms +- 0.3 ms: 1.01x faster
- fannkuch: 372 ms +- 3 ms -> 370 ms +- 3 ms: 1.01x faster
- float: 66.4 ms +- 1.1 ms -> 66.0 ms +- 1.1 ms: 1.00x faster
- hexiom: 5.63 ms +- 0.06 ms -> 5.61 ms +- 0.05 ms: 1.00x faster

Benchmark hidden because not significant (11): 2to3, async_generators, asyncio_websockets, bench_mp_pool, bench_thread_pool, crypto_pyaes, docutils, nbody, pickle_dict, unpickle_list, unpickle_pure_python

Geometric mean: 1.01x slower

cdsplit

All benchmarks:
===============

Slower (41):
- sqlglot_normalize: 102 ms +- 1 ms -> 283 ms +- 2 ms: 2.78x slower
- deltablue: 2.91 ms +- 0.03 ms -> 3.20 ms +- 0.05 ms: 1.10x slower
- pyflate: 380 ms +- 3 ms -> 411 ms +- 4 ms: 1.08x slower
- scimark_fft: 313 ms +- 3 ms -> 329 ms +- 4 ms: 1.05x slower
- generators: 25.3 ms +- 0.2 ms -> 26.4 ms +- 0.2 ms: 1.04x slower
- html5lib: 60.5 ms +- 0.6 ms -> 62.7 ms +- 0.5 ms: 1.04x slower
- mako: 10.2 ms +- 0.1 ms -> 10.6 ms +- 0.1 ms: 1.04x slower
- go: 110 ms +- 1 ms -> 114 ms +- 1 ms: 1.03x slower
- logging_format: 5.82 us +- 0.13 us -> 6.00 us +- 0.13 us: 1.03x slower
- logging_simple: 5.25 us +- 0.07 us -> 5.41 us +- 0.12 us: 1.03x slower
- comprehensions: 15.3 us +- 0.1 us -> 15.8 us +- 0.1 us: 1.03x slower
- async_generators: 376 ms +- 4 ms -> 387 ms +- 4 ms: 1.03x slower
- genshi_xml: 46.1 ms +- 0.6 ms -> 47.3 ms +- 0.5 ms: 1.03x slower
- scimark_sparse_mat_mult: 4.21 ms +- 0.11 ms -> 4.32 ms +- 0.07 ms: 1.03x slower
- chaos: 58.0 ms +- 0.8 ms -> 59.4 ms +- 0.8 ms: 1.02x slower
- spectral_norm: 93.0 ms +- 0.8 ms -> 95.1 ms +- 0.9 ms: 1.02x slower
- scimark_sor: 114 ms +- 1 ms -> 116 ms +- 1 ms: 1.02x slower
- coroutines: 20.4 ms +- 0.3 ms -> 20.8 ms +- 0.3 ms: 1.02x slower
- richards: 43.6 ms +- 0.4 ms -> 44.4 ms +- 0.4 ms: 1.02x slower
- fannkuch: 372 ms +- 3 ms -> 378 ms +- 3 ms: 1.02x slower
- richards_super: 49.7 ms +- 0.4 ms -> 50.6 ms +- 0.5 ms: 1.02x slower
- json_dumps: 10.1 ms +- 0.2 ms -> 10.2 ms +- 0.3 ms: 1.02x slower
- regex_dna: 147 ms +- 3 ms -> 149 ms +- 3 ms: 1.02x slower
- genshi_text: 19.6 ms +- 0.2 ms -> 19.9 ms +- 0.2 ms: 1.02x slower
- asyncio_tcp: 429 ms +- 4 ms -> 436 ms +- 5 ms: 1.01x slower
- meteor_contest: 89.6 ms +- 0.4 ms -> 90.7 ms +- 0.5 ms: 1.01x slower
- float: 66.4 ms +- 1.1 ms -> 67.2 ms +- 1.1 ms: 1.01x slower
- typing_runtime_protocols: 153 us +- 3 us -> 154 us +- 3 us: 1.01x slower
- pprint_safe_repr: 667 ms +- 6 ms -> 674 ms +- 7 ms: 1.01x slower
- tomli_loads: 1.93 sec +- 0.02 sec -> 1.95 sec +- 0.02 sec: 1.01x slower
- hexiom: 5.63 ms +- 0.06 ms -> 5.68 ms +- 0.04 ms: 1.01x slower
- xml_etree_process: 54.6 ms +- 0.4 ms -> 55.0 ms +- 0.4 ms: 1.01x slower
- 2to3: 230 ms +- 1 ms -> 232 ms +- 1 ms: 1.01x slower
- regex_compile: 118 ms +- 1 ms -> 119 ms +- 1 ms: 1.01x slower
- xml_etree_iterparse: 78.7 ms +- 0.6 ms -> 79.2 ms +- 0.7 ms: 1.01x slower
- pprint_pformat: 1.38 sec +- 0.01 sec -> 1.39 sec +- 0.01 sec: 1.00x slower
- mdp: 2.33 sec +- 0.02 sec -> 2.34 sec +- 0.01 sec: 1.00x slower
- raytrace: 256 ms +- 3 ms -> 257 ms +- 3 ms: 1.00x slower
- asyncio_tcp_ssl: 1.48 sec +- 0.01 sec -> 1.48 sec +- 0.01 sec: 1.00x slower
- docutils: 2.22 sec +- 0.01 sec -> 2.23 sec +- 0.01 sec: 1.00x slower
- scimark_monte_carlo: 59.3 ms +- 0.4 ms -> 59.5 ms +- 0.5 ms: 1.00x slower

Faster (22):
- regex_effbot: 2.47 ms +- 0.03 ms -> 2.31 ms +- 0.05 ms: 1.07x faster
- scimark_lu: 112 ms +- 1 ms -> 105 ms +- 1 ms: 1.06x faster
- regex_v8: 22.1 ms +- 0.2 ms -> 20.8 ms +- 0.2 ms: 1.06x faster
- json_loads: 22.9 us +- 0.2 us -> 21.7 us +- 0.2 us: 1.06x faster
- logging_silent: 107 ns +- 1 ns -> 102 ns +- 1 ns: 1.05x faster
- unpack_sequence: 37.4 ns +- 1.9 ns -> 36.4 ns +- 0.8 ns: 1.03x faster
- deepcopy_reduce: 2.60 us +- 0.03 us -> 2.55 us +- 0.03 us: 1.02x faster
- unpickle: 12.1 us +- 0.2 us -> 11.9 us +- 0.1 us: 1.02x faster
- deepcopy_memo: 29.6 us +- 0.2 us -> 29.1 us +- 0.3 us: 1.02x faster
- pidigits: 160 ms +- 1 ms -> 158 ms +- 1 ms: 1.01x faster
- gc_traversal: 2.78 ms +- 0.06 ms -> 2.74 ms +- 0.02 ms: 1.01x faster
- create_gc_cycles: 954 us +- 5 us -> 944 us +- 6 us: 1.01x faster
- pathlib: 19.2 ms +- 0.1 ms -> 19.0 ms +- 0.1 ms: 1.01x faster
- bench_thread_pool: 936 us +- 39 us -> 926 us +- 35 us: 1.01x faster
- deepcopy: 250 us +- 2 us -> 248 us +- 2 us: 1.01x faster
- pickle_pure_python: 298 us +- 3 us -> 295 us +- 2 us: 1.01x faster
- unpickle_pure_python: 210 us +- 1 us -> 209 us +- 2 us: 1.01x faster
- unpickle_list: 4.36 us +- 0.07 us -> 4.33 us +- 0.09 us: 1.01x faster
- nbody: 88.3 ms +- 0.6 ms -> 87.8 ms +- 0.6 ms: 1.01x faster
- python_startup: 9.30 ms +- 0.03 ms -> 9.27 ms +- 0.03 ms: 1.00x faster
- python_startup_no_site: 6.52 ms +- 0.03 ms -> 6.49 ms +- 0.03 ms: 1.00x faster
- nqueens: 74.4 ms +- 0.7 ms -> 74.2 ms +- 0.6 ms: 1.00x faster

Benchmark hidden because not significant (10): asyncio_websockets, bench_mp_pool, crypto_pyaes, dulwich_log, pickle, pickle_dict, pickle_list, telco, xml_etree_parse, xml_etree_generate

Geometric mean: 1.02x slower

cdsplit

All benchmarks:
===============

Slower (41):
- sqlglot_normalize: 102 ms +- 1 ms -> 283 ms +- 2 ms: 2.78x slower
- deltablue: 2.91 ms +- 0.03 ms -> 3.20 ms +- 0.05 ms: 1.10x slower
- pyflate: 380 ms +- 3 ms -> 411 ms +- 4 ms: 1.08x slower
- scimark_fft: 313 ms +- 3 ms -> 329 ms +- 4 ms: 1.05x slower
- generators: 25.3 ms +- 0.2 ms -> 26.4 ms +- 0.2 ms: 1.04x slower
- html5lib: 60.5 ms +- 0.6 ms -> 62.7 ms +- 0.5 ms: 1.04x slower
- mako: 10.2 ms +- 0.1 ms -> 10.6 ms +- 0.1 ms: 1.04x slower
- go: 110 ms +- 1 ms -> 114 ms +- 1 ms: 1.03x slower
- logging_format: 5.82 us +- 0.13 us -> 6.00 us +- 0.13 us: 1.03x slower
- logging_simple: 5.25 us +- 0.07 us -> 5.41 us +- 0.12 us: 1.03x slower
- comprehensions: 15.3 us +- 0.1 us -> 15.8 us +- 0.1 us: 1.03x slower
- async_generators: 376 ms +- 4 ms -> 387 ms +- 4 ms: 1.03x slower
- genshi_xml: 46.1 ms +- 0.6 ms -> 47.3 ms +- 0.5 ms: 1.03x slower
- scimark_sparse_mat_mult: 4.21 ms +- 0.11 ms -> 4.32 ms +- 0.07 ms: 1.03x slower
- chaos: 58.0 ms +- 0.8 ms -> 59.4 ms +- 0.8 ms: 1.02x slower
- spectral_norm: 93.0 ms +- 0.8 ms -> 95.1 ms +- 0.9 ms: 1.02x slower
- scimark_sor: 114 ms +- 1 ms -> 116 ms +- 1 ms: 1.02x slower
- coroutines: 20.4 ms +- 0.3 ms -> 20.8 ms +- 0.3 ms: 1.02x slower
- richards: 43.6 ms +- 0.4 ms -> 44.4 ms +- 0.4 ms: 1.02x slower
- fannkuch: 372 ms +- 3 ms -> 378 ms +- 3 ms: 1.02x slower
- richards_super: 49.7 ms +- 0.4 ms -> 50.6 ms +- 0.5 ms: 1.02x slower
- json_dumps: 10.1 ms +- 0.2 ms -> 10.2 ms +- 0.3 ms: 1.02x slower
- regex_dna: 147 ms +- 3 ms -> 149 ms +- 3 ms: 1.02x slower
- genshi_text: 19.6 ms +- 0.2 ms -> 19.9 ms +- 0.2 ms: 1.02x slower
- asyncio_tcp: 429 ms +- 4 ms -> 436 ms +- 5 ms: 1.01x slower
- meteor_contest: 89.6 ms +- 0.4 ms -> 90.7 ms +- 0.5 ms: 1.01x slower
- float: 66.4 ms +- 1.1 ms -> 67.2 ms +- 1.1 ms: 1.01x slower
- typing_runtime_protocols: 153 us +- 3 us -> 154 us +- 3 us: 1.01x slower
- pprint_safe_repr: 667 ms +- 6 ms -> 674 ms +- 7 ms: 1.01x slower
- tomli_loads: 1.93 sec +- 0.02 sec -> 1.95 sec +- 0.02 sec: 1.01x slower
- hexiom: 5.63 ms +- 0.06 ms -> 5.68 ms +- 0.04 ms: 1.01x slower
- xml_etree_process: 54.6 ms +- 0.4 ms -> 55.0 ms +- 0.4 ms: 1.01x slower
- 2to3: 230 ms +- 1 ms -> 232 ms +- 1 ms: 1.01x slower
- regex_compile: 118 ms +- 1 ms -> 119 ms +- 1 ms: 1.01x slower
- xml_etree_iterparse: 78.7 ms +- 0.6 ms -> 79.2 ms +- 0.7 ms: 1.01x slower
- pprint_pformat: 1.38 sec +- 0.01 sec -> 1.39 sec +- 0.01 sec: 1.00x slower
- mdp: 2.33 sec +- 0.02 sec -> 2.34 sec +- 0.01 sec: 1.00x slower
- raytrace: 256 ms +- 3 ms -> 257 ms +- 3 ms: 1.00x slower
- asyncio_tcp_ssl: 1.48 sec +- 0.01 sec -> 1.48 sec +- 0.01 sec: 1.00x slower
- docutils: 2.22 sec +- 0.01 sec -> 2.23 sec +- 0.01 sec: 1.00x slower
- scimark_monte_carlo: 59.3 ms +- 0.4 ms -> 59.5 ms +- 0.5 ms: 1.00x slower

Faster (22):
- regex_effbot: 2.47 ms +- 0.03 ms -> 2.31 ms +- 0.05 ms: 1.07x faster
- scimark_lu: 112 ms +- 1 ms -> 105 ms +- 1 ms: 1.06x faster
- regex_v8: 22.1 ms +- 0.2 ms -> 20.8 ms +- 0.2 ms: 1.06x faster
- json_loads: 22.9 us +- 0.2 us -> 21.7 us +- 0.2 us: 1.06x faster
- logging_silent: 107 ns +- 1 ns -> 102 ns +- 1 ns: 1.05x faster
- unpack_sequence: 37.4 ns +- 1.9 ns -> 36.4 ns +- 0.8 ns: 1.03x faster
- deepcopy_reduce: 2.60 us +- 0.03 us -> 2.55 us +- 0.03 us: 1.02x faster
- unpickle: 12.1 us +- 0.2 us -> 11.9 us +- 0.1 us: 1.02x faster
- deepcopy_memo: 29.6 us +- 0.2 us -> 29.1 us +- 0.3 us: 1.02x faster
- pidigits: 160 ms +- 1 ms -> 158 ms +- 1 ms: 1.01x faster
- gc_traversal: 2.78 ms +- 0.06 ms -> 2.74 ms +- 0.02 ms: 1.01x faster
- create_gc_cycles: 954 us +- 5 us -> 944 us +- 6 us: 1.01x faster
- pathlib: 19.2 ms +- 0.1 ms -> 19.0 ms +- 0.1 ms: 1.01x faster
- bench_thread_pool: 936 us +- 39 us -> 926 us +- 35 us: 1.01x faster
- deepcopy: 250 us +- 2 us -> 248 us +- 2 us: 1.01x faster
- pickle_pure_python: 298 us +- 3 us -> 295 us +- 2 us: 1.01x faster
- unpickle_pure_python: 210 us +- 1 us -> 209 us +- 2 us: 1.01x faster
- unpickle_list: 4.36 us +- 0.07 us -> 4.33 us +- 0.09 us: 1.01x faster
- nbody: 88.3 ms +- 0.6 ms -> 87.8 ms +- 0.6 ms: 1.01x faster
- python_startup: 9.30 ms +- 0.03 ms -> 9.27 ms +- 0.03 ms: 1.00x faster
- python_startup_no_site: 6.52 ms +- 0.03 ms -> 6.49 ms +- 0.03 ms: 1.00x faster
- nqueens: 74.4 ms +- 0.7 ms -> 74.2 ms +- 0.6 ms: 1.00x faster

Benchmark hidden because not significant (10): asyncio_websockets, bench_mp_pool, crypto_pyaes, dulwich_log, pickle, pickle_dict, pickle_list, telco, xml_etree_parse, xml_etree_generate

Geometric mean: 1.02x slower

split-all-cold

All benchmarks:
===============

Slower (30):
- sqlglot_normalize: 102 ms +- 1 ms -> 283 ms +- 4 ms: 2.78x slower
- bench_mp_pool: 26.1 ms +- 0.4 ms -> 31.0 ms +- 8.7 ms: 1.19x slower
- mako: 10.2 ms +- 0.1 ms -> 11.0 ms +- 0.2 ms: 1.08x slower
- pyflate: 380 ms +- 3 ms -> 407 ms +- 3 ms: 1.07x slower
- create_gc_cycles: 954 us +- 5 us -> 1.01 ms +- 0.01 ms: 1.06x slower
- unpack_sequence: 37.4 ns +- 1.9 ns -> 39.5 ns +- 2.6 ns: 1.06x slower
- regex_dna: 147 ms +- 3 ms -> 152 ms +- 3 ms: 1.04x slower
- scimark_sparse_mat_mult: 4.21 ms +- 0.11 ms -> 4.36 ms +- 0.08 ms: 1.04x slower
- unpickle_list: 4.36 us +- 0.07 us -> 4.50 us +- 0.10 us: 1.03x slower
- telco: 6.60 ms +- 0.10 ms -> 6.81 ms +- 0.07 ms: 1.03x slower
- scimark_sor: 114 ms +- 1 ms -> 117 ms +- 1 ms: 1.03x slower
- generators: 25.3 ms +- 0.2 ms -> 26.1 ms +- 0.3 ms: 1.03x slower
- gc_traversal: 2.78 ms +- 0.06 ms -> 2.85 ms +- 0.06 ms: 1.03x slower
- raytrace: 256 ms +- 3 ms -> 262 ms +- 2 ms: 1.02x slower
- deltablue: 2.91 ms +- 0.03 ms -> 2.98 ms +- 0.02 ms: 1.02x slower
- xml_etree_iterparse: 78.7 ms +- 0.6 ms -> 80.3 ms +- 0.5 ms: 1.02x slower
- scimark_lu: 112 ms +- 1 ms -> 114 ms +- 2 ms: 1.02x slower
- genshi_text: 19.6 ms +- 0.2 ms -> 20.0 ms +- 0.2 ms: 1.02x slower
- async_generators: 376 ms +- 4 ms -> 383 ms +- 4 ms: 1.02x slower
- comprehensions: 15.3 us +- 0.1 us -> 15.6 us +- 0.1 us: 1.02x slower
- richards_super: 49.7 ms +- 0.4 ms -> 50.4 ms +- 0.5 ms: 1.01x slower
- xml_etree_parse: 114 ms +- 1 ms -> 116 ms +- 1 ms: 1.01x slower
- scimark_fft: 313 ms +- 3 ms -> 317 ms +- 3 ms: 1.01x slower
- xml_etree_generate: 78.3 ms +- 0.5 ms -> 79.2 ms +- 0.6 ms: 1.01x slower
- tomli_loads: 1.93 sec +- 0.02 sec -> 1.94 sec +- 0.02 sec: 1.01x slower
- go: 110 ms +- 1 ms -> 111 ms +- 1 ms: 1.01x slower
- xml_etree_process: 54.6 ms +- 0.4 ms -> 55.0 ms +- 0.5 ms: 1.01x slower
- html5lib: 60.5 ms +- 0.6 ms -> 60.8 ms +- 0.8 ms: 1.01x slower
- richards: 43.6 ms +- 0.4 ms -> 43.8 ms +- 0.4 ms: 1.00x slower
- 2to3: 230 ms +- 1 ms -> 230 ms +- 1 ms: 1.00x slower

Faster (29):
- regex_v8: 22.1 ms +- 0.2 ms -> 20.7 ms +- 0.2 ms: 1.06x faster
- logging_silent: 107 ns +- 1 ns -> 101 ns +- 1 ns: 1.06x faster
- spectral_norm: 93.0 ms +- 0.8 ms -> 88.6 ms +- 1.7 ms: 1.05x faster
- json_loads: 22.9 us +- 0.2 us -> 21.8 us +- 0.2 us: 1.05x faster
- coroutines: 20.4 ms +- 0.3 ms -> 19.5 ms +- 0.4 ms: 1.05x faster
- regex_effbot: 2.47 ms +- 0.03 ms -> 2.36 ms +- 0.05 ms: 1.05x faster
- hexiom: 5.63 ms +- 0.06 ms -> 5.46 ms +- 0.04 ms: 1.03x faster
- deepcopy_memo: 29.6 us +- 0.2 us -> 28.8 us +- 0.3 us: 1.03x faster
- deepcopy_reduce: 2.60 us +- 0.03 us -> 2.54 us +- 0.03 us: 1.02x faster
- pickle: 11.1 us +- 0.4 us -> 10.8 us +- 0.1 us: 1.02x faster
- unpickle: 12.1 us +- 0.2 us -> 11.8 us +- 0.2 us: 1.02x faster
- logging_format: 5.82 us +- 0.13 us -> 5.71 us +- 0.08 us: 1.02x faster
- nbody: 88.3 ms +- 0.6 ms -> 86.6 ms +- 0.6 ms: 1.02x faster
- pathlib: 19.2 ms +- 0.1 ms -> 18.8 ms +- 0.1 ms: 1.02x faster
- bench_thread_pool: 936 us +- 39 us -> 921 us +- 34 us: 1.02x faster
- deepcopy: 250 us +- 2 us -> 247 us +- 3 us: 1.01x faster
- nqueens: 74.4 ms +- 0.7 ms -> 73.4 ms +- 0.5 ms: 1.01x faster
- meteor_contest: 89.6 ms +- 0.4 ms -> 88.5 ms +- 0.5 ms: 1.01x faster
- dulwich_log: 60.0 ms +- 0.3 ms -> 59.2 ms +- 0.3 ms: 1.01x faster
- pprint_pformat: 1.38 sec +- 0.01 sec -> 1.37 sec +- 0.01 sec: 1.01x faster
- fannkuch: 372 ms +- 3 ms -> 368 ms +- 4 ms: 1.01x faster
- chaos: 58.0 ms +- 0.8 ms -> 57.5 ms +- 0.5 ms: 1.01x faster
- pprint_safe_repr: 667 ms +- 6 ms -> 661 ms +- 5 ms: 1.01x faster
- regex_compile: 118 ms +- 1 ms -> 117 ms +- 1 ms: 1.01x faster
- pickle_pure_python: 298 us +- 3 us -> 296 us +- 3 us: 1.00x faster
- crypto_pyaes: 63.6 ms +- 0.6 ms -> 63.3 ms +- 0.4 ms: 1.00x faster
- mdp: 2.33 sec +- 0.02 sec -> 2.32 sec +- 0.01 sec: 1.00x faster
- pidigits: 160 ms +- 1 ms -> 160 ms +- 1 ms: 1.00x faster
- python_startup: 9.30 ms +- 0.03 ms -> 9.29 ms +- 0.03 ms: 1.00x faster

Benchmark hidden because not significant (14): asyncio_tcp, asyncio_tcp_ssl, asyncio_websockets, docutils, float, genshi_xml, json_dumps, logging_simple, pickle_dict, pickle_list, python_startup_no_site, scimark_monte_carlo, typing_runtime_protocols, unpickle_pure_python

Geometric mean: 1.02x slower

@zanieb
Copy link
Contributor Author

zanieb commented Feb 9, 2025

I'm not sure what's up with sqlglot_normalize

@zanieb
Copy link
Contributor Author

zanieb commented Feb 10, 2025

I did another benchmark with all three of the aforementioned flags added. It looks like together, there's a more consistent speedup.

https://gist.github.com/zanieb/2e043b5f1f415a0b1e98fcf72465c0fa

All benchmarks:
===============

Slower (10):
- sqlglot_normalize: 102 ms +- 1 ms -> 275 ms +- 3 ms: 2.70x slower
- python_startup_no_site: 6.52 ms +- 0.03 ms -> 8.53 ms +- 0.04 ms: 1.31x slower
- python_startup: 9.30 ms +- 0.03 ms -> 11.2 ms +- 0.0 ms: 1.21x slower
- bench_mp_pool: 26.1 ms +- 0.4 ms -> 28.3 ms +- 0.6 ms: 1.08x slower
- create_gc_cycles: 954 us +- 5 us -> 1.00 ms +- 0.01 ms: 1.05x slower
- regex_dna: 147 ms +- 3 ms -> 153 ms +- 1 ms: 1.04x slower
- gc_traversal: 2.78 ms +- 0.06 ms -> 2.84 ms +- 0.03 ms: 1.02x slower
- json_loads: 22.9 us +- 0.2 us -> 23.1 us +- 0.2 us: 1.01x slower
- xml_etree_iterparse: 78.7 ms +- 0.6 ms -> 79.0 ms +- 0.9 ms: 1.00x slower
- xml_etree_parse: 114 ms +- 1 ms -> 114 ms +- 1 ms: 1.00x slower

Faster (58):
- unpack_sequence: 37.4 ns +- 1.9 ns -> 33.1 ns +- 0.5 ns: 1.13x faster
- deltablue: 2.91 ms +- 0.03 ms -> 2.63 ms +- 0.02 ms: 1.11x faster
- nbody: 88.3 ms +- 0.6 ms -> 80.1 ms +- 0.8 ms: 1.10x faster
- logging_silent: 107 ns +- 1 ns -> 97.3 ns +- 1.8 ns: 1.10x faster
- hexiom: 5.63 ms +- 0.06 ms -> 5.13 ms +- 0.05 ms: 1.10x faster
- async_generators: 376 ms +- 4 ms -> 345 ms +- 3 ms: 1.09x faster
- go: 110 ms +- 1 ms -> 101 ms +- 1 ms: 1.09x faster
- chaos: 58.0 ms +- 0.8 ms -> 53.5 ms +- 0.6 ms: 1.08x faster
- regex_compile: 118 ms +- 1 ms -> 110 ms +- 1 ms: 1.08x faster
- unpickle_pure_python: 210 us +- 1 us -> 195 us +- 1 us: 1.08x faster
- pyflate: 380 ms +- 3 ms -> 354 ms +- 4 ms: 1.07x faster
- genshi_text: 19.6 ms +- 0.2 ms -> 18.5 ms +- 0.2 ms: 1.06x faster
- float: 66.4 ms +- 1.1 ms -> 62.5 ms +- 0.6 ms: 1.06x faster
- comprehensions: 15.3 us +- 0.1 us -> 14.4 us +- 0.3 us: 1.06x faster
- deepcopy: 250 us +- 2 us -> 236 us +- 2 us: 1.06x faster
- regex_effbot: 2.47 ms +- 0.03 ms -> 2.33 ms +- 0.04 ms: 1.06x faster
- scimark_fft: 313 ms +- 3 ms -> 297 ms +- 9 ms: 1.06x faster
- scimark_monte_carlo: 59.3 ms +- 0.4 ms -> 56.3 ms +- 0.9 ms: 1.05x faster
- unpickle_list: 4.36 us +- 0.07 us -> 4.14 us +- 0.10 us: 1.05x faster
- pickle_pure_python: 298 us +- 3 us -> 283 us +- 3 us: 1.05x faster
- richards: 43.6 ms +- 0.4 ms -> 41.5 ms +- 0.6 ms: 1.05x faster
- deepcopy_reduce: 2.60 us +- 0.03 us -> 2.48 us +- 0.03 us: 1.05x faster
- telco: 6.60 ms +- 0.10 ms -> 6.30 ms +- 0.11 ms: 1.05x faster
- json_dumps: 10.1 ms +- 0.2 ms -> 9.60 ms +- 0.10 ms: 1.05x faster
- logging_simple: 5.25 us +- 0.07 us -> 5.02 us +- 0.06 us: 1.05x faster
- nqueens: 74.4 ms +- 0.7 ms -> 71.1 ms +- 0.8 ms: 1.05x faster
- genshi_xml: 46.1 ms +- 0.6 ms -> 44.1 ms +- 0.4 ms: 1.05x faster
- meteor_contest: 89.6 ms +- 0.4 ms -> 85.8 ms +- 0.4 ms: 1.04x faster
- spectral_norm: 93.0 ms +- 0.8 ms -> 89.1 ms +- 0.7 ms: 1.04x faster
- xml_etree_process: 54.6 ms +- 0.4 ms -> 52.3 ms +- 0.4 ms: 1.04x faster
- mdp: 2.33 sec +- 0.02 sec -> 2.23 sec +- 0.03 sec: 1.04x faster
- dulwich_log: 60.0 ms +- 0.3 ms -> 57.6 ms +- 0.4 ms: 1.04x faster
- pprint_pformat: 1.38 sec +- 0.01 sec -> 1.33 sec +- 0.01 sec: 1.04x faster
- scimark_lu: 112 ms +- 1 ms -> 107 ms +- 1 ms: 1.04x faster
- coroutines: 20.4 ms +- 0.3 ms -> 19.6 ms +- 0.4 ms: 1.04x faster
- scimark_sparse_mat_mult: 4.21 ms +- 0.11 ms -> 4.05 ms +- 0.12 ms: 1.04x faster
- tomli_loads: 1.93 sec +- 0.02 sec -> 1.85 sec +- 0.02 sec: 1.04x faster
- scimark_sor: 114 ms +- 1 ms -> 110 ms +- 1 ms: 1.04x faster
- xml_etree_generate: 78.3 ms +- 0.5 ms -> 75.7 ms +- 0.6 ms: 1.04x faster
- pprint_safe_repr: 667 ms +- 6 ms -> 646 ms +- 7 ms: 1.03x faster
- 2to3: 230 ms +- 1 ms -> 223 ms +- 1 ms: 1.03x faster
- regex_v8: 22.1 ms +- 0.2 ms -> 21.4 ms +- 0.1 ms: 1.03x faster
- richards_super: 49.7 ms +- 0.4 ms -> 48.3 ms +- 0.5 ms: 1.03x faster
- mako: 10.2 ms +- 0.1 ms -> 9.92 ms +- 0.07 ms: 1.03x faster
- deepcopy_memo: 29.6 us +- 0.2 us -> 28.8 us +- 0.3 us: 1.03x faster
- bench_thread_pool: 936 us +- 39 us -> 913 us +- 38 us: 1.02x faster
- generators: 25.3 ms +- 0.2 ms -> 24.7 ms +- 0.2 ms: 1.02x faster
- raytrace: 256 ms +- 3 ms -> 250 ms +- 2 ms: 1.02x faster
- asyncio_websockets: 508 ms +- 24 ms -> 496 ms +- 2 ms: 1.02x faster
- logging_format: 5.82 us +- 0.13 us -> 5.69 us +- 0.11 us: 1.02x faster
- pidigits: 160 ms +- 1 ms -> 158 ms +- 1 ms: 1.01x faster
- docutils: 2.22 sec +- 0.01 sec -> 2.19 sec +- 0.02 sec: 1.01x faster
- pickle: 11.1 us +- 0.4 us -> 10.9 us +- 0.1 us: 1.01x faster
- typing_runtime_protocols: 153 us +- 3 us -> 151 us +- 6 us: 1.01x faster
- pathlib: 19.2 ms +- 0.1 ms -> 19.0 ms +- 0.1 ms: 1.01x faster
- pickle_list: 3.84 us +- 0.07 us -> 3.82 us +- 0.06 us: 1.00x faster
- asyncio_tcp: 429 ms +- 4 ms -> 428 ms +- 3 ms: 1.00x faster
- asyncio_tcp_ssl: 1.48 sec +- 0.01 sec -> 1.47 sec +- 0.01 sec: 1.00x faster

Benchmark hidden because not significant (5): crypto_pyaes, fannkuch, html5lib, pickle_dict, unpickle

Geometric mean: 1.01x faster

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build The build process and cross-build performance Performance or resource usage type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants