You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Yesterday I read a post about needletail performance. I came up with an idea to try to optimize the library performance with PGO (as I already did for many other applications - all the results are available here). I performed some tests and want to share the results.
Test environment
Fedora 39
Linux kernel 6.7.3
AMD Ryzen 9 5900x
48 Gib RAM
SSD Samsung 980 Pro 2 Tib
Compiler - Rustc 1.76
needletail version: the latest for now from the master branch on commit 25e9b931af87d5aed79ecf7a3ff32245b91ce9dc
Disabled Turbo boost (for more stable results across benchmark runs)
Benchmark
Built-in benchmarks are invoked with cargo bench. PGO instrumentation phase on benchmarks is done with cargo pgo bench. PGO optimization phase is done with cargo pgo optimize bench.
All PGO optimization steps are done with cargo-pgo tool.
The only caveat is found that Rustc hits some internal bug when LTO and PGO are combined at the same time (more details see here). However, it should not affect the benchmark usefulness - PGO still can bring performance improvements even with LTO in practice. I hope one day the bug will be fixed, and it will be possible to use LTO and PGO for needletail simultaneously.
At least in the provided by the project benchmarks, I see measurable performance improvements in many cases. The only interesting case here - regression in "FASTA parsing/SeqIO" case. It should be investigated further but my guess here that it's due to PGO nature: sometimes optimizing for one hot path pessimizes other cases. In real life, in such cases, users usually are able to build multiple PGO-optimized binaries - one for each workload (with different PGO profiles).
Possible further steps
I can suggest the following things to consider:
Perform more PGO benchmarks in other scenarios. If it shows improvements - add a note to the documentation about possible improvements in the tracing library performance with PGO (I guess somewhere in the README file will be enough).
I will be happy to answer all your questions about PGO.
The text was updated successfully, but these errors were encountered:
Thanks for the issue! No worries for the SeqIO, it's benchmarking another library.
How does it work in practice? Can I have the profile in this repo and have it used automatically by people downloading from crates.io? Does it need to be rna on all different archs?
Can I have the profile in this repo and have it used automatically by people downloading from crates.io?
Not sure about crates.io since the Rust community usually rebuilds the library with their applications so you cannot preoptimize artifacts with PGO on crates.io. However, if you have some prebuilds or build the library separately (like a Python package) - it's possible to do so. See the pydantic-core example with PGO for a wheel: pydantic/pydantic-core#741
Does it need to be rna on all different archs?
Generated PGO profiles (.profraw and .profdata files) can be architecture-dependent. However, if you describe your PGO training routine as a bunch of scripts (like some PGO training scenarios) - it will be completely architecture-independent and can be used on any arch.
Hi!
Yesterday I read a post about needletail performance. I came up with an idea to try to optimize the library performance with PGO (as I already did for many other applications - all the results are available here). I performed some tests and want to share the results.
Test environment
master
branch on commit25e9b931af87d5aed79ecf7a3ff32245b91ce9dc
Benchmark
Built-in benchmarks are invoked with
cargo bench
. PGO instrumentation phase on benchmarks is done withcargo pgo bench
. PGO optimization phase is done withcargo pgo optimize bench
.All PGO optimization steps are done with cargo-pgo tool.
The only caveat is found that Rustc hits some internal bug when LTO and PGO are combined at the same time (more details see here). However, it should not affect the benchmark usefulness - PGO still can bring performance improvements even with LTO in practice. I hope one day the bug will be fixed, and it will be possible to use LTO and PGO for needletail simultaneously.
Results
I got the following results:
At least in the provided by the project benchmarks, I see measurable performance improvements in many cases. The only interesting case here - regression in "FASTA parsing/SeqIO" case. It should be investigated further but my guess here that it's due to PGO nature: sometimes optimizing for one hot path pessimizes other cases. In real life, in such cases, users usually are able to build multiple PGO-optimized binaries - one for each workload (with different PGO profiles).
Possible further steps
I can suggest the following things to consider:
I will be happy to answer all your questions about PGO.
The text was updated successfully, but these errors were encountered: