Description
Motivation
At present (b7dfe5c), pybind11
proper only benchmarks compile-time and and artifact size for one given test setup (which tests arguments, simple inheritance, but that's about it, I think); the results of which can be seen here:
https://pybind11.readthedocs.io/en/stable/benchmark.html
https://github.com/pybind/pybind11/blob/v2.6.1/docs/benchmark.rst
However, it may difficult to objectively and concretely judge the performance impact of a PR, and weigh that against the value of the feature / issue resolution. Generally, benchmarking is done on an ad-hoc basis (totes works, but may make it difficult for less creative people like myself ;)
Primary motivating issues / PRs:
- Detect and fail if using mismatched holders #2644 (comment)
- [FEAT] Rework casting #2646
- [BUG] Problem when creating derived Python objects from C++ (inheritance slicing) #1333
Secondary:
- Would you consider merging a patch that optionally splits pybind11 into declarations vs definitions in a backwards compatible opt-in manner? #2322
- why pybind11 is slower than cython #1227 / why pybind11 is slower than Python and boost python; the benchmark does not include performance test? #1825
- Python multiple inheritance #693 (comment)
- Iterators efficiency and instance tracking #376
Fuzzy Scoping + Steps
- Establish a baseline benchmarking setup that touches on the core "hotpath" features for performance. Only track metrics for a given version of
pybind11
(out of scope: other binding approaches)
- Initialization time (e.g.
dlopen
- ish stuff, pybind11 internals upstart, binding registration, ...) - Run time (function calls, type conversions, casting, etc.)
- Comparison "axes": later CPython versions, pybind11 versions / PR branches / forks
- OS: For now, just Linux ('cause that's all I use ;)
- Add additional metrics (e.g. memory leaks / usage, do a redux on compile-time + size)
- Ideally, provide guidance on what
pybind11
finds the most important (how to weigh compile-time, size, speed, memory, etc.) - (Stretch) Possibly compare against other approaches (Boost.Python, Cython, SWIG, cppyy / Julia / LLVM-esque stuff, etc.)
Given that performance benchmarks can be a P.I.T.A. (e.g. how to OS + interrupts, hardware capacity / abstractions, blah blah), ideally decisions should be made about relative performance on the same machine. Ideally, we should also publish some metrics for a given config to give people a "feel" for the performance, as was done for compile time.
Suggested Solution Artifacts
- Identify code repository
- Perhaps
github.com/pybind/pybind-benchmarks
? - Alt.: In-tree (which may make it hard to compare across versions...)
- Perhaps
- Identify good tooling for performance benchmarking
- @henryiii suggested
pytest-benchmark
- @henryiii suggested
@wjakob @rwgk @rhaschke @YannickJadoul @bstaletic @henryiii @ax3l
Can I ask what y'all think? Is this redundant w.r.t. what we already have?