Reduce function calling overhead by ~25% in ReferencePool::update_counts: #1608

alex · 2021-05-15T14:13:26Z

Place both increfs and decrefs behind a single mutex, rather than two. Even uncontended mutexes aren't free to acquire.
Keep a dirty tracking bool to avoid acquiring any mutexes at all in the common case of no modifications (because everything happened with the GIL held)

Benchmarks

Before

(.venv) alex@media-01:~/p/pyo3$ python3 -mtimeit -s "from pyo3_benchmarks import no_args" "no_args()"
5000000 loops, best of 5: 63.6 nsec per loop
(.venv) alex@media-01:~/p/pyo3$ python3 -mtimeit -s "from pyo3_benchmarks import no_args" "no_args()"
5000000 loops, best of 5: 63.7 nsec per loop
(.venv) alex@media-01:~/p/pyo3$ python3 -mtimeit -s "from pyo3_benchmarks import no_args" "no_args()"
5000000 loops, best of 5: 64.1 nsec per loop
(.venv) alex@media-01:~/p/pyo3$ python3 -mtimeit -s "from pyo3_benchmarks import no_args" "no_args()"
5000000 loops, best of 5: 63.6 nsec per loop
(.venv) alex@media-01:~/p/pyo3$ python3 -mtimeit -s "from pyo3_benchmarks import no_args" "no_args()"
5000000 loops, best of 5: 63.5 nsec per loop

With single lock

(.venv) alex@media-01:~/p/pyo3$ python3 -mtimeit -s "from pyo3_benchmarks import no_args" "no_args()"
5000000 loops, best of 5: 51.9 nsec per loop
(.venv) alex@media-01:~/p/pyo3$ python3 -mtimeit -s "from pyo3_benchmarks import no_args" "no_args()"
5000000 loops, best of 5: 52 nsec per loop
(.venv) alex@media-01:~/p/pyo3$ python3 -mtimeit -s "from pyo3_benchmarks import no_args" "no_args()"
5000000 loops, best of 5: 52.9 nsec per loop
(.venv) alex@media-01:~/p/pyo3$ python3 -mtimeit -s "from pyo3_benchmarks import no_args" "no_args()"
5000000 loops, best of 5: 51.8 nsec per loop
(.venv) alex@media-01:~/p/pyo3$ python3 -mtimeit -s "from pyo3_benchmarks import no_args" "no_args()"
5000000 loops, best of 5: 54.7 nsec per loop

With both single lock and dirty flag

(.venv) alex@media-01:~/p/pyo3$ python3 -mtimeit -s "from pyo3_benchmarks import no_args" "no_args()"
5000000 loops, best of 5: 47.2 nsec per loop
(.venv) alex@media-01:~/p/pyo3$ python3 -mtimeit -s "from pyo3_benchmarks import no_args" "no_args()"
5000000 loops, best of 5: 46.7 nsec per loop
(.venv) alex@media-01:~/p/pyo3$ python3 -mtimeit -s "from pyo3_benchmarks import no_args" "no_args()"
5000000 loops, best of 5: 46.6 nsec per loop
(.venv) alex@media-01:~/p/pyo3$ python3 -mtimeit -s "from pyo3_benchmarks import no_args" "no_args()"
5000000 loops, best of 5: 46.6 nsec per loop
(.venv) alex@media-01:~/p/pyo3$ python3 -mtimeit -s "from pyo3_benchmarks import no_args" "no_args()"
5000000 loops, best of 5: 47.1 nsec per loop

davidhewitt

Awesome! These are some really great optimizations which many people will be happy with. I think there's a potential deadlock though (which we can avoid), see comment below.

davidhewitt · 2021-05-15T16:20:46Z

src/gil.rs

            unsafe { ffi::Py_INCREF(ptr.as_ptr()) };
        }

-        for ptr in swap_vec_with_lock!(self.pointers_to_decref) {
+        for ptr in swap_vec!(ops.1) {


We need to be careful here. Decrementing references can run arbitrary Python drop code, which can lead to GIL release which may eventually deadlock because we hold the lock here. (I think this mutex is also non reentrant so may deadlock on a single thread.)

After swapping ops.1 we should release the lock before decreasing any references.

Good call, should be good now.

src/gil.rs

…nts: 1) Place both increfs and decrefs behind a single mutex, rather than two. Even uncontended mutexes aren't free to acquire. 2) Keep a dirty tracking bool to avoid acquiring any mutexes at all in the common case of no modifications (because everything happened with the GIL held)

davidhewitt

Looks great! Thanks very much 😌

kngwyu · 2021-05-16T06:11:25Z

Looks like I was late for the party... Very interesting to see this optimization works, thanks.

alex force-pushed the opt-no-args branch from 60c89e5 to 13ddb72 Compare May 15, 2021 14:26

alex mentioned this pull request May 15, 2021

PyO3 performance analysis: function overheads #1607

Closed

davidhewitt requested changes May 15, 2021

View reviewed changes

alex force-pushed the opt-no-args branch from 13ddb72 to 56973d7 Compare May 15, 2021 17:19

davidhewitt approved these changes May 15, 2021

View reviewed changes

davidhewitt merged commit c48e5b0 into PyO3:main May 15, 2021

alex deleted the opt-no-args branch May 15, 2021 23:53

alex mentioned this pull request Jun 17, 2023

Two minor pool optimizations #3250

Merged

alex mentioned this pull request May 11, 2024

add flag to skip reference pool mutex if the program doesn't use the pool #4174

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce function calling overhead by ~25% in ReferencePool::update_counts: #1608

Reduce function calling overhead by ~25% in ReferencePool::update_counts: #1608

alex commented May 15, 2021

davidhewitt left a comment

davidhewitt May 15, 2021

alex May 15, 2021

davidhewitt left a comment

kngwyu commented May 16, 2021

Reduce function calling overhead by ~25% in ReferencePool::update_counts: #1608

Reduce function calling overhead by ~25% in ReferencePool::update_counts: #1608

Conversation

alex commented May 15, 2021

Benchmarks

Before

With single lock

With both single lock and dirty flag

davidhewitt left a comment

Choose a reason for hiding this comment

davidhewitt May 15, 2021

Choose a reason for hiding this comment

alex May 15, 2021

Choose a reason for hiding this comment

davidhewitt left a comment

Choose a reason for hiding this comment

kngwyu commented May 16, 2021