x86_64: Use `lock or` instead of mfence #156

taiki-e · 2024-03-21T16:46:17Z

UPDATE: this implementation is now replaced with faster implementation; see #156 (comment).

Based on x86_32 64-bit atomic SeqCst store using SSE generated by LLVM. https://godbolt.org/z/9sKEr8YWc

Equivalent to mfence, but is 10-35% faster at least in simple cases on Coffee Lake.

Below are the results of the microbenchmark on Intel Core i7-9750H (Coffee Lake) with the ORDERING constant set to SeqCst.

bench_portable_atomic_arch/u128_store
                        time:   [11.610 ns 11.670 ns 11.738 ns]
                        change: [-36.119% -35.236% -34.348%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild
bench_portable_atomic_arch/u128_concurrent_load_store
                        time:   [202.30 µs 203.54 µs 205.24 µs]
                        change: [-32.313% -31.167% -29.845%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe
bench_portable_atomic_arch/u128_concurrent_store
                        time:   [395.74 µs 397.37 µs 398.98 µs]
                        change: [-18.517% -17.560% -16.582%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  12 (12.00%) high severe
bench_portable_atomic_arch/u128_concurrent_store_swap
                        time:   [791.21 µs 793.43 µs 795.69 µs]
                        change: [-10.682% -10.197% -9.6789%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) low severe
  2 (2.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe

Based on x86_32 64-bit atomic SeqCst store using SSE generated by LLVM. https://godbolt.org/z/9sKEr8YWc Equivalent to mfence, but is 10-35% faster at least in simple cases on Coffee Lake. Below are the results of the microbenchmark on an Intel Core i7-9750H (Coffee Lake) with the ORDERING constant set to SeqCst. ``` bench_portable_atomic_arch/u128_store time: [11.610 ns 11.670 ns 11.738 ns] change: [-36.119% -35.236% -34.348%] (p = 0.00 < 0.05) Performance has improved. Found 4 outliers among 100 measurements (4.00%) 4 (4.00%) high mild bench_portable_atomic_arch/u128_concurrent_load_store time: [202.30 µs 203.54 µs 205.24 µs] change: [-32.313% -31.167% -29.845%] (p = 0.00 < 0.05) Performance has improved. Found 8 outliers among 100 measurements (8.00%) 2 (2.00%) high mild 6 (6.00%) high severe bench_portable_atomic_arch/u128_concurrent_store time: [395.74 µs 397.37 µs 398.98 µs] change: [-18.517% -17.560% -16.582%] (p = 0.00 < 0.05) Performance has improved. Found 15 outliers among 100 measurements (15.00%) 1 (1.00%) low mild 2 (2.00%) high mild 12 (12.00%) high severe bench_portable_atomic_arch/u128_concurrent_store_swap time: [791.21 µs 793.43 µs 795.69 µs] change: [-10.682% -10.197% -9.6789%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 2 (2.00%) low severe 2 (2.00%) low mild 2 (2.00%) high mild 1 (1.00%) high severe ```

Follow-up to #156.

taiki-e · 2024-07-19T19:56:58Z

I created another version of this based on my guess described in smol-rs/event-listener#71 (comment), and did a microbenchmark using the same way about 3 months ago.
The results were generally as expected and the main branch now uses xchg (lock prefix is implied).

Intel Core i7-9750H, Coffee Lake

u128_store
- xchg qword ptr {local uninit mem}, {uninit reg}  5.7455 ns
- xchg dword ptr {local uninit mem}, {uninit reg}  5.7468 ns
- lock not qword ptr {local uninit mem}     5.9586 ns
- lock not dword ptr {local uninit mem}     5.9857 ns
- lock or qword ptr {local uninit mem}, 0   5.9016 ns
- lock or dword ptr {local uninit mem}, 0   5.9458 ns
- lock or qword ptr {sp}, 0                 11.535 ns
- lock or dword ptr {sp}, 0                 11.555 ns
- mfence                                    17.836 ns

u128_concurrent_load_store
- xchg qword ptr {local uninit mem}, {uninit reg}  179.06 µs
- xchg dword ptr {local uninit mem}, {uninit reg}  180.54 µs
- lock not qword ptr {local uninit mem}     216.10 µs
- lock not dword ptr {local uninit mem}     181.44 µs
- lock or qword ptr {local uninit mem}, 0   185.32 µs
- lock or dword ptr {local uninit mem}, 0   204.28 µs
- lock or qword ptr {sp}, 0                 216.74 µs
- lock or dword ptr {sp}, 0                 214.76 µs
- mfence                                    299.72 µs

u128_concurrent_store
- xchg qword ptr {local uninit mem}, {uninit reg}  272.36 µs
- xchg dword ptr {local uninit mem}, {uninit reg}  272.42 µs
- lock not qword ptr {local uninit mem}     386.58 µs
- lock not dword ptr {local uninit mem}     278.82 µs
- lock or qword ptr {local uninit mem}, 0   288.11 µs
- lock or dword ptr {local uninit mem}, 0   334.91 µs
- lock or qword ptr {sp}, 0                 354.49 µs
- lock or dword ptr {sp}, 0                 395.22 µs
- mfence                                    466.75 µs

u128_concurrent_store_swap (flaky)
- xchg qword ptr {local uninit mem}, {uninit reg}  706.18 µs
- xchg dword ptr {local uninit mem}, {uninit reg}  712.04 µs
- lock not qword ptr {local uninit mem}     832.51 µs
- lock not dword ptr {local uninit mem}     745.58 µs
- lock or qword ptr {local uninit mem}, 0   742.42 µs
- lock or dword ptr {local uninit mem}, 0   772.04 µs
- lock or qword ptr {sp}, 0                 782.62 µs
- lock or dword ptr {sp}, 0                 842.42 µs
- mfence                                    876.23 µs

Intel Core i7-i3800H, Raptor Lake-H

u128_store
- xchg qword ptr {local uninit mem}, {uninit reg}  4.6275 ns
- xchg dword ptr {local uninit mem}, {uninit reg}  4.6272 ns
- lock not qword ptr {local uninit mem}     4.8203 ns
- lock not dword ptr {local uninit mem}     5.0133 ns
- lock or qword ptr {local uninit mem}, 0   4.8201 ns
- lock or dword ptr {local uninit mem}, 0   5.0131 ns
- lock or qword ptr {sp}, 0                 9.0605 ns
- lock or dword ptr {sp}, 0                 9.0607 ns
- mfence                                    11.312 ns

u128_concurrent_load_store
- xchg qword ptr {local uninit mem}, {uninit reg}  126.10 µs
- xchg dword ptr {local uninit mem}, {uninit reg}  127.39 µs
- lock not qword ptr {local uninit mem}     136.86 µs
- lock not dword ptr {local uninit mem}     139.22 µs
- lock or qword ptr {local uninit mem}, 0   131.83 µs
- lock or dword ptr {local uninit mem}, 0   132.02 µs
- lock or qword ptr {sp}, 0                 180.00 µs
- lock or dword ptr {sp}, 0                 183.95 µs
- mfence                                    193.67 µs

u128_concurrent_store
- xchg qword ptr {local uninit mem}, {uninit reg}  204.95 µs
- xchg dword ptr {local uninit mem}, {uninit reg}  204.43 µs
- lock not qword ptr {local uninit mem}     222.19 µs
- lock not dword ptr {local uninit mem}     223.60 µs
- lock or qword ptr {local uninit mem}, 0   213.67 µs
- lock or dword ptr {local uninit mem}, 0   224.38 µs
- lock or qword ptr {sp}, 0                 308.61 µs
- lock or dword ptr {sp}, 0                 312.73 µs
- mfence                                    339.74 µs

u128_concurrent_store_swap (flaky)
- xchg qword ptr {local uninit mem}, {uninit reg}  328.78 µs
- xchg dword ptr {local uninit mem}, {uninit reg}  328.51 µs
- lock not qword ptr {local uninit mem}     418.78 µs
- lock not dword ptr {local uninit mem}     339.05 µs
- lock or qword ptr {local uninit mem}, 0   338.19 µs
- lock or dword ptr {local uninit mem}, 0   337.84 µs
- lock or qword ptr {sp}, 0                 398.34 µs
- lock or dword ptr {sp}, 0                 412.58 µs
- mfence                                    403.29 µs

taiki-e added the O-x86 Target: x86/x64 processors label Mar 21, 2024

taiki-e force-pushed the x86_64-mfence branch 2 times, most recently from a0d7f3f to 817f51e Compare March 21, 2024 16:49

taiki-e force-pushed the x86_64-mfence branch from 817f51e to 51524fc Compare March 21, 2024 16:54

taiki-e merged commit 6267661 into main Mar 21, 2024
98 checks passed

taiki-e deleted the x86_64-mfence branch March 21, 2024 19:10

taiki-e added a commit that referenced this pull request Jul 19, 2024

x86_64: Use xchg instead of lock or

0483042

Follow-up to #156.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x86_64: Use `lock or` instead of mfence #156

x86_64: Use `lock or` instead of mfence #156

taiki-e commented Mar 21, 2024 •

edited

Loading

taiki-e commented Jul 19, 2024 •

edited

Loading

x86_64: Use lock or instead of mfence #156

x86_64: Use lock or instead of mfence #156

Conversation

taiki-e commented Mar 21, 2024 • edited Loading

taiki-e commented Jul 19, 2024 • edited Loading

x86_64: Use `lock or` instead of mfence #156

x86_64: Use `lock or` instead of mfence #156

taiki-e commented Mar 21, 2024 •

edited

Loading

taiki-e commented Jul 19, 2024 •

edited

Loading