Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x86_64: Use lock or instead of mfence #156

Merged
merged 1 commit into from
Mar 21, 2024
Merged

x86_64: Use lock or instead of mfence #156

merged 1 commit into from
Mar 21, 2024

Conversation

taiki-e
Copy link
Owner

@taiki-e taiki-e commented Mar 21, 2024

UPDATE: this implementation is now replaced with faster implementation; see #156 (comment).


Based on x86_32 64-bit atomic SeqCst store using SSE generated by LLVM. https://godbolt.org/z/9sKEr8YWc

Equivalent to mfence, but is 10-35% faster at least in simple cases on Coffee Lake.

Below are the results of the microbenchmark on Intel Core i7-9750H (Coffee Lake) with the ORDERING constant set to SeqCst.

bench_portable_atomic_arch/u128_store
                        time:   [11.610 ns 11.670 ns 11.738 ns]
                        change: [-36.119% -35.236% -34.348%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild
bench_portable_atomic_arch/u128_concurrent_load_store
                        time:   [202.30 µs 203.54 µs 205.24 µs]
                        change: [-32.313% -31.167% -29.845%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe
bench_portable_atomic_arch/u128_concurrent_store
                        time:   [395.74 µs 397.37 µs 398.98 µs]
                        change: [-18.517% -17.560% -16.582%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  12 (12.00%) high severe
bench_portable_atomic_arch/u128_concurrent_store_swap
                        time:   [791.21 µs 793.43 µs 795.69 µs]
                        change: [-10.682% -10.197% -9.6789%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) low severe
  2 (2.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe

@taiki-e taiki-e added the O-x86 Target: x86/x64 processors label Mar 21, 2024
@taiki-e taiki-e force-pushed the x86_64-mfence branch 2 times, most recently from a0d7f3f to 817f51e Compare March 21, 2024 16:49
Based on x86_32 64-bit atomic SeqCst store using SSE generated by LLVM.
https://godbolt.org/z/9sKEr8YWc

Equivalent to mfence, but is 10-35% faster at least in simple cases on
Coffee Lake.

Below are the results of the microbenchmark on an Intel Core i7-9750H
(Coffee Lake) with the ORDERING constant set to SeqCst.

```
bench_portable_atomic_arch/u128_store
                        time:   [11.610 ns 11.670 ns 11.738 ns]
                        change: [-36.119% -35.236% -34.348%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild
bench_portable_atomic_arch/u128_concurrent_load_store
                        time:   [202.30 µs 203.54 µs 205.24 µs]
                        change: [-32.313% -31.167% -29.845%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe
bench_portable_atomic_arch/u128_concurrent_store
                        time:   [395.74 µs 397.37 µs 398.98 µs]
                        change: [-18.517% -17.560% -16.582%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  12 (12.00%) high severe
bench_portable_atomic_arch/u128_concurrent_store_swap
                        time:   [791.21 µs 793.43 µs 795.69 µs]
                        change: [-10.682% -10.197% -9.6789%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) low severe
  2 (2.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe
```
@taiki-e taiki-e merged commit 6267661 into main Mar 21, 2024
98 checks passed
@taiki-e taiki-e deleted the x86_64-mfence branch March 21, 2024 19:10
taiki-e added a commit that referenced this pull request Jul 19, 2024
@taiki-e
Copy link
Owner Author

taiki-e commented Jul 19, 2024

I created another version of this based on my guess described in smol-rs/event-listener#71 (comment), and did a microbenchmark using the same way about 3 months ago.
The results were generally as expected and the main branch now uses xchg (lock prefix is implied).

Intel Core i7-9750H, Coffee Lake

u128_store
- xchg qword ptr {local uninit mem}, {uninit reg}  5.7455 ns
- xchg dword ptr {local uninit mem}, {uninit reg}  5.7468 ns
- lock not qword ptr {local uninit mem}     5.9586 ns
- lock not dword ptr {local uninit mem}     5.9857 ns
- lock or qword ptr {local uninit mem}, 0   5.9016 ns
- lock or dword ptr {local uninit mem}, 0   5.9458 ns
- lock or qword ptr {sp}, 0                 11.535 ns
- lock or dword ptr {sp}, 0                 11.555 ns
- mfence                                    17.836 ns

u128_concurrent_load_store
- xchg qword ptr {local uninit mem}, {uninit reg}  179.06 µs
- xchg dword ptr {local uninit mem}, {uninit reg}  180.54 µs
- lock not qword ptr {local uninit mem}     216.10 µs
- lock not dword ptr {local uninit mem}     181.44 µs
- lock or qword ptr {local uninit mem}, 0   185.32 µs
- lock or dword ptr {local uninit mem}, 0   204.28 µs
- lock or qword ptr {sp}, 0                 216.74 µs
- lock or dword ptr {sp}, 0                 214.76 µs
- mfence                                    299.72 µs

u128_concurrent_store
- xchg qword ptr {local uninit mem}, {uninit reg}  272.36 µs
- xchg dword ptr {local uninit mem}, {uninit reg}  272.42 µs
- lock not qword ptr {local uninit mem}     386.58 µs
- lock not dword ptr {local uninit mem}     278.82 µs
- lock or qword ptr {local uninit mem}, 0   288.11 µs
- lock or dword ptr {local uninit mem}, 0   334.91 µs
- lock or qword ptr {sp}, 0                 354.49 µs
- lock or dword ptr {sp}, 0                 395.22 µs
- mfence                                    466.75 µs

u128_concurrent_store_swap (flaky)
- xchg qword ptr {local uninit mem}, {uninit reg}  706.18 µs
- xchg dword ptr {local uninit mem}, {uninit reg}  712.04 µs
- lock not qword ptr {local uninit mem}     832.51 µs
- lock not dword ptr {local uninit mem}     745.58 µs
- lock or qword ptr {local uninit mem}, 0   742.42 µs
- lock or dword ptr {local uninit mem}, 0   772.04 µs
- lock or qword ptr {sp}, 0                 782.62 µs
- lock or dword ptr {sp}, 0                 842.42 µs
- mfence                                    876.23 µs

Intel Core i7-i3800H, Raptor Lake-H

u128_store
- xchg qword ptr {local uninit mem}, {uninit reg}  4.6275 ns
- xchg dword ptr {local uninit mem}, {uninit reg}  4.6272 ns
- lock not qword ptr {local uninit mem}     4.8203 ns
- lock not dword ptr {local uninit mem}     5.0133 ns
- lock or qword ptr {local uninit mem}, 0   4.8201 ns
- lock or dword ptr {local uninit mem}, 0   5.0131 ns
- lock or qword ptr {sp}, 0                 9.0605 ns
- lock or dword ptr {sp}, 0                 9.0607 ns
- mfence                                    11.312 ns

u128_concurrent_load_store
- xchg qword ptr {local uninit mem}, {uninit reg}  126.10 µs
- xchg dword ptr {local uninit mem}, {uninit reg}  127.39 µs
- lock not qword ptr {local uninit mem}     136.86 µs
- lock not dword ptr {local uninit mem}     139.22 µs
- lock or qword ptr {local uninit mem}, 0   131.83 µs
- lock or dword ptr {local uninit mem}, 0   132.02 µs
- lock or qword ptr {sp}, 0                 180.00 µs
- lock or dword ptr {sp}, 0                 183.95 µs
- mfence                                    193.67 µs

u128_concurrent_store
- xchg qword ptr {local uninit mem}, {uninit reg}  204.95 µs
- xchg dword ptr {local uninit mem}, {uninit reg}  204.43 µs
- lock not qword ptr {local uninit mem}     222.19 µs
- lock not dword ptr {local uninit mem}     223.60 µs
- lock or qword ptr {local uninit mem}, 0   213.67 µs
- lock or dword ptr {local uninit mem}, 0   224.38 µs
- lock or qword ptr {sp}, 0                 308.61 µs
- lock or dword ptr {sp}, 0                 312.73 µs
- mfence                                    339.74 µs

u128_concurrent_store_swap (flaky)
- xchg qword ptr {local uninit mem}, {uninit reg}  328.78 µs
- xchg dword ptr {local uninit mem}, {uninit reg}  328.51 µs
- lock not qword ptr {local uninit mem}     418.78 µs
- lock not dword ptr {local uninit mem}     339.05 µs
- lock or qword ptr {local uninit mem}, 0   338.19 µs
- lock or dword ptr {local uninit mem}, 0   337.84 µs
- lock or qword ptr {sp}, 0                 398.34 µs
- lock or dword ptr {sp}, 0                 412.58 µs
- mfence                                    403.29 µs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
O-x86 Target: x86/x64 processors
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant