-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize x86 atomic_fence #328
Merged
alexey-katranov
merged 3 commits into
uxlfoundation:master
from
Lastique:optimize_x86_fence
Dec 22, 2021
Merged
Optimize x86 atomic_fence #328
alexey-katranov
merged 3 commits into
uxlfoundation:master
from
Lastique:optimize_x86_fence
Dec 22, 2021
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Lastique
force-pushed
the
optimize_x86_fence
branch
from
November 25, 2021 13:45
0ab68f2
to
06a56b3
Compare
Lastique
force-pushed
the
optimize_x86_fence
branch
from
November 25, 2021 17:23
06a56b3
to
1d22948
Compare
On x86 (32 and 64-bit) any lock-prefixed instruction provides sequential consistency guarantees for atomic operations and is more efficient than mfence. We are choosing a "lock not" on a dummy byte on the stack for the following reasons: - The "not" instruction does not affect flags or clobber any registers. The memory operand is presumably accessible through esp/rsp. - The dummy byte variable is at the top of the stack, which is likely hot in cache. - The dummy variable does not alias any other data on the stack, which means the "lock not" instruction won't introduce any false data dependencies with prior or following instructions. In order to avoid various sanitizers and valgrind complaining, we have to initialize the dummy variable to zero prior to the operation. Additionally, for memory orders weaker than seq_cst there is no need for any special instructions, and we only need a compiler fence. For the relaxed memory order we don't need even that. This optimization is only enabled for gcc up to version 11. In gcc 11 the compiler implements a similar optimization for std::atomic_thread_fence. Compilers compatible with gcc (namely, clang up to 13 and icc up to 2021.3.0, inclusively) identify themselves as gcc < 11 and also benefit from this optimization, as they otherwise generate mfence for std::atomic_thread_fence(std::memory_order_seq_cst). Signed-off-by: Andrey Semashev <andrey.semashev@gmail.com>
The necessary instructions according to the memory order argument should already be generated by std::atomic_thread_fence. Signed-off-by: Andrey Semashev <andrey.semashev@gmail.com>
The code uses memory_order_seq_cst in all call sites of atomic_fence, so remove the argument and simplifiy the implementation a bit. Also, renamed the function to make the memory order it implements apparent. Signed-off-by: Andrey Semashev <andrey.semashev@gmail.com>
Lastique
force-pushed
the
optimize_x86_fence
branch
from
November 25, 2021 22:30
1d22948
to
8feefce
Compare
alexey-katranov
approved these changes
Nov 26, 2021
kboyarinov
pushed a commit
that referenced
this pull request
Dec 27, 2021
* Added optimized x86 atomic_fence for gcc-compatible compilers. On x86 (32 and 64-bit) any lock-prefixed instruction provides sequential consistency guarantees for atomic operations and is more efficient than mfence. We are choosing a "lock not" on a dummy byte on the stack for the following reasons: - The "not" instruction does not affect flags or clobber any registers. The memory operand is presumably accessible through esp/rsp. - The dummy byte variable is at the top of the stack, which is likely hot in cache. - The dummy variable does not alias any other data on the stack, which means the "lock not" instruction won't introduce any false data dependencies with prior or following instructions. In order to avoid various sanitizers and valgrind complaining, we have to initialize the dummy variable to zero prior to the operation. Additionally, for memory orders weaker than seq_cst there is no need for any special instructions, and we only need a compiler fence. For the relaxed memory order we don't need even that. This optimization is only enabled for gcc up to version 11. In gcc 11 the compiler implements a similar optimization for std::atomic_thread_fence. Compilers compatible with gcc (namely, clang up to 13 and icc up to 2021.3.0, inclusively) identify themselves as gcc < 11 and also benefit from this optimization, as they otherwise generate mfence for std::atomic_thread_fence(std::memory_order_seq_cst). Signed-off-by: Andrey Semashev <andrey.semashev@gmail.com> * Removed explicit mfence in atomic_fence on Windows. The necessary instructions according to the memory order argument should already be generated by std::atomic_thread_fence. Signed-off-by: Andrey Semashev <andrey.semashev@gmail.com> * Removed memory order argument from atomic_fence. The code uses memory_order_seq_cst in all call sites of atomic_fence, so remove the argument and simplifiy the implementation a bit. Also, renamed the function to make the memory order it implements apparent. Signed-off-by: Andrey Semashev <andrey.semashev@gmail.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The first commit provides an optimized
atomic_fence
implementation for x86 on gcc-compatible compilers.On x86 (32 and 64-bit) any lock-prefixed instruction provides sequential consistency guarantees for atomic operations and is more efficient than mfence. You can see some tests in this article.
We are choosing a "lock not" on a dummy byte on the stack for the following reasons:
In order to avoid various sanitizers and valgrind complaining, we have to initialize the dummy variable to zero prior to the operation.
Additionally, for memory orders weaker than seq_cst there is no need for any special instructions, and we only need a compiler fence. For the relaxed memory order we don't need even that.
The second commit removes explicit mfence on Windows. The existing
std::atomic_thread_fence
already provides the necessary instructions to maintain memory order according to its argument.