-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arm64: Memory barrier improvements #62895
Conversation
Tagging subscribers to this area: @JulieLeeMSFT Issue Details
|
I haven't reviewed this PR in detail, but I would just advise caution when it comes to memory barriers on ARM64: Basically, some of the instructions don't quite have the semantics you might expect. |
I see approx. 0.58% and 0.28% improvement in RPS in mvc and json benchmarks respectively. I believe these are in error range. |
Thanks @alexrp . This PR just extends the optimal memory barriers for |
@dotnet/jit-contrib |
LGTM, @VSadov could you please take a quick look if you have time
used to emit a full memory barrier, now it emits a store-only one. Also, for most cases stores/loads for variables marked as "volatile" now actually don't emit memory barriers at all and use e.g. |
src/coreclr/jit/codegenarm64.cpp
Outdated
@@ -3280,7 +3280,7 @@ void CodeGen::genCodeForStoreInd(GenTreeStoreInd* tree) | |||
else | |||
{ | |||
// issue a full memory barrier before a volatile StInd | |||
instGen_MemoryBarrier(); | |||
instGen_MemoryBarrier(BARRIER_STORE_ONLY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the typical scenario when this branch is taken? When value is a struct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand the purpose of this barrier is to have release semantics for storing into volatile variable that does not fit into a single register (otherwise stlr could be used).
I think this needs a full barrier, since unlike stlr
, dmb ishst
only waits for stores in progress and has no effect on loads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, ldar
can be replaced with ldr; dmb ishld
, but there is no such equivalency between stlr
and dmb ishst; str
because ishst is too weak.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From https://developer.arm.com/documentation/100941/0100/Barriers, trying to understand what you stated.
Basically, for ishst
, loads can still be reordered around barrier and hence it is weaker than the ishld
where loads/stores need to wait till the barrier is complete.
If we change from full barrier to ishst
, we might have a load that should have been completed but got reordered and might end up reading the wrong value (pre-updated value). Is my understanding correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, loads that appear after ishst, in program order, may speculatively happen ahead of the store.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static volatile int x;
static volatile int y;
static int xx;
static int yy;
---- one thread:
x = 42;
yy = y;
--- another thread:
y = 42;
xx = x;
Can both xx
and yy
end up 0 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I will revert the change related to shst
then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Improvement on ubuntu-arm64: dotnet/perf-autofiling-issues#2981 |
volatile
variable which makes the speed ofvolatile
declared variables as observed in [arm64] Volatile.Read/Write is 2x faster than "volatile" loads/stores #60232. Thanks @EgorBo for suggesting the solution as well.