-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batched write barrier for byrefs #99096
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsTesting on CI, for now - x64 nonAOT only. Presumably, the asm version can be better than just a loop over R8 (length), e.g. when dest is not in heap then other iterations can be faster. Example: struct MyStruct
{
object a1;
object a2;
object a3;
string a4;
string a5;
string a6;
}
static MyStruct Test(MyStruct ms) => ms; // simple copy of a struct with gc fields Current codegen:; Assembly listing for method Prog:Test(MyStruct):MyStruct
G_M32391_IG01:
push rdi
push rsi
push rbx
mov rbx, rcx
G_M32391_IG02:
mov rdi, rbx
mov rsi, rdx
call CORINFO_HELP_ASSIGN_BYREF
call CORINFO_HELP_ASSIGN_BYREF
call CORINFO_HELP_ASSIGN_BYREF
call CORINFO_HELP_ASSIGN_BYREF
call CORINFO_HELP_ASSIGN_BYREF
call CORINFO_HELP_ASSIGN_BYREF
mov rax, rbx
G_M32391_IG03:
pop rbx
pop rsi
pop rdi
ret
; Total bytes of code 49 New codegen:; Assembly listing for method Prog:Test(MyStruct):MyStruct
G_M32391_IG01:
push rdi
push rsi
push rbx
mov rbx, rcx
G_M32391_IG02:
mov rdi, rbx
mov rsi, rdx
mov r8d, 6
call CORINFO_HELP_ASSIGN_BYREF_BATCH
mov rax, rbx
G_M32391_IG03:
pop rbx
pop rsi
pop rdi
ret
; Total bytes of code 30
|
add rdi, 8h | ||
add rsi, 8h | ||
dec r8d | ||
jne NextByref |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop can hang GC thread suspension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please elaborate? Is it because of shadow gc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We won't be able to suspend the thread for GC while this helper is running. If this helper is called for large block repeatedly and there is not much managed code running between subsequent invocations of the helper, the GC thread suspension can be very slow or hang.
The same problem existed in BulkMoveWithWriteBarrier
. It was fixed in dotnet/coreclr#27776 . You should be able to modify the repro from that PR to use InlineArray to hit the hang with this helper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can limit the max batch size. e.g. 1024 gc slots will be handled as 16 call CORINFO_HELP_ASSIGN_BYREF_BATCH
batches
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would measure the worst-case scenario, e.g. call MyStruct Test(MyStruct ms) => ms;
in a loop on one thread and measure average GC pause time on a second thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I see from the diffs, we're mostly dealing with small values like 2 (most popular) and less than 10
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, then the math does not work - there is something else going on. (For some reason, I thought that the batch size is 16 in your current change.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I change the microbenchmark so that the write barrier has to do actual work, I am seeing 15sec - 30sec pause times with your current change: https://gist.github.com/jkotas/d9b48e48935bcc5a6f3dc87db086196d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't know why, but this benchmark shows better numbers in my branch than in Main (with 16 slots max)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe because method is just too big in Main and that made gcinfo slower to process? (; Total bytes of code 5012
)
Observation: The helper is functional equivalent of |
Thanks! Presumably it's easier to just call One thing that motivated me is that we've seen write barriers on hot paths in several real world benchmarks (e.g. @kunalspathak recently did). I am not sure this specific one was the culprit (they're generally less advanced on arm64), just looked like a low-hanging fruit to me. |
BulkMoveWithWriteBarrier has higher fixed overhead than the hand-written assembly helper. I expect that it would be profitable to call it only for blocks above certain size.
Is this addressing the hot paths that we have seen? |
Not sure yet, but jitdiffs imply they're not rare (jitdiffs are usually way smaller than SPMI because we only run it for framework libs, no tests, benchmarks, apsnet, etc). |
We saw it more on arm64 and we do not know yet if too many samples of |
It's just for the draft, I planned to support x64 (windows, unix) and arm64 (windows, unix) for both CoreCLR and NativeAOT |
Here is the current diff between non-batched (left) and batched (right) version: https://www.diffchecker.com/clp44lCx/ |
@jkotas Does it look good now? I changed to 16 slots max and implemented Windows-x64+Unix-x64 on both CoreCLR and NAOT. I'd like to handle ARM64 in a separate PR if you don't mind. To simplify code-review of the newly added helpers, I made diffs for each against their non-batch version |
/azp list |
This comment was marked as resolved.
This comment was marked as resolved.
/azp run runtime-coreclr outerloop, runtime-coreclr jitstressregs, runtime-coreclr gcstress0x3-gcstress0xc, runtime-nativeaot-outerloop |
Azure Pipelines successfully started running 4 pipeline(s). |
@VSadov What do you think about impact of this change on GC thread suspension? Have you seen bad cases where the current write barriers are stalling GC thread suspension? I am wondering where we should set the bar for the GC thread suspension impact on changes like this one. There are multiple ways to make it more friendly to GC thread suspension - explicitly poll in the helper, explicit GC polls in the caller, enable hijacking of return address from these helpers, ... . |
My first thought about this is that the barriers are the kind of platform- and achitecture-specific pieces of assembly, that we would usually avoid for portability reasons. With Core/AOT + 6 architecture + Win/Unix - how many combinations will need a copy of the new barrier? Do we get enough savings for that? It looks like it saves a call. Also hoists “not in heap” check, although that will help only if it is really not in heap and short-circuits, otherwise that check is cheap. The rest of the cost - other checks, lock or, the actual assignment are still there. What happens to the benchmark if the write is to a bunch of array elements instead of a single stack location? |
I kind of suspect the original issue expected more “batching” - like hoisting other checks, setting cards in bulk, etc… But that will turn the barrier into BulkMoveWithWriteBarrier, which we already have. |
;; r8: number of byrefs | ||
;; | ||
;; On exit: | ||
;; rdi, rsi are incremented by 8, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't them be increased by the bytes processed? And also comments in other asm code.
Not much indeed, I don't have a strong opinion here, since it raises concerns/doubts I am fine in closing it before I invest my time into arm side of it 🤷♂️ The case where I noticed this issue were tuples, e.g.: (string, string) StructCopy((string, string) s) => s; |
Closing, will try to benchmark/compare arm64 write barriers (not only byref one) against x64 to see if I can contribute there instead |
Closes #8627
Testing on CI, for now - win-x64 coreclr only. Presumably, the asm version can be better than just a copy of
JIT_ByRefWriteBarrier
with a loop over R8 (length):WRITE_BARRIER_CHECK
? we can use the non-batched version for DEBUG/CHECKEDREPRET
looks like some legacy we can replace with just a normalret
jit-diffs (seems like a few hits in Roslyn). Size regressions are from the cases when batchSize is 2, two calls are smaller than the batched one +
mov r8d, <size>
. But the perf should still be better with the batch one.Example:
Current codegen:
New codegen:
Benchmark:
When destination is on the heap, the perf is also improved