Batched write barrier for byrefs #99096

EgorBo · 2024-02-29T03:18:21Z

Testing on CI, for now - win-x64 coreclr only. Presumably, the asm version can be better than just a copy of JIT_ByRefWriteBarrier with a loop over R8 (length):

If first iteration finds out that the destination doesn't belong to any heap, we should skip all checks for the rest inside the batch
TODO: what else
Do we need debug-only WRITE_BARRIER_CHECK? we can use the non-batched version for DEBUG/CHECKED
REPRET looks like some legacy we can replace with just a normal ret

jit-diffs (seems like a few hits in Roslyn). Size regressions are from the cases when batchSize is 2, two calls are smaller than the batched one + mov r8d, <size>. But the perf should still be better with the batch one.

Example:

struct MyStruct
{
    object a1;
    object a2;
    object a3;
    string a4;
    string a5;
    string a6;
}

static MyStruct Test(MyStruct ms) => ms; // simple copy of a struct with gc fields

Current codegen:

; Assembly listing for method Prog:Test(MyStruct):MyStruct
G_M32391_IG01:
       push     rdi
       push     rsi
       push     rbx
       mov      rbx, rcx
G_M32391_IG02:
       mov      rdi, rbx
       mov      rsi, rdx
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       mov      rax, rbx
G_M32391_IG03:
       pop      rbx
       pop      rsi
       pop      rdi
       ret      
; Total bytes of code 49

New codegen:

; Assembly listing for method Prog:Test(MyStruct):MyStruct
G_M32391_IG01:
       push     rdi
       push     rsi
       push     rbx
       mov      rbx, rcx
G_M32391_IG02:
       mov      rdi, rbx
       mov      rsi, rdx
       mov      r8d, 6
       call     CORINFO_HELP_ASSIGN_BYREF_BATCH
       mov      rax, rbx
G_M32391_IG03:
       pop      rbx
       pop      rsi
       pop      rdi
       ret      
; Total bytes of code 30

Benchmark:

struct MyStruct
{
    object a1;
    object a2;
    object a3;
    object a4;
}

[Benchmark]
public void StructCopy()
{
    MyStruct ms = default;
    Test(ms);

    [MethodImpl(MethodImplOptions.NoInlining)]
    MyStruct Test(MyStruct ms) => ms;
}

Method	Toolchain	Mean	Error	StdDev	Ratio
StructCopy	Core_Root_Main\corerun.exe	4.655 ns	0.0097 ns	0.0090 ns	2.05
StructCopy	Core_Root_PR\corerun.exe	2.269 ns	0.0097 ns	0.0086 ns	1.00

When destination is on the heap, the perf is also improved

ghost · 2024-02-29T03:18:39Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Testing on CI, for now - x64 nonAOT only. Presumably, the asm version can be better than just a loop over R8 (length), e.g. when dest is not in heap then other iterations can be faster.

Example:

struct MyStruct
{
    object a1;
    object a2;
    object a3;
    string a4;
    string a5;
    string a6;
}

static MyStruct Test(MyStruct ms) => ms; // simple copy of a struct with gc fields

Current codegen:

; Assembly listing for method Prog:Test(MyStruct):MyStruct
G_M32391_IG01:
       push     rdi
       push     rsi
       push     rbx
       mov      rbx, rcx
G_M32391_IG02:
       mov      rdi, rbx
       mov      rsi, rdx
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       mov      rax, rbx
G_M32391_IG03:
       pop      rbx
       pop      rsi
       pop      rdi
       ret      
; Total bytes of code 49

New codegen:

; Assembly listing for method Prog:Test(MyStruct):MyStruct
G_M32391_IG01:
       push     rdi
       push     rsi
       push     rbx
       mov      rbx, rcx
G_M32391_IG02:
       mov      rdi, rbx
       mov      rsi, rdx
       mov      r8d, 6
       call     CORINFO_HELP_ASSIGN_BYREF_BATCH
       mov      rax, rbx
G_M32391_IG03:
       pop      rbx
       pop      rsi
       pop      rdi
       ret      
; Total bytes of code 30

Author:	EgorBo
Assignees:	EgorBo
Labels:	`area-CodeGen-coreclr`
Milestone:	-

…atched-byref-wb

jkotas · 2024-02-29T14:30:38Z

src/coreclr/vm/amd64/JitHelpers_Fast.asm

+        add     rdi, 8h
+        add     rsi, 8h
+        dec     r8d
+        jne     NextByref


This loop can hang GC thread suspension.

Can you please elaborate? Is it because of shadow gc?

We won't be able to suspend the thread for GC while this helper is running. If this helper is called for large block repeatedly and there is not much managed code running between subsequent invocations of the helper, the GC thread suspension can be very slow or hang.

The same problem existed in BulkMoveWithWriteBarrier. It was fixed in dotnet/coreclr#27776 . You should be able to modify the repro from that PR to use InlineArray to hit the hang with this helper.

I can limit the max batch size. e.g. 1024 gc slots will be handled as 16 call CORINFO_HELP_ASSIGN_BYREF_BATCH batches

I would measure the worst-case scenario, e.g. call MyStruct Test(MyStruct ms) => ms; in a loop on one thread and measure average GC pause time on a second thread.

From what I see from the diffs, we're mostly dealing with small values like 2 (most popular) and less than 10

Ok, then the math does not work - there is something else going on. (For some reason, I thought that the batch size is 16 in your current change.)

If I change the microbenchmark so that the write barrier has to do actual work, I am seeing 15sec - 30sec pause times with your current change: https://gist.github.com/jkotas/d9b48e48935bcc5a6f3dc87db086196d

Don't know why, but this benchmark shows better numbers in my branch than in Main (with 16 slots max)

Maybe because method is just too big in Main and that made gcinfo slower to process? (; Total bytes of code 5012)

jkotas · 2024-02-29T14:37:22Z

Observation: The helper is functional equivalent of BulkMoveWithWriteBarrier, with different perf trade-off. Compared toBulkMoveWithWriteBarrier , the helper as currently implemented in the PR tries to be more precise with setting the cards so it is slower but leaves less work to be done at GC time.

EgorBo · 2024-02-29T14:44:00Z

Observation: The helper is functional equivalent of BulkMoveWithWriteBarrier, with different perf trade-off. Compared toBulkMoveWithWriteBarrier , the helper as currently implemented in the PR tries to be more precise with setting the cards so it is slower but leaves less work to be done at GC time.

Thanks! Presumably it's easier to just call BulkMoveWithWriteBarrier so then we don't have to duplicate these helpers for all platforms. Do you have an opinion on which way is generally better?

One thing that motivated me is that we've seen write barriers on hot paths in several real world benchmarks (e.g. @kunalspathak recently did). I am not sure this specific one was the culprit (they're generally less advanced on arm64), just looked like a low-hanging fruit to me.

jkotas · 2024-02-29T14:48:40Z

Presumably it's easier to just call BulkMoveWithWriteBarrier

BulkMoveWithWriteBarrier has higher fixed overhead than the hand-written assembly helper. I expect that it would be profitable to call it only for blocks above certain size.

we've seen write barriers on hot paths in several real world benchmarks

Is this addressing the hot paths that we have seen?

EgorBo · 2024-02-29T14:51:15Z

Is this addressing the hot paths that we have seen?

Not sure yet, but jitdiffs imply they're not rare (jitdiffs are usually way smaller than SPMI because we only run it for framework libs, no tests, benchmarks, apsnet, etc).

kunalspathak · 2024-02-29T14:52:14Z

on hot paths in several real world benchmarks (e.g. @kunalspathak recently did)

We saw it more on arm64 and we do not know yet if too many samples of JIT_WriteBarrier on the call stack was because the arm64 version is slower or it was due to lack of batching. I see that this one is just for amd64, so won't help the arm64 case that I was seeing.

EgorBo · 2024-02-29T14:53:25Z

I see that this one is just for amd64, so won't help the arm64 case that I was seeing.

It's just for the draft, I planned to support x64 (windows, unix) and arm64 (windows, unix) for both CoreCLR and NativeAOT

EgorBo · 2024-02-29T16:59:15Z

Here is the current diff between non-batched (left) and batched (right) version: https://www.diffchecker.com/clp44lCx/

EgorBo · 2024-02-29T19:18:48Z

@MihuBot

EgorBo · 2024-03-01T00:54:41Z

@jkotas Does it look good now? I changed to 16 slots max and implemented Windows-x64+Unix-x64 on both CoreCLR and NAOT. I'd like to handle ARM64 in a separate PR if you don't mind.

To simplify code-review of the newly added helpers, I made diffs for each against their non-batch version

src/coreclr/nativeaot/Runtime/amd64/WriteBarriers.S: diff
src/coreclr/nativeaot/Runtime/amd64/WriteBarriers.asm: diff
src/coreclr/vm/amd64/JitHelpers_Fast.asm: diff <--- the least verbose diff
src/coreclr/vm/amd64/jithelpers_fast.S: diff

EgorBo · 2024-03-01T00:55:02Z

/azp list

EgorBo · 2024-03-01T00:55:54Z

/azp run runtime-coreclr outerloop, runtime-coreclr jitstressregs, runtime-coreclr gcstress0x3-gcstress0xc, runtime-nativeaot-outerloop

azure-pipelines · 2024-03-01T00:56:25Z

Azure Pipelines successfully started running 4 pipeline(s).

jkotas · 2024-03-01T01:17:05Z

@VSadov What do you think about impact of this change on GC thread suspension? Have you seen bad cases where the current write barriers are stalling GC thread suspension?

I am wondering where we should set the bar for the GC thread suspension impact on changes like this one. There are multiple ways to make it more friendly to GC thread suspension - explicitly poll in the helper, explicit GC polls in the caller, enable hijacking of return address from these helpers, ... .

VSadov · 2024-03-01T06:17:27Z

My first thought about this is that the barriers are the kind of platform- and achitecture-specific pieces of assembly, that we would usually avoid for portability reasons. With Core/AOT + 6 architecture + Win/Unix - how many combinations will need a copy of the new barrier? Do we get enough savings for that?

It looks like it saves a call. Also hoists “not in heap” check, although that will help only if it is really not in heap and short-circuits, otherwise that check is cheap. The rest of the cost - other checks, lock or, the actual assignment are still there.
Considering the barrier cost is typically in 1-5% range and that this will be hit in fairly rare cases in the code (“few cases in Roslyn”?), I wonder how much of an improvement is expected here in general?

What happens to the benchmark if the write is to a bunch of array elements instead of a single stack location?
(ideally on a Server GC)

VSadov · 2024-03-01T06:23:55Z

I kind of suspect the original issue expected more “batching” - like hoisting other checks, setting cards in bulk, etc… But that will turn the barrier into BulkMoveWithWriteBarrier, which we already have.

huoyaoyuan · 2024-03-01T08:08:07Z

src/coreclr/nativeaot/Runtime/amd64/WriteBarriers.asm

+;;      r8:  number of byrefs
+;;
+;; On exit:
+;;      rdi, rsi are incremented by 8,


Shouldn't them be increased by the bytes processed? And also comments in other asm code.

EgorBo · 2024-03-01T10:13:17Z

What happens to the benchmark if the write is to a bunch of array elements instead of a single stack location?
(ideally on a Server GC)

Not much indeed, I don't have a strong opinion here, since it raises concerns/doubts I am fine in closing it before I invest my time into arm side of it 🤷‍♂️ The case where I noticed this issue were tuples, e.g.:

(string, string) StructCopy((string, string) s) => s;

EgorBo · 2024-03-01T10:15:54Z

Closing, will try to benchmark/compare arm64 write barriers (not only byref one) against x64 to see if I can contribute there instead

Batched wb

3b0d6c4

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 29, 2024

ghost assigned EgorBo Feb 29, 2024

EgorBo added 8 commits February 29, 2024 04:25

only for windows

2565a74

fix build

8bc4ee3

Fix build again

8c0dd22

FIX BUILD

03da662

optimize nonheap case, bump guid

b909b49

Merge branch 'batched-byref-wb' of github.com:EgorBo/runtime-1 into b…

0191c6d

…atched-byref-wb

Clean up jit part

2c87555

Clean up jit part

6170bba

jkotas reviewed Feb 29, 2024

View reviewed changes

EgorBo added 2 commits February 29, 2024 16:53

Max size of a batch

cd7c2a2

Enable for Unix-x64

23e42bb

EgorBo added 2 commits February 29, 2024 18:08

8h -> 0x8 (fixes build on linux)

2f90192

Fix build on unix

a96af65

MihuBot mentioned this pull request Feb 29, 2024

[X64] [EgorBo] Batched write barrier for byrefs MihuBot/runtime-utils#266

Open

EgorBo added 2 commits March 1, 2024 00:31

NAOT/R2R for x64

810777b

0x8

ec35a6d

EgorBo marked this pull request as ready for review March 1, 2024 00:54

EgorBo requested a review from MichalStrehovsky as a code owner March 1, 2024 00:54

This comment was marked as resolved.

Sign in to view

huoyaoyuan reviewed Mar 1, 2024

View reviewed changes

EgorBo closed this Mar 1, 2024

github-actions bot locked and limited conversation to collaborators Apr 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batched write barrier for byrefs #99096

Batched write barrier for byrefs #99096

EgorBo commented Feb 29, 2024 •

edited

Loading

ghost commented Feb 29, 2024

Current codegen:

New codegen:

jkotas Feb 29, 2024 •

edited

Loading

EgorBo Feb 29, 2024

jkotas Feb 29, 2024 •

edited

Loading

EgorBo Feb 29, 2024

jkotas Feb 29, 2024

EgorBo Feb 29, 2024

jkotas Feb 29, 2024

jkotas Feb 29, 2024

EgorBo Mar 1, 2024 •

edited

Loading

EgorBo Mar 1, 2024 •

edited

Loading

jkotas commented Feb 29, 2024 •

edited

Loading

EgorBo commented Feb 29, 2024 •

edited

Loading

jkotas commented Feb 29, 2024 •

edited

Loading

EgorBo commented Feb 29, 2024

kunalspathak commented Feb 29, 2024

EgorBo commented Feb 29, 2024

EgorBo commented Feb 29, 2024

EgorBo commented Feb 29, 2024

EgorBo commented Mar 1, 2024

EgorBo commented Mar 1, 2024

This comment was marked as resolved.

EgorBo commented Mar 1, 2024

azure-pipelines bot commented Mar 1, 2024

jkotas commented Mar 1, 2024

VSadov commented Mar 1, 2024 •

edited

Loading

VSadov commented Mar 1, 2024

huoyaoyuan Mar 1, 2024

EgorBo commented Mar 1, 2024

EgorBo commented Mar 1, 2024

Batched write barrier for byrefs #99096

Batched write barrier for byrefs #99096

Conversation

EgorBo commented Feb 29, 2024 • edited Loading

Example:

Current codegen:

New codegen:

Benchmark:

ghost commented Feb 29, 2024

Current codegen:

New codegen:

jkotas Feb 29, 2024 • edited Loading

Choose a reason for hiding this comment

EgorBo Feb 29, 2024

Choose a reason for hiding this comment

jkotas Feb 29, 2024 • edited Loading

Choose a reason for hiding this comment

EgorBo Feb 29, 2024

Choose a reason for hiding this comment

jkotas Feb 29, 2024

Choose a reason for hiding this comment

EgorBo Feb 29, 2024

Choose a reason for hiding this comment

jkotas Feb 29, 2024

Choose a reason for hiding this comment

jkotas Feb 29, 2024

Choose a reason for hiding this comment

EgorBo Mar 1, 2024 • edited Loading

Choose a reason for hiding this comment

EgorBo Mar 1, 2024 • edited Loading

Choose a reason for hiding this comment

jkotas commented Feb 29, 2024 • edited Loading

EgorBo commented Feb 29, 2024 • edited Loading

jkotas commented Feb 29, 2024 • edited Loading

EgorBo commented Feb 29, 2024

kunalspathak commented Feb 29, 2024

EgorBo commented Feb 29, 2024

EgorBo commented Feb 29, 2024

EgorBo commented Feb 29, 2024

EgorBo commented Mar 1, 2024

EgorBo commented Mar 1, 2024

This comment was marked as resolved.

EgorBo commented Mar 1, 2024

azure-pipelines bot commented Mar 1, 2024

jkotas commented Mar 1, 2024

VSadov commented Mar 1, 2024 • edited Loading

VSadov commented Mar 1, 2024

huoyaoyuan Mar 1, 2024

Choose a reason for hiding this comment

EgorBo commented Mar 1, 2024

EgorBo commented Mar 1, 2024

EgorBo commented Feb 29, 2024 •

edited

Loading

jkotas Feb 29, 2024 •

edited

Loading

jkotas Feb 29, 2024 •

edited

Loading

EgorBo Mar 1, 2024 •

edited

Loading

EgorBo Mar 1, 2024 •

edited

Loading

jkotas commented Feb 29, 2024 •

edited

Loading

EgorBo commented Feb 29, 2024 •

edited

Loading

jkotas commented Feb 29, 2024 •

edited

Loading

VSadov commented Mar 1, 2024 •

edited

Loading