Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batched write barrier for byrefs #99096

Closed
wants to merge 15 commits into from
Closed

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Feb 29, 2024

Closes #8627

Testing on CI, for now - win-x64 coreclr only. Presumably, the asm version can be better than just a copy of JIT_ByRefWriteBarrier with a loop over R8 (length):

  • If first iteration finds out that the destination doesn't belong to any heap, we should skip all checks for the rest inside the batch
  • TODO: what else
  • Do we need debug-only WRITE_BARRIER_CHECK? we can use the non-batched version for DEBUG/CHECKED
  • REPRET looks like some legacy we can replace with just a normal ret

jit-diffs (seems like a few hits in Roslyn). Size regressions are from the cases when batchSize is 2, two calls are smaller than the batched one + mov r8d, <size>. But the perf should still be better with the batch one.

Example:

struct MyStruct
{
    object a1;
    object a2;
    object a3;
    string a4;
    string a5;
    string a6;
}

static MyStruct Test(MyStruct ms) => ms; // simple copy of a struct with gc fields

Current codegen:

; Assembly listing for method Prog:Test(MyStruct):MyStruct
G_M32391_IG01:
       push     rdi
       push     rsi
       push     rbx
       mov      rbx, rcx
G_M32391_IG02:
       mov      rdi, rbx
       mov      rsi, rdx
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       mov      rax, rbx
G_M32391_IG03:
       pop      rbx
       pop      rsi
       pop      rdi
       ret      
; Total bytes of code 49

New codegen:

; Assembly listing for method Prog:Test(MyStruct):MyStruct
G_M32391_IG01:
       push     rdi
       push     rsi
       push     rbx
       mov      rbx, rcx
G_M32391_IG02:
       mov      rdi, rbx
       mov      rsi, rdx
       mov      r8d, 6
       call     CORINFO_HELP_ASSIGN_BYREF_BATCH
       mov      rax, rbx
G_M32391_IG03:
       pop      rbx
       pop      rsi
       pop      rdi
       ret      
; Total bytes of code 30

Benchmark:

struct MyStruct
{
    object a1;
    object a2;
    object a3;
    object a4;
}

[Benchmark]
public void StructCopy()
{
    MyStruct ms = default;
    Test(ms);

    [MethodImpl(MethodImplOptions.NoInlining)]
    MyStruct Test(MyStruct ms) => ms;
}
Method Toolchain Mean Error StdDev Ratio
StructCopy Core_Root_Main\corerun.exe 4.655 ns 0.0097 ns 0.0090 ns 2.05
StructCopy Core_Root_PR\corerun.exe 2.269 ns 0.0097 ns 0.0086 ns 1.00

When destination is on the heap, the perf is also improved

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 29, 2024
@ghost ghost assigned EgorBo Feb 29, 2024
@ghost
Copy link

ghost commented Feb 29, 2024

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Testing on CI, for now - x64 nonAOT only. Presumably, the asm version can be better than just a loop over R8 (length), e.g. when dest is not in heap then other iterations can be faster.

Example:

struct MyStruct
{
    object a1;
    object a2;
    object a3;
    string a4;
    string a5;
    string a6;
}

static MyStruct Test(MyStruct ms) => ms; // simple copy of a struct with gc fields

Current codegen:

; Assembly listing for method Prog:Test(MyStruct):MyStruct
G_M32391_IG01:
       push     rdi
       push     rsi
       push     rbx
       mov      rbx, rcx
G_M32391_IG02:
       mov      rdi, rbx
       mov      rsi, rdx
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       mov      rax, rbx
G_M32391_IG03:
       pop      rbx
       pop      rsi
       pop      rdi
       ret      
; Total bytes of code 49

New codegen:

; Assembly listing for method Prog:Test(MyStruct):MyStruct
G_M32391_IG01:
       push     rdi
       push     rsi
       push     rbx
       mov      rbx, rcx
G_M32391_IG02:
       mov      rdi, rbx
       mov      rsi, rdx
       mov      r8d, 6
       call     CORINFO_HELP_ASSIGN_BYREF_BATCH
       mov      rax, rbx
G_M32391_IG03:
       pop      rbx
       pop      rsi
       pop      rdi
       ret      
; Total bytes of code 30
Author: EgorBo
Assignees: EgorBo
Labels:

area-CodeGen-coreclr

Milestone: -

add rdi, 8h
add rsi, 8h
dec r8d
jne NextByref
Copy link
Member

@jkotas jkotas Feb 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop can hang GC thread suspension.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please elaborate? Is it because of shadow gc?

Copy link
Member

@jkotas jkotas Feb 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We won't be able to suspend the thread for GC while this helper is running. If this helper is called for large block repeatedly and there is not much managed code running between subsequent invocations of the helper, the GC thread suspension can be very slow or hang.

The same problem existed in BulkMoveWithWriteBarrier. It was fixed in dotnet/coreclr#27776 . You should be able to modify the repro from that PR to use InlineArray to hit the hang with this helper.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can limit the max batch size. e.g. 1024 gc slots will be handled as 16 call CORINFO_HELP_ASSIGN_BYREF_BATCH batches

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would measure the worst-case scenario, e.g. call MyStruct Test(MyStruct ms) => ms; in a loop on one thread and measure average GC pause time on a second thread.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I see from the diffs, we're mostly dealing with small values like 2 (most popular) and less than 10

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then the math does not work - there is something else going on. (For some reason, I thought that the batch size is 16 in your current change.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I change the microbenchmark so that the write barrier has to do actual work, I am seeing 15sec - 30sec pause times with your current change: https://gist.github.com/jkotas/d9b48e48935bcc5a6f3dc87db086196d

Copy link
Member Author

@EgorBo EgorBo Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know why, but this benchmark shows better numbers in my branch than in Main (with 16 slots max)

Copy link
Member Author

@EgorBo EgorBo Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe because method is just too big in Main and that made gcinfo slower to process? (; Total bytes of code 5012)

@jkotas
Copy link
Member

jkotas commented Feb 29, 2024

Observation: The helper is functional equivalent of BulkMoveWithWriteBarrier, with different perf trade-off. Compared toBulkMoveWithWriteBarrier , the helper as currently implemented in the PR tries to be more precise with setting the cards so it is slower but leaves less work to be done at GC time.

@EgorBo
Copy link
Member Author

EgorBo commented Feb 29, 2024

Observation: The helper is functional equivalent of BulkMoveWithWriteBarrier, with different perf trade-off. Compared toBulkMoveWithWriteBarrier , the helper as currently implemented in the PR tries to be more precise with setting the cards so it is slower but leaves less work to be done at GC time.

Thanks! Presumably it's easier to just call BulkMoveWithWriteBarrier so then we don't have to duplicate these helpers for all platforms. Do you have an opinion on which way is generally better?

One thing that motivated me is that we've seen write barriers on hot paths in several real world benchmarks (e.g. @kunalspathak recently did). I am not sure this specific one was the culprit (they're generally less advanced on arm64), just looked like a low-hanging fruit to me.

@jkotas
Copy link
Member

jkotas commented Feb 29, 2024

Presumably it's easier to just call BulkMoveWithWriteBarrier

BulkMoveWithWriteBarrier has higher fixed overhead than the hand-written assembly helper. I expect that it would be profitable to call it only for blocks above certain size.

we've seen write barriers on hot paths in several real world benchmarks

Is this addressing the hot paths that we have seen?

@EgorBo
Copy link
Member Author

EgorBo commented Feb 29, 2024

Is this addressing the hot paths that we have seen?

Not sure yet, but jitdiffs imply they're not rare (jitdiffs are usually way smaller than SPMI because we only run it for framework libs, no tests, benchmarks, apsnet, etc).

@kunalspathak
Copy link
Member

on hot paths in several real world benchmarks (e.g. @kunalspathak recently did)

We saw it more on arm64 and we do not know yet if too many samples of JIT_WriteBarrier on the call stack was because the arm64 version is slower or it was due to lack of batching. I see that this one is just for amd64, so won't help the arm64 case that I was seeing.

@EgorBo
Copy link
Member Author

EgorBo commented Feb 29, 2024

I see that this one is just for amd64, so won't help the arm64 case that I was seeing.

It's just for the draft, I planned to support x64 (windows, unix) and arm64 (windows, unix) for both CoreCLR and NativeAOT

@EgorBo
Copy link
Member Author

EgorBo commented Feb 29, 2024

Here is the current diff between non-batched (left) and batched (right) version: https://www.diffchecker.com/clp44lCx/

@EgorBo
Copy link
Member Author

EgorBo commented Feb 29, 2024

@MihuBot

@EgorBo
Copy link
Member Author

EgorBo commented Mar 1, 2024

@jkotas Does it look good now? I changed to 16 slots max and implemented Windows-x64+Unix-x64 on both CoreCLR and NAOT. I'd like to handle ARM64 in a separate PR if you don't mind.

To simplify code-review of the newly added helpers, I made diffs for each against their non-batch version

  1. src/coreclr/nativeaot/Runtime/amd64/WriteBarriers.S: diff
  2. src/coreclr/nativeaot/Runtime/amd64/WriteBarriers.asm: diff
  3. src/coreclr/vm/amd64/JitHelpers_Fast.asm: diff <--- the least verbose diff
  4. src/coreclr/vm/amd64/jithelpers_fast.S: diff

@EgorBo EgorBo marked this pull request as ready for review March 1, 2024 00:54
@EgorBo
Copy link
Member Author

EgorBo commented Mar 1, 2024

/azp list

This comment was marked as resolved.

@EgorBo
Copy link
Member Author

EgorBo commented Mar 1, 2024

/azp run runtime-coreclr outerloop, runtime-coreclr jitstressregs, runtime-coreclr gcstress0x3-gcstress0xc, runtime-nativeaot-outerloop

Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@jkotas
Copy link
Member

jkotas commented Mar 1, 2024

@VSadov What do you think about impact of this change on GC thread suspension? Have you seen bad cases where the current write barriers are stalling GC thread suspension?

I am wondering where we should set the bar for the GC thread suspension impact on changes like this one. There are multiple ways to make it more friendly to GC thread suspension - explicitly poll in the helper, explicit GC polls in the caller, enable hijacking of return address from these helpers, ... .

@VSadov
Copy link
Member

VSadov commented Mar 1, 2024

My first thought about this is that the barriers are the kind of platform- and achitecture-specific pieces of assembly, that we would usually avoid for portability reasons. With Core/AOT + 6 architecture + Win/Unix - how many combinations will need a copy of the new barrier? Do we get enough savings for that?

It looks like it saves a call. Also hoists “not in heap” check, although that will help only if it is really not in heap and short-circuits, otherwise that check is cheap. The rest of the cost - other checks, lock or, the actual assignment are still there.
Considering the barrier cost is typically in 1-5% range and that this will be hit in fairly rare cases in the code (“few cases in Roslyn”?), I wonder how much of an improvement is expected here in general?

What happens to the benchmark if the write is to a bunch of array elements instead of a single stack location?
(ideally on a Server GC)

@VSadov
Copy link
Member

VSadov commented Mar 1, 2024

I kind of suspect the original issue expected more “batching” - like hoisting other checks, setting cards in bulk, etc… But that will turn the barrier into BulkMoveWithWriteBarrier, which we already have.

;; r8: number of byrefs
;;
;; On exit:
;; rdi, rsi are incremented by 8,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't them be increased by the bytes processed? And also comments in other asm code.

@EgorBo
Copy link
Member Author

EgorBo commented Mar 1, 2024

What happens to the benchmark if the write is to a bunch of array elements instead of a single stack location?
(ideally on a Server GC)

Not much indeed, I don't have a strong opinion here, since it raises concerns/doubts I am fine in closing it before I invest my time into arm side of it 🤷‍♂️ The case where I noticed this issue were tuples, e.g.:

(string, string) StructCopy((string, string) s) => s;

@EgorBo
Copy link
Member Author

EgorBo commented Mar 1, 2024

Closing, will try to benchmark/compare arm64 write barriers (not only byref one) against x64 to see if I can contribute there instead

@EgorBo EgorBo closed this Mar 1, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Apr 1, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider creating a ranged version of CORINFO_HELP_ASSIGN_BYREF
5 participants