Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement volatile barrier APIs #107843

Merged
merged 17 commits into from
Oct 17, 2024
Merged

Implement volatile barrier APIs #107843

merged 17 commits into from
Oct 17, 2024

Conversation

hamarb123
Copy link
Contributor

@hamarb123 hamarb123 commented Sep 15, 2024

Fixes #98837

This implements the proposed Read-ReadWrite and ReadWrite-Write barriers. Note: I haven't implemented any tests yet.

/cc @jkotas @VSadov @kouvel

Now that I'm on the correct branch
@dotnet-issue-labeler dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI new-api-needs-documentation labels Sep 15, 2024
Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

1 similar comment
Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Sep 15, 2024
@VSadov
Copy link
Member

VSadov commented Sep 16, 2024

my understanding of the x86 memory model is that it allows reads to be reordered after writes (when different addresses) (which is not supposed to be allowed across a Volatile.WriteBarrier with the ReadWrite-Write model if I'm understanding correctly)

reads are not reordered after writes in x86.
reads can happen earlier than preceding writes (i.e. prefetch), and to prevent that you'd indeed need a fill fence, but that is not something that WriteBarrier needs to guarantee.

In short - WriteBarrier needs to wait for reads/writes in progress to complete before allowing more writes.

  • on x86/x64 Volatile.WriteBarrier is just a compiler fence, similar to Volatile.ReadBarrier.
  • on arm it is a full fence (dmb[ish]). Sadly, dmb.st does not wait for reads.

@hamarb123
Copy link
Contributor Author

hamarb123 commented Sep 16, 2024

Hmm, I still don't understand. What's the difference between:

...x
Volatile.ReadBarrier(); //emits nothing
Volatile.WriteBarrier(); //emits nothing
...y

And

...x
Interlocked.MemoryBarrier(); //emits lock ...
...y

@VSadov can you give me an example of some x & y where the behaviour is allowed to be different so I can see an example of what I'm not understanding?

Edit: I think I understand now (leaving this here for my future reference)

read a
write b
Volatile.ReadBarrier(); Volatile.WriteBarrier(); //or swapped
read c
write d

could be re-ordered to: read a, read c, write b, write d, whereas Interlocked.MemoryBarrier(); would stop this re-ordering obviously.

- And fix missed file from jit-format
@hamarb123
Copy link
Contributor Author

hamarb123 commented Sep 16, 2024

on arm it is a full fence (dmb[ish]). Sadly, dmb.st does not wait for reads.

Would dmb ishst + dmb ishld be enough? (idk if this is actually faster anyway, but maybe it is?)
It would give Load-Load, Load-Store, and Store-Store guarantees according to this.

Using my example from earlier: read a, write b, barrier/s, read c, write d (where these represent arbitrary quantities of reads & writes in any order):

  • Volatile.WriteBarrier requires: a,b before d
  • dmb ishst gives b before d
  • dmb ishld gives a before c,d

So it would seem to me as though dmb ishst + dmb ishld (in either order) should theoretically be enough. Whether it's faster than just a dmb ish is another question obviously (one that is only really relevant if it's a valid approach anyway). If there's something wrong with my analysis, please let me know :)

@VSadov
Copy link
Member

VSadov commented Sep 16, 2024

Would dmb ishst + dmb ishld be enough? (idk if this is actually faster anyway, but maybe it is?)

At first glance it seems that the combination is as good as a full barrier.
If the cost of a full barrier could be reduced by doing two half barriers instead, I'd think that is how hardware would do it, so likely it is not faster.

@hamarb123
Copy link
Contributor Author

hamarb123 commented Sep 16, 2024

At first glance it seems that the combination is as good as a full barrier. If the cost of a full barrier could be reduced by doing two half barriers instead, I'd think that is how hardware would do it, so likely it is not faster.

It's actually not as strong as a full barrier, since it doesn't give b before c, which is the same thing that x86 doesn't give by default I think based on what you were saying.

@VSadov
Copy link
Member

VSadov commented Sep 16, 2024

At first glance it seems that the combination is as good as a full barrier. If the cost of a full barrier could be reduced by doing two half barriers instead, I'd think that is how hardware would do it, so likely it is not faster.

It's actually not as strong as a full barrier, since it doesn't give b before c, which is the same thing that x86 doesn't give by default I think based on what you were saying.

Ah, right, it still does not order Store-Load. It could be cheaper then, since it guarantees less.

@hamarb123
Copy link
Contributor Author

hamarb123 commented Sep 19, 2024

Ah, right, it still does not order Store-Load. It could be cheaper then, since it guarantees less.

I did some testing based on code I gave in the use case section of my api proposal issue (converted to C++) on a M-series macbook and got about a 1.4% regression overall (don't interpret that as the pair is exactly 1.4% slower than just dmb ish as there is obviously other code around the dmb instructions, it's just most likely not faster I think) (testing ran for about 25 mins total), so probably no point pursuing this idea further. Not overly surprised, but anyway.

@JulieLeeMSFT
Copy link
Member

@VSadov, could you please review the last comment from the author?

@VSadov
Copy link
Member

VSadov commented Oct 14, 2024

@JulieLeeMSFT I will take a look

@VSadov
Copy link
Member

VSadov commented Oct 15, 2024

The failures in superpmi are because the change is compat-breaking for the old corelib. Thus the new JIT is not a drop-in replacement for the old one.

We need to update JITEEVersionIdentifier guid in jiteeversionguid.h to signal that.

@jakobbotsch
Copy link
Member

We need to update JITEEVersionIdentifier guid in jiteeversionguid.h to signal that.

FWIW, you can do this simply by running the ThunkGenerator (https://github.com/dotnet/runtime/blob/main/src/coreclr/tools/Common/JitInterface/ThunkGenerator/gen.sh or https://github.com/dotnet/runtime/blob/main/src/coreclr/tools/Common/JitInterface/ThunkGenerator/gen.bat)

Copy link
Member

@VSadov VSadov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks!!

@hamarb123
Copy link
Contributor Author

hamarb123 commented Oct 15, 2024

🎉 btw, I haven't added tests for the methods - is this a problem? If so, please give me an idea of what I should write for such a test 😆

@VSadov
Copy link
Member

VSadov commented Oct 15, 2024

🎉 btw, I haven't added tests for the methods - is this a problem? If so, please give me an idea of what I should write for such a test 😆

  • the read barrier will have good coverage via the cast cache. The cache gets a lot of traffic, so will work better than any targeted test.
  • the write barrier is not emitted on x64 and emits full fence on arm64. I think some test can be written for this, but having something that would reliable catch regressions would be difficult.
    I think we can rely on the fact that a self-referential intrinsic must expand to something and chances that it regresses and starts doing nothing are low, so probably just taking a look at native codegen (to be sure the fences are there) would be sufficient.

@EgorBo
Copy link
Member

EgorBo commented Oct 15, 2024

@VSadov btw, I was profiling an app recently (OrchardCMS) on linux-arm64 and various Barrier speculatively executed X PMU counters were mostly pointing to CastCache as the main offender 🙂 E.g.

Screenshot 2024-10-16 at 02 42 41

(dmb_spec event)

@hamarb123
Copy link
Contributor Author

hamarb123 commented Oct 16, 2024

What else needs to be done here for merging (other than fixing up the jit-ee version guid just before merging)? Do we need other reviews? Thanks.

@VSadov
Copy link
Member

VSadov commented Oct 16, 2024

@dotnet/jit-contrib any concerns/comments on this change? I think this is ready for merging.

We need a new guid though as there was a conflicting change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member new-api-needs-documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[API Proposal]: Volatile barrier APIs
5 participants