Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fleshing out Execution and Memory models #94

Merged
merged 10 commits into from
Apr 9, 2024

Conversation

llvm-beanz
Copy link
Collaborator

This update Intro.Defs earlier in the introduction and puts more detail into the SPMD programming model and memory model.

The PDF draft of this change is available here.

@llvm-beanz llvm-beanz requested a review from bogner September 7, 2023 19:32
Copy link

@jeremyong jeremyong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this doc! I left some wording suggestions and comments that are hopefully helpful.

This update Intro.Defs earlier in the introduction and puts more detail
into the SPMD programming model and memory model.
@llvm-beanz llvm-beanz force-pushed the cbieneman/terminology branch from f39311b to 3e3fc3e Compare December 1, 2023 15:44
@llvm-beanz llvm-beanz force-pushed the cbieneman/terminology branch from 5965e2f to bbaa286 Compare December 1, 2023 16:14
We don't really have direct host-memory manipulation today in HLSL, so
let's keep this simple.
@llvm-beanz
Copy link
Collaborator Author

I've put in a bunch of word smithing and updates.

I think this is mostly in sane state now.

Updated PDF

@llvm-beanz
Copy link
Collaborator Author

@Keenuts, the last change I added here is in response to the discussion we had in email this week. Do you think that wording correctly addresses the concerns?

@Keenuts
Copy link
Collaborator

Keenuts commented Feb 22, 2024

@Keenuts, the last change I added here is in response to the discussion we had in email this week. Do you think that wording correctly addresses the concerns?

Sure, this block as-is makes sense: no optimization should change the program behavior (except optimizations like fast-math? But maybe that's considered to be a codegen switch, not an optimization switch?)

What is IMO the important missing bit (but it's a whole subject) is to define what is expected for actual execution: the spec mentions lock-step is not mandatory, hence the spec as-is is allows the same CFG/shader to yield 2 different results. Example: running on Maxwell and Volta architecture, with the latter allowing not reconverging directly after a divergence.

@llvm-beanz
Copy link
Collaborator Author

Sure, this block as-is makes sense: no optimization should change the program behavior (except optimizations like fast-math? But maybe that's considered to be a codegen switch, not an optimization switch?)

Within the spec you can't change the program behavior outside what is specified. HLSL has fast-math by default, so that will be the specified behavior for floating point math operations.

What is IMO the important missing bit (but it's a whole subject) is to define what is expected for actual execution: the spec mentions lock-step is not mandatory, hence the spec as-is is allows the same CFG/shader to yield 2 different results. Example: running on Maxwell and Volta architecture, with the latter allowing not reconverging directly after a divergence.

Lockstep isn't strictly required by the spec for anything. Even wave operations can be emulated without lockstep execution if you have appropriate synchronization primitives.

For Volta, the expectation is that wave operations will effectively require warp synchronization. I think that's a driver problem for Nvidia more than anything else. Maybe that's the missing explicit wording.

I can add something like:

Wave, Quad, and Thread Group operations require execution synchronization of applicable active and helper Lanes as defined by the individual operation.

@Keenuts
Copy link
Collaborator

Keenuts commented Feb 22, 2024

For Volta, the expectation is that wave operations will effectively require warp synchronization. I think that's a driver problem for Nvidia more than anything else. Maybe that's the missing explicit wording.

I can add something like:

Wave, Quad, and Thread Group operations require execution synchronization of applicable active and helper Lanes as defined by the individual operation.

Yes, and where those barriers are put should probably be described in the control-flow structures (here using Cuda's syntax mixed with HLSL)

switch (value) {
    case 0:
        WaveActiveSum(1);
   default:
       __syncwrap(MASK_FOR_ALL_THREADS);
       WaveActiveSum(10);
}

@llvm-beanz
Copy link
Collaborator Author

switch (value) {
    case 0:
        WaveActiveSum(1);
   default:
       __syncwrap(MASK_FOR_ALL_THREADS);
       WaveActiveSum(10);
}

I don't think this is correct. Every Wave* intrinsic needs to have a __syncwarp associated with it. In your example it may have only been required for the default: case, but in a more complicated example they're effectively needed for all Wave and Quad operations with a mask that represents all the participating lanes.

@Keenuts
Copy link
Collaborator

Keenuts commented Feb 26, 2024

switch (value) {
    case 0:
        WaveActiveSum(1);
   default:
       __syncwrap(MASK_FOR_ALL_THREADS);
       WaveActiveSum(10);
}

I don't think this is correct. Every Wave* intrinsic needs to have a __syncwarp associated with it. In your example it may have only been required for the default: case, but in a more complicated example they're effectively needed for all Wave and Quad operations with a mask that represents all the participating lanes.

Yes, of course. If we add a case 1 which doesn't branch, then the mask has to be different, what I meant with this is the control-flow structure itself needs to define where synchronization is expected, not just the intrinsic:

  • we both agree the we must synchronize as soon as we can, when we fall-through. But this has to be made explicit IMO, and it's related to the control flow construct.
    (Same for loops breaks, etc).

But It seems this goes beyond this initial PR, so probably not the best place to dive into this.

@llvm-beanz
Copy link
Collaborator Author

Yea, I think some of this will come down to the wording as we write the behavior of switch statements and case labels, but at the same time I don't think we need to define synchronization for control flow structures.

It isn't illegal to reshape control flow during optimization if it doesn't change program behavior, so synchronization of execution isn't really a property of the control flow structures. Even within the definitions of C, it is illegal to merge and hoist control flow if it changes the resulting execution. C doesn't have the SIMT/SPMD considerations that HLSL does, but we can lean on similar rules for HLSL and get the expected behavior.

The C spec defined case statements from ANSI C through until C17 (the wording is slightly different in the latest draft of C23) with the following statement:

Labels in themselves do not alter the flow of control, which continues unimpeded across them.

If that's the basis for how case labels are defined, and Wave* operations are defined to have wave sync behavior, I think that defines the expected behavior completely.

I don't want our spec to make statements about expectations for control flow structure that aren't required because we do want conforming compilers to be allowed to optimize in cases where they can.

@llvm-beanz
Copy link
Collaborator Author

I've marked the outstanding conversations as resolved. I think that I've addressed all the feedback. I'm happy to revisit anything in subsequent PRs. I'm sure this will all need some more work as we get further along.

@llvm-beanz llvm-beanz merged commit 2a8e8da into microsoft:main Apr 9, 2024
3 checks passed
llvm-beanz added a commit to llvm-beanz/hlsl-specs that referenced this pull request Feb 9, 2025
This update Intro.Defs earlier in the introduction and puts more detail
into the SPMD programming model and memory model.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants