Shader Execution Reordering (SER) proposal #277

rasmusnv · 2024-07-18T09:22:54Z

Add Shader Execution Reordering (SER) proposal for consideration.

damyanp

Looks ready to take into the repo for further review. Before completing the PR, please pick the next free proposal number (at time of writing this comment, that would be 0021) and rename the file and the reference to it on line 4 appropriately.

proposals/NNNN-shader-execution-reordering.md

rdrabins · 2024-08-29T12:39:48Z

proposals/NNNN-shader-execution-reordering.md

+- The UAV writer must issue a `DeviceMemoryBarrier` between the write and the
+reorder point.
+
+## Separation of ReorderThread and HitObject::Invoke


Actually we would like to have additional functions (also see my comment to TraceRay), one example would be:

template<payload_t> static HitObject HitObject::TraceRayReorderInvoke()

that would be quivalent to:

HitObject hit = [PreferNoReorder]HitObject::TraceRay( ..., payload ); ReorderThread( hit ); HitObject::Invoke( hit, payload );

It'd be back to original DXR 1.0 TraceRay spirit, this time we'd now that at least on implementations that support reorder it should bring reordering here as well.

Motivations:

some vendors can go to execution of a shader without getting back to shader.
It has some adventages like whole wave doesn't wait for all rays to finish. This is getting increasingly important as traversal steps start to vary more and more in complex scenes.

we'd like to encourage writing it in a single instruction instead of relying on the compiler to fuse three separate operations together. For example:
... HitObject hit = [PreferNoReorder]HitObject::TraceRay( ..., payload ); color = tex.Sample(sampler, UV); ReorderThread(hit); HitObject::Invoke( hit, payload );
In this case, the app decided to put the .Sample() between TraceRay and Reorder to hide the latency of the TraceRay call. Our compiler will have a hard time deciding if it's ok to move the .Sample() after the re-order or not. It's much simpler if this code is written as:
HitObject::TraceRayReorderInvoke(payload) color = tex.Sample(sampler, UV);

TraceRayReorderInvoke is similar in spirit to the attributes.

The first point is not clear to us. It is already the case that the whole wave doesn't have to wait for all rays to finish at any reorder point.

We are not quite sure why the compiler is not able to move the Sample call after ReorderThread. There are no side effects in ReorderThread and TraceRay was already a reorder point, meaning that restrictions on memory visibility apply already. A sampled texture is also read-only and is not affected by side-effects.

Our stance is that it should be easy for an implementation to merge 2-3 calls into one if it is deemed beneficial, without adding further API surface. Any shader code in between the calls is typically not affected by any side effects that may occur in TraceRay or Invoke. The only exception would be RW UAVs that are globally coherent and are accessed in between the calls. However, it is always possible to move code across ReorderThread since it does not cause or observe side effects.

In last week's meeting we agreed to no monolithic API, but would make adjustments to the spec for clarity.
Options going forward:

In spec we could say calling the three functions is the same as calling trace ray semantically

In spec explicitly call out that compiler and driver might merge these functions, and putting anything in between the function calls might prevent that.

Have example code as a helper function. In the example code that is traceray, reorder, invoke, and the implementation does those three in a row, and it's just a sample code for the case where the app wants

Add a helper library to compiler, and ship some helper functions as a hlsl header

@rdrabins given the above options, which would intel prefer?

From the options above, the last one seems most preferable - a function we can reliably pattern match and which is "officially sanctioned".

rdrabins · 2024-08-29T16:30:23Z

proposals/NNNN-shader-execution-reordering.md

+
+---
+
+#### HitObject::Invoke


Like in comment for TraceRay, we are interested in adding more communication about reordering.
We'd consider set of values:

PreferNoReorder - (same as for TraceRay) the default value with the behavior according to the current spirit. Application doesn't need the reordering happen but the call still is a reorder point and implementation can decide to reorder. Note: This could be also simply spelled out in the spec to be the default if Reorder calls are present.

Reorder - with this value, the [Reorder]HitObject::Invoke( hit, payload ) call would be equivalent to:
ReorderThread( hit ); HitObject::Invoke( hit, payload );

the value should be limited to shader stages supporting ReorderThread(...).

Reorder(hint,bits) - calling [Reorder(hint,bits)]HitObject::Invoke( hit, payload ) would be equivalent to:
ReorderThread( hit, hint, bits); [PreferNoReorder]HitObject::Invoke( hit, payload );
Like above the value should be limited to shader stages supporting ReorderThread(...).

We don't propose NoReorder here. In case of traceRay, implementation already has to have the inlined raytracing in place so they can probably use it for implementing that.

The shader invocation depending on implementation may necessary involve reordering as some kind of function call implementation byproduct.
Such implementation has to deal with difficult choices in compilation. e.g.:

ReorderThread(hit, hint, bits); color = tex.Sample(sampler, UV); HitObject::Invoke( hit, payload );

Is sampler benefiting the reordering before it or not?

Should I reorder twice (at reorder and at invoke) or nop at ReorderThread and reorder according to hint while Invoking the shader?

when separation of ReorderThread isn't important part of algorithm, It's easier if application writes it like that:

color = tex.Sample(sampler, UV); [Reorder(hint, bits)]HitObject::Invoke( hit, payload );

Again, we do not see a real need for this. If attributes are present at these calls, there is a risk that the user would simply specify [Reorder] on both TraceRay and Invoke, which does not actually mean what we want.

For questions like if Sample benefits from a particular location w.r.t. reordering, the assumption should be that it has an impact and that the application performs the access in the coherence scope where it makes the most sense. However, it does not necessarily hinder fusion. If it is without dependencies, it can still be moved before or after the fused call.

Same as above, generally agreed no attributes.

pow2clk

I'll disclose that I'm not a DXR expert. A lot of my feedback is copy editing, trying to make the document easier to read. I feel a bit stronger about my thoughts on testability and DXIL op feedback.

proposals/NNNN-shader-execution-reordering.md

pow2clk · 2024-09-10T21:36:48Z

proposals/NNNN-shader-execution-reordering.md

+State value getters return scalar, vector or matrix values dependent on the provided opcode.
+Scalar getters use the `hitobject_StateScalar` dxil intrinsic and the return type is the state value type.
+Vector getters use the `hitobject_StateVector` dxil intrinsic and the return type is the vector element type.
+Matrix getters use the `hitobject_StateMatrix` dxil intrinsic and the return type is the matrix element type.


I think it might be clearer to leave these out of the table above and give them their own special section. It already looks like a new topic because of the table. I think dropping them to the bottom instead of in the middle of the list of intrinsics would be less confusing.

I moved them last, but I think I disagree about leaving them out of the initial opcode table. That table is supposed to list all opcodes and their values. The later table doesn't have the values.

proposals/NNNN-shader-execution-reordering.md

Flakebi · 2024-09-13T11:25:08Z

proposals/NNNN-shader-execution-reordering.md

+### Memory coherence and visibility
+
+Due to the existence of non-coherent caches on most modern GPUs, an
+application must take special care when communicating information across
+reorder points through memory (UAVs).
+
+Specifically, if UAV stores or atomics are performed on one side of a reorder
+point, and on the other side the data is read via non-atomic UAV reads, the
+following is required:
+- The UAV must be declared `[globallycoherent]`.
+- The UAV writer must issue a `DeviceMemoryBarrier` between the write and the
+reorder point.


Requiring global memory coherence can be a large hit to performance. Especially since a driver can decide not to sort if the workload is coherent (or sort in a more local manner).
Using globallycoherent and DeviceMemoryBarrier makes it impossible for a driver to omit the synchronization work when it’s not needed.

I propose that we add a new UAV attribute and barrier scope that relates the synchronization to reordering. So,

The UAV must be declared [reordercoherent]

The UAV writer must issue a Barrier(UAV_MEMORY, THREAD_REORDER_SCOPE) between the write and the
reorder point.

That way, when a driver decides not to reorder, the barrier can be a no-op.

This seems useful at first glance, but expanding the memory model may overly broaden the scope of this proposal. It might be better suited as a separate proposal.

It is also unclear what use case [reordercoherent] is addressing. A UAV marked as [reordercoherent] can only communicate within the same DispatchRaysIndex, across any reordering. This essentially makes it a local scratch pad, which could be more efficiently handled using live state in the shader program (or as payload values if communicated with other shaders).

This seems useful at first glance, but expanding the memory model may overly broaden the scope of this proposal. It might be better suited as a separate proposal.

It needs one more enum value in SEMANTIC_FLAG and (optionally? see below) a [reordercoherent] marker on UAVs.

Vulkan also has a separate scope ShaderCallKHR for this.

Requiring global coherence is surely fine for some implementations but may incur a performance hit in other implementations.

It is also unclear what use case [reordercoherent] is addressing. A UAV marked as [reordercoherent] can only communicate within the same DispatchRaysIndex, across any reordering.

The marker was meant to connect the barrier to certain UAVs. This allows UAVs that are not explicitly marked to skip synchronization.
I suppose an alternative is to not require marking UAVs but expect all bound UAVs to be synchronized with a reorder-scope barrier.
Although this may be expensive if reordering happens globally. Then this would be equivalent to mark all UAVs as [globallycoherent].

The marker was meant to connect the barrier to certain UAVs. This allows UAVs that are not explicitly marked to skip synchronization.

This part I understand. What I would like to understand is the use case for a reorder-scope barrier. Since it can only be used to synchronize within the same DispatchRaysIndex, the data that is written/read to/from the reorder-scope coherent UAV can instead be stored as live state or in payload during intermediate steps and then be written once at the end (if needed). This seems more efficient overall.

From last week's meeting, AMD proposal for a Reorder Coherence domain: just add new enums for UAV attribute, and barrier ( ) enum param.

@rasmusnv @Flakebi can we confirm the decision here?

Yes. "Memory coherence and visibility" has been updated with the proposal.

Yes, thanks for that.
The current text reads fine, I think it would be good to add that the new REORDER_SCOPE value is added to the SEMANTIC_FLAG enum and that [reordercoherent] is a new storage class that can be annotated on buffers.
Just to be clear to readers that these two are added in this proposal.

rasmusnv · 2024-09-26T08:00:59Z

@microsoft-github-policy-service agree company="NVIDIA"

damyanp

It looks like there's a large number of unresolved conversations on this PR - these either need to be moved into issues or addressed in the spec before we can merge it.

Co-authored-by: Greg Roth <grroth@microsoft.com>

rasmusnv · 2024-10-25T08:28:17Z

I have picked proposal number 0025. All pending changes have been made. If anything is left to discuss or change I suggest we open separate issues after the merge.

damyanp

I have picked proposal number 0025. All pending changes have been made. If anything is left to discuss or change I suggest we open separate issues after the merge.

This seems like a good approach to me, it's getting quite hard to keep track of everything in this PR! Thank you to everyone for all the detailed comments and responses!

I'm going to mark this as approved, but I think we'll still need to get all the conversations marked as resolved before the PR can be merged, and this repo is set to require two approvals as well.

For resolving the conversations we should follow up with the original authors and/or file issues (and link them in the conversation before resolving it) if necessary.

jaladreips · 2024-11-12T17:26:17Z

proposals/0025-shader-execution-reordering.md

+The `HitObject` has value semantics, so modifying one `HitObject` will not
+impact any other `HitObject` in the shader. A shader can have any number of


Is there a reason for HitObject objects having value semantics while RayQuery objects have reference semantics?

HitObject is lighter weight than RayQuery (no traversal state, only hit information). RayQuery mutates during traversal. There's also a lifetime problem with references. For example, if you allocate a RayQuery within a loop but assign it to a variable outside, it will appear to go out of scope. Because of the value-semantics of HitObject, this will just work.

proposals/0025-shader-execution-reordering.md

rdrabins · 2024-11-18T18:11:59Z

proposals/0025-shader-execution-reordering.md

+similarity of subsequent work being performed by threads. The resolution of
+the hint is implementation-specific. If an implementation cannot resolve all
+values of `CoherenceHint`, it is free to ignore an arbitrary number of least
+significant bits. The thread ordering resulting from this call may be


We'd like to have a clearer description of how hit is mapping to coherence.
Could we add the sentence that higher bit of hint has more impact on coherence then lower ones.

e.g. consider 3 hints, and chose the most coherent pair of hint:
A = 0b100000
B = 0b111111
C = 0b011111

For example 3 methods for extracting coherence by defining opposite divergence metric:

If higher bit has more coherence then all lower bits after it, A,B pair is most coherent. Divergence more or less xor(...)

If its arithmetic difference then A and C. Divergence is more or less abs(substract(...)).

It could be also divergence = bitcount(xor) that tries to find most similar bits. then B and C is most coherent.

3 doesn't cope well with "it is free to ignore an arbitrary number of least
significant bits", but 1 and 2 fit well into definition.

I'm mentioning that because we discuss this feature only as reordering and forget that based on the ordering, the system needs to extract waves to execute.
Without understanding what is the actual "distance" between the hint codes its hard. It would be good to have that understanding on both sides of GPU.

I believe arithmetic difference is the most flexible and intuitive definition for users. This approach allows a hint to incorporate both hierarchical aspects (with more important elements in MSBs) and ranges of bits that represent closeness along some axis. I agree it should be spelled out.

Agreed that it makes sense to add some details what we mean with a hierarchical hint, and giving an example.

However, I think the numerical value of the XOR makes more sense than arithmetic difference. We could also mention (numerical) sorting, even if the implementation doesn't explicitly sort. It coincides with lexicographical sorting by individual bits, which nicely matches the hierarchical interpretation.

Consider:

hint1 = 0b0000 hint2 = 0b0111 hint3 = 0b1000

If the bits in the hint are intended to be used hierarchically (as per my understanding), then hint2 is closer to hint1 than to hint3. That is reflected in the numerical value of the XOR: hint2 ^ hint1 = 0b111, while hint2 ^ hint3 = 0b1111. However, using arithmetic difference, hint2 would be closer to hint3.

Edit: Fixed the example.

XOR prioritizes the strictly hierarchical use case. The problem with XOR is when a bit range represents similarity along some linear axis and we want to group based on closeness.

Take for example the following X,Y hint pairs along some linear axis of similarity:

X Y Arith XOR 5 0 5 5 5 1 4 4 5 2 3 7 5 3 2 6 5 4 1 1 5 5 0 0 5 6 1 3 5 7 2 2 5 8 3 13 5 9 4 12

In this use case, arithmetic difference produces a desirable metric but not XOR.

I was interpreting the hints to have hierarchical semantics, but checking the spec again I couldn't find any explicit mentioning of that, so fine with me.

Again, good point that we should explain this.

I would prefer if the expectation was we'd group A&B in your second example, based on the XOR distance. Given the grouping is allowed to throw away lower-order bits, it already implies higher order bits are more important, and it will be likely used to set "feature bits" anyways by ISVs because "arithmetic distance" between shaders is not really something you can calculate. But you can say "those two shaders are going to do X" and we want to group by similarity, so the more similar bits are set the more likely they are similar, with higher bits used to indicate "more expensive" things.

Undefined means it'll be defined by implementations, after all.

One thing to note here is that the XOR codes can be converted to arithmetic codes while preserving distance orderings at the cost of using a few more bits. One such mapping is shown below.

In the second example above we would have:

Label XOR code Distance preserving arithmetic code

B 63 364

A 32 243

C 31 121

With this transform we can use an arithmetic distance, yet B&A remain closer than A&C.

I'm not convinced that it's easier for applications to produce an arithmetic distance given applications will know features, not distances. I.e. a bit corresponds to a particular property, it's hard to imagine how an application would calculate an arithmetic distance from a set of feature bits. To me we got this backwards - now you'd have to take a "feature based metric" from the app, convert it to arithmetic distance using the reverse encoding, then unpack it to XOR again for sorting - what did we win? Would you mind posting the formula for both directions in this case?

For clarity, in the above the application isn't computing an arithmetic distance. It's converting an xor code into an arithmetic code, such that distance orderings are preserved.

The code conversion above reinterprets the xor code bits in trinary. So where a bit would normally be interpreted to have value 2^i the conversion interprets it as 3^i. So xor code 0b0101 (aka 5) becomes 3^2 + 3^0 = 10. This inserts a gap between each new option bit sufficient to preserve distance orderings.

In application code this could be a change from xor_code |= option_bit; to arith_code += option_value;. Of course we could imagine more complex scenarios than simply accumulating option constants.

I don't think it is possible to preserve arithmetic distance orderings using an xor code. Maybe it can be done but it's not obvious to me how to do it. So unfortunately I can't post the formula for the opposite direction.

take a "feature based metric" ... what did we win?

Since there appear to be valid arguments for both arithmetic and xor distances what we won is having a single consistent code space (arithmetic) which the application can use to encode either arithmetic or xor distances. The implementations can always group by arithmetic distance and the application has the freedom to construct its codes to reflect either distance ordering.

The most important thing about arithmetic is that for me its not even good for arithmetic based HW if it drops bits.
So impl has to do rounding when doing dropping.

e.g.
original codes given by app:
A 101000000
B 100111111
C 101111111
if implementation drops 6 bits it sees:
A 101__________
B 100__________
C 101__________

For arithmetic the closest codes changes from A and B to A and C when bits are dropped.
for XOR it stays A and C.
Simply arithmetic is not stable for dropping of bits.

I wonder if we could pass literal with mask that says at least how the structure of hint looks like. for instance if hint by app is of structure:

struct HintStructA { uchar featureA : 1; uchar featureB : 1; uchar LittleEnum : 2; uchar featureD : 1; uchar LittleArithCode : 3; };

than ReorderThreads instead of having important bit count argument could have mask (compile time constant!) that marks on which bit each field starts.
For above hintStructA: it would be 0b11101100;

Dropping bits impl may be dropping whole fields. App can express its wish of understanding the hint including the ones we are discussing above:
e.g.
arithmetic 8bit 0b10000000
xor 8bit 0b11111111

rdrabins · 2024-11-18T18:28:39Z

proposals/0025-shader-execution-reordering.md

+
+```C++
+// Primary ray wants perfect shadows and does not need to reorder as it is
+// coherent enough.


We'd appreciate using language that doesn't suggest that there is no reorder at trace ray.
It should be the expectation that impl should only minimize this. for instance:

// Primary ray wants perfect shadows and does not need
// to reorder more than required and best performant for given implementation
// The domain of shadow rays is expected to be coherent enough.

The latest commit includes clarifications like this.

Co-authored-by: Tex Riddell <texr@microsoft.com>

jasilvanus · 2024-11-20T08:57:28Z

proposals/0025-shader-execution-reordering.md

+- The UAV must be declared `[reordercoherent]`.
+- The UAV writer must issue a `Barrier(UAV_MEMORY, REORDER_SCOPE)` between
+the write and the reorder point.
+
+When communicating both between threads and across reorder points, global coherency
+can be utilized:
+- The UAV must be declared `[globallycoherent]`.
+- The UAV writer must issue a `DeviceMemoryBarrier` between the write and the
+reorder point.


Should we also require a DeviceMemoryBarrier between a reorder point (including shader entry) and the read?

When not requiring it, the compiler would have to conservatively assume an implicit barrier after every reorder point that is followed by a reordercoherent/globallycoherent UAV read, even in cases where the app does not require one. Whether this implicit barrier may cost performance depends on the implementation.

I think this is related to the general discussion on the DXR memory model and how globallycoherent should be correctly used for inter-thread communication even in non-reorder cases.

From my point of view, the DeviceMemoryBarrier call makes sure that all writes have completed before the reorder point gets to move the work to a different thread. When reading after the reorder point, there's no ordering problem. The only risk is that stale data is read from a local cache. That is mitigated by the globallycoherent attribute, causing the local cache to be bypassed.

The only risk is that stale data is read from a local cache. That is mitigated by the globallycoherent attribute, causing the local cache to be bypassed.

That depends on how the implementation implements globallycoherent. For instance, I think it is possible to implement globallycoherent by allowing such loads and stores to be served by local caches, and only invalidating/flushing caches at explicit memory barriers. Such an implementation would be unnecessarily forced to invalidate the local cache on shader entry in cases where such a UAV is accessed, but the shader doesn't actually require the synchronization.

Also, from a formal perspective, I find it a bit problematic to define abstract language semantics in terms of "local caches" of the implementation. It would be better to say that the barrier between the write and the reorder point serves as release operation, and the suggested corresponding barrier before the read as acquire operation. HLSL doesn't (yet?) support release/acquire, but it would be closer to it and could more easily be transformed to use release/acquire semantics in the future. With a single barrier, that's not possible.

I agree in principle but we are not trying to extend the memory model beyond compute here (beside the new barrier type). The reorder point is simply seen as a change of thread. In a compute shader it would surely be enough with a barrier on the write side to ensure write ordering?
D3D11_3_FunctionalSpec, section 7.14.4 indeed mentions local caches.
Also look at Producer - consumer dataflow through UAVs in WorkGraphs.

Let's separate the reordercoherent and globallycoherent discussions. I think globallycoherent semantics in DXR are a bit problematic because of the synchronization limitations, but that's not the focus of this change.

For reordercoherent, are you saying that you agree that not requiring a barrier between the reorder point and the load does penalize some implementations, but we accept that in favour of a simpler API that is closer to what e.g. WG require?

What I'm trying to say is that we are not trying to add acquire and release semantics to HLSL with this spec proposal.

What you are suggesting is calling Barrier both before and after the reorder point. First of all, note that you would have to have a barrier after each and every reorder point that wants to read the data after a single write since each reorder point will be a new thread. So a single write can result in a many required barriers, which may be difficult for the user. Second, there is no distinction between acquire and release with Barrier which means that all possible implementations are harmed (e.g., ordering + cache flush at both sides). Finally, it is not clear that there is any case where a high-frequency cache flush would be better than adopting the globallycoherent semantics of bypassing local caches, which all implementations already supports.

What I'm trying to say is that we are not trying to add acquire and release semantics to HLSL with this spec proposal.

Agreed.

What you are suggesting is calling Barrier both before and after the reorder point. First of all, note that you would have to have a barrier after each and every reorder point that wants to read the data after a single write since each reorder point will be a new thread.

Nit: Let's not mix the thread terminology (as in the globallycoherent discussion) with the reorder terminology. We can say that at reorder points, reordercoherent synchronization is necessary. If we want to give it a name, maybe "execution locus" is a better term instead of "thread"?

So a single write can result in a many required barriers, which may be difficult for the user.

Agreed.

Second, there is no distinction between acquire and release with Barrier which means that all possible implementations are harmed (e.g., ordering + cache flush at both sides).

In the worst case, putting a reordercoherent barrier immediately after every reorder point (CallShader/TraceRay/ReportHit/HitObject ops / shader entry) should be equivalent to the current proposed behavior without the barrier on the reading side. Because it immediately follows a reorder point, it is clear that only reads can be affected, and if the implementation uses the bypass strategy nothing needs to be done.
An implementation that doesn't use cache bypass would have to add exactly all these barriers, and the explicit one just makes that explicit.

Finally, it is not clear that there is any case where a high-frequency cache flush would be better than adopting the globallycoherent semantics of bypassing local caches, which all implementations already supports.

Agreed that this is totally unclear in general, but I think when designing a specification that is supposed to be used for an extended lifetime, including on not yet existing hardware, one should be cautious to unnecessarily restrict the freedom of possible implementations.

To be clear, I am not demanding to change it to require the barrier on the reading side, I'm rather pointing out a potential issue. If there is consensus that the single-barrier solution is preferable, I'm totally fine with that.

We re-discussed internally and are fine with the proposed solution of only requiring a barrier on the write side. It may penalize some implementations, but it's really a very special corner case where it makes a difference, and it's questionable whether it's worth the effort optimizing for that.

jasilvanus · 2024-11-21T10:58:46Z

proposals/0025-shader-execution-reordering.md

+This proposal includes a new coherence scope for reorder points. The scope
+is limited to communication within the same dispatch index. Specifically,
+if UAV stores or atomics are performed on one side of a reorder point, and
+on the other side the data is read via non-atomic UAV reads, the following


In general, there are two cases of UAV accesses across reorder points:

within the same shader (e.g. RGS) across e.g. a TraceRay

across shaders, e.g. a store in RGS and a load in CHS

We should explicitly mention both cases and the required behavior.
My current understanding is that we require reordercoherent in both cases, which seems to be different to Vulkan which AFAIK doesn't require explicit synchronization within a shader.

It is mentioned that a shader transition is another reorder point so the same rules apply. Does that cover it? Do you have a relevant reference for Vulkan?

I agree that the current wording implicitly means that reordercoherent is required in both cases, but I think we should explicitly spell that out, particularly if it differs in a subtle way from what Vulkan does.

Check the ShaderCallKHR scope in https://registry.khronos.org/vulkan/specs/1.3-extensions/html/vkspec.html#shaders-scope-shadercall. I interpret it as requiring explicit synchronization across shader call boundaries, but not across reorder points within a shader.

rdrabins · 2024-12-11T15:56:07Z

proposals/0025-shader-execution-reordering.md

+The canonical use of these features involve changing a `TraceRay` call to the
+following sequence:


we've spoken last time about this being equivalent functionally.
The current sentence "almost" says that. Could we have this more explicitly?

Suggested change

The canonical use of these features involve changing a `TraceRay` call to the

following sequence:

The canonical use of these features involve changing a `TraceRay` call to the

following sequence that is functionally equivalent:

Add Shader Execution Reordering (SER) proposal

6b05dbd

damyanp approved these changes Aug 14, 2024

View reviewed changes

Flakebi reviewed Aug 21, 2024

View reviewed changes

rdrabins reviewed Aug 29, 2024

View reviewed changes

proposals/NNNN-shader-execution-reordering.md Outdated Show resolved Hide resolved

rdrabins reviewed Aug 29, 2024

View reviewed changes

pow2clk reviewed Sep 10, 2024

View reviewed changes

Flakebi reviewed Sep 13, 2024

View reviewed changes

Fix line breaks in tables and remove trailing spaces.

50bd7e0

damyanp requested changes Sep 26, 2024

View reviewed changes

Apply first batch of suggestions from pow2clk.

4d8349f

Co-authored-by: Greg Roth <grroth@microsoft.com>

llvm-beanz added the needs-triage label Oct 11, 2024

Be more clear about allowed replacement of terminated threads.

cf1318e

damyanp removed the needs-triage label Oct 16, 2024

damyanp added this to the Shader Model 6.9 milestone Oct 16, 2024

rasmusnv and others added 14 commits October 18, 2024 09:55

Add section about reorder behavior communicated from app.

a589481

Clarify text in introduction.

ff837e5

Clarify text in motivation.

1b915ca

Clarify text in Proposed Solution.

ee455fe

Mention NOP-HitObject earlier.

496c699

Reference HitKind.

76a74dc

Apply suggestions from code review

1b353ea

Co-authored-by: Greg Roth <grroth@microsoft.com>

Add back remark about maximum attribute size.

8a2e45c

Improve MakeNop.

7a63102

Clarify PAQ behavior.

d8f95b4

Replace DXC with DXIL-generating compiler

1f4f412

Merge FromRayQuery overloads

414fb44

Move HitObject_SetShaderTableIndex

05ca446

Move Generic State Value getters last

d60a3b8

rasmusnv added 2 commits October 25, 2024 10:16

Correct the handle type for RayQuery. Add validation instructions.

a81e008

Pick proposal number 0025.

bb617df

damyanp approved these changes Oct 25, 2024

View reviewed changes

damyanp requested review from rdrabins, Anteru, jasilvanus, Flakebi and pow2clk October 25, 2024 17:35

damyanp mentioned this pull request Nov 5, 2024

[Feature Request] Support SPV_NV_shader_invocation_reorder microsoft/DirectXShaderCompiler#6994

Closed

llvm-beanz added the Design Meeting Agenda item for the design meeting label Nov 12, 2024

jaladreips reviewed Nov 12, 2024

View reviewed changes

tex3d reviewed Nov 14, 2024

View reviewed changes

proposals/0025-shader-execution-reordering.md Outdated Show resolved Hide resolved

tex3d reviewed Nov 14, 2024

View reviewed changes

proposals/0025-shader-execution-reordering.md Show resolved Hide resolved

rdrabins reviewed Nov 18, 2024

View reviewed changes

rasmusnv and others added 2 commits November 19, 2024 20:17

Clarify reorder point behavior, memory coherence, and examples.

f135b8c

Update proposals/0025-shader-execution-reordering.md

a3ca3f9

Co-authored-by: Tex Riddell <texr@microsoft.com>

jasilvanus reviewed Nov 20, 2024

View reviewed changes

jasilvanus reviewed Nov 21, 2024

View reviewed changes

rdrabins reviewed Dec 11, 2024

View reviewed changes

		The `HitObject` has value semantics, so modifying one `HitObject` will not
		impact any other `HitObject` in the shader. A shader can have any number of

		The canonical use of these features involve changing a `TraceRay` call to the
		following sequence:


		---

		#### HitObject::Invoke

Shader Execution Reordering (SER) proposal #277

Are you sure you want to change the base?

Shader Execution Reordering (SER) proposal #277

Conversation

rasmusnv commented Jul 18, 2024

damyanp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Anteru Nov 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pow2clk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rasmusnv commented Sep 26, 2024

damyanp left a comment

Choose a reason for hiding this comment

rasmusnv commented Oct 25, 2024

damyanp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jasilvanus Nov 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skeelyamd Dec 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skeelyamd Dec 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdrabins Nov 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jasilvanus Nov 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Anteru Nov 22, 2024 •

edited

Loading

jasilvanus Nov 21, 2024 •

edited

Loading

skeelyamd Dec 12, 2024 •

edited

Loading

skeelyamd Dec 12, 2024 •

edited

Loading

rdrabins Nov 18, 2024 •

edited

Loading

jasilvanus Nov 21, 2024 •

edited

Loading