-
Notifications
You must be signed in to change notification settings - Fork 272
The Return of Cast Operations #503
Comments
On 2020-06-26 2:10 p.m., billhuffman wrote:
We've decided to require (the appearance of) SLEN=VLEN. Early in the
discussion of the issue, we considered having cast operations that
would rearrange for different element size because it was important in
a small number of codes. We may still want to have those cast
operations, though now only for performance on wide machines. The cast
instruction will perform better on a wide, in-order implementation
than the auto-inserted micro-op. As before, the cast is a nop on
narrow machines.
Is a vector move to itself (vmv.vv v#,v#) sufficient for most cases?
It could thus be defined as the cast hint to the current SEW.
As there is no longer a need for the cast to specify a to/from Element
Width (EW).
(I didn't see any specific proposals with arguments/format et al.
Although I was on the look out for them).
I can envision a preemptive situation, in which a register write is
prefixed (-or- a following cast instruction fused to it) with its
expected target EW, to provide that register write a "preferred"
structure. In the case of multiple reads of that register the microarchs
are avoided.
And this could be done in advance of the vsetvli to the new SEW (or EEW
in the case of a narrowing op).
However, is this a frequent use case? Sufficient to provide a specific
cast instruction?
If it is sufficiently significant, I would rather propose a prefix cast
hint.
(the mv without or without the prefix will suffice as a cast op).
… I think all the issues about fragmentation are gone here. With and
without both work on all implementations. But for performance
optimization, they will want to be used on wide, in-order
implementations and possibly also on wide, OoO implementations.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#503>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFAWIKLR63LUDMQFIYHMK6TRYTQBVANCNFSM4OJRVZUA>.
|
That might work. My hesitation would be that a move would likely take one cycle and a cast would likely take more. So, the dispatch computation would need to have taken account of the current EW and the move instruction before dispatch. I assume the compiler would know it was changing widths and could leave additional cycles in it's expectation of when the result would be available.
So the extra dispatch recurrence time is what would worry me. Worth thinking about. It at least separates the rearrangement operation from the following use at different width.
I wonder if a "cast to width" instruction would be better. It would be assumed to take longer to execute and could be dispatched without comparing widths and doing different things for change-of-EW and no-change-of-EW scenarios. So, move would assume no-change-of-EW and cast would assume change-of-EW. Maybe there's an encoding similar to vmv.v.v that could do this.
Bill
On 6/26/20 12:22 PM, David-Horner wrote:
EXTERNAL MAIL
On 2020-06-26 2:10 p.m., billhuffman wrote:
We've decided to require (the appearance of) SLEN=VLEN. Early in the
discussion of the issue, we considered having cast operations that
would rearrange for different element size because it was important in
a small number of codes. We may still want to have those cast
operations, though now only for performance on wide machines. The cast
instruction will perform better on a wide, in-order implementation
than the auto-inserted micro-op. As before, the cast is a nop on
narrow machines.
Is a vector move to itself (vmv.vv v#,v#) sufficient for most cases?
It could thus be defined as the cast hint to the current SEW.
As there is no longer a need for the cast to specify a to/from Element
Width (EW).
(I didn't see any specific proposals with arguments/format et al.
Although I was on the look out for them).
I can envision a preemptive situation, in which a register write is
prefixed (-or- a following cast instruction fused to it) with its
expected target EW, to provide that register write a "preferred"
structure. In the case of multiple reads of that register the microarchs
are avoided.
And this could be done in advance of the vsetvli to the new SEW (or EEW
in the case of a narrowing op).
However, is this a frequent use case? Sufficient to provide a specific
cast instruction?
If it is sufficiently significant, I would rather propose a prefix cast
hint.
(the mv without or without the prefix will suffice as a cast op).
I think all the issues about fragmentation are gone here. With and
without both work on all implementations. But for performance
optimization, they will want to be used on wide, in-order
implementations and possibly also on wide, OoO implementations.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#503>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFAWIKLR63LUDMQFIYHMK6TRYTQBVANCNFSM4OJRVZUA><https://github.com/notifications/unsubscribe-auth/AFAWIKLR63LUDMQFIYHMK6TRYTQBVANCNFSM4OJRVZUA>.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https://github.com/riscv/riscv-v-spec/issues/503*issuecomment-650354562__;Iw!!EHscmS1ygiU1lA!WkFMpIHu6mHa5_rJTKZv2I0hWCPsANTQXCfg_xVqtweonPf04HvHkIsc1fQsW4U$>, or unsubscribe<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AKXXKKHOENAWNVZD5VNL6V3RYTYQRANCNFSM4OJRVZUA__;!!EHscmS1ygiU1lA!WkFMpIHu6mHa5_rJTKZv2I0hWCPsANTQXCfg_xVqtweonPf04HvHkIsc0g7xJ-U$>.
|
Ummm.... I didn't fully take in what you were saying before.
You were saying, in my terminology, that dispatch would assume that a vmv.v.v with source and destination identical would be assumed by dispatch to take the additional cycles. In any case, vmv.v.v with source=destination would be assumed to represent a cast sort of operation to SEW.
I take back my hesitation. I think that might work well. I would like to see at least non-normative text suggesting that so that it works the same way across all implementations that need the hint.
Bill
|
I'd say let's put a non-normative comment that vmv.v.v with dest=source is expected to be used as a hint, for implementations that need it, that element width is changing from an old element width to the current SEW. If that's added, we can close this issue. |
The beauty is that it (vmv.v.v with source=destination) does the right thing as a hint or as a physical move when internal EW is tracked. Although it only supports internal reformatting to "SEW friendly" internal state, I agree, however, that v1.0 could benefit from the inclusion of vmv.v.v with dest=source described as a hint. It helps by
Do you have any proposed wording? |
How about:
The vmv.v.v instruction with source = destination is a functional nop. It is used as a "hint" to indicate the element width of the next use when the element width of the previous use likely was different. Implementations may execute the nop move, drop the instruction entirely, or rearrange the bytes of the vector register as needed for best performance assuming the element width of the next use will be the current SEW.
Bill
|
A minor tweak in the last sentence:
or *assume that element width of the next use will be the current SEW
*and rearrange the bytes of the vector register as needed for best
performance.
What also needs to be added is a section like RVI and RVC on hints.
I will open an issue on this topic, as I believe the above hint
description would benefit from that context.
On 2020-06-26 5:40 p.m., billhuffman wrote:
How about:
The vmv.v.v instruction with source = destination is a functional nop.
It is used as a "hint" to indicate the element width of the next use
when the element width of the previous use likely was different.
Implementations may execute the nop move, drop the instruction entirely,
or *assume that element width of the next use will be the current SEW
*and rearrange the bytes of the vector register as needed for best
performance.
…
Bill
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#503 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFAWIKMMP7WONBZGGHJEV7LRYUIUTANCNFSM4OJRVZUA>.
|
On 6/26/20 3:58 PM, David-Horner wrote:
A minor tweak in the last sentence:
or *assume that element width of the next use will be the current SEW
*and rearrange the bytes of the vector register as needed for best
performance.
Yes. Better.
Bill
|
This looks like a good resolution. |
Might want both vmv.v.v form for hint that "vl" elements need to be rearranged, and vmv?r.v hints for whole vector register groups. |
I disagree with multiple formulations of whole register (WR) loads to provide hint to microarch of requested internal structure, because (detailed below) 1.. these hints only potentially benefit machines with in-register not in-mem order I believe it is premature to provide these whole register (WR) load hints. Further, I recommend we emphasis the in-memory format of the model. Specifically, that software can consider unit strided elements as equivalent to unit strided elements of narrower widths (and wider elements if alignment requirements are met) with an appropriately adjusted count. from RVI:
As with JALR refinement of conventions and recommended hardware response can evolve.
|
David,
We have agreed on a model where registers always behave as they would if they were stride-1 memory. The hints significantly help wide machines, don't take enough encoding space to matter, and are never required to be any particular value.
Specific comments interleaved.
On 7/8/20 6:40 AM, David-Horner wrote:
EXTERNAL MAIL
I disagree with multiple formulations of whole register (WR) loads to provide hint to microarch of requested internal structure, because (detailed below)
1.. these hints only potentially benefit machines with in-register not in-mem order
We make simple additions and choices for various classes of machine.
2.. better mechanisms should be possible to optimize behaviour for such machines
I think the "better" mechanisms you suggest below are significantly more complex and I'm not clear for most of them that they work.
3.. the designation has diminished value for multiple register load with a single width hint.
A compiler that cares about this issue can avoid it most of the time. There will sometimes be a tradeoff.
4. The typical WR use case is for autonomous/independent spill and fill that are unaware of subsequent use
The performance sensitive cases are for spill/fill in the same function and the compiler knows how the register is being used. In complex functions, spill/fill performance matters in a number of codes that are currently heavily used for us. I expect a factor greater than 2x loss for some functions without this. Cases like full context switch where the software has no idea hardly matter.
5.. it complicates both the hardware and software causing potential confusion to each
(software needs to use a size specific load in a context where such info is not immediately available if at all)
Hardware either knows how the bits are to be used or ignores them. Most hardware (and all simple hardware) will ignore the bits.
Broadly used software should know how to set the bits correctly most of the time. Where they're not correct, it's a performance loss and those that care about the loss can upstream improvements. Code that's used only on machines that don't care can not care.
6. hint can leads to lower performance for machines not needing them (a consequence of 5).
As far as I can tell, there's absolutely no loss of performance. Ever. Can you suggest a case where there's a loss?
Bill
I believe it is premature to provide these whole register (WR) load hints.
I propose we only use “e1024 encoding” (mop-width all ones) for WR loads and for WR stores.
The implementation will decide what internal format to use, software can assume current sew is used.
Conventions can be used to optimize the re-establishment of internal width information.
Further, I recommend we emphasis the in-memory format of the model. Specifically, that software can consider unit strided elements as equivalent to unit strided elements of narrower widths (and wider elements if alignment requirements are met) with an appropriately adjusted count.
from RVI:
We considered but did not include static branch hints in the instruction encoding. These
can reduce the pressure on dynamic predictors, but require more instruction encoding space and
software profiling for best results, and can result in poor performance if production runs do not
match profiling runs.
1. Only physical implementations that have differing in-register order formats will potentially benefit from these hints. It thus makes sense to me that such hints be proposed as an extension rather than in the base. It would then get review for the matters considered below.
2. better mechanisms should be possible to optimize behaviour for such machines
a) implementations may stack and pop internal width information (just as JALR use a convention to stack and pop return addresses). This would be based on a convention or pairing store/loads.
A possible convention is stack width with a whole register store that writes based on r2 , the stack pointer.
b) Another approach implementations could use is store width with address of store, so that load from that address will be associated with the corresponding internal width format.
c) an implementation could always load with the internal format that is least disruptive for its ELEN range. Or dynamically change the default given the recently executed whole register stores.
d) a similar tracking that a microarch performs over a loop: either
it decodes ahead and determines the future usage width for the register and stores according to that format or
profiles the use during the last use in the loop and applies the appropriate internal format from that.
This latter is an optimization that potentially benefits multi in-register order machines. It is questionable that this tracking will provide value as subsequent use of a register with a different EEW than written is relatively rare. (but significantly frequent and, as we decided, impacting to an explicit SLEN<VLEN).
As with JALR refinement of conventions and recommended hardware response can evolve.
1. the designation has diminished value for multiple register load with a single width hint.
When a group of registers is loaded, the hint can only provide the width for one of them. Forcing all registers into a single width may optimize one of the register’s subsequent use at the expense of the (up to 7) other registers.
2. The typical WR use case is for autonomous/independent spill and fill that are unaware of subsequent use.
In the envisioned typical use, called routines or interrupt routines will spill and fill registers as needed to perform the requested function. The internal format information is lost over such a process, potentially causing subsequent performance degradation Whereas there is the potential for improved code performance, these width hints may actually yield to poorer performance in dynamic use situations.
3. As a hint, hardware need only mask out all width designations and perform the same action regardless of width hint. However, providing the hint itself will be a causes of confusion as even minimal vector implementations will need to understand the SLEN<VLEN issues to assertain that they have no need to consider the hint. Similarly, software exception handlers and even vector routines will need substantial analysis to determine an optimal hint for a given target class (that may not be relevant to specific machines in that class as described above). Exception handlers especially will be challenged on the optimal use as no mechanism exists to access the current internal format or even last written EEW for any set of registers.
4. hint can leads to lower performance for machines not needing them (a consequence of 5).
A machine design may (erroneously) use the designated hint width for a WR load potentially increasing memory side activity (for example performing byte transfers when word or cache line would be appropriate and more performant).
I am sure there will be other examples.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https://github.com/riscv/riscv-v-spec/issues/503*issuecomment-655526810__;Iw!!EHscmS1ygiU1lA!QOT0KXhNYi1sSX8C4kKZy3avkqu4ThxHcZyxze5m1R7FfD7lI-4lRX0w3Djox-g$>, or unsubscribe<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AKXXKKCKZ746U7OYF4DFUXDR2RZMBANCNFSM4OJRVZUA__;!!EHscmS1ygiU1lA!QOT0KXhNYi1sSX8C4kKZy3avkqu4ThxHcZyxze5m1R7FfD7lI-4lRX0wfFj19XA$>.
|
Your comments introduced renumbering to the expanded points.
From item 6 detail: hint can leads to lower performance for machines not needing them (a consequence of 5). Your compelling argument is
This is immediate application with significant anticipated benefit. As for "better" alternatives, tracking branch behaviour was once "too complex" and did not give a significant return, such that branch hints were the preferred method. Times change, RV architecture in particular is designed to be "forever" I expect the method to apply these hints will evolve as calling conventions and register allocation conventions "improve". So, although I would now agree with providing the hints, I believe at least one should be reserved for when the compiler cannot make an informed choice, such as in interrupt routines. |
David,
Comments interleaved.
On 7/8/20 10:22 AM, David-Horner wrote:
EXTERNAL MAIL
Your comments introduced renumbering to the expanded points.
As far as I can tell, there's absolutely no loss of performance. Ever. Can you suggest a case where there's a loss?
From item 6 detail:
hint can leads to lower performance for machines not needing them (a consequence of 5).
A machine design may (erroneously) use the designated hint width for a WR load potentially increasing memory side activity (for example performing byte transfers when word or cache line would be appropriate and more performant).
I did read that. Of course some hints, such as branch predictions, can cost performance. I don't understand any mechanism by which this one can. Since this is stride-1, memory activity is precisely the same regardless of element width. For implementations that don't do any "interesting" byte arrangements, all loads are the same, with or without the hint. For implementations that do "interesting" byte arrangements, they do loads of different widths already. This hint tells them which of those loads to do. All should take the same time with or without a hint.
So maybe you can be specific about what performance loss you had in mind.
Your compelling argument is
The performance sensitive cases are for spill/fill in the same function and the compiler knows how the register is being used. In complex functions, spill/fill performance matters in a number of codes that are currently heavily used for us. I expect a factor greater than 2x loss for some functions without this.
This is immediate application with significant anticipated benefit.
As for "better" alternatives, tracking branch behaviour was once "too complex" and did not give a significant return, such that branch hints were the preferred method. Times change, RV architecture in particular is designed to be "forever"
I expect the method to apply these hints will evolve as calling conventions and register allocation conventions "improve".
So, although I would now agree with providing the hints, I believe at least one should be reserved for when the compiler cannot make an informed choice, such as in interrupt routines.
My choice would be as previously suggested, e1024.
I agree. I would like to see one used when the compiler doesn't know. And e1024 seems reasonable. In most codes it won't ever happen.
I would add a code to the store because I'd like to differentiate the store where the compiler knows how to set the load element size from the store where it is not expected to know. Same reason we use a jump rather than a branch on equal of x0 and x0.
Bill
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https://github.com/riscv/riscv-v-spec/issues/503*issuecomment-655651506__;Iw!!EHscmS1ygiU1lA!TcZ41t8rWxRMm51JrDzHya1M2GPJDjX9GVAFJKuyrGNE_In1-pFeL2_h8QSFsIw$>, or unsubscribe<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AKXXKKGHIDXSOOINAVTVN5LR2STOPANCNFSM4OJRVZUA__;!!EHscmS1ygiU1lA!TcZ41t8rWxRMm51JrDzHya1M2GPJDjX9GVAFJKuyrGNE_In1-pFeL2_hu1L98wA$>.
|
I think this is resolved in favor of retaining the EEW hint
I just added #529, where implementations would be simpler if EEW
worked as in other load/stores and implied an alignment constraint,
making use of e1024 for "unknown" problematic.
Krste
>>>> On Wed, 08 Jul 2020 12:36:50 -0700, billhuffman ***@***.***> said:
| David,
| Comments interleaved.
| On 7/8/20 10:22 AM, David-Horner wrote:
| EXTERNAL MAIL
| Your comments introduced renumbering to the expanded points.
| As far as I can tell, there's absolutely no loss of performance. Ever. Can you suggest a case where there's a loss?
| From item 6 detail:
| hint can leads to lower performance for machines not needing them (a consequence of 5).
| A machine design may (erroneously) use the designated hint width for a WR load potentially increasing memory side activity (for example performing byte transfers
| when word or cache line would be appropriate and more performant).
| I did read that. Of course some hints, such as branch predictions, can cost performance. I don't understand any mechanism by which this one can. Since this is
| stride-1, memory activity is precisely the same regardless of element width. For implementations that don't do any "interesting" byte arrangements, all loads are the
| same, with or without the hint. For implementations that do "interesting" byte arrangements, they do loads of different widths already. This hint tells them which of
| those loads to do. All should take the same time with or without a hint.
| So maybe you can be specific about what performance loss you had in mind.
| Your compelling argument is
| The performance sensitive cases are for spill/fill in the same function and the compiler knows how the register is being used. In complex functions, spill/fill
| performance matters in a number of codes that are currently heavily used for us. I expect a factor greater than 2x loss for some functions without this.
| This is immediate application with significant anticipated benefit.
| As for "better" alternatives, tracking branch behaviour was once "too complex" and did not give a significant return, such that branch hints were the preferred
| method. Times change, RV architecture in particular is designed to be "forever"
| I expect the method to apply these hints will evolve as calling conventions and register allocation conventions "improve".
| So, although I would now agree with providing the hints, I believe at least one should be reserved for when the compiler cannot make an informed choice, such as in
| interrupt routines.
| My choice would be as previously suggested, e1024.
| I agree. I would like to see one used when the compiler doesn't know. And e1024 seems reasonable. In most codes it won't ever happen.
| I would add a code to the store because I'd like to differentiate the store where the compiler knows how to set the load element size from the store where it is not
| expected to know. Same reason we use a jump rather than a branch on equal of x0 and x0.
| Bill
| —
| You are receiving this because you authored the thread.
| Reply to this email directly, view it on GitHub
| <https://urldefense.com/v3/__https://github.com/riscv/riscv-v-spec/issues/503*issuecomment-655651506__;Iw!!EHscmS1ygiU1lA!TcZ41t8rWxRMm51JrDzHya1M2GPJDjX9GVAFJKuyrGNE_In1-pFeL2_h8QSFsIw$>
| ;, or unsubscribe
| <https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AKXXKKGHIDXSOOINAVTVN5LR2STOPANCNFSM4OJRVZUA__;!!EHscmS1ygiU1lA!TcZ41t8rWxRMm51JrDzHya1M2GPJDjX9GVAFJKuyrGNE_In1-pFeL2_hu1L98wA$>
| ;.
| —
| You are receiving this because you modified the open/close state.
| Reply to this email directly, view it on GitHub, or unsubscribe.*
|
We've decided to require (the appearance of) SLEN=VLEN. Early in the discussion of the issue, we considered having cast operations that would rearrange for different element size because it was important in a small number of codes. We may still want to have those cast operations, though now only for performance on wide machines. The cast instruction will perform better on a wide, in-order implementation than the auto-inserted micro-op. As before, the cast is a nop on narrow machines.
I think all the issues about fragmentation are gone here. With and without both work on all implementations. But for performance optimization, they will want to be used on wide, in-order implementations and possibly also on wide, OoO implementations.
The text was updated successfully, but these errors were encountered: