-
Notifications
You must be signed in to change notification settings - Fork 272
Allow widening and narrowing instructions to execute in LMUL=8 #397
Comments
Can we also relax the related constraint ( Edit: As @David-Horner points out below, my proposal is tangential to his, adjacent only in that it (arguably) violates the same microarchitectural simplification that @aswaterman describes below. |
It's a microarchitectural simplification that at most 8 registers are written by an instruction. This matters more for the widening ops (where the 8 registers are in an aligned block) than for segment loads/stores, but it's not free to relax either constraint.
|
I appreciate your revised comment.
The minimum implementation can play tricks, and the loads can be interleaved with two vwmaccu.vv that replaces the one. The interleaved may even perform better if loads overlap with vwmaccu.vv. I take no side on nick;s @knightsifive comment, other to say it appears to be tangetial to my proposal. It has no direct benefit to groups of 16 in the proposed cases. |
Essentially, you are proposing to allow LMUL==16. This would not improve throughput - throughput will not change if max LMUL is increased from 8 to 16. The only thing that would change is the number of vector instructions/loop iterations required to process long vector(s). Energy/power savings would be minimal. |
It's already complex and expensive to implement segmented loads/stores with |
I don't intend to debate which machines could improve performance with this limited LMUL=16 support. Rather, that your response implies performance is not improved by increasing LMUL. |
You need to understand how typical reasonable vector implementation works (of course, reasonable design space is very large and unreasonable is even larger). For the sake of simplicity and brevity I'll describe simple example:
For simplicity let's assume VLEN==128. With LMUL==1 each vector instruction will take 1 cycle/beat (of course, pipelines can have multiple stages but for simplicity I am assuming perfect pipelining) and process 128 bits of data. With LMUL==8 each vector instruction will take 8 cycles/beats and process 8*128 bits of data, which means processor has to fetch and issue 8 times fewer instructions to process the same amount of data. It also helps if the processor can issue only 1 instruction (vector or scalar) per cycle: with LMUL==8 both vector pipelines can be easily filled with useful work and scalar computations can be done in the background. LMUL==16 would not improve anything significantly over LMUL==8. Of course, in practice VLEN can be larger, e.g. VLEN==256 or VLEN==512, then LMUL==16 makes even less sense. |
This effectively requires supporting LMUL=16, which has limited perf advantage as noted above. It also breaks the mask register layout when wanting to operate on the LMUL=16 results in a portable way wrt SLEN. |
With SLEN removed, and V0.9 register groups formats identical, perhaps this too can be revisited. #527 To me the only compelling argument for dropping limited LMUL=16 was SLEN complexities which are now removed. I agree that burdening all implementations with limited LMUL=16 support is undesirable. To have such usage supported by compilers and vtype-tracking assemblers, with compile time tags and warnings respectively, helps to provide the broad platform support that was intended/envisioned. |
Currently widening and narrowing operations fail in LMUL=8.
In 11.2. Widening Vector Arithmetic Instructions:
I propose we relax the LMUL constraint.
These widening and narrowing instructions would write to one of 2 register groups at 0 and 16, each using 16 base-arch registers.
Minimal systems could benefit most from this change effectively doubling the throughput of what are also typically implementations with limited resources to maximize efficiency with extensive buffering and the complexity of chaining.
A candidate example:
The text was updated successfully, but these errors were encountered: