Allow widening and narrowing instructions to execute in LMUL=8 #397

David-Horner · 2020-03-19T21:37:12Z

Currently widening and narrowing operations fail in LMUL=8.

In 11.2. Widening Vector Arithmetic Instructions:

For all widening instructions, the destination element width must be a supported element width and the destination LMUL value must also be a supported LMUL value (≤8, i.e., current LMUL must be ≤4), otherwise an illegal instruction exception is raised.

I propose we relax the LMUL constraint.

These widening and narrowing instructions would write to one of 2 register groups at 0 and 16, each using 16 base-arch registers.

Minimal systems could benefit most from this change effectively doubling the throughput of what are also typically implementations with limited resources to maximize efficiency with extensive buffering and the complexity of chaining.

A candidate example:

   # x = sum( vector1[i] * vector2[i]) 
   # If i=8 no loop is required for the coding details 
   # using vwmaccu.vx vd, rs1, vs2, vm # vd[i] = +(x[rs1] * vs2[i]) + vd[i]v
    # the loop could start here :
vlsetli x7,0,l8,s16
lx v8,addr1
lx v0,addr1
vwmaccu.vv v16, x0, v8
    # loop housekeeping would go here
    #  outside loop (if used)
vlseti 0,0,l8,s32
vredsum.vs v2, vs16
vredsum.vs v3, vs24
   # with the sum of scalars v2 and v3 the answer.

The text was updated successfully, but these errors were encountered:

nick-knight · 2020-03-19T22:33:47Z

Can we also relax the related constraint (NF * LMUL <= 8) on segmented loads/stores?

Edit: As @David-Horner points out below, my proposal is tangential to his, adjacent only in that it (arguably) violates the same microarchitectural simplification that @aswaterman describes below.

aswaterman · 2020-03-19T22:41:43Z

It's a microarchitectural simplification that at most 8 registers are written by an instruction. This matters more for the widening ops (where the 8 registers are in an aligned block) than for segment loads/stores, but it's not free to relax either constraint.

David-Horner · 2020-03-19T23:35:55Z

I appreciate your revised comment.

It's a microarchitectural simplification that at most 8 registers are written by an instruction. This matters more for the widening ops (where the 8 registers are in an aligned block)

The minimum implementation can play tricks, and the loads can be interleaved with two vwmaccu.vv that replaces the one. The interleaved may even perform better if loads overlap with vwmaccu.vv.
So the benefit appears to out weigh the cost to handle groups of 16 for most designs.

I take no side on nick;s @knightsifive comment, other to say it appears to be tangetial to my proposal. It has no direct benefit to groups of 16 in the proposed cases.

solomatnikov · 2020-03-20T01:05:17Z

Essentially, you are proposing to allow LMUL==16.

This would not improve throughput - throughput will not change if max LMUL is increased from 8 to 16.

The only thing that would change is the number of vector instructions/loop iterations required to process long vector(s). Energy/power savings would be minimal.

solomatnikov · 2020-03-20T01:06:39Z

Can we also relax the related constraint (NF * LMUL <= 8) on segmented loads/stores?

It's already complex and expensive to implement segmented loads/stores with NF * LMUL <= 8

David-Horner · 2020-03-20T01:26:33Z

@solomatnikov

Essentially, you are proposing to allow LMUL==16.

This would not improve throughput - throughput will not change if max LMUL is increased from 8 to 16.

The only thing that would change is the number of vector instructions/loop iterations required to process long vector(s). Energy/power savings would be minimal.

I don't intend to debate which machines could improve performance with this limited LMUL=16 support.

Rather, that your response implies performance is not improved by increasing LMUL.
Does this hold by limiting LMUL to 4?
If so what are the tradeoffs.
Do we have any empirical evidence that LMUL=8 is optimal, or that it varies by micro-arch?
If the sweetspot is less than LMUL=8 why are we advocating it?
inertial?

solomatnikov · 2020-03-20T16:26:15Z

I don't intend to debate which machines could improve performance with this limited LMUL=16 support.

Rather, that your response implies performance is not improved by increasing LMUL.
Does this hold by limiting LMUL to 4?
If so what are the tradeoffs.
Do we have any empirical evidence that LMUL=8 is optimal, or that it varies by micro-arch?
If the sweetspot is less than LMUL=8 why are we advocating it?
inertial?

You need to understand how typical reasonable vector implementation works (of course, reasonable design space is very large and unreasonable is even larger). For the sake of simplicity and brevity I'll describe simple example:

2 vector pipelines: arithmetic and memory (scalar part might have its own independent pipeline)
each pipeline is 4 lanes
each lane is 32-bit
total datapath width is 128 bit

For simplicity let's assume VLEN==128.

With LMUL==1 each vector instruction will take 1 cycle/beat (of course, pipelines can have multiple stages but for simplicity I am assuming perfect pipelining) and process 128 bits of data.

With LMUL==8 each vector instruction will take 8 cycles/beats and process 8*128 bits of data, which means processor has to fetch and issue 8 times fewer instructions to process the same amount of data. It also helps if the processor can issue only 1 instruction (vector or scalar) per cycle: with LMUL==8 both vector pipelines can be easily filled with useful work and scalar computations can be done in the background.

LMUL==16 would not improve anything significantly over LMUL==8.

Of course, in practice VLEN can be larger, e.g. VLEN==256 or VLEN==512, then LMUL==16 makes even less sense.

kasanovic · 2020-04-03T10:24:15Z

This effectively requires supporting LMUL=16, which has limited perf advantage as noted above. It also breaks the mask register layout when wanting to operate on the LMUL=16 results in a portable way wrt SLEN.

David-Horner · 2020-07-11T14:47:12Z

With SLEN removed, and V0.9 register groups formats identical, perhaps this too can be revisited. #527

To me the only compelling argument for dropping limited LMUL=16 was SLEN complexities which are now removed.

I agree that burdening all implementations with limited LMUL=16 support is undesirable.
However, we appear to be now a base "V" that targets unix/application use.
Specific configurations for it (such as VLEN>=128) are irrelevant for other domains that may have a minimal Vminus implementation.
Just as VLEN=256/512 have diminishing returns for LMUL=16, VLEN=64 and VLEN=32 have increasing value for it.

To have such usage supported by compilers and vtype-tracking assemblers, with compile time tags and warnings respectively, helps to provide the broad platform support that was intended/envisioned.

David-Horner mentioned this issue Mar 19, 2020

Towards a simple fractional LMUL design. #393

Closed

kasanovic closed this as completed Apr 3, 2020

This was referenced Jul 11, 2020

Relax constraint on vl exceeding register group effective VLEN. #527

Open

Element width in whole register move load/stores affects misalignment exceptions? #529

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow widening and narrowing instructions to execute in LMUL=8 #397

Allow widening and narrowing instructions to execute in LMUL=8 #397

David-Horner commented Mar 19, 2020

nick-knight commented Mar 19, 2020 •

edited

Loading

aswaterman commented Mar 19, 2020 via email •

edited

Loading

David-Horner commented Mar 19, 2020

solomatnikov commented Mar 20, 2020

solomatnikov commented Mar 20, 2020

David-Horner commented Mar 20, 2020

solomatnikov commented Mar 20, 2020

kasanovic commented Apr 3, 2020

David-Horner commented Jul 11, 2020 •

edited

Loading

Allow widening and narrowing instructions to execute in LMUL=8 #397

Allow widening and narrowing instructions to execute in LMUL=8 #397

Comments

David-Horner commented Mar 19, 2020

nick-knight commented Mar 19, 2020 • edited Loading

aswaterman commented Mar 19, 2020 via email • edited Loading

David-Horner commented Mar 19, 2020

solomatnikov commented Mar 20, 2020

solomatnikov commented Mar 20, 2020

David-Horner commented Mar 20, 2020

solomatnikov commented Mar 20, 2020

kasanovic commented Apr 3, 2020

David-Horner commented Jul 11, 2020 • edited Loading

nick-knight commented Mar 19, 2020 •

edited

Loading

aswaterman commented Mar 19, 2020 via email •

edited

Loading

David-Horner commented Jul 11, 2020 •

edited

Loading