Skip to content
This repository has been archived by the owner on Mar 20, 2024. It is now read-only.

Allow widening and narrowing instructions to execute in LMUL=8 #397

Closed
David-Horner opened this issue Mar 19, 2020 · 9 comments
Closed

Allow widening and narrowing instructions to execute in LMUL=8 #397

David-Horner opened this issue Mar 19, 2020 · 9 comments

Comments

@David-Horner
Copy link
Contributor

Currently widening and narrowing operations fail in LMUL=8.

In 11.2. Widening Vector Arithmetic Instructions:

For all widening instructions, the destination element width must be a supported element width and the destination LMUL value must also be a supported LMUL value (≤8, i.e., current LMUL must be ≤4), otherwise an illegal instruction exception is raised.

I propose we relax the LMUL constraint.

These widening and narrowing instructions would write to one of 2 register groups at 0 and 16, each using 16 base-arch registers.

Minimal systems could benefit most from this change effectively doubling the throughput of what are also typically implementations with limited resources to maximize efficiency with extensive buffering and the complexity of chaining.

A candidate example:

   # x = sum( vector1[i] * vector2[i]) 
   # If i=8 no loop is required for the coding details 
   # using vwmaccu.vx vd, rs1, vs2, vm # vd[i] = +(x[rs1] * vs2[i]) + vd[i]v
    # the loop could start here :
vlsetli x7,0,l8,s16
lx v8,addr1
lx v0,addr1
vwmaccu.vv v16, x0, v8
    # loop housekeeping would go here
    #  outside loop (if used)
vlseti 0,0,l8,s32
vredsum.vs v2, vs16
vredsum.vs v3, vs24
   # with the sum of scalars v2 and v3 the answer.

@nick-knight
Copy link
Contributor

nick-knight commented Mar 19, 2020

Can we also relax the related constraint (NF * LMUL <= 8) on segmented loads/stores?

Edit: As @David-Horner points out below, my proposal is tangential to his, adjacent only in that it (arguably) violates the same microarchitectural simplification that @aswaterman describes below.

@aswaterman
Copy link
Collaborator

aswaterman commented Mar 19, 2020 via email

@David-Horner
Copy link
Contributor Author

I appreciate your revised comment.

It's a microarchitectural simplification that at most 8 registers are written by an instruction. This matters more for the widening ops (where the 8 registers are in an aligned block)

The minimum implementation can play tricks, and the loads can be interleaved with two vwmaccu.vv that replaces the one. The interleaved may even perform better if loads overlap with vwmaccu.vv.
So the benefit appears to out weigh the cost to handle groups of 16 for most designs.

I take no side on nick;s @knightsifive comment, other to say it appears to be tangetial to my proposal. It has no direct benefit to groups of 16 in the proposed cases.

@solomatnikov
Copy link

Essentially, you are proposing to allow LMUL==16.

This would not improve throughput - throughput will not change if max LMUL is increased from 8 to 16.

The only thing that would change is the number of vector instructions/loop iterations required to process long vector(s). Energy/power savings would be minimal.

@solomatnikov
Copy link

Can we also relax the related constraint (NF * LMUL <= 8) on segmented loads/stores?

It's already complex and expensive to implement segmented loads/stores with NF * LMUL <= 8

@David-Horner
Copy link
Contributor Author

@solomatnikov

Essentially, you are proposing to allow LMUL==16.

This would not improve throughput - throughput will not change if max LMUL is increased from 8 to 16.

The only thing that would change is the number of vector instructions/loop iterations required to process long vector(s). Energy/power savings would be minimal.

I don't intend to debate which machines could improve performance with this limited LMUL=16 support.

Rather, that your response implies performance is not improved by increasing LMUL.
Does this hold by limiting LMUL to 4?
If so what are the tradeoffs.
Do we have any empirical evidence that LMUL=8 is optimal, or that it varies by micro-arch?
If the sweetspot is less than LMUL=8 why are we advocating it?
inertial?

@solomatnikov
Copy link

I don't intend to debate which machines could improve performance with this limited LMUL=16 support.

Rather, that your response implies performance is not improved by increasing LMUL.
Does this hold by limiting LMUL to 4?
If so what are the tradeoffs.
Do we have any empirical evidence that LMUL=8 is optimal, or that it varies by micro-arch?
If the sweetspot is less than LMUL=8 why are we advocating it?
inertial?

You need to understand how typical reasonable vector implementation works (of course, reasonable design space is very large and unreasonable is even larger). For the sake of simplicity and brevity I'll describe simple example:

  • 2 vector pipelines: arithmetic and memory (scalar part might have its own independent pipeline)
  • each pipeline is 4 lanes
  • each lane is 32-bit
  • total datapath width is 128 bit

For simplicity let's assume VLEN==128.

With LMUL==1 each vector instruction will take 1 cycle/beat (of course, pipelines can have multiple stages but for simplicity I am assuming perfect pipelining) and process 128 bits of data.

With LMUL==8 each vector instruction will take 8 cycles/beats and process 8*128 bits of data, which means processor has to fetch and issue 8 times fewer instructions to process the same amount of data. It also helps if the processor can issue only 1 instruction (vector or scalar) per cycle: with LMUL==8 both vector pipelines can be easily filled with useful work and scalar computations can be done in the background.

LMUL==16 would not improve anything significantly over LMUL==8.

Of course, in practice VLEN can be larger, e.g. VLEN==256 or VLEN==512, then LMUL==16 makes even less sense.

@kasanovic
Copy link
Collaborator

This effectively requires supporting LMUL=16, which has limited perf advantage as noted above. It also breaks the mask register layout when wanting to operate on the LMUL=16 results in a portable way wrt SLEN.

@David-Horner
Copy link
Contributor Author

David-Horner commented Jul 11, 2020

With SLEN removed, and V0.9 register groups formats identical, perhaps this too can be revisited. #527

To me the only compelling argument for dropping limited LMUL=16 was SLEN complexities which are now removed.

I agree that burdening all implementations with limited LMUL=16 support is undesirable.
However, we appear to be now a base "V" that targets unix/application use.
Specific configurations for it (such as VLEN>=128) are irrelevant for other domains that may have a minimal Vminus implementation.
Just as VLEN=256/512 have diminishing returns for LMUL=16, VLEN=64 and VLEN=32 have increasing value for it.

To have such usage supported by compilers and vtype-tracking assemblers, with compile time tags and warnings respectively, helps to provide the broad platform support that was intended/envisioned.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants