Warning
|
The EDIV extension is currently not planned to be part of the base "V" extension, and will change substantially from this current sketch and could be dropped completely. Do not use this specfication for any design work. |
Note
|
This section has not been updated to account for new mask format in v0.9. |
The divided element extension allows each element to be treated as a packed sub-vector of narrower elements. This provides efficient support for some forms of narrow-width and mixed-width arithmetic, and also to allow outer-loop vectorization of short vector and matrix operations. In addition to modifying the behavior of some existing instructions, a few new instructions are provided to operate on vectors when EDIV > 1.
The divided element extension adds a two-bit field, vediv[1:0]
to
the vtype
register.
The vediv
field encodes the number of ways, EDIV, into which each
SEW-bit element is subdivided into equal sub-elements. A vector
register group is now considered to hold a vector of sub-vectors.
vediv [1:0] | Division EDIV | ||
---|---|---|---|
0 |
0 |
1 |
(undivided, as in base) |
0 |
1 |
2 |
two equal sub-elements |
1 |
0 |
4 |
four equal sub-elements |
1 |
1 |
8 |
eight equal sub-elements |
The assembly syntax for vsetvli has additional options added to encode the EDIV options.
d1 # EDIV 1, assumed if d setting absent d2 # EDIV 2 d4 # EDIV 4 d8 # EDIV 8 vsetvli t0, a0, e32, m2, d4 # SEW=32, LMUL=2, EDIV=4
SEW |
EDIV |
Sub-element |
Integer accumulator |
FP sum/dot accumulator |
|||
sum |
dot |
FLEN=32 |
FLEN=64 |
FLEN=128 |
|||
8b |
2 |
4b |
8b |
8b |
- |
- |
- |
8b |
4 |
2b |
8b |
8b |
- |
- |
- |
8b |
8 |
1b |
8b |
8b |
- |
- |
- |
16b |
2 |
8b |
16b |
16b |
- |
- |
- |
16b |
4 |
4b |
8b |
16b |
- |
- |
- |
16b |
8 |
2b |
8b |
8b |
- |
- |
- |
32b |
2 |
16b |
32b |
32b |
32b |
32b |
32b |
32b |
4 |
8b |
16b |
32b |
- |
- |
- |
32b |
8 |
4b |
8b |
16b |
- |
- |
- |
64b |
2 |
32b |
64b |
64b |
32b |
64b |
64b |
64b |
4 |
16b |
32b |
64b |
32b |
32b |
32b |
64b |
8 |
8b |
16b |
32b |
- |
- |
- |
128b |
2 |
64b |
128b |
128b |
32b |
64b |
128b |
128b |
4 |
32b |
64b |
128b |
32b |
64b |
64b |
128b |
8 |
16b |
32b |
64b |
32b |
32b |
32b |
256b |
2 |
128b |
256b |
256b |
32b |
64b |
128b |
256b |
4 |
64b |
128b |
256b |
32b |
64b |
128b |
256b |
8 |
32b |
64b |
128b |
32b |
64b |
64b |
Each implementation defines a minimum size for a sub-element, SELEN, which must be at most 8 bits.
Note
|
While SELEN is a fourth implementation-specific parameter, values smaller than 8 would be considered an additional extension. |
The vector start register vstart
and exception reporting continue to
work as before.
The vector length vl
control and vector masking continue to operate
at the element level.
Vector masking continues to operate at the element level, so sub-elements cannot be individually masked.
Note
|
SEW can be changed dynamically to enabled per-element masking for sub-elements of 8 bits and greater. |
Vector load/store and AMO instructions are unaffected by EDIV, and continue to move whole elements.
Vector mask logical operations are unchanged by EDIV setting, and continue to operate on vector registers containing element masks.
Vector mask population count (vpopc
), find-first and related
instructions (vfirst
, vmsbf
, vmsif
, vmsof
), iota (viota
),
and element index (vid
) instructions are unaffected by EDIV.
Vector integer bit insert/extract, and integer and floating-point scalar move instruction are unaffected by EDIV.
Vector slide-up/slide-down are unaffected by EDIV.
Vector compress instructions are unaffected by EDIV.
Most vector arithmetic operations are modified to operate on the
individual sub-elements, so effective SEW is SEW/EDIV and effective
vector length is vl
* EDIV. For example, a vector add of 32-bit
elements with a vl
of 5 and EDIV of 4, operates identically to a
vector add of 8-bit elements with a vector length of 20.
vsetvli t0, a0, e32, m1, d4 # Vectors of 32-bit elements, divided into byte sub-elements vadd.vv v1,v2,v3 # Performs a vector of 4*vl 8-bit additions. vsll.vx v1,v2,x1 # Performs a vector of 4*vl 8-bit shifts.
For EDIV > 1, vadc
, vmadc
, vsbc
, vmsbc
are reserved.
Vector single-width integer sum reduction instructions are reserved under EDIV>1. Other vector single-width reductions and vector widening integer sum reduction instructions now operate independently on all elements in a vector, reducing sub-element values within an element to an element-wide result.
The scalar input is taken from the least-significant bits of the second operand, with the number of bits equal to the number of significant result bits (i.e., for sum and dot reductions, the number of bits are given in table above, for non-sum and non-dot reductions, equal to the element size).
# Sum each sub-vector of four bytes into a 16-bit result. vsetvli t0, a0, e32, d4 # Vectors of 32-bit elements, divided into byte sub-elements vwredsum.vs v1, v2, v3 # v1[i][15:0] = v2[i][31:24] + v2[i][23:16] # + v2[i][15:8] + v2[i][7:0] + v3[i][15:0] # Find maximum among sub-elements vredmax.vs v5, v6, v7 # v5[i][7:0] = max(v6[i][31:24], v6[i][23:16], # v6[i][15:8], v6[i][7:0], v7[i][7:0])
Integer sub-element non-sum reductions produce a final result that is max(8,SEW/EDIV) bits wide, sign- or zero-extended to full SEW if necessary.
Integer sub-element widening sum reductions produce a final result that is max(8,min(SEW,2*SEW/EDIV)) bits wide, sign- or zero-extended to full SEW if necessary.
Single-width floating-point reductions produce a final result that is SEW/EDIV bits wide.
Widening floating-point sum reductions produce a final result that is min(2*SEW/EDIV,FLEN) bits wide, NaN-boxed to the full SEW width if necessary.
Vector register gather instructions under non-zero EDIV only gather sub-elements within the element. The source and index values are interpreted as relative to the enclosing element only. Index values {ge} EDIV write a zero value into the result sub-element.
| | | SEW = 32b, EDIV=4 7 6 5 4 3 2 1 0 bytes d e a d b e e f v1 0 1 9 2 0 2 3 2 v2 vrgather.vv v3, v1, v2 d a 0 e f e b e v3 vrgather.vi v4, v1, 1 a a a a e e e e v4
Note
|
Vector register gathers with scalar or immediate arguments can "splat" values across sub-elements within an element. |
Note
|
Implementations can provide fast implementations of register gathers constrained within a single element width. |
The integer dot-product reduction vdot.vv
performs an element-wise
multiplication between the source sub-elements then accumulates the
results into the destination vector element. Note the assembler syntax
uses a .vv
suffix since both inputs are vectors of elements.
Sub-element integer dot reductions produce a final result that is max(8,min(SEW,4*SEW/EDIV)) bits wide, sign- or zero-extended to full SEW if necessary.
# Unsigned dot-product vdotu.vv vd, vs2, vs1, vm # Vector-vector # Signed dot-product vdot.vv vd, vs2, vs1, vm # Vector-vector
# Dot product, SEW=32, EDIV=1 vdot.vv vd, vs2, vs1, vm # vd[i][31:0] += vs2[i][31:0] * vs1[i][31:0] # Dot product, SEW=32, EDIV=2 vdot.vv vd, vs2, vs1, vm # vd[i][31:0] += vs2[i][31:16] * vs1[i][31:16] + vs2[i][15:0] * vs1[i][15:0] # Dot product, SEW=32, EDIV=4 vdot.vv vd, vs2, vs1, vm # vd[i][31:0] += vs2[i][31:24] * vs1[i][31:24] + vs2[i][23:16] * vs1[i][23:16] + vs2[i][15:8] * vs1[i][15:8] + vs2[i][7:0] * vs1[i][7:0]
The floating-point dot-product reduction vfdot.vv
performs an element-wise
multiplication between the source sub-elements then accumulates the
results into the destination vector element. Note the assembler syntax
uses a .vv
suffix since both inputs are vectors of elements.
# Signed dot-product vfdot.vv vd, vs2, vs1, vm # Vector-vector
# Dot product. SEW=32, EDIV=2 vfdot.vv vd, vs2, vs1, vm # vd[i][31:0] += vs2[i][31:16] * vs1[i][31:16] + vs2[i][15:0] * vs1[i][15:0] # Floating-point sub-vectors of two half-precision floats packed into 32-bit elements. vsetvli t0, a0, e32, m1, d2 # Vectors of 32-bit elements, divided into 16b sub-elements vfdot.vv v1, v2, v3 # v1[i][31:0] += v2[i][31:16]*v3[i][31:16] + v2[i][16:0]*v3[i][16:0] # Floating-point sub-vectors of four half-precision floats packed into 64-bit elements. vsetvli t0, a0, e64, m1, d4 # Vectors of 64-bit elements, divided into 16b sub-elements vfdot.vv v1, v2, v3 # v1[i][31:0] += v2[i][31:16]*v3[i][31:16] + v2[i][16:0]*v3[i][16:0] + # v2[i][63:48]*v3[i][63:48] + v2[i][47:32]*v3[i][47:32]; # v1[i][63:32] = ~0 (NaN boxing)