NeUnordered
Loadn: Gather*, but for stride 2..4 use ld2..4.
LoadnPair: Gather with optimizations in particular for 2x64-bit, which use 128-bit loads plus Combine. Also StorePair
Lookup128 for 32x 32-bit and 16x 64-bit. permutex2var on AVX-512, else Gather. Also Lookup64 and Lookup32.
ReduceMin/MaxOrNaN
Document Reduce/Min NaN behavior
_mm512_getmant, _mm512_scalef, _mm512_getexp (f32/f64)
High-precision! Consider copying from SLEEF. See #1650.
cbrt, cosh, erf, fmod, ilogb, lgamma, logb, modf, nextafter, nexttoward, pow, scalbn, tan, tgamma
- Min/MaxValue
- IndexOfMin/Max
- AllOf / AnyOf / NoneOf
- Count(If) (https://en.algorithmica.org/hpc/simd/masking/)
- EqualSpan
- ReverseSpan
- ShuffleSpan
- IsSorted
- Reduce
Port https://github.com/richgel999/sserangecoding to Highway (~50 instructions).
Port https://github.com/SnellerInc/sneller/tree/master/ion/zion/iguana (Go+assembly) to Highway.
= Not(FirstN()), replaces several instances. WHILEGE on SVE.
For crypto. Native on Icelake+.
- Use new mask<->vec cast instruction, possibly for OddEven, ExpandLoad
rgather_vx
for broadcasting redsum result?
- SVE2.1: TBLQ for
TableLookupBytes
- SVE2: use XAR for
RotateRight
CombineShiftRightBytes
useTableLookupLanes
instead?Shuffle*
: useTableLookupLanes
instead?- Use SME once available: DUP predicate, REVD (rotate 128-bit elements by 64), SCLAMP/UCLAMP, 128-bit TRN/UZP/ZIP (also in F64MM)
#pragma unroll(1)
in all loops to enable autovectorization
Reuse same wasm256 file, #if
for wasm-specific parts. Use reserved avx slot.
For hash tables. Use VPCONFLICT on ZEN4.
Avoids having to add offset on RVV. Table must come from LoadDup128
.
For SVE (svld1sb_u32)+WASM? Compiler can probably already fuse.
SignbitConvertF64<->I32(math-inl)Copysign(math)CopySignToAbs(math)NegCompressMask ops(math)RebindMaskNotFP16 conversionsScatterGatherPauseAbs i64FirstNCompare i64AESRoundCLMul(GCM)TableLookupBytesOr0(AES)FindFirstTrue(strlen)NECombine partialLoadMaskBits(FirstN)MaskedLoadBf16 promote2ConcatOdd/EvenSwapAdjacentBlocksOddEvenBlocksCompressBlendedStoreRotateRight(Reverse2 i16)Compare128OrAndIfNegativeThenElseMulFixedPoint15(codec)Insert/ExtractLaneIsNanIsFiniteStoreInterleavedLoadInterleaved(codec)Or3/Xor3NotXor(sort)FindKnownFirstTrue(sort)CompressStore8-bitExpandLoad(hash)Zen4 target(sort)SSE2 target- by johnplattsAbsDiff int- by johnplattsLe integer- by johnplattsLeadingZeroCount- by johnplatts in #12768-bit Mul(Neg)MulAdd for integerAESRoundInv etc- by johnplatts in #1286OddEven
for <64bit lanes: use Set of wider constant 0_1Shl for 8-bitShr for 8-bitFasterReverse2
16-bitAddReverse2
for 8-bit- by johnplatts in #1303TwoTablesLookupLanes
Add 8/16-bit- by johnplatts in #1303TableLookupLanes
- by johnplatts in #1308FindLastTrue
Vec2, Create/Get functions- by johnplatts in #1387PromoteTo
for all types (#915)atan2Slide1Up/Down- by johnplatts in #1496MaxOfLanes, MinOfLanes
returning scalarAdd- by johnplatts in #1431DupEven
for 16-bitAVX3_SPR targetMaskedGather returns zero for mask=false.GatherIndexN/ScatterIndexNMaskedScatterfloat64 support for WASMLoadNOrPromoteEvenTo- by johnplattsMasked add/sub/divReduceMin/Max like ReduceSum, in addition to Min/MaxOfLanesReductions for 8-bitRVV: Fix remaining 8-bit table lookups for large vectorsQuickSelect algo- by enum-classNew tuple interface for segment load/storeDiv (integer division) and Mod- by johnplattsAddSub and MulAddSub- by johnplattshypot- by johnplattsexp2- by johnplatts