Kernel Parameters

Alex Brown edited this page Oct 20, 2022 · 6 revisions

Solution / Kernel Parameters

  • LoopDoWhile: True=DoWhile loop, False=While or For loop
  • LoopTail: Additional loop with LoopUnroll=1.
  • EdgeType: Branch, ShiftPtr or None
  • WorkGroup: [dim0, dim1, LocalSplitU]
  • ThreadTile: [dim0, dim1]
  • MatrixInstruction: Type of matrix instruction used for the calculation, and wave tiling parameters [InstructionM, InstructionN, InstructionK, InstructionB, BlocksInMDir, WaveTileM, WaveTileN, WaveGroupM, WaveGroupN]
  • GlobalSplitU: Split up summation among work-groups to create more concurrency. This option launches a kernel to handle the beta scaling, then a second kernel where the writes to global memory are atomic.
  • PrefetchGlobalRead: True means outer loop should prefetch global data one iteration ahead.
  • PrefetchLocalRead: True means inner loop should prefetch lds data one iteration ahead.
  • WorkGroupMapping: In what order will work-groups compute C; affects cacheing.
  • LoopUnroll: How many iterations to unroll inner loop; helps loading coalesced memory.
  • MacroTile: Derrived from WorkGroup*ThreadTile.
  • DepthU: Derrived from LoopUnroll*SplitU.
  • NumLoadsCoalescedA,B: Number of loads from A in coalesced dimension.
  • GlobalReadCoalesceGroupA,B: True means adjacent threads map to adjacent global read elements (but, if transposing data then write to lds is scattered).
  • GlobalReadCoalesceVectorA,B: True means vector components map to adjacent global read elements (but, if transposing data then write to lds is scattered).
  • VectorWidth: Thread tile elements are contiguous for faster memory accesses. For example VW=4 means a thread will read a float4 from memory rather than 4 non-contiguous floats.
  • KernelLanguage: Whether kernels should be written in source code (HIP, OpenCL) or assembly (gfx803, gfx900, ...).

The exhaustive list of solution parameters and their defaults is stored in

Kernel Parameters Affect Performance

The kernel parameters affect many aspects of performance. Changing a parameter may help address one performance bottleneck but worsen another. That is why searching through the parameter space is vital to discovering the fastest kernel for a given problem.

How N-Dimensional Tensor Contractions Are Mapped to Finite-Dimensional GPU Kernels

For a traditional GEMM, the 2-dimensional output, C[i,j], is mapped to launching a 2-dimensional grid of work groups, each of which has a 2-dimensional grid of work items; one dimension belongs to i and one dimension belongs to j. The 1-dimensional summation is represented by a single loop within the kernel body.

Special Dimensions: D0, D1 and DU

To handle arbitrary dimensionality, Tensile begins by determining 3 special dimensions: D0, D1 and DU.

D0 and D1 are the free indices of A and B (one belongs to A and one to B) which have the shortest strides. This allows the inner-most loops to read from A and B the fastest via coalescing. In a traditional GEMM, every matrix has a dimension with a shortest stride of 1, but Tensile doesn't make that assumption. Of these two dimensions, D0 is the dimension which has the shortest tensor C stride which allows for fast writing.

DU represents the summation index with the shortest combined stride (stride in A + stride in B); it becomes the inner most loop which gets "U"nrolled. This assignment is also mean't to assure fast reading in the inner-most summation loop. There can be multiple summation indices (i.e. embedded loops) and DU will be iterated over in the inner most loop.

GPU Kernel Dimension

OpenCL allows for 3-dimensional grid of work-groups, and each work-group can be a 3-dimensional grid of work-items. Tensile assigns D0 to be dimension-0 of the work-group and work-item grid; it assigns D1 to be dimension-1 of the work-group and work-item grids. All other free or batch dimensions are flattened down into the final dimension-2 of the work-group and work-item grids. Withing the GPU kernel, dimensions-2 is reconstituted back into whatever dimensions it represents.

Kernel Names

Kernel names contain abbreviations of relevant parameters along with their value. A kernel name might look something like the following:


The first part (C***_A***_B***) indicates the type of operation the kernel performs. This example is a GEMM.

Next, is the data type supported by the kernel. In the example, S indicates single precision floating point numbers. B indicates the kernel can use beta values. The table below lists supported data types and their corresponding code name:

Code Type
S Single-precision float
D Double-precision float
C Single-precision complex float
Z Double-precision complex float
H Half-precision float
4xi8 4 x 8-bit integer (deprecated, use I8)
I 32-bit integer
B Bfloat16
I8 8-bit integer

MT stands for macro tile. In the example, the macro tile is 64x256. The third number listed with macro tile (16 in the example) is the unroll depth, specified by the DepthU parameter.

After these standard name segments comes an alphabetized list of abbreviations of relevant kernel parameters. The table below lists parameters, their kernel name abbreviations, and their default values to help interpret the meaning of a kernel name:

Code Parameter Default
1LDSB 1LDSBuffer 0
APM AggressivePerfMode 1
AAV AssertAlphaValue False
ABV AssertBetaValue False
ACED AssertCEqualsD False
AF0EM AssertFree0ElementMultiple 1
AF1EM AssertFree1ElementMultiple 1
AMAS AssertMinApproxSize -1
ASE AssertSizeEqual {}
ASGT AssertSizeGreaterThan {}
ASLT AssertSizeLessThan {}
ASM AssertSizeMultiple {}
ASAE AssertStrideAEqual {}
ASBE AssertStrideBEqual {}
ASCE AssertStrideCEqual {}
ASDE AssertStrideDEqual {}
ASEM AssertSummationElementMultiple 1
AAC AtomicAddC False
BL BufferLoad True
BS BufferStore True
CDO CheckDimOverflow 0
CTDA CheckTensorDimAsserts False
DU DepthU -1
DULD DepthULdsDivisor 1
DTL DirectToLds False
DTVA DirectToVgprA False
DTVB DirectToVgprB False
DAF DisableAtomicFail 0
DKP DisableKernelPieces 0
DVO DisableVgprOverlapping False
ET EdgeType Branch
EPS ExpandPointerSwap True
R Fp16AltImpl False
FL FractionalLoad 0
GR2A GlobalRead2A True
GR2B GlobalRead2B True
GRCGA GlobalReadCoalesceGroupA True
GRCGB GlobalReadCoalesceGroupB True
GRCVA GlobalReadCoalesceVectorA True
GRCVB GlobalReadCoalesceVectorB True
GRPM GlobalReadPerMfma 1
GRVW GlobalReadVectorWidth -1
GSU GlobalSplitU 1
GSUA GlobalSplitUAlgorithm SingleBuffer
GSUSARR GlobalSplitUSummationAssignmentRoundRobin True
GSUWGMRR GlobalSplitUWorkGroupMappingRoundRobin False
GLS GroupLoadStore False
IU InnerUnroll 1
IA InterleaveAlpha 0
KL KernelLanguage Source
LEL LdcEqualsLdd True
LBSPP LdsBlockSizePerPad -1
LPA LdsPadA 0
LPB LdsPadB 0
LDL LocalDotLayout 1
LRVW LocalReadVectorWidth -1
LWPM LocalWritePerMfma -1
LR2A LocalRead2A True
LR2B LocalRead2B True
LW2A LocalWrite2A True
LW2B LocalWrite2B True
LDW LoopDoWhile False
LT LoopTail True
MAD or FMA MACInstruction FMA
MT MacroTile
MTSM MacroTileShapeMax 64
MTSM MacroTileShapeMin 1
MDA MagicDivAlg 2
MI MatrixInstruction []
MO MaxOccupancy 40
MVN MaxVgprNumber 256
MIAV MIArchVgpr False
MVN MinVgprNumber 0
NTA NonTemporalA 0
NTB NonTemporalB 0
NTC NonTemporalC 0
NTD NonTemporalD 0
NR NoReject False
NEPBS NumElementsPerBatchStore 0
NLCA NumLoadsCoalescedA 1
NLCB NumLoadsCoalescedB 1
ONLL OptNoLoadLoop 1
OPLV OptPreLoopVmcnt True
PBD PackBatchDims 0
PFD PackFreeDims 1
PG PackGranularity 2
PSD PackSummationDims 0
PSL PerformanceSyncLocation -1
PWC PerformanceWaitCount -1
PWL PerformanceWaitLocation -1
PK PersistentKernel 0
PKAB PersistentKernelAlongBatch False
PAP PrefetchAcrossPersistent 0
PAPM PrefetchAcrossPersistentMode 0
PGR PrefetchGlobalRead True
PLR PrefetchLocalRead 1
RK ReplacementKernel False
SGR ScheduleGlobalRead 1
SIA ScheduleIterAlg 1
SLW ScheduleLocalWrite 1
SS SourceSwap False
SU StaggerU 32
SUM StaggerUMapping 0
SUS StaggerUStride 256
SCIU StoreCInUnroll False
SCIUE StoreCInUnrollExact False
SCIUI StoreCInUnrollInterval 1
SCIUP StoreCInUnrollPostLoop False
SPO StorePriorityOpt False
SRVW StoreRemapVectorWidth 0
SSO StoreSyncOpt 0
SVW StoreVectorWidth -1
SNLL SuppressNoLoadLoop False
TSGRA ThreadSeparateGlobalReadA 0
TSGRB ThreadSeparateGlobalReadB 0
TT ThreadTile [4, 4]
TLDS TransposeLDS 0
UIIDU UnrollIncIsDepthU 0
UMF UnrollMemFence False
U64SL Use64bShadowLimit 1
UIOFGRO UseInstOffsetForGRO 0
VAW VectorAtomicWidth -1
VS VectorStore True
VW VectorWidth -1
WSGRA WaveSeparateGlobalReadA 0
WSGRB WaveSeparateGlobalReadB 0
WS WavefrontSize 64
WG WorkGroup [16, 16, 1]
WGM WorkGroupMapping 8
WGMT WorkGroupMappingType B