@@ -360,7 +360,7 @@ Every processor supports every OS ABI (see :ref:`amdgpu-os`) with the following
360360 ``gfx90a`` ``amdgcn`` dGPU - sramecc - Absolute - *rocm-amdhsa* *TBA*
361361 - tgsplit flat
362362 - xnack scratch .. TODO::
363- - Packed
363+ - kernarg preload - Packed
364364 work-item Add product
365365 IDs names.
366366
@@ -381,21 +381,21 @@ Every processor supports every OS ABI (see :ref:`amdgpu-os`) with the following
381381 ``gfx940`` ``amdgcn`` dGPU - sramecc - Architected *TBA*
382382 - tgsplit flat
383383 - xnack scratch .. TODO::
384- - Packed
384+ - kernarg preload - Packed
385385 work-item Add product
386386 IDs names.
387387
388388 ``gfx941`` ``amdgcn`` dGPU - sramecc - Architected *TBA*
389389 - tgsplit flat
390390 - xnack scratch .. TODO::
391- - Packed
391+ - kernarg preload - Packed
392392 work-item Add product
393393 IDs names.
394394
395395 ``gfx942`` ``amdgcn`` dGPU - sramecc - Architected *TBA*
396396 - tgsplit flat
397397 - xnack scratch .. TODO::
398- - Packed
398+ - kernarg preload - Packed
399399 work-item Add product
400400 IDs names.
401401
@@ -4375,12 +4375,24 @@ The fields used by CP for code objects before V3 also match those specified in
43754375 dynamically sized stack.
43764376 This is only set in code
43774377 object v5 and later.
4378- 463:460 1 bit Reserved, must be 0.
4379- 464 1 bit RESERVED_464 Deprecated, must be 0.
4380- 467:465 3 bits Reserved, must be 0.
4381- 468 1 bit RESERVED_468 Deprecated, must be 0.
4382- 469:471 3 bits Reserved, must be 0.
4383- 511:472 5 bytes Reserved, must be 0.
4378+ 463:460 4 bits Reserved, must be 0.
4379+ 470:464 7 bits KERNARG_PRELOAD_SPEC_LENGTH GFX6-GFX9
4380+ - Reserved, must be 0.
4381+ GFX90A, GFX940
4382+ - The number of dwords from
4383+ the kernarg segment to preload
4384+ into User SGPRs before kernel
4385+ execution. (see
4386+ :ref:`amdgpu-amdhsa-kernarg-preload`).
4387+ 479:471 9 bits KERNARG_PRELOAD_SPEC_OFFSET GFX6-GFX9
4388+ - Reserved, must be 0.
4389+ GFX90A, GFX940
4390+ - An offset in dwords into the
4391+ kernarg segment to begin
4392+ preloading data into User
4393+ SGPRs. (see
4394+ :ref:`amdgpu-amdhsa-kernarg-preload`).
4395+ 511:480 4 bytes Reserved, must be 0.
43844396 512 **Total size 64 bytes.**
43854397 ======= ====================================================================
43864398
@@ -5002,7 +5014,7 @@ for enabled registers are dense starting at SGPR0: the first enabled register is
50025014SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
50035015an SGPR number.
50045016
5005- The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
5017+ The initial SGPRs comprise up to 16 User SGPRs that are set by CP and apply to
50065018all wavefronts of the grid. It is possible to specify more than 16 User SGPRs
50075019using the ``enable_sgpr_*`` bit fields, in which case only the first 16 are
50085020actually initialized. These are then immediately followed by the System SGPRs
@@ -5045,6 +5057,9 @@ SGPR register initial state is defined in
50455057 then Flat Scratch Init 2 See
50465058 (enable_sgpr_flat_scratch :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
50475059 _init)
5060+ then Preloaded Kernargs N/A See
5061+ (kernarg_preload_spec :ref:`amdgpu-amdhsa-kernarg-preload`.
5062+ _length)
50485063 then Private Segment Size 1 The 32-bit byte size of a
50495064 (enable_sgpr_private single work-item's memory
50505065 _segment_size) allocation. This is the
@@ -5177,6 +5192,31 @@ following properties:
51775192* MTYPE set to support memory coherence that matches the runtime (such as CC for
51785193 APU and NC for dGPU).
51795194
5195+ .. _amdgpu-amdhsa-kernarg-preload:
5196+
5197+ Preloaded Kernel Arguments
5198+ ++++++++++++++++++++++++++
5199+
5200+ On hardware that supports this feature, kernel arguments can be preloaded into
5201+ User SGPRs, up to the maximum number of User SGPRs available. The allocation of
5202+ Preload SGPRs occurs directly after the last enabled non-kernarg preload User
5203+ SGPR. (See :ref:`amdgpu-amdhsa-initial-kernel-execution-state`)
5204+
5205+ The data preloaded is copied from the kernarg segment, the amount of data is
5206+ determined by the value specified in the kernarg_preload_spec_length field of
5207+ the kernel descriptor. This data is then loaded into consecutive User SGPRs. The
5208+ number of SGPRs receiving preloaded kernarg data corresponds with the value
5209+ given by kernarg_preload_spec_length. The preloading starts at the dword offset
5210+ within the kernarg segment, which is specified by the
5211+ kernarg_preload_spec_offset field.
5212+
5213+ If the kernarg_preload_spec_length is non-zero, the CP firmware will append an
5214+ additional 256 bytes to the kernel_code_entry_byte_offset. This addition
5215+ facilitates the incorporation of a prologue to the kernel entry to handle cases
5216+ where code designed for kernarg preloading is executed on hardware equipped with
5217+ incompatible firmware. If hardware has compatible firmware the 256 bytes at the
5218+ start of the kernel entry will be skipped.
5219+
51805220.. _amdgpu-amdhsa-kernel-prolog:
51815221
51825222Kernel Prolog
@@ -15352,6 +15392,10 @@ terminated by an ``.end_amdhsa_kernel`` directive.
1535215392 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
1535315393 ``.amdhsa_exception_int_div_zero`` 0 GFX6-GFX11 Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in
1535415394 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
15395+ ``.amdhsa_user_sgpr_kernarg_preload_length`` 0 GFX90A, Controls KERNARG_PRELOAD_SPEC_LENGTH in
15396+ GFX940 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15397+ ``.amdhsa_user_sgpr_kernarg_preload_offset`` 0 GFX90A, Controls KERNARG_PRELOAD_SPEC_OFFSET in
15398+ GFX940 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
1535515399 ======================================================== =================== ============ ===================
1535615400
1535715401.amdgpu_metadata
0 commit comments