@@ -360,7 +360,7 @@ Every processor supports every OS ABI (see :ref:`amdgpu-os`) with the following
360
360
``gfx90a`` ``amdgcn`` dGPU - sramecc - Absolute - *rocm-amdhsa* *TBA*
361
361
- tgsplit flat
362
362
- xnack scratch .. TODO::
363
- - Packed
363
+ - kernarg preload - Packed
364
364
work-item Add product
365
365
IDs names.
366
366
@@ -381,21 +381,21 @@ Every processor supports every OS ABI (see :ref:`amdgpu-os`) with the following
381
381
``gfx940`` ``amdgcn`` dGPU - sramecc - Architected *TBA*
382
382
- tgsplit flat
383
383
- xnack scratch .. TODO::
384
- - Packed
384
+ - kernarg preload - Packed
385
385
work-item Add product
386
386
IDs names.
387
387
388
388
``gfx941`` ``amdgcn`` dGPU - sramecc - Architected *TBA*
389
389
- tgsplit flat
390
390
- xnack scratch .. TODO::
391
- - Packed
391
+ - kernarg preload - Packed
392
392
work-item Add product
393
393
IDs names.
394
394
395
395
``gfx942`` ``amdgcn`` dGPU - sramecc - Architected *TBA*
396
396
- tgsplit flat
397
397
- xnack scratch .. TODO::
398
- - Packed
398
+ - kernarg preload - Packed
399
399
work-item Add product
400
400
IDs names.
401
401
@@ -4375,12 +4375,24 @@ The fields used by CP for code objects before V3 also match those specified in
4375
4375
dynamically sized stack.
4376
4376
This is only set in code
4377
4377
object v5 and later.
4378
- 463:460 1 bit Reserved, must be 0.
4379
- 464 1 bit RESERVED_464 Deprecated, must be 0.
4380
- 467:465 3 bits Reserved, must be 0.
4381
- 468 1 bit RESERVED_468 Deprecated, must be 0.
4382
- 469:471 3 bits Reserved, must be 0.
4383
- 511:472 5 bytes Reserved, must be 0.
4378
+ 463:460 4 bits Reserved, must be 0.
4379
+ 470:464 7 bits KERNARG_PRELOAD_SPEC_LENGTH GFX6-GFX9
4380
+ - Reserved, must be 0.
4381
+ GFX90A, GFX940
4382
+ - The number of dwords from
4383
+ the kernarg segment to preload
4384
+ into User SGPRs before kernel
4385
+ execution. (see
4386
+ :ref:`amdgpu-amdhsa-kernarg-preload`).
4387
+ 479:471 9 bits KERNARG_PRELOAD_SPEC_OFFSET GFX6-GFX9
4388
+ - Reserved, must be 0.
4389
+ GFX90A, GFX940
4390
+ - An offset in dwords into the
4391
+ kernarg segment to begin
4392
+ preloading data into User
4393
+ SGPRs. (see
4394
+ :ref:`amdgpu-amdhsa-kernarg-preload`).
4395
+ 511:480 4 bytes Reserved, must be 0.
4384
4396
512 **Total size 64 bytes.**
4385
4397
======= ====================================================================
4386
4398
@@ -5002,7 +5014,7 @@ for enabled registers are dense starting at SGPR0: the first enabled register is
5002
5014
SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
5003
5015
an SGPR number.
5004
5016
5005
- The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
5017
+ The initial SGPRs comprise up to 16 User SGPRs that are set by CP and apply to
5006
5018
all wavefronts of the grid. It is possible to specify more than 16 User SGPRs
5007
5019
using the ``enable_sgpr_*`` bit fields, in which case only the first 16 are
5008
5020
actually initialized. These are then immediately followed by the System SGPRs
@@ -5045,6 +5057,9 @@ SGPR register initial state is defined in
5045
5057
then Flat Scratch Init 2 See
5046
5058
(enable_sgpr_flat_scratch :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
5047
5059
_init)
5060
+ then Preloaded Kernargs N/A See
5061
+ (kernarg_preload_spec :ref:`amdgpu-amdhsa-kernarg-preload`.
5062
+ _length)
5048
5063
then Private Segment Size 1 The 32-bit byte size of a
5049
5064
(enable_sgpr_private single work-item's memory
5050
5065
_segment_size) allocation. This is the
@@ -5177,6 +5192,31 @@ following properties:
5177
5192
* MTYPE set to support memory coherence that matches the runtime (such as CC for
5178
5193
APU and NC for dGPU).
5179
5194
5195
+ .. _amdgpu-amdhsa-kernarg-preload:
5196
+
5197
+ Preloaded Kernel Arguments
5198
+ ++++++++++++++++++++++++++
5199
+
5200
+ On hardware that supports this feature, kernel arguments can be preloaded into
5201
+ User SGPRs, up to the maximum number of User SGPRs available. The allocation of
5202
+ Preload SGPRs occurs directly after the last enabled non-kernarg preload User
5203
+ SGPR. (See :ref:`amdgpu-amdhsa-initial-kernel-execution-state`)
5204
+
5205
+ The data preloaded is copied from the kernarg segment, the amount of data is
5206
+ determined by the value specified in the kernarg_preload_spec_length field of
5207
+ the kernel descriptor. This data is then loaded into consecutive User SGPRs. The
5208
+ number of SGPRs receiving preloaded kernarg data corresponds with the value
5209
+ given by kernarg_preload_spec_length. The preloading starts at the dword offset
5210
+ within the kernarg segment, which is specified by the
5211
+ kernarg_preload_spec_offset field.
5212
+
5213
+ If the kernarg_preload_spec_length is non-zero, the CP firmware will append an
5214
+ additional 256 bytes to the kernel_code_entry_byte_offset. This addition
5215
+ facilitates the incorporation of a prologue to the kernel entry to handle cases
5216
+ where code designed for kernarg preloading is executed on hardware equipped with
5217
+ incompatible firmware. If hardware has compatible firmware the 256 bytes at the
5218
+ start of the kernel entry will be skipped.
5219
+
5180
5220
.. _amdgpu-amdhsa-kernel-prolog:
5181
5221
5182
5222
Kernel Prolog
@@ -15352,6 +15392,10 @@ terminated by an ``.end_amdhsa_kernel`` directive.
15352
15392
:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
15353
15393
``.amdhsa_exception_int_div_zero`` 0 GFX6-GFX11 Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in
15354
15394
:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
15395
+ ``.amdhsa_user_sgpr_kernarg_preload_length`` 0 GFX90A, Controls KERNARG_PRELOAD_SPEC_LENGTH in
15396
+ GFX940 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15397
+ ``.amdhsa_user_sgpr_kernarg_preload_offset`` 0 GFX90A, Controls KERNARG_PRELOAD_SPEC_OFFSET in
15398
+ GFX940 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15355
15399
======================================================== =================== ============ ===================
15356
15400
15357
15401
.amdgpu_metadata
0 commit comments