-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[AMDGPU] Document & Finalize GFX12 Memory Model #98599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-backend-amdgpu Author: Pierre van Houtryve (Pierre-vh) ChangesDocument the memory model implemented as of #98591 Patch is 160.80 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/98599.diff 1 Files Affected:
diff --git a/llvm/docs/AMDGPUUsage.rst b/llvm/docs/AMDGPUUsage.rst
index 117fc2cf6bbbc..be8a2fed07b57 100644
--- a/llvm/docs/AMDGPUUsage.rst
+++ b/llvm/docs/AMDGPUUsage.rst
@@ -6094,6 +6094,7 @@ following sections:
* :ref:`amdgpu-amdhsa-memory-model-gfx90a`
* :ref:`amdgpu-amdhsa-memory-model-gfx942`
* :ref:`amdgpu-amdhsa-memory-model-gfx10-gfx11`
+* :ref:`amdgpu-amdhsa-memory-model-gfx12`
.. _amdgpu-fence-as:
@@ -14074,6 +14075,2266 @@ table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
- system for OpenCL.*
============ ============ ============== ========== ================================
+
+.. _amdgpu-amdhsa-memory-model-gfx12:
+
+Memory Model GFX12
+++++++++++++++++++++++++
+
+For GFX12:
+
+* Each agent has multiple shader arrays (SA).
+* Each SA has multiple work-group processors (WGP).
+* Each WGP has multiple compute units (CU).
+* Each CU has multiple SIMDs that execute wavefronts.
+* The wavefronts for a single work-group are executed in the same
+ WGP.
+
+ * In CU wavefront execution mode the wavefronts may be executed by different SIMDs
+ in the same CU.
+ * In WGP wavefront execution mode the wavefronts may be executed by different SIMDs
+ in different CUs in the same WGP.
+
+* Each WGP has a single LDS memory shared by the wavefronts of the work-groups
+ executing on it.
+* All LDS operations of a WGP are performed as wavefront wide operations in a
+ global order and involve no caching. Completion is reported to a wavefront in
+ execution order.
+* The LDS memory has multiple request queues shared by the SIMDs of a
+ WGP. Therefore, the LDS operations performed by different wavefronts of a
+ work-group can be reordered relative to each other, which can result in
+ reordering the visibility of vector memory operations with respect to LDS
+ operations of other wavefronts in the same work-group. A ``s_wait_dscnt 0x0``
+ is required to ensure synchronization between LDS operations and
+ vector memory operations between wavefronts of a work-group, but not between
+ operations performed by the same wavefront.
+* The vector memory operations are performed as wavefront wide operations.
+ Vector memory operations are divided in different types. Completion of a
+ vector memory operation is reported to a wavefront in-order within a type,
+ but may be out of order between types. The types of vector memory operations
+ (and their associated ``s_wait`` instructions) are:
+
+ * LDS: ``s_wait_dscnt``
+ * Load (global, scratch, flat, buffer and image): ``s_wait_loadcnt``
+ * Store (global, scratch, flat, buffer and image): ``s_wait_storecnt``
+ * Sample and Gather4: ``s_wait_samplecnt``
+ * BVH: ``s_wait_bvhcnt``
+
+* Vector and scalar memory instructions contain a ``SCOPE`` field with values
+ corresponding to each cache level. The ``SCOPE`` determines whether a cache
+ can complete an operation locally or whether it needs to forward the operation
+ to the next cache level. The ``SCOPE`` values are:
+
+ * ``SCOPE_CU``: Compute Unit (NOTE: not affected by CU/WGP mode)
+ * ``SCOPE_SE``: Shader Engine
+ * ``SCOPE_DEV``: Device/Agent
+ * ``SCOPE_SYS``: System
+
+* When a memory operation with a given ``SCOPE`` reaches a cache with a smaller
+ ``SCOPE`` value, it is forwarded to the next level of cache.
+* When a memory operation with a given ``SCOPE`` reaches a cache with a ``SCOPE``
+ value greater than or equal to its own, the operation can proceed:
+
+ * Reads can hit into the cache
+ * Writes can happen in this cache and the transaction is acknowledged
+ from this level of cache.
+ * RMW operations can be done locally.
+
+* ``global_inv``, ``global_wb`` and ``global_wbinv`` instructions are used to
+ invalidate, write-back and write-back+invalidate caches. The affected
+ cache(s) are controlled by the ``SCOPE:`` of the instruction.
+* ``global_inv`` invalidates caches whose scope is strictly smaller than the
+ instruction's. The invalidation requests cannot be reordered with pending or
+ upcoming memory operations.
+* ``global_wb`` additionally ensures that previous memory operation done at
+ a lower scope level have reached the ``SCOPE:`` of the ``global_wb``.
+* The vector memory operations access a vector L0 cache. There is a single L0
+ cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, no
+ special action is required for coherence between the lanes of a single
+ wavefront. To achieve coherence between wavefronts executing in the same
+ work-group:
+
+ * In CU wavefront execution mode, no special action is required.
+ * In WGP wavefront execution mode, a ``global_inv scope:SCOPE_CU`` is required
+ as wavefronts may be executing on SIMDs of different CUs that access different L0s.
+
+* The scalar memory operations access a scalar L0 cache shared by all wavefronts
+ on a WGP. The scalar and vector L0 caches are not coherent. However, scalar
+ operations are used in a restricted way so do not impact the memory model. See
+ :ref:`amdgpu-amdhsa-memory-spaces`.
+* The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on
+ the same SA. Therefore, no special action is required for coherence between
+ the wavefronts of a single work-group. However, a ``global_inv scope:SCOPE_DEV`` is
+ required for coherence between wavefronts executing in different work-groups
+ as they may be executing on different SAs that access different L1s.
+* The L1 caches have independent quadrants to service disjoint ranges of virtual
+ addresses.
+* Each L0 cache has a separate request queue per L1 quadrant. Therefore, the
+ vector and scalar memory operations performed by different wavefronts, whether
+ executing in the same or different work-groups (which may be executing on
+ different CUs accessing different L0s), can be reordered relative to each
+ other. Some or all of the wait instructions below are required to ensure
+ synchronization between vector memory operations of different wavefronts. It
+ ensures a previous vector memory operation has completed before executing a
+ subsequent vector memory or LDS operation and so can be used to meet the
+ requirements of acquire, release and sequential consistency.
+
+ * ``s_wait_loadcnt 0x0``
+ * ``s_wait_samplecnt 0x0``
+ * ``s_wait_bvhcnt 0x0``
+ * ``s_wait_storecnt 0x0``
+
+* The L1 caches use an L2 cache shared by all SAs on the same agent.
+* The L2 cache has independent channels to service disjoint ranges of virtual
+ addresses.
+* Each L1 quadrant of a single SA accesses a different L2 channel. Each L1
+ quadrant has a separate request queue per L2 channel. Therefore, the vector
+ and scalar memory operations performed by wavefronts executing in different
+ work-groups (which may be executing on different SAs) of an agent can be
+ reordered relative to each other. Some or all of the wait instructions below are
+ required to ensure synchronization between vector memory operations of
+ different SAs. It ensures a previous vector memory operation has completed
+ before executing a subsequent vector memory and so can be used to meet the
+ requirements of acquire, release and sequential consistency.
+
+ * ``s_wait_loadcnt 0x0``
+ * ``s_wait_samplecnt 0x0``
+ * ``s_wait_bvhcnt 0x0``
+ * ``s_wait_storecnt 0x0``
+
+* The L2 cache can be kept coherent with other agents, or ranges
+ of virtual addresses can be set up to bypass it to ensure system coherence.
+* A memory attached last level (MALL) cache exists for GPU memory.
+ The MALL cache is fully coherent with GPU memory and has no impact on system
+ coherence. All agents (GPU and CPU) access GPU memory through the MALL cache.
+
+Scalar memory operations are only used to access memory that is proven to not
+change during the execution of the kernel dispatch. This includes constant
+address space and global address space for program scope ``const`` variables.
+Therefore, the kernel machine code does not have to maintain the scalar cache to
+ensure it is coherent with the vector caches. The scalar and vector caches are
+invalidated between kernel dispatches by CP since constant address space data
+may change between kernel dispatch executions. See
+:ref:`amdgpu-amdhsa-memory-spaces`.
+
+For kernarg backing memory:
+
+* CP invalidates the L0 and L1 caches at the start of each kernel dispatch.
+* On dGPU the kernarg backing memory is accessed as MTYPE UC (uncached) to avoid
+ needing to invalidate the L2 cache.
+* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
+ so the L2 cache will be coherent with the CPU and other agents.
+
+Scratch backing memory (which is used for the private address space) is accessed
+with MTYPE NC (non-coherent). Since the private address space is only accessed
+by a single thread, and is always write-before-read, there is never a need to
+invalidate these entries from the L0 or L1 caches.
+
+Wavefronts can be executed in WGP or CU wavefront execution mode:
+
+* In WGP wavefront execution mode the wavefronts of a work-group are executed
+ on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per
+ CU L0 caches is required for work-group synchronization. Also accesses to L1
+ at work-group scope need to be explicitly ordered as the accesses from
+ different CUs are not ordered.
+* In CU wavefront execution mode the wavefronts of a work-group are executed on
+ the SIMDs of a single CU of the WGP. Therefore, all global memory access by
+ the work-group access the same L0 which in turn ensures L1 accesses are
+ ordered and so do not require explicit management of the caches for
+ work-group synchronization.
+
+See ``WGP_MODE`` field in
+:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table` and
+:ref:`amdgpu-target-features`.
+
+The code sequences used to implement the memory model for GFX12 are defined in
+table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-table`.
+
+ .. table:: AMDHSA Memory Model Code Sequences GFX12 - Instruction Scopes
+ :name: amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table
+
+ =================== =================== ===================
+ LLVM syncscope CU wavefront WGP wavefront
+ execution execution
+ mode mode
+ =================== =================== ===================
+ *none* ``scope:SCOPE_SYS`` ``scope:SCOPE_SYS``
+ system ``scope:SCOPE_SYS`` ``scope:SCOPE_SYS``
+ agent ``scope:SCOPE_DEV`` ``scope:SCOPE_DEV``
+ workgroup *none* ``scope:SCOPE_SE``
+ wavefront *none* *none*
+ singlethread *none* *none*
+ one-as ``scope:SCOPE_SYS`` ``scope:SCOPE_SYS``
+ system-one-as ``scope:SCOPE_SYS`` ``scope:SCOPE_SYS``
+ agent-one-as ``scope:SCOPE_DEV`` ``scope:SCOPE_DEV``
+ workgroup-one-as *none* ``scope:SCOPE_SE``
+ wavefront-one-as *none* *none*
+ singlethread-one-as *none* *none*
+ =================== =================== ===================
+
+NOTE: The table above applies if and only if it is explicitly referenced by
+a code sequence in :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-table`.
+
+ .. table:: AMDHSA Memory Model Code Sequences GFX12
+ :name: amdgpu-amdhsa-memory-model-code-sequences-gfx12-table
+
+ ============ ============ ============== ========== ================================
+ LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
+ Ordering Sync Scope Address GFX12
+ Space
+ ============ ============ ============== ========== ================================
+ **Non-Atomic**
+ ------------------------------------------------------------------------------------
+ load *none* *none* - global - !volatile & !nontemporal
+ - generic
+ - private 1. buffer/global/flat_load
+ - constant
+ - !volatile & nontemporal
+
+ 1. buffer/global/flat_load
+ ``th:TH_LOAD_NT``
+
+ - volatile
+
+ 1. buffer/global/flat_load
+ ``scope:SCOPE_SYS``
+
+ 2. ``s_wait_bvhcnt 0x0``
+ ``s_wait_samplecnt 0x0``
+ ``s_wait_loadcnt 0x0``
+
+ - Must happen before
+ any following volatile
+ global/generic
+ load/store.
+ - Ensures that
+ volatile
+ operations to
+ different
+ addresses will not
+ be reordered by
+ hardware.
+
+ load *none* *none* - local 1. ds_load
+ store *none* *none* - global - !volatile & !nontemporal
+ - generic
+ - private 1. buffer/global/flat_store
+ - constant
+ - !volatile & nontemporal
+
+ 1. buffer/global/flat_store
+ ``th:TH_STORE_NT``
+
+ - volatile
+
+ 1. buffer/global/flat_store
+ ``scope:SCOPE_SYS``
+
+ 2. ``s_wait_storecnt 0x0``
+
+ - Must happen before
+ any following volatile
+ global/generic
+ load/store.
+ - Ensures that
+ volatile
+ operations to
+ different
+ addresses will not
+ be reordered by
+ hardware.
+
+ store *none* *none* - local 1. ds_store
+ **Unordered Atomic**
+ ------------------------------------------------------------------------------------
+ load atomic unordered *any* *any* *Same as non-atomic*.
+ store atomic unordered *any* *any* *Same as non-atomic*.
+ atomicrmw unordered *any* *any* *Same as monotonic atomic*.
+ **Monotonic Atomic**
+ ------------------------------------------------------------------------------------
+ load atomic monotonic - singlethread - global 1. buffer/global/flat_load
+ - wavefront - generic
+ - workgroup - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
+ - agent
+ - system
+ load atomic monotonic - singlethread - local 1. ds_load
+ - wavefront
+ - workgroup
+ store atomic monotonic - singlethread - global 1. buffer/global/flat_store
+ - wavefront - generic
+ - workgroup - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
+ - agent
+ - system
+ store atomic monotonic - singlethread - local 1. ds_store
+ - wavefront
+ - workgroup
+ atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
+ - wavefront - generic
+ - workgroup - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
+ - agent
+ - system
+ atomicrmw monotonic - singlethread - local 1. ds_atomic
+ - wavefront
+ - workgroup
+ **Acquire Atomic**
+ ------------------------------------------------------------------------------------
+ load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
+ - wavefront - local
+ - generic
+ load atomic acquire - workgroup - global 1. buffer/global_load ``scope:SCOPE_SE``
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
+
+ 2. | ``s_wait_bvhcnt 0x0``
+ | ``s_wait_samplecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+
+ - If CU wavefront execution
+ mode, omit.
+ - Must happen before
+ the following ``global_inv``
+ and before any following
+ global/generic
+ load/load
+ atomic/store/store
+ atomic/atomicrmw.
+
+ 3. ``global_inv scope:SCOPE_SE``
+
+ - If CU wavefront execution
+ mode, omit.
+ - Ensures that
+ following
+ loads will not see
+ stale data.
+
+ load atomic acquire - workgroup - local 1. ds_load
+ 2. ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit.
+ - Must happen before
+ the following ``global_inv``
+ and before any following
+ ...
[truncated]
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (having previously reviewed this downstream)
I will land this at the end of the week if no comments come up. |
execution execution | ||
mode mode | ||
=================== =================== =================== | ||
*none* ``scope:SCOPE_SYS`` ``scope:SCOPE_SYS`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if you don't specify a syncscope in IR, it acts like system
? Has that always been the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the default is always system scope: https://llvm.org/docs/AMDGPUUsage.html#memory-scopes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has to be that way otherwise the code generated would not be conservatively correct in the absence of the hint.
@jayfoad Can you please review the changes I made for L1 as a buffer? |
The code sequences used to implement the memory model for GFX12 are defined in | ||
table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-table`. | ||
|
||
.. table:: AMDHSA Memory Model Code Sequences GFX12 - Instruction Scopes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want the semantic meaning of syncscope to change according to the hardware mode? My understanding is that synscope is a language level concept. It is the responsibility of the backend to generate the correct code according to what mode the hardware will be running in as defined in the kernel descriptor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not my intent to change the meaning of syncscope, this table just provides a mapping of LLVM IR synscope to GFX12 ISA scope
operands.
The idea is that you start with some instruction, say a flat_load_b32
, then when you see this table being referenced in the code sequences table below, you look at it and add the relevant scope. So if you want workgroup scope, you either add nothing for CU mode, or scope:SCOPE_SE
for WGP mode.
I'm now realizing the doc isn't maybe as evident as I thought it would be so I will add a short paragraph to explain this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly LGTM, but I do have some substantive comments :)
I'm confused that c2625c2 changes the text about what |
Document the memory model implemented as of llvm#98591
3a17dc5
to
6ccf48a
Compare
Good catch, and you're right. I think we could replace those SCOPE_DEV with SCOPE_SE, but I'm not really convinced it's the right decision because:
I will bring this up with @t-tye on our next meeting. My intuition is that we should leave SCOPE_DEV, and then add a new paragraph to explain how we approach global_inv/wb emission. |
note: I plan to include the code changes in this diff as well and just make it a "finalize GFX12 memory model" patch. It makes more sense and will be easier to review like that IMO. I need a bit of time to finalize the code changes, they'll come later today or tomorrow. |
I'm not entirely sure that's true, and in any case I think we shouldn't rely on such details. Let's just have the scopes follow the semantics we want. So like you said, and agent scope release should do a SCOPE_DEV writeback, an agent scope acquire should do a SCOPE_DEV invalidate. |
Yes let's discuss this as there are several things here that seem questionable:-) |
This makes more sense to me. Release is not about invalidating, it is about "writing back" and ensuring it has completed. The WB instructions do this. Even if the cache is write through, we still have to confirm that the write through has completed to the scope we want to release to. That is were the WB instruction comes in. It does more than just trigger a write back, it also confirms the write is complete at a specified scope. We want the hardware instruction scopes to reflect the source language semantics. Unfortunately this is not always the case and so we have to modify the scope according to the modality of the configuration in some cases. But where the scopes do reflect the language semantics, we can use them and not worry about which caches they manipulate as the hardware will make sure it controls the appropriate caches for the source language semantic action in conjunction with the hardware modal configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM
Documents the memory model implemented as of #98591, with some fixes/optimizations to the implementation.