|
| 1 | +.. SPDX-License-Identifier: GPL-2.0-only |
| 2 | +
|
| 3 | +.. include:: <isonum.txt> |
| 4 | + |
| 5 | +========= |
| 6 | + AMD NPU |
| 7 | +========= |
| 8 | + |
| 9 | +:Copyright: |copy| 2024 Advanced Micro Devices, Inc. |
| 10 | +:Author: Sonal Santan < [email protected]> |
| 11 | + |
| 12 | +Overview |
| 13 | +======== |
| 14 | + |
| 15 | +AMD NPU (Neural Processing Unit) is a multi-user AI inference accelerator |
| 16 | +integrated into AMD client APU. NPU enables efficient execution of Machine |
| 17 | +Learning applications like CNN, LLM, etc. NPU is based on |
| 18 | +`AMD XDNA Architecture`_. NPU is managed by **amdxdna** driver. |
| 19 | + |
| 20 | + |
| 21 | +Hardware Description |
| 22 | +==================== |
| 23 | + |
| 24 | +AMD NPU consists of the following hardware components: |
| 25 | + |
| 26 | +AMD XDNA Array |
| 27 | +-------------- |
| 28 | + |
| 29 | +AMD XDNA Array comprises of 2D array of compute and memory tiles built with |
| 30 | +`AMD AI Engine Technology`_. Each column has 4 rows of compute tiles and 1 |
| 31 | +row of memory tile. Each compute tile contains a VLIW processor with its own |
| 32 | +dedicated program and data memory. The memory tile acts as L2 memory. The 2D |
| 33 | +array can be partitioned at a column boundary creating a spatially isolated |
| 34 | +partition which can be bound to a workload context. |
| 35 | + |
| 36 | +Each column also has dedicated DMA engines to move data between host DDR and |
| 37 | +memory tile. |
| 38 | + |
| 39 | +AMD Phoenix and AMD Hawk Point client NPU have a 4x5 topology, i.e., 4 rows of |
| 40 | +compute tiles arranged into 5 columns. AMD Strix Point client APU have 4x8 |
| 41 | +topology, i.e., 4 rows of compute tiles arranged into 8 columns. |
| 42 | + |
| 43 | +Shared L2 Memory |
| 44 | +---------------- |
| 45 | + |
| 46 | +The single row of memory tiles create a pool of software managed on chip L2 |
| 47 | +memory. DMA engines are used to move data between host DDR and memory tiles. |
| 48 | +AMD Phoenix and AMD Hawk Point NPUs have a total of 2560 KB of L2 memory. |
| 49 | +AMD Strix Point NPU has a total of 4096 KB of L2 memory. |
| 50 | + |
| 51 | +Microcontroller |
| 52 | +--------------- |
| 53 | + |
| 54 | +A microcontroller runs NPU Firmware which is responsible for command processing, |
| 55 | +XDNA Array partition setup, XDNA Array configuration, workload context |
| 56 | +management and workload orchestration. |
| 57 | + |
| 58 | +NPU Firmware uses a dedicated instance of an isolated non-privileged context |
| 59 | +called ERT to service each workload context. ERT is also used to execute user |
| 60 | +provided ``ctrlcode`` associated with the workload context. |
| 61 | + |
| 62 | +NPU Firmware uses a single isolated privileged context called MERT to service |
| 63 | +management commands from the amdxdna driver. |
| 64 | + |
| 65 | +Mailboxes |
| 66 | +--------- |
| 67 | + |
| 68 | +The microcontroller and amdxdna driver use a privileged channel for management |
| 69 | +tasks like setting up of contexts, telemetry, query, error handling, setting up |
| 70 | +user channel, etc. As mentioned before, privileged channel requests are |
| 71 | +serviced by MERT. The privileged channel is bound to a single mailbox. |
| 72 | + |
| 73 | +The microcontroller and amdxdna driver use a dedicated user channel per |
| 74 | +workload context. The user channel is primarily used for submitting work to |
| 75 | +the NPU. As mentioned before, a user channel requests are serviced by an |
| 76 | +instance of ERT. Each user channel is bound to its own dedicated mailbox. |
| 77 | + |
| 78 | +PCIe EP |
| 79 | +------- |
| 80 | + |
| 81 | +NPU is visible to the x86 host CPU as a PCIe device with multiple BARs and some |
| 82 | +MSI-X interrupt vectors. NPU uses a dedicated high bandwidth SoC level fabric |
| 83 | +for reading or writing into host memory. Each instance of ERT gets its own |
| 84 | +dedicated MSI-X interrupt. MERT gets a single instance of MSI-X interrupt. |
| 85 | + |
| 86 | +The number of PCIe BARs varies depending on the specific device. Based on their |
| 87 | +functions, PCIe BARs can generally be categorized into the following types. |
| 88 | + |
| 89 | +* PSP BAR: Expose the AMD PSP (Platform Security Processor) function |
| 90 | +* SMU BAR: Expose the AMD SMU (System Management Unit) function |
| 91 | +* SRAM BAR: Expose ring buffers for the mailbox |
| 92 | +* Mailbox BAR: Expose the mailbox control registers (head, tail and ISR |
| 93 | + registers etc.) |
| 94 | +* Public Register BAR: Expose public registers |
| 95 | + |
| 96 | +On specific devices, the above-mentioned BAR type might be combined into a |
| 97 | +single physical PCIe BAR. Or a module might require two physical PCIe BARs to |
| 98 | +be fully functional. For example, |
| 99 | + |
| 100 | +* On AMD Phoenix device, PSP, SMU, Public Register BARs are on PCIe BAR index 0. |
| 101 | +* On AMD Strix Point device, Mailbox and Public Register BARs are on PCIe BAR |
| 102 | + index 0. The PSP has some registers in PCIe BAR index 0 (Public Register BAR) |
| 103 | + and PCIe BAR index 4 (PSP BAR). |
| 104 | + |
| 105 | +Process Isolation Hardware |
| 106 | +-------------------------- |
| 107 | + |
| 108 | +As explained before, XDNA Array can be dynamically divided into isolated |
| 109 | +spatial partitions, each of which may have one or more columns. The spatial |
| 110 | +partition is setup by programming the column isolation registers by the |
| 111 | +microcontroller. Each spatial partition is associated with a PASID which is |
| 112 | +also programmed by the microcontroller. Hence multiple spatial partitions in |
| 113 | +the NPU can make concurrent host access protected by PASID. |
| 114 | + |
| 115 | +The NPU FW itself uses microcontroller MMU enforced isolated contexts for |
| 116 | +servicing user and privileged channel requests. |
| 117 | + |
| 118 | + |
| 119 | +Mixed Spatial and Temporal Scheduling |
| 120 | +===================================== |
| 121 | + |
| 122 | +AMD XDNA architecture supports mixed spatial and temporal (time sharing) |
| 123 | +scheduling of 2D array. This means that spatial partitions may be setup and |
| 124 | +torn down dynamically to accommodate various workloads. A *spatial* partition |
| 125 | +may be *exclusively* bound to one workload context while another partition may |
| 126 | +be *temporarily* bound to more than one workload contexts. The microcontroller |
| 127 | +updates the PASID for a temporarily shared partition to match the context that |
| 128 | +has been bound to the partition at any moment. |
| 129 | + |
| 130 | +Resource Solver |
| 131 | +--------------- |
| 132 | + |
| 133 | +The Resource Solver component of the amdxdna driver manages the allocation |
| 134 | +of 2D array among various workloads. Every workload describes the number |
| 135 | +of columns required to run the NPU binary in its metadata. The Resource Solver |
| 136 | +component uses hints passed by the workload and its own heuristics to |
| 137 | +decide 2D array (re)partition strategy and mapping of workloads for spatial and |
| 138 | +temporal sharing of columns. The FW enforces the context-to-column(s) resource |
| 139 | +binding decisions made by the Resource Solver. |
| 140 | + |
| 141 | +AMD Phoenix and AMD Hawk Point client NPU can support 6 concurrent workload |
| 142 | +contexts. AMD Strix Point can support 16 concurrent workload contexts. |
| 143 | + |
| 144 | + |
| 145 | +Application Binaries |
| 146 | +==================== |
| 147 | + |
| 148 | +A NPU application workload is comprised of two separate binaries which are |
| 149 | +generated by the NPU compiler. |
| 150 | + |
| 151 | +1. AMD XDNA Array overlay, which is used to configure a NPU spatial partition. |
| 152 | + The overlay contains instructions for setting up the stream switch |
| 153 | + configuration and ELF for the compute tiles. The overlay is loaded on the |
| 154 | + spatial partition bound to the workload by the associated ERT instance. |
| 155 | + Refer to the |
| 156 | + `Versal Adaptive SoC AIE-ML Architecture Manual (AM020)`_ for more details. |
| 157 | + |
| 158 | +2. ``ctrlcode``, used for orchestrating the overlay loaded on the spatial |
| 159 | + partition. ``ctrlcode`` is executed by the ERT running in protected mode on |
| 160 | + the microcontroller in the context of the workload. ``ctrlcode`` is made up |
| 161 | + of a sequence of opcodes named ``XAie_TxnOpcode``. Refer to the |
| 162 | + `AI Engine Run Time`_ for more details. |
| 163 | + |
| 164 | + |
| 165 | +Special Host Buffers |
| 166 | +==================== |
| 167 | + |
| 168 | +Per-context Instruction Buffer |
| 169 | +------------------------------ |
| 170 | + |
| 171 | +Every workload context uses a host resident 64 MB buffer which is memory |
| 172 | +mapped into the ERT instance created to service the workload. The ``ctrlcode`` |
| 173 | +used by the workload is copied into this special memory. This buffer is |
| 174 | +protected by PASID like all other input/output buffers used by that workload. |
| 175 | +Instruction buffer is also mapped into the user space of the workload. |
| 176 | + |
| 177 | +Global Privileged Buffer |
| 178 | +------------------------ |
| 179 | + |
| 180 | +In addition, the driver also allocates a single buffer for maintenance tasks |
| 181 | +like recording errors from MERT. This global buffer uses the global IOMMU |
| 182 | +domain and is only accessible by MERT. |
| 183 | + |
| 184 | + |
| 185 | +High-level Use Flow |
| 186 | +=================== |
| 187 | + |
| 188 | +Here are the steps to run a workload on AMD NPU: |
| 189 | + |
| 190 | +1. Compile the workload into an overlay and a ``ctrlcode`` binary. |
| 191 | +2. Userspace opens a context in the driver and provides the overlay. |
| 192 | +3. The driver checks with the Resource Solver for provisioning a set of columns |
| 193 | + for the workload. |
| 194 | +4. The driver then asks MERT to create a context on the device with the desired |
| 195 | + columns. |
| 196 | +5. MERT then creates an instance of ERT. MERT also maps the Instruction Buffer |
| 197 | + into ERT memory. |
| 198 | +6. The userspace then copies the ``ctrlcode`` to the Instruction Buffer. |
| 199 | +7. Userspace then creates a command buffer with pointers to input, output, and |
| 200 | + instruction buffer; it then submits command buffer with the driver and goes |
| 201 | + to sleep waiting for completion. |
| 202 | +8. The driver sends the command over the Mailbox to ERT. |
| 203 | +9. ERT *executes* the ``ctrlcode`` in the instruction buffer. |
| 204 | +10. Execution of the ``ctrlcode`` kicks off DMAs to and from the host DDR while |
| 205 | + AMD XDNA Array is running. |
| 206 | +11. When ERT reaches end of ``ctrlcode``, it raises an MSI-X to send completion |
| 207 | + signal to the driver which then wakes up the waiting workload. |
| 208 | + |
| 209 | + |
| 210 | +Boot Flow |
| 211 | +========= |
| 212 | + |
| 213 | +amdxdna driver uses PSP to securely load signed NPU FW and kick off the boot |
| 214 | +of the NPU microcontroller. amdxdna driver then waits for the alive signal in |
| 215 | +a special location on BAR 0. The NPU is switched off during SoC suspend and |
| 216 | +turned on after resume where the NPU FW is reloaded, and the handshake is |
| 217 | +performed again. |
| 218 | + |
| 219 | + |
| 220 | +Userspace components |
| 221 | +==================== |
| 222 | + |
| 223 | +Compiler |
| 224 | +-------- |
| 225 | + |
| 226 | +Peano is an LLVM based open-source compiler for AMD XDNA Array compute tile |
| 227 | +available at: |
| 228 | +https://github.com/Xilinx/llvm-aie |
| 229 | + |
| 230 | +The open-source IREE compiler supports graph compilation of ML models for AMD |
| 231 | +NPU and uses Peano underneath. It is available at: |
| 232 | +https://github.com/nod-ai/iree-amd-aie |
| 233 | + |
| 234 | +Usermode Driver (UMD) |
| 235 | +--------------------- |
| 236 | + |
| 237 | +The open-source XRT runtime stack interfaces with amdxdna kernel driver. XRT |
| 238 | +can be found at: |
| 239 | +https://github.com/Xilinx/XRT |
| 240 | + |
| 241 | +The open-source XRT shim for NPU is can be found at: |
| 242 | +https://github.com/amd/xdna-driver |
| 243 | + |
| 244 | + |
| 245 | +DMA Operation |
| 246 | +============= |
| 247 | + |
| 248 | +DMA operation instructions are encoded in the ``ctrlcode`` as |
| 249 | +``XAIE_IO_BLOCKWRITE`` opcode. When ERT executes ``XAIE_IO_BLOCKWRITE``, DMA |
| 250 | +operations between host DDR and L2 memory are effected. |
| 251 | + |
| 252 | + |
| 253 | +Error Handling |
| 254 | +============== |
| 255 | + |
| 256 | +When MERT detects an error in AMD XDNA Array, it pauses execution for that |
| 257 | +workload context and sends an asynchronous message to the driver over the |
| 258 | +privileged channel. The driver then sends a buffer pointer to MERT to capture |
| 259 | +the register states for the partition bound to faulting workload context. The |
| 260 | +driver then decodes the error by reading the contents of the buffer pointer. |
| 261 | + |
| 262 | + |
| 263 | +Telemetry |
| 264 | +========= |
| 265 | + |
| 266 | +MERT can report various kinds of telemetry information like the following: |
| 267 | + |
| 268 | +* L1 interrupt counter |
| 269 | +* DMA counter |
| 270 | +* Deep Sleep counter |
| 271 | +* etc. |
| 272 | + |
| 273 | + |
| 274 | +References |
| 275 | +========== |
| 276 | + |
| 277 | +- `AMD XDNA Architecture <https://www.amd.com/en/technologies/xdna.html>`_ |
| 278 | +- `AMD AI Engine Technology <https://www.xilinx.com/products/technology/ai-engine.html>`_ |
| 279 | +- `Peano <https://github.com/Xilinx/llvm-aie>`_ |
| 280 | +- `Versal Adaptive SoC AIE-ML Architecture Manual (AM020) <https://docs.amd.com/r/en-US/am020-versal-aie-ml>`_ |
| 281 | +- `AI Engine Run Time <https://github.com/Xilinx/aie-rt/tree/release/main_aig>`_ |
0 commit comments