Skip to content

Commit 96c8470

Browse files
committed
Merge tag 'drm-next-2025-01-17' of https://gitlab.freedesktop.org/drm/kernel
Pull drm updates from Dave Airlie: "There are two external interactions of note, the msm tree pull in some opp tree, hopefully the opp tree arrives from the same git tree however it normally does. There is also a new cgroup controller for device memory, that is used by drm, so is merging through my tree. This will hopefully help open up gpu cgroup usage a bit more and move us forward. There is a new accelerator driver for the AMD XDNA Ryzen AI NPUs. Then the usual xe/amdgpu/i915/msm leaders and lots of changes and refactors across the board: core: - device memory cgroup controller added - Remove driver date from drm_driver - Add drm_printer based hex dumper - drm memory stats docs update - scheduler documentation improvements new driver: - amdxdna - Ryzen AI NPU support connector: - add a mutex to protect ELD - make connector setup two-step panels: - Introduce backlight quirks infrastructure - New panels: KDB KD116N2130B12, Tianma TM070JDHG34-00, - Multi-Inno Technology MI1010Z1T-1CP11 bridge: - ti-sn65dsi83: Add ti,lvds-vod-swing optional properties - Provide default implementation of atomic_check for HDMI bridges - it605: HDCP improvements, MCCS Support xe: - make OA buffer size configurable - GuC capture fixes - add ufence and g2h flushes - restore system memory GGTT mappings - ioctl fixes - SRIOV PF scheduling priority - allow fault injection - lots of improvements/refactors - Enable GuC's WA_DUAL_QUEUE for newer platforms - IRQ related fixes and improvements i915: - More accurate engine busyness metrics with GuC submission - Ensure partial BO segment offset never exceeds allowed max - Flush GuC CT receive tasklet during reset preparation - Some DG2 refactor to fix DG2 bugs when operating with certain CPUs - Fix DG1 power gate sequence - Enabling uncompressed 128b/132b UHBR SST - Handle hdmi connector init failures, and no HDMI/DP cases - More robust engine resets on Haswell and older i915/xe display: - HDCP fixes for Xe3Lpd - New GSC FW ARL-H/ARL-U - support 3 VDSC engines 12 slices - MBUS joining sanitisation - reconcile i915/xe display power mgmt - Xe3Lpd fixes - UHBR rates for Thunderbolt amdgpu: - DRM panic support - track BO memory stats at runtime - Fix max surface handling in DC - Cleaner shader support for gfx10.3 dGPUs - fix drm buddy trim handling - SDMA engine reset updates - Fix doorbell ttm cleanup - RAS updates - ISP updates - SDMA queue reset support - Rework DPM powergating interfaces - Documentation updates and cleanups - DCN 3.5 updates - Use a pm notifier to more gracefully handle VRAM eviction on suspend or hibernate - Add debugfs interfaces for forcing scheduling to specific engine instances - GG 9.5 updates - IH 4.4 updates - Make missing optional firmware less noisy - PSP 13.x updates - SMU 13.x updates - VCN 5.x updates - JPEG 5.x updates - GC 12.x updates - DC FAMS updates amdkfd: - GG 9.5 updates - Logging improvements - Shader debugger fixes - Trap handler cleanup - Cleanup includes - Eviction fence wq fix msm: - MDSS: - properly described UBWC registers - added SM6150 (aka QCS615) support - DPU: - added SM6150 (aka QCS615) support - enabled wide planes if virtual planes are enabled (by using two SSPPs for a single plane) - added CWB hardware blocks support - DSI: - added SM6150 (aka QCS615) support - GPU: - Print GMU core fw version - GMU bandwidth voting for a740 and a750 - Expose uche trap base via uapi - UAPI error reporting rcar-du: - Add r8a779h0 Support ivpu: - Fix qemu crash when using passthrough nouveau: - expose GSP-RM logging buffers via debugfs panfrost: - Add MT8188 Mali-G57 MC3 support rockchip: - Gamma LUT support hisilicon: - new HIBMC support virtio-gpu: - convert to helpers - add prime support for scanout buffers v3d: - Add DRM_IOCTL_V3D_PERFMON_SET_GLOBAL vc4: - Add support for BCM2712 vkms: - line-per-line compositing algorithm to improve performance zynqmp: - Add DP audio support mediatek: - dp: Add sdp path reset - dp: Support flexible length of DP calibration data etnaviv: - add fdinfo memory support - add explicit reset handling" * tag 'drm-next-2025-01-17' of https://gitlab.freedesktop.org/drm/kernel: (1070 commits) drm/bridge: fix documentation for the hdmi_audio_prepare() callback doc/cgroup: Fix title underline length drm/doc: Include new drm-compute documentation cgroup/dmem: Fix parameters documentation cgroup/dmem: Select PAGE_COUNTER kernel/cgroup: Remove the unused variable climit drm/display: hdmi: Do not read EDID on disconnected connectors drm/tests: hdmi: Add connector disablement test drm/connector: hdmi: Do atomic check when necessary drm/amd/display: 3.2.316 drm/amd/display: avoid reset DTBCLK at clock init drm/amd/display: improve dpia pre-train drm/amd/display: Apply DML21 Patches drm/amd/display: Use HW lock mgr for PSR1 drm/amd/display: Revised for Replay Pseudo vblank control drm/amd/display: Add a new flag for replay low hz drm/amd/display: Remove unused read_ono_state function from Hwss module drm/amd/display: Do not elevate mem_type change to full update drm/amd/display: Do not wait for PSR disable on vbl enable drm/amd/display: Remove unnecessary eDP power down ...
2 parents c0e7590 + 951a6bf commit 96c8470

File tree

1,178 files changed

+50401
-14781
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,178 files changed

+50401
-14781
lines changed
Lines changed: 281 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,281 @@
1+
.. SPDX-License-Identifier: GPL-2.0-only
2+
3+
.. include:: <isonum.txt>
4+
5+
=========
6+
AMD NPU
7+
=========
8+
9+
:Copyright: |copy| 2024 Advanced Micro Devices, Inc.
10+
:Author: Sonal Santan <[email protected]>
11+
12+
Overview
13+
========
14+
15+
AMD NPU (Neural Processing Unit) is a multi-user AI inference accelerator
16+
integrated into AMD client APU. NPU enables efficient execution of Machine
17+
Learning applications like CNN, LLM, etc. NPU is based on
18+
`AMD XDNA Architecture`_. NPU is managed by **amdxdna** driver.
19+
20+
21+
Hardware Description
22+
====================
23+
24+
AMD NPU consists of the following hardware components:
25+
26+
AMD XDNA Array
27+
--------------
28+
29+
AMD XDNA Array comprises of 2D array of compute and memory tiles built with
30+
`AMD AI Engine Technology`_. Each column has 4 rows of compute tiles and 1
31+
row of memory tile. Each compute tile contains a VLIW processor with its own
32+
dedicated program and data memory. The memory tile acts as L2 memory. The 2D
33+
array can be partitioned at a column boundary creating a spatially isolated
34+
partition which can be bound to a workload context.
35+
36+
Each column also has dedicated DMA engines to move data between host DDR and
37+
memory tile.
38+
39+
AMD Phoenix and AMD Hawk Point client NPU have a 4x5 topology, i.e., 4 rows of
40+
compute tiles arranged into 5 columns. AMD Strix Point client APU have 4x8
41+
topology, i.e., 4 rows of compute tiles arranged into 8 columns.
42+
43+
Shared L2 Memory
44+
----------------
45+
46+
The single row of memory tiles create a pool of software managed on chip L2
47+
memory. DMA engines are used to move data between host DDR and memory tiles.
48+
AMD Phoenix and AMD Hawk Point NPUs have a total of 2560 KB of L2 memory.
49+
AMD Strix Point NPU has a total of 4096 KB of L2 memory.
50+
51+
Microcontroller
52+
---------------
53+
54+
A microcontroller runs NPU Firmware which is responsible for command processing,
55+
XDNA Array partition setup, XDNA Array configuration, workload context
56+
management and workload orchestration.
57+
58+
NPU Firmware uses a dedicated instance of an isolated non-privileged context
59+
called ERT to service each workload context. ERT is also used to execute user
60+
provided ``ctrlcode`` associated with the workload context.
61+
62+
NPU Firmware uses a single isolated privileged context called MERT to service
63+
management commands from the amdxdna driver.
64+
65+
Mailboxes
66+
---------
67+
68+
The microcontroller and amdxdna driver use a privileged channel for management
69+
tasks like setting up of contexts, telemetry, query, error handling, setting up
70+
user channel, etc. As mentioned before, privileged channel requests are
71+
serviced by MERT. The privileged channel is bound to a single mailbox.
72+
73+
The microcontroller and amdxdna driver use a dedicated user channel per
74+
workload context. The user channel is primarily used for submitting work to
75+
the NPU. As mentioned before, a user channel requests are serviced by an
76+
instance of ERT. Each user channel is bound to its own dedicated mailbox.
77+
78+
PCIe EP
79+
-------
80+
81+
NPU is visible to the x86 host CPU as a PCIe device with multiple BARs and some
82+
MSI-X interrupt vectors. NPU uses a dedicated high bandwidth SoC level fabric
83+
for reading or writing into host memory. Each instance of ERT gets its own
84+
dedicated MSI-X interrupt. MERT gets a single instance of MSI-X interrupt.
85+
86+
The number of PCIe BARs varies depending on the specific device. Based on their
87+
functions, PCIe BARs can generally be categorized into the following types.
88+
89+
* PSP BAR: Expose the AMD PSP (Platform Security Processor) function
90+
* SMU BAR: Expose the AMD SMU (System Management Unit) function
91+
* SRAM BAR: Expose ring buffers for the mailbox
92+
* Mailbox BAR: Expose the mailbox control registers (head, tail and ISR
93+
registers etc.)
94+
* Public Register BAR: Expose public registers
95+
96+
On specific devices, the above-mentioned BAR type might be combined into a
97+
single physical PCIe BAR. Or a module might require two physical PCIe BARs to
98+
be fully functional. For example,
99+
100+
* On AMD Phoenix device, PSP, SMU, Public Register BARs are on PCIe BAR index 0.
101+
* On AMD Strix Point device, Mailbox and Public Register BARs are on PCIe BAR
102+
index 0. The PSP has some registers in PCIe BAR index 0 (Public Register BAR)
103+
and PCIe BAR index 4 (PSP BAR).
104+
105+
Process Isolation Hardware
106+
--------------------------
107+
108+
As explained before, XDNA Array can be dynamically divided into isolated
109+
spatial partitions, each of which may have one or more columns. The spatial
110+
partition is setup by programming the column isolation registers by the
111+
microcontroller. Each spatial partition is associated with a PASID which is
112+
also programmed by the microcontroller. Hence multiple spatial partitions in
113+
the NPU can make concurrent host access protected by PASID.
114+
115+
The NPU FW itself uses microcontroller MMU enforced isolated contexts for
116+
servicing user and privileged channel requests.
117+
118+
119+
Mixed Spatial and Temporal Scheduling
120+
=====================================
121+
122+
AMD XDNA architecture supports mixed spatial and temporal (time sharing)
123+
scheduling of 2D array. This means that spatial partitions may be setup and
124+
torn down dynamically to accommodate various workloads. A *spatial* partition
125+
may be *exclusively* bound to one workload context while another partition may
126+
be *temporarily* bound to more than one workload contexts. The microcontroller
127+
updates the PASID for a temporarily shared partition to match the context that
128+
has been bound to the partition at any moment.
129+
130+
Resource Solver
131+
---------------
132+
133+
The Resource Solver component of the amdxdna driver manages the allocation
134+
of 2D array among various workloads. Every workload describes the number
135+
of columns required to run the NPU binary in its metadata. The Resource Solver
136+
component uses hints passed by the workload and its own heuristics to
137+
decide 2D array (re)partition strategy and mapping of workloads for spatial and
138+
temporal sharing of columns. The FW enforces the context-to-column(s) resource
139+
binding decisions made by the Resource Solver.
140+
141+
AMD Phoenix and AMD Hawk Point client NPU can support 6 concurrent workload
142+
contexts. AMD Strix Point can support 16 concurrent workload contexts.
143+
144+
145+
Application Binaries
146+
====================
147+
148+
A NPU application workload is comprised of two separate binaries which are
149+
generated by the NPU compiler.
150+
151+
1. AMD XDNA Array overlay, which is used to configure a NPU spatial partition.
152+
The overlay contains instructions for setting up the stream switch
153+
configuration and ELF for the compute tiles. The overlay is loaded on the
154+
spatial partition bound to the workload by the associated ERT instance.
155+
Refer to the
156+
`Versal Adaptive SoC AIE-ML Architecture Manual (AM020)`_ for more details.
157+
158+
2. ``ctrlcode``, used for orchestrating the overlay loaded on the spatial
159+
partition. ``ctrlcode`` is executed by the ERT running in protected mode on
160+
the microcontroller in the context of the workload. ``ctrlcode`` is made up
161+
of a sequence of opcodes named ``XAie_TxnOpcode``. Refer to the
162+
`AI Engine Run Time`_ for more details.
163+
164+
165+
Special Host Buffers
166+
====================
167+
168+
Per-context Instruction Buffer
169+
------------------------------
170+
171+
Every workload context uses a host resident 64 MB buffer which is memory
172+
mapped into the ERT instance created to service the workload. The ``ctrlcode``
173+
used by the workload is copied into this special memory. This buffer is
174+
protected by PASID like all other input/output buffers used by that workload.
175+
Instruction buffer is also mapped into the user space of the workload.
176+
177+
Global Privileged Buffer
178+
------------------------
179+
180+
In addition, the driver also allocates a single buffer for maintenance tasks
181+
like recording errors from MERT. This global buffer uses the global IOMMU
182+
domain and is only accessible by MERT.
183+
184+
185+
High-level Use Flow
186+
===================
187+
188+
Here are the steps to run a workload on AMD NPU:
189+
190+
1. Compile the workload into an overlay and a ``ctrlcode`` binary.
191+
2. Userspace opens a context in the driver and provides the overlay.
192+
3. The driver checks with the Resource Solver for provisioning a set of columns
193+
for the workload.
194+
4. The driver then asks MERT to create a context on the device with the desired
195+
columns.
196+
5. MERT then creates an instance of ERT. MERT also maps the Instruction Buffer
197+
into ERT memory.
198+
6. The userspace then copies the ``ctrlcode`` to the Instruction Buffer.
199+
7. Userspace then creates a command buffer with pointers to input, output, and
200+
instruction buffer; it then submits command buffer with the driver and goes
201+
to sleep waiting for completion.
202+
8. The driver sends the command over the Mailbox to ERT.
203+
9. ERT *executes* the ``ctrlcode`` in the instruction buffer.
204+
10. Execution of the ``ctrlcode`` kicks off DMAs to and from the host DDR while
205+
AMD XDNA Array is running.
206+
11. When ERT reaches end of ``ctrlcode``, it raises an MSI-X to send completion
207+
signal to the driver which then wakes up the waiting workload.
208+
209+
210+
Boot Flow
211+
=========
212+
213+
amdxdna driver uses PSP to securely load signed NPU FW and kick off the boot
214+
of the NPU microcontroller. amdxdna driver then waits for the alive signal in
215+
a special location on BAR 0. The NPU is switched off during SoC suspend and
216+
turned on after resume where the NPU FW is reloaded, and the handshake is
217+
performed again.
218+
219+
220+
Userspace components
221+
====================
222+
223+
Compiler
224+
--------
225+
226+
Peano is an LLVM based open-source compiler for AMD XDNA Array compute tile
227+
available at:
228+
https://github.com/Xilinx/llvm-aie
229+
230+
The open-source IREE compiler supports graph compilation of ML models for AMD
231+
NPU and uses Peano underneath. It is available at:
232+
https://github.com/nod-ai/iree-amd-aie
233+
234+
Usermode Driver (UMD)
235+
---------------------
236+
237+
The open-source XRT runtime stack interfaces with amdxdna kernel driver. XRT
238+
can be found at:
239+
https://github.com/Xilinx/XRT
240+
241+
The open-source XRT shim for NPU is can be found at:
242+
https://github.com/amd/xdna-driver
243+
244+
245+
DMA Operation
246+
=============
247+
248+
DMA operation instructions are encoded in the ``ctrlcode`` as
249+
``XAIE_IO_BLOCKWRITE`` opcode. When ERT executes ``XAIE_IO_BLOCKWRITE``, DMA
250+
operations between host DDR and L2 memory are effected.
251+
252+
253+
Error Handling
254+
==============
255+
256+
When MERT detects an error in AMD XDNA Array, it pauses execution for that
257+
workload context and sends an asynchronous message to the driver over the
258+
privileged channel. The driver then sends a buffer pointer to MERT to capture
259+
the register states for the partition bound to faulting workload context. The
260+
driver then decodes the error by reading the contents of the buffer pointer.
261+
262+
263+
Telemetry
264+
=========
265+
266+
MERT can report various kinds of telemetry information like the following:
267+
268+
* L1 interrupt counter
269+
* DMA counter
270+
* Deep Sleep counter
271+
* etc.
272+
273+
274+
References
275+
==========
276+
277+
- `AMD XDNA Architecture <https://www.amd.com/en/technologies/xdna.html>`_
278+
- `AMD AI Engine Technology <https://www.xilinx.com/products/technology/ai-engine.html>`_
279+
- `Peano <https://github.com/Xilinx/llvm-aie>`_
280+
- `Versal Adaptive SoC AIE-ML Architecture Manual (AM020) <https://docs.amd.com/r/en-US/am020-versal-aie-ml>`_
281+
- `AI Engine Run Time <https://github.com/Xilinx/aie-rt/tree/release/main_aig>`_

Documentation/accel/amdxdna/index.rst

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
.. SPDX-License-Identifier: GPL-2.0-only
2+
3+
=====================================
4+
accel/amdxdna NPU driver
5+
=====================================
6+
7+
The accel/amdxdna driver supports the AMD NPU (Neural Processing Unit).
8+
9+
.. toctree::
10+
11+
amdnpu

Documentation/accel/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ Compute Accelerators
88
:maxdepth: 1
99

1010
introduction
11+
amdxdna/index
1112
qaic/index
1213

1314
.. only:: subproject and html

Documentation/admin-guide/cgroup-v2.rst

Lines changed: 51 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -64,13 +64,14 @@ v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgrou
6464
5-6. Device
6565
5-7. RDMA
6666
5-7-1. RDMA Interface Files
67-
5-8. HugeTLB
68-
5.8-1. HugeTLB Interface Files
69-
5-9. Misc
70-
5.9-1 Miscellaneous cgroup Interface Files
71-
5.9-2 Migration and Ownership
72-
5-10. Others
73-
5-10-1. perf_event
67+
5-8. DMEM
68+
5-9. HugeTLB
69+
5.9-1. HugeTLB Interface Files
70+
5-10. Misc
71+
5.10-1 Miscellaneous cgroup Interface Files
72+
5.10-2 Migration and Ownership
73+
5-11. Others
74+
5-11-1. perf_event
7475
5-N. Non-normative information
7576
5-N-1. CPU controller root cgroup process behaviour
7677
5-N-2. IO controller root cgroup process behaviour
@@ -2626,6 +2627,49 @@ RDMA Interface Files
26262627
mlx4_0 hca_handle=1 hca_object=20
26272628
ocrdma1 hca_handle=1 hca_object=23
26282629

2630+
DMEM
2631+
----
2632+
2633+
The "dmem" controller regulates the distribution and accounting of
2634+
device memory regions. Because each memory region may have its own page size,
2635+
which does not have to be equal to the system page size, the units are always bytes.
2636+
2637+
DMEM Interface Files
2638+
~~~~~~~~~~~~~~~~~~~~
2639+
2640+
dmem.max, dmem.min, dmem.low
2641+
A readwrite nested-keyed file that exists for all the cgroups
2642+
except root that describes current configured resource limit
2643+
for a region.
2644+
2645+
An example for xe follows::
2646+
2647+
drm/0000:03:00.0/vram0 1073741824
2648+
drm/0000:03:00.0/stolen max
2649+
2650+
The semantics are the same as for the memory cgroup controller, and are
2651+
calculated in the same way.
2652+
2653+
dmem.capacity
2654+
A read-only file that describes maximum region capacity.
2655+
It only exists on the root cgroup. Not all memory can be
2656+
allocated by cgroups, as the kernel reserves some for
2657+
internal use.
2658+
2659+
An example for xe follows::
2660+
2661+
drm/0000:03:00.0/vram0 8514437120
2662+
drm/0000:03:00.0/stolen 67108864
2663+
2664+
dmem.current
2665+
A read-only file that describes current resource usage.
2666+
It exists for all the cgroup except root.
2667+
2668+
An example for xe follows::
2669+
2670+
drm/0000:03:00.0/vram0 12550144
2671+
drm/0000:03:00.0/stolen 8650752
2672+
26292673
HugeTLB
26302674
-------
26312675

Documentation/core-api/cgroup.rst

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
==================
2+
Cgroup Kernel APIs
3+
==================
4+
5+
Device Memory Cgroup API (dmemcg)
6+
=================================
7+
.. kernel-doc:: kernel/cgroup/dmem.c
8+
:export:
9+

Documentation/core-api/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -109,6 +109,7 @@ more memory-management documentation in Documentation/mm/index.rst.
109109
dma-isa-lpc
110110
swiotlb
111111
mm-api
112+
cgroup
112113
genalloc
113114
pin_user_pages
114115
boot-time-mm

0 commit comments

Comments
 (0)