@@ -2378,6 +2378,9 @@ are deprecated and should not be used.
2378
2378
======== ============================== ======================================
2379
2379
"AMDGPU" ``NT_AMDGPU_METADATA`` Metadata in Message Pack [MsgPack]_
2380
2380
binary format.
2381
+ "AMDGPU" ``NT_AMDGPU_KFD_CORE_STATE`` Snapshot of runtime, agent and queues
2382
+ state for use in core dump. See
2383
+ :ref:`amdgpu_corefile_note`.
2381
2384
======== ============================== ======================================
2382
2385
2383
2386
..
@@ -2390,6 +2393,7 @@ are deprecated and should not be used.
2390
2393
============================== =====
2391
2394
*reserved* 0-31
2392
2395
``NT_AMDGPU_METADATA`` 32
2396
+ ``NT_AMDGPU_KFD_CORE_STATE`` 33
2393
2397
============================== =====
2394
2398
2395
2399
``NT_AMDGPU_METADATA``
@@ -15024,6 +15028,115 @@ instructions are handled as follows:
15024
15028
trap handler installed.
15025
15029
=============== =============== ===========================================
15026
15030
15031
+ Core file format
15032
+ ================
15033
+
15034
+ This section describes the format of core files supporting AMDGPU. Core dumps
15035
+ for an AMDGPU program can come in 2 flavors: split or unified core files.
15036
+
15037
+ The split layout consists of one host core file containing the information to
15038
+ rebuild the image of the host process and one AMDGPU core file that contains
15039
+ the information for the AMDGPU agents used in the process. The AMDGPU core
15040
+ file consists of:
15041
+
15042
+ * A note describing the state of the AMDGPU agents, AMDGPU queues, and AMDGPU
15043
+ runtime for the process (see :ref:`amdgpu_corefile_note`).
15044
+ * A list of load segments containing an image of the AMDGPU agents' memory (see
15045
+ :ref:`amdgpu_corefile_memory`).
15046
+
15047
+ The unified core file is the union of all the information contained in
15048
+ the two files of the split layout (all notes and load segments). It contains
15049
+ all the information required to reconstruct the image of the process across all
15050
+ the agents.
15051
+
15052
+ Core file header
15053
+ ----------------
15054
+
15055
+ An AMDGPU core file is an ``ELF64`` core file. The content of the header
15056
+ differs in unified core file layout and AMDGPU core file layout.
15057
+
15058
+ Split files
15059
+ ~~~~~~~~~~~
15060
+
15061
+ In the split files layout, the AMDGPU core file is an ``ELF64`` file with the
15062
+ header configured as described in :ref:`amdgpu-corefile-headers-table`:
15063
+
15064
+ .. table:: AMDGPU corefile headers
15065
+ :name: amdgpu-corefile-headers-table
15066
+
15067
+ ========================== ===================================
15068
+ Field Value
15069
+ ========================== ===================================
15070
+ ``e_ident[EI_CLASS]`` ``ELFCLASS64`` (``0x2``)
15071
+ ``e_ident[EI_DATA]`` ``ELFDATA2LSB`` (``0x1``)
15072
+ ``e_ident[EI_OSABI]`` ``ELFOSABI_AMDGPU_HSA`` (``0x40``)
15073
+ ``e_type`` ``ET_CORE``(``0x4``)
15074
+ ``e_ident[EI_ABIVERSION]`` ``ELFABIVERSION_AMDGPU_HSA_5``
15075
+ ``e_machine`` ``EM_AMDGPU`` (``0xe0``)
15076
+ ========================== ===================================
15077
+
15078
+ Unified file
15079
+ ~~~~~~~~~~~~
15080
+
15081
+ In the unified core file mode, the ``ELF64`` headers are set to describe
15082
+ the host architecture and process.
15083
+
15084
+ .. _amdgpu_corefile_note:
15085
+
15086
+ Core file notes
15087
+ ---------------
15088
+
15089
+ An AMDGPU core file must contain one snapshot note in a ``PT_NOTE`` segment.
15090
+ When using a split core file layout, this note is in the AMDGPU file.
15091
+
15092
+ The note record vendor field is "``AMDGPU``" and the record type is
15093
+ "``NT_AMDGPU_KFD_CORE_STATE``" (see :ref:`amdgpu-note-records-v3-onwards`)
15094
+
15095
+ The content of the note is defined in table
15096
+ :ref:`amdgpu-core-snapshot-note-layout-table-v1`:
15097
+
15098
+ .. table:: AMDGPU snapshot note format V1
15099
+ :name: amdgpu-core-snapshot-note-layout-table-v1
15100
+
15101
+ ================================ ======================================= ======================= ============== ===========================
15102
+ Field Type Size (bytes) Byte alignment Comment
15103
+ ================================ ======================================= ======================= ============== ===========================
15104
+ ``version_major`` ``uint32`` 4 4 ``KFD_IOCTL_MAJOR_VERSION``
15105
+ ``version_minor`` ``uint32`` 4 4 ``KFD_IOCTL_MINOR_VERSION``
15106
+ ``runtime_info_size`` ``uint64`` 8 8 Must be a multiple of 8
15107
+ ``n_agents`` ``uint32`` 4 8
15108
+ ``agent_info_entry_size`` ``uint32`` 4 4 Must be a multiple of 8
15109
+ ``n_queues`` ``uint32`` 4 8
15110
+ ``queue_info_entry_size`` ``uint32`` 4 4 Must be a multiple of 8
15111
+ ``runtime_info`` ``kfd_runtime_info`` ``runtime_info_size`` 8
15112
+ ``agents_info`` ``kfd_dbg_device_info_entry[n_agents]`` ``n_agents * 8
15113
+ agent_info_entry_size``
15114
+ ``queues_info`` ``kfd_queue_snapshot_entry[n_queues]`` ``n_queues *
15115
+ queue_info_entry_size`` 8
15116
+ ================================ ======================================= ======================= ============== ===========================
15117
+
15118
+ The definition of all the ``kfd_*`` types comes from the
15119
+ ``include/uapi/linux/kfd_ioctl.h`` header file from the KFD repository. It is
15120
+ usually installed in ``/usr/include/linux/kfd_ioctl.h``. The version of the
15121
+ ``kfd_ioctl.h`` file used must define values for
15122
+ ``KFD_IOCTL_MAJOR_VERSION`` and ``KFD_IOCTL_MINOR_VERSION`` matching
15123
+ the values of ``kfd_version_major`` and ``kfd_version_major`` from the
15124
+ note.
15125
+
15126
+ .. _amdgpu_corefile_memory:
15127
+
15128
+ Memory segments
15129
+ ---------------
15130
+
15131
+ An AMDGPU core file must contain an image of the AMDGPU agents' memory in load
15132
+ segments (of type ``PT_LOAD``). Those segments must correspond to the memory
15133
+ regions where the content of the agent memory is mapped into the host process
15134
+ by the ROCr runtime (note that those memory mappings are usually not readable
15135
+ by the process itself).
15136
+
15137
+ When using the split core file layout, those segments must be included in the
15138
+ AMDGPU core file.
15139
+
15027
15140
Source Languages
15028
15141
================
15029
15142
0 commit comments