@@ -106,6 +106,16 @@ will occupy those chip-select rows.
106
106
This term is avoided because it is unclear when needing to distinguish
107
107
between chip-select rows and socket sets.
108
108
109
+ * High Bandwidth Memory (HBM)
110
+
111
+ HBM is a new memory type with low power consumption and ultra-wide
112
+ communication lanes. It uses vertically stacked memory chips (DRAM dies)
113
+ interconnected by microscopic wires called "through-silicon vias," or
114
+ TSVs.
115
+
116
+ Several stacks of HBM chips connect to the CPU or GPU through an ultra-fast
117
+ interconnect called the "interposer". Therefore, HBM's characteristics
118
+ are nearly indistinguishable from on-chip integrated RAM.
109
119
110
120
Memory Controllers
111
121
------------------
@@ -176,3 +186,113 @@ nodes::
176
186
the L1 and L2 directories would be "edac_device_block's"
177
187
178
188
.. kernel-doc :: drivers/edac/edac_device.h
189
+
190
+
191
+ Heterogeneous system support
192
+ ----------------------------
193
+
194
+ An AMD heterogeneous system is built by connecting the data fabrics of
195
+ both CPUs and GPUs via custom xGMI links. Thus, the data fabric on the
196
+ GPU nodes can be accessed the same way as the data fabric on CPU nodes.
197
+
198
+ The MI200 accelerators are data center GPUs. They have 2 data fabrics,
199
+ and each GPU data fabric contains four Unified Memory Controllers (UMC).
200
+ Each UMC contains eight channels. Each UMC channel controls one 128-bit
201
+ HBM2e (2GB) channel (equivalent to 8 X 2GB ranks). This creates a total
202
+ of 4096-bits of DRAM data bus.
203
+
204
+ While the UMC is interfacing a 16GB (8high X 2GB DRAM) HBM stack, each UMC
205
+ channel is interfacing 2GB of DRAM (represented as rank).
206
+
207
+ Memory controllers on AMD GPU nodes can be represented in EDAC thusly:
208
+
209
+ GPU DF / GPU Node -> EDAC MC
210
+ GPU UMC -> EDAC CSROW
211
+ GPU UMC channel -> EDAC CHANNEL
212
+
213
+ For example: a heterogeneous system with 1 AMD CPU is connected to
214
+ 4 MI200 (Aldebaran) GPUs using xGMI.
215
+
216
+ Some more heterogeneous hardware details:
217
+
218
+ - The CPU UMC (Unified Memory Controller) is mostly the same as the GPU UMC.
219
+ They have chip selects (csrows) and channels. However, the layouts are different
220
+ for performance, physical layout, or other reasons.
221
+ - CPU UMCs use 1 channel, In this case UMC = EDAC channel. This follows the
222
+ marketing speak. CPU has X memory channels, etc.
223
+ - CPU UMCs use up to 4 chip selects, So UMC chip select = EDAC CSROW.
224
+ - GPU UMCs use 1 chip select, So UMC = EDAC CSROW.
225
+ - GPU UMCs use 8 channels, So UMC channel = EDAC channel.
226
+
227
+ The EDAC subsystem provides a mechanism to handle AMD heterogeneous
228
+ systems by calling system specific ops for both CPUs and GPUs.
229
+
230
+ AMD GPU nodes are enumerated in sequential order based on the PCI
231
+ hierarchy, and the first GPU node is assumed to have a Node ID value
232
+ following those of the CPU nodes after latter are fully populated::
233
+
234
+ $ ls /sys/devices/system/edac/mc/
235
+ mc0 - CPU MC node 0
236
+ mc1 |
237
+ mc2 |- GPU card[0] => node 0(mc1), node 1(mc2)
238
+ mc3 |
239
+ mc4 |- GPU card[1] => node 0(mc3), node 1(mc4)
240
+ mc5 |
241
+ mc6 |- GPU card[2] => node 0(mc5), node 1(mc6)
242
+ mc7 |
243
+ mc8 |- GPU card[3] => node 0(mc7), node 1(mc8)
244
+
245
+ For example, a heterogeneous system with one AMD CPU is connected to
246
+ four MI200 (Aldebaran) GPUs using xGMI. This topology can be represented
247
+ via the following sysfs entries::
248
+
249
+ /sys/devices/system/edac/mc/..
250
+
251
+ CPU # CPU node
252
+ ├── mc 0
253
+
254
+ GPU Nodes are enumerated sequentially after CPU nodes have been populated
255
+ GPU card 1 # Each MI200 GPU has 2 nodes/mcs
256
+ ├── mc 1 # GPU node 0 == mc1, Each MC node has 4 UMCs/CSROWs
257
+ │ ├── csrow 0 # UMC 0
258
+ │ │ ├── channel 0 # Each UMC has 8 channels
259
+ │ │ ├── channel 1 # size of each channel is 2 GB, so each UMC has 16 GB
260
+ │ │ ├── channel 2
261
+ │ │ ├── channel 3
262
+ │ │ ├── channel 4
263
+ │ │ ├── channel 5
264
+ │ │ ├── channel 6
265
+ │ │ ├── channel 7
266
+ │ ├── csrow 1 # UMC 1
267
+ │ │ ├── channel 0
268
+ │ │ ├── ..
269
+ │ │ ├── channel 7
270
+ │ ├── .. ..
271
+ │ ├── csrow 3 # UMC 3
272
+ │ │ ├── channel 0
273
+ │ │ ├── ..
274
+ │ │ ├── channel 7
275
+ │ ├── rank 0
276
+ │ ├── .. ..
277
+ │ ├── rank 31 # total 32 ranks/dimms from 4 UMCs
278
+ ├
279
+ ├── mc 2 # GPU node 1 == mc2
280
+ │ ├── .. # each GPU has total 64 GB
281
+
282
+ GPU card 2
283
+ ├── mc 3
284
+ │ ├── ..
285
+ ├── mc 4
286
+ │ ├── ..
287
+
288
+ GPU card 3
289
+ ├── mc 5
290
+ │ ├── ..
291
+ ├── mc 6
292
+ │ ├── ..
293
+
294
+ GPU card 4
295
+ ├── mc 7
296
+ │ ├── ..
297
+ ├── mc 8
298
+ │ ├── ..
0 commit comments