Skip to content

Commit 9f153b7

Browse files
committed
Merge branch 'for-6.16/cxl-features-ras' into cxl-for-next
Add CXL RAS Features support. Features include "patrol scrub control", "error check scrub", "perform maintenance", and "memory sparing". This support connects the RAS Featurs to EDAC.
2 parents 6eed708 + be9b359 commit 9f153b7

File tree

17 files changed

+2373
-14
lines changed

17 files changed

+2373
-14
lines changed

Documentation/edac/memory_repair.rst

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -119,3 +119,34 @@ sysfs
119119

120120
Sysfs files are documented in
121121
`Documentation/ABI/testing/sysfs-edac-memory-repair`.
122+
123+
Examples
124+
--------
125+
126+
The memory repair usage takes the form shown in this example:
127+
128+
1. CXL memory sparing
129+
130+
Memory sparing is defined as a repair function that replaces a portion of
131+
memory with a portion of functional memory at that same DPA. The subclass
132+
for this operation, cacheline/row/bank/rank sparing, vary in terms of the
133+
scope of the sparing being performed.
134+
135+
Memory sparing maintenance operations may be supported by CXL devices that
136+
implement CXL.mem protocol. A sparing maintenance operation requests the
137+
CXL device to perform a repair operation on its media. For example, a CXL
138+
device with DRAM components that support memory sparing features may
139+
implement sparing maintenance operations.
140+
141+
2. CXL memory Soft Post Package Repair (sPPR)
142+
143+
Post Package Repair (PPR) maintenance operations may be supported by CXL
144+
devices that implement CXL.mem protocol. A PPR maintenance operation
145+
requests the CXL device to perform a repair operation on its media.
146+
For example, a CXL device with DRAM components that support PPR features
147+
may implement PPR Maintenance operations. Soft PPR (sPPR) is a temporary
148+
row repair. Soft PPR may be faster, but the repair is lost with a power
149+
cycle.
150+
151+
Sysfs files for memory repair are documented in
152+
`Documentation/ABI/testing/sysfs-edac-memory-repair`

Documentation/edac/scrub.rst

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -264,3 +264,79 @@ Sysfs files are documented in
264264
`Documentation/ABI/testing/sysfs-edac-scrub`
265265

266266
`Documentation/ABI/testing/sysfs-edac-ecs`
267+
268+
Examples
269+
--------
270+
271+
The usage takes the form shown in these examples:
272+
273+
1. CXL memory Patrol Scrub
274+
275+
The following are the use cases identified why we might increase the scrub rate.
276+
277+
- Scrubbing is needed at device granularity because a device is showing
278+
unexpectedly high errors.
279+
280+
- Scrubbing may apply to memory that isn't online at all yet. Likely this
281+
is a system wide default setting on boot.
282+
283+
- Scrubbing at a higher rate because the monitor software has determined that
284+
more reliability is necessary for a particular data set. This is called
285+
Differentiated Reliability.
286+
287+
1.1. Device based scrubbing
288+
289+
CXL memory is exposed to memory management subsystem and ultimately userspace
290+
via CXL devices. Device-based scrubbing is used for the first use case
291+
described in "Section 1 CXL Memory Patrol Scrub".
292+
293+
When combining control via the device interfaces and region interfaces,
294+
"see Section 1.2 Region based scrubbing".
295+
296+
Sysfs files for scrubbing are documented in
297+
`Documentation/ABI/testing/sysfs-edac-scrub`
298+
299+
1.2. Region based scrubbing
300+
301+
CXL memory is exposed to memory management subsystem and ultimately userspace
302+
via CXL regions. CXL Regions represent mapped memory capacity in system
303+
physical address space. These can incorporate one or more parts of multiple CXL
304+
memory devices with traffic interleaved across them. The user may want to control
305+
the scrub rate via this more abstract region instead of having to figure out the
306+
constituent devices and program them separately. The scrub rate for each device
307+
covers the whole device. Thus if multiple regions use parts of that device then
308+
requests for scrubbing of other regions may result in a higher scrub rate than
309+
requested for this specific region.
310+
311+
Region-based scrubbing is used for the third use case described in
312+
"Section 1 CXL Memory Patrol Scrub".
313+
314+
Userspace must follow below set of rules on how to set the scrub rates for any
315+
mixture of requirements.
316+
317+
1. Taking each region in turn from lowest desired scrub rate to highest and set
318+
their scrub rates. Later regions may override the scrub rate on individual
319+
devices (and hence potentially whole regions).
320+
321+
2. Take each device for which enhanced scrubbing is required (higher rate) and
322+
set those scrub rates. This will override the scrub rates of individual devices,
323+
setting them to the maximum rate required for any of the regions they help back,
324+
unless a specific rate is already defined.
325+
326+
Sysfs files for scrubbing are documented in
327+
`Documentation/ABI/testing/sysfs-edac-scrub`
328+
329+
2. CXL memory Error Check Scrub (ECS)
330+
331+
The Error Check Scrub (ECS) feature enables a memory device to perform error
332+
checking and correction (ECC) and count single-bit errors. The associated
333+
memory controller sets the ECS mode with a trigger sent to the memory
334+
device. CXL ECS control allows the host, thus the userspace, to change the
335+
attributes for error count mode, threshold number of errors per segment
336+
(indicating how many segments have at least that number of errors) for
337+
reporting errors, and reset the ECS counter. Thus the responsibility for
338+
initiating Error Check Scrub on a memory device may lie with the memory
339+
controller or platform when unexpectedly high error rates are detected.
340+
341+
Sysfs files for scrubbing are documented in
342+
`Documentation/ABI/testing/sysfs-edac-ecs`

drivers/cxl/Kconfig

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -114,6 +114,77 @@ config CXL_FEATURES
114114

115115
If unsure say 'n'
116116

117+
config CXL_EDAC_MEM_FEATURES
118+
bool "CXL: EDAC Memory Features"
119+
depends on EXPERT
120+
depends on CXL_MEM
121+
depends on CXL_FEATURES
122+
depends on EDAC >= CXL_BUS
123+
help
124+
The CXL EDAC memory feature is optional and allows host to
125+
control the EDAC memory features configurations of CXL memory
126+
expander devices.
127+
128+
Say 'y' if you have an expert need to change default settings
129+
of a memory RAS feature established by the platform/device.
130+
Otherwise say 'n'.
131+
132+
config CXL_EDAC_SCRUB
133+
bool "Enable CXL Patrol Scrub Control (Patrol Read)"
134+
depends on CXL_EDAC_MEM_FEATURES
135+
depends on EDAC_SCRUB
136+
help
137+
The CXL EDAC scrub control is optional and allows host to
138+
control the scrub feature configurations of CXL memory expander
139+
devices.
140+
141+
When enabled 'cxl_mem' and 'cxl_region' EDAC devices are
142+
published with memory scrub control attributes as described by
143+
Documentation/ABI/testing/sysfs-edac-scrub.
144+
145+
Say 'y' if you have an expert need to change default settings
146+
of a memory scrub feature established by the platform/device
147+
(e.g. scrub rates for the patrol scrub feature).
148+
Otherwise say 'n'.
149+
150+
config CXL_EDAC_ECS
151+
bool "Enable CXL Error Check Scrub (Repair)"
152+
depends on CXL_EDAC_MEM_FEATURES
153+
depends on EDAC_ECS
154+
help
155+
The CXL EDAC ECS control is optional and allows host to
156+
control the ECS feature configurations of CXL memory expander
157+
devices.
158+
159+
When enabled 'cxl_mem' EDAC devices are published with memory
160+
ECS control attributes as described by
161+
Documentation/ABI/testing/sysfs-edac-ecs.
162+
163+
Say 'y' if you have an expert need to change default settings
164+
of a memory ECS feature established by the platform/device.
165+
Otherwise say 'n'.
166+
167+
config CXL_EDAC_MEM_REPAIR
168+
bool "Enable CXL Memory Repair"
169+
depends on CXL_EDAC_MEM_FEATURES
170+
depends on EDAC_MEM_REPAIR
171+
help
172+
The CXL EDAC memory repair control is optional and allows host
173+
to control the memory repair features (e.g. sparing, PPR)
174+
configurations of CXL memory expander devices.
175+
176+
When enabled, the memory repair feature requires an additional
177+
memory of approximately 43KB to store CXL DRAM and CXL general
178+
media event records.
179+
180+
When enabled 'cxl_mem' EDAC devices are published with memory
181+
repair control attributes as described by
182+
Documentation/ABI/testing/sysfs-edac-memory-repair.
183+
184+
Say 'y' if you have an expert need to change default settings
185+
of a memory repair feature established by the platform/device.
186+
Otherwise say 'n'.
187+
117188
config CXL_PORT
118189
default CXL_BUS
119190
tristate

drivers/cxl/core/Makefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,3 +20,4 @@ cxl_core-$(CONFIG_TRACING) += trace.o
2020
cxl_core-$(CONFIG_CXL_REGION) += region.o
2121
cxl_core-$(CONFIG_CXL_MCE) += mce.o
2222
cxl_core-$(CONFIG_CXL_FEATURES) += features.o
23+
cxl_core-$(CONFIG_CXL_EDAC_MEM_FEATURES) += edac.o

drivers/cxl/core/core.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -124,6 +124,8 @@ int cxl_acpi_get_extended_linear_cache_size(struct resource *backing_res,
124124
int nid, resource_size_t *size);
125125

126126
#ifdef CONFIG_CXL_FEATURES
127+
struct cxl_feat_entry *
128+
cxl_feature_info(struct cxl_features_state *cxlfs, const uuid_t *uuid);
127129
size_t cxl_get_feature(struct cxl_mailbox *cxl_mbox, const uuid_t *feat_uuid,
128130
enum cxl_get_feat_selection selection,
129131
void *feat_out, size_t feat_out_size, u16 offset,

0 commit comments

Comments
 (0)