Skip to content

Commit be9b359

Browse files
shijujose4davejiang
authored andcommitted
cxl/edac: Add CXL memory device soft PPR control feature
Post Package Repair (PPR) maintenance operations may be supported by CXL devices that implement CXL.mem protocol. A PPR maintenance operation requests the CXL device to perform a repair operation on its media. For example, a CXL device with DRAM components that support PPR features may implement PPR Maintenance operations. DRAM components may support two types of PPR, hard PPR (hPPR), for a permanent row repair, and Soft PPR (sPPR), for a temporary row repair. Soft PPR is much faster than hPPR, but the repair is lost with a power cycle. During the execution of a PPR Maintenance operation, a CXL memory device: - May or may not retain data - May or may not be able to process CXL.mem requests correctly, including the ones that target the DPA involved in the repair. These CXL Memory Device capabilities are specified by Restriction Flags in the sPPR Feature and hPPR Feature. Soft PPR maintenance operation may be executed at runtime, if data is retained and CXL.mem requests are correctly processed. For CXL devices with DRAM components, hPPR maintenance operation may be executed only at boot because typically data may not be retained with hPPR maintenance operation. When a CXL device identifies error on a memory component, the device may inform the host about the need for a PPR maintenance operation by using an Event Record, where the Maintenance Needed flag is set. The Event Record specifies the DPA that should be repaired. A CXL device may not keep track of the requests that have already been sent and the information on which DPA should be repaired may be lost upon power cycle. The userspace tool requests for maintenance operation if the number of corrected error reported on a CXL.mem media exceeds error threshold. CXL spec 3.2 section 8.2.10.7.1.2 describes the device's sPPR (soft PPR) maintenance operation and section 8.2.10.7.1.3 describes the device's hPPR (hard PPR) maintenance operation feature. CXL spec 3.2 section 8.2.10.7.2.1 describes the sPPR feature discovery and configuration. CXL spec 3.2 section 8.2.10.7.2.2 describes the hPPR feature discovery and configuration. Add support for controlling CXL memory device soft PPR (sPPR) feature. Register with EDAC driver, which gets the memory repair attr descriptors from the EDAC memory repair driver and exposes sysfs repair control attributes for PRR to the userspace. For example CXL PPR control for the CXL mem0 device is exposed in /sys/bus/edac/devices/cxl_mem0/mem_repairX/ Add checks to ensure the memory to be repaired is offline and originates from a CXL DRAM or CXL gen_media error record reported in the current boot, before requesting a PPR operation on the device. Note: Tested with QEMU patch for CXL PPR feature. https://lore.kernel.org/linux-cxl/[email protected]/T/#m70b2b010f43f7f4a6f9acee5ec9008498bf292c3 Reviewed-by: Dave Jiang <[email protected]> Reviewed-by: Jonathan Cameron <[email protected]> Signed-off-by: Shiju Jose <[email protected]> Reviewed-by: Alison Schofield <[email protected]> Acked-by: Dan Williams <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Dave Jiang <[email protected]>
1 parent 588ca94 commit be9b359

File tree

2 files changed

+337
-1
lines changed

2 files changed

+337
-1
lines changed

Documentation/edac/memory_repair.rst

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -138,5 +138,15 @@ CXL device to perform a repair operation on its media. For example, a CXL
138138
device with DRAM components that support memory sparing features may
139139
implement sparing maintenance operations.
140140

141+
2. CXL memory Soft Post Package Repair (sPPR)
142+
143+
Post Package Repair (PPR) maintenance operations may be supported by CXL
144+
devices that implement CXL.mem protocol. A PPR maintenance operation
145+
requests the CXL device to perform a repair operation on its media.
146+
For example, a CXL device with DRAM components that support PPR features
147+
may implement PPR Maintenance operations. Soft PPR (sPPR) is a temporary
148+
row repair. Soft PPR may be faster, but the repair is lost with a power
149+
cycle.
150+
141151
Sysfs files for memory repair are documented in
142152
`Documentation/ABI/testing/sysfs-edac-memory-repair`

drivers/cxl/core/edac.c

Lines changed: 327 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,14 +14,15 @@
1414
#include <linux/cleanup.h>
1515
#include <linux/edac.h>
1616
#include <linux/limits.h>
17+
#include <linux/unaligned.h>
1718
#include <linux/xarray.h>
1819
#include <cxl/features.h>
1920
#include <cxl.h>
2021
#include <cxlmem.h>
2122
#include "core.h"
2223
#include "trace.h"
2324

24-
#define CXL_NR_EDAC_DEV_FEATURES 6
25+
#define CXL_NR_EDAC_DEV_FEATURES 7
2526

2627
#define CXL_SCRUB_NO_REGION -1
2728

@@ -1665,6 +1666,321 @@ static int cxl_memdev_sparing_init(struct cxl_memdev *cxlmd,
16651666
return 0;
16661667
}
16671668

1669+
/*
1670+
* CXL memory soft PPR & hard PPR control
1671+
*/
1672+
struct cxl_ppr_context {
1673+
uuid_t repair_uuid;
1674+
u8 instance;
1675+
u16 get_feat_size;
1676+
u16 set_feat_size;
1677+
u8 get_version;
1678+
u8 set_version;
1679+
u16 effects;
1680+
u8 op_class;
1681+
u8 op_subclass;
1682+
bool cap_dpa;
1683+
bool cap_nib_mask;
1684+
bool media_accessible;
1685+
bool data_retained;
1686+
struct cxl_memdev *cxlmd;
1687+
enum edac_mem_repair_type repair_type;
1688+
bool persist_mode;
1689+
u64 dpa;
1690+
u32 nibble_mask;
1691+
};
1692+
1693+
/*
1694+
* See CXL rev 3.2 @8.2.10.7.2.1 Table 8-128 sPPR Feature Readable Attributes
1695+
*
1696+
* See CXL rev 3.2 @8.2.10.7.2.2 Table 8-131 hPPR Feature Readable Attributes
1697+
*/
1698+
1699+
#define CXL_PPR_OP_CAP_DEVICE_INITIATED BIT(0)
1700+
#define CXL_PPR_OP_MODE_DEV_INITIATED BIT(0)
1701+
1702+
#define CXL_PPR_FLAG_DPA_SUPPORT_MASK BIT(0)
1703+
#define CXL_PPR_FLAG_NIB_SUPPORT_MASK BIT(1)
1704+
#define CXL_PPR_FLAG_MEM_SPARING_EV_REC_SUPPORT_MASK BIT(2)
1705+
#define CXL_PPR_FLAG_DEV_INITED_PPR_AT_BOOT_CAP_MASK BIT(3)
1706+
1707+
#define CXL_PPR_RESTRICTION_FLAG_MEDIA_ACCESSIBLE_MASK BIT(0)
1708+
#define CXL_PPR_RESTRICTION_FLAG_DATA_RETAINED_MASK BIT(2)
1709+
1710+
#define CXL_PPR_SPARING_EV_REC_EN_MASK BIT(0)
1711+
#define CXL_PPR_DEV_INITED_PPR_AT_BOOT_EN_MASK BIT(1)
1712+
1713+
#define CXL_PPR_GET_CAP_DPA(flags) \
1714+
FIELD_GET(CXL_PPR_FLAG_DPA_SUPPORT_MASK, flags)
1715+
#define CXL_PPR_GET_CAP_NIB_MASK(flags) \
1716+
FIELD_GET(CXL_PPR_FLAG_NIB_SUPPORT_MASK, flags)
1717+
#define CXL_PPR_GET_MEDIA_ACCESSIBLE(restriction_flags) \
1718+
(FIELD_GET(CXL_PPR_RESTRICTION_FLAG_MEDIA_ACCESSIBLE_MASK, \
1719+
restriction_flags) ^ 1)
1720+
#define CXL_PPR_GET_DATA_RETAINED(restriction_flags) \
1721+
(FIELD_GET(CXL_PPR_RESTRICTION_FLAG_DATA_RETAINED_MASK, \
1722+
restriction_flags) ^ 1)
1723+
1724+
struct cxl_memdev_ppr_rd_attrbs {
1725+
struct cxl_memdev_repair_rd_attrbs_hdr hdr;
1726+
u8 ppr_flags;
1727+
__le16 restriction_flags;
1728+
u8 ppr_op_mode;
1729+
} __packed;
1730+
1731+
/*
1732+
* See CXL rev 3.2 @8.2.10.7.1.2 Table 8-118 sPPR Maintenance Input Payload
1733+
*
1734+
* See CXL rev 3.2 @8.2.10.7.1.3 Table 8-119 hPPR Maintenance Input Payload
1735+
*/
1736+
struct cxl_memdev_ppr_maintenance_attrbs {
1737+
u8 flags;
1738+
__le64 dpa;
1739+
u8 nibble_mask[3];
1740+
} __packed;
1741+
1742+
static int cxl_mem_ppr_get_attrbs(struct cxl_ppr_context *cxl_ppr_ctx)
1743+
{
1744+
size_t rd_data_size = sizeof(struct cxl_memdev_ppr_rd_attrbs);
1745+
struct cxl_memdev *cxlmd = cxl_ppr_ctx->cxlmd;
1746+
struct cxl_mailbox *cxl_mbox = &cxlmd->cxlds->cxl_mbox;
1747+
u16 restriction_flags;
1748+
size_t data_size;
1749+
u16 return_code;
1750+
1751+
struct cxl_memdev_ppr_rd_attrbs *rd_attrbs __free(kfree) =
1752+
kmalloc(rd_data_size, GFP_KERNEL);
1753+
if (!rd_attrbs)
1754+
return -ENOMEM;
1755+
1756+
data_size = cxl_get_feature(cxl_mbox, &cxl_ppr_ctx->repair_uuid,
1757+
CXL_GET_FEAT_SEL_CURRENT_VALUE, rd_attrbs,
1758+
rd_data_size, 0, &return_code);
1759+
if (!data_size)
1760+
return -EIO;
1761+
1762+
cxl_ppr_ctx->op_class = rd_attrbs->hdr.op_class;
1763+
cxl_ppr_ctx->op_subclass = rd_attrbs->hdr.op_subclass;
1764+
cxl_ppr_ctx->cap_dpa = CXL_PPR_GET_CAP_DPA(rd_attrbs->ppr_flags);
1765+
cxl_ppr_ctx->cap_nib_mask =
1766+
CXL_PPR_GET_CAP_NIB_MASK(rd_attrbs->ppr_flags);
1767+
1768+
restriction_flags = le16_to_cpu(rd_attrbs->restriction_flags);
1769+
cxl_ppr_ctx->media_accessible =
1770+
CXL_PPR_GET_MEDIA_ACCESSIBLE(restriction_flags);
1771+
cxl_ppr_ctx->data_retained =
1772+
CXL_PPR_GET_DATA_RETAINED(restriction_flags);
1773+
1774+
return 0;
1775+
}
1776+
1777+
static int cxl_mem_perform_ppr(struct cxl_ppr_context *cxl_ppr_ctx)
1778+
{
1779+
struct cxl_memdev_ppr_maintenance_attrbs maintenance_attrbs;
1780+
struct cxl_memdev *cxlmd = cxl_ppr_ctx->cxlmd;
1781+
struct cxl_mem_repair_attrbs attrbs = { 0 };
1782+
1783+
struct rw_semaphore *region_lock __free(rwsem_read_release) =
1784+
rwsem_read_intr_acquire(&cxl_region_rwsem);
1785+
if (!region_lock)
1786+
return -EINTR;
1787+
1788+
struct rw_semaphore *dpa_lock __free(rwsem_read_release) =
1789+
rwsem_read_intr_acquire(&cxl_dpa_rwsem);
1790+
if (!dpa_lock)
1791+
return -EINTR;
1792+
1793+
if (!cxl_ppr_ctx->media_accessible || !cxl_ppr_ctx->data_retained) {
1794+
/* Memory to repair must be offline */
1795+
if (cxl_is_memdev_memory_online(cxlmd))
1796+
return -EBUSY;
1797+
} else {
1798+
if (cxl_is_memdev_memory_online(cxlmd)) {
1799+
/* Check memory to repair is from the current boot */
1800+
attrbs.repair_type = CXL_PPR;
1801+
attrbs.dpa = cxl_ppr_ctx->dpa;
1802+
attrbs.nibble_mask = cxl_ppr_ctx->nibble_mask;
1803+
if (!cxl_find_rec_dram(cxlmd, &attrbs) &&
1804+
!cxl_find_rec_gen_media(cxlmd, &attrbs))
1805+
return -EINVAL;
1806+
}
1807+
}
1808+
1809+
memset(&maintenance_attrbs, 0, sizeof(maintenance_attrbs));
1810+
maintenance_attrbs.flags = 0;
1811+
maintenance_attrbs.dpa = cpu_to_le64(cxl_ppr_ctx->dpa);
1812+
put_unaligned_le24(cxl_ppr_ctx->nibble_mask,
1813+
maintenance_attrbs.nibble_mask);
1814+
1815+
return cxl_perform_maintenance(&cxlmd->cxlds->cxl_mbox,
1816+
cxl_ppr_ctx->op_class,
1817+
cxl_ppr_ctx->op_subclass,
1818+
&maintenance_attrbs,
1819+
sizeof(maintenance_attrbs));
1820+
}
1821+
1822+
static int cxl_ppr_get_repair_type(struct device *dev, void *drv_data,
1823+
const char **repair_type)
1824+
{
1825+
*repair_type = edac_repair_type[EDAC_REPAIR_PPR];
1826+
1827+
return 0;
1828+
}
1829+
1830+
static int cxl_ppr_get_persist_mode(struct device *dev, void *drv_data,
1831+
bool *persist_mode)
1832+
{
1833+
struct cxl_ppr_context *cxl_ppr_ctx = drv_data;
1834+
1835+
*persist_mode = cxl_ppr_ctx->persist_mode;
1836+
1837+
return 0;
1838+
}
1839+
1840+
static int cxl_get_ppr_safe_when_in_use(struct device *dev, void *drv_data,
1841+
bool *safe)
1842+
{
1843+
struct cxl_ppr_context *cxl_ppr_ctx = drv_data;
1844+
1845+
*safe = cxl_ppr_ctx->media_accessible & cxl_ppr_ctx->data_retained;
1846+
1847+
return 0;
1848+
}
1849+
1850+
static int cxl_ppr_get_min_dpa(struct device *dev, void *drv_data, u64 *min_dpa)
1851+
{
1852+
struct cxl_ppr_context *cxl_ppr_ctx = drv_data;
1853+
struct cxl_memdev *cxlmd = cxl_ppr_ctx->cxlmd;
1854+
struct cxl_dev_state *cxlds = cxlmd->cxlds;
1855+
1856+
*min_dpa = cxlds->dpa_res.start;
1857+
1858+
return 0;
1859+
}
1860+
1861+
static int cxl_ppr_get_max_dpa(struct device *dev, void *drv_data, u64 *max_dpa)
1862+
{
1863+
struct cxl_ppr_context *cxl_ppr_ctx = drv_data;
1864+
struct cxl_memdev *cxlmd = cxl_ppr_ctx->cxlmd;
1865+
struct cxl_dev_state *cxlds = cxlmd->cxlds;
1866+
1867+
*max_dpa = cxlds->dpa_res.end;
1868+
1869+
return 0;
1870+
}
1871+
1872+
static int cxl_ppr_get_dpa(struct device *dev, void *drv_data, u64 *dpa)
1873+
{
1874+
struct cxl_ppr_context *cxl_ppr_ctx = drv_data;
1875+
1876+
*dpa = cxl_ppr_ctx->dpa;
1877+
1878+
return 0;
1879+
}
1880+
1881+
static int cxl_ppr_set_dpa(struct device *dev, void *drv_data, u64 dpa)
1882+
{
1883+
struct cxl_ppr_context *cxl_ppr_ctx = drv_data;
1884+
struct cxl_memdev *cxlmd = cxl_ppr_ctx->cxlmd;
1885+
struct cxl_dev_state *cxlds = cxlmd->cxlds;
1886+
1887+
if (dpa < cxlds->dpa_res.start || dpa > cxlds->dpa_res.end)
1888+
return -EINVAL;
1889+
1890+
cxl_ppr_ctx->dpa = dpa;
1891+
1892+
return 0;
1893+
}
1894+
1895+
static int cxl_ppr_get_nibble_mask(struct device *dev, void *drv_data,
1896+
u32 *nibble_mask)
1897+
{
1898+
struct cxl_ppr_context *cxl_ppr_ctx = drv_data;
1899+
1900+
*nibble_mask = cxl_ppr_ctx->nibble_mask;
1901+
1902+
return 0;
1903+
}
1904+
1905+
static int cxl_ppr_set_nibble_mask(struct device *dev, void *drv_data,
1906+
u32 nibble_mask)
1907+
{
1908+
struct cxl_ppr_context *cxl_ppr_ctx = drv_data;
1909+
1910+
cxl_ppr_ctx->nibble_mask = nibble_mask;
1911+
1912+
return 0;
1913+
}
1914+
1915+
static int cxl_do_ppr(struct device *dev, void *drv_data, u32 val)
1916+
{
1917+
struct cxl_ppr_context *cxl_ppr_ctx = drv_data;
1918+
1919+
if (!cxl_ppr_ctx->dpa || val != EDAC_DO_MEM_REPAIR)
1920+
return -EINVAL;
1921+
1922+
return cxl_mem_perform_ppr(cxl_ppr_ctx);
1923+
}
1924+
1925+
static const struct edac_mem_repair_ops cxl_sppr_ops = {
1926+
.get_repair_type = cxl_ppr_get_repair_type,
1927+
.get_persist_mode = cxl_ppr_get_persist_mode,
1928+
.get_repair_safe_when_in_use = cxl_get_ppr_safe_when_in_use,
1929+
.get_min_dpa = cxl_ppr_get_min_dpa,
1930+
.get_max_dpa = cxl_ppr_get_max_dpa,
1931+
.get_dpa = cxl_ppr_get_dpa,
1932+
.set_dpa = cxl_ppr_set_dpa,
1933+
.get_nibble_mask = cxl_ppr_get_nibble_mask,
1934+
.set_nibble_mask = cxl_ppr_set_nibble_mask,
1935+
.do_repair = cxl_do_ppr,
1936+
};
1937+
1938+
static int cxl_memdev_soft_ppr_init(struct cxl_memdev *cxlmd,
1939+
struct edac_dev_feature *ras_feature,
1940+
u8 repair_inst)
1941+
{
1942+
struct cxl_ppr_context *cxl_sppr_ctx;
1943+
struct cxl_feat_entry *feat_entry;
1944+
int ret;
1945+
1946+
feat_entry = cxl_feature_info(to_cxlfs(cxlmd->cxlds),
1947+
&CXL_FEAT_SPPR_UUID);
1948+
if (IS_ERR(feat_entry))
1949+
return -EOPNOTSUPP;
1950+
1951+
if (!(le32_to_cpu(feat_entry->flags) & CXL_FEATURE_F_CHANGEABLE))
1952+
return -EOPNOTSUPP;
1953+
1954+
cxl_sppr_ctx =
1955+
devm_kzalloc(&cxlmd->dev, sizeof(*cxl_sppr_ctx), GFP_KERNEL);
1956+
if (!cxl_sppr_ctx)
1957+
return -ENOMEM;
1958+
1959+
*cxl_sppr_ctx = (struct cxl_ppr_context){
1960+
.get_feat_size = le16_to_cpu(feat_entry->get_feat_size),
1961+
.set_feat_size = le16_to_cpu(feat_entry->set_feat_size),
1962+
.get_version = feat_entry->get_feat_ver,
1963+
.set_version = feat_entry->set_feat_ver,
1964+
.effects = le16_to_cpu(feat_entry->effects),
1965+
.cxlmd = cxlmd,
1966+
.repair_type = EDAC_REPAIR_PPR,
1967+
.persist_mode = 0,
1968+
.instance = repair_inst,
1969+
};
1970+
uuid_copy(&cxl_sppr_ctx->repair_uuid, &CXL_FEAT_SPPR_UUID);
1971+
1972+
ret = cxl_mem_ppr_get_attrbs(cxl_sppr_ctx);
1973+
if (ret)
1974+
return ret;
1975+
1976+
ras_feature->ft_type = RAS_FEAT_MEM_REPAIR;
1977+
ras_feature->instance = cxl_sppr_ctx->instance;
1978+
ras_feature->mem_repair_ops = &cxl_sppr_ops;
1979+
ras_feature->ctx = cxl_sppr_ctx;
1980+
1981+
return 0;
1982+
}
1983+
16681984
int devm_cxl_memdev_edac_register(struct cxl_memdev *cxlmd)
16691985
{
16701986
struct edac_dev_feature ras_features[CXL_NR_EDAC_DEV_FEATURES];
@@ -1704,6 +2020,16 @@ int devm_cxl_memdev_edac_register(struct cxl_memdev *cxlmd)
17042020
num_ras_features++;
17052021
}
17062022

2023+
rc = cxl_memdev_soft_ppr_init(cxlmd, &ras_features[num_ras_features],
2024+
repair_inst);
2025+
if (rc < 0 && rc != -EOPNOTSUPP)
2026+
return rc;
2027+
2028+
if (rc != -EOPNOTSUPP) {
2029+
repair_inst++;
2030+
num_ras_features++;
2031+
}
2032+
17072033
if (repair_inst) {
17082034
struct cxl_mem_err_rec *array_rec =
17092035
devm_kzalloc(&cxlmd->dev, sizeof(*array_rec),

0 commit comments

Comments
 (0)