Skip to content

Commit 29e9359

Browse files
committed
Merge tag 'cxl-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Pull Compute Express Link (CXL) updates from Dave Jiang: - Remove always true condition in cxl features code - Add verification of CHBS length for CXL 2.0 - Ignore interleave granularity when interleave ways is 1 - Add update addressing mising MODULE_DESCRIPTION for cxl_test - A series of cleanups/refactor to prep for AMD Zen5 translate code - Clean %pa debug printk in core/hdm.c - Documentation updates: - Update to CXL Maturity Map - Fixes to source linking in CXL documentation - CXL documentation fixes, spelling corrections - A large collection of CXL documentation for the entire CXL subsystem, including documentation on CXL related platform and firmware notes - Remove redundant code of cxlctl_get_supported_features() - Series to support CXL RAS Features - Including "Patrol Scrub Control", "Error Check Scrub", "Performance Maitenance" and "Memory Sparing". The series connects CXL to EDAC. * tag 'cxl-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (53 commits) cxl/edac: Add CXL memory device soft PPR control feature cxl/edac: Add CXL memory device memory sparing control feature cxl/edac: Support for finding memory operation attributes from the current boot cxl/edac: Add support for PERFORM_MAINTENANCE command cxl/edac: Add CXL memory device ECS control feature cxl/edac: Add CXL memory device patrol scrub control feature cxl: Update prototype of function get_support_feature_info() EDAC: Update documentation for the CXL memory patrol scrub control feature cxl/features: Remove the inline specifier from to_cxlfs() cxl/feature: Remove redundant code of get supported features docs: ABI: Fix "firwmare" to "firmware" cxl/Documentation: Fix typo in sysfs write_bandwidth attribute path cxl: doc/linux/access-coordinates Update access coordinates calculation methods cxl: docs/platform/acpi/srat Add generic target documentation cxl: docs/platform/cdat reference documentation Documentation: Update the CXL Maturity Map cxl: Sync up the driver-api/cxl documentation cxl: docs - add self-referencing cross-links cxl: docs/allocation/hugepages cxl: docs/allocation/reclaim ...
2 parents a9dfb7d + 9f153b7 commit 29e9359

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

59 files changed

+6769
-266
lines changed

Documentation/ABI/testing/sysfs-bus-cxl

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -242,7 +242,7 @@ Description:
242242
decoding a Host Physical Address range. Note that this number
243243
may be elevated without any regionX objects active or even
244244
enumerated, as this may be due to decoders established by
245-
platform firwmare or a previous kernel (kexec).
245+
platform firmware or a previous kernel (kexec).
246246

247247

248248
What: /sys/bus/cxl/devices/decoderX.Y
@@ -572,7 +572,7 @@ Description:
572572

573573

574574
What: /sys/bus/cxl/devices/regionZ/accessY/read_bandwidth
575-
/sys/bus/cxl/devices/regionZ/accessY/write_banwidth
575+
/sys/bus/cxl/devices/regionZ/accessY/write_bandwidth
576576
Date: Jan, 2024
577577
KernelVersion: v6.9
578578

Documentation/driver-api/cxl/access-coordinates.rst

Lines changed: 0 additions & 91 deletions
This file was deleted.
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
.. SPDX-License-Identifier: GPL-2.0
2+
3+
===========
4+
DAX Devices
5+
===========
6+
CXL capacity exposed as a DAX device can be accessed directly via mmap.
7+
Users may wish to use this interface mechanism to write their own userland
8+
CXL allocator, or to managed shared or persistent memory regions across multiple
9+
hosts.
10+
11+
If the capacity is shared across hosts or persistent, appropriate flushing
12+
mechanisms must be employed unless the region supports Snoop Back-Invalidate.
13+
14+
Note that mappings must be aligned (size and base) to the dax device's base
15+
alignment, which is typically 2MB - but maybe be configured larger.
16+
17+
::
18+
19+
#include <stdio.h>
20+
#include <stdlib.h>
21+
#include <stdint.h>
22+
#include <sys/mman.h>
23+
#include <fcntl.h>
24+
#include <unistd.h>
25+
26+
#define DEVICE_PATH "/dev/dax0.0" // Replace DAX device path
27+
#define DEVICE_SIZE (4ULL * 1024 * 1024 * 1024) // 4GB
28+
29+
int main() {
30+
int fd;
31+
void* mapped_addr;
32+
33+
/* Open the DAX device */
34+
fd = open(DEVICE_PATH, O_RDWR);
35+
if (fd < 0) {
36+
perror("open");
37+
return -1;
38+
}
39+
40+
/* Map the device into memory */
41+
mapped_addr = mmap(NULL, DEVICE_SIZE, PROT_READ | PROT_WRITE,
42+
MAP_SHARED, fd, 0);
43+
if (mapped_addr == MAP_FAILED) {
44+
perror("mmap");
45+
close(fd);
46+
return -1;
47+
}
48+
49+
printf("Mapped address: %p\n", mapped_addr);
50+
51+
/* You can now access the device through the mapped address */
52+
uint64_t* ptr = (uint64_t*)mapped_addr;
53+
*ptr = 0x1234567890abcdef; // Write a value to the device
54+
printf("Value at address %p: 0x%016llx\n", ptr, *ptr);
55+
56+
/* Clean up */
57+
munmap(mapped_addr, DEVICE_SIZE);
58+
close(fd);
59+
return 0;
60+
}
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
.. SPDX-License-Identifier: GPL-2.0
2+
3+
==========
4+
Huge Pages
5+
==========
6+
7+
Contiguous Memory Allocator
8+
===========================
9+
CXL Memory onlined as SystemRAM during early boot is eligible for use by CMA,
10+
as the NUMA node hosting that capacity will be `Online` at the time CMA
11+
carves out contiguous capacity.
12+
13+
CXL Memory deferred to the CXL Driver for configuration cannot have its
14+
capacity allocated by CMA - as the NUMA node hosting the capacity is `Offline`
15+
at :code:`__init` time - when CMA carves out contiguous capacity.
16+
17+
HugeTLB
18+
=======
19+
Different huge page sizes allow different memory configurations.
20+
21+
2MB Huge Pages
22+
--------------
23+
All CXL capacity regardless of configuration time or memory zone is eligible
24+
for use as 2MB huge pages.
25+
26+
1GB Huge Pages
27+
--------------
28+
CXL capacity onlined in :code:`ZONE_NORMAL` is eligible for 1GB Gigantic Page
29+
allocation.
30+
31+
CXL capacity onlined in :code:`ZONE_MOVABLE` is not eligible for 1GB Gigantic
32+
Page allocation.
Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
.. SPDX-License-Identifier: GPL-2.0
2+
3+
==================
4+
The Page Allocator
5+
==================
6+
7+
The kernel page allocator services all general page allocation requests, such
8+
as :code:`kmalloc`. CXL configuration steps affect the behavior of the page
9+
allocator based on the selected `Memory Zone` and `NUMA node` the capacity is
10+
placed in.
11+
12+
This section mostly focuses on how these configurations affect the page
13+
allocator (as of Linux v6.15) rather than the overall page allocator behavior.
14+
15+
NUMA nodes and mempolicy
16+
========================
17+
Unless a task explicitly registers a mempolicy, the default memory policy
18+
of the linux kernel is to allocate memory from the `local NUMA node` first,
19+
and fall back to other nodes only if the local node is pressured.
20+
21+
Generally, we expect to see local DRAM and CXL memory on separate NUMA nodes,
22+
with the CXL memory being non-local. Technically, however, it is possible
23+
for a compute node to have no local DRAM, and for CXL memory to be the
24+
`local` capacity for that compute node.
25+
26+
27+
Memory Zones
28+
============
29+
CXL capacity may be onlined in :code:`ZONE_NORMAL` or :code:`ZONE_MOVABLE`.
30+
31+
As of v6.15, the page allocator attempts to allocate from the highest
32+
available and compatible ZONE for an allocation from the local node first.
33+
34+
An example of a `zone incompatibility` is attempting to service an allocation
35+
marked :code:`GFP_KERNEL` from :code:`ZONE_MOVABLE`. Kernel allocations are
36+
typically not migratable, and as a result can only be serviced from
37+
:code:`ZONE_NORMAL` or lower.
38+
39+
To simplify this, the page allocator will prefer :code:`ZONE_MOVABLE` over
40+
:code:`ZONE_NORMAL` by default, but if :code:`ZONE_MOVABLE` is depleted, it
41+
will fallback to allocate from :code:`ZONE_NORMAL`.
42+
43+
44+
Zone and Node Quirks
45+
====================
46+
Let's consider a configuration where the local DRAM capacity is largely onlined
47+
into :code:`ZONE_NORMAL`, with no :code:`ZONE_MOVABLE` capacity present. The
48+
CXL capacity has the opposite configuration - all onlined in
49+
:code:`ZONE_MOVABLE`.
50+
51+
Under the default allocation policy, the page allocator will completely skip
52+
:code:`ZONE_MOVABLE` as a valid allocation target. This is because, as of
53+
Linux v6.15, the page allocator does (approximately) the following: ::
54+
55+
for (each zone in local_node):
56+
57+
for (each node in fallback_order):
58+
59+
attempt_allocation(gfp_flags);
60+
61+
Because the local node does not have :code:`ZONE_MOVABLE`, the CXL node is
62+
functionally unreachable for direct allocation. As a result, the only way
63+
for CXL capacity to be used is via `demotion` in the reclaim path.
64+
65+
This configuration also means that if the DRAM ndoe has :code:`ZONE_MOVABLE`
66+
capacity - when that capacity is depleted, the page allocator will actually
67+
prefer CXL :code:`ZONE_MOVABLE` pages over DRAM :code:`ZONE_NORMAL` pages.
68+
69+
We may wish to invert this priority in future Linux versions.
70+
71+
If `demotion` and `swap` are disabled, Linux will begin to cause OOM crashes
72+
when the DRAM nodes are depleted. See the reclaim section for more details.
73+
74+
75+
CGroups and CPUSets
76+
===================
77+
Finally, assuming CXL memory is reachable via the page allocation (i.e. onlined
78+
in :code:`ZONE_NORMAL`), the :code:`cpusets.mems_allowed` may be used by
79+
containers to limit the accessibility of certain NUMA nodes for tasks in that
80+
container. Users may wish to utilize this in multi-tenant systems where some
81+
tasks prefer not to use slower memory.
82+
83+
In the reclaim section we'll discuss some limitations of this interface to
84+
prevent demotions of shared data to CXL memory (if demotions are enabled).
85+
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
.. SPDX-License-Identifier: GPL-2.0
2+
3+
=======
4+
Reclaim
5+
=======
6+
Another way CXL memory can be utilized *indirectly* is via the reclaim system
7+
in :code:`mm/vmscan.c`. Reclaim is engaged when memory capacity on the system
8+
becomes pressured based on global and cgroup-local `watermark` settings.
9+
10+
In this section we won't discuss the `watermark` configurations, just how CXL
11+
memory can be consumed by various pieces of reclaim system.
12+
13+
Demotion
14+
========
15+
By default, the reclaim system will prefer swap (or zswap) when reclaiming
16+
memory. Enabling :code:`kernel/mm/numa/demotion_enabled` will cause vmscan
17+
to opportunistically prefer distant NUMA nodes to swap or zswap, if capacity
18+
is available.
19+
20+
Demotion engages the :code:`mm/memory_tier.c` component to determine the
21+
next demotion node. The next demotion node is based on the :code:`HMAT`
22+
or :code:`CDAT` performance data.
23+
24+
cpusets.mems_allowed quirk
25+
--------------------------
26+
In Linux v6.15 and below, demotion does not respect :code:`cpusets.mems_allowed`
27+
when migrating pages. As a result, if demotion is enabled, vmscan cannot
28+
guarantee isolation of a container's memory from nodes not set in mems_allowed.
29+
30+
In Linux v6.XX and up, demotion does attempt to respect
31+
:code:`cpusets.mems_allowed`; however, certain classes of shared memory
32+
originally instantiated by another cgroup (such as common libraries - e.g.
33+
libc) may still be demoted. As a result, the mems_allowed interface still
34+
cannot provide perfect isolation from the remote nodes.
35+
36+
ZSwap and Node Preference
37+
=========================
38+
In Linux v6.15 and below, ZSwap allocates memory from the local node of the
39+
processor for the new pages being compressed. Since pages being compressed
40+
are typically cold, the result is a cold page becomes promoted - only to
41+
be later demoted as it ages off the LRU.
42+
43+
In Linux v6.XX, ZSwap tries to prefer the node of the page being compressed
44+
as the allocation target for the compression page. This helps prevent
45+
thrashing.
46+
47+
Demotion with ZSwap
48+
===================
49+
When enabling both Demotion and ZSwap, you create a situation where ZSwap
50+
will prefer the slowest form of CXL memory by default until that tier of
51+
memory is exhausted.

0 commit comments

Comments
 (0)