Skip to content

Commit 39388d5

Browse files
committed
Merge tag 'cgroup-dmem-drm-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/mripard/linux into drm-next
DMEM cgroup pull request This introduces a new cgroup controller to limit the device memory. Notable users would be DRM, dma-buf heaps, or v4l2. This pull request is based on the series developped by Maarten Lankhorst, Friedrich Vock, and I: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Dave Airlie <[email protected]> From: Maxime Ripard <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/20250110-cryptic-warm-mandrill-b71f5d@houat
2 parents f600187 + dfe6aa1 commit 39388d5

File tree

20 files changed

+1194
-32
lines changed

20 files changed

+1194
-32
lines changed

Documentation/admin-guide/cgroup-v2.rst

Lines changed: 51 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -64,13 +64,14 @@ v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgrou
6464
5-6. Device
6565
5-7. RDMA
6666
5-7-1. RDMA Interface Files
67-
5-8. HugeTLB
68-
5.8-1. HugeTLB Interface Files
69-
5-9. Misc
70-
5.9-1 Miscellaneous cgroup Interface Files
71-
5.9-2 Migration and Ownership
72-
5-10. Others
73-
5-10-1. perf_event
67+
5-8. DMEM
68+
5-9. HugeTLB
69+
5.9-1. HugeTLB Interface Files
70+
5-10. Misc
71+
5.10-1 Miscellaneous cgroup Interface Files
72+
5.10-2 Migration and Ownership
73+
5-11. Others
74+
5-11-1. perf_event
7475
5-N. Non-normative information
7576
5-N-1. CPU controller root cgroup process behaviour
7677
5-N-2. IO controller root cgroup process behaviour
@@ -2626,6 +2627,49 @@ RDMA Interface Files
26262627
mlx4_0 hca_handle=1 hca_object=20
26272628
ocrdma1 hca_handle=1 hca_object=23
26282629

2630+
DMEM
2631+
----
2632+
2633+
The "dmem" controller regulates the distribution and accounting of
2634+
device memory regions. Because each memory region may have its own page size,
2635+
which does not have to be equal to the system page size, the units are always bytes.
2636+
2637+
DMEM Interface Files
2638+
~~~~~~~~~~~~~~~~~~~~
2639+
2640+
dmem.max, dmem.min, dmem.low
2641+
A readwrite nested-keyed file that exists for all the cgroups
2642+
except root that describes current configured resource limit
2643+
for a region.
2644+
2645+
An example for xe follows::
2646+
2647+
drm/0000:03:00.0/vram0 1073741824
2648+
drm/0000:03:00.0/stolen max
2649+
2650+
The semantics are the same as for the memory cgroup controller, and are
2651+
calculated in the same way.
2652+
2653+
dmem.capacity
2654+
A read-only file that describes maximum region capacity.
2655+
It only exists on the root cgroup. Not all memory can be
2656+
allocated by cgroups, as the kernel reserves some for
2657+
internal use.
2658+
2659+
An example for xe follows::
2660+
2661+
drm/0000:03:00.0/vram0 8514437120
2662+
drm/0000:03:00.0/stolen 67108864
2663+
2664+
dmem.current
2665+
A read-only file that describes current resource usage.
2666+
It exists for all the cgroup except root.
2667+
2668+
An example for xe follows::
2669+
2670+
drm/0000:03:00.0/vram0 12550144
2671+
drm/0000:03:00.0/stolen 8650752
2672+
26292673
HugeTLB
26302674
-------
26312675

Documentation/core-api/cgroup.rst

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
==================
2+
Cgroup Kernel APIs
3+
==================
4+
5+
Device Memory Cgroup API (dmemcg)
6+
=========================
7+
.. kernel-doc:: kernel/cgroup/dmem.c
8+
:export:
9+

Documentation/core-api/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -109,6 +109,7 @@ more memory-management documentation in Documentation/mm/index.rst.
109109
dma-isa-lpc
110110
swiotlb
111111
mm-api
112+
cgroup
112113
genalloc
113114
pin_user_pages
114115
boot-time-mm

Documentation/gpu/drm-compute.rst

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
==================================
2+
Long running workloads and compute
3+
==================================
4+
5+
Long running workloads (compute) are workloads that will not complete in 10
6+
seconds. (The time let the user wait before he reaches for the power button).
7+
This means that other techniques need to be used to manage those workloads,
8+
that cannot use fences.
9+
10+
Some hardware may schedule compute jobs, and have no way to pre-empt them, or
11+
have their memory swapped out from them. Or they simply want their workload
12+
not to be preempted or swapped out at all.
13+
14+
This means that it differs from what is described in driver-api/dma-buf.rst.
15+
16+
As with normal compute jobs, dma-fence may not be used at all. In this case,
17+
not even to force preemption. The driver with is simply forced to unmap a BO
18+
from the long compute job's address space on unbind immediately, not even
19+
waiting for the workload to complete. Effectively this terminates the workload
20+
when there is no hardware support to recover.
21+
22+
Since this is undesirable, there need to be mitigations to prevent a workload
23+
from being terminated. There are several possible approach, all with their
24+
advantages and drawbacks.
25+
26+
The first approach you will likely try is to pin all buffers used by compute.
27+
This guarantees that the job will run uninterrupted, but also allows a very
28+
denial of service attack by pinning as much memory as possible, hogging the
29+
all GPU memory, and possibly a huge chunk of CPU memory.
30+
31+
A second approach that will work slightly better on its own is adding an option
32+
not to evict when creating a new job (any kind). If all of userspace opts in
33+
to this flag, it would prevent cooperating userspace from forced terminating
34+
older compute jobs to start a new one.
35+
36+
If job preemption and recoverable pagefaults are not available, those are the
37+
only approaches possible. So even with those, you want a separate way of
38+
controlling resources. The standard kernel way of doing so is cgroups.
39+
40+
This creates a third option, using cgroups to prevent eviction. Both GPU and
41+
driver-allocated CPU memory would be accounted to the correct cgroup, and
42+
eviction would be made cgroup aware. This allows the GPU to be partitioned
43+
into cgroups, that will allow jobs to run next to each other without
44+
interference.
45+
46+
The interface to the cgroup would be similar to the current CPU memory
47+
interface, with similar semantics for min/low/high/max, if eviction can
48+
be made cgroup aware.
49+
50+
What should be noted is that each memory region (tiled memory for example)
51+
should have its own accounting.
52+
53+
The key is set to the regionid set by the driver, for example "tile0".
54+
For the value of $card, we use drmGetUnique().

drivers/gpu/drm/drm_drv.c

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@
2626
* DEALINGS IN THE SOFTWARE.
2727
*/
2828

29+
#include <linux/cgroup_dmem.h>
2930
#include <linux/debugfs.h>
3031
#include <linux/fs.h>
3132
#include <linux/module.h>
@@ -820,6 +821,37 @@ void drm_dev_put(struct drm_device *dev)
820821
}
821822
EXPORT_SYMBOL(drm_dev_put);
822823

824+
static void drmm_cg_unregister_region(struct drm_device *dev, void *arg)
825+
{
826+
dmem_cgroup_unregister_region(arg);
827+
}
828+
829+
/**
830+
* drmm_cgroup_register_region - Register a region of a DRM device to cgroups
831+
* @dev: device for region
832+
* @region_name: Region name for registering
833+
* @size: Size of region in bytes
834+
*
835+
* This decreases the ref-count of @dev by one. The device is destroyed if the
836+
* ref-count drops to zero.
837+
*/
838+
struct dmem_cgroup_region *drmm_cgroup_register_region(struct drm_device *dev, const char *region_name, u64 size)
839+
{
840+
struct dmem_cgroup_region *region;
841+
int ret;
842+
843+
region = dmem_cgroup_register_region(size, "drm/%s/%s", dev->unique, region_name);
844+
if (IS_ERR_OR_NULL(region))
845+
return region;
846+
847+
ret = drmm_add_action_or_reset(dev, drmm_cg_unregister_region, region);
848+
if (ret)
849+
return ERR_PTR(ret);
850+
851+
return region;
852+
}
853+
EXPORT_SYMBOL_GPL(drmm_cgroup_register_region);
854+
823855
static int create_compat_control_link(struct drm_device *dev)
824856
{
825857
struct drm_minor *minor;

drivers/gpu/drm/ttm/tests/ttm_bo_test.c

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -258,13 +258,13 @@ static void ttm_bo_unreserve_basic(struct kunit *test)
258258
bo = ttm_bo_kunit_init(test, test->priv, BO_SIZE, NULL);
259259
bo->priority = bo_prio;
260260

261-
err = ttm_resource_alloc(bo, place, &res1);
261+
err = ttm_resource_alloc(bo, place, &res1, NULL);
262262
KUNIT_ASSERT_EQ(test, err, 0);
263263

264264
bo->resource = res1;
265265

266266
/* Add a dummy resource to populate LRU */
267-
ttm_resource_alloc(bo, place, &res2);
267+
ttm_resource_alloc(bo, place, &res2, NULL);
268268

269269
dma_resv_lock(bo->base.resv, NULL);
270270
ttm_bo_unreserve(bo);
@@ -300,12 +300,12 @@ static void ttm_bo_unreserve_pinned(struct kunit *test)
300300
dma_resv_lock(bo->base.resv, NULL);
301301
ttm_bo_pin(bo);
302302

303-
err = ttm_resource_alloc(bo, place, &res1);
303+
err = ttm_resource_alloc(bo, place, &res1, NULL);
304304
KUNIT_ASSERT_EQ(test, err, 0);
305305
bo->resource = res1;
306306

307307
/* Add a dummy resource to the pinned list */
308-
err = ttm_resource_alloc(bo, place, &res2);
308+
err = ttm_resource_alloc(bo, place, &res2, NULL);
309309
KUNIT_ASSERT_EQ(test, err, 0);
310310
KUNIT_ASSERT_EQ(test,
311311
list_is_last(&res2->lru.link, &priv->ttm_dev->unevictable), 1);
@@ -355,15 +355,15 @@ static void ttm_bo_unreserve_bulk(struct kunit *test)
355355
ttm_bo_set_bulk_move(bo1, &lru_bulk_move);
356356
dma_resv_unlock(bo1->base.resv);
357357

358-
err = ttm_resource_alloc(bo1, place, &res1);
358+
err = ttm_resource_alloc(bo1, place, &res1, NULL);
359359
KUNIT_ASSERT_EQ(test, err, 0);
360360
bo1->resource = res1;
361361

362362
dma_resv_lock(bo2->base.resv, NULL);
363363
ttm_bo_set_bulk_move(bo2, &lru_bulk_move);
364364
dma_resv_unlock(bo2->base.resv);
365365

366-
err = ttm_resource_alloc(bo2, place, &res2);
366+
err = ttm_resource_alloc(bo2, place, &res2, NULL);
367367
KUNIT_ASSERT_EQ(test, err, 0);
368368
bo2->resource = res2;
369369

@@ -401,7 +401,7 @@ static void ttm_bo_put_basic(struct kunit *test)
401401
bo = ttm_bo_kunit_init(test, test->priv, BO_SIZE, NULL);
402402
bo->type = ttm_bo_type_device;
403403

404-
err = ttm_resource_alloc(bo, place, &res);
404+
err = ttm_resource_alloc(bo, place, &res, NULL);
405405
KUNIT_ASSERT_EQ(test, err, 0);
406406
bo->resource = res;
407407

@@ -518,7 +518,7 @@ static void ttm_bo_pin_unpin_resource(struct kunit *test)
518518

519519
bo = ttm_bo_kunit_init(test, test->priv, BO_SIZE, NULL);
520520

521-
err = ttm_resource_alloc(bo, place, &res);
521+
err = ttm_resource_alloc(bo, place, &res, NULL);
522522
KUNIT_ASSERT_EQ(test, err, 0);
523523
bo->resource = res;
524524

@@ -569,7 +569,7 @@ static void ttm_bo_multiple_pin_one_unpin(struct kunit *test)
569569

570570
bo = ttm_bo_kunit_init(test, test->priv, BO_SIZE, NULL);
571571

572-
err = ttm_resource_alloc(bo, place, &res);
572+
err = ttm_resource_alloc(bo, place, &res, NULL);
573573
KUNIT_ASSERT_EQ(test, err, 0);
574574
bo->resource = res;
575575

drivers/gpu/drm/ttm/tests/ttm_bo_validate_test.c

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -542,7 +542,7 @@ static void ttm_bo_validate_no_placement_signaled(struct kunit *test)
542542
bo->ttm = old_tt;
543543
}
544544

545-
err = ttm_resource_alloc(bo, place, &bo->resource);
545+
err = ttm_resource_alloc(bo, place, &bo->resource, NULL);
546546
KUNIT_EXPECT_EQ(test, err, 0);
547547
KUNIT_ASSERT_EQ(test, man->usage, size);
548548

@@ -603,7 +603,7 @@ static void ttm_bo_validate_no_placement_not_signaled(struct kunit *test)
603603
bo = ttm_bo_kunit_init(test, test->priv, size, NULL);
604604
bo->type = params->bo_type;
605605

606-
err = ttm_resource_alloc(bo, place, &bo->resource);
606+
err = ttm_resource_alloc(bo, place, &bo->resource, NULL);
607607
KUNIT_EXPECT_EQ(test, err, 0);
608608

609609
placement = kunit_kzalloc(test, sizeof(*placement), GFP_KERNEL);

drivers/gpu/drm/ttm/tests/ttm_resource_test.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -302,7 +302,7 @@ static void ttm_sys_man_free_basic(struct kunit *test)
302302
res = kunit_kzalloc(test, sizeof(*res), GFP_KERNEL);
303303
KUNIT_ASSERT_NOT_NULL(test, res);
304304

305-
ttm_resource_alloc(bo, place, &res);
305+
ttm_resource_alloc(bo, place, &res, NULL);
306306

307307
man = ttm_manager_type(priv->devs->ttm_dev, mem_type);
308308
man->func->free(man, res);

0 commit comments

Comments
 (0)