Skip to content

Commit c84bf6d

Browse files
lorenzo-stoakesakpm00
authored andcommitted
mm: introduce new .mmap_prepare() file callback
Patch series "eliminate mmap() retry merge, add .mmap_prepare hook", v2. During the mmap() of a file-backed mapping, we invoke the underlying driver file's mmap() callback in order to perform driver/file system initialisation of the underlying VMA. This has been a source of issues in the past, including a significant security concern relating to unwinding of error state discovered by Jann Horn, as fixed in commit 5de1950 ("mm: resolve faulty mmap_region() error path behaviour") which performed the recent, significant, rework of mmap() as a whole. However, we have had a fly in the ointment remain - drivers have a great deal of freedom in the .mmap() hook to manipulate VMA state (as well as page table state). This can be problematic, as we can no longer reason sensibly about VMA state once the call is complete (the ability to do - anything - here does rather interfere with that). In addition, callers may choose to do odd or unusual things which might interfere with subsequent steps in the mmap() process, and it may do so and then raise an error, requiring very careful unwinding of state about which we can make no assumptions. Rather than providing such an open-ended interface, this series provides an alternative, far more restrictive one - we expose a whitelist of fields which can be adjusted by the driver, along with immutable state upon which the driver can make such decisions: struct vm_area_desc { /* Immutable state. */ struct mm_struct *mm; unsigned long start; unsigned long end; /* Mutable fields. Populated with initial state. */ pgoff_t pgoff; struct file *file; vm_flags_t vm_flags; pgprot_t page_prot; /* Write-only fields. */ const struct vm_operations_struct *vm_ops; void *private_data; }; The mmap logic then updates the state used to either merge with a VMA or establish a new VMA based upon this logic. This is achieved via new file hook .mmap_prepare(), which is, importantly, invoked very early on in the mmap() process. If an error arises, we can very simply abort the operation with very little unwinding of state required. The existing logic contains another, related, peccadillo - since the .mmap() callback might do anything, it may also cause a previously unmergeable VMA to become mergeable with adjacent VMAs. Right now the logic will retry a merge like this only if the driver changes VMA flags, and changes them in such a way that a merge might succeed (that is, the flags are not 'special', that is do not contain any of the flags specified in VM_SPECIAL). This has also been the source of a great deal of pain - it's hard to reason about an .mmap() callback that might do - anything - but it's also hard to reason about setting up a VMA and writing to the maple tree, only to do it again utilising a great deal of shared state. Since .mmap_prepare() sets fields before the first merge is even attempted, the use of this callback obviates the need for this retry merge logic. A driver may only specify .mmap_prepare() or the deprecated .mmap() callback. In future we may add futher callbacks beyond .mmap_prepare() to faciliate all use cass as we convert drivers. In researching this change, I examined every .mmap() callback, and discovered only a very few that set VMA state in such a way that a. the VMA flags changed and b. this would be mergeable. In the majority of cases, it turns out that drivers are mapping kernel memory and thus ultimately set VM_PFNMAP, VM_MIXEDMAP, or other unmergeable VM_SPECIAL flags. Of those that remain I identified a number of cases which are only applicable in DAX, setting the VM_HUGEPAGE flag: * dax_mmap() * erofs_file_mmap() * ext4_file_mmap() * xfs_file_mmap() For this remerge to not occur and to impact users, each of these cases would require a user to mmap() files using DAX, in parts, immediately adjacent to one another. This is a very unlikely usecase and so it does not appear to be worthwhile to adjust this functionality accordingly. We can, however, very quickly do so if needed by simply adding an .mmap_prepare() callback to these as required. There are two further non-DAX cases I idenitfied: * orangefs_file_mmap() - Clears VM_RAND_READ if set, replacing with VM_SEQ_READ. * usb_stream_hwdep_mmap() - Sets VM_DONTDUMP. Both of these cases again seem very unlikely to be mmap()'d immediately adjacent to one another in a fashion that would result in a merge. Finally, we are left with a viable case: * secretmem_mmap() - Set VM_LOCKED, VM_DONTDUMP. This is viable enough that the mm selftests trigger the logic as a matter of course. Therefore, this series replace the .secretmem_mmap() hook with .secret_mmap_prepare(). This patch (of 3): Provide a means by which drivers can specify which fields of those permitted to be changed should be altered to prior to mmap()'ing a range (which may either result from a merge or from mapping an entirely new VMA). Doing so is substantially safer than the existing .mmap() calback which provides unrestricted access to the part-constructed VMA and permits drivers and file systems to do 'creative' things which makes it hard to reason about the state of the VMA after the function returns. The existing .mmap() callback's freedom has caused a great deal of issues, especially in error handling, as unwinding the mmap() state has proven to be non-trivial and caused significant issues in the past, for instance those addressed in commit 5de1950 ("mm: resolve faulty mmap_region() error path behaviour"). It also necessitates a second attempt at merge once the .mmap() callback has completed, which has caused issues in the past, is awkward, adds overhead and is difficult to reason about. The .mmap_prepare() callback eliminates this requirement, as we can update fields prior to even attempting the first merge. It is safer, as we heavily restrict what can actually be modified, and being invoked very early in the mmap() process, error handling can be performed safely with very little unwinding of state required. The .mmap_prepare() and deprecated .mmap() callbacks are mutually exclusive, so we permit only one to be invoked at a time. Update vma userland test stubs to account for changes. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/adb36a7c4affd7393b2fc4b54cc5cfe211e41f71.1746792520.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Cc: Al Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jann Horn <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
1 parent d2def68 commit c84bf6d

File tree

6 files changed

+180
-8
lines changed

6 files changed

+180
-8
lines changed

include/linux/fs.h

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2169,6 +2169,7 @@ struct file_operations {
21692169
int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
21702170
int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
21712171
unsigned int poll_flags);
2172+
int (*mmap_prepare)(struct vm_area_desc *);
21722173
} __randomize_layout;
21732174

21742175
/* Supports async buffered reads */
@@ -2238,11 +2239,35 @@ struct inode_operations {
22382239
struct offset_ctx *(*get_offset_ctx)(struct inode *inode);
22392240
} ____cacheline_aligned;
22402241

2242+
/* Did the driver provide valid mmap hook configuration? */
2243+
static inline bool file_has_valid_mmap_hooks(struct file *file)
2244+
{
2245+
bool has_mmap = file->f_op->mmap;
2246+
bool has_mmap_prepare = file->f_op->mmap_prepare;
2247+
2248+
/* Hooks are mutually exclusive. */
2249+
if (WARN_ON_ONCE(has_mmap && has_mmap_prepare))
2250+
return false;
2251+
if (WARN_ON_ONCE(!has_mmap && !has_mmap_prepare))
2252+
return false;
2253+
2254+
return true;
2255+
}
2256+
22412257
static inline int call_mmap(struct file *file, struct vm_area_struct *vma)
22422258
{
2259+
if (WARN_ON_ONCE(file->f_op->mmap_prepare))
2260+
return -EINVAL;
2261+
22432262
return file->f_op->mmap(file, vma);
22442263
}
22452264

2265+
static inline int __call_mmap_prepare(struct file *file,
2266+
struct vm_area_desc *desc)
2267+
{
2268+
return file->f_op->mmap_prepare(desc);
2269+
}
2270+
22462271
extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
22472272
extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
22482273
extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,

include/linux/mm_types.h

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -763,6 +763,30 @@ struct vma_numab_state {
763763
int prev_scan_seq;
764764
};
765765

766+
/*
767+
* Describes a VMA that is about to be mmap()'ed. Drivers may choose to
768+
* manipulate mutable fields which will cause those fields to be updated in the
769+
* resultant VMA.
770+
*
771+
* Helper functions are not required for manipulating any field.
772+
*/
773+
struct vm_area_desc {
774+
/* Immutable state. */
775+
struct mm_struct *mm;
776+
unsigned long start;
777+
unsigned long end;
778+
779+
/* Mutable fields. Populated with initial state. */
780+
pgoff_t pgoff;
781+
struct file *file;
782+
vm_flags_t vm_flags;
783+
pgprot_t page_prot;
784+
785+
/* Write-only fields. */
786+
const struct vm_operations_struct *vm_ops;
787+
void *private_data;
788+
};
789+
766790
/*
767791
* This struct describes a virtual memory area. There is one of these
768792
* per VM-area/task. A VM area is any part of the process virtual memory

mm/memory.c

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -527,10 +527,11 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
527527
dump_page(page, "bad pte");
528528
pr_alert("addr:%px vm_flags:%08lx anon_vma:%px mapping:%px index:%lx\n",
529529
(void *)addr, vma->vm_flags, vma->anon_vma, mapping, index);
530-
pr_alert("file:%pD fault:%ps mmap:%ps read_folio:%ps\n",
530+
pr_alert("file:%pD fault:%ps mmap:%ps mmap_prepare: %ps read_folio:%ps\n",
531531
vma->vm_file,
532532
vma->vm_ops ? vma->vm_ops->fault : NULL,
533533
vma->vm_file ? vma->vm_file->f_op->mmap : NULL,
534+
vma->vm_file ? vma->vm_file->f_op->mmap_prepare : NULL,
534535
mapping ? mapping->a_ops->read_folio : NULL);
535536
dump_stack();
536537
add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);

mm/mmap.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -475,7 +475,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
475475
vm_flags &= ~VM_MAYEXEC;
476476
}
477477

478-
if (!file->f_op->mmap)
478+
if (!file_has_valid_mmap_hooks(file))
479479
return -ENODEV;
480480
if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP))
481481
return -EINVAL;

mm/vma.c

Lines changed: 67 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,11 @@ struct mmap_state {
1717
unsigned long pglen;
1818
unsigned long flags;
1919
struct file *file;
20+
pgprot_t page_prot;
21+
22+
/* User-defined fields, perhaps updated by .mmap_prepare(). */
23+
const struct vm_operations_struct *vm_ops;
24+
void *vm_private_data;
2025

2126
unsigned long charged;
2227
bool retry_merge;
@@ -40,6 +45,7 @@ struct mmap_state {
4045
.pglen = PHYS_PFN(len_), \
4146
.flags = flags_, \
4247
.file = file_, \
48+
.page_prot = vm_get_page_prot(flags_), \
4349
}
4450

4551
#define VMG_MMAP_STATE(name, map_, vma_) \
@@ -2385,6 +2391,10 @@ static int __mmap_new_file_vma(struct mmap_state *map,
23852391
int error;
23862392

23872393
vma->vm_file = get_file(map->file);
2394+
2395+
if (!map->file->f_op->mmap)
2396+
return 0;
2397+
23882398
error = mmap_file(vma->vm_file, vma);
23892399
if (error) {
23902400
fput(vma->vm_file);
@@ -2441,7 +2451,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
24412451
vma_iter_config(vmi, map->addr, map->end);
24422452
vma_set_range(vma, map->addr, map->end, map->pgoff);
24432453
vm_flags_init(vma, map->flags);
2444-
vma->vm_page_prot = vm_get_page_prot(map->flags);
2454+
vma->vm_page_prot = map->page_prot;
24452455

24462456
if (vma_iter_prealloc(vmi, vma)) {
24472457
error = -ENOMEM;
@@ -2528,17 +2538,70 @@ static void __mmap_complete(struct mmap_state *map, struct vm_area_struct *vma)
25282538
vma_set_page_prot(vma);
25292539
}
25302540

2541+
/*
2542+
* Invoke the f_op->mmap_prepare() callback for a file-backed mapping that
2543+
* specifies it.
2544+
*
2545+
* This is called prior to any merge attempt, and updates whitelisted fields
2546+
* that are permitted to be updated by the caller.
2547+
*
2548+
* All but user-defined fields will be pre-populated with original values.
2549+
*
2550+
* Returns 0 on success, or an error code otherwise.
2551+
*/
2552+
static int call_mmap_prepare(struct mmap_state *map)
2553+
{
2554+
int err;
2555+
struct vm_area_desc desc = {
2556+
.mm = map->mm,
2557+
.start = map->addr,
2558+
.end = map->end,
2559+
2560+
.pgoff = map->pgoff,
2561+
.file = map->file,
2562+
.vm_flags = map->flags,
2563+
.page_prot = map->page_prot,
2564+
};
2565+
2566+
/* Invoke the hook. */
2567+
err = __call_mmap_prepare(map->file, &desc);
2568+
if (err)
2569+
return err;
2570+
2571+
/* Update fields permitted to be changed. */
2572+
map->pgoff = desc.pgoff;
2573+
map->file = desc.file;
2574+
map->flags = desc.vm_flags;
2575+
map->page_prot = desc.page_prot;
2576+
/* User-defined fields. */
2577+
map->vm_ops = desc.vm_ops;
2578+
map->vm_private_data = desc.private_data;
2579+
2580+
return 0;
2581+
}
2582+
2583+
static void set_vma_user_defined_fields(struct vm_area_struct *vma,
2584+
struct mmap_state *map)
2585+
{
2586+
if (map->vm_ops)
2587+
vma->vm_ops = map->vm_ops;
2588+
vma->vm_private_data = map->vm_private_data;
2589+
}
2590+
25312591
static unsigned long __mmap_region(struct file *file, unsigned long addr,
25322592
unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
25332593
struct list_head *uf)
25342594
{
25352595
struct mm_struct *mm = current->mm;
25362596
struct vm_area_struct *vma = NULL;
25372597
int error;
2598+
bool have_mmap_prepare = file && file->f_op->mmap_prepare;
25382599
VMA_ITERATOR(vmi, mm, addr);
25392600
MMAP_STATE(map, mm, &vmi, addr, len, pgoff, vm_flags, file);
25402601

25412602
error = __mmap_prepare(&map, uf);
2603+
if (!error && have_mmap_prepare)
2604+
error = call_mmap_prepare(&map);
25422605
if (error)
25432606
goto abort_munmap;
25442607

@@ -2556,6 +2619,9 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
25562619
goto unacct_error;
25572620
}
25582621

2622+
if (have_mmap_prepare)
2623+
set_vma_user_defined_fields(vma, &map);
2624+
25592625
/* If flags changed, we might be able to merge, so try again. */
25602626
if (map.retry_merge) {
25612627
struct vm_area_struct *merged;

tools/testing/vma/vma_internal.h

Lines changed: 61 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -253,8 +253,40 @@ struct mm_struct {
253253
unsigned long flags; /* Must use atomic bitops to access */
254254
};
255255

256+
struct vm_area_struct;
257+
258+
/*
259+
* Describes a VMA that is about to be mmap()'ed. Drivers may choose to
260+
* manipulate mutable fields which will cause those fields to be updated in the
261+
* resultant VMA.
262+
*
263+
* Helper functions are not required for manipulating any field.
264+
*/
265+
struct vm_area_desc {
266+
/* Immutable state. */
267+
struct mm_struct *mm;
268+
unsigned long start;
269+
unsigned long end;
270+
271+
/* Mutable fields. Populated with initial state. */
272+
pgoff_t pgoff;
273+
struct file *file;
274+
vm_flags_t vm_flags;
275+
pgprot_t page_prot;
276+
277+
/* Write-only fields. */
278+
const struct vm_operations_struct *vm_ops;
279+
void *private_data;
280+
};
281+
282+
struct file_operations {
283+
int (*mmap)(struct file *, struct vm_area_struct *);
284+
int (*mmap_prepare)(struct vm_area_desc *);
285+
};
286+
256287
struct file {
257288
struct address_space *f_mapping;
289+
const struct file_operations *f_op;
258290
};
259291

260292
#define VMA_LOCK_OFFSET 0x40000000
@@ -1125,11 +1157,6 @@ static inline void vm_flags_clear(struct vm_area_struct *vma,
11251157
vma->__vm_flags &= ~flags;
11261158
}
11271159

1128-
static inline int call_mmap(struct file *, struct vm_area_struct *)
1129-
{
1130-
return 0;
1131-
}
1132-
11331160
static inline int shmem_zero_setup(struct vm_area_struct *)
11341161
{
11351162
return 0;
@@ -1405,4 +1432,33 @@ static inline void free_anon_vma_name(struct vm_area_struct *vma)
14051432
(void)vma;
14061433
}
14071434

1435+
/* Did the driver provide valid mmap hook configuration? */
1436+
static inline bool file_has_valid_mmap_hooks(struct file *file)
1437+
{
1438+
bool has_mmap = file->f_op->mmap;
1439+
bool has_mmap_prepare = file->f_op->mmap_prepare;
1440+
1441+
/* Hooks are mutually exclusive. */
1442+
if (WARN_ON_ONCE(has_mmap && has_mmap_prepare))
1443+
return false;
1444+
if (WARN_ON_ONCE(!has_mmap && !has_mmap_prepare))
1445+
return false;
1446+
1447+
return true;
1448+
}
1449+
1450+
static inline int call_mmap(struct file *file, struct vm_area_struct *vma)
1451+
{
1452+
if (WARN_ON_ONCE(file->f_op->mmap_prepare))
1453+
return -EINVAL;
1454+
1455+
return file->f_op->mmap(file, vma);
1456+
}
1457+
1458+
static inline int __call_mmap_prepare(struct file *file,
1459+
struct vm_area_desc *desc)
1460+
{
1461+
return file->f_op->mmap_prepare(desc);
1462+
}
1463+
14081464
#endif /* __MM_VMA_INTERNAL_H */

0 commit comments

Comments
 (0)