Skip to content

Commit 932f4a6

Browse files
weiny2torvalds
authored andcommitted
mm/gup: replace get_user_pages_longterm() with FOLL_LONGTERM
Pach series "Add FOLL_LONGTERM to GUP fast and use it". HFI1, qib, and mthca, use get_user_pages_fast() due to its performance advantages. These pages can be held for a significant time. But get_user_pages_fast() does not protect against mapping FS DAX pages. Introduce FOLL_LONGTERM and use this flag in get_user_pages_fast() which retains the performance while also adding the FS DAX checks. XDP has also shown interest in using this functionality.[1] In addition we change get_user_pages() to use the new FOLL_LONGTERM flag and remove the specialized get_user_pages_longterm call. [1] https://lkml.org/lkml/2019/3/19/939 "longterm" is a relative thing and at this point is probably a misnomer. This is really flagging a pin which is going to be given to hardware and can't move. I've thought of a couple of alternative names but I think we have to settle on if we are going to use FL_LAYOUT or something else to solve the "longterm" problem. Then I think we can change the flag to a better name. Secondly, it depends on how often you are registering memory. I have spoken with some RDMA users who consider MR in the performance path... For the overall application performance. I don't have the numbers as the tests for HFI1 were done a long time ago. But there was a significant advantage. Some of which is probably due to the fact that you don't have to hold mmap_sem. Finally, architecturally I think it would be good for everyone to use *_fast. There are patches submitted to the RDMA list which would allow the use of *_fast (they reworking the use of mmap_sem) and as soon as they are accepted I'll submit a patch to convert the RDMA core as well. Also to this point others are looking to use *_fast. As an aside, Jasons pointed out in my previous submission that *_fast and *_unlocked look very much the same. I agree and I think further cleanup will be coming. But I'm focused on getting the final solution for DAX at the moment. This patch (of 7): This patch starts a series which aims to support FOLL_LONGTERM in get_user_pages_fast(). Some callers who would like to do a longterm (user controlled pin) of pages with the fast variant of GUP for performance purposes. Rather than have a separate get_user_pages_longterm() call, introduce FOLL_LONGTERM and change the longterm callers to use it. This patch does not change any functionality. In the short term "longterm" or user controlled pins are unsafe for Filesystems and FS DAX in particular has been blocked. However, callers of get_user_pages_fast() were not "protected". FOLL_LONGTERM can _only_ be supported with get_user_pages[_fast]() as it requires vmas to determine if DAX is in use. NOTE: In merging with the CMA changes we opt to change the get_user_pages() call in check_and_migrate_cma_pages() to a call of __get_user_pages_locked() on the newly migrated pages. This makes the code read better in that we are calling __get_user_pages_locked() on the pages before and after a potential migration. As a side affect some of the interfaces are cleaned up but this is not the primary purpose of the series. In review[1] it was asked: <quote> > This I don't get - if you do lock down long term mappings performance > of the actual get_user_pages call shouldn't matter to start with. > > What do I miss? A couple of points. First "longterm" is a relative thing and at this point is probably a misnomer. This is really flagging a pin which is going to be given to hardware and can't move. I've thought of a couple of alternative names but I think we have to settle on if we are going to use FL_LAYOUT or something else to solve the "longterm" problem. Then I think we can change the flag to a better name. Second, It depends on how often you are registering memory. I have spoken with some RDMA users who consider MR in the performance path... For the overall application performance. I don't have the numbers as the tests for HFI1 were done a long time ago. But there was a significant advantage. Some of which is probably due to the fact that you don't have to hold mmap_sem. Finally, architecturally I think it would be good for everyone to use *_fast. There are patches submitted to the RDMA list which would allow the use of *_fast (they reworking the use of mmap_sem) and as soon as they are accepted I'll submit a patch to convert the RDMA core as well. Also to this point others are looking to use *_fast. As an asside, Jasons pointed out in my previous submission that *_fast and *_unlocked look very much the same. I agree and I think further cleanup will be coming. But I'm focused on getting the final solution for DAX at the moment. </quote> [1] https://lore.kernel.org/lkml/[email protected]/T/#md6abad2569f3bf6c1f03686c8097ab6563e94965 [[email protected]: v3] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ira Weiny <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Michal Hocko <[email protected]> Cc: John Hubbard <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Rich Felker <[email protected]> Cc: Yoshinori Sato <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: James Hogan <[email protected]> Cc: Dan Williams <[email protected]> Cc: Mike Marshall <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
1 parent a222f34 commit 932f4a6

File tree

11 files changed

+173
-108
lines changed

11 files changed

+173
-108
lines changed

arch/powerpc/mm/book3s64/iommu_api.c

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -141,8 +141,9 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
141141
for (entry = 0; entry < entries; entry += chunk) {
142142
unsigned long n = min(entries - entry, chunk);
143143

144-
ret = get_user_pages_longterm(ua + (entry << PAGE_SHIFT), n,
145-
FOLL_WRITE, mem->hpages + entry, NULL);
144+
ret = get_user_pages(ua + (entry << PAGE_SHIFT), n,
145+
FOLL_WRITE | FOLL_LONGTERM,
146+
mem->hpages + entry, NULL);
146147
if (ret == n) {
147148
pinned += n;
148149
continue;

drivers/infiniband/core/umem.c

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -295,10 +295,11 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, unsigned long addr,
295295

296296
while (npages) {
297297
down_read(&mm->mmap_sem);
298-
ret = get_user_pages_longterm(cur_base,
298+
ret = get_user_pages(cur_base,
299299
min_t(unsigned long, npages,
300300
PAGE_SIZE / sizeof (struct page *)),
301-
gup_flags, page_list, NULL);
301+
gup_flags | FOLL_LONGTERM,
302+
page_list, NULL);
302303
if (ret < 0) {
303304
up_read(&mm->mmap_sem);
304305
goto umem_release;

drivers/infiniband/hw/qib/qib_user_pages.c

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -114,10 +114,10 @@ int qib_get_user_pages(unsigned long start_page, size_t num_pages,
114114

115115
down_read(&current->mm->mmap_sem);
116116
for (got = 0; got < num_pages; got += ret) {
117-
ret = get_user_pages_longterm(start_page + got * PAGE_SIZE,
118-
num_pages - got,
119-
FOLL_WRITE | FOLL_FORCE,
120-
p + got, NULL);
117+
ret = get_user_pages(start_page + got * PAGE_SIZE,
118+
num_pages - got,
119+
FOLL_LONGTERM | FOLL_WRITE | FOLL_FORCE,
120+
p + got, NULL);
121121
if (ret < 0) {
122122
up_read(&current->mm->mmap_sem);
123123
goto bail_release;

drivers/infiniband/hw/usnic/usnic_uiom.c

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -143,10 +143,11 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
143143
ret = 0;
144144

145145
while (npages) {
146-
ret = get_user_pages_longterm(cur_base,
147-
min_t(unsigned long, npages,
148-
PAGE_SIZE / sizeof(struct page *)),
149-
gup_flags, page_list, NULL);
146+
ret = get_user_pages(cur_base,
147+
min_t(unsigned long, npages,
148+
PAGE_SIZE / sizeof(struct page *)),
149+
gup_flags | FOLL_LONGTERM,
150+
page_list, NULL);
150151

151152
if (ret < 0)
152153
goto out;

drivers/media/v4l2-core/videobuf-dma-sg.c

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -186,12 +186,12 @@ static int videobuf_dma_init_user_locked(struct videobuf_dmabuf *dma,
186186
dprintk(1, "init user [0x%lx+0x%lx => %d pages]\n",
187187
data, size, dma->nr_pages);
188188

189-
err = get_user_pages_longterm(data & PAGE_MASK, dma->nr_pages,
190-
flags, dma->pages, NULL);
189+
err = get_user_pages(data & PAGE_MASK, dma->nr_pages,
190+
flags | FOLL_LONGTERM, dma->pages, NULL);
191191

192192
if (err != dma->nr_pages) {
193193
dma->nr_pages = (err >= 0) ? err : 0;
194-
dprintk(1, "get_user_pages_longterm: err=%d [%d]\n", err,
194+
dprintk(1, "get_user_pages: err=%d [%d]\n", err,
195195
dma->nr_pages);
196196
return err < 0 ? err : -EINVAL;
197197
}

drivers/vfio/vfio_iommu_type1.c

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -358,7 +358,8 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
358358

359359
down_read(&mm->mmap_sem);
360360
if (mm == current->mm) {
361-
ret = get_user_pages_longterm(vaddr, 1, flags, page, vmas);
361+
ret = get_user_pages(vaddr, 1, flags | FOLL_LONGTERM, page,
362+
vmas);
362363
} else {
363364
ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
364365
vmas, NULL);

fs/io_uring.c

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2697,8 +2697,9 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
26972697

26982698
ret = 0;
26992699
down_read(&current->mm->mmap_sem);
2700-
pret = get_user_pages_longterm(ubuf, nr_pages, FOLL_WRITE,
2701-
pages, vmas);
2700+
pret = get_user_pages(ubuf, nr_pages,
2701+
FOLL_WRITE | FOLL_LONGTERM,
2702+
pages, vmas);
27022703
if (pret == nr_pages) {
27032704
/* don't support file backed memory */
27042705
for (j = 0; j < nr_pages; j++) {

include/linux/mm.h

Lines changed: 28 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1505,19 +1505,6 @@ long get_user_pages_locked(unsigned long start, unsigned long nr_pages,
15051505
long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
15061506
struct page **pages, unsigned int gup_flags);
15071507

1508-
#if defined(CONFIG_FS_DAX) || defined(CONFIG_CMA)
1509-
long get_user_pages_longterm(unsigned long start, unsigned long nr_pages,
1510-
unsigned int gup_flags, struct page **pages,
1511-
struct vm_area_struct **vmas);
1512-
#else
1513-
static inline long get_user_pages_longterm(unsigned long start,
1514-
unsigned long nr_pages, unsigned int gup_flags,
1515-
struct page **pages, struct vm_area_struct **vmas)
1516-
{
1517-
return get_user_pages(start, nr_pages, gup_flags, pages, vmas);
1518-
}
1519-
#endif /* CONFIG_FS_DAX */
1520-
15211508
int get_user_pages_fast(unsigned long start, int nr_pages, int write,
15221509
struct page **pages);
15231510

@@ -2583,6 +2570,34 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
25832570
#define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */
25842571
#define FOLL_COW 0x4000 /* internal GUP flag */
25852572
#define FOLL_ANON 0x8000 /* don't do file mappings */
2573+
#define FOLL_LONGTERM 0x10000 /* mapping lifetime is indefinite: see below */
2574+
2575+
/*
2576+
* NOTE on FOLL_LONGTERM:
2577+
*
2578+
* FOLL_LONGTERM indicates that the page will be held for an indefinite time
2579+
* period _often_ under userspace control. This is contrasted with
2580+
* iov_iter_get_pages() where usages which are transient.
2581+
*
2582+
* FIXME: For pages which are part of a filesystem, mappings are subject to the
2583+
* lifetime enforced by the filesystem and we need guarantees that longterm
2584+
* users like RDMA and V4L2 only establish mappings which coordinate usage with
2585+
* the filesystem. Ideas for this coordination include revoking the longterm
2586+
* pin, delaying writeback, bounce buffer page writeback, etc. As FS DAX was
2587+
* added after the problem with filesystems was found FS DAX VMAs are
2588+
* specifically failed. Filesystem pages are still subject to bugs and use of
2589+
* FOLL_LONGTERM should be avoided on those pages.
2590+
*
2591+
* FIXME: Also NOTE that FOLL_LONGTERM is not supported in every GUP call.
2592+
* Currently only get_user_pages() and get_user_pages_fast() support this flag
2593+
* and calls to get_user_pages_[un]locked are specifically not allowed. This
2594+
* is due to an incompatibility with the FS DAX check and
2595+
* FAULT_FLAG_ALLOW_RETRY
2596+
*
2597+
* In the CMA case: longterm pins in a CMA region would unnecessarily fragment
2598+
* that region. And so CMA attempts to migrate the page before pinning when
2599+
* FOLL_LONGTERM is specified.
2600+
*/
25862601

25872602
static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags)
25882603
{

0 commit comments

Comments
 (0)