Skip to content

Commit a939888

Browse files
minchanktorvalds
authored andcommitted
zram: support idle/huge page writeback
Add a new feature "zram idle/huge page writeback". In the zram-swap use case, zram usually has many idle/huge swap pages. It's pointless to keep them in memory (ie, zram). To solve this problem, this feature introduces idle/huge page writeback to the backing device so the goal is to save more memory space on embedded systems. Normal sequence to use idle/huge page writeback feature is as follows, while (1) { # mark allocated zram slot to idle echo all > /sys/block/zram0/idle # leave system working for several hours # Unless there is no access for some blocks on zram, # they are still IDLE marked pages. echo "idle" > /sys/block/zram0/writeback or/and echo "huge" > /sys/block/zram0/writeback # write the IDLE or/and huge marked slot into backing device # and free the memory. } Per the discussion at https://lore.kernel.org/lkml/20181122065926.GG3441@jagdpanzerIV/T/#u, This patch removes direct incommpressibe page writeback feature (d2afd25114f4 ("zram: write incompressible pages to backing device")). Below concerns from Sergey: == &< == "IDLE writeback" is superior to "incompressible writeback". "incompressible writeback" is completely unpredictable and uncontrollable; it depens on data patterns and compression algorithms. While "IDLE writeback" is predictable. I even suspect, that, *ideally*, we can remove "incompressible writeback". "IDLE pages" is a super set which also includes "incompressible" pages. So, technically, we still can do "incompressible writeback" from "IDLE writeback" path; but a much more reasonable one, based on a page idling period. I understand that you want to keep "direct incompressible writeback" around. ZRAM is especially popular on devices which do suffer from flash wearout, so I can see "incompressible writeback" path becoming a dead code, long term. == &< == Below concerns from Minchan: == &< == My concern is if we enable CONFIG_ZRAM_WRITEBACK in this implementation, both hugepage/idlepage writeck will turn on. However someuser want to enable only idlepage writeback so we need to introduce turn on/off knob for hugepage or new CONFIG_ZRAM_IDLEPAGE_WRITEBACK for those usecase. I don't want to make it complicated *if possible*. Long term, I imagine we need to make VM aware of new swap hierarchy a little bit different with as-is. For example, first high priority swap can return -EIO or -ENOCOMP, swap try to fallback to next lower priority swap device. With that, hugepage writeback will work tranparently. So we could regard it as regression because incompressible pages doesn't go to backing storage automatically. Instead, user should do it via "echo huge" > /sys/block/zram/writeback" manually. == &< == Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Minchan Kim <[email protected]> Reviewed-by: Joey Pabalinas <[email protected]> Reviewed-by: Sergey Senozhatsky <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
1 parent e82592c commit a939888

File tree

5 files changed

+209
-79
lines changed

5 files changed

+209
-79
lines changed

Documentation/ABI/testing/sysfs-block-zram

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -106,3 +106,10 @@ Description:
106106
idle file is write-only and mark zram slot as idle.
107107
If system has mounted debugfs, user can see which slots
108108
are idle via /sys/kernel/debug/zram/zram<id>/block_state
109+
110+
What: /sys/block/zram<id>/writeback
111+
Date: November 2018
112+
Contact: Minchan Kim <[email protected]>
113+
Description:
114+
The writeback file is write-only and trigger idle and/or
115+
huge page writeback to backing device.

Documentation/blockdev/zram.txt

Lines changed: 24 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -238,11 +238,31 @@ line of text and contains the following stats separated by whitespace:
238238

239239
= writeback
240240

241-
With incompressible pages, there is no memory saving with zram.
242-
Instead, with CONFIG_ZRAM_WRITEBACK, zram can write incompressible page
241+
With CONFIG_ZRAM_WRITEBACK, zram can write idle/incompressible page
243242
to backing storage rather than keeping it in memory.
244-
User should set up backing device via /sys/block/zramX/backing_dev
245-
before disksize setting.
243+
To use the feature, admin should set up backing device via
244+
245+
"echo /dev/sda5 > /sys/block/zramX/backing_dev"
246+
247+
before disksize setting. It supports only partition at this moment.
248+
If admin want to use incompressible page writeback, they could do via
249+
250+
"echo huge > /sys/block/zramX/write"
251+
252+
To use idle page writeback, first, user need to declare zram pages
253+
as idle.
254+
255+
"echo all > /sys/block/zramX/idle"
256+
257+
From now on, any pages on zram are idle pages. The idle mark
258+
will be removed until someone request access of the block.
259+
IOW, unless there is access request, those pages are still idle pages.
260+
261+
Admin can request writeback of those idle pages at right timing via
262+
263+
"echo idle > /sys/block/zramX/writeback"
264+
265+
With the command, zram writeback idle pages from memory to the storage.
246266

247267
= memory tracking
248268

drivers/block/zram/Kconfig

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,14 +15,17 @@ config ZRAM
1515
See Documentation/blockdev/zram.txt for more information.
1616

1717
config ZRAM_WRITEBACK
18-
bool "Write back incompressible page to backing device"
18+
bool "Write back incompressible or idle page to backing device"
1919
depends on ZRAM
2020
help
2121
With incompressible page, there is no memory saving to keep it
2222
in memory. Instead, write it out to backing device.
2323
For this feature, admin should set up backing device via
2424
/sys/block/zramX/backing_dev.
2525

26+
With /sys/block/zramX/{idle,writeback}, application could ask
27+
idle page's writeback to the backing device to save in memory.
28+
2629
See Documentation/blockdev/zram.txt for more information.
2730

2831
config ZRAM_MEMORY_TRACKING

drivers/block/zram/zram_drv.c

Lines changed: 173 additions & 74 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,9 @@ static unsigned int num_devices = 1;
5252
static size_t huge_class_size;
5353

5454
static void zram_free_page(struct zram *zram, size_t index);
55+
static int zram_bvec_read(struct zram *zram, struct bio_vec *bvec,
56+
u32 index, int offset, struct bio *bio);
57+
5558

5659
static int zram_slot_trylock(struct zram *zram, u32 index)
5760
{
@@ -73,13 +76,6 @@ static inline bool init_done(struct zram *zram)
7376
return zram->disksize;
7477
}
7578

76-
static inline bool zram_allocated(struct zram *zram, u32 index)
77-
{
78-
79-
return (zram->table[index].flags >> (ZRAM_FLAG_SHIFT + 1)) ||
80-
zram->table[index].handle;
81-
}
82-
8379
static inline struct zram *dev_to_zram(struct device *dev)
8480
{
8581
return (struct zram *)dev_to_disk(dev)->private_data;
@@ -138,6 +134,13 @@ static void zram_set_obj_size(struct zram *zram,
138134
zram->table[index].flags = (flags << ZRAM_FLAG_SHIFT) | size;
139135
}
140136

137+
static inline bool zram_allocated(struct zram *zram, u32 index)
138+
{
139+
return zram_get_obj_size(zram, index) ||
140+
zram_test_flag(zram, index, ZRAM_SAME) ||
141+
zram_test_flag(zram, index, ZRAM_WB);
142+
}
143+
141144
#if PAGE_SIZE != 4096
142145
static inline bool is_partial_io(struct bio_vec *bvec)
143146
{
@@ -308,10 +311,14 @@ static ssize_t idle_store(struct device *dev,
308311
}
309312

310313
for (index = 0; index < nr_pages; index++) {
314+
/*
315+
* Do not mark ZRAM_UNDER_WB slot as ZRAM_IDLE to close race.
316+
* See the comment in writeback_store.
317+
*/
311318
zram_slot_lock(zram, index);
312-
if (!zram_allocated(zram, index))
319+
if (!zram_allocated(zram, index) ||
320+
zram_test_flag(zram, index, ZRAM_UNDER_WB))
313321
goto next;
314-
315322
zram_set_flag(zram, index, ZRAM_IDLE);
316323
next:
317324
zram_slot_unlock(zram, index);
@@ -546,6 +553,158 @@ static int read_from_bdev_async(struct zram *zram, struct bio_vec *bvec,
546553
return 1;
547554
}
548555

556+
#define HUGE_WRITEBACK 0x1
557+
#define IDLE_WRITEBACK 0x2
558+
559+
static ssize_t writeback_store(struct device *dev,
560+
struct device_attribute *attr, const char *buf, size_t len)
561+
{
562+
struct zram *zram = dev_to_zram(dev);
563+
unsigned long nr_pages = zram->disksize >> PAGE_SHIFT;
564+
unsigned long index;
565+
struct bio bio;
566+
struct bio_vec bio_vec;
567+
struct page *page;
568+
ssize_t ret, sz;
569+
char mode_buf[8];
570+
unsigned long mode = -1UL;
571+
unsigned long blk_idx = 0;
572+
573+
sz = strscpy(mode_buf, buf, sizeof(mode_buf));
574+
if (sz <= 0)
575+
return -EINVAL;
576+
577+
/* ignore trailing newline */
578+
if (mode_buf[sz - 1] == '\n')
579+
mode_buf[sz - 1] = 0x00;
580+
581+
if (!strcmp(mode_buf, "idle"))
582+
mode = IDLE_WRITEBACK;
583+
else if (!strcmp(mode_buf, "huge"))
584+
mode = HUGE_WRITEBACK;
585+
586+
if (mode == -1UL)
587+
return -EINVAL;
588+
589+
down_read(&zram->init_lock);
590+
if (!init_done(zram)) {
591+
ret = -EINVAL;
592+
goto release_init_lock;
593+
}
594+
595+
if (!zram->backing_dev) {
596+
ret = -ENODEV;
597+
goto release_init_lock;
598+
}
599+
600+
page = alloc_page(GFP_KERNEL);
601+
if (!page) {
602+
ret = -ENOMEM;
603+
goto release_init_lock;
604+
}
605+
606+
for (index = 0; index < nr_pages; index++) {
607+
struct bio_vec bvec;
608+
609+
bvec.bv_page = page;
610+
bvec.bv_len = PAGE_SIZE;
611+
bvec.bv_offset = 0;
612+
613+
if (!blk_idx) {
614+
blk_idx = alloc_block_bdev(zram);
615+
if (!blk_idx) {
616+
ret = -ENOSPC;
617+
break;
618+
}
619+
}
620+
621+
zram_slot_lock(zram, index);
622+
if (!zram_allocated(zram, index))
623+
goto next;
624+
625+
if (zram_test_flag(zram, index, ZRAM_WB) ||
626+
zram_test_flag(zram, index, ZRAM_SAME) ||
627+
zram_test_flag(zram, index, ZRAM_UNDER_WB))
628+
goto next;
629+
630+
if ((mode & IDLE_WRITEBACK &&
631+
!zram_test_flag(zram, index, ZRAM_IDLE)) &&
632+
(mode & HUGE_WRITEBACK &&
633+
!zram_test_flag(zram, index, ZRAM_HUGE)))
634+
goto next;
635+
/*
636+
* Clearing ZRAM_UNDER_WB is duty of caller.
637+
* IOW, zram_free_page never clear it.
638+
*/
639+
zram_set_flag(zram, index, ZRAM_UNDER_WB);
640+
/* Need for hugepage writeback racing */
641+
zram_set_flag(zram, index, ZRAM_IDLE);
642+
zram_slot_unlock(zram, index);
643+
if (zram_bvec_read(zram, &bvec, index, 0, NULL)) {
644+
zram_slot_lock(zram, index);
645+
zram_clear_flag(zram, index, ZRAM_UNDER_WB);
646+
zram_clear_flag(zram, index, ZRAM_IDLE);
647+
zram_slot_unlock(zram, index);
648+
continue;
649+
}
650+
651+
bio_init(&bio, &bio_vec, 1);
652+
bio_set_dev(&bio, zram->bdev);
653+
bio.bi_iter.bi_sector = blk_idx * (PAGE_SIZE >> 9);
654+
bio.bi_opf = REQ_OP_WRITE | REQ_SYNC;
655+
656+
bio_add_page(&bio, bvec.bv_page, bvec.bv_len,
657+
bvec.bv_offset);
658+
/*
659+
* XXX: A single page IO would be inefficient for write
660+
* but it would be not bad as starter.
661+
*/
662+
ret = submit_bio_wait(&bio);
663+
if (ret) {
664+
zram_slot_lock(zram, index);
665+
zram_clear_flag(zram, index, ZRAM_UNDER_WB);
666+
zram_clear_flag(zram, index, ZRAM_IDLE);
667+
zram_slot_unlock(zram, index);
668+
continue;
669+
}
670+
671+
/*
672+
* We released zram_slot_lock so need to check if the slot was
673+
* changed. If there is freeing for the slot, we can catch it
674+
* easily by zram_allocated.
675+
* A subtle case is the slot is freed/reallocated/marked as
676+
* ZRAM_IDLE again. To close the race, idle_store doesn't
677+
* mark ZRAM_IDLE once it found the slot was ZRAM_UNDER_WB.
678+
* Thus, we could close the race by checking ZRAM_IDLE bit.
679+
*/
680+
zram_slot_lock(zram, index);
681+
if (!zram_allocated(zram, index) ||
682+
!zram_test_flag(zram, index, ZRAM_IDLE)) {
683+
zram_clear_flag(zram, index, ZRAM_UNDER_WB);
684+
zram_clear_flag(zram, index, ZRAM_IDLE);
685+
goto next;
686+
}
687+
688+
zram_free_page(zram, index);
689+
zram_clear_flag(zram, index, ZRAM_UNDER_WB);
690+
zram_set_flag(zram, index, ZRAM_WB);
691+
zram_set_element(zram, index, blk_idx);
692+
blk_idx = 0;
693+
atomic64_inc(&zram->stats.pages_stored);
694+
next:
695+
zram_slot_unlock(zram, index);
696+
}
697+
698+
if (blk_idx)
699+
free_block_bdev(zram, blk_idx);
700+
ret = len;
701+
__free_page(page);
702+
release_init_lock:
703+
up_read(&zram->init_lock);
704+
705+
return ret;
706+
}
707+
549708
struct zram_work {
550709
struct work_struct work;
551710
struct zram *zram;
@@ -603,57 +762,8 @@ static int read_from_bdev(struct zram *zram, struct bio_vec *bvec,
603762
else
604763
return read_from_bdev_async(zram, bvec, entry, parent);
605764
}
606-
607-
static int write_to_bdev(struct zram *zram, struct bio_vec *bvec,
608-
u32 index, struct bio *parent,
609-
unsigned long *pentry)
610-
{
611-
struct bio *bio;
612-
unsigned long entry;
613-
614-
bio = bio_alloc(GFP_ATOMIC, 1);
615-
if (!bio)
616-
return -ENOMEM;
617-
618-
entry = alloc_block_bdev(zram);
619-
if (!entry) {
620-
bio_put(bio);
621-
return -ENOSPC;
622-
}
623-
624-
bio->bi_iter.bi_sector = entry * (PAGE_SIZE >> 9);
625-
bio_set_dev(bio, zram->bdev);
626-
if (!bio_add_page(bio, bvec->bv_page, bvec->bv_len,
627-
bvec->bv_offset)) {
628-
bio_put(bio);
629-
free_block_bdev(zram, entry);
630-
return -EIO;
631-
}
632-
633-
if (!parent) {
634-
bio->bi_opf = REQ_OP_WRITE | REQ_SYNC;
635-
bio->bi_end_io = zram_page_end_io;
636-
} else {
637-
bio->bi_opf = parent->bi_opf;
638-
bio_chain(bio, parent);
639-
}
640-
641-
submit_bio(bio);
642-
*pentry = entry;
643-
644-
return 0;
645-
}
646-
647765
#else
648766
static inline void reset_bdev(struct zram *zram) {};
649-
static int write_to_bdev(struct zram *zram, struct bio_vec *bvec,
650-
u32 index, struct bio *parent,
651-
unsigned long *pentry)
652-
653-
{
654-
return -EIO;
655-
}
656-
657767
static int read_from_bdev(struct zram *zram, struct bio_vec *bvec,
658768
unsigned long entry, struct bio *parent, bool sync)
659769
{
@@ -1006,7 +1116,8 @@ static void zram_free_page(struct zram *zram, size_t index)
10061116
atomic64_dec(&zram->stats.pages_stored);
10071117
zram_set_handle(zram, index, 0);
10081118
zram_set_obj_size(zram, index, 0);
1009-
WARN_ON_ONCE(zram->table[index].flags & ~(1UL << ZRAM_LOCK));
1119+
WARN_ON_ONCE(zram->table[index].flags &
1120+
~(1UL << ZRAM_LOCK | 1UL << ZRAM_UNDER_WB));
10101121
}
10111122

10121123
static int __zram_bvec_read(struct zram *zram, struct page *page, u32 index,
@@ -1115,7 +1226,6 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec,
11151226
struct page *page = bvec->bv_page;
11161227
unsigned long element = 0;
11171228
enum zram_pageflags flags = 0;
1118-
bool allow_wb = true;
11191229

11201230
mem = kmap_atomic(page);
11211231
if (page_same_filled(mem, &element)) {
@@ -1140,21 +1250,8 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec,
11401250
return ret;
11411251
}
11421252

1143-
if (unlikely(comp_len >= huge_class_size)) {
1253+
if (comp_len >= huge_class_size)
11441254
comp_len = PAGE_SIZE;
1145-
if (zram->backing_dev && allow_wb) {
1146-
zcomp_stream_put(zram->comp);
1147-
ret = write_to_bdev(zram, bvec, index, bio, &element);
1148-
if (!ret) {
1149-
flags = ZRAM_WB;
1150-
ret = 1;
1151-
goto out;
1152-
}
1153-
allow_wb = false;
1154-
goto compress_again;
1155-
}
1156-
}
1157-
11581255
/*
11591256
* handle allocation has 2 paths:
11601257
* a) fast path is executed with preemption disabled (for
@@ -1643,6 +1740,7 @@ static DEVICE_ATTR_RW(max_comp_streams);
16431740
static DEVICE_ATTR_RW(comp_algorithm);
16441741
#ifdef CONFIG_ZRAM_WRITEBACK
16451742
static DEVICE_ATTR_RW(backing_dev);
1743+
static DEVICE_ATTR_WO(writeback);
16461744
#endif
16471745

16481746
static struct attribute *zram_disk_attrs[] = {
@@ -1657,6 +1755,7 @@ static struct attribute *zram_disk_attrs[] = {
16571755
&dev_attr_comp_algorithm.attr,
16581756
#ifdef CONFIG_ZRAM_WRITEBACK
16591757
&dev_attr_backing_dev.attr,
1758+
&dev_attr_writeback.attr,
16601759
#endif
16611760
&dev_attr_io_stat.attr,
16621761
&dev_attr_mm_stat.attr,

0 commit comments

Comments
 (0)