Skip to content

Commit 088737f

Browse files
committed
Merge tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux
Pull Writeback error handling updates from Jeff Layton: "This pile represents the bulk of the writeback error handling fixes that I have for this cycle. Some of the earlier patches in this pile may look trivial but they are prerequisites for later patches in the series. The aim of this set is to improve how we track and report writeback errors to userland. Most applications that care about data integrity will periodically call fsync/fdatasync/msync to ensure that their writes have made it to the backing store. For a very long time, we have tracked writeback errors using two flags in the address_space: AS_EIO and AS_ENOSPC. Those flags are set when a writeback error occurs (via mapping_set_error) and are cleared as a side-effect of filemap_check_errors (as you noted yesterday). This model really sucks for userland. Only the first task to call fsync (or msync or fdatasync) will see the error. Any subsequent task calling fsync on a file will get back 0 (unless another writeback error occurs in the interim). If I have several tasks writing to a file and calling fsync to ensure that their writes got stored, then I need to have them coordinate with one another. That's difficult enough, but in a world of containerized setups that coordination may even not be possible. But wait...it gets worse! The calls to filemap_check_errors can be buried pretty far down in the call stack, and there are internal callers of filemap_write_and_wait and the like that also end up clearing those errors. Many of those callers ignore the error return from that function or return it to userland at nonsensical times (e.g. truncate() or stat()). If I get back -EIO on a truncate, there is no reason to think that it was because some previous writeback failed, and a subsequent fsync() will (incorrectly) return 0. This pile aims to do three things: 1) ensure that when a writeback error occurs that that error will be reported to userland on a subsequent fsync/fdatasync/msync call, regardless of what internal callers are doing 2) report writeback errors on all file descriptions that were open at the time that the error occurred. This is a user-visible change, but I think most applications are written to assume this behavior anyway. Those that aren't are unlikely to be hurt by it. 3) document what filesystems should do when there is a writeback error. Today, there is very little consistency between them, and a lot of cargo-cult copying. We need to make it very clear what filesystems should do in this situation. To achieve this, the set adds a new data type (errseq_t) and then builds new writeback error tracking infrastructure around that. Once all of that is in place, we change the filesystems to use the new infrastructure for reporting wb errors to userland. Note that this is just the initial foray into cleaning up this mess. There is a lot of work remaining here: 1) convert the rest of the filesystems in a similar fashion. Once the initial set is in, then I think most other fs' will be fairly simple to convert. Hopefully most of those can in via individual filesystem trees. 2) convert internal waiters on writeback to use errseq_t for detecting errors instead of relying on the AS_* flags. I have some draft patches for this for ext4, but they are not quite ready for prime time yet. This was a discussion topic this year at LSF/MM too. If you're interested in the gory details, LWN has some good articles about this: https://lwn.net/Articles/718734/ https://lwn.net/Articles/724307/" * tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: btrfs: minimal conversion to errseq_t writeback error reporting on fsync xfs: minimal conversion to errseq_t writeback error reporting ext4: use errseq_t based error handling for reporting data writeback errors fs: convert __generic_file_fsync to use errseq_t based reporting block: convert to errseq_t based writeback error tracking dax: set errors in mapping when writeback fails Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error fs: new infrastructure for writeback error handling and reporting lib: add errseq_t type and infrastructure for handling it mm: don't TestClearPageError in __filemap_fdatawait_range mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails jbd2: don't clear and reset errors after waiting on writeback buffer: set errors in mapping at the time that the error occurs fs: check for writeback errors after syncing out buffers in generic_file_fsync buffer: use mapping_set_error instead of setting the flag mm: fix mapping_set_error call in me_pagecache_dirty
2 parents 33198c1 + 333427a commit 088737f

File tree

24 files changed

+572
-63
lines changed

24 files changed

+572
-63
lines changed

Documentation/filesystems/vfs.txt

Lines changed: 41 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -576,7 +576,43 @@ should clear PG_Dirty and set PG_Writeback. It can be actually
576576
written at any point after PG_Dirty is clear. Once it is known to be
577577
safe, PG_Writeback is cleared.
578578

579-
Writeback makes use of a writeback_control structure...
579+
Writeback makes use of a writeback_control structure to direct the
580+
operations. This gives the the writepage and writepages operations some
581+
information about the nature of and reason for the writeback request,
582+
and the constraints under which it is being done. It is also used to
583+
return information back to the caller about the result of a writepage or
584+
writepages request.
585+
586+
Handling errors during writeback
587+
--------------------------------
588+
Most applications that do buffered I/O will periodically call a file
589+
synchronization call (fsync, fdatasync, msync or sync_file_range) to
590+
ensure that data written has made it to the backing store. When there
591+
is an error during writeback, they expect that error to be reported when
592+
a file sync request is made. After an error has been reported on one
593+
request, subsequent requests on the same file descriptor should return
594+
0, unless further writeback errors have occurred since the previous file
595+
syncronization.
596+
597+
Ideally, the kernel would report errors only on file descriptions on
598+
which writes were done that subsequently failed to be written back. The
599+
generic pagecache infrastructure does not track the file descriptions
600+
that have dirtied each individual page however, so determining which
601+
file descriptors should get back an error is not possible.
602+
603+
Instead, the generic writeback error tracking infrastructure in the
604+
kernel settles for reporting errors to fsync on all file descriptions
605+
that were open at the time that the error occurred. In a situation with
606+
multiple writers, all of them will get back an error on a subsequent fsync,
607+
even if all of the writes done through that particular file descriptor
608+
succeeded (or even if there were no writes on that file descriptor at all).
609+
610+
Filesystems that wish to use this infrastructure should call
611+
mapping_set_error to record the error in the address_space when it
612+
occurs. Then, after writing back data from the pagecache in their
613+
file->fsync operation, they should call file_check_and_advance_wb_err to
614+
ensure that the struct file's error cursor has advanced to the correct
615+
point in the stream of errors emitted by the backing device(s).
580616

581617
struct address_space_operations
582618
-------------------------------
@@ -804,7 +840,8 @@ struct address_space_operations {
804840
The File Object
805841
===============
806842

807-
A file object represents a file opened by a process.
843+
A file object represents a file opened by a process. This is also known
844+
as an "open file description" in POSIX parlance.
808845

809846

810847
struct file_operations
@@ -887,7 +924,8 @@ otherwise noted.
887924

888925
release: called when the last reference to an open file is closed
889926

890-
fsync: called by the fsync(2) system call
927+
fsync: called by the fsync(2) system call. Also see the section above
928+
entitled "Handling errors during writeback".
891929

892930
fasync: called by the fcntl(2) system call when asynchronous
893931
(non-blocking) mode is enabled for a file

MAINTAINERS

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5069,6 +5069,12 @@ T: git git://git.kernel.org/pub/scm/linux/kernel/git/kristoffer/linux-hpc.git
50695069
F: drivers/video/fbdev/s1d13xxxfb.c
50705070
F: include/video/s1d13xxxfb.h
50715071

5072+
ERRSEQ ERROR TRACKING INFRASTRUCTURE
5073+
M: Jeff Layton <[email protected]>
5074+
S: Maintained
5075+
F: lib/errseq.c
5076+
F: include/linux/errseq.h
5077+
50725078
ET131X NETWORK DRIVER
50735079
M: Mark Einon <[email protected]>
50745080
S: Odd Fixes

drivers/dax/device.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -499,6 +499,7 @@ static int dax_open(struct inode *inode, struct file *filp)
499499
inode->i_mapping = __dax_inode->i_mapping;
500500
inode->i_mapping->host = __dax_inode;
501501
filp->f_mapping = inode->i_mapping;
502+
filp->f_wb_err = filemap_sample_wb_err(filp->f_mapping);
502503
filp->private_data = dev_dax;
503504
inode->i_flags = S_DAX;
504505

fs/block_dev.c

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -632,7 +632,7 @@ int blkdev_fsync(struct file *filp, loff_t start, loff_t end, int datasync)
632632
struct block_device *bdev = I_BDEV(bd_inode);
633633
int error;
634634

635-
error = filemap_write_and_wait_range(filp->f_mapping, start, end);
635+
error = file_write_and_wait_range(filp, start, end);
636636
if (error)
637637
return error;
638638

@@ -1751,6 +1751,7 @@ static int blkdev_open(struct inode * inode, struct file * filp)
17511751
return -ENOMEM;
17521752

17531753
filp->f_mapping = bdev->bd_inode->i_mapping;
1754+
filp->f_wb_err = filemap_sample_wb_err(filp->f_mapping);
17541755

17551756
return blkdev_get(bdev, filp->f_mode, filp);
17561757
}

fs/btrfs/file.c

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2032,7 +2032,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
20322032
struct btrfs_root *root = BTRFS_I(inode)->root;
20332033
struct btrfs_trans_handle *trans;
20342034
struct btrfs_log_ctx ctx;
2035-
int ret = 0;
2035+
int ret = 0, err;
20362036
bool full_sync = 0;
20372037
u64 len;
20382038

@@ -2051,7 +2051,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
20512051
*/
20522052
ret = start_ordered_ops(inode, start, end);
20532053
if (ret)
2054-
return ret;
2054+
goto out;
20552055

20562056
inode_lock(inode);
20572057
atomic_inc(&root->log_batch);
@@ -2156,10 +2156,10 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
21562156
* An ordered extent might have started before and completed
21572157
* already with io errors, in which case the inode was not
21582158
* updated and we end up here. So check the inode's mapping
2159-
* flags for any errors that might have happened while doing
2160-
* writeback of file data.
2159+
* for any errors that might have happened since we last
2160+
* checked called fsync.
21612161
*/
2162-
ret = filemap_check_errors(inode->i_mapping);
2162+
ret = filemap_check_wb_err(inode->i_mapping, file->f_wb_err);
21632163
inode_unlock(inode);
21642164
goto out;
21652165
}
@@ -2248,6 +2248,9 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
22482248
ret = btrfs_end_transaction(trans);
22492249
}
22502250
out:
2251+
err = file_check_and_advance_wb_err(file);
2252+
if (!ret)
2253+
ret = err;
22512254
return ret > 0 ? -EIO : ret;
22522255
}
22532256

fs/buffer.c

Lines changed: 13 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -178,7 +178,7 @@ void end_buffer_write_sync(struct buffer_head *bh, int uptodate)
178178
set_buffer_uptodate(bh);
179179
} else {
180180
buffer_io_error(bh, ", lost sync page write");
181-
set_buffer_write_io_error(bh);
181+
mark_buffer_write_io_error(bh);
182182
clear_buffer_uptodate(bh);
183183
}
184184
unlock_buffer(bh);
@@ -352,8 +352,7 @@ void end_buffer_async_write(struct buffer_head *bh, int uptodate)
352352
set_buffer_uptodate(bh);
353353
} else {
354354
buffer_io_error(bh, ", lost async page write");
355-
mapping_set_error(page->mapping, -EIO);
356-
set_buffer_write_io_error(bh);
355+
mark_buffer_write_io_error(bh);
357356
clear_buffer_uptodate(bh);
358357
SetPageError(page);
359358
}
@@ -481,8 +480,6 @@ static void __remove_assoc_queue(struct buffer_head *bh)
481480
{
482481
list_del_init(&bh->b_assoc_buffers);
483482
WARN_ON(!bh->b_assoc_map);
484-
if (buffer_write_io_error(bh))
485-
set_bit(AS_EIO, &bh->b_assoc_map->flags);
486483
bh->b_assoc_map = NULL;
487484
}
488485

@@ -1181,6 +1178,17 @@ void mark_buffer_dirty(struct buffer_head *bh)
11811178
}
11821179
EXPORT_SYMBOL(mark_buffer_dirty);
11831180

1181+
void mark_buffer_write_io_error(struct buffer_head *bh)
1182+
{
1183+
set_buffer_write_io_error(bh);
1184+
/* FIXME: do we need to set this in both places? */
1185+
if (bh->b_page && bh->b_page->mapping)
1186+
mapping_set_error(bh->b_page->mapping, -EIO);
1187+
if (bh->b_assoc_map)
1188+
mapping_set_error(bh->b_assoc_map, -EIO);
1189+
}
1190+
EXPORT_SYMBOL(mark_buffer_write_io_error);
1191+
11841192
/*
11851193
* Decrement a buffer_head's reference count. If all buffers against a page
11861194
* have zero reference count, are clean and unlocked, and if the page is clean
@@ -3282,8 +3290,6 @@ drop_buffers(struct page *page, struct buffer_head **buffers_to_free)
32823290

32833291
bh = head;
32843292
do {
3285-
if (buffer_write_io_error(bh) && page->mapping)
3286-
mapping_set_error(page->mapping, -EIO);
32873293
if (buffer_busy(bh))
32883294
goto failed;
32893295
bh = bh->b_this_page;

fs/dax.c

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -855,8 +855,10 @@ int dax_writeback_mapping_range(struct address_space *mapping,
855855

856856
ret = dax_writeback_one(bdev, dax_dev, mapping,
857857
indices[i], pvec.pages[i]);
858-
if (ret < 0)
858+
if (ret < 0) {
859+
mapping_set_error(mapping, ret);
859860
goto out;
861+
}
860862
}
861863
start_index = indices[pvec.nr - 1] + 1;
862864
}

fs/ext2/file.c

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -174,15 +174,12 @@ int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync)
174174
{
175175
int ret;
176176
struct super_block *sb = file->f_mapping->host->i_sb;
177-
struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping;
178177

179178
ret = generic_file_fsync(file, start, end, datasync);
180-
if (ret == -EIO || test_and_clear_bit(AS_EIO, &mapping->flags)) {
179+
if (ret == -EIO)
181180
/* We don't really know where the IO error happened... */
182181
ext2_error(sb, __func__,
183182
"detected IO error when writing metadata buffers");
184-
ret = -EIO;
185-
}
186183
return ret;
187184
}
188185

fs/ext4/fsync.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -124,7 +124,7 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
124124
goto out;
125125
}
126126

127-
ret = filemap_write_and_wait_range(inode->i_mapping, start, end);
127+
ret = file_write_and_wait_range(file, start, end);
128128
if (ret)
129129
return ret;
130130
/*

fs/file_table.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -168,6 +168,7 @@ struct file *alloc_file(const struct path *path, fmode_t mode,
168168
file->f_path = *path;
169169
file->f_inode = path->dentry->d_inode;
170170
file->f_mapping = path->dentry->d_inode->i_mapping;
171+
file->f_wb_err = filemap_sample_wb_err(file->f_mapping);
171172
if ((mode & FMODE_READ) &&
172173
likely(fop->read || fop->read_iter))
173174
mode |= FMODE_CAN_READ;

fs/gfs2/lops.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -180,7 +180,7 @@ static void gfs2_end_log_write_bh(struct gfs2_sbd *sdp, struct bio_vec *bvec,
180180
bh = bh->b_this_page;
181181
do {
182182
if (error)
183-
set_buffer_write_io_error(bh);
183+
mark_buffer_write_io_error(bh);
184184
unlock_buffer(bh);
185185
next = bh->b_this_page;
186186
size -= bh->b_size;

fs/jbd2/commit.c

Lines changed: 4 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -263,18 +263,10 @@ static int journal_finish_inode_data_buffers(journal_t *journal,
263263
continue;
264264
jinode->i_flags |= JI_COMMIT_RUNNING;
265265
spin_unlock(&journal->j_list_lock);
266-
err = filemap_fdatawait(jinode->i_vfs_inode->i_mapping);
267-
if (err) {
268-
/*
269-
* Because AS_EIO is cleared by
270-
* filemap_fdatawait_range(), set it again so
271-
* that user process can get -EIO from fsync().
272-
*/
273-
mapping_set_error(jinode->i_vfs_inode->i_mapping, -EIO);
274-
275-
if (!ret)
276-
ret = err;
277-
}
266+
err = filemap_fdatawait_keep_errors(
267+
jinode->i_vfs_inode->i_mapping);
268+
if (!ret)
269+
ret = err;
278270
spin_lock(&journal->j_list_lock);
279271
jinode->i_flags &= ~JI_COMMIT_RUNNING;
280272
smp_mb();

fs/libfs.c

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -974,7 +974,7 @@ int __generic_file_fsync(struct file *file, loff_t start, loff_t end,
974974
int err;
975975
int ret;
976976

977-
err = filemap_write_and_wait_range(inode->i_mapping, start, end);
977+
err = file_write_and_wait_range(file, start, end);
978978
if (err)
979979
return err;
980980

@@ -991,6 +991,10 @@ int __generic_file_fsync(struct file *file, loff_t start, loff_t end,
991991

992992
out:
993993
inode_unlock(inode);
994+
/* check and advance again to catch errors after syncing out buffers */
995+
err = file_check_and_advance_wb_err(file);
996+
if (ret == 0)
997+
ret = err;
994998
return ret;
995999
}
9961000
EXPORT_SYMBOL(__generic_file_fsync);

fs/open.c

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -707,6 +707,9 @@ static int do_dentry_open(struct file *f,
707707
f->f_inode = inode;
708708
f->f_mapping = inode->i_mapping;
709709

710+
/* Ensure that we skip any errors that predate opening of the file */
711+
f->f_wb_err = filemap_sample_wb_err(f->f_mapping);
712+
710713
if (unlikely(f->f_flags & O_PATH)) {
711714
f->f_mode = FMODE_PATH;
712715
f->f_op = &empty_fops;

fs/xfs/xfs_file.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -140,7 +140,7 @@ xfs_file_fsync(
140140

141141
trace_xfs_file_fsync(ip);
142142

143-
error = filemap_write_and_wait_range(inode->i_mapping, start, end);
143+
error = file_write_and_wait_range(file, start, end);
144144
if (error)
145145
return error;
146146

include/linux/buffer_head.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -149,6 +149,7 @@ void buffer_check_dirty_writeback(struct page *page,
149149
*/
150150

151151
void mark_buffer_dirty(struct buffer_head *bh);
152+
void mark_buffer_write_io_error(struct buffer_head *bh);
152153
void init_buffer(struct buffer_head *, bh_end_io_t *, void *);
153154
void touch_buffer(struct buffer_head *bh);
154155
void set_bh_page(struct buffer_head *bh,

include/linux/errseq.h

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
#ifndef _LINUX_ERRSEQ_H
2+
#define _LINUX_ERRSEQ_H
3+
4+
/* See lib/errseq.c for more info */
5+
6+
typedef u32 errseq_t;
7+
8+
errseq_t __errseq_set(errseq_t *eseq, int err);
9+
static inline void errseq_set(errseq_t *eseq, int err)
10+
{
11+
/* Optimize for the common case of no error */
12+
if (unlikely(err))
13+
__errseq_set(eseq, err);
14+
}
15+
16+
errseq_t errseq_sample(errseq_t *eseq);
17+
int errseq_check(errseq_t *eseq, errseq_t since);
18+
int errseq_check_and_advance(errseq_t *eseq, errseq_t *since);
19+
#endif

0 commit comments

Comments
 (0)