Skip to content

Commit 7bb46a6

Browse files
npiggin@suse.deAl Viro
authored andcommitted
fs: introduce new truncate sequence
Introduce a new truncate calling sequence into fs/mm subsystems. Rather than setattr > vmtruncate > truncate, have filesystems call their truncate sequence from ->setattr if filesystem specific operations are required. vmtruncate is deprecated, and truncate_pagecache and inode_newsize_ok helpers introduced previously should be used. simple_setattr is introduced for simple in-ram filesystems to implement the new truncate sequence. Eventually all filesystems should be converted to implement a setattr, and the default code in notify_change should go away. simple_setsize is also introduced to perform just the ATTR_SIZE portion of simple_setattr (ie. changing i_size and trimming pagecache). To implement the new truncate sequence: - filesystem specific manipulations (eg freeing blocks) must be done in the setattr method rather than ->truncate. - vmtruncate can not be used by core code to trim blocks past i_size in the event of write failure after allocation, so this must be performed in the fs code. - convert usage of helpers block_write_begin, nobh_write_begin, cont_write_begin, and *blockdev_direct_IO* to use _newtrunc postfixed variants. These avoid calling vmtruncate to trim blocks (see previous). - inode_setattr should not be used. generic_setattr is a new function to be used to copy simple attributes into the generic inode. - make use of the better opportunity to handle errors with the new sequence. Big problem with the previous calling sequence: the filesystem is not called until i_size has already changed. This means it is not allowed to fail the call, and also it does not know what the previous i_size was. Also, generic code calling vmtruncate to truncate allocated blocks in case of error had no good way to return a meaningful error (or, for example, atomically handle block deallocation). Cc: Christoph Hellwig <[email protected]> Acked-by: Jan Kara <[email protected]> Signed-off-by: Nick Piggin <[email protected]> Signed-off-by: Al Viro <[email protected]>
1 parent 7000d3c commit 7bb46a6

File tree

8 files changed

+300
-63
lines changed

8 files changed

+300
-63
lines changed

Documentation/filesystems/vfs.txt

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -401,11 +401,16 @@ otherwise noted.
401401
started might not be in the page cache at the end of the
402402
walk).
403403

404-
truncate: called by the VFS to change the size of a file. The
404+
truncate: Deprecated. This will not be called if ->setsize is defined.
405+
Called by the VFS to change the size of a file. The
405406
i_size field of the inode is set to the desired size by the
406407
VFS before this method is called. This method is called by
407408
the truncate(2) system call and related functionality.
408409

410+
Note: ->truncate and vmtruncate are deprecated. Do not add new
411+
instances/calls of these. Filesystems should be converted to do their
412+
truncate sequence via ->setattr().
413+
409414
permission: called by the VFS to check for access rights on a POSIX-like
410415
filesystem.
411416

fs/attr.c

Lines changed: 40 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -67,14 +67,14 @@ EXPORT_SYMBOL(inode_change_ok);
6767
* @offset: the new size to assign to the inode
6868
* @Returns: 0 on success, -ve errno on failure
6969
*
70+
* inode_newsize_ok must be called with i_mutex held.
71+
*
7072
* inode_newsize_ok will check filesystem limits and ulimits to check that the
7173
* new inode size is within limits. inode_newsize_ok will also send SIGXFSZ
7274
* when necessary. Caller must not proceed with inode size change if failure is
7375
* returned. @inode must be a file (not directory), with appropriate
7476
* permissions to allow truncate (inode_newsize_ok does NOT check these
7577
* conditions).
76-
*
77-
* inode_newsize_ok must be called with i_mutex held.
7878
*/
7979
int inode_newsize_ok(const struct inode *inode, loff_t offset)
8080
{
@@ -104,17 +104,25 @@ int inode_newsize_ok(const struct inode *inode, loff_t offset)
104104
}
105105
EXPORT_SYMBOL(inode_newsize_ok);
106106

107-
int inode_setattr(struct inode * inode, struct iattr * attr)
107+
/**
108+
* generic_setattr - copy simple metadata updates into the generic inode
109+
* @inode: the inode to be updated
110+
* @attr: the new attributes
111+
*
112+
* generic_setattr must be called with i_mutex held.
113+
*
114+
* generic_setattr updates the inode's metadata with that specified
115+
* in attr. Noticably missing is inode size update, which is more complex
116+
* as it requires pagecache updates. See simple_setsize.
117+
*
118+
* The inode is not marked as dirty after this operation. The rationale is
119+
* that for "simple" filesystems, the struct inode is the inode storage.
120+
* The caller is free to mark the inode dirty afterwards if needed.
121+
*/
122+
void generic_setattr(struct inode *inode, const struct iattr *attr)
108123
{
109124
unsigned int ia_valid = attr->ia_valid;
110125

111-
if (ia_valid & ATTR_SIZE &&
112-
attr->ia_size != i_size_read(inode)) {
113-
int error = vmtruncate(inode, attr->ia_size);
114-
if (error)
115-
return error;
116-
}
117-
118126
if (ia_valid & ATTR_UID)
119127
inode->i_uid = attr->ia_uid;
120128
if (ia_valid & ATTR_GID)
@@ -135,6 +143,28 @@ int inode_setattr(struct inode * inode, struct iattr * attr)
135143
mode &= ~S_ISGID;
136144
inode->i_mode = mode;
137145
}
146+
}
147+
EXPORT_SYMBOL(generic_setattr);
148+
149+
/*
150+
* note this function is deprecated, the new truncate sequence should be
151+
* used instead -- see eg. simple_setsize, generic_setattr.
152+
*/
153+
int inode_setattr(struct inode *inode, const struct iattr *attr)
154+
{
155+
unsigned int ia_valid = attr->ia_valid;
156+
157+
if (ia_valid & ATTR_SIZE &&
158+
attr->ia_size != i_size_read(inode)) {
159+
int error;
160+
161+
error = vmtruncate(inode, attr->ia_size);
162+
if (error)
163+
return error;
164+
}
165+
166+
generic_setattr(inode, attr);
167+
138168
mark_inode_dirty(inode);
139169

140170
return 0;

fs/buffer.c

Lines changed: 98 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1949,14 +1949,11 @@ static int __block_commit_write(struct inode *inode, struct page *page,
19491949
}
19501950

19511951
/*
1952-
* block_write_begin takes care of the basic task of block allocation and
1953-
* bringing partial write blocks uptodate first.
1954-
*
1955-
* If *pagep is not NULL, then block_write_begin uses the locked page
1956-
* at *pagep rather than allocating its own. In this case, the page will
1957-
* not be unlocked or deallocated on failure.
1952+
* Filesystems implementing the new truncate sequence should use the
1953+
* _newtrunc postfix variant which won't incorrectly call vmtruncate.
1954+
* The filesystem needs to handle block truncation upon failure.
19581955
*/
1959-
int block_write_begin(struct file *file, struct address_space *mapping,
1956+
int block_write_begin_newtrunc(struct file *file, struct address_space *mapping,
19601957
loff_t pos, unsigned len, unsigned flags,
19611958
struct page **pagep, void **fsdata,
19621959
get_block_t *get_block)
@@ -1992,20 +1989,50 @@ int block_write_begin(struct file *file, struct address_space *mapping,
19921989
unlock_page(page);
19931990
page_cache_release(page);
19941991
*pagep = NULL;
1995-
1996-
/*
1997-
* prepare_write() may have instantiated a few blocks
1998-
* outside i_size. Trim these off again. Don't need
1999-
* i_size_read because we hold i_mutex.
2000-
*/
2001-
if (pos + len > inode->i_size)
2002-
vmtruncate(inode, inode->i_size);
20031992
}
20041993
}
20051994

20061995
out:
20071996
return status;
20081997
}
1998+
EXPORT_SYMBOL(block_write_begin_newtrunc);
1999+
2000+
/*
2001+
* block_write_begin takes care of the basic task of block allocation and
2002+
* bringing partial write blocks uptodate first.
2003+
*
2004+
* If *pagep is not NULL, then block_write_begin uses the locked page
2005+
* at *pagep rather than allocating its own. In this case, the page will
2006+
* not be unlocked or deallocated on failure.
2007+
*/
2008+
int block_write_begin(struct file *file, struct address_space *mapping,
2009+
loff_t pos, unsigned len, unsigned flags,
2010+
struct page **pagep, void **fsdata,
2011+
get_block_t *get_block)
2012+
{
2013+
int ret;
2014+
2015+
ret = block_write_begin_newtrunc(file, mapping, pos, len, flags,
2016+
pagep, fsdata, get_block);
2017+
2018+
/*
2019+
* prepare_write() may have instantiated a few blocks
2020+
* outside i_size. Trim these off again. Don't need
2021+
* i_size_read because we hold i_mutex.
2022+
*
2023+
* Filesystems which pass down their own page also cannot
2024+
* call into vmtruncate here because it would lead to lock
2025+
* inversion problems (*pagep is locked). This is a further
2026+
* example of where the old truncate sequence is inadequate.
2027+
*/
2028+
if (unlikely(ret) && *pagep == NULL) {
2029+
loff_t isize = mapping->host->i_size;
2030+
if (pos + len > isize)
2031+
vmtruncate(mapping->host, isize);
2032+
}
2033+
2034+
return ret;
2035+
}
20092036
EXPORT_SYMBOL(block_write_begin);
20102037

20112038
int block_write_end(struct file *file, struct address_space *mapping,
@@ -2324,7 +2351,7 @@ static int cont_expand_zero(struct file *file, struct address_space *mapping,
23242351
* For moronic filesystems that do not allow holes in file.
23252352
* We may have to extend the file.
23262353
*/
2327-
int cont_write_begin(struct file *file, struct address_space *mapping,
2354+
int cont_write_begin_newtrunc(struct file *file, struct address_space *mapping,
23282355
loff_t pos, unsigned len, unsigned flags,
23292356
struct page **pagep, void **fsdata,
23302357
get_block_t *get_block, loff_t *bytes)
@@ -2345,11 +2372,30 @@ int cont_write_begin(struct file *file, struct address_space *mapping,
23452372
}
23462373

23472374
*pagep = NULL;
2348-
err = block_write_begin(file, mapping, pos, len,
2375+
err = block_write_begin_newtrunc(file, mapping, pos, len,
23492376
flags, pagep, fsdata, get_block);
23502377
out:
23512378
return err;
23522379
}
2380+
EXPORT_SYMBOL(cont_write_begin_newtrunc);
2381+
2382+
int cont_write_begin(struct file *file, struct address_space *mapping,
2383+
loff_t pos, unsigned len, unsigned flags,
2384+
struct page **pagep, void **fsdata,
2385+
get_block_t *get_block, loff_t *bytes)
2386+
{
2387+
int ret;
2388+
2389+
ret = cont_write_begin_newtrunc(file, mapping, pos, len, flags,
2390+
pagep, fsdata, get_block, bytes);
2391+
if (unlikely(ret)) {
2392+
loff_t isize = mapping->host->i_size;
2393+
if (pos + len > isize)
2394+
vmtruncate(mapping->host, isize);
2395+
}
2396+
2397+
return ret;
2398+
}
23532399
EXPORT_SYMBOL(cont_write_begin);
23542400

23552401
int block_prepare_write(struct page *page, unsigned from, unsigned to,
@@ -2381,7 +2427,7 @@ EXPORT_SYMBOL(block_commit_write);
23812427
*
23822428
* We are not allowed to take the i_mutex here so we have to play games to
23832429
* protect against truncate races as the page could now be beyond EOF. Because
2384-
* vmtruncate() writes the inode size before removing pages, once we have the
2430+
* truncate writes the inode size before removing pages, once we have the
23852431
* page lock we can determine safely if the page is beyond EOF. If it is not
23862432
* beyond EOF, then the page is guaranteed safe against truncation until we
23872433
* unlock the page.
@@ -2464,10 +2510,11 @@ static void attach_nobh_buffers(struct page *page, struct buffer_head *head)
24642510
}
24652511

24662512
/*
2467-
* On entry, the page is fully not uptodate.
2468-
* On exit the page is fully uptodate in the areas outside (from,to)
2513+
* Filesystems implementing the new truncate sequence should use the
2514+
* _newtrunc postfix variant which won't incorrectly call vmtruncate.
2515+
* The filesystem needs to handle block truncation upon failure.
24692516
*/
2470-
int nobh_write_begin(struct file *file, struct address_space *mapping,
2517+
int nobh_write_begin_newtrunc(struct file *file, struct address_space *mapping,
24712518
loff_t pos, unsigned len, unsigned flags,
24722519
struct page **pagep, void **fsdata,
24732520
get_block_t *get_block)
@@ -2500,8 +2547,8 @@ int nobh_write_begin(struct file *file, struct address_space *mapping,
25002547
unlock_page(page);
25012548
page_cache_release(page);
25022549
*pagep = NULL;
2503-
return block_write_begin(file, mapping, pos, len, flags, pagep,
2504-
fsdata, get_block);
2550+
return block_write_begin_newtrunc(file, mapping, pos, len,
2551+
flags, pagep, fsdata, get_block);
25052552
}
25062553

25072554
if (PageMappedToDisk(page))
@@ -2605,8 +2652,34 @@ int nobh_write_begin(struct file *file, struct address_space *mapping,
26052652
page_cache_release(page);
26062653
*pagep = NULL;
26072654

2608-
if (pos + len > inode->i_size)
2609-
vmtruncate(inode, inode->i_size);
2655+
return ret;
2656+
}
2657+
EXPORT_SYMBOL(nobh_write_begin_newtrunc);
2658+
2659+
/*
2660+
* On entry, the page is fully not uptodate.
2661+
* On exit the page is fully uptodate in the areas outside (from,to)
2662+
*/
2663+
int nobh_write_begin(struct file *file, struct address_space *mapping,
2664+
loff_t pos, unsigned len, unsigned flags,
2665+
struct page **pagep, void **fsdata,
2666+
get_block_t *get_block)
2667+
{
2668+
int ret;
2669+
2670+
ret = nobh_write_begin_newtrunc(file, mapping, pos, len, flags,
2671+
pagep, fsdata, get_block);
2672+
2673+
/*
2674+
* prepare_write() may have instantiated a few blocks
2675+
* outside i_size. Trim these off again. Don't need
2676+
* i_size_read because we hold i_mutex.
2677+
*/
2678+
if (unlikely(ret)) {
2679+
loff_t isize = mapping->host->i_size;
2680+
if (pos + len > isize)
2681+
vmtruncate(mapping->host, isize);
2682+
}
26102683

26112684
return ret;
26122685
}

fs/direct-io.c

Lines changed: 40 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1134,27 +1134,8 @@ direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode,
11341134
return ret;
11351135
}
11361136

1137-
/*
1138-
* This is a library function for use by filesystem drivers.
1139-
*
1140-
* The locking rules are governed by the flags parameter:
1141-
* - if the flags value contains DIO_LOCKING we use a fancy locking
1142-
* scheme for dumb filesystems.
1143-
* For writes this function is called under i_mutex and returns with
1144-
* i_mutex held, for reads, i_mutex is not held on entry, but it is
1145-
* taken and dropped again before returning.
1146-
* For reads and writes i_alloc_sem is taken in shared mode and released
1147-
* on I/O completion (which may happen asynchronously after returning to
1148-
* the caller).
1149-
*
1150-
* - if the flags value does NOT contain DIO_LOCKING we don't use any
1151-
* internal locking but rather rely on the filesystem to synchronize
1152-
* direct I/O reads/writes versus each other and truncate.
1153-
* For reads and writes both i_mutex and i_alloc_sem are not held on
1154-
* entry and are never taken.
1155-
*/
11561137
ssize_t
1157-
__blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
1138+
__blockdev_direct_IO_newtrunc(int rw, struct kiocb *iocb, struct inode *inode,
11581139
struct block_device *bdev, const struct iovec *iov, loff_t offset,
11591140
unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io,
11601141
dio_submit_t submit_io, int flags)
@@ -1247,22 +1228,60 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
12471228
nr_segs, blkbits, get_block, end_io,
12481229
submit_io, dio);
12491230

1231+
out:
1232+
return retval;
1233+
}
1234+
EXPORT_SYMBOL(__blockdev_direct_IO_newtrunc);
1235+
1236+
/*
1237+
* This is a library function for use by filesystem drivers.
1238+
*
1239+
* The locking rules are governed by the flags parameter:
1240+
* - if the flags value contains DIO_LOCKING we use a fancy locking
1241+
* scheme for dumb filesystems.
1242+
* For writes this function is called under i_mutex and returns with
1243+
* i_mutex held, for reads, i_mutex is not held on entry, but it is
1244+
* taken and dropped again before returning.
1245+
* For reads and writes i_alloc_sem is taken in shared mode and released
1246+
* on I/O completion (which may happen asynchronously after returning to
1247+
* the caller).
1248+
*
1249+
* - if the flags value does NOT contain DIO_LOCKING we don't use any
1250+
* internal locking but rather rely on the filesystem to synchronize
1251+
* direct I/O reads/writes versus each other and truncate.
1252+
* For reads and writes both i_mutex and i_alloc_sem are not held on
1253+
* entry and are never taken.
1254+
*/
1255+
ssize_t
1256+
__blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
1257+
struct block_device *bdev, const struct iovec *iov, loff_t offset,
1258+
unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io,
1259+
dio_submit_t submit_io, int flags)
1260+
{
1261+
ssize_t retval;
1262+
1263+
retval = __blockdev_direct_IO_newtrunc(rw, iocb, inode, bdev, iov,
1264+
offset, nr_segs, get_block, end_io, submit_io, flags);
12501265
/*
12511266
* In case of error extending write may have instantiated a few
12521267
* blocks outside i_size. Trim these off again for DIO_LOCKING.
1268+
* NOTE: DIO_NO_LOCK/DIO_OWN_LOCK callers have to handle this in
1269+
* their own manner. This is a further example of where the old
1270+
* truncate sequence is inadequate.
12531271
*
12541272
* NOTE: filesystems with their own locking have to handle this
12551273
* on their own.
12561274
*/
12571275
if (flags & DIO_LOCKING) {
12581276
if (unlikely((rw & WRITE) && retval < 0)) {
12591277
loff_t isize = i_size_read(inode);
1278+
loff_t end = offset + iov_length(iov, nr_segs);
1279+
12601280
if (end > isize)
12611281
vmtruncate(inode, isize);
12621282
}
12631283
}
12641284

1265-
out:
12661285
return retval;
12671286
}
12681287
EXPORT_SYMBOL(__blockdev_direct_IO);

0 commit comments

Comments
 (0)