Skip to content

Commit 41bd606

Browse files
fdmananakdave
authored andcommitted
Btrfs: fix fsync of files with multiple hard links in new directories
The log tree has a long standing problem that when a file is fsync'ed we only check for new ancestors, created in the current transaction, by following only the hard link for which the fsync was issued. We follow the ancestors using the VFS' dget_parent() API. This means that if we create a new link for a file in a directory that is new (or in an any other new ancestor directory) and then fsync the file using an old hard link, we end up not logging the new ancestor, and on log replay that new hard link and ancestor do not exist. In some cases, involving renames, the file will not exist at all. Example: mkfs.btrfs -f /dev/sdb mount /dev/sdb /mnt mkdir /mnt/A touch /mnt/foo ln /mnt/foo /mnt/A/bar xfs_io -c fsync /mnt/foo <power failure> In this example after log replay only the hard link named 'foo' exists and directory A does not exist, which is unexpected. In other major linux filesystems, such as ext4, xfs and f2fs for example, both hard links exist and so does directory A after mounting again the filesystem. Checking if any new ancestors are new and need to be logged was added in 2009 by commit 12fcfd2 ("Btrfs: tree logging unlink/rename fixes"), however only for the ancestors of the hard link (dentry) for which the fsync was issued, instead of checking for all ancestors for all of the inode's hard links. So fix this by tracking the id of the last transaction where a hard link was created for an inode and then on fsync fallback to a full transaction commit when an inode has more than one hard link and at least one new hard link was created in the current transaction. This is the simplest solution since this is not a common use case (adding frequently hard links for which there's an ancestor created in the current transaction and then fsync the file). In case it ever becomes a common use case, a solution that consists of iterating the fs/subvol btree for each hard link and check if any ancestor is new, could be implemented. This solves many unexpected scenarios reported by Jayashree Mohan and Vijay Chidambaram, and for which there is a new test case for fstests under review. Fixes: 12fcfd2 ("Btrfs: tree logging unlink/rename fixes") CC: [email protected] # 4.4+ Reported-by: Vijay Chidambaram <[email protected]> Reported-by: Jayashree Mohan <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
1 parent bbe339c commit 41bd606

File tree

3 files changed

+39
-0
lines changed

3 files changed

+39
-0
lines changed

fs/btrfs/btrfs_inode.h

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -147,6 +147,12 @@ struct btrfs_inode {
147147
*/
148148
u64 last_unlink_trans;
149149

150+
/*
151+
* Track the transaction id of the last transaction used to create a
152+
* hard link for the inode. This is used by the log tree (fsync).
153+
*/
154+
u64 last_link_trans;
155+
150156
/*
151157
* Number of bytes outstanding that are going to need csums. This is
152158
* used in ENOSPC accounting.

fs/btrfs/inode.c

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3658,6 +3658,21 @@ static int btrfs_read_locked_inode(struct inode *inode,
36583658
* inode is not a directory, logging its parent unnecessarily.
36593659
*/
36603660
BTRFS_I(inode)->last_unlink_trans = BTRFS_I(inode)->last_trans;
3661+
/*
3662+
* Similar reasoning for last_link_trans, needs to be set otherwise
3663+
* for a case like the following:
3664+
*
3665+
* mkdir A
3666+
* touch foo
3667+
* ln foo A/bar
3668+
* echo 2 > /proc/sys/vm/drop_caches
3669+
* fsync foo
3670+
* <power failure>
3671+
*
3672+
* Would result in link bar and directory A not existing after the power
3673+
* failure.
3674+
*/
3675+
BTRFS_I(inode)->last_link_trans = BTRFS_I(inode)->last_trans;
36613676

36623677
path->slots[0]++;
36633678
if (inode->i_nlink != 1 ||
@@ -6597,6 +6612,7 @@ static int btrfs_link(struct dentry *old_dentry, struct inode *dir,
65976612
if (err)
65986613
goto fail;
65996614
}
6615+
BTRFS_I(inode)->last_link_trans = trans->transid;
66006616
d_instantiate(dentry, inode);
66016617
ret = btrfs_log_new_name(trans, BTRFS_I(inode), NULL, parent,
66026618
true, NULL);
@@ -9123,6 +9139,7 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
91239139
ei->index_cnt = (u64)-1;
91249140
ei->dir_index = 0;
91259141
ei->last_unlink_trans = 0;
9142+
ei->last_link_trans = 0;
91269143
ei->last_log_commit = 0;
91279144

91289145
spin_lock_init(&ei->lock);

fs/btrfs/tree-log.c

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5758,6 +5758,22 @@ static int btrfs_log_inode_parent(struct btrfs_trans_handle *trans,
57585758
goto end_trans;
57595759
}
57605760

5761+
/*
5762+
* If a new hard link was added to the inode in the current transaction
5763+
* and its link count is now greater than 1, we need to fallback to a
5764+
* transaction commit, otherwise we can end up not logging all its new
5765+
* parents for all the hard links. Here just from the dentry used to
5766+
* fsync, we can not visit the ancestor inodes for all the other hard
5767+
* links to figure out if any is new, so we fallback to a transaction
5768+
* commit (instead of adding a lot of complexity of scanning a btree,
5769+
* since this scenario is not a common use case).
5770+
*/
5771+
if (inode->vfs_inode.i_nlink > 1 &&
5772+
inode->last_link_trans > last_committed) {
5773+
ret = -EMLINK;
5774+
goto end_trans;
5775+
}
5776+
57615777
while (1) {
57625778
if (!parent || d_really_is_negative(parent) || sb != parent->d_sb)
57635779
break;

0 commit comments

Comments
 (0)