Skip to content

Commit fc0561c

Browse files
Dave Chinnerdchinner
authored andcommitted
xfs: optimise away log forces on timestamp updates for fdatasync
xfs: timestamp updates cause excessive fdatasync log traffic Sage Weil reported that a ceph test workload was writing to the log on every fdatasync during an overwrite workload. Event tracing showed that the only metadata modification being made was the timestamp updates during the write(2) syscall, but fdatasync(2) is supposed to ignore them. The key observation was that the transactions in the log all looked like this: INODE: #regs: 4 ino: 0x8b flags: 0x45 dsize: 32 And contained a flags field of 0x45 or 0x85, and had data and attribute forks following the inode core. This means that the timestamp updates were triggering dirty relogging of previously logged parts of the inode that hadn't yet been flushed back to disk. There are two parts to this problem. The first is that XFS relogs dirty regions in subsequent transactions, so it carries around the fields that have been dirtied since the last time the inode was written back to disk, not since the last time the inode was forced into the log. The second part is that on v5 filesystems, the inode change count update during inode dirtying also sets the XFS_ILOG_CORE flag, so on v5 filesystems this makes a timestamp update dirty the entire inode. As a result when fdatasync is run, it looks at the dirty fields in the inode, and sees more than just the timestamp flag, even though the only metadata change since the last fdatasync was just the timestamps. Hence we force the log on every subsequent fdatasync even though it is not needed. To fix this, add a new field to the inode log item that tracks changes since the last time fsync/fdatasync forced the log to flush the changes to the journal. This flag is updated when we dirty the inode, but we do it before updating the change count so it does not carry the "core dirty" flag from timestamp updates. The fields are zeroed when the inode is marked clean (due to writeback/freeing) or when an fsync/datasync forces the log. Hence if we only dirty the timestamps on the inode between fsync/fdatasync calls, the fdatasync will not trigger another log force. Over 100 runs of the test program: Ext4 baseline: runtime: 1.63s +/- 0.24s avg lat: 1.59ms +/- 0.24ms iops: ~2000 XFS, vanilla kernel: runtime: 2.45s +/- 0.18s avg lat: 2.39ms +/- 0.18ms log forces: ~400/s iops: ~1000 XFS, patched kernel: runtime: 1.49s +/- 0.26s avg lat: 1.46ms +/- 0.25ms log forces: ~30/s iops: ~1500 Reported-by: Sage Weil <[email protected]> Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Brian Foster <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
1 parent af3b638 commit fc0561c

File tree

5 files changed

+29
-5
lines changed

5 files changed

+29
-5
lines changed

fs/xfs/xfs_file.c

Lines changed: 16 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -242,19 +242,30 @@ xfs_file_fsync(
242242
}
243243

244244
/*
245-
* All metadata updates are logged, which means that we just have
246-
* to flush the log up to the latest LSN that touched the inode.
245+
* All metadata updates are logged, which means that we just have to
246+
* flush the log up to the latest LSN that touched the inode. If we have
247+
* concurrent fsync/fdatasync() calls, we need them to all block on the
248+
* log force before we clear the ili_fsync_fields field. This ensures
249+
* that we don't get a racing sync operation that does not wait for the
250+
* metadata to hit the journal before returning. If we race with
251+
* clearing the ili_fsync_fields, then all that will happen is the log
252+
* force will do nothing as the lsn will already be on disk. We can't
253+
* race with setting ili_fsync_fields because that is done under
254+
* XFS_ILOCK_EXCL, and that can't happen because we hold the lock shared
255+
* until after the ili_fsync_fields is cleared.
247256
*/
248257
xfs_ilock(ip, XFS_ILOCK_SHARED);
249258
if (xfs_ipincount(ip)) {
250259
if (!datasync ||
251-
(ip->i_itemp->ili_fields & ~XFS_ILOG_TIMESTAMP))
260+
(ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
252261
lsn = ip->i_itemp->ili_last_lsn;
253262
}
254-
xfs_iunlock(ip, XFS_ILOCK_SHARED);
255263

256-
if (lsn)
264+
if (lsn) {
257265
error = _xfs_log_force_lsn(mp, lsn, XFS_LOG_SYNC, &log_flushed);
266+
ip->i_itemp->ili_fsync_fields = 0;
267+
}
268+
xfs_iunlock(ip, XFS_ILOCK_SHARED);
258269

259270
/*
260271
* If we only have a single device, and the log force about was

fs/xfs/xfs_inode.c

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2365,6 +2365,7 @@ xfs_ifree_cluster(
23652365

23662366
iip->ili_last_fields = iip->ili_fields;
23672367
iip->ili_fields = 0;
2368+
iip->ili_fsync_fields = 0;
23682369
iip->ili_logged = 1;
23692370
xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
23702371
&iip->ili_item.li_lsn);
@@ -3560,6 +3561,7 @@ xfs_iflush_int(
35603561
*/
35613562
iip->ili_last_fields = iip->ili_fields;
35623563
iip->ili_fields = 0;
3564+
iip->ili_fsync_fields = 0;
35633565
iip->ili_logged = 1;
35643566

35653567
xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,

fs/xfs/xfs_inode_item.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -719,6 +719,7 @@ xfs_iflush_abort(
719719
* attempted.
720720
*/
721721
iip->ili_fields = 0;
722+
iip->ili_fsync_fields = 0;
722723
}
723724
/*
724725
* Release the inode's flush lock since we're done with it.

fs/xfs/xfs_inode_item.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@ typedef struct xfs_inode_log_item {
3434
unsigned short ili_logged; /* flushed logged data */
3535
unsigned int ili_last_fields; /* fields when flushed */
3636
unsigned int ili_fields; /* fields to be logged */
37+
unsigned int ili_fsync_fields; /* logged since last fsync */
3738
} xfs_inode_log_item_t;
3839

3940
static inline int xfs_inode_clean(xfs_inode_t *ip)

fs/xfs/xfs_trans_inode.c

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -107,6 +107,15 @@ xfs_trans_log_inode(
107107
ASSERT(ip->i_itemp != NULL);
108108
ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
109109

110+
/*
111+
* Record the specific change for fdatasync optimisation. This
112+
* allows fdatasync to skip log forces for inodes that are only
113+
* timestamp dirty. We do this before the change count so that
114+
* the core being logged in this case does not impact on fdatasync
115+
* behaviour.
116+
*/
117+
ip->i_itemp->ili_fsync_fields |= flags;
118+
110119
/*
111120
* First time we log the inode in a transaction, bump the inode change
112121
* counter if it is configured for this to occur. We don't use

0 commit comments

Comments
 (0)