Skip to content

Commit 111c1aa

Browse files
committed
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o: "In addition to some ext4 bug fixes and cleanups, this cycle we add the orphan_file feature, which eliminates bottlenecks when doing a large number of parallel truncates and file deletions, and move the discard operation out of the jbd2 commit thread when using the discard mount option, to better support devices with slow discard operations" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (23 commits) ext4: make the updating inode data procedure atomic ext4: remove an unnecessary if statement in __ext4_get_inode_loc() ext4: move inode eio simulation behind io completeion ext4: Improve scalability of ext4 orphan file handling ext4: Orphan file documentation ext4: Speedup ext4 orphan inode handling ext4: Move orphan inode handling into a separate file ext4: Support for checksumming from journal triggers ext4: fix race writing to an inline_data file while its xattrs are changing jbd2: add sparse annotations for add_transaction_credits() ext4: fix sparse warnings ext4: Make sure quota files are not grabbed accidentally ext4: fix e2fsprogs checksum failure for mounted filesystem ext4: if zeroout fails fall back to splitting the extent node ext4: reduce arguments of ext4_fc_add_dentry_tlv ext4: flush background discard kwork when retry allocation ext4: get discard out of jbd2 commit kthread contex ext4: remove the repeated comment of ext4_trim_all_free ext4: add new helper interface ext4_try_to_trim_range() ext4: remove the 'group' parameter of ext4_trim_extent ...
2 parents 815409a + baaae97 commit 111c1aa

27 files changed

+1443
-731
lines changed

Documentation/filesystems/ext4/globals.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,3 +11,4 @@ have static metadata at fixed locations.
1111
.. include:: bitmaps.rst
1212
.. include:: mmp.rst
1313
.. include:: journal.rst
14+
.. include:: orphan.rst

Documentation/filesystems/ext4/inodes.rst

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -498,11 +498,11 @@ structure -- inode change time (ctime), access time (atime), data
498498
modification time (mtime), and deletion time (dtime). The four fields
499499
are 32-bit signed integers that represent seconds since the Unix epoch
500500
(1970-01-01 00:00:00 GMT), which means that the fields will overflow in
501-
January 2038. For inodes that are not linked from any directory but are
502-
still open (orphan inodes), the dtime field is overloaded for use with
503-
the orphan list. The superblock field ``s_last_orphan`` points to the
504-
first inode in the orphan list; dtime is then the number of the next
505-
orphaned inode, or zero if there are no more orphans.
501+
January 2038. If the filesystem does not have orphan_file feature, inodes
502+
that are not linked from any directory but are still open (orphan inodes) have
503+
the dtime field overloaded for use with the orphan list. The superblock field
504+
``s_last_orphan`` points to the first inode in the orphan list; dtime is then
505+
the number of the next orphaned inode, or zero if there are no more orphans.
506506

507507
If the inode structure size ``sb->s_inode_size`` is larger than 128
508508
bytes and the ``i_inode_extra`` field is large enough to encompass the
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
.. SPDX-License-Identifier: GPL-2.0
2+
3+
Orphan file
4+
-----------
5+
6+
In unix there can inodes that are unlinked from directory hierarchy but that
7+
are still alive because they are open. In case of crash the filesystem has to
8+
clean up these inodes as otherwise they (and the blocks referenced from them)
9+
would leak. Similarly if we truncate or extend the file, we need not be able
10+
to perform the operation in a single journalling transaction. In such case we
11+
track the inode as orphan so that in case of crash extra blocks allocated to
12+
the file get truncated.
13+
14+
Traditionally ext4 tracks orphan inodes in a form of single linked list where
15+
superblock contains the inode number of the last orphan inode (s\_last\_orphan
16+
field) and then each inode contains inode number of the previously orphaned
17+
inode (we overload i\_dtime inode field for this). However this filesystem
18+
global single linked list is a scalability bottleneck for workloads that result
19+
in heavy creation of orphan inodes. When orphan file feature
20+
(COMPAT\_ORPHAN\_FILE) is enabled, the filesystem has a special inode
21+
(referenced from the superblock through s\_orphan_file_inum) with several
22+
blocks. Each of these blocks has a structure:
23+
24+
.. list-table::
25+
:widths: 8 8 24 40
26+
:header-rows: 1
27+
28+
* - Offset
29+
- Type
30+
- Name
31+
- Description
32+
* - 0x0
33+
- Array of \_\_le32 entries
34+
- Orphan inode entries
35+
- Each \_\_le32 entry is either empty (0) or it contains inode number of
36+
an orphan inode.
37+
* - blocksize - 8
38+
- \_\_le32
39+
- ob\_magic
40+
- Magic value stored in orphan block tail (0x0b10ca04)
41+
* - blocksize - 4
42+
- \_\_le32
43+
- ob\_checksum
44+
- Checksum of the orphan block.
45+
46+
When a filesystem with orphan file feature is writeably mounted, we set
47+
RO\_COMPAT\_ORPHAN\_PRESENT feature in the superblock to indicate there may
48+
be valid orphan entries. In case we see this feature when mounting the
49+
filesystem, we read the whole orphan file and process all orphan inodes found
50+
there as usual. When cleanly unmounting the filesystem we remove the
51+
RO\_COMPAT\_ORPHAN\_PRESENT feature to avoid unnecessary scanning of the orphan
52+
file and also make the filesystem fully compatible with older kernels.

Documentation/filesystems/ext4/special_inodes.rst

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,3 +36,20 @@ ext4 reserves some inode for special features, as follows:
3636
* - 11
3737
- Traditional first non-reserved inode. Usually this is the lost+found directory. See s\_first\_ino in the superblock.
3838

39+
Note that there are also some inodes allocated from non-reserved inode numbers
40+
for other filesystem features which are not referenced from standard directory
41+
hierarchy. These are generally reference from the superblock. They are:
42+
43+
.. list-table::
44+
:widths: 20 50
45+
:header-rows: 1
46+
47+
* - Superblock field
48+
- Description
49+
50+
* - s\_lpf\_ino
51+
- Inode number of lost+found directory.
52+
* - s\_prj\_quota\_inum
53+
- Inode number of quota file tracking project quotas
54+
* - s\_orphan\_file\_inum
55+
- Inode number of file tracking orphan inodes.

Documentation/filesystems/ext4/super.rst

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -479,7 +479,11 @@ The ext4 superblock is laid out as follows in
479479
- Filename charset encoding flags.
480480
* - 0x280
481481
- \_\_le32
482-
- s\_reserved[95]
482+
- s\_orphan\_file\_inum
483+
- Orphan file inode number.
484+
* - 0x284
485+
- \_\_le32
486+
- s\_reserved[94]
483487
- Padding to the end of the block.
484488
* - 0x3FC
485489
- \_\_le32
@@ -603,6 +607,11 @@ following:
603607
the journal, JBD2 incompat feature
604608
(JBD2\_FEATURE\_INCOMPAT\_FAST\_COMMIT) gets
605609
set (COMPAT\_FAST\_COMMIT).
610+
* - 0x1000
611+
- Orphan file allocated. This is the special file for more efficient
612+
tracking of unlinked but still open inodes. When there may be any
613+
entries in the file, we additionally set proper rocompat feature
614+
(RO\_COMPAT\_ORPHAN\_PRESENT).
606615

607616
.. _super_incompat:
608617

@@ -713,6 +722,10 @@ the following:
713722
- Filesystem tracks project quotas. (RO\_COMPAT\_PROJECT)
714723
* - 0x8000
715724
- Verity inodes may be present on the filesystem. (RO\_COMPAT\_VERITY)
725+
* - 0x10000
726+
- Indicates orphan file may have valid orphan entries and thus we need
727+
to clean them up when mounting the filesystem
728+
(RO\_COMPAT\_ORPHAN\_PRESENT).
716729

717730
.. _super_def_hash:
718731

fs/ext4/Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ ext4-y := balloc.o bitmap.o block_validity.o dir.o ext4_jbd2.o extents.o \
1010
indirect.o inline.o inode.o ioctl.o mballoc.o migrate.o \
1111
mmp.o move_extent.o namei.o page-io.o readpage.o resize.o \
1212
super.o symlink.o sysfs.o xattr.o xattr_hurd.o xattr_trusted.o \
13-
xattr_user.o fast_commit.o
13+
xattr_user.o fast_commit.o orphan.o
1414

1515
ext4-$(CONFIG_EXT4_FS_POSIX_ACL) += acl.o
1616
ext4-$(CONFIG_EXT4_FS_SECURITY) += xattr_security.o

fs/ext4/balloc.c

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -652,8 +652,14 @@ int ext4_should_retry_alloc(struct super_block *sb, int *retries)
652652
* possible we just missed a transaction commit that did so
653653
*/
654654
smp_mb();
655-
if (sbi->s_mb_free_pending == 0)
655+
if (sbi->s_mb_free_pending == 0) {
656+
if (test_opt(sb, DISCARD)) {
657+
atomic_inc(&sbi->s_retry_alloc_pending);
658+
flush_work(&sbi->s_discard_work);
659+
atomic_dec(&sbi->s_retry_alloc_pending);
660+
}
656661
return ext4_has_free_clusters(sbi, 1, 0);
662+
}
657663

658664
/*
659665
* it's possible we've just missed a transaction commit here,

0 commit comments

Comments
 (0)