Skip to content

Commit efb2aef

Browse files
committed
erofs: add encoded extent on-disk definition
Previously, EROFS provided both (non-)compact compressed indexes to keep necessary hints for each logical block, enabling O(1) random indexing. This approach was originally designed for small compression units (e.g., 4KiB), where compressed data is strictly block-aligned via fixed-sized output compression. However, EROFS now supports big pclusters up to 1MiB and many users use large configurations to minimize image sizes. For such configurations, the total number of extents decreases significantly (e.g., only 1,024 extents for a 1GiB file using 1MiB pclusters), then runtime metadata overhead becomes negligible compared to data I/O and decoding costs. Additionally, some popular compression algorithm (mainly Zstd) still lacks native fixed-sized output compression support (although it's planned by their authors). Instead of just waiting for compressor improvements, let's adopt byte-oriented extents, allowing these compressors to retain their current methods. For example, it speeds up Zstd compression a lot: Processor: Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz * 96 Dataset: enwik9 Build time Size Type Command Line 3m52.339s 266653696 FO -C524288 -zzstd,22 3m48.549s 266174464 FO -E48bit -C524288 -zzstd,22 0m12.821s 272134144 FI -E48bit -C1048576 --max-extent-bytes=1048576 -zzstd,22 0m14.528s 248987648 FO -C1048576 -zlzma,9 0m14.605s 248504320 FO -E48bit -C1048576 -zlzma,9 Encoded extents are structured as an array of `struct z_erofs_extent`, sorted by logical address in ascending order: __le32 plen // encoded length, algorithm id and flags __le32 pstart_lo // physical offset LSB __le32 pstart_hi // physical offset MSB __le32 lstart_lo // logical offset __le32 lstart_hi // logical offset MSB .. Note that prefixed reduced records can be used to minimize metadata for specific cases (e.g. lstart less than 32 bits, then 32 to 16 bytes). If the logical lengths of all encoded extents are the same, 4-byte (plen) and 8-byte (plen, pstart_lo) records can be used. Or, 16-byte (plen .. lstart_lo) and 32-byte full records have to be used instead. If 16-byte and 32-byte records are used, the total number of extents is kept in `struct z_erofs_map_header`, and binary search can be applied on them. Note that `eytzinger order` is not considerd because data sequential access is important. If 4-byte records are used, 8-byte start physical offset is between `struct z_erofs_map_header` and the `plen` array. In addition, 64-bit physical offsets can be applied with new encoded extent format to match full 48-bit block addressing. Remove redundant comments around `struct z_erofs_lcluster_index` too. Signed-off-by: Gao Xiang <[email protected]> Acked-by: Chao Yu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
1 parent fe1e57d commit efb2aef

File tree

3 files changed

+58
-67
lines changed

3 files changed

+58
-67
lines changed

fs/erofs/erofs_fs.h

Lines changed: 45 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -331,21 +331,20 @@ struct z_erofs_zstd_cfgs {
331331
#define Z_EROFS_ZSTD_MAX_DICT_SIZE Z_EROFS_PCLUSTER_MAX_SIZE
332332

333333
/*
334-
* bit 0 : COMPACTED_2B indexes (0 - off; 1 - on)
335-
* e.g. for 4k logical cluster size, 4B if compacted 2B is off;
336-
* (4B) + 2B + (4B) if compacted 2B is on.
337-
* bit 1 : HEAD1 big pcluster (0 - off; 1 - on)
338-
* bit 2 : HEAD2 big pcluster (0 - off; 1 - on)
339-
* bit 3 : tailpacking inline pcluster (0 - off; 1 - on)
340-
* bit 4 : interlaced plain pcluster (0 - off; 1 - on)
341-
* bit 5 : fragment pcluster (0 - off; 1 - on)
334+
* Enable COMPACTED_2B for EROFS_INODE_COMPRESSED_COMPACT inodes:
335+
* 4B (disabled) vs 4B+2B+4B (enabled)
342336
*/
343337
#define Z_EROFS_ADVISE_COMPACTED_2B 0x0001
338+
/* Enable extent metadata for EROFS_INODE_COMPRESSED_FULL inodes */
339+
#define Z_EROFS_ADVISE_EXTENTS 0x0001
344340
#define Z_EROFS_ADVISE_BIG_PCLUSTER_1 0x0002
345341
#define Z_EROFS_ADVISE_BIG_PCLUSTER_2 0x0004
346342
#define Z_EROFS_ADVISE_INLINE_PCLUSTER 0x0008
347343
#define Z_EROFS_ADVISE_INTERLACED_PCLUSTER 0x0010
348344
#define Z_EROFS_ADVISE_FRAGMENT_PCLUSTER 0x0020
345+
/* Indicate the record size for each extent if extent metadata is used */
346+
#define Z_EROFS_ADVISE_EXTRECSZ_BIT 1
347+
#define Z_EROFS_ADVISE_EXTRECSZ_MASK 0x3
349348

350349
#define Z_EROFS_FRAGMENT_INODE_BIT 7
351350
struct z_erofs_map_header {
@@ -357,45 +356,24 @@ struct z_erofs_map_header {
357356
/* indicates the encoded size of tailpacking data */
358357
__le16 h_idata_size;
359358
};
359+
__le32 h_extents_lo; /* extent count LSB */
360360
};
361361
__le16 h_advise;
362-
/*
363-
* bit 0-3 : algorithm type of head 1 (logical cluster type 01);
364-
* bit 4-7 : algorithm type of head 2 (logical cluster type 11).
365-
*/
366-
__u8 h_algorithmtype;
367-
/*
368-
* bit 0-2 : logical cluster bits - 12, e.g. 0 for 4096;
369-
* bit 3-6 : reserved;
370-
* bit 7 : move the whole file into packed inode or not.
371-
*/
372-
__u8 h_clusterbits;
362+
union {
363+
struct {
364+
/* algorithm type (bit 0-3: HEAD1; bit 4-7: HEAD2) */
365+
__u8 h_algorithmtype;
366+
/*
367+
* bit 0-3 : logical cluster bits - blkszbits
368+
* bit 4-6 : reserved
369+
* bit 7 : pack the whole file into packed inode
370+
*/
371+
__u8 h_clusterbits;
372+
};
373+
__le16 h_extents_hi; /* extent count MSB */
374+
};
373375
};
374376

375-
/*
376-
* On-disk logical cluster type:
377-
* 0 - literal (uncompressed) lcluster
378-
* 1,3 - compressed lcluster (for HEAD lclusters)
379-
* 2 - compressed lcluster (for NONHEAD lclusters)
380-
*
381-
* In detail,
382-
* 0 - literal (uncompressed) lcluster,
383-
* di_advise = 0
384-
* di_clusterofs = the literal data offset of the lcluster
385-
* di_blkaddr = the blkaddr of the literal pcluster
386-
*
387-
* 1,3 - compressed lcluster (for HEAD lclusters)
388-
* di_advise = 1 or 3
389-
* di_clusterofs = the decompressed data offset of the lcluster
390-
* di_blkaddr = the blkaddr of the compressed pcluster
391-
*
392-
* 2 - compressed lcluster (for NONHEAD lclusters)
393-
* di_advise = 2
394-
* di_clusterofs =
395-
* the decompressed data offset in its own HEAD lcluster
396-
* di_u.delta[0] = distance to this HEAD lcluster
397-
* di_u.delta[1] = distance to the next HEAD lcluster
398-
*/
399377
enum {
400378
Z_EROFS_LCLUSTER_TYPE_PLAIN = 0,
401379
Z_EROFS_LCLUSTER_TYPE_HEAD1 = 1,
@@ -409,11 +387,7 @@ enum {
409387
/* (noncompact only, HEAD) This pcluster refers to partial decompressed data */
410388
#define Z_EROFS_LI_PARTIAL_REF (1 << 15)
411389

412-
/*
413-
* D0_CBLKCNT will be marked _only_ at the 1st non-head lcluster to store the
414-
* compressed block count of a compressed extent (in logical clusters, aka.
415-
* block count of a pcluster).
416-
*/
390+
/* Set on 1st non-head lcluster to store compressed block counti (in blocks) */
417391
#define Z_EROFS_LI_D0_CBLKCNT (1 << 11)
418392

419393
struct z_erofs_lcluster_index {
@@ -422,19 +396,36 @@ struct z_erofs_lcluster_index {
422396
__le16 di_clusterofs;
423397

424398
union {
425-
/* for the HEAD lclusters */
426-
__le32 blkaddr;
399+
__le32 blkaddr; /* for the HEAD lclusters */
427400
/*
428-
* for the NONHEAD lclusters
429401
* [0] - distance to its HEAD lcluster
430402
* [1] - distance to the next HEAD lcluster
431403
*/
432-
__le16 delta[2];
404+
__le16 delta[2]; /* for the NONHEAD lclusters */
433405
} di_u;
434406
};
435407

436-
#define Z_EROFS_FULL_INDEX_ALIGN(end) \
437-
(ALIGN(end, 8) + sizeof(struct z_erofs_map_header) + 8)
408+
#define Z_EROFS_MAP_HEADER_END(end) \
409+
(ALIGN(end, 8) + sizeof(struct z_erofs_map_header))
410+
#define Z_EROFS_FULL_INDEX_START(end) (Z_EROFS_MAP_HEADER_END(end) + 8)
411+
412+
#define Z_EROFS_EXTENT_PLEN_PARTIAL BIT(27)
413+
#define Z_EROFS_EXTENT_PLEN_FMT_BIT 28
414+
#define Z_EROFS_EXTENT_PLEN_MASK ((Z_EROFS_PCLUSTER_MAX_SIZE << 1) - 1)
415+
struct z_erofs_extent {
416+
__le32 plen; /* encoded length */
417+
__le32 pstart_lo; /* physical offset */
418+
__le32 pstart_hi; /* physical offset MSB */
419+
__le32 lstart_lo; /* logical offset */
420+
__le32 lstart_hi; /* logical offset MSB (>= 4GiB inodes) */
421+
__u8 reserved[12]; /* for future use */
422+
};
423+
424+
static inline int z_erofs_extent_recsize(unsigned int advise)
425+
{
426+
return 4 << ((advise >> Z_EROFS_ADVISE_EXTRECSZ_BIT) &
427+
Z_EROFS_ADVISE_EXTRECSZ_MASK);
428+
}
438429

439430
/* check the EROFS on-disk layout strictly at compile time */
440431
static inline void erofs_check_ondisk_layout_definitions(void)

fs/erofs/internal.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -262,7 +262,7 @@ struct erofs_inode {
262262
struct {
263263
unsigned short z_advise;
264264
unsigned char z_algorithmtype[2];
265-
unsigned char z_logical_clusterbits;
265+
unsigned char z_lclusterbits;
266266
unsigned long z_tailextent_headlcn;
267267
erofs_off_t z_fragmentoff;
268268
unsigned short z_idata_size;

fs/erofs/zmap.c

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ static int z_erofs_load_full_lcluster(struct z_erofs_maprecorder *m,
2525
{
2626
struct inode *const inode = m->inode;
2727
struct erofs_inode *const vi = EROFS_I(inode);
28-
const erofs_off_t pos = Z_EROFS_FULL_INDEX_ALIGN(erofs_iloc(inode) +
28+
const erofs_off_t pos = Z_EROFS_FULL_INDEX_START(erofs_iloc(inode) +
2929
vi->inode_isize + vi->xattr_isize) +
3030
lcn * sizeof(struct z_erofs_lcluster_index);
3131
struct z_erofs_lcluster_index *di;
@@ -40,7 +40,7 @@ static int z_erofs_load_full_lcluster(struct z_erofs_maprecorder *m,
4040
advise = le16_to_cpu(di->di_advise);
4141
m->type = advise & Z_EROFS_LI_LCLUSTER_TYPE_MASK;
4242
if (m->type == Z_EROFS_LCLUSTER_TYPE_NONHEAD) {
43-
m->clusterofs = 1 << vi->z_logical_clusterbits;
43+
m->clusterofs = 1 << vi->z_lclusterbits;
4444
m->delta[0] = le16_to_cpu(di->di_u.delta[0]);
4545
if (m->delta[0] & Z_EROFS_LI_D0_CBLKCNT) {
4646
if (!(vi->z_advise & (Z_EROFS_ADVISE_BIG_PCLUSTER_1 |
@@ -55,7 +55,7 @@ static int z_erofs_load_full_lcluster(struct z_erofs_maprecorder *m,
5555
} else {
5656
m->partialref = !!(advise & Z_EROFS_LI_PARTIAL_REF);
5757
m->clusterofs = le16_to_cpu(di->di_clusterofs);
58-
if (m->clusterofs >= 1 << vi->z_logical_clusterbits) {
58+
if (m->clusterofs >= 1 << vi->z_lclusterbits) {
5959
DBG_BUGON(1);
6060
return -EFSCORRUPTED;
6161
}
@@ -102,9 +102,9 @@ static int z_erofs_load_compact_lcluster(struct z_erofs_maprecorder *m,
102102
{
103103
struct inode *const inode = m->inode;
104104
struct erofs_inode *const vi = EROFS_I(inode);
105-
const erofs_off_t ebase = sizeof(struct z_erofs_map_header) +
106-
ALIGN(erofs_iloc(inode) + vi->inode_isize + vi->xattr_isize, 8);
107-
const unsigned int lclusterbits = vi->z_logical_clusterbits;
105+
const erofs_off_t ebase = Z_EROFS_MAP_HEADER_END(erofs_iloc(inode) +
106+
vi->inode_isize + vi->xattr_isize);
107+
const unsigned int lclusterbits = vi->z_lclusterbits;
108108
const unsigned int totalidx = erofs_iblks(inode);
109109
unsigned int compacted_4b_initial, compacted_2b, amortizedshift;
110110
unsigned int vcnt, lo, lobits, encodebits, nblk, bytes;
@@ -255,7 +255,7 @@ static int z_erofs_extent_lookback(struct z_erofs_maprecorder *m,
255255
{
256256
struct super_block *sb = m->inode->i_sb;
257257
struct erofs_inode *const vi = EROFS_I(m->inode);
258-
const unsigned int lclusterbits = vi->z_logical_clusterbits;
258+
const unsigned int lclusterbits = vi->z_lclusterbits;
259259

260260
while (m->lcn >= lookback_distance) {
261261
unsigned long lcn = m->lcn - lookback_distance;
@@ -304,7 +304,7 @@ static int z_erofs_get_extent_compressedlen(struct z_erofs_maprecorder *m,
304304
if ((m->headtype == Z_EROFS_LCLUSTER_TYPE_HEAD1 && !bigpcl1) ||
305305
((m->headtype == Z_EROFS_LCLUSTER_TYPE_PLAIN ||
306306
m->headtype == Z_EROFS_LCLUSTER_TYPE_HEAD2) && !bigpcl2) ||
307-
(lcn << vi->z_logical_clusterbits) >= inode->i_size)
307+
(lcn << vi->z_lclusterbits) >= inode->i_size)
308308
m->compressedblks = 1;
309309

310310
if (m->compressedblks)
@@ -354,7 +354,7 @@ static int z_erofs_get_extent_decompressedlen(struct z_erofs_maprecorder *m)
354354
struct inode *inode = m->inode;
355355
struct erofs_inode *vi = EROFS_I(inode);
356356
struct erofs_map_blocks *map = m->map;
357-
unsigned int lclusterbits = vi->z_logical_clusterbits;
357+
unsigned int lclusterbits = vi->z_lclusterbits;
358358
u64 lcn = m->lcn, headlcn = map->m_la >> lclusterbits;
359359
int err;
360360

@@ -398,16 +398,16 @@ static int z_erofs_do_map_blocks(struct inode *inode,
398398
struct super_block *sb = inode->i_sb;
399399
bool fragment = vi->z_advise & Z_EROFS_ADVISE_FRAGMENT_PCLUSTER;
400400
bool ztailpacking = vi->z_idata_size;
401+
unsigned int lclusterbits = vi->z_lclusterbits;
401402
struct z_erofs_maprecorder m = {
402403
.inode = inode,
403404
.map = map,
404405
};
405406
int err = 0;
406-
unsigned int lclusterbits, endoff, afmt;
407+
unsigned int endoff, afmt;
407408
unsigned long initial_lcn;
408409
unsigned long long ofs, end;
409410

410-
lclusterbits = vi->z_logical_clusterbits;
411411
ofs = flags & EROFS_GET_BLOCKS_FINDTAIL ? inode->i_size - 1 : map->m_la;
412412
initial_lcn = ofs >> lclusterbits;
413413
endoff = ofs & ((1 << lclusterbits) - 1);
@@ -569,6 +569,7 @@ static int z_erofs_fill_inode_lazy(struct inode *inode)
569569
goto done;
570570
}
571571
vi->z_advise = le16_to_cpu(h->h_advise);
572+
vi->z_lclusterbits = sb->s_blocksize_bits + (h->h_clusterbits & 15);
572573
vi->z_algorithmtype[0] = h->h_algorithmtype & 15;
573574
vi->z_algorithmtype[1] = h->h_algorithmtype >> 4;
574575
if (vi->z_advise & Z_EROFS_ADVISE_FRAGMENT_PCLUSTER)
@@ -585,7 +586,6 @@ static int z_erofs_fill_inode_lazy(struct inode *inode)
585586
goto out_put_metabuf;
586587
}
587588

588-
vi->z_logical_clusterbits = sb->s_blocksize_bits + (h->h_clusterbits & 7);
589589
if (!erofs_sb_has_big_pcluster(EROFS_SB(sb)) &&
590590
vi->z_advise & (Z_EROFS_ADVISE_BIG_PCLUSTER_1 |
591591
Z_EROFS_ADVISE_BIG_PCLUSTER_2)) {

0 commit comments

Comments
 (0)