-
Notifications
You must be signed in to change notification settings - Fork 4
Bitstream Format
A Kanzi bitstream usually starts with a header but can be headless.
In the latter case, a decompressor must know specific information about the bitstream in order to decompress it:
- the bitstream version (less than 16)
- the block size (between MIN_BITSTREAM_BLOCK_SIZE=1024 and MAX_BITSTREAM_BLOCK_SIZE=1024*0124*1024)
- the name of the entropy codec
- the list of names of the transforms (up to 8)
- the checksum type (0, 32 or 64 as the number of bits).
Except for headless mode, the header is present once at the beginning of the stream.
The structure of the header for bitstream version 6 is described below.
bits | role | value | comment | name |
32 | magic number | 0x4B414E5A | 'KANZ' | |
4 | bit stream version | 6 | current version | bsVersion |
2 | block checksum | 0 to 2 | 0: no checksum, 1: 32 bit, 2: 64 bits, 3: invalid | chkSize |
5 | entropy | see table | one among 32 | _entropyType |
48 | transforms | see table | up to 8 among 64 (8*6 bits) | _transformType |
28 | block size / 16 | 0 to (2^28)-1 | block size in bytes between 1024 and 1024*1024*1024 | _blockSize |
2 | bits for input size | 0 to 3 | 0: not provided or >= 2^48, 1: <2^16, 2: <2^32, 3: <2^48 | szMask |
0,16,32,48 | original size(*) | 0 to (2^48)-1 | if present, checked by the decompressor | _outputSize |
15 | padding | 0 | reserved for future use | |
24 | header checksum | 0 to (2^24)-1 | see formula below, checked by the decompressor | checksum |
(*) If the input size is not available to the compressor, this field is skipped.
Formula to compute header checksum:
// bsVersion, _entropyType, _transformType, _outputSize // and _blockSize are all read from the header (see above) uint32 seed = 0x01030507 * uint32(bsVersion); const uint32 HASH = 0x1E35A7BD; uint32 cksum = HASH * seed; cksum ^= (HASH * uint32(~chkSize)); cksum ^= (HASH * uint32(~_entropyType)); cksum ^= (HASH * uint32((~_transformType) >> 32)); cksum ^= (HASH * uint32(~_transformType)); cksum ^= (HASH * uint32(~_blockSize)); if (szMask != 0) { cksum ^= (HASH * uint32((~_outputSize) >> 32)); cksum ^= (HASH * uint32(~_outputSize)); } cksum = (cksum >> 23) ^ (cksum >> 3); cksum &= ((1 << 24) - 1);
This checksum must be identical to the 24 bit checksum value at the end of the bitstream header, otherwise the header is corrupted.
entropy types:
- NONE_TYPE = 0; No compression
- HUFFMAN_TYPE = 1; Huffman
- FPAQ_TYPE = 2; Fast PAQ (order 0)
- PAQ_TYPE = 3; PAQ (Obsolete, not supported)
- RANGE_TYPE = 4; Range
- ANS0_TYPE = 5; Asymmetric Numerical System order 0
- CM_TYPE = 6; Context Model
- TPAQ_TYPE = 7; Tangelo PAQ
- ANS1_TYPE = 8; Asymmetric Numerical System order 1
- TPAQX_TYPE = 9; Tangelo PAQ Extra
transform types:
- NONE_TYPE = 0; Copy
- BWT_TYPE = 1; Burrows Wheeler
- BWTS_TYPE = 2; Burrows Wheeler Scott
- LZ_TYPE = 3; Lempel Ziv
- SNAPPY_TYPE = 4; Snappy (obsolete, not supported)
- RLT_TYPE = 5; Run Length
- ZRLT_TYPE = 6; Zero Run Length
- MTFT_TYPE = 7; Move To Front
- RANK_TYPE = 8; Rank
- EXE_TYPE = 9; Codec for executables
- DICT_TYPE = 10; Text codec
- ROLZ_TYPE = 11; ROLZ codec
- ROLZX_TYPE = 12; ROLZ Extra codec
- SRT_TYPE = 13; Sorted Rank
- LZP_TYPE = 14; Lempel Ziv Predict
- MM_TYPE = 15; Multimedia (FSD) codec
- LZX_TYPE = 16; Lempel Ziv Extra
- UTF_TYPE = 17; UTF codec
- PACK_TYPE = 18; Pack codec
- DNA_TYPE = 19; DNA codec
After the header, one or many blocks must be present.
The bitstream must end with an empty block (which signals a valid end of stream).
Each block starts with a header with the following format:
bits | optional? | role | comment |
5 | no | log(cbs) - 3 | block size <= 1GB |
log(cbs) | no | cbs | compressed block size (cbs), if 0 last block |
The compressed block size is 1 << log(cbs) bits.
The last (empty) block is just one byte: five 0 bits for log(code block size)-3 and three 0 bits for the block size.
Each block header (except for the last empty block) is followed by block data.
bits | optional? | role | comment |
8 | no | block mode | see format below |
8 | yes | block transform skip flags | only present if more than 4 transforms |
8*ps | no | data | pre-transform size ps = 1+((mode>>5)&0x03) bytes |
0,32,64 | yes | block checksum | XXHash32 or XXHash64 block checksum |
The pre-transform size is the size of the data after entropy decoding and before applying inverse transforms.
// mode | 0b1yy0xxxx => copy block // | 0b0yy00000 => size(size(block))-1 // case 4 transforms or less // | 0b0001xxxx => transform sequence skip flags (1 means skip) // case more than 4 transforms // | 0b0yy00000 0bxxxxxxxx => transform sequence skip flags in next byte (1 means skip)
The skipFlag byte contains information about which transforms to apply (starting from the MSB).
Example:
TEXT+UTF+BWT+RANK+ZRLT
skipFlag = 0b01000111 => process Text (0), skip UTF (1), process BWT, RANK and ZRLT (0).
How to decompress a block:
- Extract the compressed data size from the block header
- Read the mode and the block data
- If the block is a "copy block", just copy the data to the output
- Else, entropy decode the data, then run sequentially every inverse transform on the decoded data (except those with a skip flag set to 1)
- If a checksum was present, compute checksum of the decompressed data and compare with the checksum provided in the bitstream