Bitstream Format

Introduction

A Kanzi bitstream usually starts with a header but can be headless.

In the latter case, a decompressor must know specific information about the bitstream in order to decompress it:

the bitstream version (less than 16)
the block size (between MIN_BITSTREAM_BLOCK_SIZE=1024 and MAX_BITSTREAM_BLOCK_SIZE=1024*0124*1024)
the name of the entropy codec
the list of names of the transforms (up to 8)
the checksum type (0, 32 or 64 as the number of bits).

Bitstream header format

Except for headless mode, the header is present once at the beginning of the stream.

The structure of the header for bitstream version 6 is described below.

bits	role	value	comment	name
32	magic number	0x4B414E5A	'KANZ'
4	bit stream version	6	current version	bsVersion
2	block checksum	0 to 2	0: no checksum, 1: 32 bit, 2: 64 bits, 3: invalid	chkSize
5	entropy	see table	one among 32	_entropyType
48	transforms	see table	up to 8 among 64 (8*6 bits)	_transformType
28	block size / 16	0 to (2^28)-1	block size in bytes between 1024 and 102410241024	_blockSize
2	bits for input size	0 to 3	0: not provided or >= 2^48, 1: <2^16, 2: <2^32, 3: <2^48	szMask
0,16,32,48	original size(*)	0 to (2^48)-1	if present, checked by the decompressor	_outputSize
15	padding	0	reserved for future use
24	header checksum	0 to (2^24)-1	see formula below, checked by the decompressor	checksum

(*) If the input size is not available to the compressor, this field is skipped.

Formula to compute header checksum:

   // bsVersion, _entropyType, _transformType, _outputSize 
   // and _blockSize are all read from the header (see above)

   uint32 seed = 0x01030507 * uint32(bsVersion);
   const uint32 HASH = 0x1E35A7BD;
   uint32 cksum = HASH * seed;
   cksum ^= (HASH * uint32(~chkSize));
   cksum ^= (HASH * uint32(~_entropyType));
   cksum ^= (HASH * uint32((~_transformType) >> 32));
   cksum ^= (HASH * uint32(~_transformType));
   cksum ^= (HASH * uint32(~_blockSize));

   if (szMask != 0) {
       cksum ^= (HASH * uint32((~_outputSize) >> 32));
       cksum ^= (HASH * uint32(~_outputSize));
   }

   cksum = (cksum >> 23) ^ (cksum >> 3);
   cksum &= ((1 << 24) - 1);

This checksum must be identical to the 24 bit checksum value at the end of the bitstream header, otherwise the header is corrupted.

entropy types:

NONE_TYPE = 0; No compression
HUFFMAN_TYPE = 1; Huffman
FPAQ_TYPE = 2; Fast PAQ (order 0)
PAQ_TYPE = 3; PAQ (Obsolete, not supported)
RANGE_TYPE = 4; Range
ANS0_TYPE = 5; Asymmetric Numerical System order 0
CM_TYPE = 6; Context Model
TPAQ_TYPE = 7; Tangelo PAQ
ANS1_TYPE = 8; Asymmetric Numerical System order 1
TPAQX_TYPE = 9; Tangelo PAQ Extra

transform types:

NONE_TYPE = 0; Copy
BWT_TYPE = 1; Burrows Wheeler
BWTS_TYPE = 2; Burrows Wheeler Scott
LZ_TYPE = 3; Lempel Ziv
SNAPPY_TYPE = 4; Snappy (obsolete, not supported)
RLT_TYPE = 5; Run Length
ZRLT_TYPE = 6; Zero Run Length
MTFT_TYPE = 7; Move To Front
RANK_TYPE = 8; Rank
EXE_TYPE = 9; Codec for executables
DICT_TYPE = 10; Text codec
ROLZ_TYPE = 11; ROLZ codec
ROLZX_TYPE = 12; ROLZ Extra codec
SRT_TYPE = 13; Sorted Rank
LZP_TYPE = 14; Lempel Ziv Predict
MM_TYPE = 15; Multimedia (FSD) codec
LZX_TYPE = 16; Lempel Ziv Extra
UTF_TYPE = 17; UTF codec
PACK_TYPE = 18; Pack codec
DNA_TYPE = 19; DNA codec

Block format

After the header, one or many blocks must be present.

The bitstream must end with an empty block (which signals a valid end of stream).

Each block starts with a header with the following format:

bits	optional?	role	comment
5	no	log(cbs) - 3	block size <= 1GB
log(cbs)	no	cbs	compressed block size (cbs), if 0 last block

The compressed block size is 1 << log(cbs) bits.

The last (empty) block is just one byte: five 0 bits for log(code block size)-3 and three 0 bits for the block size.

Each block header (except for the last empty block) is followed by block data.

bits	optional?	role	comment
8	no	block mode	see format below
8	yes	block transform skip flags	only present if more than 4 transforms
8*ps	no	data	pre-transform size ps = 1+((mode>>5)&0x03) bytes
0,32,64	yes	block checksum	XXHash32 or XXHash64 block checksum

The pre-transform size is the size of the data after entropy decoding and before applying inverse transforms.

// mode | 0b1yy0xxxx => copy block
//      | 0b0yy00000 => size(size(block))-1
//  case 4 transforms or less
//      | 0b0001xxxx => transform sequence skip flags (1 means skip)
//  case more than 4 transforms
//      | 0b0yy00000 0bxxxxxxxx => transform sequence skip flags in next byte (1 means skip)

The skipFlag byte contains information about which transforms to apply (starting from the MSB).

Example:

TEXT+UTF+BWT+RANK+ZRLT

skipFlag = 0b01000111 => process Text (0), skip UTF (1), process BWT, RANK and ZRLT (0).

How to decompress a block:

Extract the compressed data size from the block header
Read the mode and the block data
If the block is a "copy block", just copy the data to the output
Else, entropy decode the data, then run sequentially every inverse transform on the decoded data (except those with a skip flag set to 1)
If a checksum was present, compute checksum of the decompressed data and compare with the checksum provided in the bitstream

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bitstream Format

Introduction

Bitstream header format

Block format

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally