Skip to content

Bitstream Format

flanglet edited this page May 13, 2025 · 10 revisions

Introduction

A Kanzi bitstream usually starts with a header but can be headless.

In the latter case, a decompressor must know specific information about the bitstream in order to decompress it:

  • the bitstream version (less than 16)
  • the block size (between MIN_BITSTREAM_BLOCK_SIZE=1024 and MAX_BITSTREAM_BLOCK_SIZE=1024*0124*1024)
  • the name of the entropy codec
  • the list of names of the transforms (up to 8)
  • the checksum type (0, 32 or 64 as the number of bits).

Bitstream header format

Except for headless mode, the header is present once at the beginning of the stream.

The structure of the header for bitstream version 6 is described below.

bits role value comment name
32 magic number 0x4B414E5A 'KANZ'
4 bit stream version 6 current version bsVersion
2 block checksum 0 to 2 0: no checksum, 1: 32 bit, 2: 64 bits, 3: invalid chkSize
5 entropy see table one among 32 _entropyType
48 transforms see table up to 8 among 64 (8*6 bits) _transformType
28 block size / 16 0 to (2^28)-1 block size in bytes between 1024 and 1024*1024*1024 _blockSize
2 bits for input size 0 to 3 0: not provided or >= 2^48, 1: <2^16, 2: <2^32, 3: <2^48 szMask
0,16,32,48 original size(*) 0 to (2^48)-1 if present, checked by the decompressor _outputSize
15 padding 0 reserved for future use
24 header checksum 0 to (2^24)-1 see formula below, checked by the decompressor checksum

(*) If the input size is not available to the compressor, this field is skipped.

Formula to compute header checksum:

   // bsVersion, _entropyType, _transformType, _outputSize 
   // and _blockSize are all read from the header (see above)

   uint32 seed = 0x01030507 * uint32(bsVersion);
   const uint32 HASH = 0x1E35A7BD;
   uint32 cksum = HASH * seed;
   cksum ^= (HASH * uint32(~chkSize));
   cksum ^= (HASH * uint32(~_entropyType));
   cksum ^= (HASH * uint32((~_transformType) >> 32));
   cksum ^= (HASH * uint32(~_transformType));
   cksum ^= (HASH * uint32(~_blockSize));

   if (szMask != 0) {
       cksum ^= (HASH * uint32((~_outputSize) >> 32));
       cksum ^= (HASH * uint32(~_outputSize));
   }

   cksum = (cksum >> 23) ^ (cksum >> 3);
   cksum &= ((1 << 24) - 1);

This checksum must be identical to the 24 bit checksum value at the end of the bitstream header, otherwise the header is corrupted.

entropy types:

  • NONE_TYPE = 0; No compression
  • HUFFMAN_TYPE = 1; Huffman
  • FPAQ_TYPE = 2; Fast PAQ (order 0)
  • PAQ_TYPE = 3; PAQ (Obsolete, not supported)
  • RANGE_TYPE = 4; Range
  • ANS0_TYPE = 5; Asymmetric Numerical System order 0
  • CM_TYPE = 6; Context Model
  • TPAQ_TYPE = 7; Tangelo PAQ
  • ANS1_TYPE = 8; Asymmetric Numerical System order 1
  • TPAQX_TYPE = 9; Tangelo PAQ Extra

transform types:

  • NONE_TYPE = 0; Copy
  • BWT_TYPE = 1; Burrows Wheeler
  • BWTS_TYPE = 2; Burrows Wheeler Scott
  • LZ_TYPE = 3; Lempel Ziv
  • SNAPPY_TYPE = 4; Snappy (obsolete, not supported)
  • RLT_TYPE = 5; Run Length
  • ZRLT_TYPE = 6; Zero Run Length
  • MTFT_TYPE = 7; Move To Front
  • RANK_TYPE = 8; Rank
  • EXE_TYPE = 9; Codec for executables
  • DICT_TYPE = 10; Text codec
  • ROLZ_TYPE = 11; ROLZ codec
  • ROLZX_TYPE = 12; ROLZ Extra codec
  • SRT_TYPE = 13; Sorted Rank
  • LZP_TYPE = 14; Lempel Ziv Predict
  • MM_TYPE = 15; Multimedia (FSD) codec
  • LZX_TYPE = 16; Lempel Ziv Extra
  • UTF_TYPE = 17; UTF codec
  • PACK_TYPE = 18; Pack codec
  • DNA_TYPE = 19; DNA codec

Block format

After the header, one or many blocks must be present.

The bitstream must end with an empty block (which signals a valid end of stream).

Each block starts with a header with the following format:

bits optional? role comment
5 no log(cbs) - 3 block size <= 1GB
log(cbs) no cbs compressed block size (cbs), if 0 last block

The compressed block size is 1 << log(cbs) bits.

The last (empty) block is just one byte: five 0 bits for log(code block size)-3 and three 0 bits for the block size.

Each block header (except for the last empty block) is followed by block data.

bits optional? role comment
8 no block mode see format below
8 yes block transform skip flags only present if more than 4 transforms
8*ps no data pre-transform size ps = 1+((mode>>5)&0x03) bytes
0,32,64 yes block checksum XXHash32 or XXHash64 block checksum

The pre-transform size is the size of the data after entropy decoding and before applying inverse transforms.

// mode | 0b1yy0xxxx => copy block
//      | 0b0yy00000 => size(size(block))-1
//  case 4 transforms or less
//      | 0b0001xxxx => transform sequence skip flags (1 means skip)
//  case more than 4 transforms
//      | 0b0yy00000 0bxxxxxxxx => transform sequence skip flags in next byte (1 means skip)

The skipFlag byte contains information about which transforms to apply (starting from the MSB).

Example:

TEXT+UTF+BWT+RANK+ZRLT

skipFlag = 0b01000111 => process Text (0), skip UTF (1), process BWT, RANK and ZRLT (0).

How to decompress a block:

  • Extract the compressed data size from the block header
  • Read the mode and the block data
  • If the block is a "copy block", just copy the data to the output
  • Else, entropy decode the data, then run sequentially every inverse transform on the decoded data (except those with a skip flag set to 1)
  • If a checksum was present, compute checksum of the decompressed data and compare with the checksum provided in the bitstream
Clone this wiki locally