Sunday, September 24, 2006

Xdelta3 has support for secondary compression, part of VCDIFF (RFC 3284) that allows external compression algorithms for the three parts of a VCDIFF window (instruction, address, data). VCDIFF is entirely based on byte codes, not variable-length codes.

Most compression programs are based on Huffman coding, and there's the well-known Burrows–Wheeler transform implemented by bzip2. But there's more to it. I asked Julian Seward and he pointed me at two very interesting articles by D. J. Wheeler.

The decription, and the code. Fascinating.

The algorithm works with multiple Huffman-code tables and several iterations. The input is divided into fixed-length chunks, each chunk assigned to a Huffman table. Each iteration, chunks are assigned to the table which gives the shortest encoding, then the tables are recomputed according to the chunks they were assigned.

The chunk encodings plus the chunk–table assignments are then transformed by move-to-front. Move-to-front works very well (especially following Burrows–Wheeler), but it presents another problem for later Huffman coding, because after move-to-front coding there tend to be many 0s, and a symbol having frequency greater than 50% will not be efficiently encoded by Huffman coding—even the shortest 1-bit code has redundency. To address this problem, the 0 symbol is replaced by two symbols (call them 0_0 and 0_1). These two symbols are used to code the run-length of 0s in binary.

What really fascinates me is how Wheeler does this in 780 lines of code—including the Burrows–Wheeler transform. Amazing.

Xdelta3 has a secondary compressor based on DJW enabled by -S djw (2000 lines w/ no Burrows–Wheeler transform). For comparison (*), there's another secondary compression (enabled by -S fgk), based on D. E. Knuth's Dynamic Huffman coding. A dynamic huffman code updates code-table frequencies after each symbol is encoded or decoded. The routines are efficient (it's Knuth, after all), but still slower than DJW.

Since there are no standards written for secondary compression, secondary compression is turned off by default, but I recommend giving -S djw a try. Sample results:

xdelta -e -1613,032
xdelta -e -9610,560
xdelta -e -1 -S djw461,298
xdelta -e -9 -S djw458,859
xdelta -e -1 -S fgk476,742
xdelta -e -9 -S fgk474,244

(*) only because I once implemented FGK in school

Comments: Post a Comment

This page is powered by Blogger. Isn't yours?