Sunday 31 October 2010

On lrzip and packing bytes

As some of you may be aware I maintain a compression application called lrzip which is an evolution of Tridge's rzip which does a range coding pre-processing pass and then compresses the data with bzip2. Lrzip does the compression with a selection of backend compressors, using lzma by default but offering extremely fast with lzo or extremely slow with zpaq. The other thing lrzip does differently is I updated all the code to work properly on 64 bit machines with super large sizes. The idea behind lrzip is that as machines get more ram, and file sizes get bigger, the compression it will be capable of achieving will keep getting better, as it uses compression windows limited only in size by available ram. It is not a complex mature application with lots of features like 7z, but simply offers solid compression at good speeds.

See more about it with benchmarks and other advertising blurbs here:
lrzip.kolivas.org

Anyway when I updated it to 64 bit, my conversion of the code was a little inefficient, simply changing all the offsets in the pre-processor stage from 16 and 32 bit variables to 64 bit ones. This lost some of the efficiency of the compression because suddenly 2 and 4 byte entries were now 8 bytes long. However, since the 8 bytes were mostly zeroes, they would compress quite well since the backend compression pass was done next. Still, it was clear this was inefficient. So I did some experimenting.

The first thing I did was to base the byte width on the file size of the archive, thus allowing very small archives to have even less than 4 byte wide entries, and most archives would likely be 4-5 bytes wide. 5 bytes gives us 40 bits or files up to 1,099,511,627,776 in size. The compression gains were small but very real, and given that with zpaq compression lrzip's overall compression is rather extreme, any gains are worthwhile chasing.

The next thing I tried was packing bits. As most offsets were likely to be a lot less than the full file size, then it is possible to just make each entry only as many bytes wide as the data itself. To do this, I encoded how many bytes wide the data was within the data itself, by adding 3 bits to it. Having only 61 bits for the data itself of a 64 bit variable is not a problem since a 61 bit sized file is up to 2,305,843,009,213,693,952 in size.

Here's a rundown of how the code looked for encode/decode:
static inline void put_vbytes(void *ss, int stream, i64 s)
{
 uchar encoded_bytes;
 int bytes, bits = 0;

 while (s > (1LL << bits))
  bits++;
 /* Add one bit for bit 0, and 3 bits for the encoded number of bytes */
 bits += 4;

 bytes = bits / 8;
 /* Round up to byte sized */
 if (bits % 8)
  bytes++;

 /* Always assume there will be one byte, so we only need 3 bits to count
  * the remaining 7 possible. Left shift s by 3 bits and then encode the
  * 3 bits of remaining bytes into the right-most bits of s that will be
  * written thus making them first */
 encoded_bytes = bytes - 1;
 s <<= 3;
 s |= (i64)encoded_bytes;
 put_vchars(ss, stream, s, bytes);
}


static inline i64 read_vbytes(void *ss, int stream)
{
 uchar encoded_bytes, sb = read_u8(ss, stream);
 i64 s;
 int bytes;

 /* Grab the right most 3 bits. It has encoded how many remaining
  * bytes this entry is */
 encoded_bytes = sb & 7;
 s = sb;
 for (bytes = 1; bytes <= encoded_bytes; bytes++) {
  int bits = bytes * 8;

  sb = read_u8(ss, stream);
  s |= (i64)sb << bits;
 }
 s >>= 3;
 return s;
}

Anyway after a lot of fluffing around to actually make this code work since I had numerous mistakes while getting this far, I finally got around to testing it. The results of the pre-processing stage were very promising with the data getting a lot smaller. Then after the backend compression stage kicked in, the final compressed file size was almost the same as it was with just the 8 byte wide entries already in use in the current lrzip version 0.47 release. I sat there looking at it thinking "well that was a complete waste of time". So after thinking about it a little more, I realised that this bit-packing I was doing is basically just another form of compression, so it was making the compression stage less efficient. Incidentally, some file formats and compression formats I believe use a similar form of bit packing.

So I'll be throwing away the bit-packing idea and going back to the fixed width idea since the final compression size will be smaller that way. However I wanted to put to do something useful with the code since I made it and figured I should put it on this blog since I'm about to delete it from the actual code :) So back to the fixed byte width coding, hopefully with a new version of lrzip not too far away.

No comments:

Post a Comment