The problem he was encountering was on decompressing a 20GB logfile, he would compare it to the original file with the 'cmp' command. On decompressing the file and comparing it, there would be differences in the file at random places. This made me think there was a memory corruption somewhere in lrzip. However he also noted that the problem went away on his desktop machine when he upgraded from Debian Lenny to Squeeze. So we knew something fishy was going on. Finally it occurred to me to suggest he try simply copying the 20GB logfile and then running 'cmp' on it. Lo and behold just copying a file of that size would randomly produce a file that had differences in it. This is a disturbing bug, and had it been confined to one machine, would have pointed the finger at the hardware. However he had reproduced it on the desktop PC as well, and the problem went away after upgrading his distribution. This pointed to a corruption problem somewhere in the layers between write() and what ends up on the disk. Anyway this particular problem is now something that needs to be tackled elsewhere (i.e. Debian).
Nonetheless, the corruption issue got me thinking about how I could make lrzip more reliable on decompression when it is mandatory that what is on disk is the same as what was originally compressed. Till now, lrzip has silently internally used crc32 to check the integrity of each decompressed block before writing it to disk. crc32 still has its place and is very simple, but it has quite a few collisions once you have files in the gigabyte size (collisions being files with the same CRC value despite being different). Fortunately, even with a hash check as simple as CRC, if only one byte changes in a file, the value will never be the same. However the crc was only being done on each decompressed chunk and not the whole file. So I set out to change over to MD5. After importing the MD5 code from coreutils and modifying it to suit lrzip, I added an md5 check during the compression phase, and put the MD5 value in the archive itself. For compatibility, the CRC check is still done and stored, so that the file format is still compatible with all previous 0.5 versions of lrzip. I hate breaking compatibility when it's not needed. On decompression, lrzip will now detect what is the most powerful hash function in the archive and use that to check the integrity of the data. One major advantage of md5 is that you can also use md5sum which is standard on all modern linux installations to compare the value to that stored in the archive on either compression or decompression. I took this idea one step further, and added an option to lrzip (-c) to actually do an md5 on the file that has been written to disk on decompression. This is to ensure that what is written on disk is what was actually extracted! The Debian lenny bug was what made me think this would be a useful feature. I've also added the ability to display the md5 hash value with a new -H option, even if the archive was not originally stored with an md5 value.
One broken "feature" for a while now has been multi-threading on OS-X. I have blogged previously about how OS-X will happily compile software that uses unnamed semaphores, yet when you try to run the program, it will say "feature unimplemented". After looking for some time at named semaphores, which are clunky in the extreme by comparison, it dawned on me I didn't need semaphores at all and could do with pthread_mutexes which are supported pretty much everywhere. So I converted the locking primitive to use mutexes instead, and now multi-threading on OS-X works nicely. I've had one user report it scales very well on his 8-way machine.
Over the last few revisions of lrzip, apart from the multi-threaded changes which have sped it up, numerous changes to improve the reliability of compression/decompression (to prevent it from running out of memory or corrupting data) unfortunately also have slowed it down somewhat. Being a CPU scheduler nut myself, I wasn't satisfied with this situation so I set out to speed it up. A few new changes have made their way into version 0.570 which do precisely that. The new hash check of both md5 and crc, which would have slowed it down now with an extra check, are done now only on already buffered parts of the main file. On a file that's larger than your available ram, this gives a major speed up. Multi-threading now spawns one extra thread as well, to take into account that the initial start up of threads is partially serialised, which means we need more threads available than CPUs. One long term battle with lrzip, which is never resolved, is how much ram to make available for each stage of the rzip pre-processing and then each thread for compression. After looking into the internals of the memory hungry lzma and zpaq, I was able to more accurately account for how much ram each thread would use, and push the amount of ram available per compression thread. The larger the blocks sent to the compression back end, the smaller the resulting file, and the greater the multi-threading speed up, provided there's enough data to keep all threads busy. Anyway the final upshot is that although more threads are in use now (which would decrease compression), compression has been kept approximately the same, but is actually faster.
Here's the latest results from my standard 10GB virtual image compression test:
Compression Size Percentage Compress Time Decompress Time None 10737418240 100.0 gzip 2772899756 25.8 5m47s 2m46s bzip2 2704781700 25.2 16m15s 6m19s xz 2272322208 21.2 50m58s 3m52s 7z 2242897134 20.9 26m36s 5m41s lrzip 1372218189 12.8 10m23s 2m53s lrzip -U 1095735108 10.2 8m44s 2m45s lrzip -l 1831894161 17.1 4m53s 2m37s lrzip -lU 1414959433 13.2 4m48s 2m38s lrzip -zU 1067075961 9.9 69m36s 69m35s
Lots of other internal changes have gone into it that are too numerous to go into depth here (see the Changelog for the short summary), but some user visible changes have been incorporated. Gone is the annoying bug where it would sit there waiting for stdin input if it was called without any arguments. The help information and manual page have been dramatically cleaned up. The -M option has been abolished in favour of just the -U option being used. The -T option no longer takes an argument and is just on/off. A -k option has been added to "keep corrupt/broken files" while corrupt/broken files generated on compression/decompression are automatically deleted by default. The -i information option now gives more information, and has verbose(+) mode to give a breakdown of the lrzip archive, like the following -vvi example:
Detected lrzip version 0.5 file. ../temp/enwik8.lrz: lrzip version: 0.5 file Compression: rzip + lzma Decompressed file size: 100000000 Compressed file size: 26642293 Compression ratio: 3.753 MD5 used for integrity testing MD5: a1fa5ffddb56f4953e226637dabbb36a Rzip chunk 1: Stream: 0 Offset: 25 Block Comp Percent Size 1 lzma 58.1% 867413 / 1493985 Offset: 22687516 Head: 0 Stream: 1 Offset: 25 Block Comp Percent Size 1 lzma 28.8% 5756191 / 20000000 Offset: 75 Head: 5756266 2 lzma 28.4% 5681891 / 20000000 Offset: 5756291 Head: 11438182 3 lzma 28.2% 5630256 / 20000000 Offset: 11438207 Head: 17068463 4 lzma 28.1% 5619003 / 20000000 Offset: 17068488 Head: 23554929 5 lzma 28.5% 3087298 / 10841364 Offset: 23554954 Head: 0 Rzip compression: 92.3% 92335349 / 100000000 Back end compression: 28.9% 26642052 / 92335349 Overall compression: 26.6% 26642052 / 100000000
I didn't bother blogging about version 0.560 because all the while 0.570 was under heavy development as well and I figured I'd wrap it all up as a nice big update instead. I'm also very pleased that Peter Hyman, who helped code for lrzip some time ago, has once again started contributing code.
That's probably enough babbling. You can get it here once freshmeat updates its links:
lrzip