Tuesday, 8 March 2011

lrzip 0.571

I had set out to do some major surgery to lrzip to make it work better with STDIN/STDOUT and not use physical temporary files at all. However, as I mentioned in my last blog post, my approach of using shared memory as files was a failure as it only really worked on linux. So after much angst, I decided to simply put together all the minor bugfixes that I had accumulated and bring out a new version. So here is a mainly small bugfix release lrzip version 0.571:
lrzip on freshmeat

On a related note I chanced upon this article recently which compared compression tools:
best-linux-compression-tool-8-utilities-tested
tl:dr lrzip wins

Suffice to say that lrzip did pretty well, despite them using a much older version that predates any of the multi-threading enhancements!

Monday, 7 March 2011

Shared memory on Darwin and lrzip

So I've spent the last month working on lrzip trying to make it work without temporary files when working from STDIN +/- to STDOUT. Virtually all of that work has gone into creating pseudo files in memory using posix shared memory in the form of shm_open() and friends. To complete the work,I even needed to upgrade the file format for lrzip when files are dumped to stdout and beyond a certain size. As the code has taken shape and is almost finalised, I have tried to get as much testing as possible before making a release. Lo and behold, OS-X screws me over yet again. I discover that the so-called portable posix shared memory that Darwin is supposed to support, is not fully featured. By that I mean I can open an shm file with shm_open and then I can mmap it. But I can't actually access the file as a file... which was entirely the point. Instead I get the meaningless "operation not supported" when trying to write to the file, or a failure to seek on it entirely. What's more is the open_shm doesn't really work with create and truncate at the same time too... So for now, Darwin will be stuck on physical temporary files while the rest of us move on. Sigh... Someone pointed out that perhaps the whole treating of the shared memory object as a file isn't actually part of the posix guarantee at all, so once again I've gotten carried away with something that isn't portable.

EDIT: After a bit more thought I'm going to abandon all this use of shared memory and reconsider how I might do stdin/out without temporary files yet again.

Friday, 25 February 2011

lrzip-0.570 for the uncorrupt.

When I last blogged about lrzip, I mentioned the corruption on decompression issue a user was seeing in the field. This bug, not surprisingly, worried me greatly so I set out on a major hunt to eliminate it, and make lrzip more reliable on decompression. After extensive investigation, and testing on the part of the user, to cut a long story short, the corruption was NEVER THERE.

The problem he was encountering was on decompressing a 20GB logfile, he would compare it to the original file with the 'cmp' command. On decompressing the file and comparing it, there would be differences in the file at random places. This made me think there was a memory corruption somewhere in lrzip. However he also noted that the problem went away on his desktop machine when he upgraded from Debian Lenny to Squeeze. So we knew something fishy was going on. Finally it occurred to me to suggest he try simply copying the 20GB logfile and then running 'cmp' on it. Lo and behold just copying a file of that size would randomly produce a file that had differences in it. This is a disturbing bug, and had it been confined to one machine, would have pointed the finger at the hardware. However he had reproduced it on the desktop PC as well, and the problem went away after upgrading his distribution. This pointed to a corruption problem somewhere in the layers between write() and what ends up on the disk. Anyway this particular problem is now something that needs to be tackled elsewhere (i.e. Debian).

Nonetheless, the corruption issue got me thinking about how I could make lrzip more reliable on decompression when it is mandatory that what is on disk is the same as what was originally compressed. Till now, lrzip has silently internally used crc32 to check the integrity of each decompressed block before writing it to disk. crc32 still has its place and is very simple, but it has quite a few collisions once you have files in the gigabyte size (collisions being files with the same CRC value despite being different). Fortunately, even with a hash check as simple as CRC, if only one byte changes in a file, the value will never be the same. However the crc was only being done on each decompressed chunk and not the whole file. So I set out to change over to MD5. After importing the MD5 code from coreutils and modifying it to suit lrzip, I added an md5 check during the compression phase, and put the MD5 value in the archive itself. For compatibility, the CRC check is still done and stored, so that the file format is still compatible with all previous 0.5 versions of lrzip. I hate breaking compatibility when it's not needed. On decompression, lrzip will now detect what is the most powerful hash function in the archive and use that to check the integrity of the data. One major advantage of md5 is that you can also use md5sum which is standard on all modern linux installations to compare the value to that stored in the archive on either compression or decompression. I took this idea one step further, and added an option to lrzip (-c) to actually do an md5 on the file that has been written to disk on decompression. This is to ensure that what is written on disk is what was actually extracted! The Debian lenny bug was what made me think this would be a useful feature. I've also added the ability to display the md5 hash value with a new -H option, even if the archive was not originally stored with an md5 value.

One broken "feature" for a while now has been multi-threading on OS-X. I have blogged previously about how OS-X will happily compile software that uses unnamed semaphores, yet when you try to run the program, it will say "feature unimplemented". After looking for some time at named semaphores, which are clunky in the extreme by comparison, it dawned on me I didn't need semaphores at all and could do with pthread_mutexes which are supported pretty much everywhere. So I converted the locking primitive to use mutexes instead, and now multi-threading on OS-X works nicely. I've had one user report it scales very well on his 8-way machine.

Over the last few revisions of lrzip, apart from the multi-threaded changes which have sped it up, numerous changes to improve the reliability of compression/decompression (to prevent it from running out of memory or corrupting data) unfortunately also have slowed it down somewhat. Being a CPU scheduler nut myself, I wasn't satisfied with this situation so I set out to speed it up. A few new changes have made their way into version 0.570 which do precisely that. The new hash check of both md5 and crc, which would have slowed it down now with an extra check, are done now only on already buffered parts of the main file. On a file that's larger than your available ram, this gives a major speed up. Multi-threading now spawns one extra thread as well, to take into account that the initial start up of threads is partially serialised, which means we need more threads available than CPUs. One long term battle with lrzip, which is never resolved, is how much ram to make available for each stage of the rzip pre-processing and then each thread for compression. After looking into the internals of the memory hungry lzma and zpaq, I was able to more accurately account for how much ram each thread would use, and push the amount of ram available per compression thread. The larger the blocks sent to the compression back end, the smaller the resulting file, and the greater the multi-threading speed up, provided there's enough data to keep all threads busy. Anyway the final upshot is that although more threads are in use now (which would decrease compression), compression has been kept approximately the same, but is actually faster.

Here's the latest results from my standard 10GB virtual image compression test:
Compression  Size         Percentage  Compress Time  Decompress Time
None         10737418240  100.0
gzip         2772899756    25.8          5m47s        2m46s
bzip2        2704781700    25.2         16m15s        6m19s
xz           2272322208    21.2         50m58s        3m52s
7z           2242897134    20.9         26m36s        5m41s
lrzip        1372218189    12.8         10m23s        2m53s
lrzip -U     1095735108    10.2          8m44s        2m45s
lrzip -l     1831894161    17.1          4m53s        2m37s
lrzip -lU    1414959433    13.2          4m48s        2m38s
lrzip -zU    1067075961     9.9         69m36s       69m35s

Lots of other internal changes have gone into it that are too numerous to go into depth here (see the Changelog for the short summary), but some user visible changes have been incorporated. Gone is the annoying bug where it would sit there waiting for stdin input if it was called without any arguments. The help information and manual page have been dramatically cleaned up. The -M option has been abolished in favour of just the -U option being used. The -T option no longer takes an argument and is just on/off. A -k option has been added to "keep corrupt/broken files" while corrupt/broken files generated on compression/decompression are automatically deleted by default. The -i information option now gives more information, and has verbose(+) mode to give a breakdown of the lrzip archive, like the following -vvi example:

Detected lrzip version 0.5 file.
../temp/enwik8.lrz:
lrzip version: 0.5 file
Compression: rzip + lzma
Decompressed file size: 100000000
Compressed file size: 26642293
Compression ratio: 3.753
MD5 used for integrity testing
MD5: a1fa5ffddb56f4953e226637dabbb36a
Rzip chunk 1:
Stream: 0
Offset: 25
Block   Comp    Percent Size
1       lzma    58.1%   867413 / 1493985        Offset: 22687516        Head: 0
Stream: 1
Offset: 25
Block   Comp    Percent Size
1       lzma    28.8%   5756191 / 20000000      Offset: 75      Head: 5756266
2       lzma    28.4%   5681891 / 20000000      Offset: 5756291 Head: 11438182
3       lzma    28.2%   5630256 / 20000000      Offset: 11438207        Head: 17068463
4       lzma    28.1%   5619003 / 20000000      Offset: 17068488        Head: 23554929
5       lzma    28.5%   3087298 / 10841364      Offset: 23554954        Head: 0
Rzip compression: 92.3% 92335349 / 100000000
Back end compression: 28.9% 26642052 / 92335349
Overall compression: 26.6% 26642052 / 100000000

I didn't bother blogging about version 0.560 because all the while 0.570 was under heavy development as well and I figured I'd wrap it all up as a nice big update instead. I'm also very pleased that Peter Hyman, who helped code for lrzip some time ago, has once again started contributing code.

That's probably enough babbling. You can get it here once freshmeat updates its links:
lrzip

Monday, 14 February 2011

2.6.37-ck2 minor fixes.

I've put up a small updated -ck version for 2.6.37. There are only 2 changes: 1. The build fix for it not compiling with CPU HOTPLUG disabled, and 2. I've dropped the patch called mm-make_swappiness_really_mean_it.patch . This second patch broke a while back and I never noticed because I had swap disabled, sorry. It works better with it disabled. Note that the ubuntu packages I recently made available include this change which I quietly snuck in, but I will make ck2 packages officially available just to avoid confusion ;) If you've built your own 2.6.37-ck1 then I recommend upgrading only if you have swap enabled on your machine (which most people still do). Alternatively if you built it with CPU_HOTPLUG enabled just to make it build, then you _can_ upgrade to this version to build with it disabled but you won't notice a difference with the feature disabled - your kernel will only be slightly smaller. It might be a few nanoseconds faster..

Get it here:
2.6.37-ck2