Friday, 25 February 2011

lrzip-0.570 for the uncorrupt.

When I last blogged about lrzip, I mentioned the corruption on decompression issue a user was seeing in the field. This bug, not surprisingly, worried me greatly so I set out on a major hunt to eliminate it, and make lrzip more reliable on decompression. After extensive investigation, and testing on the part of the user, to cut a long story short, the corruption was NEVER THERE.

The problem he was encountering was on decompressing a 20GB logfile, he would compare it to the original file with the 'cmp' command. On decompressing the file and comparing it, there would be differences in the file at random places. This made me think there was a memory corruption somewhere in lrzip. However he also noted that the problem went away on his desktop machine when he upgraded from Debian Lenny to Squeeze. So we knew something fishy was going on. Finally it occurred to me to suggest he try simply copying the 20GB logfile and then running 'cmp' on it. Lo and behold just copying a file of that size would randomly produce a file that had differences in it. This is a disturbing bug, and had it been confined to one machine, would have pointed the finger at the hardware. However he had reproduced it on the desktop PC as well, and the problem went away after upgrading his distribution. This pointed to a corruption problem somewhere in the layers between write() and what ends up on the disk. Anyway this particular problem is now something that needs to be tackled elsewhere (i.e. Debian).

Nonetheless, the corruption issue got me thinking about how I could make lrzip more reliable on decompression when it is mandatory that what is on disk is the same as what was originally compressed. Till now, lrzip has silently internally used crc32 to check the integrity of each decompressed block before writing it to disk. crc32 still has its place and is very simple, but it has quite a few collisions once you have files in the gigabyte size (collisions being files with the same CRC value despite being different). Fortunately, even with a hash check as simple as CRC, if only one byte changes in a file, the value will never be the same. However the crc was only being done on each decompressed chunk and not the whole file. So I set out to change over to MD5. After importing the MD5 code from coreutils and modifying it to suit lrzip, I added an md5 check during the compression phase, and put the MD5 value in the archive itself. For compatibility, the CRC check is still done and stored, so that the file format is still compatible with all previous 0.5 versions of lrzip. I hate breaking compatibility when it's not needed. On decompression, lrzip will now detect what is the most powerful hash function in the archive and use that to check the integrity of the data. One major advantage of md5 is that you can also use md5sum which is standard on all modern linux installations to compare the value to that stored in the archive on either compression or decompression. I took this idea one step further, and added an option to lrzip (-c) to actually do an md5 on the file that has been written to disk on decompression. This is to ensure that what is written on disk is what was actually extracted! The Debian lenny bug was what made me think this would be a useful feature. I've also added the ability to display the md5 hash value with a new -H option, even if the archive was not originally stored with an md5 value.

One broken "feature" for a while now has been multi-threading on OS-X. I have blogged previously about how OS-X will happily compile software that uses unnamed semaphores, yet when you try to run the program, it will say "feature unimplemented". After looking for some time at named semaphores, which are clunky in the extreme by comparison, it dawned on me I didn't need semaphores at all and could do with pthread_mutexes which are supported pretty much everywhere. So I converted the locking primitive to use mutexes instead, and now multi-threading on OS-X works nicely. I've had one user report it scales very well on his 8-way machine.

Over the last few revisions of lrzip, apart from the multi-threaded changes which have sped it up, numerous changes to improve the reliability of compression/decompression (to prevent it from running out of memory or corrupting data) unfortunately also have slowed it down somewhat. Being a CPU scheduler nut myself, I wasn't satisfied with this situation so I set out to speed it up. A few new changes have made their way into version 0.570 which do precisely that. The new hash check of both md5 and crc, which would have slowed it down now with an extra check, are done now only on already buffered parts of the main file. On a file that's larger than your available ram, this gives a major speed up. Multi-threading now spawns one extra thread as well, to take into account that the initial start up of threads is partially serialised, which means we need more threads available than CPUs. One long term battle with lrzip, which is never resolved, is how much ram to make available for each stage of the rzip pre-processing and then each thread for compression. After looking into the internals of the memory hungry lzma and zpaq, I was able to more accurately account for how much ram each thread would use, and push the amount of ram available per compression thread. The larger the blocks sent to the compression back end, the smaller the resulting file, and the greater the multi-threading speed up, provided there's enough data to keep all threads busy. Anyway the final upshot is that although more threads are in use now (which would decrease compression), compression has been kept approximately the same, but is actually faster.

Here's the latest results from my standard 10GB virtual image compression test:
Compression  Size         Percentage  Compress Time  Decompress Time
None         10737418240  100.0
gzip         2772899756    25.8          5m47s        2m46s
bzip2        2704781700    25.2         16m15s        6m19s
xz           2272322208    21.2         50m58s        3m52s
7z           2242897134    20.9         26m36s        5m41s
lrzip        1372218189    12.8         10m23s        2m53s
lrzip -U     1095735108    10.2          8m44s        2m45s
lrzip -l     1831894161    17.1          4m53s        2m37s
lrzip -lU    1414959433    13.2          4m48s        2m38s
lrzip -zU    1067075961     9.9         69m36s       69m35s

Lots of other internal changes have gone into it that are too numerous to go into depth here (see the Changelog for the short summary), but some user visible changes have been incorporated. Gone is the annoying bug where it would sit there waiting for stdin input if it was called without any arguments. The help information and manual page have been dramatically cleaned up. The -M option has been abolished in favour of just the -U option being used. The -T option no longer takes an argument and is just on/off. A -k option has been added to "keep corrupt/broken files" while corrupt/broken files generated on compression/decompression are automatically deleted by default. The -i information option now gives more information, and has verbose(+) mode to give a breakdown of the lrzip archive, like the following -vvi example:

Detected lrzip version 0.5 file.
../temp/enwik8.lrz:
lrzip version: 0.5 file
Compression: rzip + lzma
Decompressed file size: 100000000
Compressed file size: 26642293
Compression ratio: 3.753
MD5 used for integrity testing
MD5: a1fa5ffddb56f4953e226637dabbb36a
Rzip chunk 1:
Stream: 0
Offset: 25
Block   Comp    Percent Size
1       lzma    58.1%   867413 / 1493985        Offset: 22687516        Head: 0
Stream: 1
Offset: 25
Block   Comp    Percent Size
1       lzma    28.8%   5756191 / 20000000      Offset: 75      Head: 5756266
2       lzma    28.4%   5681891 / 20000000      Offset: 5756291 Head: 11438182
3       lzma    28.2%   5630256 / 20000000      Offset: 11438207        Head: 17068463
4       lzma    28.1%   5619003 / 20000000      Offset: 17068488        Head: 23554929
5       lzma    28.5%   3087298 / 10841364      Offset: 23554954        Head: 0
Rzip compression: 92.3% 92335349 / 100000000
Back end compression: 28.9% 26642052 / 92335349
Overall compression: 26.6% 26642052 / 100000000

I didn't bother blogging about version 0.560 because all the while 0.570 was under heavy development as well and I figured I'd wrap it all up as a nice big update instead. I'm also very pleased that Peter Hyman, who helped code for lrzip some time ago, has once again started contributing code.

That's probably enough babbling. You can get it here once freshmeat updates its links:
lrzip

17 comments:

  1. Con: What CPU and what versions of gzip, bzip2, xz, 7z have you been using for these benchmarks? Command line arguments?

    ReplyDelete
  2. The options used were just the default except where noted. CPU was a quad core 3GHz core 2.

    Versions are as follows (straight from ubuntu packages):
    gzip 1.3.12
    bzip2, Version 1.0.5
    xz 4.999.9beta
    7-Zip (A) 9.04 beta

    Full benchmarks and more details are here:
    http://ck.kolivas.org/apps/lrzip/README.benchmarks

    ReplyDelete
  3. Hi Con,

    question about it. I used "lrzip ./directory" to compress my ZEN kernel branch, 2GB so far. I know, for this you have created the lrztar command, but lrzip was doing something and so I give it a try. After 1 hour I canceled the operation. Until to the break lrzip was using the CPU heavy. But the result was a file directory.lrzip of some KB. OK, so the question, what had lrzip done? (Maybe you can implement some dummy check for file/directory :-) ). However, after that I checked the lrztar command. Is it correct, that this will tar the ./directory and than lrzip the resulting tar file and after that operation it will delete the tar file? That means, I need nearly double the size as free space to lrztar a directory? Or does it make sense to use the --use-compress-program= switch of tar?

    OK and here some results: (DualCore 2,2 GHz, 64bit OpenSuse)
    2,1GB tar of git tree of Zen linux kernel:
    gzip: 11min 920MB
    lrzip -lU: 5min 850MB
    lrzip: 16min 630MB
    lrzip -U: 22min 630MB

    To be fair, the gzip was combined with tar on the whole directory, the lrzips are used on the tar file.

    CU sysitos

    ReplyDelete
  4. Hey Mike

    I actually have no idea what it's doing when you pass it a directory. It's definitely a bug and it needs to detect that it hasn't been passed a file. That's a good idea.

    Now about the whole tar issue: rzip - which became lrzip - was never designed to work on stdin/stdout. Because it has to pretty much read from one end of the file to the other on both the compression and decompression side to derive the benefits of the rzip preprocessing, it means that the whole input file, buffers for compression/decompression, and the output file, all need to fit into ram if it's to work on stdin/stdout. This is a major limitation of lrzip, and whether you use tar --use-compress-program or not, it ends up generating faked temporary files taking up double the space during the process. A relatively recent change to lrzip is the ability to compress from stdin without a temporary file, but because it stores it all in ram, it actually is less efficient than compressing whole files. For a long time I've been considering ways to make it work properly on stdin/stdout and they all point to some problems - that I'd need to change the file format yet again, and that it may be impossible to decompress a file from stdin on a machine with less ram than the one it was compressed on. To change the file format and then not be able to actually cope in all circumstances seems a complete waste. Nonetheless I am still trying to find solutions, but all of them require major code surgery. As for your times and sizes, they look nice. Interestingly you didn't derive any benefit from -U suggesting not much redundancy in your data.

    ReplyDelete
  5. damn con, you code the best stuff ever. this utility is amazing. kudos. the compression ratio is amazing in this thing. i love your work man, keep it up.

    ReplyDelete
  6. Hi Con,

    sometimes my free space is limited and so I ask for an pipe with tar. Would be fine, if there is only 1 step to produce the archive for a directory tree. I see the problem, it's better to know which data is to compress, than waiting what's coming into the pipe ;)
    But the problem with the compression window and the unpacking without enough memory still exist with 64/32 bit versions or I am wrong?
    Btw., the md5 check is a fine security/verify feature, shouldn't it be the default on decompressing?

    Another suggestion: lrzip help output:
    Usage: lrzip [options]
    better: Usage: lrzip [options] (As you already have done in the man page). So a dummy could see, that's only for one file ;)

    CU sysitos

    ReplyDelete
  7. Hi Con,

    there was some stripping on the written text, so here again:

    lrzip [options] file...
    better: lrzip [options] file

    CU sysitos

    ReplyDelete
  8. Yes I understand, but for the reasons I already said, piping the data via stdin still generates temporary files. I'm currently working hard to minimise this problem and you can check out the progress here:
    https://github.com/ckolivas/lrzip/tree/stdinout
    It's a very big rewrite and compression is not as good when you use stdin/out, but so many people have requested not using temporary files that I've been trying to find a way to implement it. I may have to change the file format (perhaps go to version 0.600) for the most benefit though.

    ReplyDelete
  9. md5 checking of decompressed data is always performed on decompression, whether it shows it or not. What is NOT done is to automatically verify what is written to disk as well. Normally you would trust your filesystem and operating system, and no other compression program does a verify on disk by default!

    ReplyDelete
  10. Hi Con,

    I saw something with stdinout on my git pull. But it seems, that something went wrong, because at the moment I don't have ./configure file anymore. And so I can't compile it. Or must I change the branch or other things with git (still a noob with git, so sorry for the question).

    PS: And if the compression with stdinout isn't so good compared to the file method, but I assume the you still beat the others ;)

    CU sysitos

    ReplyDelete
  11. It has a new build system and that's not a release so you need to run
    ./autogen.sh
    first.

    ReplyDelete
  12. Hi Con,

    how can I see, if I use the new stdinout lrzip. I changed to the stdinout branch in git, than ./autogen.sh etc. But the version still shows lrzip
    Will not read stdin from a terminal. Use -f to override.
    lrzip version 0.570
    And it seems, that piping or tar --use-compress-program=lrzip doesn't work. I know, it's still an early beta ;)

    Thanks.
    CU sysitos

    ReplyDelete
  13. If it says "will not read stdin from a terminal" it's because you're not piping anything into it? Unless you have one of the earlier broken versions from git. It is still under heavy development. The current git version works fine here from pipes, with redirection, and with --use-compress-program on tar. I haven't put up the version number yet because it's not ready for release :P

    ReplyDelete
  14. Hi Con, working with the newest git version branch sdtinout, but piping doesn't work. So it seems, that I made a mistake somewhere.

    "lrzip < input > output" does work
    "cat input | lrzip > output" doesn't work (broken pipe)
    Same valid for tar.

    So I will wait until you merged the 2 branches on git or release a new version ;) Doesn't find my error.

    Thanks so far.
    CU sysitos

    ReplyDelete
  15. cat intput | lrzip > output works for me
    tar works for me

    You'll have to email me exactly what your command lines are and what the errors are. I'm unable to reproduce any of the problems you describe: kernel@kolivas.org

    ReplyDelete
  16. Hi Con, it does work now for me too (since yesterday). But because there is also a version change, I decided to wait for your official announcement ;)

    Had done some test with my linux kernel source tree. And it looks very good.
    tar | lrzip -l -> is quicker and compress better than tgz
    tar -I lrzip -> is quicker and compress better than tar.xz

    "tar -I lrzip " and "tar && lrzip" is nearly equal in compression ratio, but the second one with the additional tar file is even quicker, surprising for me.

    Btw, the lzrtar wrapper still use a temporary file.

    CU sysitos

    ReplyDelete
  17. Thanks

    It will always compress better and faster with a real temporary file for the reasons I've already outlined. The only reason I have upgraded the stdinout support is because people keep asking me over and over for it. There is a lot more you can do with solid files that you simply cannot do when reading from or writing to a blind pipe. The design of lrzip works precisely in a way that needs solid files and this latest experimental version works by creating temporary files as large as possible in ram. I'm not sure what I should do with lrztar just yet. The official release of the next version will still be a while away as I finalise what I will include in it.

    ReplyDelete