I haven't been idle all this time. In all honesty I've spent a lot of time playing with my bitcoin mining code in cgminer and after a couple of hardware failures with my mining rig I was finally able to turn my attention back to lrzip which has been really fairly quiet in the bug reporting department since the last stable release.
Here is version 0.610 (when freecode updates):
http://freecode.com/projects/long-range-zip
Or directly:
http://lrzip.kolivas.org
The major feature addition in this latest version is largely thanks to the work of Michael Blumenkrantz in the form of a liblrzip library! This means you can now link in lrzip compression and decompression into any other application. There is a range of APIs available to use this capability from the simplest gzip equivalent lrzip_compress and lrzip_decompress functions to ones that expose all the low level features knobs and dials in lrzip. There are also a couple of demo examples of using these functions included in the source code, but in their simplest form they should be very easy to use. I look forward to seeing package managers, archivers and maybe even git and so on having lrzip support in the future ;) Being a large chunk of new code it is not entirely impossible that bugs or issues with using the library may show up.
WHATS-NEW summary:
The new liblrzip library allows you to add lrzip compression and decompression
to other applications with either simple lrzip_compress and lrzip_decompress
functions or fine control over all the options with low level functions.
Faster rzip stage when files are large enough to require the sliding mmap
feature (usually >1/3 of ram) and in unlimited mode.
A bug where multiple files being compressed or decompressed from the one
command line could have gotten corrupted was fixed.
Modification date of the decompressed file is now set to that of the lrzip
archive (support for storing the original file's date would require modifying
the archive format again).
Compilation warning fixes.
Make lrztar work with directories with spaces in their names.
The full changelog follows:
* Implement complete set of liblrzip libraries, documentation and example uses
with support for simple lrzip_compress() and lrzip_decompress() or complete
fine-grained control over all compression and decompression options.
* Use as much of the low buffer as possible with a single memcopy before going
fine grained byte by byte.
* Preserve the compressed time on decompression where suitable.
* Store a copy of the control struct to be reused on subsequent files to prevent
variables being modified in the control struct on the first file that corrupt
compression/decompression of the 2nd file.
* Explicitly select C99 to avoid certain warnings.
* Generic modifications to silence -Wextra warnings.
* Fix typos.
* Use an array of parameters in lrztar to allow working with directories with
spaces in their names.
Enjoy.
A development blog of what Con Kolivas is doing with code at the moment with the emphasis on linux kernel, MuQSS, BFS and -ck.
Showing posts with label compression window. Show all posts
Showing posts with label compression window. Show all posts
Thursday, 8 March 2012
Friday, 25 February 2011
lrzip-0.570 for the uncorrupt.
When I last blogged about lrzip, I mentioned the corruption on decompression issue a user was seeing in the field. This bug, not surprisingly, worried me greatly so I set out on a major hunt to eliminate it, and make lrzip more reliable on decompression. After extensive investigation, and testing on the part of the user, to cut a long story short, the corruption was NEVER THERE.
The problem he was encountering was on decompressing a 20GB logfile, he would compare it to the original file with the 'cmp' command. On decompressing the file and comparing it, there would be differences in the file at random places. This made me think there was a memory corruption somewhere in lrzip. However he also noted that the problem went away on his desktop machine when he upgraded from Debian Lenny to Squeeze. So we knew something fishy was going on. Finally it occurred to me to suggest he try simply copying the 20GB logfile and then running 'cmp' on it. Lo and behold just copying a file of that size would randomly produce a file that had differences in it. This is a disturbing bug, and had it been confined to one machine, would have pointed the finger at the hardware. However he had reproduced it on the desktop PC as well, and the problem went away after upgrading his distribution. This pointed to a corruption problem somewhere in the layers between write() and what ends up on the disk. Anyway this particular problem is now something that needs to be tackled elsewhere (i.e. Debian).
Nonetheless, the corruption issue got me thinking about how I could make lrzip more reliable on decompression when it is mandatory that what is on disk is the same as what was originally compressed. Till now, lrzip has silently internally used crc32 to check the integrity of each decompressed block before writing it to disk. crc32 still has its place and is very simple, but it has quite a few collisions once you have files in the gigabyte size (collisions being files with the same CRC value despite being different). Fortunately, even with a hash check as simple as CRC, if only one byte changes in a file, the value will never be the same. However the crc was only being done on each decompressed chunk and not the whole file. So I set out to change over to MD5. After importing the MD5 code from coreutils and modifying it to suit lrzip, I added an md5 check during the compression phase, and put the MD5 value in the archive itself. For compatibility, the CRC check is still done and stored, so that the file format is still compatible with all previous 0.5 versions of lrzip. I hate breaking compatibility when it's not needed. On decompression, lrzip will now detect what is the most powerful hash function in the archive and use that to check the integrity of the data. One major advantage of md5 is that you can also use md5sum which is standard on all modern linux installations to compare the value to that stored in the archive on either compression or decompression. I took this idea one step further, and added an option to lrzip (-c) to actually do an md5 on the file that has been written to disk on decompression. This is to ensure that what is written on disk is what was actually extracted! The Debian lenny bug was what made me think this would be a useful feature. I've also added the ability to display the md5 hash value with a new -H option, even if the archive was not originally stored with an md5 value.
One broken "feature" for a while now has been multi-threading on OS-X. I have blogged previously about how OS-X will happily compile software that uses unnamed semaphores, yet when you try to run the program, it will say "feature unimplemented". After looking for some time at named semaphores, which are clunky in the extreme by comparison, it dawned on me I didn't need semaphores at all and could do with pthread_mutexes which are supported pretty much everywhere. So I converted the locking primitive to use mutexes instead, and now multi-threading on OS-X works nicely. I've had one user report it scales very well on his 8-way machine.
Over the last few revisions of lrzip, apart from the multi-threaded changes which have sped it up, numerous changes to improve the reliability of compression/decompression (to prevent it from running out of memory or corrupting data) unfortunately also have slowed it down somewhat. Being a CPU scheduler nut myself, I wasn't satisfied with this situation so I set out to speed it up. A few new changes have made their way into version 0.570 which do precisely that. The new hash check of both md5 and crc, which would have slowed it down now with an extra check, are done now only on already buffered parts of the main file. On a file that's larger than your available ram, this gives a major speed up. Multi-threading now spawns one extra thread as well, to take into account that the initial start up of threads is partially serialised, which means we need more threads available than CPUs. One long term battle with lrzip, which is never resolved, is how much ram to make available for each stage of the rzip pre-processing and then each thread for compression. After looking into the internals of the memory hungry lzma and zpaq, I was able to more accurately account for how much ram each thread would use, and push the amount of ram available per compression thread. The larger the blocks sent to the compression back end, the smaller the resulting file, and the greater the multi-threading speed up, provided there's enough data to keep all threads busy. Anyway the final upshot is that although more threads are in use now (which would decrease compression), compression has been kept approximately the same, but is actually faster.
Here's the latest results from my standard 10GB virtual image compression test:
Lots of other internal changes have gone into it that are too numerous to go into depth here (see the Changelog for the short summary), but some user visible changes have been incorporated. Gone is the annoying bug where it would sit there waiting for stdin input if it was called without any arguments. The help information and manual page have been dramatically cleaned up. The -M option has been abolished in favour of just the -U option being used. The -T option no longer takes an argument and is just on/off. A -k option has been added to "keep corrupt/broken files" while corrupt/broken files generated on compression/decompression are automatically deleted by default. The -i information option now gives more information, and has verbose(+) mode to give a breakdown of the lrzip archive, like the following -vvi example:
I didn't bother blogging about version 0.560 because all the while 0.570 was under heavy development as well and I figured I'd wrap it all up as a nice big update instead. I'm also very pleased that Peter Hyman, who helped code for lrzip some time ago, has once again started contributing code.
That's probably enough babbling. You can get it here once freshmeat updates its links:
lrzip
The problem he was encountering was on decompressing a 20GB logfile, he would compare it to the original file with the 'cmp' command. On decompressing the file and comparing it, there would be differences in the file at random places. This made me think there was a memory corruption somewhere in lrzip. However he also noted that the problem went away on his desktop machine when he upgraded from Debian Lenny to Squeeze. So we knew something fishy was going on. Finally it occurred to me to suggest he try simply copying the 20GB logfile and then running 'cmp' on it. Lo and behold just copying a file of that size would randomly produce a file that had differences in it. This is a disturbing bug, and had it been confined to one machine, would have pointed the finger at the hardware. However he had reproduced it on the desktop PC as well, and the problem went away after upgrading his distribution. This pointed to a corruption problem somewhere in the layers between write() and what ends up on the disk. Anyway this particular problem is now something that needs to be tackled elsewhere (i.e. Debian).
Nonetheless, the corruption issue got me thinking about how I could make lrzip more reliable on decompression when it is mandatory that what is on disk is the same as what was originally compressed. Till now, lrzip has silently internally used crc32 to check the integrity of each decompressed block before writing it to disk. crc32 still has its place and is very simple, but it has quite a few collisions once you have files in the gigabyte size (collisions being files with the same CRC value despite being different). Fortunately, even with a hash check as simple as CRC, if only one byte changes in a file, the value will never be the same. However the crc was only being done on each decompressed chunk and not the whole file. So I set out to change over to MD5. After importing the MD5 code from coreutils and modifying it to suit lrzip, I added an md5 check during the compression phase, and put the MD5 value in the archive itself. For compatibility, the CRC check is still done and stored, so that the file format is still compatible with all previous 0.5 versions of lrzip. I hate breaking compatibility when it's not needed. On decompression, lrzip will now detect what is the most powerful hash function in the archive and use that to check the integrity of the data. One major advantage of md5 is that you can also use md5sum which is standard on all modern linux installations to compare the value to that stored in the archive on either compression or decompression. I took this idea one step further, and added an option to lrzip (-c) to actually do an md5 on the file that has been written to disk on decompression. This is to ensure that what is written on disk is what was actually extracted! The Debian lenny bug was what made me think this would be a useful feature. I've also added the ability to display the md5 hash value with a new -H option, even if the archive was not originally stored with an md5 value.
One broken "feature" for a while now has been multi-threading on OS-X. I have blogged previously about how OS-X will happily compile software that uses unnamed semaphores, yet when you try to run the program, it will say "feature unimplemented". After looking for some time at named semaphores, which are clunky in the extreme by comparison, it dawned on me I didn't need semaphores at all and could do with pthread_mutexes which are supported pretty much everywhere. So I converted the locking primitive to use mutexes instead, and now multi-threading on OS-X works nicely. I've had one user report it scales very well on his 8-way machine.
Over the last few revisions of lrzip, apart from the multi-threaded changes which have sped it up, numerous changes to improve the reliability of compression/decompression (to prevent it from running out of memory or corrupting data) unfortunately also have slowed it down somewhat. Being a CPU scheduler nut myself, I wasn't satisfied with this situation so I set out to speed it up. A few new changes have made their way into version 0.570 which do precisely that. The new hash check of both md5 and crc, which would have slowed it down now with an extra check, are done now only on already buffered parts of the main file. On a file that's larger than your available ram, this gives a major speed up. Multi-threading now spawns one extra thread as well, to take into account that the initial start up of threads is partially serialised, which means we need more threads available than CPUs. One long term battle with lrzip, which is never resolved, is how much ram to make available for each stage of the rzip pre-processing and then each thread for compression. After looking into the internals of the memory hungry lzma and zpaq, I was able to more accurately account for how much ram each thread would use, and push the amount of ram available per compression thread. The larger the blocks sent to the compression back end, the smaller the resulting file, and the greater the multi-threading speed up, provided there's enough data to keep all threads busy. Anyway the final upshot is that although more threads are in use now (which would decrease compression), compression has been kept approximately the same, but is actually faster.
Here's the latest results from my standard 10GB virtual image compression test:
Compression Size Percentage Compress Time Decompress Time None 10737418240 100.0 gzip 2772899756 25.8 5m47s 2m46s bzip2 2704781700 25.2 16m15s 6m19s xz 2272322208 21.2 50m58s 3m52s 7z 2242897134 20.9 26m36s 5m41s lrzip 1372218189 12.8 10m23s 2m53s lrzip -U 1095735108 10.2 8m44s 2m45s lrzip -l 1831894161 17.1 4m53s 2m37s lrzip -lU 1414959433 13.2 4m48s 2m38s lrzip -zU 1067075961 9.9 69m36s 69m35s
Lots of other internal changes have gone into it that are too numerous to go into depth here (see the Changelog for the short summary), but some user visible changes have been incorporated. Gone is the annoying bug where it would sit there waiting for stdin input if it was called without any arguments. The help information and manual page have been dramatically cleaned up. The -M option has been abolished in favour of just the -U option being used. The -T option no longer takes an argument and is just on/off. A -k option has been added to "keep corrupt/broken files" while corrupt/broken files generated on compression/decompression are automatically deleted by default. The -i information option now gives more information, and has verbose(+) mode to give a breakdown of the lrzip archive, like the following -vvi example:
Detected lrzip version 0.5 file. ../temp/enwik8.lrz: lrzip version: 0.5 file Compression: rzip + lzma Decompressed file size: 100000000 Compressed file size: 26642293 Compression ratio: 3.753 MD5 used for integrity testing MD5: a1fa5ffddb56f4953e226637dabbb36a Rzip chunk 1: Stream: 0 Offset: 25 Block Comp Percent Size 1 lzma 58.1% 867413 / 1493985 Offset: 22687516 Head: 0 Stream: 1 Offset: 25 Block Comp Percent Size 1 lzma 28.8% 5756191 / 20000000 Offset: 75 Head: 5756266 2 lzma 28.4% 5681891 / 20000000 Offset: 5756291 Head: 11438182 3 lzma 28.2% 5630256 / 20000000 Offset: 11438207 Head: 17068463 4 lzma 28.1% 5619003 / 20000000 Offset: 17068488 Head: 23554929 5 lzma 28.5% 3087298 / 10841364 Offset: 23554954 Head: 0 Rzip compression: 92.3% 92335349 / 100000000 Back end compression: 28.9% 26642052 / 92335349 Overall compression: 26.6% 26642052 / 100000000
I didn't bother blogging about version 0.560 because all the while 0.570 was under heavy development as well and I figured I'd wrap it all up as a nice big update instead. I'm also very pleased that Peter Hyman, who helped code for lrzip some time ago, has once again started contributing code.
That's probably enough babbling. You can get it here once freshmeat updates its links:
lrzip
Saturday, 6 November 2010
lrzip-0.5.2 with unlimited size compression windows unconstrained by ram.
After my last blog, all I needed to do was sleep on the "sliding mmap" implementation I had so far to think about how to improve it to make it work in some reasonable time frame instead of 100 to 1000 times slower. Yes, I know lrzip isn't as exciting as BFS to everyone else, but I'm having fun hacking on it.
The hash search in the rzip phase of lrzip moves progressively along the file offset, and the match lengths are up to 65536 bytes long. So I changed my 2 sliding buffers to have the one large one move along with the progressive file offset, and the smaller one to be 65536 bytes long. Well the results were dramatic. Instead of being 100 times slower, my first attempt showed it to be 3 times slower. So I cleaned it up and started benchmarking. Not only was it faster than the earlier implementation, surprisingly, sometimes it's faster to compress with it enabled than not! So it was clear I was onto a winner. I tried a few different memory configuration machines and different file sizes, and I can see that as the amount of redundancy goes up in the original file, and as the file deviates further from the ram size, it slows down more and more, but it's still a usable speed.
Using the classic 10GB virtual image file on an 8GB machine example, here's how it fared with the various options:
Not bad at all really, considering it manages to mmap about 5.5GB and then the other 4.5GB depends on the sliding feature to work.
So I fixed a swag of other things, updated the docs and made all new benchmarks, and decided I couldn't sit on it any longer. At some stage you have to release, so version 0.5.1 is out, with the goodness of unlimited size compression windows, unconstrained by ram.
EDIT: Updated link to 0.5.2
lrzip-0.5.2
Here's some more fun benchmarks:
These are benchmarks performed on a 2.53Ghz dual core Intel Core2 with 4GB ram using lrzip v0.5.1. Note that it was running with a 32 bit userspace so only 2GB addressing was possible. However the benchmark was run with the -U option allowing the whole file to be treated as one large compression window.
Tarball of 6 consecutive kernel trees, linux-2.6.31 to linux-2.6.36
Kick arse.
Update:Again I managed to break the Darwin build in 0.5.1. This time it's a simple missing ; at the end of line 536 in main.c . I'll confirm that it works otherwise on Darwin and do a quick bump to a new version. (edit:done)
EDIT: Just for grins, I tried compressing a 50GB freenet datastore on the 8GB ram machine with just the one window as a test. It succeeded with flying colours in about 2 hours. Given the nature of the data that is of very low compressibility due to all being encrypted, and the window being more than 6* the size of the ram.
Compression Ratio: 1.022. Average Compression Speed: 7.266MB/s.
Total time: 01:59:3563.609
It also decompressed fine:
Total time: 01:02:3566.894
Note that it was on an external USB hard drive which isn't the fastest thing to read/write from and reading and writing 50GB to the same drive probably would take an hour itself, so that's why it takes so long. The compression time frame looks good in light of that :)
The hash search in the rzip phase of lrzip moves progressively along the file offset, and the match lengths are up to 65536 bytes long. So I changed my 2 sliding buffers to have the one large one move along with the progressive file offset, and the smaller one to be 65536 bytes long. Well the results were dramatic. Instead of being 100 times slower, my first attempt showed it to be 3 times slower. So I cleaned it up and started benchmarking. Not only was it faster than the earlier implementation, surprisingly, sometimes it's faster to compress with it enabled than not! So it was clear I was onto a winner. I tried a few different memory configuration machines and different file sizes, and I can see that as the amount of redundancy goes up in the original file, and as the file deviates further from the ram size, it slows down more and more, but it's still a usable speed.
Using the classic 10GB virtual image file on an 8GB machine example, here's how it fared with the various options:
Options Size Compress Decompress -l 1793312108 5m13s 3m12s -lM 1413268368 4m18s 2m54s -lU 1413268368 6m05s 2m54s
Not bad at all really, considering it manages to mmap about 5.5GB and then the other 4.5GB depends on the sliding feature to work.
So I fixed a swag of other things, updated the docs and made all new benchmarks, and decided I couldn't sit on it any longer. At some stage you have to release, so version 0.5.1 is out, with the goodness of unlimited size compression windows, unconstrained by ram.
EDIT: Updated link to 0.5.2
lrzip-0.5.2
Here's some more fun benchmarks:
These are benchmarks performed on a 2.53Ghz dual core Intel Core2 with 4GB ram using lrzip v0.5.1. Note that it was running with a 32 bit userspace so only 2GB addressing was possible. However the benchmark was run with the -U option allowing the whole file to be treated as one large compression window.
Tarball of 6 consecutive kernel trees, linux-2.6.31 to linux-2.6.36
Compression Size Percentage Compress Decompress None 2373713920 100 7z 344088002 14.5 17m26s 1m22s lrzip -l 223130711 9.4 05m21s 1m01s lrzip -U 73356070 3.1 08m53s 43s lrzip -Ul 158851141 6.7 04m31s 35s lrzip -Uz 62614573 2.6 24m42s 25m30s
Kick arse.
Update:Again I managed to break the Darwin build in 0.5.1. This time it's a simple missing ; at the end of line 536 in main.c . I'll confirm that it works otherwise on Darwin and do a quick bump to a new version. (edit:done)
EDIT: Just for grins, I tried compressing a 50GB freenet datastore on the 8GB ram machine with just the one window as a test. It succeeded with flying colours in about 2 hours. Given the nature of the data that is of very low compressibility due to all being encrypted, and the window being more than 6* the size of the ram.
Compression Ratio: 1.022. Average Compression Speed: 7.266MB/s.
Total time: 01:59:3563.609
It also decompressed fine:
Total time: 01:02:3566.894
Note that it was on an external USB hard drive which isn't the fastest thing to read/write from and reading and writing 50GB to the same drive probably would take an hour itself, so that's why it takes so long. The compression time frame looks good in light of that :)
Subscribe to:
Posts (Atom)