So the latest version of lrzip seems to be working well in the field with very few bug reports which is nice considering the magnitude of the changes that went into 0.600. I let it stew for a while at 0.601 while I shook out any obvious bugs and am releasing a new stable version that is mostly a bugfix release. Here's the what's new entry for this version:
Now builds on Cygwin.
Fixed wrong symlinks which broke some package generation.
Imposed limits for 32bit machines with way too much ram for their own good.
Disable md5 generation on Apple for now since it's faulty.
Displays full version with -V.
Checks for podman on ./configure
File permissions are better carried over instead of being only 0600.
The only new "feature" is building on cygwin which was contributed by Тулебаев Салават. Thanks!
Just a reminder for what sort of data lrzip works particularly well on:
linux-2.6.0-2.6.38.tar.lrz
This is a tarball of all 39 stable kernel releases from 2.6.0 to 2.6.38 and is only 160MB.
Decompressed file size: 10618664960
Compressed file size: 168125950
Compression ratio: 63.159
Enjoy!
lrzip 0.602 at freshmeat
A development blog of what Con Kolivas is doing with code at the moment with the emphasis on linux kernel, MuQSS, BFS and -ck.
Showing posts with label lrzip. Show all posts
Showing posts with label lrzip. Show all posts
Wednesday, 13 April 2011
Thursday, 7 April 2011
Quick lrzip comparison
So I was building a kernel package for myself and when I was done saw a large directory and thought, what the heck, I'll compress it with lrzip for grins.
These were done on a quad core 3GHz with 8GB ram using nothing but default options. Note that lrzip was done using the lrztar wrapper to make the comparison fair since it's just a single command without temporary files, so it took 3 passes to compress. If I compressed it with temporary files it would be smaller again (i.e. tar cf directory && lrzip)
Of course there is parallel bzip2 which speeds up bzip2's compression but has no effect on compression. And then there is xz which slows down 7z's compression and has no effect on compression. So why is xz becoming the new defacto standard? Probably because it's not aimed squarely at the windows market the way 7z is and feels more politically correct to the linux crowd. Maybe it's because it's called lzma2 that it must be better than lzma since it has a higher version number. I've never seen any performance or compression advantage of any significance with lzma2 versus lzma. Personally I'm relatively unimpressed with xz so I don't really understand it. Even less understandable is why the kernel has both lzma AND xz support in it now.
I wish I could get people more excited about lrzip :)
bzip2 size: 1831647009 time: 12m50s (command "tar cjf") 7z size: 945054166 time: 36m23s (command "7za a") lrzip size: 586087630 time: 17m09s (command "lrztar")
These were done on a quad core 3GHz with 8GB ram using nothing but default options. Note that lrzip was done using the lrztar wrapper to make the comparison fair since it's just a single command without temporary files, so it took 3 passes to compress. If I compressed it with temporary files it would be smaller again (i.e. tar cf directory && lrzip)
Of course there is parallel bzip2 which speeds up bzip2's compression but has no effect on compression. And then there is xz which slows down 7z's compression and has no effect on compression. So why is xz becoming the new defacto standard? Probably because it's not aimed squarely at the windows market the way 7z is and feels more politically correct to the linux crowd. Maybe it's because it's called lzma2 that it must be better than lzma since it has a higher version number. I've never seen any performance or compression advantage of any significance with lzma2 versus lzma. Personally I'm relatively unimpressed with xz so I don't really understand it. Even less understandable is why the kernel has both lzma AND xz support in it now.
I wish I could get people more excited about lrzip :)
Wednesday, 23 March 2011
lrzip-0.600, STDIN/STDOUT and Encryption
TL;DR I haven't got a BFS for 2.6.38 yet.
The short short changelog: Lrzip now supports stdin/stdout fully on both compression and decompression without using temporary files. It also now supports very high grade password protected encryption, faster decompression and testing, big endian machines, and has had numerous minor fixes. There is still no progress on BFS for 2.6.38, but the fact that I've released this version of lrzip means I will likely be looking at it soon.
Lrzip was originally based on rzip code by Andrew Tridgell and for a very very long time full stdin and stdout support was considered so orthogonal to the design that it would be virtually impossible to implement. The reason for this is it compressed data in 2 separate passes, and worked over chunks of data that would totally fill ram and then store it, move on to the next chunk and so on, then it would go back and fill in all the header details in the various compression chunks and finally go back to the start of the file and write down the size of the file. This meant that unless the whole file and two separate compression phases all requiring ram, could all fit into ram, it was impossible to write that chunk to disk. If that part wasn't bad enough, on decompression, it would decompress each chunk, then write down parts to disk, decompress the next chunk, then refer -back- to parts already written to copy them to the new parts, then move onto the next chunk. It's not hard to see why this would make it impossible to read from stdin and write to stdout with all that fast forwarding and rewinding on the file as it's being read from or written to. Basically unless you had about 6 times as much RAM as the size of the file you're compressing it is impossible to fit all that into memory and work on it at once.
One thing I've always kept in mind when developing lrzip is to design a compression format that scales. And by scales I mean as CPUs get faster, computers get more ram, hard drives get faster, and files get bigger, the compression should actually get better. One of the most limiting things about other compression formats is they pretty much generate a static size regardless of how good the machine is they're working on. They also have not scaled very well with increasing CPU cores. For some time it has occurred to me that one day PCs would have enough RAM that the limiting aspect of fitting all the data of a file into RAM to work on it as I described above would no longer be an issue. Of course files will always get larger as well and there will always be files larger than the amount of RAM you have. To make lrzip work on large files, it was going to be necessary to rewrite the file format such that it could do "chunks" of independent data, write them to stdout, and then move on. So basically the major rework that went into this version of lrzip does precisely that. It carefully tries to figure out how much ram it can allocate to a temporary input buffer, temporary output buffer, rzip compression first stage and multiple compression back end stages, and then starts constructing the archive to/from the temporary buffers before dumping a chunk to disk.
Implementing the temporary buffers was no minor task either. Each phase had to take into account different aspects of whether there would be multiple reads/writes, modify in ram or on disk and safely keep track of it. Basically each different combination of compress/decompress stdin/stdout needed a different implementation to work so I incrementally implemented one after the other. The one caveat remains, though, that you might end up decompressing a file to STDOUT on a machine with MUCH less RAM than the machine it was compressed on. So I needed to also implement an opt-out where if it all failed to fit in RAM, it was still able to safely dump it out to temporary files on disk. This is required on decompression from STDIN to create a temporary file there, and decompression to STDOUT (especially if the file wasn't generated via a STDOUT archive in the first place). As you can well imagine all this took a lot of time and testing but all appears to be working nicely now. Furthermore, if you are unlucky enough to need temporary files for the above said reason, lrzip can now "chunk" them as well so it will use less temporary space than before. It's important to note when using STDIN+STDOUT that the compression on files greater than about 25% RAM size will not be as good as when you compress the file directly on disk. The reason has to do with just how much you can fit in ram at any one time and what size compression window can be allocated. Nonetheless 25% of ram on a modern machine is a fair chunk (and gets better every year :)
One of the bonuses of implementing decompression to a temporary buffer was that it was possible to modify even the normal decompression to use this as much as possible. As decompression writes to disk and then reads back from the file it writes to, this can be VERY slow if it involves lots of small reads and writes (and this sometimes happens with large files with lots of small similar portions). Doing this in RAM is much faster, and if you're just testing the file, there is no need to even write it to disk in the end. Thus decompression and especially testing received a healthy speed-up as a result.
After finally getting all that out of the way and having to create a new file format to accommodate it, it seemed a shame to have to change the file format again in the near future to support the 2nd most requested feature of lrzip - encryption. So instead of releasing it as it was, I started getting advice on how I might implement encryption and was lucky enough to recruit the help of Serge Belyshev who was actually reading all about encryption techniques. Unfortunately it also meant that instead of just slapping together some basic password protection and simple encryption mechanism, we ended up trying to generate some absolutely ludicrously secure encryption for lrzip.
You're probably aware that there are 2 basic ways to circumvent encryption: brute force and backdoors. If we stick to encryption techniques for which there are no known current weaknesses, nor backdoors, then the only way to break them is brute force. This means attempting as many password combinations as you can in the shortest possible time to see if they're correct. The average modern CPU can test about 1 million hashed passwords per second using sha512. However if you multiply hash the password (i.e. do it over and over with slight changes), then you have to multiply hash it on testing the password. So if you multiply hash it 1 million times, it takes one second just to try each password. Of course this is only for today's PCs so I thought up that we could increase the amount of hashing by date. So we went back to Moore's law which predicts how much CPUs increase in power each year, and look at the system date when hashing the password, and then decide how many times to hash the password. That means that 10 years from now (if Moore's law still holds, which it has for 50 years now), it will still take 1 second for each password on a newly encrypted file.
After hashing the password with sha512 millions of times, we added random salt to each compression block after the 2nd compression stage and used AES128 block encryption on the data. Then, and this was the hardest part of all, once each rzip chunk is complete, we go back and overwrite each block header encrypted with yet more random salt, and finally encrypt even the md5 sum at the end. This means that you end up with a file that is recognisable by its header as being an lrzip file, but has basically random indecipherable data after that so there is no way of knowing how many chunks the data is, how compressed it is, nor how big the final file is. It also means that each section that is encrypted is encrypted with a different cipher and if you compress exactly the same file with exactly the same password, it will (virtually) never produce the same file twice. To be honest I think we overdid it by anyone's standards. However since I was implementing it, it seemed illogical to just do a half-arsed effort. By far the weakest link in this equation is the password you choose when encrypting, but it does accept passwords up to 500 bytes in length. If you lose your password, consider the data irrevocably lost as there is no known way of retrieving it.
Michael Blumenkrantz was also kind enough to hack on lrzip and implemented a completely new autoconf style configuration mechanism which should hold lrzip in good stead for the future, being more robust and flexible. He then went on to abstract out most functions in a way that will make it possible for him to implement a liblrzip library in the future. That's a rather exciting prospect, but I'm glad it's him and not me that wants to implement that ;)
Finally with more help we tracked down some big endian bugs that had crept in and made lrzip not work on BE machines, I did some testing on my wife's mac to make sure it built and all of the normal compression/decompression and encryption worked, and fixed up and tweaked too many other things to mention in the code.
All in all I'm very pleased with the result, and as much as I'd love for it to be super stable and bug free, no doubt something will show up.
So do your worst, have fun and make sure to tell everyone where you got it:
LRZIP FRESHMEAT
And now a small rest before I look at the kernel again...
EDIT: In the install it seems we forgot the lrzuntar, lrunzip wrappers and the extra manpages. I have rushed out a v0.601 release with these changes.
The short short changelog: Lrzip now supports stdin/stdout fully on both compression and decompression without using temporary files. It also now supports very high grade password protected encryption, faster decompression and testing, big endian machines, and has had numerous minor fixes. There is still no progress on BFS for 2.6.38, but the fact that I've released this version of lrzip means I will likely be looking at it soon.
Lrzip was originally based on rzip code by Andrew Tridgell and for a very very long time full stdin and stdout support was considered so orthogonal to the design that it would be virtually impossible to implement. The reason for this is it compressed data in 2 separate passes, and worked over chunks of data that would totally fill ram and then store it, move on to the next chunk and so on, then it would go back and fill in all the header details in the various compression chunks and finally go back to the start of the file and write down the size of the file. This meant that unless the whole file and two separate compression phases all requiring ram, could all fit into ram, it was impossible to write that chunk to disk. If that part wasn't bad enough, on decompression, it would decompress each chunk, then write down parts to disk, decompress the next chunk, then refer -back- to parts already written to copy them to the new parts, then move onto the next chunk. It's not hard to see why this would make it impossible to read from stdin and write to stdout with all that fast forwarding and rewinding on the file as it's being read from or written to. Basically unless you had about 6 times as much RAM as the size of the file you're compressing it is impossible to fit all that into memory and work on it at once.
One thing I've always kept in mind when developing lrzip is to design a compression format that scales. And by scales I mean as CPUs get faster, computers get more ram, hard drives get faster, and files get bigger, the compression should actually get better. One of the most limiting things about other compression formats is they pretty much generate a static size regardless of how good the machine is they're working on. They also have not scaled very well with increasing CPU cores. For some time it has occurred to me that one day PCs would have enough RAM that the limiting aspect of fitting all the data of a file into RAM to work on it as I described above would no longer be an issue. Of course files will always get larger as well and there will always be files larger than the amount of RAM you have. To make lrzip work on large files, it was going to be necessary to rewrite the file format such that it could do "chunks" of independent data, write them to stdout, and then move on. So basically the major rework that went into this version of lrzip does precisely that. It carefully tries to figure out how much ram it can allocate to a temporary input buffer, temporary output buffer, rzip compression first stage and multiple compression back end stages, and then starts constructing the archive to/from the temporary buffers before dumping a chunk to disk.
Implementing the temporary buffers was no minor task either. Each phase had to take into account different aspects of whether there would be multiple reads/writes, modify in ram or on disk and safely keep track of it. Basically each different combination of compress/decompress stdin/stdout needed a different implementation to work so I incrementally implemented one after the other. The one caveat remains, though, that you might end up decompressing a file to STDOUT on a machine with MUCH less RAM than the machine it was compressed on. So I needed to also implement an opt-out where if it all failed to fit in RAM, it was still able to safely dump it out to temporary files on disk. This is required on decompression from STDIN to create a temporary file there, and decompression to STDOUT (especially if the file wasn't generated via a STDOUT archive in the first place). As you can well imagine all this took a lot of time and testing but all appears to be working nicely now. Furthermore, if you are unlucky enough to need temporary files for the above said reason, lrzip can now "chunk" them as well so it will use less temporary space than before. It's important to note when using STDIN+STDOUT that the compression on files greater than about 25% RAM size will not be as good as when you compress the file directly on disk. The reason has to do with just how much you can fit in ram at any one time and what size compression window can be allocated. Nonetheless 25% of ram on a modern machine is a fair chunk (and gets better every year :)
One of the bonuses of implementing decompression to a temporary buffer was that it was possible to modify even the normal decompression to use this as much as possible. As decompression writes to disk and then reads back from the file it writes to, this can be VERY slow if it involves lots of small reads and writes (and this sometimes happens with large files with lots of small similar portions). Doing this in RAM is much faster, and if you're just testing the file, there is no need to even write it to disk in the end. Thus decompression and especially testing received a healthy speed-up as a result.
After finally getting all that out of the way and having to create a new file format to accommodate it, it seemed a shame to have to change the file format again in the near future to support the 2nd most requested feature of lrzip - encryption. So instead of releasing it as it was, I started getting advice on how I might implement encryption and was lucky enough to recruit the help of Serge Belyshev who was actually reading all about encryption techniques. Unfortunately it also meant that instead of just slapping together some basic password protection and simple encryption mechanism, we ended up trying to generate some absolutely ludicrously secure encryption for lrzip.
You're probably aware that there are 2 basic ways to circumvent encryption: brute force and backdoors. If we stick to encryption techniques for which there are no known current weaknesses, nor backdoors, then the only way to break them is brute force. This means attempting as many password combinations as you can in the shortest possible time to see if they're correct. The average modern CPU can test about 1 million hashed passwords per second using sha512. However if you multiply hash the password (i.e. do it over and over with slight changes), then you have to multiply hash it on testing the password. So if you multiply hash it 1 million times, it takes one second just to try each password. Of course this is only for today's PCs so I thought up that we could increase the amount of hashing by date. So we went back to Moore's law which predicts how much CPUs increase in power each year, and look at the system date when hashing the password, and then decide how many times to hash the password. That means that 10 years from now (if Moore's law still holds, which it has for 50 years now), it will still take 1 second for each password on a newly encrypted file.
After hashing the password with sha512 millions of times, we added random salt to each compression block after the 2nd compression stage and used AES128 block encryption on the data. Then, and this was the hardest part of all, once each rzip chunk is complete, we go back and overwrite each block header encrypted with yet more random salt, and finally encrypt even the md5 sum at the end. This means that you end up with a file that is recognisable by its header as being an lrzip file, but has basically random indecipherable data after that so there is no way of knowing how many chunks the data is, how compressed it is, nor how big the final file is. It also means that each section that is encrypted is encrypted with a different cipher and if you compress exactly the same file with exactly the same password, it will (virtually) never produce the same file twice. To be honest I think we overdid it by anyone's standards. However since I was implementing it, it seemed illogical to just do a half-arsed effort. By far the weakest link in this equation is the password you choose when encrypting, but it does accept passwords up to 500 bytes in length. If you lose your password, consider the data irrevocably lost as there is no known way of retrieving it.
Michael Blumenkrantz was also kind enough to hack on lrzip and implemented a completely new autoconf style configuration mechanism which should hold lrzip in good stead for the future, being more robust and flexible. He then went on to abstract out most functions in a way that will make it possible for him to implement a liblrzip library in the future. That's a rather exciting prospect, but I'm glad it's him and not me that wants to implement that ;)
Finally with more help we tracked down some big endian bugs that had crept in and made lrzip not work on BE machines, I did some testing on my wife's mac to make sure it built and all of the normal compression/decompression and encryption worked, and fixed up and tweaked too many other things to mention in the code.
All in all I'm very pleased with the result, and as much as I'd love for it to be super stable and bug free, no doubt something will show up.
So do your worst, have fun and make sure to tell everyone where you got it:
LRZIP FRESHMEAT
And now a small rest before I look at the kernel again...
EDIT: In the install it seems we forgot the lrzuntar, lrunzip wrappers and the extra manpages. I have rushed out a v0.601 release with these changes.
Tuesday, 8 March 2011
lrzip 0.571
I had set out to do some major surgery to lrzip to make it work better with STDIN/STDOUT and not use physical temporary files at all. However, as I mentioned in my last blog post, my approach of using shared memory as files was a failure as it only really worked on linux. So after much angst, I decided to simply put together all the minor bugfixes that I had accumulated and bring out a new version. So here is a mainly small bugfix release lrzip version 0.571:
lrzip on freshmeat
On a related note I chanced upon this article recently which compared compression tools:
best-linux-compression-tool-8-utilities-tested
tl:dr lrzip wins
Suffice to say that lrzip did pretty well, despite them using a much older version that predates any of the multi-threading enhancements!
lrzip on freshmeat
On a related note I chanced upon this article recently which compared compression tools:
best-linux-compression-tool-8-utilities-tested
tl:dr lrzip wins
Suffice to say that lrzip did pretty well, despite them using a much older version that predates any of the multi-threading enhancements!
Monday, 7 March 2011
Shared memory on Darwin and lrzip
So I've spent the last month working on lrzip trying to make it work without temporary files when working from STDIN +/- to STDOUT. Virtually all of that work has gone into creating pseudo files in memory using posix shared memory in the form of shm_open() and friends. To complete the work,I even needed to upgrade the file format for lrzip when files are dumped to stdout and beyond a certain size. As the code has taken shape and is almost finalised, I have tried to get as much testing as possible before making a release. Lo and behold, OS-X screws me over yet again. I discover that the so-called portable posix shared memory that Darwin is supposed to support, is not fully featured. By that I mean I can open an shm file with shm_open and then I can mmap it. But I can't actually access the file as a file... which was entirely the point. Instead I get the meaningless "operation not supported" when trying to write to the file, or a failure to seek on it entirely. What's more is the open_shm doesn't really work with create and truncate at the same time too... So for now, Darwin will be stuck on physical temporary files while the rest of us move on. Sigh... Someone pointed out that perhaps the whole treating of the shared memory object as a file isn't actually part of the posix guarantee at all, so once again I've gotten carried away with something that isn't portable.
EDIT: After a bit more thought I'm going to abandon all this use of shared memory and reconsider how I might do stdin/out without temporary files yet again.
EDIT: After a bit more thought I'm going to abandon all this use of shared memory and reconsider how I might do stdin/out without temporary files yet again.
Friday, 25 February 2011
lrzip-0.570 for the uncorrupt.
When I last blogged about lrzip, I mentioned the corruption on decompression issue a user was seeing in the field. This bug, not surprisingly, worried me greatly so I set out on a major hunt to eliminate it, and make lrzip more reliable on decompression. After extensive investigation, and testing on the part of the user, to cut a long story short, the corruption was NEVER THERE.
The problem he was encountering was on decompressing a 20GB logfile, he would compare it to the original file with the 'cmp' command. On decompressing the file and comparing it, there would be differences in the file at random places. This made me think there was a memory corruption somewhere in lrzip. However he also noted that the problem went away on his desktop machine when he upgraded from Debian Lenny to Squeeze. So we knew something fishy was going on. Finally it occurred to me to suggest he try simply copying the 20GB logfile and then running 'cmp' on it. Lo and behold just copying a file of that size would randomly produce a file that had differences in it. This is a disturbing bug, and had it been confined to one machine, would have pointed the finger at the hardware. However he had reproduced it on the desktop PC as well, and the problem went away after upgrading his distribution. This pointed to a corruption problem somewhere in the layers between write() and what ends up on the disk. Anyway this particular problem is now something that needs to be tackled elsewhere (i.e. Debian).
Nonetheless, the corruption issue got me thinking about how I could make lrzip more reliable on decompression when it is mandatory that what is on disk is the same as what was originally compressed. Till now, lrzip has silently internally used crc32 to check the integrity of each decompressed block before writing it to disk. crc32 still has its place and is very simple, but it has quite a few collisions once you have files in the gigabyte size (collisions being files with the same CRC value despite being different). Fortunately, even with a hash check as simple as CRC, if only one byte changes in a file, the value will never be the same. However the crc was only being done on each decompressed chunk and not the whole file. So I set out to change over to MD5. After importing the MD5 code from coreutils and modifying it to suit lrzip, I added an md5 check during the compression phase, and put the MD5 value in the archive itself. For compatibility, the CRC check is still done and stored, so that the file format is still compatible with all previous 0.5 versions of lrzip. I hate breaking compatibility when it's not needed. On decompression, lrzip will now detect what is the most powerful hash function in the archive and use that to check the integrity of the data. One major advantage of md5 is that you can also use md5sum which is standard on all modern linux installations to compare the value to that stored in the archive on either compression or decompression. I took this idea one step further, and added an option to lrzip (-c) to actually do an md5 on the file that has been written to disk on decompression. This is to ensure that what is written on disk is what was actually extracted! The Debian lenny bug was what made me think this would be a useful feature. I've also added the ability to display the md5 hash value with a new -H option, even if the archive was not originally stored with an md5 value.
One broken "feature" for a while now has been multi-threading on OS-X. I have blogged previously about how OS-X will happily compile software that uses unnamed semaphores, yet when you try to run the program, it will say "feature unimplemented". After looking for some time at named semaphores, which are clunky in the extreme by comparison, it dawned on me I didn't need semaphores at all and could do with pthread_mutexes which are supported pretty much everywhere. So I converted the locking primitive to use mutexes instead, and now multi-threading on OS-X works nicely. I've had one user report it scales very well on his 8-way machine.
Over the last few revisions of lrzip, apart from the multi-threaded changes which have sped it up, numerous changes to improve the reliability of compression/decompression (to prevent it from running out of memory or corrupting data) unfortunately also have slowed it down somewhat. Being a CPU scheduler nut myself, I wasn't satisfied with this situation so I set out to speed it up. A few new changes have made their way into version 0.570 which do precisely that. The new hash check of both md5 and crc, which would have slowed it down now with an extra check, are done now only on already buffered parts of the main file. On a file that's larger than your available ram, this gives a major speed up. Multi-threading now spawns one extra thread as well, to take into account that the initial start up of threads is partially serialised, which means we need more threads available than CPUs. One long term battle with lrzip, which is never resolved, is how much ram to make available for each stage of the rzip pre-processing and then each thread for compression. After looking into the internals of the memory hungry lzma and zpaq, I was able to more accurately account for how much ram each thread would use, and push the amount of ram available per compression thread. The larger the blocks sent to the compression back end, the smaller the resulting file, and the greater the multi-threading speed up, provided there's enough data to keep all threads busy. Anyway the final upshot is that although more threads are in use now (which would decrease compression), compression has been kept approximately the same, but is actually faster.
Here's the latest results from my standard 10GB virtual image compression test:
Lots of other internal changes have gone into it that are too numerous to go into depth here (see the Changelog for the short summary), but some user visible changes have been incorporated. Gone is the annoying bug where it would sit there waiting for stdin input if it was called without any arguments. The help information and manual page have been dramatically cleaned up. The -M option has been abolished in favour of just the -U option being used. The -T option no longer takes an argument and is just on/off. A -k option has been added to "keep corrupt/broken files" while corrupt/broken files generated on compression/decompression are automatically deleted by default. The -i information option now gives more information, and has verbose(+) mode to give a breakdown of the lrzip archive, like the following -vvi example:
I didn't bother blogging about version 0.560 because all the while 0.570 was under heavy development as well and I figured I'd wrap it all up as a nice big update instead. I'm also very pleased that Peter Hyman, who helped code for lrzip some time ago, has once again started contributing code.
That's probably enough babbling. You can get it here once freshmeat updates its links:
lrzip
The problem he was encountering was on decompressing a 20GB logfile, he would compare it to the original file with the 'cmp' command. On decompressing the file and comparing it, there would be differences in the file at random places. This made me think there was a memory corruption somewhere in lrzip. However he also noted that the problem went away on his desktop machine when he upgraded from Debian Lenny to Squeeze. So we knew something fishy was going on. Finally it occurred to me to suggest he try simply copying the 20GB logfile and then running 'cmp' on it. Lo and behold just copying a file of that size would randomly produce a file that had differences in it. This is a disturbing bug, and had it been confined to one machine, would have pointed the finger at the hardware. However he had reproduced it on the desktop PC as well, and the problem went away after upgrading his distribution. This pointed to a corruption problem somewhere in the layers between write() and what ends up on the disk. Anyway this particular problem is now something that needs to be tackled elsewhere (i.e. Debian).
Nonetheless, the corruption issue got me thinking about how I could make lrzip more reliable on decompression when it is mandatory that what is on disk is the same as what was originally compressed. Till now, lrzip has silently internally used crc32 to check the integrity of each decompressed block before writing it to disk. crc32 still has its place and is very simple, but it has quite a few collisions once you have files in the gigabyte size (collisions being files with the same CRC value despite being different). Fortunately, even with a hash check as simple as CRC, if only one byte changes in a file, the value will never be the same. However the crc was only being done on each decompressed chunk and not the whole file. So I set out to change over to MD5. After importing the MD5 code from coreutils and modifying it to suit lrzip, I added an md5 check during the compression phase, and put the MD5 value in the archive itself. For compatibility, the CRC check is still done and stored, so that the file format is still compatible with all previous 0.5 versions of lrzip. I hate breaking compatibility when it's not needed. On decompression, lrzip will now detect what is the most powerful hash function in the archive and use that to check the integrity of the data. One major advantage of md5 is that you can also use md5sum which is standard on all modern linux installations to compare the value to that stored in the archive on either compression or decompression. I took this idea one step further, and added an option to lrzip (-c) to actually do an md5 on the file that has been written to disk on decompression. This is to ensure that what is written on disk is what was actually extracted! The Debian lenny bug was what made me think this would be a useful feature. I've also added the ability to display the md5 hash value with a new -H option, even if the archive was not originally stored with an md5 value.
One broken "feature" for a while now has been multi-threading on OS-X. I have blogged previously about how OS-X will happily compile software that uses unnamed semaphores, yet when you try to run the program, it will say "feature unimplemented". After looking for some time at named semaphores, which are clunky in the extreme by comparison, it dawned on me I didn't need semaphores at all and could do with pthread_mutexes which are supported pretty much everywhere. So I converted the locking primitive to use mutexes instead, and now multi-threading on OS-X works nicely. I've had one user report it scales very well on his 8-way machine.
Over the last few revisions of lrzip, apart from the multi-threaded changes which have sped it up, numerous changes to improve the reliability of compression/decompression (to prevent it from running out of memory or corrupting data) unfortunately also have slowed it down somewhat. Being a CPU scheduler nut myself, I wasn't satisfied with this situation so I set out to speed it up. A few new changes have made their way into version 0.570 which do precisely that. The new hash check of both md5 and crc, which would have slowed it down now with an extra check, are done now only on already buffered parts of the main file. On a file that's larger than your available ram, this gives a major speed up. Multi-threading now spawns one extra thread as well, to take into account that the initial start up of threads is partially serialised, which means we need more threads available than CPUs. One long term battle with lrzip, which is never resolved, is how much ram to make available for each stage of the rzip pre-processing and then each thread for compression. After looking into the internals of the memory hungry lzma and zpaq, I was able to more accurately account for how much ram each thread would use, and push the amount of ram available per compression thread. The larger the blocks sent to the compression back end, the smaller the resulting file, and the greater the multi-threading speed up, provided there's enough data to keep all threads busy. Anyway the final upshot is that although more threads are in use now (which would decrease compression), compression has been kept approximately the same, but is actually faster.
Here's the latest results from my standard 10GB virtual image compression test:
Compression Size Percentage Compress Time Decompress Time None 10737418240 100.0 gzip 2772899756 25.8 5m47s 2m46s bzip2 2704781700 25.2 16m15s 6m19s xz 2272322208 21.2 50m58s 3m52s 7z 2242897134 20.9 26m36s 5m41s lrzip 1372218189 12.8 10m23s 2m53s lrzip -U 1095735108 10.2 8m44s 2m45s lrzip -l 1831894161 17.1 4m53s 2m37s lrzip -lU 1414959433 13.2 4m48s 2m38s lrzip -zU 1067075961 9.9 69m36s 69m35s
Lots of other internal changes have gone into it that are too numerous to go into depth here (see the Changelog for the short summary), but some user visible changes have been incorporated. Gone is the annoying bug where it would sit there waiting for stdin input if it was called without any arguments. The help information and manual page have been dramatically cleaned up. The -M option has been abolished in favour of just the -U option being used. The -T option no longer takes an argument and is just on/off. A -k option has been added to "keep corrupt/broken files" while corrupt/broken files generated on compression/decompression are automatically deleted by default. The -i information option now gives more information, and has verbose(+) mode to give a breakdown of the lrzip archive, like the following -vvi example:
Detected lrzip version 0.5 file. ../temp/enwik8.lrz: lrzip version: 0.5 file Compression: rzip + lzma Decompressed file size: 100000000 Compressed file size: 26642293 Compression ratio: 3.753 MD5 used for integrity testing MD5: a1fa5ffddb56f4953e226637dabbb36a Rzip chunk 1: Stream: 0 Offset: 25 Block Comp Percent Size 1 lzma 58.1% 867413 / 1493985 Offset: 22687516 Head: 0 Stream: 1 Offset: 25 Block Comp Percent Size 1 lzma 28.8% 5756191 / 20000000 Offset: 75 Head: 5756266 2 lzma 28.4% 5681891 / 20000000 Offset: 5756291 Head: 11438182 3 lzma 28.2% 5630256 / 20000000 Offset: 11438207 Head: 17068463 4 lzma 28.1% 5619003 / 20000000 Offset: 17068488 Head: 23554929 5 lzma 28.5% 3087298 / 10841364 Offset: 23554954 Head: 0 Rzip compression: 92.3% 92335349 / 100000000 Back end compression: 28.9% 26642052 / 92335349 Overall compression: 26.6% 26642052 / 100000000
I didn't bother blogging about version 0.560 because all the while 0.570 was under heavy development as well and I figured I'd wrap it all up as a nice big update instead. I'm also very pleased that Peter Hyman, who helped code for lrzip some time ago, has once again started contributing code.
That's probably enough babbling. You can get it here once freshmeat updates its links:
lrzip
Friday, 11 February 2011
lrzip-0.552 Random fixes.
It had been a while since lrzip had received attention, and a number of things came up that made me give it some attention. First, it wouldn't compile on FreeBSD because it doesn't implement mremap. So I wrapped that in a fake mremap the way I managed the OSX one. Then OSX would compile the existing code fine but would fail to actually work because it would say "unimplemented" when you tried to run it due to OSX not having unnamed semaphores implemented. I've mentioned this before, but named semaphores are a real pain to work with and unnamed ones are very convenient. So I wrapped all the threading calls for OSX to not actually thread and ignore all the calls to semaphore functions. It means OSX doesn't benefit from the extra threading added after version 0.530 but at least it actually works now.
Finally, there was an alarming report I received about file corruption on decompression from a user who had a 20GB log that he was compressing. The generated archive would decompress fine with version 0.530 but it randomly would silently corrupt somewhere in the decompression with any version after that and it was only noticeable after comparing to the original file. Now, try as hard as I could at home, I couldn't reproduce this bug. So with the help of this user doing possibly hundreds of tests for me and with debugging code I found a trail of possible causes. Disabling the decompression threading code surprisingly didn't fix it, confirming the bug was elsewhere. After an extensive search I found there were some mmap calls in lrzip that weren't being careful about being aligned to page size, hence writing to the buffer generated would have random results. It's surprising that no more corruption was seen but presumably the particular buffer in question never had that much data so it would fit in whatever amount happened to be allocated. Presumably the larger file is what made it easier to trigger. That would certainly have explained random position failures, but it didn't really explain why it only started happening after 0.530.
Anyway after converting the mmap calls to ordinary malloc calls, decompression would now actually fail with a crc error. Lrzip does a crc check on the data generated, and compares it to the crc stored in the archive. However if a string of zeroes is generated, and then the crc is also read as zero, it can look ok. I'm assuming this is why it was silently corrupting it. The crc errors at least gave me a trail of possible sources for the error. After much scratching of heads, heavy drinking and sleepless nights, I found that in the original rzip code that lrzip came from was a function that was designed to read a certain amount of data, then return how much it had returned. The thing is, the function that called it (unzip_literal) didn't check the return value and just assumed it always read all the data asked of it. Hitting this codepath is obviously extremely rare but there it was. Fixing that resolved the instant crc error and made it mostly reliable.
Now I'd love to say it's 100% reliable, but after running the decompression hundreds of times just to be certain, it failed on one occasion. So there's still potential for failure, possibly somewhere else, in the code on decompressing extremely large archives of 20GB or more. Given the number of fixes I put into lrzip, and that there were obvious bugs in the previous version, I've released the cumulative fixes as version 0.552. I've also put a warning in there that all very large decompressed files should be checked with md5sum or equivalent to ensure they are not corrupted.
A handful of other minor fixes went in too, so this is basically a bugfix release only. So if you're using lrzip, I highly recommend upgrading to version 0.552.
Freshmeat link (sometimes takes a while for new release to appear):
lrzip
Now for more debugging.
Finally, there was an alarming report I received about file corruption on decompression from a user who had a 20GB log that he was compressing. The generated archive would decompress fine with version 0.530 but it randomly would silently corrupt somewhere in the decompression with any version after that and it was only noticeable after comparing to the original file. Now, try as hard as I could at home, I couldn't reproduce this bug. So with the help of this user doing possibly hundreds of tests for me and with debugging code I found a trail of possible causes. Disabling the decompression threading code surprisingly didn't fix it, confirming the bug was elsewhere. After an extensive search I found there were some mmap calls in lrzip that weren't being careful about being aligned to page size, hence writing to the buffer generated would have random results. It's surprising that no more corruption was seen but presumably the particular buffer in question never had that much data so it would fit in whatever amount happened to be allocated. Presumably the larger file is what made it easier to trigger. That would certainly have explained random position failures, but it didn't really explain why it only started happening after 0.530.
Anyway after converting the mmap calls to ordinary malloc calls, decompression would now actually fail with a crc error. Lrzip does a crc check on the data generated, and compares it to the crc stored in the archive. However if a string of zeroes is generated, and then the crc is also read as zero, it can look ok. I'm assuming this is why it was silently corrupting it. The crc errors at least gave me a trail of possible sources for the error. After much scratching of heads, heavy drinking and sleepless nights, I found that in the original rzip code that lrzip came from was a function that was designed to read a certain amount of data, then return how much it had returned. The thing is, the function that called it (unzip_literal) didn't check the return value and just assumed it always read all the data asked of it. Hitting this codepath is obviously extremely rare but there it was. Fixing that resolved the instant crc error and made it mostly reliable.
Now I'd love to say it's 100% reliable, but after running the decompression hundreds of times just to be certain, it failed on one occasion. So there's still potential for failure, possibly somewhere else, in the code on decompressing extremely large archives of 20GB or more. Given the number of fixes I put into lrzip, and that there were obvious bugs in the previous version, I've released the cumulative fixes as version 0.552. I've also put a warning in there that all very large decompressed files should be checked with md5sum or equivalent to ensure they are not corrupted.
A handful of other minor fixes went in too, so this is basically a bugfix release only. So if you're using lrzip, I highly recommend upgrading to version 0.552.
Freshmeat link (sometimes takes a while for new release to appear):
lrzip
Now for more debugging.
Sunday, 12 December 2010
lrzip-0.551, overlapping streams for faster compression
A few bug reports have been trickling in since I last released a version of lrzip, which as I said last time is good and bad. It's helped direct my workflow on this, even though I keep trying to let it settle down. I'd love to leave the code since it mostly works quite well, and there is a common mantra that "release early, release often" is considered good, but in practice that doesn't seem to work. If you release stuff while it's still broken, it will get a bad reputation and people will dismiss it that time around and often not give it a chance again. So if there are show stopper bugs, they need attending to ASAP.
So anyway, the fact is that the fun part is implementing stuff rather than just fixing bugs, so for the next release I worked on doing both since I had to fix the bugs I knew about. I've mentioned before that lrzip breaks data into rzip streams and compresses each stream a block at a time. I had implemented multithreading by splitting that block up according to number of CPUs, but at the end of the stream, it waited to get all the blocks back before moving onto the next stream. For this release I moved the threading up earlier in the overall process, thus allowing the rzip stage to move onto the next stream while the back end threads are still busy compressing data from the previous stream. This means that if a file is large enough (greater than approximately 2/3 ram), once all CPUs are in use with back end threads, they should remain in use until the end of the file is reached. This speeds up compression on large files on SMP machines.
While looking for bugs to do with lzma compression failing due to running out of ram, I discovered why changing from level 7 compression to 9 compression did nothing... It's been a long time since the lzma library was imported into lrzip, and I had completely forgotten that it only goes up to level 7! The default lzma compression level is meant to be 5, so I've scaled it down now, meaning lrzip will now use the best all round compromise level for lzma. This also was affecting the window sizes internally used by the lzma library. At level 5, the windows are 16MB in size, and go up to 64MB for level 7. As one kind person noted in the comments on this blog, lzma uses up to 12x the memory of the compression window, so each thread would have been needing up to 800MB each. This explained why people were getting lzma failing on window sizes and complaining despite apparently enough ram. While the new level does lose slightly on compression, it does also improve compression speed significantly, and should pretty much eliminate lzma compression failing due to "too big window sizes".
With the two speed improvements, here are the sort of speed ups one can see with version 0.550 of lrzip:
10GB virtual image on quad core with 8GB ram:
lrzip 15m45s -> 11m03s
lrzip -M 12m03s -> 9m30s
And just as a reminder, xz completed this in 50m58s and the file size was almost halved by lrzip compared to xz.
Benchmarks can be found here:
http://ck.kolivas.org/apps/lrzip/README.benchmarks
On version 0.544 of lrzip, I had changed multithreading to use two separate groups of threads for stream 0 and stream 1 to make decompression more robust for out-of-order storage in the archive. Well unfortunately this doubled the possible number of threads, and this brought a now all-too-familiar problem: running out of ram on big files. In .550 I changed it to only use one thread for stream 0 rather than multiple ones since it's usually only one block of stream 0 before multiple blocks of stream 1. This can slow down decompression a little more, but reliability is more important.
I put some effort into tackling the dreaded running out of ram problem with multiple threads all competing for ram. What lrzip does now is detect when the compression back end has failed (which is usually for memory reasons) and instead of failing completely, it will then wait for the previous thread to complete its compression before trying again. This serialises work that would have otherwise been in parallel. Thus less ram is required since only one thread is likely to be using ram. I did this on both compression and decompression. Also, I made the lzma compression back end first try a lower compression level (since this uses a smaller window) before failing, and then finally resorting to bzip if it never succeeds. This should make running out of memory a non-fatal problem since it will just try again using less and being slower.
On the subject of constraining memory usage, I finally bit the bullet and dropped the maximum mmap down to only 1/3 of ram since I can use my sliding mmap to do the remaining chunk size. It is a tiny bit slower, but far more reliable, and leaves a lot more ram for the compression back ends. This makes my sliding mmap implementation, which was just an experiment originally, used routinely for any file larger than 1/3 of ram size.
My "spreading out of thread spawning evenly" idea in the last release was a failure. I reverted it since it provided no speed ups and occasionally slowed things down.
I fixed a few other minor errors to do with not callocing enough ram in some structs which looked really scary but I have no idea if they were causing problems. Hopefully they fixed some random problem somewhere :P
So I was very proud of this release and made it go gold as version 0.550...
About 5 minutes after releasing it, I found a couple of new bugs that were pretty nasty. When lzma worked on an incompressible block, it now failed instead of just leaving that block uncompressed. That was an easy fix. I also found that I had broken compression from STDIN a while back and hadn't noticed, and the compression summary at the end was reading 0. So I quickly corrected those and bumped the version up to 0.551.
So get it here:
http://freshmeat.net/projects/lrzip
No doubt more bugs will show up and I'll have to tend to those as well, but this should be a more robust version since a lot of failure cases in 0.544 have been addressed.
Other ideas I'm considering for lrzip at this stage based on requests include a good error testing and recovery mechanism with either reed-solomon or low density parity checking, and encrypted password protection. I have a few libraries I've looked at for both of these but am open to suggestions if people have recommendations.
EDIT: People didn't understand how the chunks and streams work, so I've copied this comment I made below into the blogpost.
---
Lrzip splits files up multiple times. Imagine a file is twice the size of ram. It will first be split into chunks:
A B C
where each chunk is 1/3 the size of the file.
Then the rzip stage is performed on chunk A, splitting it into streams:
A0 A1
where A0 is the "dictionary" and A2 the "matches"
Then each stream is split up into further chunks for the compression backend, chunks
A0A A0B
and then
A1A A1B A1C A1D A1E
and so on, and each chunk is passed onto the compressor (such as lzma) and the results written to disk.
All this used to be done in series.
Version 0.530 made the last chopping up part be done in parallel since the slow compression back end (lzma or zpaq) was the time consuming part of compression. However when once streams A0 and A1 would have finished their rzip stage, lrzip would wait for A0A A0B A1A A1B etc. to finish compressing completely before lrzip would move onto chunk B.
Version 0.551 now allows lrzip to move onto chunk B while A0A A0B A1A A1B etc are all still compressing.
---
This means that on faster compression (lzo, gzip and bzip2), the rate limiting step is the stream production of A0 and A1. The pipe dream is to parallelise those, but they rely on using as much ram as possible to make the rzip stage effective so it would compromise compression greatly (and be messy as hell to implement).
So anyway, the fact is that the fun part is implementing stuff rather than just fixing bugs, so for the next release I worked on doing both since I had to fix the bugs I knew about. I've mentioned before that lrzip breaks data into rzip streams and compresses each stream a block at a time. I had implemented multithreading by splitting that block up according to number of CPUs, but at the end of the stream, it waited to get all the blocks back before moving onto the next stream. For this release I moved the threading up earlier in the overall process, thus allowing the rzip stage to move onto the next stream while the back end threads are still busy compressing data from the previous stream. This means that if a file is large enough (greater than approximately 2/3 ram), once all CPUs are in use with back end threads, they should remain in use until the end of the file is reached. This speeds up compression on large files on SMP machines.
While looking for bugs to do with lzma compression failing due to running out of ram, I discovered why changing from level 7 compression to 9 compression did nothing... It's been a long time since the lzma library was imported into lrzip, and I had completely forgotten that it only goes up to level 7! The default lzma compression level is meant to be 5, so I've scaled it down now, meaning lrzip will now use the best all round compromise level for lzma. This also was affecting the window sizes internally used by the lzma library. At level 5, the windows are 16MB in size, and go up to 64MB for level 7. As one kind person noted in the comments on this blog, lzma uses up to 12x the memory of the compression window, so each thread would have been needing up to 800MB each. This explained why people were getting lzma failing on window sizes and complaining despite apparently enough ram. While the new level does lose slightly on compression, it does also improve compression speed significantly, and should pretty much eliminate lzma compression failing due to "too big window sizes".
With the two speed improvements, here are the sort of speed ups one can see with version 0.550 of lrzip:
10GB virtual image on quad core with 8GB ram:
lrzip 15m45s -> 11m03s
lrzip -M 12m03s -> 9m30s
And just as a reminder, xz completed this in 50m58s and the file size was almost halved by lrzip compared to xz.
Benchmarks can be found here:
http://ck.kolivas.org/apps/lrzip/README.benchmarks
On version 0.544 of lrzip, I had changed multithreading to use two separate groups of threads for stream 0 and stream 1 to make decompression more robust for out-of-order storage in the archive. Well unfortunately this doubled the possible number of threads, and this brought a now all-too-familiar problem: running out of ram on big files. In .550 I changed it to only use one thread for stream 0 rather than multiple ones since it's usually only one block of stream 0 before multiple blocks of stream 1. This can slow down decompression a little more, but reliability is more important.
I put some effort into tackling the dreaded running out of ram problem with multiple threads all competing for ram. What lrzip does now is detect when the compression back end has failed (which is usually for memory reasons) and instead of failing completely, it will then wait for the previous thread to complete its compression before trying again. This serialises work that would have otherwise been in parallel. Thus less ram is required since only one thread is likely to be using ram. I did this on both compression and decompression. Also, I made the lzma compression back end first try a lower compression level (since this uses a smaller window) before failing, and then finally resorting to bzip if it never succeeds. This should make running out of memory a non-fatal problem since it will just try again using less and being slower.
On the subject of constraining memory usage, I finally bit the bullet and dropped the maximum mmap down to only 1/3 of ram since I can use my sliding mmap to do the remaining chunk size. It is a tiny bit slower, but far more reliable, and leaves a lot more ram for the compression back ends. This makes my sliding mmap implementation, which was just an experiment originally, used routinely for any file larger than 1/3 of ram size.
My "spreading out of thread spawning evenly" idea in the last release was a failure. I reverted it since it provided no speed ups and occasionally slowed things down.
I fixed a few other minor errors to do with not callocing enough ram in some structs which looked really scary but I have no idea if they were causing problems. Hopefully they fixed some random problem somewhere :P
So I was very proud of this release and made it go gold as version 0.550...
About 5 minutes after releasing it, I found a couple of new bugs that were pretty nasty. When lzma worked on an incompressible block, it now failed instead of just leaving that block uncompressed. That was an easy fix. I also found that I had broken compression from STDIN a while back and hadn't noticed, and the compression summary at the end was reading 0. So I quickly corrected those and bumped the version up to 0.551.
So get it here:
http://freshmeat.net/projects/lrzip
No doubt more bugs will show up and I'll have to tend to those as well, but this should be a more robust version since a lot of failure cases in 0.544 have been addressed.
Other ideas I'm considering for lrzip at this stage based on requests include a good error testing and recovery mechanism with either reed-solomon or low density parity checking, and encrypted password protection. I have a few libraries I've looked at for both of these but am open to suggestions if people have recommendations.
EDIT: People didn't understand how the chunks and streams work, so I've copied this comment I made below into the blogpost.
---
Lrzip splits files up multiple times. Imagine a file is twice the size of ram. It will first be split into chunks:
A B C
where each chunk is 1/3 the size of the file.
Then the rzip stage is performed on chunk A, splitting it into streams:
A0 A1
where A0 is the "dictionary" and A2 the "matches"
Then each stream is split up into further chunks for the compression backend, chunks
A0A A0B
and then
A1A A1B A1C A1D A1E
and so on, and each chunk is passed onto the compressor (such as lzma) and the results written to disk.
All this used to be done in series.
Version 0.530 made the last chopping up part be done in parallel since the slow compression back end (lzma or zpaq) was the time consuming part of compression. However when once streams A0 and A1 would have finished their rzip stage, lrzip would wait for A0A A0B A1A A1B etc. to finish compressing completely before lrzip would move onto chunk B.
Version 0.551 now allows lrzip to move onto chunk B while A0A A0B A1A A1B etc are all still compressing.
---
This means that on faster compression (lzo, gzip and bzip2), the rate limiting step is the stream production of A0 and A1. The pipe dream is to parallelise those, but they rely on using as much ram as possible to make the rzip stage effective so it would compromise compression greatly (and be messy as hell to implement).
Tuesday, 7 December 2010
lrzip-0.544 and fudging with memory allocation on 32 bits
It never ceases to amaze me that no matter how much I try and keep the memory allocation paths in check on lrzip, userspace will continue to fail me somehow. I had reports of memory allocation failures on 32 bit machines running lrzip with big files and started investigating again. Even if you have 64GB of ram running on a 32 bit kernel with PAE enabled, you can only allocate up to 2GB maximum with malloc and friends, due to ssize_t being a 32 bit signed long, giving you only 31 bits of positive to work with (which works out to 2GB). I had taken that into account, but it seems that 32 bit userspace really doesn't even like allocating more than 2GB in total for the one application.
Since lrzip can use much smaller allocation now with the use of sliding mmap, I set out to just use even smaller base mmap on 32 bits rather than allocating up to 2GB, and relying on sliding mmap for the rest. This fixed the memory allocation errors on 32 bits, but left me with the lzma back end failing for some users out there, reporting "try decreasing window size". Even though there is plenty of ram available now, it occasionally returns with a memory error. Since I haven't actually written any the lzma library myself, I really didn't know where to begin looking. I tried decreasing the window size passed to lzma but it was failing sometimes even on windows of only 60MB despite gigabytes of ram spare. So after much battling with it I decided to just let it fail, and then fall back to bzip2 compression if lzma failed to compress that block. This means that lrzip will generate a mixed compression file which is perfectly fine, but at least it will not leave blocks uncompressed. None of this is a problem on 64 bits mind you, but there are plenty of people running 32 bits still out there and will be for some time to come.
The other issue that arose was that uclibc doesn't appear to support the sysconf() function fully, not returning a valid value for ram (it returns a negative value) so I had to add a fallback to reading /proc for that.
The multithreaded compression (till now) worked off dividing up the size of the file by the number of threads, and then when a chunk of that size was reached, it would pass it onto a thread and then continue. The problem with this approach is the file is rzip preprocessed before passing it on to a back end compression thread, so it's usually smaller than this. I modified it to put data to threads at regular intervals during the rzip hash search instead, thus spawning threads at even timeframes to hopefully use more CPU time on SMP machines. What's interesting about this, though, is that the chunks compress markedly differently because they're different parts of the rzip processed data and can be tiny or massive after that processing. It made for only very modest improvements in compression times. It has an advantage on decompression, though, because each chunk after decompression is the same size, and it can take quite a while to write say a 2.5GB chunk at a time, thus allowing the next thread to keep working on the chunk after that while it's busy writing the first chunk.
When I last posted, I was unhappy about the multithreaded decompression component so I rewote part of that as well. Now, instead of guessing what stream is likely to be read next, it simply spawns two groups of threads, one group for each stream. The vast majority of the time, archives end up with just one block of stream 0 and lots of blocks of stream 1, so it won't make a big difference, but hopefully it's more robust than it was before. It does mean, however, that you can potentially spawn twice as many threads as you have CPUs which could be a problem with memory usage.
So what's next for lrzip? Well I keep trying to let it stabilise instead of doing more work on it, but there are more people using it now which means I get more bug reports. That's both good and bad ;) The things I see that still need addressing are more ability to cope with failures to do with memory allocation. Unlike other compression applications which use fixed upper sized compression windows, the main challenge with lrzip is to try and use as much memory as possible without running out of memory. This means that the main challenges aren't to do with the algorithms themselves, but memory usage. Multithreading has just made that even more complex since each thread needs its own memory to work with. I need to put more work into coping with trying to allocate too much ram and then cut back on thread usage, but backtracking like that is a little tricky with the code the way it is at the moment. If you do get memory problems, try setting -p values to less than your CPUs manually.
Also, it still will not work on macOSX, with the last good version for that being 0.520. I need a little rest before tackling that :s
Anyway, the new version is out there, and to indicate that it's pretty much just a bug fix and that everyone should upgrade, I've only pushed the minor subversion up to 0.544.
http://freshmeat.net/projects/lrzip
Since lrzip can use much smaller allocation now with the use of sliding mmap, I set out to just use even smaller base mmap on 32 bits rather than allocating up to 2GB, and relying on sliding mmap for the rest. This fixed the memory allocation errors on 32 bits, but left me with the lzma back end failing for some users out there, reporting "try decreasing window size". Even though there is plenty of ram available now, it occasionally returns with a memory error. Since I haven't actually written any the lzma library myself, I really didn't know where to begin looking. I tried decreasing the window size passed to lzma but it was failing sometimes even on windows of only 60MB despite gigabytes of ram spare. So after much battling with it I decided to just let it fail, and then fall back to bzip2 compression if lzma failed to compress that block. This means that lrzip will generate a mixed compression file which is perfectly fine, but at least it will not leave blocks uncompressed. None of this is a problem on 64 bits mind you, but there are plenty of people running 32 bits still out there and will be for some time to come.
The other issue that arose was that uclibc doesn't appear to support the sysconf() function fully, not returning a valid value for ram (it returns a negative value) so I had to add a fallback to reading /proc for that.
The multithreaded compression (till now) worked off dividing up the size of the file by the number of threads, and then when a chunk of that size was reached, it would pass it onto a thread and then continue. The problem with this approach is the file is rzip preprocessed before passing it on to a back end compression thread, so it's usually smaller than this. I modified it to put data to threads at regular intervals during the rzip hash search instead, thus spawning threads at even timeframes to hopefully use more CPU time on SMP machines. What's interesting about this, though, is that the chunks compress markedly differently because they're different parts of the rzip processed data and can be tiny or massive after that processing. It made for only very modest improvements in compression times. It has an advantage on decompression, though, because each chunk after decompression is the same size, and it can take quite a while to write say a 2.5GB chunk at a time, thus allowing the next thread to keep working on the chunk after that while it's busy writing the first chunk.
When I last posted, I was unhappy about the multithreaded decompression component so I rewote part of that as well. Now, instead of guessing what stream is likely to be read next, it simply spawns two groups of threads, one group for each stream. The vast majority of the time, archives end up with just one block of stream 0 and lots of blocks of stream 1, so it won't make a big difference, but hopefully it's more robust than it was before. It does mean, however, that you can potentially spawn twice as many threads as you have CPUs which could be a problem with memory usage.
So what's next for lrzip? Well I keep trying to let it stabilise instead of doing more work on it, but there are more people using it now which means I get more bug reports. That's both good and bad ;) The things I see that still need addressing are more ability to cope with failures to do with memory allocation. Unlike other compression applications which use fixed upper sized compression windows, the main challenge with lrzip is to try and use as much memory as possible without running out of memory. This means that the main challenges aren't to do with the algorithms themselves, but memory usage. Multithreading has just made that even more complex since each thread needs its own memory to work with. I need to put more work into coping with trying to allocate too much ram and then cut back on thread usage, but backtracking like that is a little tricky with the code the way it is at the moment. If you do get memory problems, try setting -p values to less than your CPUs manually.
Also, it still will not work on macOSX, with the last good version for that being 0.520. I need a little rest before tackling that :s
Anyway, the new version is out there, and to indicate that it's pretty much just a bug fix and that everyone should upgrade, I've only pushed the minor subversion up to 0.544.
http://freshmeat.net/projects/lrzip
Monday, 29 November 2010
lrzip-0.543, random fixes
Yep, there are random fixes in this one.
Oh wait, you probably want more detail than that.
Lrzip splits the data into two streams during its rzip stage, one with the "dictionary" parts it ends up finding multiple copies of, and the rest. It turns out that during the multithreaded decompression modification I made the naive assumption that there would be only one chunk of stream 0 data per compressed window. Unfortunately I can't tell what the data is until I've already read it, and throwing lots of chunks into threads to decompress means I have a whole lot of data potentially in a different order to how the rzip re-expansion stage will expect it. I was loathe to modifying the lrzip archive format yet again so soon after changing it, so I used the help of the only person I could find reproducing a decompression problem to try and find how to predict what stream the data came from and then modified the decompression stage accordingly. I'm still not happy with it because it still feels fragile, but I need more time to investigate to ensure I'm doing the right thing. This is unlikely to hit anyone trying to decompress anything less than many many gigabytes in size (the archive is still fine, and the previous non-multithreaded one should decompress it ok).
Much more importantly for the common case usage, I limited the lrzip compression window on 0.542 to only 300MB because it seemed to break on larger windows on 32 bit. Unfortunately, I did it -before- chopping it up into smaller chunks for multithreading, and for all architectures, not just 32 bit. So I fixed that which means the default compression window should be significantly larger on 0.543.
Finally it was clear that the rate limiting step on multithreaded workloads was the rzip stage on extra large files, and since the compression/decompression can now begin as a separate thread to the rzip component, I made the rzip process run at a lower 'nice' level than the back end threads, which afforded a small speed up too.
I still have more planned for lrzip but was hoping to take a small break for a while to do other things now that it's more or less stable. In the meantime, apparently when it's run on a system with uclibc instead of glibc, sysconf() fails to report ram properly which needs fixing. That should be easy enough to workaround, but is a little disappointing from uclibc. The other planned changes involve committing a more robust fix for the multiple stream 0 archives, and more multithreading improvements. Currently one chunk at a time is either compressed or decompressed at the same time as the rzip preprocessing/expansion stage and then all the threads need to complete before moving onto the next chunk. The next logical change is to move the threading even higher up the chain to be able to process multiple chunks concurrently, keeping all CPUs busy at all times. This last change is unlikely to make a huge difference on default settings, but should speed up zpaq based de/compression even more when a file is large enough (bigger than available ram).
Oh and working on OSX still needs fixing (see previous post about named semaphores).
Oh wait, you probably want more detail than that.
Lrzip splits the data into two streams during its rzip stage, one with the "dictionary" parts it ends up finding multiple copies of, and the rest. It turns out that during the multithreaded decompression modification I made the naive assumption that there would be only one chunk of stream 0 data per compressed window. Unfortunately I can't tell what the data is until I've already read it, and throwing lots of chunks into threads to decompress means I have a whole lot of data potentially in a different order to how the rzip re-expansion stage will expect it. I was loathe to modifying the lrzip archive format yet again so soon after changing it, so I used the help of the only person I could find reproducing a decompression problem to try and find how to predict what stream the data came from and then modified the decompression stage accordingly. I'm still not happy with it because it still feels fragile, but I need more time to investigate to ensure I'm doing the right thing. This is unlikely to hit anyone trying to decompress anything less than many many gigabytes in size (the archive is still fine, and the previous non-multithreaded one should decompress it ok).
Much more importantly for the common case usage, I limited the lrzip compression window on 0.542 to only 300MB because it seemed to break on larger windows on 32 bit. Unfortunately, I did it -before- chopping it up into smaller chunks for multithreading, and for all architectures, not just 32 bit. So I fixed that which means the default compression window should be significantly larger on 0.543.
Finally it was clear that the rate limiting step on multithreaded workloads was the rzip stage on extra large files, and since the compression/decompression can now begin as a separate thread to the rzip component, I made the rzip process run at a lower 'nice' level than the back end threads, which afforded a small speed up too.
I still have more planned for lrzip but was hoping to take a small break for a while to do other things now that it's more or less stable. In the meantime, apparently when it's run on a system with uclibc instead of glibc, sysconf() fails to report ram properly which needs fixing. That should be easy enough to workaround, but is a little disappointing from uclibc. The other planned changes involve committing a more robust fix for the multiple stream 0 archives, and more multithreading improvements. Currently one chunk at a time is either compressed or decompressed at the same time as the rzip preprocessing/expansion stage and then all the threads need to complete before moving onto the next chunk. The next logical change is to move the threading even higher up the chain to be able to process multiple chunks concurrently, keeping all CPUs busy at all times. This last change is unlikely to make a huge difference on default settings, but should speed up zpaq based de/compression even more when a file is large enough (bigger than available ram).
Oh and working on OSX still needs fixing (see previous post about named semaphores).
Monday, 22 November 2010
lrzip-0.542, mmap windows and overcommit
I started out this blog entry with a lot more about the tty patch thingy and then decided against it. Suffice to say I don't like heuristics being special coded into the scheduler.
So back to lrzip progress. I found a lovely little bug in it which would make the sliding mmap buffer not slide from the 2nd compression window onwards. It was a lovely bug to find because after fixing it, a very large file took 1/4 the time it was taking to compress down. Overall sliding mmap has gotten a lot faster and useful.
I did a bit of further investigating to see if it was fast enough to enable unlimited compression windows by default. It hasn't quite gotten that much faster, but it's very usable. I tried it on that 9.1GB all-kernel archive on a 32 bit laptop and the default windows take 20 minutes to compress, while unlimited sized windows with sliding mmap took 80 minutes. Not bad given how much more compression it ends up giving.
Now because I run my main desktop with swap disabled (swap is a steaming pile of dodo's doodoo) I wanted to check what happened with memory allocation on linux and just how much the Virtual Memory subsystem will happily give you since lrzip tests before it allocates a window. With a file backed mmap it turns out you can allocate whatever you bloody well like. It happily let me mmap 50GB on a machine with 4GB ram. Now this is a disaster because imagine reading that file linearly into ram as lrzip is working on it. As soon as it stops being able to fit any more into real ram (which happens at about 2/3 of the ram used), it has all this stuff in ram which is now a complete waste and starts faking caching the rest. It never really has any useful amount in ram at any one time since it ends up dropping everything behind. Then if you read up to the end of the file in short bursts, and back again (as lrzip does), it gets worse and worse. Unfortunately, the -M mode in lrzip just worked that way by using whatever mmap would give you.
So I spent days trying various combinations of sizes of window that would work out optimal (with and without swap) and kept coming up with the same answer: that having a main buffer of ~2/3 of the ram gives the best speed performance. It left me without an answer for what to do with -M mode. After enough playing around I decided to make the sliding mmap mode kick in whenever the compression window is larger than 2/3 ram as it speeds up the performance noticeably compared to just mapping in the whole file and never really caching anything. But I had to make up some upper limit to how big -M would work with and 1.5 times the ram size worked out to be a reasonable compromise in how much it would slow down, and how much it would hit swap.
Lrzip can hit swap real hard was the conclusion by the way... and it's not like it speeds anything up or allows any more ram to be used compared to just disabling it. The only
time swap made a difference to the testing was how much anonymous ram could be mmapped (as in just free ram versus caching a file that's on disk). The linux vm allows you to pretty much allocate everything you have as real ram and swap. You should see what the desktop was like after it finished compressing a huge file - clicking on anything took ages since pretty much every useful application was sitting on swap instead of in ram. I had to test it with a real HD since my main desktop now uses an SSD to see how it would affect regular users. By the way, if you can afford an SSD for your desktop, get one ASAP. It's the greatest desktop speedup I've seen in years. (Sure be vigilant with backups and so on cause people don't trust them blah blah).
Despite choosing 2/3 as the amount of ram to allocate, lrzip still actually tests to ensure it won't fail and then decreases the amount till it succeeds. It's rather tricky making sure there will be enough ram for all the compression components because it needs enough ram to allocate to caching the file on disk, enough for 2 compression streams possibly as large as that first one, and then enough ram to dedicate to the back end compression phase. All in all it's a real ram hog. If you compress a big file with lrzip, expect that most things in ram will be trashed (all for a good cause of course). Making sure it doesn't fail on 32 bits is also rather annoying, but I now use the sliding mmap for anything bigger than 700MB there and that's made a big difference to how effectively a 32 bit machine can compress large files too.
Lrzip has been building successfully on and off on mac osx for a while now. Every 2nd or 3rd release I break it since I can't test it myself. It turns out that I recently broke it again. When I added the massive multithreading capabilities to the compression side, I used a fairly standard posix tool, the unnamed semaphore. It turns out that mac osx just doesn't support them. It builds fine, but then when you run it, it says function not implemented. That's pretty sad... named semaphores are much clunkier and according to the manpage "POSIX named semaphores have kernel persistence: if not removed by sem_unlink(3), a semaphore will exist until the system is shut down." I'm not sure if that's adhered to, but that's rather awkward if you don't clean up after yourself very well when you abort your program. At some stage if I can find the enthusiasm, I might try and get the multithreaded lrzip working on osx by converting the unnamed semaphores to named ones, or use them selectively on osx.
All in all I'm very pleased with how lrzip has shaped up. I went back and tested the huge kernel tarball with zpaq multithreaded compression and managed to get 9.1GB down to 116MB in just 40 minutes with zpaq on an external USB drive. Gotta be happy with that :) It's a shame people don't find this remotely as interesting as anything I have to say on the linux kernel.
Lrzip project page on freshmeat
EDIT: I keep getting people ask me what the big deal is with 32 bits and lrzip, since 32 bits with PAE should be able to address up to 64GB ram. That may well be the case, but the gnu libraries on the 32 bit userspace take a size_t on malloc and mmap, and they are the size of a signed long, which is 4 bytes on 32 bits, and 8 bytes on 64 bits. So the most you can address in one malloc with 32 bit userspace is up to 31 bits, or a bit over 2GB.
So back to lrzip progress. I found a lovely little bug in it which would make the sliding mmap buffer not slide from the 2nd compression window onwards. It was a lovely bug to find because after fixing it, a very large file took 1/4 the time it was taking to compress down. Overall sliding mmap has gotten a lot faster and useful.
I did a bit of further investigating to see if it was fast enough to enable unlimited compression windows by default. It hasn't quite gotten that much faster, but it's very usable. I tried it on that 9.1GB all-kernel archive on a 32 bit laptop and the default windows take 20 minutes to compress, while unlimited sized windows with sliding mmap took 80 minutes. Not bad given how much more compression it ends up giving.
Now because I run my main desktop with swap disabled (swap is a steaming pile of dodo's doodoo) I wanted to check what happened with memory allocation on linux and just how much the Virtual Memory subsystem will happily give you since lrzip tests before it allocates a window. With a file backed mmap it turns out you can allocate whatever you bloody well like. It happily let me mmap 50GB on a machine with 4GB ram. Now this is a disaster because imagine reading that file linearly into ram as lrzip is working on it. As soon as it stops being able to fit any more into real ram (which happens at about 2/3 of the ram used), it has all this stuff in ram which is now a complete waste and starts faking caching the rest. It never really has any useful amount in ram at any one time since it ends up dropping everything behind. Then if you read up to the end of the file in short bursts, and back again (as lrzip does), it gets worse and worse. Unfortunately, the -M mode in lrzip just worked that way by using whatever mmap would give you.
So I spent days trying various combinations of sizes of window that would work out optimal (with and without swap) and kept coming up with the same answer: that having a main buffer of ~2/3 of the ram gives the best speed performance. It left me without an answer for what to do with -M mode. After enough playing around I decided to make the sliding mmap mode kick in whenever the compression window is larger than 2/3 ram as it speeds up the performance noticeably compared to just mapping in the whole file and never really caching anything. But I had to make up some upper limit to how big -M would work with and 1.5 times the ram size worked out to be a reasonable compromise in how much it would slow down, and how much it would hit swap.
Lrzip can hit swap real hard was the conclusion by the way... and it's not like it speeds anything up or allows any more ram to be used compared to just disabling it. The only
time swap made a difference to the testing was how much anonymous ram could be mmapped (as in just free ram versus caching a file that's on disk). The linux vm allows you to pretty much allocate everything you have as real ram and swap. You should see what the desktop was like after it finished compressing a huge file - clicking on anything took ages since pretty much every useful application was sitting on swap instead of in ram. I had to test it with a real HD since my main desktop now uses an SSD to see how it would affect regular users. By the way, if you can afford an SSD for your desktop, get one ASAP. It's the greatest desktop speedup I've seen in years. (Sure be vigilant with backups and so on cause people don't trust them blah blah).
Despite choosing 2/3 as the amount of ram to allocate, lrzip still actually tests to ensure it won't fail and then decreases the amount till it succeeds. It's rather tricky making sure there will be enough ram for all the compression components because it needs enough ram to allocate to caching the file on disk, enough for 2 compression streams possibly as large as that first one, and then enough ram to dedicate to the back end compression phase. All in all it's a real ram hog. If you compress a big file with lrzip, expect that most things in ram will be trashed (all for a good cause of course). Making sure it doesn't fail on 32 bits is also rather annoying, but I now use the sliding mmap for anything bigger than 700MB there and that's made a big difference to how effectively a 32 bit machine can compress large files too.
Lrzip has been building successfully on and off on mac osx for a while now. Every 2nd or 3rd release I break it since I can't test it myself. It turns out that I recently broke it again. When I added the massive multithreading capabilities to the compression side, I used a fairly standard posix tool, the unnamed semaphore. It turns out that mac osx just doesn't support them. It builds fine, but then when you run it, it says function not implemented. That's pretty sad... named semaphores are much clunkier and according to the manpage "POSIX named semaphores have kernel persistence: if not removed by sem_unlink(3), a semaphore will exist until the system is shut down." I'm not sure if that's adhered to, but that's rather awkward if you don't clean up after yourself very well when you abort your program. At some stage if I can find the enthusiasm, I might try and get the multithreaded lrzip working on osx by converting the unnamed semaphores to named ones, or use them selectively on osx.
All in all I'm very pleased with how lrzip has shaped up. I went back and tested the huge kernel tarball with zpaq multithreaded compression and managed to get 9.1GB down to 116MB in just 40 minutes with zpaq on an external USB drive. Gotta be happy with that :) It's a shame people don't find this remotely as interesting as anything I have to say on the linux kernel.
Lrzip project page on freshmeat
EDIT: I keep getting people ask me what the big deal is with 32 bits and lrzip, since 32 bits with PAE should be able to address up to 64GB ram. That may well be the case, but the gnu libraries on the 32 bit userspace take a size_t on malloc and mmap, and they are the size of a signed long, which is 4 bytes on 32 bits, and 8 bytes on 64 bits. So the most you can address in one malloc with 32 bit userspace is up to 31 bits, or a bit over 2GB.
Wednesday, 17 November 2010
lrzip-0.540 and multithreaded decompression.
Unable to put it down, I tackled the last big thing I had planned for lrzip in the short term future: Multithreaded decompression. In a similar fashion (but in reverse obviously) to how I tackled it on compression, I modified it to take the file and spit out chunks to as many threads as there are CPU, and let them all start decompressing each chunk independently. Then as soon as the first thread is complete, it hand its buffer over to the rzip stage and makes itself available to take on more chunks and so on.
This technique needs to have a file that was compressed with enough chunks in the first place, but the absolute minimum a file will already be is two chunks due to the 2 streams that rzip uses every time, but will obviously scale best with files already compressed with the multithreaded version.
So how does it perform? Well much like the compression side, the slower the backend compressor in use, the better the speedup with more CPUs.
Going to my regular benchmark, here are the before and after effects on decompression on quad core:
Note that the -z compress/decompress had a slightly braindamaged screen output in v0.530, and in this benchmark above, so it actually didn't perform as well as it would in v0.540 since that's been fixed. Clearly, though, it's massively faster.
Summary: Kick arse.
So I'm very pleased with the performance of lrzip on SMP now. With the help of a friend online who had access to a 48 CPU machine (no he wasn't running BFS :\), we tried various files to see how much we could get lrzip to scale. To actually use all the CPUs, we needed a file large enough that had enough work to distribute to most of the CPUs, and then keep them busy for long enough to show the effect of that scaling.
lrzip -z linux-2.6.36.tar
real 1m6.552s
user 17m16.660s
So this ran about 17x faster than it would have run single threaded, but still wasn't large enough.
The kde source fit that reasonably well, at 1967093760 bytes.
lrzip -z kde.tar
real 4m33.285s
user 92m26.650s
This one ran over 20x faster than it would have run single threaded. I don't know what the upper limit will be, but clearly this has been a massive improvement in performance, and brings zpaq into usable speeds with enough cores.
Get it here (too lazy to post on freshmeat right now): lrzip.kolivas.org
This technique needs to have a file that was compressed with enough chunks in the first place, but the absolute minimum a file will already be is two chunks due to the 2 streams that rzip uses every time, but will obviously scale best with files already compressed with the multithreaded version.
So how does it perform? Well much like the compression side, the slower the backend compressor in use, the better the speedup with more CPUs.
Going to my regular benchmark, here are the before and after effects on decompression on quad core:
Options Singlethread Multithread lrzip 4m32s 3m07s lrzip -M 4m05s 2m50s lrzip -l 3m12s 2m23s lrzip -lM 2m57s 2m20s lrzip -zM 04h08m 72m0s
Note that the -z compress/decompress had a slightly braindamaged screen output in v0.530, and in this benchmark above, so it actually didn't perform as well as it would in v0.540 since that's been fixed. Clearly, though, it's massively faster.
Summary: Kick arse.
So I'm very pleased with the performance of lrzip on SMP now. With the help of a friend online who had access to a 48 CPU machine (no he wasn't running BFS :\), we tried various files to see how much we could get lrzip to scale. To actually use all the CPUs, we needed a file large enough that had enough work to distribute to most of the CPUs, and then keep them busy for long enough to show the effect of that scaling.
lrzip -z linux-2.6.36.tar
real 1m6.552s
user 17m16.660s
So this ran about 17x faster than it would have run single threaded, but still wasn't large enough.
The kde source fit that reasonably well, at 1967093760 bytes.
lrzip -z kde.tar
real 4m33.285s
user 92m26.650s
This one ran over 20x faster than it would have run single threaded. I don't know what the upper limit will be, but clearly this has been a massive improvement in performance, and brings zpaq into usable speeds with enough cores.
Get it here (too lazy to post on freshmeat right now): lrzip.kolivas.org
Saturday, 13 November 2010
lrzip-0.530 and multithreading speed-ups.
As planned I finally manage to thread the compression phase of lrzip to parallelise as much as possible. Originally it used to feed data into a buffer which rzip acted on, and then when it was full it was handed over to the backend compressor, all of which were single threaded except for lzma which didn't even scale beyond 2 CPUs. The implementation in the latest released version of lrzip, 0.530, does it the way I mentioned in my previous blog post.
First it feeds data into the rzip buffer which preprocesses data until it has enough to pass onto a compression thread. Then it hands the data to a compression thread and continues reading more data and lets rzip work on it while the compression thread is doing the 2nd phase compression in the background. Once rzip has enough data for another thread, it spawns another thread and so on, until there are as many threads as CPUs, and then keeps reading until the first thread is free and reuses it, and so on.
Well the results of this are about as good as I could have hoped for. While the faster lzo compression backend only gains a small speedup, the slower the backend, the bigger the speedup. It becomes impressive once zpaq is in use, where I was able to get a 4x speedup on a quad core. That makes lrzip with zpaq almost as fast as regular xz! However, since zpaq takes just as long to decompress as it does to compress, and I haven't threaded the decompression phase, it ends up taking 4x longer to decompress than it did to compress (grin). So zpaq isn't -quite- at the usable stage just yet, but it may well be in the near future.
So what everyone has been waiting for (all 3 of you), benchmarks! 10GB virtual image file being compressed on a quad core 3GHz from an SSD.
Kick arse.
Get it here, and remember freshmeat may not have updated their download links yet:
lrzip
Next stop, to parallelise the decompression phase. I doubt anything but zpaq will really benefit from this, but it would be great to have a zpaq based compression format that is useably fast.
First it feeds data into the rzip buffer which preprocesses data until it has enough to pass onto a compression thread. Then it hands the data to a compression thread and continues reading more data and lets rzip work on it while the compression thread is doing the 2nd phase compression in the background. Once rzip has enough data for another thread, it spawns another thread and so on, until there are as many threads as CPUs, and then keeps reading until the first thread is free and reuses it, and so on.
Well the results of this are about as good as I could have hoped for. While the faster lzo compression backend only gains a small speedup, the slower the backend, the bigger the speedup. It becomes impressive once zpaq is in use, where I was able to get a 4x speedup on a quad core. That makes lrzip with zpaq almost as fast as regular xz! However, since zpaq takes just as long to decompress as it does to compress, and I haven't threaded the decompression phase, it ends up taking 4x longer to decompress than it did to compress (grin). So zpaq isn't -quite- at the usable stage just yet, but it may well be in the near future.
So what everyone has been waiting for (all 3 of you), benchmarks! 10GB virtual image file being compressed on a quad core 3GHz from an SSD.
Compression Size Compress Decompress None 10737418240 gzip 2772899756 05m47s 2m46s bzip2 2704781700 16m15s 6m19s xz 2272322208 50m58s 3m52s 7z 2242897134 26m36s 5m41s lrzip 1299228155 16m12s 4m32s lrzip -M 1079682231 12m03s 4m05s lrzip -l 1754694010 05m30s 3m12s lrzip -lM 1414958844 05m15s 2m57s lrzip -zM 1066902006 71m20s 04h08m
Kick arse.
Get it here, and remember freshmeat may not have updated their download links yet:
lrzip
Next stop, to parallelise the decompression phase. I doubt anything but zpaq will really benefit from this, but it would be great to have a zpaq based compression format that is useably fast.
Thursday, 11 November 2010
lrzip scalability improvements
I've been looking at the benchmarks I've accumulated for lrzip's performance and have been mostly pleased. The speed/size performance or pure size performance is very good compared to other compression algorithms, but not much has been done to date to make it scale better on SMP. It scales slightly during lzma compression, but even there it's a clunky threading library attached to it that increases the CPU usage up to a bit over 2 CPUs for about 50% better performance.
So I've set out to make significant improvements to the scalability of lrzip on SMP, knowing it will slightly adversely affect compression. There are two stages to compression on lrzip, the preprocessing rzip stage and the back end compression 2nd stage. Improving the scalability of the first stage looks highly improbable (I'll never admit to anything being impossible). However, if I split up the chunks handed to the back end compressor and spread it over each thread, it should scale better. So the first step I did here is to do precisely that: Grab all the data from rzip and then split it up to a fixed number of block sizes and use every CPU available, provided each block is at least a minimum size to preserve compression efficiency. I've done precisely that and committed it to the git tree:
lrzip repo
Now it would be nice if this made the backend compression 4 x faster on quad core, but there are a number of reasons this doesn't happen. Mainly it's because they're all competing for cache and memory bandwidth, and (less significantly) the blocks end up compressing different degrees so not all threads finish at the same time. Still, the classic 23 minute 10GB benchmark now dropped to 18 minutes which is a fairly significant improvement given that previous lzma compression was already lightly threaded. The file was only marginally bigger so it's definitely a win.
Next up is to try and break up the rzip stream and hand it to the compression back end before the stream has even finished, to parallelise the rzip phase with the compression phase. This will not likely get to use *every* CPU at the same time (except in the case of zpaq), but will actually be faster than the previous approach. This work is much more invasive than the previous one and is currently underway.
After that I'm going to look at decompression and try and parallelise the back end decompression phase. While decompression is already quite fast, who can ever complain about more speed?
All in all this has been loads of fun to hack on. I'd never thought I'd say this, but maybe it would be nice to have a sea change and get paid to hack part-time and do medicine only part-time.
So I've set out to make significant improvements to the scalability of lrzip on SMP, knowing it will slightly adversely affect compression. There are two stages to compression on lrzip, the preprocessing rzip stage and the back end compression 2nd stage. Improving the scalability of the first stage looks highly improbable (I'll never admit to anything being impossible). However, if I split up the chunks handed to the back end compressor and spread it over each thread, it should scale better. So the first step I did here is to do precisely that: Grab all the data from rzip and then split it up to a fixed number of block sizes and use every CPU available, provided each block is at least a minimum size to preserve compression efficiency. I've done precisely that and committed it to the git tree:
lrzip repo
Now it would be nice if this made the backend compression 4 x faster on quad core, but there are a number of reasons this doesn't happen. Mainly it's because they're all competing for cache and memory bandwidth, and (less significantly) the blocks end up compressing different degrees so not all threads finish at the same time. Still, the classic 23 minute 10GB benchmark now dropped to 18 minutes which is a fairly significant improvement given that previous lzma compression was already lightly threaded. The file was only marginally bigger so it's definitely a win.
Next up is to try and break up the rzip stream and hand it to the compression back end before the stream has even finished, to parallelise the rzip phase with the compression phase. This will not likely get to use *every* CPU at the same time (except in the case of zpaq), but will actually be faster than the previous approach. This work is much more invasive than the previous one and is currently underway.
After that I'm going to look at decompression and try and parallelise the back end decompression phase. While decompression is already quite fast, who can ever complain about more speed?
All in all this has been loads of fun to hack on. I'd never thought I'd say this, but maybe it would be nice to have a sea change and get paid to hack part-time and do medicine only part-time.
Monday, 8 November 2010
lrzip-0.520 and massive kernel compression experiment.
So it's another day and it's time for a new version. Actually version 0.520 is the same as version 0.5.2, it's just that I had it pointed out to me that distros (and github and most everything else) don't really like 3 point numbering schemes. So I have to go back to 2 point version numbering and this is otherwise the same code. I'm not planning on any more changes in the immediate future so hopefully this will calm down for a while and give people a change to actually use it.
Anyway I had a bit of fun with testing out lrzip at the extremes of what it does well and ended up posting my results to the linux kernel mailing list since it involved linux tarballs. Here is a copy of the email as I sent it.
---
Let's do this backwards. Data first.
tarball of every 3 point linux kernel from 2.6.0 to 2.6.36
9748285440 linux-2.6.0-2.6.36.tar
compressed file
136516013 linux-2.6.0-2.6.36.tar.lrz
Compressed size: 1.4%
Compression time:
00:19:22.086
Decompression time:
00:03:02.564
lrzip.kolivas.org
Now for the introduction
Lrzip is a compression application I've been working on that is based on the
original excellent application rzip by Andrew Tridgell. Rzip worked by having
a preprocessor stage with a massive compression window up to 900MB for long
distance redundancy compaction and then compressing the data with bz2. Lrzip
was a project I began based on rzip which tried to extend the initial
compression window beyond 900MB and to use lzma as the back end for the 2nd
stage compression. The idea was that as file sizes got bigger, and machines
had more ram, it would keep getting more and more useful.
After many iterations on lrzip, I've been able to significantly expand on this
idea and make it address 64 bit sized files, ram sizes, and bigger windows.
Previously the limit was based on available ram and how much of the file being
compressed could be mmapped. The current version (0.5.2) is able to use
unlimited sized compression windows for the preprocessor stage using two
sliding mmap windows I invented to match the hash search algorithm of rzip
which can go beyond the size of the ram available. Basically the larger the
file, and the more redundant information in the file, the more the compression
is you can achieve.
Anyway the relevance of this to kernel folk is that the linux kernel is a
perfect test case for long distance redundancy when multiple kernel trees are
compressed.
So here's the lowdown. All of these results were conducted on a 3GHz quad core
64 bit machine with 8GB ram on a 160GB SSD so hard drive speed is a
(relatively) insignificant part of the times.
Starting with linux kernel 2.6.36
413511680 linux-2.6.36.tar
70277083 linux-2.6.36.tar.bz2
59905365 linux-2.6.36.tar.lrz
Compression:
Compression Ratio: 6.903. Average Compression Speed: 1.582MB/s.
Total time: 00:04:08.896
Decompression:
Average DeCompression Speed: 17.130MB/s
Total time: 00:00:22.557
So trying my best to show off what lrzip is good at, I grabbed every 3 point
2.6 linux kernel release. That's every release from 2.6.0 to 2.6.36 as a
tarball, not as patches. It came to a 9.1GB file. Now each kernel tree is
going to be repeating an -awful- lot of data, but given that they have gotten
progressively larger, and span gigabytes, normal compression doesn't really
offer much. This is where the rzip compaction stage over the full size of the
file comes in handy. The more redundant information there is over large
distances, the better. [insert joke about the linux kernel and redundant
information here]. (-MU means Maximum Unlimited; maximum ram and unlimited
window size).
9748285440 linux-2.6.0-2.6.36.tar
lrzip -MU linux-2.6.0-2.6.36.tar
136516013 linux-2.6.0-2.6.36.tar.lrz
Compression:
Compression Ratio: 71.408. Average Compression Speed: 8.000MB/s.
Total time: 00:19:22.086
That's 9.1GB to 130MB (1.4%) in 20 minutes.
Decompression:
Average DeCompression Speed: 51.359MB/s
Total time: 00:03:02.564
At 130MB, it's small enough for me to even offer the entire 3 point stable
release for download from my piddly little server. So here's the file, if
anyone wanted to confirm its validity, or just to download them all :)
linux-2.6.0-2.6.36.tar.lrz
lrzip also has an lzo and zpaq backend for those who want to extract every
last drop of compression out of it at the cost of a bit more time. zpaq is
EIGHT TIMES slower during the backend compression phase compared to lzma
whereas lzo is almost instant after the rzip phase. Note that this file loses
most of its size during the preprocessor stage so it doesn't end up taking 8x
longer to compress with zpaq:
lrzip -MUz linux-2.6.0-2.6.36.tar
121021214 linux-2.6.0-2.6.36.tar.lrz
Compression:
Compression Ratio: 80.550. Average Compression Speed: 3.041MB/s.
Total time: 00:50:56.537
That's 9.1GB to 115MB (1.2%) in 51 mins.
Decompression time is about 52 minutes (yes it's longer than compression!)
Now, I am NOT proposing lrzip be used as a drop in replacement for the
existing compression formats in use. It does work on stdin/stdout but
inefficiently except on compression from stdin. It does not have any archive
facilities beyond compressing the one file, so it comes with an lrztar and
lrzuntar wrapper to do basic directory archiving. It uses a LOT of ram, and
although it has the ability to use unlimited sized compression windows
unconstrained by ram, the bigger the discrepancy between the file size and the
ram, the slower it will get. (Decompression typically uses 10x less ram than
compression).
Nevertheless, I home some people with lots of similar files lying around, like
kernel hackers, will hopefully find it a useful tool, as will people with
database dumps and the like. I also wonder if features of this compression
approach might be helpful for other projects that deal with huge amounts of
data and have large memory requirements. When I mentioned the sliding mmap
feature online, git using lots of memory during its compression phase was
mentioned, so perhaps there are other uses for the sliding mmap idea as a way
of reducing memory usage. There's some documentation of it in the code in
rzip.c and I wrote briefly about it in my blog: http://ck-hack.blogspot.com
Anyway I had a bit of fun with testing out lrzip at the extremes of what it does well and ended up posting my results to the linux kernel mailing list since it involved linux tarballs. Here is a copy of the email as I sent it.
---
Let's do this backwards. Data first.
tarball of every 3 point linux kernel from 2.6.0 to 2.6.36
9748285440 linux-2.6.0-2.6.36.tar
compressed file
136516013 linux-2.6.0-2.6.36.tar.lrz
Compressed size: 1.4%
Compression time:
00:19:22.086
Decompression time:
00:03:02.564
lrzip.kolivas.org
Now for the introduction
Lrzip is a compression application I've been working on that is based on the
original excellent application rzip by Andrew Tridgell. Rzip worked by having
a preprocessor stage with a massive compression window up to 900MB for long
distance redundancy compaction and then compressing the data with bz2. Lrzip
was a project I began based on rzip which tried to extend the initial
compression window beyond 900MB and to use lzma as the back end for the 2nd
stage compression. The idea was that as file sizes got bigger, and machines
had more ram, it would keep getting more and more useful.
After many iterations on lrzip, I've been able to significantly expand on this
idea and make it address 64 bit sized files, ram sizes, and bigger windows.
Previously the limit was based on available ram and how much of the file being
compressed could be mmapped. The current version (0.5.2) is able to use
unlimited sized compression windows for the preprocessor stage using two
sliding mmap windows I invented to match the hash search algorithm of rzip
which can go beyond the size of the ram available. Basically the larger the
file, and the more redundant information in the file, the more the compression
is you can achieve.
Anyway the relevance of this to kernel folk is that the linux kernel is a
perfect test case for long distance redundancy when multiple kernel trees are
compressed.
So here's the lowdown. All of these results were conducted on a 3GHz quad core
64 bit machine with 8GB ram on a 160GB SSD so hard drive speed is a
(relatively) insignificant part of the times.
Starting with linux kernel 2.6.36
413511680 linux-2.6.36.tar
70277083 linux-2.6.36.tar.bz2
59905365 linux-2.6.36.tar.lrz
Compression:
Compression Ratio: 6.903. Average Compression Speed: 1.582MB/s.
Total time: 00:04:08.896
Decompression:
Average DeCompression Speed: 17.130MB/s
Total time: 00:00:22.557
So trying my best to show off what lrzip is good at, I grabbed every 3 point
2.6 linux kernel release. That's every release from 2.6.0 to 2.6.36 as a
tarball, not as patches. It came to a 9.1GB file. Now each kernel tree is
going to be repeating an -awful- lot of data, but given that they have gotten
progressively larger, and span gigabytes, normal compression doesn't really
offer much. This is where the rzip compaction stage over the full size of the
file comes in handy. The more redundant information there is over large
distances, the better. [insert joke about the linux kernel and redundant
information here]. (-MU means Maximum Unlimited; maximum ram and unlimited
window size).
9748285440 linux-2.6.0-2.6.36.tar
lrzip -MU linux-2.6.0-2.6.36.tar
136516013 linux-2.6.0-2.6.36.tar.lrz
Compression:
Compression Ratio: 71.408. Average Compression Speed: 8.000MB/s.
Total time: 00:19:22.086
That's 9.1GB to 130MB (1.4%) in 20 minutes.
Decompression:
Average DeCompression Speed: 51.359MB/s
Total time: 00:03:02.564
At 130MB, it's small enough for me to even offer the entire 3 point stable
release for download from my piddly little server. So here's the file, if
anyone wanted to confirm its validity, or just to download them all :)
linux-2.6.0-2.6.36.tar.lrz
lrzip also has an lzo and zpaq backend for those who want to extract every
last drop of compression out of it at the cost of a bit more time. zpaq is
EIGHT TIMES slower during the backend compression phase compared to lzma
whereas lzo is almost instant after the rzip phase. Note that this file loses
most of its size during the preprocessor stage so it doesn't end up taking 8x
longer to compress with zpaq:
lrzip -MUz linux-2.6.0-2.6.36.tar
121021214 linux-2.6.0-2.6.36.tar.lrz
Compression:
Compression Ratio: 80.550. Average Compression Speed: 3.041MB/s.
Total time: 00:50:56.537
That's 9.1GB to 115MB (1.2%) in 51 mins.
Decompression time is about 52 minutes (yes it's longer than compression!)
Now, I am NOT proposing lrzip be used as a drop in replacement for the
existing compression formats in use. It does work on stdin/stdout but
inefficiently except on compression from stdin. It does not have any archive
facilities beyond compressing the one file, so it comes with an lrztar and
lrzuntar wrapper to do basic directory archiving. It uses a LOT of ram, and
although it has the ability to use unlimited sized compression windows
unconstrained by ram, the bigger the discrepancy between the file size and the
ram, the slower it will get. (Decompression typically uses 10x less ram than
compression).
Nevertheless, I home some people with lots of similar files lying around, like
kernel hackers, will hopefully find it a useful tool, as will people with
database dumps and the like. I also wonder if features of this compression
approach might be helpful for other projects that deal with huge amounts of
data and have large memory requirements. When I mentioned the sliding mmap
feature online, git using lots of memory during its compression phase was
mentioned, so perhaps there are other uses for the sliding mmap idea as a way
of reducing memory usage. There's some documentation of it in the code in
rzip.c and I wrote briefly about it in my blog: http://ck-hack.blogspot.com
Saturday, 6 November 2010
lrzip-0.5.2 with unlimited size compression windows unconstrained by ram.
After my last blog, all I needed to do was sleep on the "sliding mmap" implementation I had so far to think about how to improve it to make it work in some reasonable time frame instead of 100 to 1000 times slower. Yes, I know lrzip isn't as exciting as BFS to everyone else, but I'm having fun hacking on it.
The hash search in the rzip phase of lrzip moves progressively along the file offset, and the match lengths are up to 65536 bytes long. So I changed my 2 sliding buffers to have the one large one move along with the progressive file offset, and the smaller one to be 65536 bytes long. Well the results were dramatic. Instead of being 100 times slower, my first attempt showed it to be 3 times slower. So I cleaned it up and started benchmarking. Not only was it faster than the earlier implementation, surprisingly, sometimes it's faster to compress with it enabled than not! So it was clear I was onto a winner. I tried a few different memory configuration machines and different file sizes, and I can see that as the amount of redundancy goes up in the original file, and as the file deviates further from the ram size, it slows down more and more, but it's still a usable speed.
Using the classic 10GB virtual image file on an 8GB machine example, here's how it fared with the various options:
Not bad at all really, considering it manages to mmap about 5.5GB and then the other 4.5GB depends on the sliding feature to work.
So I fixed a swag of other things, updated the docs and made all new benchmarks, and decided I couldn't sit on it any longer. At some stage you have to release, so version 0.5.1 is out, with the goodness of unlimited size compression windows, unconstrained by ram.
EDIT: Updated link to 0.5.2
lrzip-0.5.2
Here's some more fun benchmarks:
These are benchmarks performed on a 2.53Ghz dual core Intel Core2 with 4GB ram using lrzip v0.5.1. Note that it was running with a 32 bit userspace so only 2GB addressing was possible. However the benchmark was run with the -U option allowing the whole file to be treated as one large compression window.
Tarball of 6 consecutive kernel trees, linux-2.6.31 to linux-2.6.36
Kick arse.
Update:Again I managed to break the Darwin build in 0.5.1. This time it's a simple missing ; at the end of line 536 in main.c . I'll confirm that it works otherwise on Darwin and do a quick bump to a new version. (edit:done)
EDIT: Just for grins, I tried compressing a 50GB freenet datastore on the 8GB ram machine with just the one window as a test. It succeeded with flying colours in about 2 hours. Given the nature of the data that is of very low compressibility due to all being encrypted, and the window being more than 6* the size of the ram.
Compression Ratio: 1.022. Average Compression Speed: 7.266MB/s.
Total time: 01:59:3563.609
It also decompressed fine:
Total time: 01:02:3566.894
Note that it was on an external USB hard drive which isn't the fastest thing to read/write from and reading and writing 50GB to the same drive probably would take an hour itself, so that's why it takes so long. The compression time frame looks good in light of that :)
The hash search in the rzip phase of lrzip moves progressively along the file offset, and the match lengths are up to 65536 bytes long. So I changed my 2 sliding buffers to have the one large one move along with the progressive file offset, and the smaller one to be 65536 bytes long. Well the results were dramatic. Instead of being 100 times slower, my first attempt showed it to be 3 times slower. So I cleaned it up and started benchmarking. Not only was it faster than the earlier implementation, surprisingly, sometimes it's faster to compress with it enabled than not! So it was clear I was onto a winner. I tried a few different memory configuration machines and different file sizes, and I can see that as the amount of redundancy goes up in the original file, and as the file deviates further from the ram size, it slows down more and more, but it's still a usable speed.
Using the classic 10GB virtual image file on an 8GB machine example, here's how it fared with the various options:
Options Size Compress Decompress -l 1793312108 5m13s 3m12s -lM 1413268368 4m18s 2m54s -lU 1413268368 6m05s 2m54s
Not bad at all really, considering it manages to mmap about 5.5GB and then the other 4.5GB depends on the sliding feature to work.
So I fixed a swag of other things, updated the docs and made all new benchmarks, and decided I couldn't sit on it any longer. At some stage you have to release, so version 0.5.1 is out, with the goodness of unlimited size compression windows, unconstrained by ram.
EDIT: Updated link to 0.5.2
lrzip-0.5.2
Here's some more fun benchmarks:
These are benchmarks performed on a 2.53Ghz dual core Intel Core2 with 4GB ram using lrzip v0.5.1. Note that it was running with a 32 bit userspace so only 2GB addressing was possible. However the benchmark was run with the -U option allowing the whole file to be treated as one large compression window.
Tarball of 6 consecutive kernel trees, linux-2.6.31 to linux-2.6.36
Compression Size Percentage Compress Decompress None 2373713920 100 7z 344088002 14.5 17m26s 1m22s lrzip -l 223130711 9.4 05m21s 1m01s lrzip -U 73356070 3.1 08m53s 43s lrzip -Ul 158851141 6.7 04m31s 35s lrzip -Uz 62614573 2.6 24m42s 25m30s
Kick arse.
Update:Again I managed to break the Darwin build in 0.5.1. This time it's a simple missing ; at the end of line 536 in main.c . I'll confirm that it works otherwise on Darwin and do a quick bump to a new version. (edit:done)
EDIT: Just for grins, I tried compressing a 50GB freenet datastore on the 8GB ram machine with just the one window as a test. It succeeded with flying colours in about 2 hours. Given the nature of the data that is of very low compressibility due to all being encrypted, and the window being more than 6* the size of the ram.
Compression Ratio: 1.022. Average Compression Speed: 7.266MB/s.
Total time: 01:59:3563.609
It also decompressed fine:
Total time: 01:02:3566.894
Note that it was on an external USB hard drive which isn't the fastest thing to read/write from and reading and writing 50GB to the same drive probably would take an hour itself, so that's why it takes so long. The compression time frame looks good in light of that :)
Friday, 5 November 2010
Lrzip and sliding mmap
After my last blog I completed work on releasing a new version of lrzip. The most user visible changes would be the new file format which is slightly more compact and a little faster on encoding. Behind the scenes, though, quite a lot of work went into fixing random minor issues, improving the memory allocation code and making the maximum compression windows larger with less chance of failure by falling back as much as possible.
One other major change was to try and implement stdin/stdout properly on lrzip instead of faking it by using temporary files as it did on the last version. As you may, or more likely, may not be aware, the original rzip which lrzip was based on did not work with stdin/stdout. It was one of the most commonly requested features of lrzip, and I implemented it by simply creating temporary files on readin, and then working off that, and a temporary file on compression and then dumping it to stdout once complete. I've been thinking for a long time about just how I could do it without the temporary files as there were certain limitations. As far as stdin goes, oOn compression, the file it's compressing needs to be mmappable. On decompression, the compressed file needs to be seekable. Neither of those can be done on stdin. As far as stdout goes, the output file needs to be seekable/readable while it's being written, and the header needs to be written at the start of the file on compression as well. Neither of those can be done on stdout.
After much fluffing around, the best I could do was to implement stdin without a temporary file. As lrzip works off an mmapped file, I implemented it with an anonymous mapping that reads as much as it can from stdin before working with it. The problem with this is we can't know how big the file will be as it's being read, so the anonymous mapping is made as big as is possible. There is some slight compression lost by doing this as one cannot map any more than how much real ram can be allocated, and the block headers need to be generous in predicting how long the next data block will be when they're written, but these are not major problems.
After releasing version 0.50, it didn't take long before someone spotted a bug to do with building it on OS X. It turns out that OS X doesn't have mremap implemented. So with the help of a few people online who had OS X, we got to fixing the build problem and I also noticed that due to a wrongly made workaround for OS X not having memopen, even the linux builds would be getting the fake memopen so I fixed that and committed it to the git repo. Not enough for a new version yet.
Anyway all of that got me thinking about how to manage the whole compression window failing issue since I started implementing things to workaround failing to allocate enough ram. Up to version 0.5, a compression window is "chosen" by lrzip based on a 2/3 of ram rule, or if set by the user. This was rather prone to failure for numerous obvious reasons, and the fact that further ram needed to be allocated as the backend compression phase would kick in.
When I first started playing with lrzip as an entity, one thing struck me. The major advantage to using the rzip first stage is proportional to the size of the compression window, and that window is limited by how much of a file you can mmap. Now with linux using virtual addressing and being an aggressive overcommitter, it turns out that amount of ram can be quite a lot, but still limited by how much real ram you have. Also if you dedicate all your ram to the first stage, you're left with not much ram for the backend compressor, and lzma quite benefits from a lot of ram too. So a few years back I half-heartedly mentioned the idea of a "sliding mmap" which could look like you have the whole file mmapped, yet did not need to dedicate your entire ram to it, as the "window" of which part of the buffer you were currently accessing would simply move over that part of the file, and you'd just have a virtual offset into one massive buffer.
So I set out to build my sliding mmap. After screwing around for many hours just to sort out pointers and variable types and all sorts of other nonsense that I'm not much good with, I finally had something that started taking shape. The first thing I experimented with was one large buffer that would move as per my original concept above. Well this would be fine if you were mostly reading linearly from a file, but the hash search of a huge compression window does this NOT. It reads from one end to the other in 2 sort of streams at a time, so there were literally billions on of mmap/munmaps occurring on a large file (since that's what I was designing it for) and almost no progress was being made. So then I experimented with splitting the buffer into two half sized ones and moving just the top one. Not really any better. Then I created a mini algorithm for sliding both of those two buffers up and down. Still no better. I tried tossing the mmap out completely and using single reads from disk per byte either via pread or fread. That was even worse. I tried mallocing just one page and moving that and it was marginally better. I kept hoping I'd make progress, as I'm nothing if not persistent :P Finally I found what worked best was to still have as much ram mapped as possible since this is fast, and use a one page sized buffer (4096 bytes) move up and down once we read beyond the base mmapped buffer.
How well does it perform? Well somewhere between 100 and 1000 times slower than a real mmap once reading beyond the initial buffer, but just as fast during that initial buffer. So if you can manage to fit most of a file into ram, this will get you the rest of the way. It takes away one of the major advantages of the rzip preprocessor stage, being that it's very fast, but then again lrzip does also do the ultra slow thing when in zpaq mode. I have ideas for how I might be able to tweak it further, so we'll see :) It was fun to implement, and it's nice to know that I can claim that lrzip has "unlimited sized compression windows" now. I've since committed that to the git repo.
Probably more useful than that feature is that I put a lot more effort into maximising the size of windows by throwing away the "guessing" how big a window it can use. Now it just allocates as much as it can through testing, and then does the same on the backend compression phase so it wont run out of ram later on. Simply put, the larger the window for the preprocessor stage, the better, and if it's as big as the file, even better, and this does its best to make that as big as possible (with -M).
So here's some numbers, choosing what lrzip is best at; a 10GB virtual image (so lots of redundant data over a big file) being compressed on a machine with 8GB ram on an SSD drive (this is an excerpt from the benchmarks on my site):
-l mode is using lzo as the backend
-M mode uses the maximum possible ram. In this case, lrzip succeeded in fitting the entire 10GB file into one compression window.
I'd love to tell you how well zpaq fared, but it hasn't finished yet since it's so slow :P It used to take around 4 hours to do this benchmark. I'll update this post tomorrow with them. (edit:done).
Hopefully I can finish cleaning a few more things up and release version 0.5.1 soon with these changes, but I need to go to bed. They're all committed into the git repo already.
lrzip.kolivas.org
UPDATE: I've played with the sliding mmap implementation some more, in line with how lrzip works, and found that if I make the smaller buffer 64k it matches the size of the search function, and if I slide the larger buffer along with the main hash search it makes for a MASSIVE speed improvement over the earlier implementation. It is in fact usable speed now, as I just tested a 5GB file on a laptop that could normally only use a 2GB window, but used unlimited mode having an effective 5GB window, and it completed in 15 minutes, which is very reasonable! It decompressed in under 4 minutes. The changes have been committed to the git repo.
One other major change was to try and implement stdin/stdout properly on lrzip instead of faking it by using temporary files as it did on the last version. As you may, or more likely, may not be aware, the original rzip which lrzip was based on did not work with stdin/stdout. It was one of the most commonly requested features of lrzip, and I implemented it by simply creating temporary files on readin, and then working off that, and a temporary file on compression and then dumping it to stdout once complete. I've been thinking for a long time about just how I could do it without the temporary files as there were certain limitations. As far as stdin goes, oOn compression, the file it's compressing needs to be mmappable. On decompression, the compressed file needs to be seekable. Neither of those can be done on stdin. As far as stdout goes, the output file needs to be seekable/readable while it's being written, and the header needs to be written at the start of the file on compression as well. Neither of those can be done on stdout.
After much fluffing around, the best I could do was to implement stdin without a temporary file. As lrzip works off an mmapped file, I implemented it with an anonymous mapping that reads as much as it can from stdin before working with it. The problem with this is we can't know how big the file will be as it's being read, so the anonymous mapping is made as big as is possible. There is some slight compression lost by doing this as one cannot map any more than how much real ram can be allocated, and the block headers need to be generous in predicting how long the next data block will be when they're written, but these are not major problems.
After releasing version 0.50, it didn't take long before someone spotted a bug to do with building it on OS X. It turns out that OS X doesn't have mremap implemented. So with the help of a few people online who had OS X, we got to fixing the build problem and I also noticed that due to a wrongly made workaround for OS X not having memopen, even the linux builds would be getting the fake memopen so I fixed that and committed it to the git repo. Not enough for a new version yet.
Anyway all of that got me thinking about how to manage the whole compression window failing issue since I started implementing things to workaround failing to allocate enough ram. Up to version 0.5, a compression window is "chosen" by lrzip based on a 2/3 of ram rule, or if set by the user. This was rather prone to failure for numerous obvious reasons, and the fact that further ram needed to be allocated as the backend compression phase would kick in.
When I first started playing with lrzip as an entity, one thing struck me. The major advantage to using the rzip first stage is proportional to the size of the compression window, and that window is limited by how much of a file you can mmap. Now with linux using virtual addressing and being an aggressive overcommitter, it turns out that amount of ram can be quite a lot, but still limited by how much real ram you have. Also if you dedicate all your ram to the first stage, you're left with not much ram for the backend compressor, and lzma quite benefits from a lot of ram too. So a few years back I half-heartedly mentioned the idea of a "sliding mmap" which could look like you have the whole file mmapped, yet did not need to dedicate your entire ram to it, as the "window" of which part of the buffer you were currently accessing would simply move over that part of the file, and you'd just have a virtual offset into one massive buffer.
So I set out to build my sliding mmap. After screwing around for many hours just to sort out pointers and variable types and all sorts of other nonsense that I'm not much good with, I finally had something that started taking shape. The first thing I experimented with was one large buffer that would move as per my original concept above. Well this would be fine if you were mostly reading linearly from a file, but the hash search of a huge compression window does this NOT. It reads from one end to the other in 2 sort of streams at a time, so there were literally billions on of mmap/munmaps occurring on a large file (since that's what I was designing it for) and almost no progress was being made. So then I experimented with splitting the buffer into two half sized ones and moving just the top one. Not really any better. Then I created a mini algorithm for sliding both of those two buffers up and down. Still no better. I tried tossing the mmap out completely and using single reads from disk per byte either via pread or fread. That was even worse. I tried mallocing just one page and moving that and it was marginally better. I kept hoping I'd make progress, as I'm nothing if not persistent :P Finally I found what worked best was to still have as much ram mapped as possible since this is fast, and use a one page sized buffer (4096 bytes) move up and down once we read beyond the base mmapped buffer.
How well does it perform? Well somewhere between 100 and 1000 times slower than a real mmap once reading beyond the initial buffer, but just as fast during that initial buffer. So if you can manage to fit most of a file into ram, this will get you the rest of the way. It takes away one of the major advantages of the rzip preprocessor stage, being that it's very fast, but then again lrzip does also do the ultra slow thing when in zpaq mode. I have ideas for how I might be able to tweak it further, so we'll see :) It was fun to implement, and it's nice to know that I can claim that lrzip has "unlimited sized compression windows" now. I've since committed that to the git repo.
Probably more useful than that feature is that I put a lot more effort into maximising the size of windows by throwing away the "guessing" how big a window it can use. Now it just allocates as much as it can through testing, and then does the same on the backend compression phase so it wont run out of ram later on. Simply put, the larger the window for the preprocessor stage, the better, and if it's as big as the file, even better, and this does its best to make that as big as possible (with -M).
So here's some numbers, choosing what lrzip is best at; a 10GB virtual image (so lots of redundant data over a big file) being compressed on a machine with 8GB ram on an SSD drive (this is an excerpt from the benchmarks on my site):
Compression Size Compress Decompress None 10737418240 gzip 2772899756 05m47.35s 2m46.77s bzip2 2704781700 20m34.26s 7m51.36s xz 2272322208 58m26.82s 4m46.15s 7z 2242897134 29m28.15s 6m35.95s lrzip 1354237684 29m13.40s 6m55.44s lrzip -M 1079528708 23m44.22s 4m05.46s lrzip -l 1793312108 05m13.24s 3m12.88s lrzip -lM 1413268368 04m18.33s 2m54.65s lrzip -z 1299844906 04h32m14s 04h33m lrzip -zM 1066902006 04h07m14s 04h08m
-l mode is using lzo as the backend
-M mode uses the maximum possible ram. In this case, lrzip succeeded in fitting the entire 10GB file into one compression window.
I'd love to tell you how well zpaq fared, but it hasn't finished yet since it's so slow :P It used to take around 4 hours to do this benchmark. I'll update this post tomorrow with them. (edit:done).
Hopefully I can finish cleaning a few more things up and release version 0.5.1 soon with these changes, but I need to go to bed. They're all committed into the git repo already.
lrzip.kolivas.org
UPDATE: I've played with the sliding mmap implementation some more, in line with how lrzip works, and found that if I make the smaller buffer 64k it matches the size of the search function, and if I slide the larger buffer along with the main hash search it makes for a MASSIVE speed improvement over the earlier implementation. It is in fact usable speed now, as I just tested a 5GB file on a laptop that could normally only use a 2GB window, but used unlimited mode having an effective 5GB window, and it completed in 15 minutes, which is very reasonable! It decompressed in under 4 minutes. The changes have been committed to the git repo.
Sunday, 31 October 2010
On lrzip and packing bytes
As some of you may be aware I maintain a compression application called lrzip which is an evolution of Tridge's rzip which does a range coding pre-processing pass and then compresses the data with bzip2. Lrzip does the compression with a selection of backend compressors, using lzma by default but offering extremely fast with lzo or extremely slow with zpaq. The other thing lrzip does differently is I updated all the code to work properly on 64 bit machines with super large sizes. The idea behind lrzip is that as machines get more ram, and file sizes get bigger, the compression it will be capable of achieving will keep getting better, as it uses compression windows limited only in size by available ram. It is not a complex mature application with lots of features like 7z, but simply offers solid compression at good speeds.
See more about it with benchmarks and other advertising blurbs here:
lrzip.kolivas.org
Anyway when I updated it to 64 bit, my conversion of the code was a little inefficient, simply changing all the offsets in the pre-processor stage from 16 and 32 bit variables to 64 bit ones. This lost some of the efficiency of the compression because suddenly 2 and 4 byte entries were now 8 bytes long. However, since the 8 bytes were mostly zeroes, they would compress quite well since the backend compression pass was done next. Still, it was clear this was inefficient. So I did some experimenting.
The first thing I did was to base the byte width on the file size of the archive, thus allowing very small archives to have even less than 4 byte wide entries, and most archives would likely be 4-5 bytes wide. 5 bytes gives us 40 bits or files up to 1,099,511,627,776 in size. The compression gains were small but very real, and given that with zpaq compression lrzip's overall compression is rather extreme, any gains are worthwhile chasing.
The next thing I tried was packing bits. As most offsets were likely to be a lot less than the full file size, then it is possible to just make each entry only as many bytes wide as the data itself. To do this, I encoded how many bytes wide the data was within the data itself, by adding 3 bits to it. Having only 61 bits for the data itself of a 64 bit variable is not a problem since a 61 bit sized file is up to 2,305,843,009,213,693,952 in size.
Here's a rundown of how the code looked for encode/decode:
Anyway after a lot of fluffing around to actually make this code work since I had numerous mistakes while getting this far, I finally got around to testing it. The results of the pre-processing stage were very promising with the data getting a lot smaller. Then after the backend compression stage kicked in, the final compressed file size was almost the same as it was with just the 8 byte wide entries already in use in the current lrzip version 0.47 release. I sat there looking at it thinking "well that was a complete waste of time". So after thinking about it a little more, I realised that this bit-packing I was doing is basically just another form of compression, so it was making the compression stage less efficient. Incidentally, some file formats and compression formats I believe use a similar form of bit packing.
So I'll be throwing away the bit-packing idea and going back to the fixed width idea since the final compression size will be smaller that way. However I wanted to put to do something useful with the code since I made it and figured I should put it on this blog since I'm about to delete it from the actual code :) So back to the fixed byte width coding, hopefully with a new version of lrzip not too far away.
See more about it with benchmarks and other advertising blurbs here:
lrzip.kolivas.org
Anyway when I updated it to 64 bit, my conversion of the code was a little inefficient, simply changing all the offsets in the pre-processor stage from 16 and 32 bit variables to 64 bit ones. This lost some of the efficiency of the compression because suddenly 2 and 4 byte entries were now 8 bytes long. However, since the 8 bytes were mostly zeroes, they would compress quite well since the backend compression pass was done next. Still, it was clear this was inefficient. So I did some experimenting.
The first thing I did was to base the byte width on the file size of the archive, thus allowing very small archives to have even less than 4 byte wide entries, and most archives would likely be 4-5 bytes wide. 5 bytes gives us 40 bits or files up to 1,099,511,627,776 in size. The compression gains were small but very real, and given that with zpaq compression lrzip's overall compression is rather extreme, any gains are worthwhile chasing.
The next thing I tried was packing bits. As most offsets were likely to be a lot less than the full file size, then it is possible to just make each entry only as many bytes wide as the data itself. To do this, I encoded how many bytes wide the data was within the data itself, by adding 3 bits to it. Having only 61 bits for the data itself of a 64 bit variable is not a problem since a 61 bit sized file is up to 2,305,843,009,213,693,952 in size.
Here's a rundown of how the code looked for encode/decode:
static inline void put_vbytes(void *ss, int stream, i64 s)
{
uchar encoded_bytes;
int bytes, bits = 0;
while (s > (1LL << bits))
bits++;
/* Add one bit for bit 0, and 3 bits for the encoded number of bytes */
bits += 4;
bytes = bits / 8;
/* Round up to byte sized */
if (bits % 8)
bytes++;
/* Always assume there will be one byte, so we only need 3 bits to count
* the remaining 7 possible. Left shift s by 3 bits and then encode the
* 3 bits of remaining bytes into the right-most bits of s that will be
* written thus making them first */
encoded_bytes = bytes - 1;
s <<= 3;
s |= (i64)encoded_bytes;
put_vchars(ss, stream, s, bytes);
}
static inline i64 read_vbytes(void *ss, int stream)
{
uchar encoded_bytes, sb = read_u8(ss, stream);
i64 s;
int bytes;
/* Grab the right most 3 bits. It has encoded how many remaining
* bytes this entry is */
encoded_bytes = sb & 7;
s = sb;
for (bytes = 1; bytes <= encoded_bytes; bytes++) {
int bits = bytes * 8;
sb = read_u8(ss, stream);
s |= (i64)sb << bits;
}
s >>= 3;
return s;
}
Anyway after a lot of fluffing around to actually make this code work since I had numerous mistakes while getting this far, I finally got around to testing it. The results of the pre-processing stage were very promising with the data getting a lot smaller. Then after the backend compression stage kicked in, the final compressed file size was almost the same as it was with just the 8 byte wide entries already in use in the current lrzip version 0.47 release. I sat there looking at it thinking "well that was a complete waste of time". So after thinking about it a little more, I realised that this bit-packing I was doing is basically just another form of compression, so it was making the compression stage less efficient. Incidentally, some file formats and compression formats I believe use a similar form of bit packing.
So I'll be throwing away the bit-packing idea and going back to the fixed width idea since the final compression size will be smaller that way. However I wanted to put to do something useful with the code since I made it and figured I should put it on this blog since I'm about to delete it from the actual code :) So back to the fixed byte width coding, hopefully with a new version of lrzip not too far away.
Subscribe to:
Posts (Atom)