Showing posts with label multithreading. Show all posts
Showing posts with label multithreading. Show all posts

Friday, 25 February 2011

lrzip-0.570 for the uncorrupt.

When I last blogged about lrzip, I mentioned the corruption on decompression issue a user was seeing in the field. This bug, not surprisingly, worried me greatly so I set out on a major hunt to eliminate it, and make lrzip more reliable on decompression. After extensive investigation, and testing on the part of the user, to cut a long story short, the corruption was NEVER THERE.

The problem he was encountering was on decompressing a 20GB logfile, he would compare it to the original file with the 'cmp' command. On decompressing the file and comparing it, there would be differences in the file at random places. This made me think there was a memory corruption somewhere in lrzip. However he also noted that the problem went away on his desktop machine when he upgraded from Debian Lenny to Squeeze. So we knew something fishy was going on. Finally it occurred to me to suggest he try simply copying the 20GB logfile and then running 'cmp' on it. Lo and behold just copying a file of that size would randomly produce a file that had differences in it. This is a disturbing bug, and had it been confined to one machine, would have pointed the finger at the hardware. However he had reproduced it on the desktop PC as well, and the problem went away after upgrading his distribution. This pointed to a corruption problem somewhere in the layers between write() and what ends up on the disk. Anyway this particular problem is now something that needs to be tackled elsewhere (i.e. Debian).

Nonetheless, the corruption issue got me thinking about how I could make lrzip more reliable on decompression when it is mandatory that what is on disk is the same as what was originally compressed. Till now, lrzip has silently internally used crc32 to check the integrity of each decompressed block before writing it to disk. crc32 still has its place and is very simple, but it has quite a few collisions once you have files in the gigabyte size (collisions being files with the same CRC value despite being different). Fortunately, even with a hash check as simple as CRC, if only one byte changes in a file, the value will never be the same. However the crc was only being done on each decompressed chunk and not the whole file. So I set out to change over to MD5. After importing the MD5 code from coreutils and modifying it to suit lrzip, I added an md5 check during the compression phase, and put the MD5 value in the archive itself. For compatibility, the CRC check is still done and stored, so that the file format is still compatible with all previous 0.5 versions of lrzip. I hate breaking compatibility when it's not needed. On decompression, lrzip will now detect what is the most powerful hash function in the archive and use that to check the integrity of the data. One major advantage of md5 is that you can also use md5sum which is standard on all modern linux installations to compare the value to that stored in the archive on either compression or decompression. I took this idea one step further, and added an option to lrzip (-c) to actually do an md5 on the file that has been written to disk on decompression. This is to ensure that what is written on disk is what was actually extracted! The Debian lenny bug was what made me think this would be a useful feature. I've also added the ability to display the md5 hash value with a new -H option, even if the archive was not originally stored with an md5 value.

One broken "feature" for a while now has been multi-threading on OS-X. I have blogged previously about how OS-X will happily compile software that uses unnamed semaphores, yet when you try to run the program, it will say "feature unimplemented". After looking for some time at named semaphores, which are clunky in the extreme by comparison, it dawned on me I didn't need semaphores at all and could do with pthread_mutexes which are supported pretty much everywhere. So I converted the locking primitive to use mutexes instead, and now multi-threading on OS-X works nicely. I've had one user report it scales very well on his 8-way machine.

Over the last few revisions of lrzip, apart from the multi-threaded changes which have sped it up, numerous changes to improve the reliability of compression/decompression (to prevent it from running out of memory or corrupting data) unfortunately also have slowed it down somewhat. Being a CPU scheduler nut myself, I wasn't satisfied with this situation so I set out to speed it up. A few new changes have made their way into version 0.570 which do precisely that. The new hash check of both md5 and crc, which would have slowed it down now with an extra check, are done now only on already buffered parts of the main file. On a file that's larger than your available ram, this gives a major speed up. Multi-threading now spawns one extra thread as well, to take into account that the initial start up of threads is partially serialised, which means we need more threads available than CPUs. One long term battle with lrzip, which is never resolved, is how much ram to make available for each stage of the rzip pre-processing and then each thread for compression. After looking into the internals of the memory hungry lzma and zpaq, I was able to more accurately account for how much ram each thread would use, and push the amount of ram available per compression thread. The larger the blocks sent to the compression back end, the smaller the resulting file, and the greater the multi-threading speed up, provided there's enough data to keep all threads busy. Anyway the final upshot is that although more threads are in use now (which would decrease compression), compression has been kept approximately the same, but is actually faster.

Here's the latest results from my standard 10GB virtual image compression test:
Compression  Size         Percentage  Compress Time  Decompress Time
None         10737418240  100.0
gzip         2772899756    25.8          5m47s        2m46s
bzip2        2704781700    25.2         16m15s        6m19s
xz           2272322208    21.2         50m58s        3m52s
7z           2242897134    20.9         26m36s        5m41s
lrzip        1372218189    12.8         10m23s        2m53s
lrzip -U     1095735108    10.2          8m44s        2m45s
lrzip -l     1831894161    17.1          4m53s        2m37s
lrzip -lU    1414959433    13.2          4m48s        2m38s
lrzip -zU    1067075961     9.9         69m36s       69m35s

Lots of other internal changes have gone into it that are too numerous to go into depth here (see the Changelog for the short summary), but some user visible changes have been incorporated. Gone is the annoying bug where it would sit there waiting for stdin input if it was called without any arguments. The help information and manual page have been dramatically cleaned up. The -M option has been abolished in favour of just the -U option being used. The -T option no longer takes an argument and is just on/off. A -k option has been added to "keep corrupt/broken files" while corrupt/broken files generated on compression/decompression are automatically deleted by default. The -i information option now gives more information, and has verbose(+) mode to give a breakdown of the lrzip archive, like the following -vvi example:

Detected lrzip version 0.5 file.
../temp/enwik8.lrz:
lrzip version: 0.5 file
Compression: rzip + lzma
Decompressed file size: 100000000
Compressed file size: 26642293
Compression ratio: 3.753
MD5 used for integrity testing
MD5: a1fa5ffddb56f4953e226637dabbb36a
Rzip chunk 1:
Stream: 0
Offset: 25
Block   Comp    Percent Size
1       lzma    58.1%   867413 / 1493985        Offset: 22687516        Head: 0
Stream: 1
Offset: 25
Block   Comp    Percent Size
1       lzma    28.8%   5756191 / 20000000      Offset: 75      Head: 5756266
2       lzma    28.4%   5681891 / 20000000      Offset: 5756291 Head: 11438182
3       lzma    28.2%   5630256 / 20000000      Offset: 11438207        Head: 17068463
4       lzma    28.1%   5619003 / 20000000      Offset: 17068488        Head: 23554929
5       lzma    28.5%   3087298 / 10841364      Offset: 23554954        Head: 0
Rzip compression: 92.3% 92335349 / 100000000
Back end compression: 28.9% 26642052 / 92335349
Overall compression: 26.6% 26642052 / 100000000

I didn't bother blogging about version 0.560 because all the while 0.570 was under heavy development as well and I figured I'd wrap it all up as a nice big update instead. I'm also very pleased that Peter Hyman, who helped code for lrzip some time ago, has once again started contributing code.

That's probably enough babbling. You can get it here once freshmeat updates its links:
lrzip

Sunday, 12 December 2010

lrzip-0.551, overlapping streams for faster compression

A few bug reports have been trickling in since I last released a version of lrzip, which as I said last time is good and bad. It's helped direct my workflow on this, even though I keep trying to let it settle down. I'd love to leave the code since it mostly works quite well, and there is a common mantra that "release early, release often" is considered good, but in practice that doesn't seem to work. If you release stuff while it's still broken, it will get a bad reputation and people will dismiss it that time around and often not give it a chance again. So if there are show stopper bugs, they need attending to ASAP.

So anyway, the fact is that the fun part is implementing stuff rather than just fixing bugs, so for the next release I worked on doing both since I had to fix the bugs I knew about. I've mentioned before that lrzip breaks data into rzip streams and compresses each stream a block at a time. I had implemented multithreading by splitting that block up according to number of CPUs, but at the end of the stream, it waited to get all the blocks back before moving onto the next stream. For this release I moved the threading up earlier in the overall process, thus allowing the rzip stage to move onto the next stream while the back end threads are still busy compressing data from the previous stream. This means that if a file is large enough (greater than approximately 2/3 ram), once all CPUs are in use with back end threads, they should remain in use until the end of the file is reached. This speeds up compression on large files on SMP machines.

While looking for bugs to do with lzma compression failing due to running out of ram, I discovered why changing from level 7 compression to 9 compression did nothing... It's been a long time since the lzma library was imported into lrzip, and I had completely forgotten that it only goes up to level 7! The default lzma compression level is meant to be 5, so I've scaled it down now, meaning lrzip will now use the best all round compromise level for lzma. This also was affecting the window sizes internally used by the lzma library. At level 5, the windows are 16MB in size, and go up to 64MB for level 7. As one kind person noted in the comments on this blog, lzma uses up to 12x the memory of the compression window, so each thread would have been needing up to 800MB each. This explained why people were getting lzma failing on window sizes and complaining despite apparently enough ram. While the new level does lose slightly on compression, it does also improve compression speed significantly, and should pretty much eliminate lzma compression failing due to "too big window sizes".

With the two speed improvements, here are the sort of speed ups one can see with version 0.550 of lrzip:
10GB virtual image on quad core with 8GB ram:
lrzip 15m45s -> 11m03s
lrzip -M 12m03s -> 9m30s

And just as a reminder, xz completed this in 50m58s and the file size was almost halved by lrzip compared to xz.

Benchmarks can be found here:
http://ck.kolivas.org/apps/lrzip/README.benchmarks

On version 0.544 of lrzip, I had changed multithreading to use two separate groups of threads for stream 0 and stream 1 to make decompression more robust for out-of-order storage in the archive. Well unfortunately this doubled the possible number of threads, and this brought a now all-too-familiar problem: running out of ram on big files. In .550 I changed it to only use one thread for stream 0 rather than multiple ones since it's usually only one block of stream 0 before multiple blocks of stream 1. This can slow down decompression a little more, but reliability is more important.

I put some effort into tackling the dreaded running out of ram problem with multiple threads all competing for ram. What lrzip does now is detect when the compression back end has failed (which is usually for memory reasons) and instead of failing completely, it will then wait for the previous thread to complete its compression before trying again. This serialises work that would have otherwise been in parallel. Thus less ram is required since only one thread is likely to be using ram. I did this on both compression and decompression. Also, I made the lzma compression back end first try a lower compression level (since this uses a smaller window) before failing, and then finally resorting to bzip if it never succeeds. This should make running out of memory a non-fatal problem since it will just try again using less and being slower.

On the subject of constraining memory usage, I finally bit the bullet and dropped the maximum mmap down to only 1/3 of ram since I can use my sliding mmap to do the remaining chunk size. It is a tiny bit slower, but far more reliable, and leaves a lot more ram for the compression back ends. This makes my sliding mmap implementation, which was just an experiment originally, used routinely for any file larger than 1/3 of ram size.

My "spreading out of thread spawning evenly" idea in the last release was a failure. I reverted it since it provided no speed ups and occasionally slowed things down.

I fixed a few other minor errors to do with not callocing enough ram in some structs which looked really scary but I have no idea if they were causing problems. Hopefully they fixed some random problem somewhere :P

So I was very proud of this release and made it go gold as version 0.550...

About 5 minutes after releasing it, I found a couple of new bugs that were pretty nasty. When lzma worked on an incompressible block, it now failed instead of just leaving that block uncompressed. That was an easy fix. I also found that I had broken compression from STDIN a while back and hadn't noticed, and the compression summary at the end was reading 0. So I quickly corrected those and bumped the version up to 0.551.

So get it here:
http://freshmeat.net/projects/lrzip

No doubt more bugs will show up and I'll have to tend to those as well, but this should be a more robust version since a lot of failure cases in 0.544 have been addressed.

Other ideas I'm considering for lrzip at this stage based on requests include a good error testing and recovery mechanism with either reed-solomon or low density parity checking, and encrypted password protection. I have a few libraries I've looked at for both of these but am open to suggestions if people have recommendations.

EDIT: People didn't understand how the chunks and streams work, so I've copied this comment I made below into the blogpost.

---

Lrzip splits files up multiple times. Imagine a file is twice the size of ram. It will first be split into chunks:
A B C
where each chunk is 1/3 the size of the file.

Then the rzip stage is performed on chunk A, splitting it into streams:
A0 A1
where A0 is the "dictionary" and A2 the "matches"

Then each stream is split up into further chunks for the compression backend, chunks
A0A A0B
and then
A1A A1B A1C A1D A1E
and so on, and each chunk is passed onto the compressor (such as lzma) and the results written to disk.

All this used to be done in series.

Version 0.530 made the last chopping up part be done in parallel since the slow compression back end (lzma or zpaq) was the time consuming part of compression. However when once streams A0 and A1 would have finished their rzip stage, lrzip would wait for A0A A0B A1A A1B etc. to finish compressing completely before lrzip would move onto chunk B.

Version 0.551 now allows lrzip to move onto chunk B while A0A A0B A1A A1B etc are all still compressing.

---

This means that on faster compression (lzo, gzip and bzip2), the rate limiting step is the stream production of A0 and A1. The pipe dream is to parallelise those, but they rely on using as much ram as possible to make the rzip stage effective so it would compromise compression greatly (and be messy as hell to implement).

Wednesday, 17 November 2010

lrzip-0.540 and multithreaded decompression.

Unable to put it down, I tackled the last big thing I had planned for lrzip in the short term future: Multithreaded decompression. In a similar fashion (but in reverse obviously) to how I tackled it on compression, I modified it to take the file and spit out chunks to as many threads as there are CPU, and let them all start decompressing each chunk independently. Then as soon as the first thread is complete, it hand its buffer over to the rzip stage and makes itself available to take on more chunks and so on.

This technique needs to have a file that was compressed with enough chunks in the first place, but the absolute minimum a file will already be is two chunks due to the 2 streams that rzip uses every time, but will obviously scale best with files already compressed with the multithreaded version.

So how does it perform? Well much like the compression side, the slower the backend compressor in use, the better the speedup with more CPUs.

Going to my regular benchmark, here are the before and after effects on decompression on quad core:

Options   Singlethread  Multithread
lrzip      4m32s         3m07s
lrzip -M   4m05s         2m50s
lrzip -l   3m12s         2m23s
lrzip -lM  2m57s         2m20s
lrzip -zM  04h08m        72m0s

Note that the -z compress/decompress had a slightly braindamaged screen output in v0.530, and in this benchmark above, so it actually didn't perform as well as it would in v0.540 since that's been fixed. Clearly, though, it's massively faster.

Summary: Kick arse.

So I'm very pleased with the performance of lrzip on SMP now. With the help of a friend online who had access to a 48 CPU machine (no he wasn't running BFS :\), we tried various files to see how much we could get lrzip to scale. To actually use all the CPUs, we needed a file large enough that had enough work to distribute to most of the CPUs, and then keep them busy for long enough to show the effect of that scaling.

lrzip -z linux-2.6.36.tar
real 1m6.552s
user 17m16.660s

So this ran about 17x faster than it would have run single threaded, but still wasn't large enough.

The kde source fit that reasonably well, at 1967093760 bytes.

lrzip -z kde.tar
real 4m33.285s
user 92m26.650s

This one ran over 20x faster than it would have run single threaded. I don't know what the upper limit will be, but clearly this has been a massive improvement in performance, and brings zpaq into usable speeds with enough cores.

Get it here (too lazy to post on freshmeat right now): lrzip.kolivas.org

Saturday, 13 November 2010

lrzip-0.530 and multithreading speed-ups.

As planned I finally manage to thread the compression phase of lrzip to parallelise as much as possible. Originally it used to feed data into a buffer which rzip acted on, and then when it was full it was handed over to the backend compressor, all of which were single threaded except for lzma which didn't even scale beyond 2 CPUs. The implementation in the latest released version of lrzip, 0.530, does it the way I mentioned in my previous blog post.

First it feeds data into the rzip buffer which preprocesses data until it has enough to pass onto a compression thread. Then it hands the data to a compression thread and continues reading more data and lets rzip work on it while the compression thread is doing the 2nd phase compression in the background. Once rzip has enough data for another thread, it spawns another thread and so on, until there are as many threads as CPUs, and then keeps reading until the first thread is free and reuses it, and so on.

Well the results of this are about as good as I could have hoped for. While the faster lzo compression backend only gains a small speedup, the slower the backend, the bigger the speedup. It becomes impressive once zpaq is in use, where I was able to get a 4x speedup on a quad core. That makes lrzip with zpaq almost as fast as regular xz! However, since zpaq takes just as long to decompress as it does to compress, and I haven't threaded the decompression phase, it ends up taking 4x longer to decompress than it did to compress (grin). So zpaq isn't -quite- at the usable stage just yet, but it may well be in the near future.

So what everyone has been waiting for (all 3 of you), benchmarks! 10GB virtual image file being compressed on a quad core 3GHz from an SSD.

Compression Size        Compress  Decompress
None        10737418240
gzip        2772899756   05m47s    2m46s
bzip2       2704781700   16m15s    6m19s
xz          2272322208   50m58s    3m52s
7z          2242897134   26m36s    5m41s
lrzip       1299228155   16m12s    4m32s
lrzip -M    1079682231   12m03s    4m05s
lrzip -l    1754694010   05m30s    3m12s
lrzip -lM   1414958844   05m15s    2m57s
lrzip -zM   1066902006   71m20s    04h08m

Kick arse.

Get it here, and remember freshmeat may not have updated their download links yet:
lrzip

Next stop, to parallelise the decompression phase. I doubt anything but zpaq will really benefit from this, but it would be great to have a zpaq based compression format that is useably fast.