I've been looking at the benchmarks I've accumulated for lrzip's performance and have been mostly pleased. The speed/size performance or pure size performance is very good compared to other compression algorithms, but not much has been done to date to make it scale better on SMP. It scales slightly during lzma compression, but even there it's a clunky threading library attached to it that increases the CPU usage up to a bit over 2 CPUs for about 50% better performance.
So I've set out to make significant improvements to the scalability of lrzip on SMP, knowing it will slightly adversely affect compression. There are two stages to compression on lrzip, the preprocessing rzip stage and the back end compression 2nd stage. Improving the scalability of the first stage looks highly improbable (I'll never admit to anything being impossible). However, if I split up the chunks handed to the back end compressor and spread it over each thread, it should scale better. So the first step I did here is to do precisely that: Grab all the data from rzip and then split it up to a fixed number of block sizes and use every CPU available, provided each block is at least a minimum size to preserve compression efficiency. I've done precisely that and committed it to the git tree:
Now it would be nice if this made the backend compression 4 x faster on quad core, but there are a number of reasons this doesn't happen. Mainly it's because they're all competing for cache and memory bandwidth, and (less significantly) the blocks end up compressing different degrees so not all threads finish at the same time. Still, the classic 23 minute 10GB benchmark now dropped to 18 minutes which is a fairly significant improvement given that previous lzma compression was already lightly threaded. The file was only marginally bigger so it's definitely a win.
Next up is to try and break up the rzip stream and hand it to the compression back end before the stream has even finished, to parallelise the rzip phase with the compression phase. This will not likely get to use *every* CPU at the same time (except in the case of zpaq), but will actually be faster than the previous approach. This work is much more invasive than the previous one and is currently underway.
After that I'm going to look at decompression and try and parallelise the back end decompression phase. While decompression is already quite fast, who can ever complain about more speed?
All in all this has been loads of fun to hack on. I'd never thought I'd say this, but maybe it would be nice to have a sea change and get paid to hack part-time and do medicine only part-time.
Post a Comment