Wednesday, 17 November 2010

Create task groups by TTY comment

I've had every man and his dog either drop into IRC or email me asking me what my thoughts are on the grouping tasks by tty layer patch discussed here: Phoronix link and slashdot. I guess people didn't understand my 2.6.36-ck1 announcement clearly enough, so I'll quote it again here:

Those following the development of the patches for interactivity at massive
load, I have COMPLETELY DROPPED them as they introduce regressions at normal
workloads, and I cannot under any circumstances approve changes to improve
behaviour at ridiculous workloads which affect regular ones. I still see
precisely zero point at optimising for absurd workloads. Proving how many
un-niced jobs you can throw at your kernel compiles is not a measure of one's
prowess. It is just a mindless test. 

Remember, I already had developed a hierarchical tree-based penalty patch for BFS and blogged about it here. I can do it in a 10 line patch for BFS, but it introduced regressions, which is why I dropped it (see earlier blog entry here: further-updates-on-hierarchical-tree).

Again, I can't for the life of me see why you'd optimise for make -j64 on a quad core machine. It is one workload, unique to people who compile all the time, but done in a way you wouldn't normally do it anyway. It is not going to magically make anything else better. If for some god-forsaken reason you wanted to do that, you could already do that with nice, or even better, by running it SCHED_IDLEPRIO.

nice -19 make -j 64 blahblah
or
schedtool -D -e make -j64 blahblah

It's not really that hard folks...

And if you really really really still want the feature for BFS, the patch that does the hierarchical tree based penalty is rolled into a bigger patch (so a lot more than just the 10 lines I mentioned) that can also group threads and enable/disable the features and it's still here:

bfs357-penalise_fork_depth_account_threads.patch

It is worth noting also that the mainline approach costs you in throughput, whereas this patch is virtually free.

EDIT: I forgot to mention that for YEARS now I've been using my "toolsched" wrapper scripts that do this automatically. See toolsched for the scripts. Make always starts as SCHED_IDLEPRIO for me at home.

lrzip-0.540 and multithreaded decompression.

Unable to put it down, I tackled the last big thing I had planned for lrzip in the short term future: Multithreaded decompression. In a similar fashion (but in reverse obviously) to how I tackled it on compression, I modified it to take the file and spit out chunks to as many threads as there are CPU, and let them all start decompressing each chunk independently. Then as soon as the first thread is complete, it hand its buffer over to the rzip stage and makes itself available to take on more chunks and so on.

This technique needs to have a file that was compressed with enough chunks in the first place, but the absolute minimum a file will already be is two chunks due to the 2 streams that rzip uses every time, but will obviously scale best with files already compressed with the multithreaded version.

So how does it perform? Well much like the compression side, the slower the backend compressor in use, the better the speedup with more CPUs.

Going to my regular benchmark, here are the before and after effects on decompression on quad core:

Options   Singlethread  Multithread
lrzip      4m32s         3m07s
lrzip -M   4m05s         2m50s
lrzip -l   3m12s         2m23s
lrzip -lM  2m57s         2m20s
lrzip -zM  04h08m        72m0s

Note that the -z compress/decompress had a slightly braindamaged screen output in v0.530, and in this benchmark above, so it actually didn't perform as well as it would in v0.540 since that's been fixed. Clearly, though, it's massively faster.

Summary: Kick arse.

So I'm very pleased with the performance of lrzip on SMP now. With the help of a friend online who had access to a 48 CPU machine (no he wasn't running BFS :\), we tried various files to see how much we could get lrzip to scale. To actually use all the CPUs, we needed a file large enough that had enough work to distribute to most of the CPUs, and then keep them busy for long enough to show the effect of that scaling.

lrzip -z linux-2.6.36.tar
real 1m6.552s
user 17m16.660s

So this ran about 17x faster than it would have run single threaded, but still wasn't large enough.

The kde source fit that reasonably well, at 1967093760 bytes.

lrzip -z kde.tar
real 4m33.285s
user 92m26.650s

This one ran over 20x faster than it would have run single threaded. I don't know what the upper limit will be, but clearly this has been a massive improvement in performance, and brings zpaq into usable speeds with enough cores.

Get it here (too lazy to post on freshmeat right now): lrzip.kolivas.org

Saturday, 13 November 2010

lrzip-0.530 and multithreading speed-ups.

As planned I finally manage to thread the compression phase of lrzip to parallelise as much as possible. Originally it used to feed data into a buffer which rzip acted on, and then when it was full it was handed over to the backend compressor, all of which were single threaded except for lzma which didn't even scale beyond 2 CPUs. The implementation in the latest released version of lrzip, 0.530, does it the way I mentioned in my previous blog post.

First it feeds data into the rzip buffer which preprocesses data until it has enough to pass onto a compression thread. Then it hands the data to a compression thread and continues reading more data and lets rzip work on it while the compression thread is doing the 2nd phase compression in the background. Once rzip has enough data for another thread, it spawns another thread and so on, until there are as many threads as CPUs, and then keeps reading until the first thread is free and reuses it, and so on.

Well the results of this are about as good as I could have hoped for. While the faster lzo compression backend only gains a small speedup, the slower the backend, the bigger the speedup. It becomes impressive once zpaq is in use, where I was able to get a 4x speedup on a quad core. That makes lrzip with zpaq almost as fast as regular xz! However, since zpaq takes just as long to decompress as it does to compress, and I haven't threaded the decompression phase, it ends up taking 4x longer to decompress than it did to compress (grin). So zpaq isn't -quite- at the usable stage just yet, but it may well be in the near future.

So what everyone has been waiting for (all 3 of you), benchmarks! 10GB virtual image file being compressed on a quad core 3GHz from an SSD.

Compression Size        Compress  Decompress
None        10737418240
gzip        2772899756   05m47s    2m46s
bzip2       2704781700   16m15s    6m19s
xz          2272322208   50m58s    3m52s
7z          2242897134   26m36s    5m41s
lrzip       1299228155   16m12s    4m32s
lrzip -M    1079682231   12m03s    4m05s
lrzip -l    1754694010   05m30s    3m12s
lrzip -lM   1414958844   05m15s    2m57s
lrzip -zM   1066902006   71m20s    04h08m

Kick arse.

Get it here, and remember freshmeat may not have updated their download links yet:
lrzip

Next stop, to parallelise the decompression phase. I doubt anything but zpaq will really benefit from this, but it would be great to have a zpaq based compression format that is useably fast.

Thursday, 11 November 2010

lrzip scalability improvements

I've been looking at the benchmarks I've accumulated for lrzip's performance and have been mostly pleased. The speed/size performance or pure size performance is very good compared to other compression algorithms, but not much has been done to date to make it scale better on SMP. It scales slightly during lzma compression, but even there it's a clunky threading library attached to it that increases the CPU usage up to a bit over 2 CPUs for about 50% better performance.

So I've set out to make significant improvements to the scalability of lrzip on SMP, knowing it will slightly adversely affect compression. There are two stages to compression on lrzip, the preprocessing rzip stage and the back end compression 2nd stage. Improving the scalability of the first stage looks highly improbable (I'll never admit to anything being impossible). However, if I split up the chunks handed to the back end compressor and spread it over each thread, it should scale better. So the first step I did here is to do precisely that: Grab all the data from rzip and then split it up to a fixed number of block sizes and use every CPU available, provided each block is at least a minimum size to preserve compression efficiency. I've done precisely that and committed it to the git tree:

lrzip repo

Now it would be nice if this made the backend compression 4 x faster on quad core, but there are a number of reasons this doesn't happen. Mainly it's because they're all competing for cache and memory bandwidth, and (less significantly) the blocks end up compressing different degrees so not all threads finish at the same time. Still, the classic 23 minute 10GB benchmark now dropped to 18 minutes which is a fairly significant improvement given that previous lzma compression was already lightly threaded. The file was only marginally bigger so it's definitely a win.

Next up is to try and break up the rzip stream and hand it to the compression back end before the stream has even finished, to parallelise the rzip phase with the compression phase. This will not likely get to use *every* CPU at the same time (except in the case of zpaq), but will actually be faster than the previous approach. This work is much more invasive than the previous one and is currently underway.

After that I'm going to look at decompression and try and parallelise the back end decompression phase. While decompression is already quite fast, who can ever complain about more speed?

All in all this has been loads of fun to hack on. I'd never thought I'd say this, but maybe it would be nice to have a sea change and get paid to hack part-time and do medicine only part-time.