Saturday, 26 March 2011

BFS and optimising a global scheduler for per-cpu frequency scaling

TL;DR BFS now faster with cpu frequency scaling

The last update to BFS brought some performance enhancements to threaded CPUs like the i5/i7 etc. One thing that has troubled me for some time with BFS has been that CPU frequency scaling in the form of ondemand and friends, is based on the "local load" on a CPU, and scales one CPU up in the presence of load on that CPU alone, yet BFS has almost no notion of "local load" since it's a global scheduler. While there is some rudimentary code in BFS to say whether one CPU is currently slightly more busy than the total load, the fact that load is more a total system concept rather than per-CPU in BFS means that the CPU frequency scaling is not going to work as well. So I set out to try and tackle that exact issue with this latest test patch.

bfs363-test2.patch

In this patch, a special deadline selection process is invoked only when a "scaling" type cpu frequency governor is invoked on a per-cpu basis. What this special deadline selection process does is has a kind of soft affinity for CPU bound tasks (i.e. ones that don't ever sleep) and tries to keep them on one CPU over and above the normal amount in BFS, to the point of allowing a CPU to go idle instead of pulling that task away from its last CPU. It does this by using a simple counter for the number of times the task would otherwise have been selected by another CPU and keeps it under the total number of other CPUs. If it goes over this counter it will allow it to move. It still moves very very easily compared to the mainline scheduler which has a very strong notion of locality, and works the other way around (moving tasks only if things look really badly unbalanced). Freshly waking tasks also completely ignore this and just work as per usual BFS, thereby maintaining BFS' very low latency, and if a non-scaling governor is used (currently performance or powersave), BFS reverts to the regular deadline selection. This can increase the latency slightly of CPU bound tasks, but because the CPU they're bound to will now more easily increase its frequency with (for example) the ondemand governor, this latency is offset by faster completion of its work.

All in all it addresses a weakness in the fact that the CPU frequency scaling in mainline is designed around per-CPU scheduling and not global scheduling which BFS does. The most extreme example of this in testing on a heavily threaded multicore multithread CPU (i7) shows a 30% increase in throughput on a single threaded task when run on a 4 thread CPU. That's a massive change. Not only will the workload complete faster, but almost certainly battery usage will be reduced. In practice most workloads are more mixed than this, so don't expect a sudden 30% overall boost in performance. It also has no effect without cpu frequency scaling, and no effect on single CPU machines, but the more cores and threads you have, the greater the benefit. Since some new Linux distributions now no longer even allow you to change the governor and just set it to ondemand by default, this change is something that will be essential. After a bit of testing, and possibly more tweaking, it will constitute the changes for the next BFS release.

Please test and report back how this patch performs for you if you have more than one core and run CPU frequency scaling.

Thursday, 24 March 2011

2.6.38-ck1, BFS 0.363 for 2.6.38

TL;DR This post isn't about lrzip at last.

I've ported BFS and all previous -ck patches to 2.6.38.

2.6.38-ck1 can be grabbed here:
2.6.38-ck1

Ubuntu packages here:
Ubuntu Packages

and BFS for 2.6.38 can be grabbed here:
BFS 2.6.38

Apart from slight architectural changes between the kernel versions, and YET ANOTHER mainline rewrite of the CPU offlining code for suspend to ram/disk (which always causes problems with porting BFS since I have to rewrite my own parts of that code), this is the same BFS v0.363 as per the last release.

There is no "autogrouping by SID" in BFS or -ck. I remain unconvinced of any tangible benefit of such an approach for real world usage, and for the potential for problems and inability to apportion CPU when you actually want to.

Please report your experiences, but only if they're meaningful. I don't care how your PC performs if you do make -j 4096 unless you happen to have 4096 CPUs :)

Please enjoy! お楽しみ下さい

Wednesday, 23 March 2011

lrzip-0.600, STDIN/STDOUT and Encryption

TL;DR I haven't got a BFS for 2.6.38 yet.

The short short changelog: Lrzip now supports stdin/stdout fully on both compression and decompression without using temporary files. It also now supports very high grade password protected encryption, faster decompression and testing, big endian machines, and has had numerous minor fixes. There is still no progress on BFS for 2.6.38, but the fact that I've released this version of lrzip means I will likely be looking at it soon.

Lrzip was originally based on rzip code by Andrew Tridgell and for a very very long time full stdin and stdout support was considered so orthogonal to the design that it would be virtually impossible to implement. The reason for this is it compressed data in 2 separate passes, and worked over chunks of data that would totally fill ram and then store it, move on to the next chunk and so on, then it would go back and fill in all the header details in the various compression chunks and finally go back to the start of the file and write down the size of the file. This meant that unless the whole file and two separate compression phases all requiring ram, could all fit into ram, it was impossible to write that chunk to disk. If that part wasn't bad enough, on decompression, it would decompress each chunk, then write down parts to disk, decompress the next chunk, then refer -back- to parts already written to copy them to the new parts, then move onto the next chunk. It's not hard to see why this would make it impossible to read from stdin and write to stdout with all that fast forwarding and rewinding on the file as it's being read from or written to. Basically unless you had about 6 times as much RAM as the size of the file you're compressing it is impossible to fit all that into memory and work on it at once.

One thing I've always kept in mind when developing lrzip is to design a compression format that scales. And by scales I mean as CPUs get faster, computers get more ram, hard drives get faster, and files get bigger, the compression should actually get better. One of the most limiting things about other compression formats is they pretty much generate a static size regardless of how good the machine is they're working on. They also have not scaled very well with increasing CPU cores. For some time it has occurred to me that one day PCs would have enough RAM that the limiting aspect of fitting all the data of a file into RAM to work on it as I described above would no longer be an issue. Of course files will always get larger as well and there will always be files larger than the amount of RAM you have. To make lrzip work on large files, it was going to be necessary to rewrite the file format such that it could do "chunks" of independent data, write them to stdout, and then move on. So basically the major rework that went into this version of lrzip does precisely that. It carefully tries to figure out how much ram it can allocate to a temporary input buffer, temporary output buffer, rzip compression first stage and multiple compression back end stages, and then starts constructing the archive to/from the temporary buffers before dumping a chunk to disk.

Implementing the temporary buffers was no minor task either. Each phase had to take into account different aspects of whether there would be multiple reads/writes, modify in ram or on disk and safely keep track of it. Basically each different combination of compress/decompress stdin/stdout needed a different implementation to work so I incrementally implemented one after the other. The one caveat remains, though, that you might end up decompressing a file to STDOUT on a machine with MUCH less RAM than the machine it was compressed on. So I needed to also implement an opt-out where if it all failed to fit in RAM, it was still able to safely dump it out to temporary files on disk. This is required on decompression from STDIN to create a temporary file there, and decompression to STDOUT (especially if the file wasn't generated via a STDOUT archive in the first place). As you can well imagine all this took a lot of time and testing but all appears to be working nicely now. Furthermore, if you are unlucky enough to need temporary files for the above said reason, lrzip can now "chunk" them as well so it will use less temporary space than before. It's important to note when using STDIN+STDOUT that the compression on files greater than about 25% RAM size will not be as good as when you compress the file directly on disk. The reason has to do with just how much you can fit in ram at any one time and what size compression window can be allocated. Nonetheless 25% of ram on a modern machine is a fair chunk (and gets better every year :)

One of the bonuses of implementing decompression to a temporary buffer was that it was possible to modify even the normal decompression to use this as much as possible. As decompression writes to disk and then reads back from the file it writes to, this can be VERY slow if it involves lots of small reads and writes (and this sometimes happens with large files with lots of small similar portions). Doing this in RAM is much faster, and if you're just testing the file, there is no need to even write it to disk in the end. Thus decompression and especially testing received a healthy speed-up as a result.

After finally getting all that out of the way and having to create a new file format to accommodate it, it seemed a shame to have to change the file format again in the near future to support the 2nd most requested feature of lrzip - encryption. So instead of releasing it as it was, I started getting advice on how I might implement encryption and was lucky enough to recruit the help of Serge Belyshev who was actually reading all about encryption techniques. Unfortunately it also meant that instead of just slapping together some basic password protection and simple encryption mechanism, we ended up trying to generate some absolutely ludicrously secure encryption for lrzip.

You're probably aware that there are 2 basic ways to circumvent encryption: brute force and backdoors. If we stick to encryption techniques for which there are no known current weaknesses, nor backdoors, then the only way to break them is brute force. This means attempting as many password combinations as you can in the shortest possible time to see if they're correct. The average modern CPU can test about 1 million hashed passwords per second using sha512. However if you multiply hash the password (i.e. do it over and over with slight changes), then you have to multiply hash it on testing the password. So if you multiply hash it 1 million times, it takes one second just to try each password. Of course this is only for today's PCs so I thought up that we could increase the amount of hashing by date. So we went back to Moore's law which predicts how much CPUs increase in power each year, and look at the system date when hashing the password, and then decide how many times to hash the password. That means that 10 years from now (if Moore's law still holds, which it has for 50 years now), it will still take 1 second for each password on a newly encrypted file.

After hashing the password with sha512 millions of times, we added random salt to each compression block after the 2nd compression stage and used AES128 block encryption on the data. Then, and this was the hardest part of all, once each rzip chunk is complete, we go back and overwrite each block header encrypted with yet more random salt, and finally encrypt even the md5 sum at the end. This means that you end up with a file that is recognisable by its header as being an lrzip file, but has basically random indecipherable data after that so there is no way of knowing how many chunks the data is, how compressed it is, nor how big the final file is. It also means that each section that is encrypted is encrypted with a different cipher and if you compress exactly the same file with exactly the same password, it will (virtually) never produce the same file twice. To be honest I think we overdid it by anyone's standards. However since I was implementing it, it seemed illogical to just do a half-arsed effort. By far the weakest link in this equation is the password you choose when encrypting, but it does accept passwords up to 500 bytes in length. If you lose your password, consider the data irrevocably lost as there is no known way of retrieving it.

Michael Blumenkrantz was also kind enough to hack on lrzip and implemented a completely new autoconf style configuration mechanism which should hold lrzip in good stead for the future, being more robust and flexible. He then went on to abstract out most functions in a way that will make it possible for him to implement a liblrzip library in the future. That's a rather exciting prospect, but I'm glad it's him and not me that wants to implement that ;)

Finally with more help we tracked down some big endian bugs that had crept in and made lrzip not work on BE machines, I did some testing on my wife's mac to make sure it built and all of the normal compression/decompression and encryption worked, and fixed up and tweaked too many other things to mention in the code.

All in all I'm very pleased with the result, and as much as I'd love for it to be super stable and bug free, no doubt something will show up.

So do your worst, have fun and make sure to tell everyone where you got it:
LRZIP FRESHMEAT

And now a small rest before I look at the kernel again...

EDIT: In the install it seems we forgot the lrzuntar, lrunzip wrappers and the extra manpages. I have rushed out a v0.601 release with these changes.

Thursday, 17 March 2011

2.6.38 and BFS/-ck releases

2.6.38 came out faster than I was expecting, and I've made no effort as of yet to port BFS or -ck to it. It looked like there were still bug reports on the last -rc on the linux kernel mailing list so I wasn't expecting the "stable" release to come out yet. Furthermore, I've been working very hard on the next major release of lrzip which is a massive rewrite so that has completely occupied my time and I am not done with it yet. So there will likely be quite some delay before I can release an official BFS/CK for 2.6.38. I also would like to watch what fallout, if any, there is from the autogrouping code and decide what I should do with respect to that feature. I'm sure some unofficial BFS ports will be available soon enough but obviously I won't be able to say with any confidence whether they can be trusted or not.