TL;DR BFS now faster with cpu frequency scaling
The last update to BFS brought some performance enhancements to threaded CPUs like the i5/i7 etc. One thing that has troubled me for some time with BFS has been that CPU frequency scaling in the form of ondemand and friends, is based on the "local load" on a CPU, and scales one CPU up in the presence of load on that CPU alone, yet BFS has almost no notion of "local load" since it's a global scheduler. While there is some rudimentary code in BFS to say whether one CPU is currently slightly more busy than the total load, the fact that load is more a total system concept rather than per-CPU in BFS means that the CPU frequency scaling is not going to work as well. So I set out to try and tackle that exact issue with this latest test patch.
bfs363-test2.patch
In this patch, a special deadline selection process is invoked only when a "scaling" type cpu frequency governor is invoked on a per-cpu basis. What this special deadline selection process does is has a kind of soft affinity for CPU bound tasks (i.e. ones that don't ever sleep) and tries to keep them on one CPU over and above the normal amount in BFS, to the point of allowing a CPU to go idle instead of pulling that task away from its last CPU. It does this by using a simple counter for the number of times the task would otherwise have been selected by another CPU and keeps it under the total number of other CPUs. If it goes over this counter it will allow it to move. It still moves very very easily compared to the mainline scheduler which has a very strong notion of locality, and works the other way around (moving tasks only if things look really badly unbalanced). Freshly waking tasks also completely ignore this and just work as per usual BFS, thereby maintaining BFS' very low latency, and if a non-scaling governor is used (currently performance or powersave), BFS reverts to the regular deadline selection. This can increase the latency slightly of CPU bound tasks, but because the CPU they're bound to will now more easily increase its frequency with (for example) the ondemand governor, this latency is offset by faster completion of its work.
All in all it addresses a weakness in the fact that the CPU frequency scaling in mainline is designed around per-CPU scheduling and not global scheduling which BFS does. The most extreme example of this in testing on a heavily threaded multicore multithread CPU (i7) shows a 30% increase in throughput on a single threaded task when run on a 4 thread CPU. That's a massive change. Not only will the workload complete faster, but almost certainly battery usage will be reduced. In practice most workloads are more mixed than this, so don't expect a sudden 30% overall boost in performance. It also has no effect without cpu frequency scaling, and no effect on single CPU machines, but the more cores and threads you have, the greater the benefit. Since some new Linux distributions now no longer even allow you to change the governor and just set it to ondemand by default, this change is something that will be essential. After a bit of testing, and possibly more tweaking, it will constitute the changes for the next BFS release.
Please test and report back how this patch performs for you if you have more than one core and run CPU frequency scaling.
A development blog of what Con Kolivas is doing with code at the moment with the emphasis on linux kernel, MuQSS, BFS and -ck.
Saturday, 26 March 2011
Thursday, 24 March 2011
2.6.38-ck1, BFS 0.363 for 2.6.38
TL;DR This post isn't about lrzip at last.
I've ported BFS and all previous -ck patches to 2.6.38.
2.6.38-ck1 can be grabbed here:
2.6.38-ck1
Ubuntu packages here:
Ubuntu Packages
and BFS for 2.6.38 can be grabbed here:
BFS 2.6.38
Apart from slight architectural changes between the kernel versions, and YET ANOTHER mainline rewrite of the CPU offlining code for suspend to ram/disk (which always causes problems with porting BFS since I have to rewrite my own parts of that code), this is the same BFS v0.363 as per the last release.
There is no "autogrouping by SID" in BFS or -ck. I remain unconvinced of any tangible benefit of such an approach for real world usage, and for the potential for problems and inability to apportion CPU when you actually want to.
Please report your experiences, but only if they're meaningful. I don't care how your PC performs if you do make -j 4096 unless you happen to have 4096 CPUs :)
Please enjoy! お楽しみ下さい
I've ported BFS and all previous -ck patches to 2.6.38.
2.6.38-ck1 can be grabbed here:
2.6.38-ck1
Ubuntu packages here:
Ubuntu Packages
and BFS for 2.6.38 can be grabbed here:
BFS 2.6.38
Apart from slight architectural changes between the kernel versions, and YET ANOTHER mainline rewrite of the CPU offlining code for suspend to ram/disk (which always causes problems with porting BFS since I have to rewrite my own parts of that code), this is the same BFS v0.363 as per the last release.
There is no "autogrouping by SID" in BFS or -ck. I remain unconvinced of any tangible benefit of such an approach for real world usage, and for the potential for problems and inability to apportion CPU when you actually want to.
Please report your experiences, but only if they're meaningful. I don't care how your PC performs if you do make -j 4096 unless you happen to have 4096 CPUs :)
Please enjoy! お楽しみ下さい
Wednesday, 23 March 2011
lrzip-0.600, STDIN/STDOUT and Encryption
TL;DR I haven't got a BFS for 2.6.38 yet.
The short short changelog: Lrzip now supports stdin/stdout fully on both compression and decompression without using temporary files. It also now supports very high grade password protected encryption, faster decompression and testing, big endian machines, and has had numerous minor fixes. There is still no progress on BFS for 2.6.38, but the fact that I've released this version of lrzip means I will likely be looking at it soon.
Lrzip was originally based on rzip code by Andrew Tridgell and for a very very long time full stdin and stdout support was considered so orthogonal to the design that it would be virtually impossible to implement. The reason for this is it compressed data in 2 separate passes, and worked over chunks of data that would totally fill ram and then store it, move on to the next chunk and so on, then it would go back and fill in all the header details in the various compression chunks and finally go back to the start of the file and write down the size of the file. This meant that unless the whole file and two separate compression phases all requiring ram, could all fit into ram, it was impossible to write that chunk to disk. If that part wasn't bad enough, on decompression, it would decompress each chunk, then write down parts to disk, decompress the next chunk, then refer -back- to parts already written to copy them to the new parts, then move onto the next chunk. It's not hard to see why this would make it impossible to read from stdin and write to stdout with all that fast forwarding and rewinding on the file as it's being read from or written to. Basically unless you had about 6 times as much RAM as the size of the file you're compressing it is impossible to fit all that into memory and work on it at once.
One thing I've always kept in mind when developing lrzip is to design a compression format that scales. And by scales I mean as CPUs get faster, computers get more ram, hard drives get faster, and files get bigger, the compression should actually get better. One of the most limiting things about other compression formats is they pretty much generate a static size regardless of how good the machine is they're working on. They also have not scaled very well with increasing CPU cores. For some time it has occurred to me that one day PCs would have enough RAM that the limiting aspect of fitting all the data of a file into RAM to work on it as I described above would no longer be an issue. Of course files will always get larger as well and there will always be files larger than the amount of RAM you have. To make lrzip work on large files, it was going to be necessary to rewrite the file format such that it could do "chunks" of independent data, write them to stdout, and then move on. So basically the major rework that went into this version of lrzip does precisely that. It carefully tries to figure out how much ram it can allocate to a temporary input buffer, temporary output buffer, rzip compression first stage and multiple compression back end stages, and then starts constructing the archive to/from the temporary buffers before dumping a chunk to disk.
Implementing the temporary buffers was no minor task either. Each phase had to take into account different aspects of whether there would be multiple reads/writes, modify in ram or on disk and safely keep track of it. Basically each different combination of compress/decompress stdin/stdout needed a different implementation to work so I incrementally implemented one after the other. The one caveat remains, though, that you might end up decompressing a file to STDOUT on a machine with MUCH less RAM than the machine it was compressed on. So I needed to also implement an opt-out where if it all failed to fit in RAM, it was still able to safely dump it out to temporary files on disk. This is required on decompression from STDIN to create a temporary file there, and decompression to STDOUT (especially if the file wasn't generated via a STDOUT archive in the first place). As you can well imagine all this took a lot of time and testing but all appears to be working nicely now. Furthermore, if you are unlucky enough to need temporary files for the above said reason, lrzip can now "chunk" them as well so it will use less temporary space than before. It's important to note when using STDIN+STDOUT that the compression on files greater than about 25% RAM size will not be as good as when you compress the file directly on disk. The reason has to do with just how much you can fit in ram at any one time and what size compression window can be allocated. Nonetheless 25% of ram on a modern machine is a fair chunk (and gets better every year :)
One of the bonuses of implementing decompression to a temporary buffer was that it was possible to modify even the normal decompression to use this as much as possible. As decompression writes to disk and then reads back from the file it writes to, this can be VERY slow if it involves lots of small reads and writes (and this sometimes happens with large files with lots of small similar portions). Doing this in RAM is much faster, and if you're just testing the file, there is no need to even write it to disk in the end. Thus decompression and especially testing received a healthy speed-up as a result.
After finally getting all that out of the way and having to create a new file format to accommodate it, it seemed a shame to have to change the file format again in the near future to support the 2nd most requested feature of lrzip - encryption. So instead of releasing it as it was, I started getting advice on how I might implement encryption and was lucky enough to recruit the help of Serge Belyshev who was actually reading all about encryption techniques. Unfortunately it also meant that instead of just slapping together some basic password protection and simple encryption mechanism, we ended up trying to generate some absolutely ludicrously secure encryption for lrzip.
You're probably aware that there are 2 basic ways to circumvent encryption: brute force and backdoors. If we stick to encryption techniques for which there are no known current weaknesses, nor backdoors, then the only way to break them is brute force. This means attempting as many password combinations as you can in the shortest possible time to see if they're correct. The average modern CPU can test about 1 million hashed passwords per second using sha512. However if you multiply hash the password (i.e. do it over and over with slight changes), then you have to multiply hash it on testing the password. So if you multiply hash it 1 million times, it takes one second just to try each password. Of course this is only for today's PCs so I thought up that we could increase the amount of hashing by date. So we went back to Moore's law which predicts how much CPUs increase in power each year, and look at the system date when hashing the password, and then decide how many times to hash the password. That means that 10 years from now (if Moore's law still holds, which it has for 50 years now), it will still take 1 second for each password on a newly encrypted file.
After hashing the password with sha512 millions of times, we added random salt to each compression block after the 2nd compression stage and used AES128 block encryption on the data. Then, and this was the hardest part of all, once each rzip chunk is complete, we go back and overwrite each block header encrypted with yet more random salt, and finally encrypt even the md5 sum at the end. This means that you end up with a file that is recognisable by its header as being an lrzip file, but has basically random indecipherable data after that so there is no way of knowing how many chunks the data is, how compressed it is, nor how big the final file is. It also means that each section that is encrypted is encrypted with a different cipher and if you compress exactly the same file with exactly the same password, it will (virtually) never produce the same file twice. To be honest I think we overdid it by anyone's standards. However since I was implementing it, it seemed illogical to just do a half-arsed effort. By far the weakest link in this equation is the password you choose when encrypting, but it does accept passwords up to 500 bytes in length. If you lose your password, consider the data irrevocably lost as there is no known way of retrieving it.
Michael Blumenkrantz was also kind enough to hack on lrzip and implemented a completely new autoconf style configuration mechanism which should hold lrzip in good stead for the future, being more robust and flexible. He then went on to abstract out most functions in a way that will make it possible for him to implement a liblrzip library in the future. That's a rather exciting prospect, but I'm glad it's him and not me that wants to implement that ;)
Finally with more help we tracked down some big endian bugs that had crept in and made lrzip not work on BE machines, I did some testing on my wife's mac to make sure it built and all of the normal compression/decompression and encryption worked, and fixed up and tweaked too many other things to mention in the code.
All in all I'm very pleased with the result, and as much as I'd love for it to be super stable and bug free, no doubt something will show up.
So do your worst, have fun and make sure to tell everyone where you got it:
LRZIP FRESHMEAT
And now a small rest before I look at the kernel again...
EDIT: In the install it seems we forgot the lrzuntar, lrunzip wrappers and the extra manpages. I have rushed out a v0.601 release with these changes.
The short short changelog: Lrzip now supports stdin/stdout fully on both compression and decompression without using temporary files. It also now supports very high grade password protected encryption, faster decompression and testing, big endian machines, and has had numerous minor fixes. There is still no progress on BFS for 2.6.38, but the fact that I've released this version of lrzip means I will likely be looking at it soon.
Lrzip was originally based on rzip code by Andrew Tridgell and for a very very long time full stdin and stdout support was considered so orthogonal to the design that it would be virtually impossible to implement. The reason for this is it compressed data in 2 separate passes, and worked over chunks of data that would totally fill ram and then store it, move on to the next chunk and so on, then it would go back and fill in all the header details in the various compression chunks and finally go back to the start of the file and write down the size of the file. This meant that unless the whole file and two separate compression phases all requiring ram, could all fit into ram, it was impossible to write that chunk to disk. If that part wasn't bad enough, on decompression, it would decompress each chunk, then write down parts to disk, decompress the next chunk, then refer -back- to parts already written to copy them to the new parts, then move onto the next chunk. It's not hard to see why this would make it impossible to read from stdin and write to stdout with all that fast forwarding and rewinding on the file as it's being read from or written to. Basically unless you had about 6 times as much RAM as the size of the file you're compressing it is impossible to fit all that into memory and work on it at once.
One thing I've always kept in mind when developing lrzip is to design a compression format that scales. And by scales I mean as CPUs get faster, computers get more ram, hard drives get faster, and files get bigger, the compression should actually get better. One of the most limiting things about other compression formats is they pretty much generate a static size regardless of how good the machine is they're working on. They also have not scaled very well with increasing CPU cores. For some time it has occurred to me that one day PCs would have enough RAM that the limiting aspect of fitting all the data of a file into RAM to work on it as I described above would no longer be an issue. Of course files will always get larger as well and there will always be files larger than the amount of RAM you have. To make lrzip work on large files, it was going to be necessary to rewrite the file format such that it could do "chunks" of independent data, write them to stdout, and then move on. So basically the major rework that went into this version of lrzip does precisely that. It carefully tries to figure out how much ram it can allocate to a temporary input buffer, temporary output buffer, rzip compression first stage and multiple compression back end stages, and then starts constructing the archive to/from the temporary buffers before dumping a chunk to disk.
Implementing the temporary buffers was no minor task either. Each phase had to take into account different aspects of whether there would be multiple reads/writes, modify in ram or on disk and safely keep track of it. Basically each different combination of compress/decompress stdin/stdout needed a different implementation to work so I incrementally implemented one after the other. The one caveat remains, though, that you might end up decompressing a file to STDOUT on a machine with MUCH less RAM than the machine it was compressed on. So I needed to also implement an opt-out where if it all failed to fit in RAM, it was still able to safely dump it out to temporary files on disk. This is required on decompression from STDIN to create a temporary file there, and decompression to STDOUT (especially if the file wasn't generated via a STDOUT archive in the first place). As you can well imagine all this took a lot of time and testing but all appears to be working nicely now. Furthermore, if you are unlucky enough to need temporary files for the above said reason, lrzip can now "chunk" them as well so it will use less temporary space than before. It's important to note when using STDIN+STDOUT that the compression on files greater than about 25% RAM size will not be as good as when you compress the file directly on disk. The reason has to do with just how much you can fit in ram at any one time and what size compression window can be allocated. Nonetheless 25% of ram on a modern machine is a fair chunk (and gets better every year :)
One of the bonuses of implementing decompression to a temporary buffer was that it was possible to modify even the normal decompression to use this as much as possible. As decompression writes to disk and then reads back from the file it writes to, this can be VERY slow if it involves lots of small reads and writes (and this sometimes happens with large files with lots of small similar portions). Doing this in RAM is much faster, and if you're just testing the file, there is no need to even write it to disk in the end. Thus decompression and especially testing received a healthy speed-up as a result.
After finally getting all that out of the way and having to create a new file format to accommodate it, it seemed a shame to have to change the file format again in the near future to support the 2nd most requested feature of lrzip - encryption. So instead of releasing it as it was, I started getting advice on how I might implement encryption and was lucky enough to recruit the help of Serge Belyshev who was actually reading all about encryption techniques. Unfortunately it also meant that instead of just slapping together some basic password protection and simple encryption mechanism, we ended up trying to generate some absolutely ludicrously secure encryption for lrzip.
You're probably aware that there are 2 basic ways to circumvent encryption: brute force and backdoors. If we stick to encryption techniques for which there are no known current weaknesses, nor backdoors, then the only way to break them is brute force. This means attempting as many password combinations as you can in the shortest possible time to see if they're correct. The average modern CPU can test about 1 million hashed passwords per second using sha512. However if you multiply hash the password (i.e. do it over and over with slight changes), then you have to multiply hash it on testing the password. So if you multiply hash it 1 million times, it takes one second just to try each password. Of course this is only for today's PCs so I thought up that we could increase the amount of hashing by date. So we went back to Moore's law which predicts how much CPUs increase in power each year, and look at the system date when hashing the password, and then decide how many times to hash the password. That means that 10 years from now (if Moore's law still holds, which it has for 50 years now), it will still take 1 second for each password on a newly encrypted file.
After hashing the password with sha512 millions of times, we added random salt to each compression block after the 2nd compression stage and used AES128 block encryption on the data. Then, and this was the hardest part of all, once each rzip chunk is complete, we go back and overwrite each block header encrypted with yet more random salt, and finally encrypt even the md5 sum at the end. This means that you end up with a file that is recognisable by its header as being an lrzip file, but has basically random indecipherable data after that so there is no way of knowing how many chunks the data is, how compressed it is, nor how big the final file is. It also means that each section that is encrypted is encrypted with a different cipher and if you compress exactly the same file with exactly the same password, it will (virtually) never produce the same file twice. To be honest I think we overdid it by anyone's standards. However since I was implementing it, it seemed illogical to just do a half-arsed effort. By far the weakest link in this equation is the password you choose when encrypting, but it does accept passwords up to 500 bytes in length. If you lose your password, consider the data irrevocably lost as there is no known way of retrieving it.
Michael Blumenkrantz was also kind enough to hack on lrzip and implemented a completely new autoconf style configuration mechanism which should hold lrzip in good stead for the future, being more robust and flexible. He then went on to abstract out most functions in a way that will make it possible for him to implement a liblrzip library in the future. That's a rather exciting prospect, but I'm glad it's him and not me that wants to implement that ;)
Finally with more help we tracked down some big endian bugs that had crept in and made lrzip not work on BE machines, I did some testing on my wife's mac to make sure it built and all of the normal compression/decompression and encryption worked, and fixed up and tweaked too many other things to mention in the code.
All in all I'm very pleased with the result, and as much as I'd love for it to be super stable and bug free, no doubt something will show up.
So do your worst, have fun and make sure to tell everyone where you got it:
LRZIP FRESHMEAT
And now a small rest before I look at the kernel again...
EDIT: In the install it seems we forgot the lrzuntar, lrunzip wrappers and the extra manpages. I have rushed out a v0.601 release with these changes.
Thursday, 17 March 2011
2.6.38 and BFS/-ck releases
2.6.38 came out faster than I was expecting, and I've made no effort as of yet to port BFS or -ck to it. It looked like there were still bug reports on the last -rc on the linux kernel mailing list so I wasn't expecting the "stable" release to come out yet. Furthermore, I've been working very hard on the next major release of lrzip which is a massive rewrite so that has completely occupied my time and I am not done with it yet. So there will likely be quite some delay before I can release an official BFS/CK for 2.6.38. I also would like to watch what fallout, if any, there is from the autogrouping code and decide what I should do with respect to that feature. I'm sure some unofficial BFS ports will be available soon enough but obviously I won't be able to say with any confidence whether they can be trusted or not.
Tuesday, 8 March 2011
lrzip 0.571
I had set out to do some major surgery to lrzip to make it work better with STDIN/STDOUT and not use physical temporary files at all. However, as I mentioned in my last blog post, my approach of using shared memory as files was a failure as it only really worked on linux. So after much angst, I decided to simply put together all the minor bugfixes that I had accumulated and bring out a new version. So here is a mainly small bugfix release lrzip version 0.571:
lrzip on freshmeat
On a related note I chanced upon this article recently which compared compression tools:
best-linux-compression-tool-8-utilities-tested
tl:dr lrzip wins
Suffice to say that lrzip did pretty well, despite them using a much older version that predates any of the multi-threading enhancements!
lrzip on freshmeat
On a related note I chanced upon this article recently which compared compression tools:
best-linux-compression-tool-8-utilities-tested
tl:dr lrzip wins
Suffice to say that lrzip did pretty well, despite them using a much older version that predates any of the multi-threading enhancements!
Monday, 7 March 2011
Shared memory on Darwin and lrzip
So I've spent the last month working on lrzip trying to make it work without temporary files when working from STDIN +/- to STDOUT. Virtually all of that work has gone into creating pseudo files in memory using posix shared memory in the form of shm_open() and friends. To complete the work,I even needed to upgrade the file format for lrzip when files are dumped to stdout and beyond a certain size. As the code has taken shape and is almost finalised, I have tried to get as much testing as possible before making a release. Lo and behold, OS-X screws me over yet again. I discover that the so-called portable posix shared memory that Darwin is supposed to support, is not fully featured. By that I mean I can open an shm file with shm_open and then I can mmap it. But I can't actually access the file as a file... which was entirely the point. Instead I get the meaningless "operation not supported" when trying to write to the file, or a failure to seek on it entirely. What's more is the open_shm doesn't really work with create and truncate at the same time too... So for now, Darwin will be stuck on physical temporary files while the rest of us move on. Sigh... Someone pointed out that perhaps the whole treating of the shared memory object as a file isn't actually part of the posix guarantee at all, so once again I've gotten carried away with something that isn't portable.
EDIT: After a bit more thought I'm going to abandon all this use of shared memory and reconsider how I might do stdin/out without temporary files yet again.
EDIT: After a bit more thought I'm going to abandon all this use of shared memory and reconsider how I might do stdin/out without temporary files yet again.
Friday, 25 February 2011
lrzip-0.570 for the uncorrupt.
When I last blogged about lrzip, I mentioned the corruption on decompression issue a user was seeing in the field. This bug, not surprisingly, worried me greatly so I set out on a major hunt to eliminate it, and make lrzip more reliable on decompression. After extensive investigation, and testing on the part of the user, to cut a long story short, the corruption was NEVER THERE.
The problem he was encountering was on decompressing a 20GB logfile, he would compare it to the original file with the 'cmp' command. On decompressing the file and comparing it, there would be differences in the file at random places. This made me think there was a memory corruption somewhere in lrzip. However he also noted that the problem went away on his desktop machine when he upgraded from Debian Lenny to Squeeze. So we knew something fishy was going on. Finally it occurred to me to suggest he try simply copying the 20GB logfile and then running 'cmp' on it. Lo and behold just copying a file of that size would randomly produce a file that had differences in it. This is a disturbing bug, and had it been confined to one machine, would have pointed the finger at the hardware. However he had reproduced it on the desktop PC as well, and the problem went away after upgrading his distribution. This pointed to a corruption problem somewhere in the layers between write() and what ends up on the disk. Anyway this particular problem is now something that needs to be tackled elsewhere (i.e. Debian).
Nonetheless, the corruption issue got me thinking about how I could make lrzip more reliable on decompression when it is mandatory that what is on disk is the same as what was originally compressed. Till now, lrzip has silently internally used crc32 to check the integrity of each decompressed block before writing it to disk. crc32 still has its place and is very simple, but it has quite a few collisions once you have files in the gigabyte size (collisions being files with the same CRC value despite being different). Fortunately, even with a hash check as simple as CRC, if only one byte changes in a file, the value will never be the same. However the crc was only being done on each decompressed chunk and not the whole file. So I set out to change over to MD5. After importing the MD5 code from coreutils and modifying it to suit lrzip, I added an md5 check during the compression phase, and put the MD5 value in the archive itself. For compatibility, the CRC check is still done and stored, so that the file format is still compatible with all previous 0.5 versions of lrzip. I hate breaking compatibility when it's not needed. On decompression, lrzip will now detect what is the most powerful hash function in the archive and use that to check the integrity of the data. One major advantage of md5 is that you can also use md5sum which is standard on all modern linux installations to compare the value to that stored in the archive on either compression or decompression. I took this idea one step further, and added an option to lrzip (-c) to actually do an md5 on the file that has been written to disk on decompression. This is to ensure that what is written on disk is what was actually extracted! The Debian lenny bug was what made me think this would be a useful feature. I've also added the ability to display the md5 hash value with a new -H option, even if the archive was not originally stored with an md5 value.
One broken "feature" for a while now has been multi-threading on OS-X. I have blogged previously about how OS-X will happily compile software that uses unnamed semaphores, yet when you try to run the program, it will say "feature unimplemented". After looking for some time at named semaphores, which are clunky in the extreme by comparison, it dawned on me I didn't need semaphores at all and could do with pthread_mutexes which are supported pretty much everywhere. So I converted the locking primitive to use mutexes instead, and now multi-threading on OS-X works nicely. I've had one user report it scales very well on his 8-way machine.
Over the last few revisions of lrzip, apart from the multi-threaded changes which have sped it up, numerous changes to improve the reliability of compression/decompression (to prevent it from running out of memory or corrupting data) unfortunately also have slowed it down somewhat. Being a CPU scheduler nut myself, I wasn't satisfied with this situation so I set out to speed it up. A few new changes have made their way into version 0.570 which do precisely that. The new hash check of both md5 and crc, which would have slowed it down now with an extra check, are done now only on already buffered parts of the main file. On a file that's larger than your available ram, this gives a major speed up. Multi-threading now spawns one extra thread as well, to take into account that the initial start up of threads is partially serialised, which means we need more threads available than CPUs. One long term battle with lrzip, which is never resolved, is how much ram to make available for each stage of the rzip pre-processing and then each thread for compression. After looking into the internals of the memory hungry lzma and zpaq, I was able to more accurately account for how much ram each thread would use, and push the amount of ram available per compression thread. The larger the blocks sent to the compression back end, the smaller the resulting file, and the greater the multi-threading speed up, provided there's enough data to keep all threads busy. Anyway the final upshot is that although more threads are in use now (which would decrease compression), compression has been kept approximately the same, but is actually faster.
Here's the latest results from my standard 10GB virtual image compression test:
Lots of other internal changes have gone into it that are too numerous to go into depth here (see the Changelog for the short summary), but some user visible changes have been incorporated. Gone is the annoying bug where it would sit there waiting for stdin input if it was called without any arguments. The help information and manual page have been dramatically cleaned up. The -M option has been abolished in favour of just the -U option being used. The -T option no longer takes an argument and is just on/off. A -k option has been added to "keep corrupt/broken files" while corrupt/broken files generated on compression/decompression are automatically deleted by default. The -i information option now gives more information, and has verbose(+) mode to give a breakdown of the lrzip archive, like the following -vvi example:
I didn't bother blogging about version 0.560 because all the while 0.570 was under heavy development as well and I figured I'd wrap it all up as a nice big update instead. I'm also very pleased that Peter Hyman, who helped code for lrzip some time ago, has once again started contributing code.
That's probably enough babbling. You can get it here once freshmeat updates its links:
lrzip
The problem he was encountering was on decompressing a 20GB logfile, he would compare it to the original file with the 'cmp' command. On decompressing the file and comparing it, there would be differences in the file at random places. This made me think there was a memory corruption somewhere in lrzip. However he also noted that the problem went away on his desktop machine when he upgraded from Debian Lenny to Squeeze. So we knew something fishy was going on. Finally it occurred to me to suggest he try simply copying the 20GB logfile and then running 'cmp' on it. Lo and behold just copying a file of that size would randomly produce a file that had differences in it. This is a disturbing bug, and had it been confined to one machine, would have pointed the finger at the hardware. However he had reproduced it on the desktop PC as well, and the problem went away after upgrading his distribution. This pointed to a corruption problem somewhere in the layers between write() and what ends up on the disk. Anyway this particular problem is now something that needs to be tackled elsewhere (i.e. Debian).
Nonetheless, the corruption issue got me thinking about how I could make lrzip more reliable on decompression when it is mandatory that what is on disk is the same as what was originally compressed. Till now, lrzip has silently internally used crc32 to check the integrity of each decompressed block before writing it to disk. crc32 still has its place and is very simple, but it has quite a few collisions once you have files in the gigabyte size (collisions being files with the same CRC value despite being different). Fortunately, even with a hash check as simple as CRC, if only one byte changes in a file, the value will never be the same. However the crc was only being done on each decompressed chunk and not the whole file. So I set out to change over to MD5. After importing the MD5 code from coreutils and modifying it to suit lrzip, I added an md5 check during the compression phase, and put the MD5 value in the archive itself. For compatibility, the CRC check is still done and stored, so that the file format is still compatible with all previous 0.5 versions of lrzip. I hate breaking compatibility when it's not needed. On decompression, lrzip will now detect what is the most powerful hash function in the archive and use that to check the integrity of the data. One major advantage of md5 is that you can also use md5sum which is standard on all modern linux installations to compare the value to that stored in the archive on either compression or decompression. I took this idea one step further, and added an option to lrzip (-c) to actually do an md5 on the file that has been written to disk on decompression. This is to ensure that what is written on disk is what was actually extracted! The Debian lenny bug was what made me think this would be a useful feature. I've also added the ability to display the md5 hash value with a new -H option, even if the archive was not originally stored with an md5 value.
One broken "feature" for a while now has been multi-threading on OS-X. I have blogged previously about how OS-X will happily compile software that uses unnamed semaphores, yet when you try to run the program, it will say "feature unimplemented". After looking for some time at named semaphores, which are clunky in the extreme by comparison, it dawned on me I didn't need semaphores at all and could do with pthread_mutexes which are supported pretty much everywhere. So I converted the locking primitive to use mutexes instead, and now multi-threading on OS-X works nicely. I've had one user report it scales very well on his 8-way machine.
Over the last few revisions of lrzip, apart from the multi-threaded changes which have sped it up, numerous changes to improve the reliability of compression/decompression (to prevent it from running out of memory or corrupting data) unfortunately also have slowed it down somewhat. Being a CPU scheduler nut myself, I wasn't satisfied with this situation so I set out to speed it up. A few new changes have made their way into version 0.570 which do precisely that. The new hash check of both md5 and crc, which would have slowed it down now with an extra check, are done now only on already buffered parts of the main file. On a file that's larger than your available ram, this gives a major speed up. Multi-threading now spawns one extra thread as well, to take into account that the initial start up of threads is partially serialised, which means we need more threads available than CPUs. One long term battle with lrzip, which is never resolved, is how much ram to make available for each stage of the rzip pre-processing and then each thread for compression. After looking into the internals of the memory hungry lzma and zpaq, I was able to more accurately account for how much ram each thread would use, and push the amount of ram available per compression thread. The larger the blocks sent to the compression back end, the smaller the resulting file, and the greater the multi-threading speed up, provided there's enough data to keep all threads busy. Anyway the final upshot is that although more threads are in use now (which would decrease compression), compression has been kept approximately the same, but is actually faster.
Here's the latest results from my standard 10GB virtual image compression test:
Compression Size Percentage Compress Time Decompress Time None 10737418240 100.0 gzip 2772899756 25.8 5m47s 2m46s bzip2 2704781700 25.2 16m15s 6m19s xz 2272322208 21.2 50m58s 3m52s 7z 2242897134 20.9 26m36s 5m41s lrzip 1372218189 12.8 10m23s 2m53s lrzip -U 1095735108 10.2 8m44s 2m45s lrzip -l 1831894161 17.1 4m53s 2m37s lrzip -lU 1414959433 13.2 4m48s 2m38s lrzip -zU 1067075961 9.9 69m36s 69m35s
Lots of other internal changes have gone into it that are too numerous to go into depth here (see the Changelog for the short summary), but some user visible changes have been incorporated. Gone is the annoying bug where it would sit there waiting for stdin input if it was called without any arguments. The help information and manual page have been dramatically cleaned up. The -M option has been abolished in favour of just the -U option being used. The -T option no longer takes an argument and is just on/off. A -k option has been added to "keep corrupt/broken files" while corrupt/broken files generated on compression/decompression are automatically deleted by default. The -i information option now gives more information, and has verbose(+) mode to give a breakdown of the lrzip archive, like the following -vvi example:
Detected lrzip version 0.5 file. ../temp/enwik8.lrz: lrzip version: 0.5 file Compression: rzip + lzma Decompressed file size: 100000000 Compressed file size: 26642293 Compression ratio: 3.753 MD5 used for integrity testing MD5: a1fa5ffddb56f4953e226637dabbb36a Rzip chunk 1: Stream: 0 Offset: 25 Block Comp Percent Size 1 lzma 58.1% 867413 / 1493985 Offset: 22687516 Head: 0 Stream: 1 Offset: 25 Block Comp Percent Size 1 lzma 28.8% 5756191 / 20000000 Offset: 75 Head: 5756266 2 lzma 28.4% 5681891 / 20000000 Offset: 5756291 Head: 11438182 3 lzma 28.2% 5630256 / 20000000 Offset: 11438207 Head: 17068463 4 lzma 28.1% 5619003 / 20000000 Offset: 17068488 Head: 23554929 5 lzma 28.5% 3087298 / 10841364 Offset: 23554954 Head: 0 Rzip compression: 92.3% 92335349 / 100000000 Back end compression: 28.9% 26642052 / 92335349 Overall compression: 26.6% 26642052 / 100000000
I didn't bother blogging about version 0.560 because all the while 0.570 was under heavy development as well and I figured I'd wrap it all up as a nice big update instead. I'm also very pleased that Peter Hyman, who helped code for lrzip some time ago, has once again started contributing code.
That's probably enough babbling. You can get it here once freshmeat updates its links:
lrzip
Monday, 14 February 2011
2.6.37-ck2 minor fixes.
I've put up a small updated -ck version for 2.6.37. There are only 2 changes: 1. The build fix for it not compiling with CPU HOTPLUG disabled, and 2. I've dropped the patch called mm-make_swappiness_really_mean_it.patch . This second patch broke a while back and I never noticed because I had swap disabled, sorry. It works better with it disabled. Note that the ubuntu packages I recently made available include this change which I quietly snuck in, but I will make ck2 packages officially available just to avoid confusion ;) If you've built your own 2.6.37-ck1 then I recommend upgrading only if you have swap enabled on your machine (which most people still do). Alternatively if you built it with CPU_HOTPLUG enabled just to make it build, then you _can_ upgrade to this version to build with it disabled but you won't notice a difference with the feature disabled - your kernel will only be slightly smaller. It might be a few nanoseconds faster..
Get it here:
2.6.37-ck2
Get it here:
2.6.37-ck2
Friday, 11 February 2011
lrzip-0.552 Random fixes.
It had been a while since lrzip had received attention, and a number of things came up that made me give it some attention. First, it wouldn't compile on FreeBSD because it doesn't implement mremap. So I wrapped that in a fake mremap the way I managed the OSX one. Then OSX would compile the existing code fine but would fail to actually work because it would say "unimplemented" when you tried to run it due to OSX not having unnamed semaphores implemented. I've mentioned this before, but named semaphores are a real pain to work with and unnamed ones are very convenient. So I wrapped all the threading calls for OSX to not actually thread and ignore all the calls to semaphore functions. It means OSX doesn't benefit from the extra threading added after version 0.530 but at least it actually works now.
Finally, there was an alarming report I received about file corruption on decompression from a user who had a 20GB log that he was compressing. The generated archive would decompress fine with version 0.530 but it randomly would silently corrupt somewhere in the decompression with any version after that and it was only noticeable after comparing to the original file. Now, try as hard as I could at home, I couldn't reproduce this bug. So with the help of this user doing possibly hundreds of tests for me and with debugging code I found a trail of possible causes. Disabling the decompression threading code surprisingly didn't fix it, confirming the bug was elsewhere. After an extensive search I found there were some mmap calls in lrzip that weren't being careful about being aligned to page size, hence writing to the buffer generated would have random results. It's surprising that no more corruption was seen but presumably the particular buffer in question never had that much data so it would fit in whatever amount happened to be allocated. Presumably the larger file is what made it easier to trigger. That would certainly have explained random position failures, but it didn't really explain why it only started happening after 0.530.
Anyway after converting the mmap calls to ordinary malloc calls, decompression would now actually fail with a crc error. Lrzip does a crc check on the data generated, and compares it to the crc stored in the archive. However if a string of zeroes is generated, and then the crc is also read as zero, it can look ok. I'm assuming this is why it was silently corrupting it. The crc errors at least gave me a trail of possible sources for the error. After much scratching of heads, heavy drinking and sleepless nights, I found that in the original rzip code that lrzip came from was a function that was designed to read a certain amount of data, then return how much it had returned. The thing is, the function that called it (unzip_literal) didn't check the return value and just assumed it always read all the data asked of it. Hitting this codepath is obviously extremely rare but there it was. Fixing that resolved the instant crc error and made it mostly reliable.
Now I'd love to say it's 100% reliable, but after running the decompression hundreds of times just to be certain, it failed on one occasion. So there's still potential for failure, possibly somewhere else, in the code on decompressing extremely large archives of 20GB or more. Given the number of fixes I put into lrzip, and that there were obvious bugs in the previous version, I've released the cumulative fixes as version 0.552. I've also put a warning in there that all very large decompressed files should be checked with md5sum or equivalent to ensure they are not corrupted.
A handful of other minor fixes went in too, so this is basically a bugfix release only. So if you're using lrzip, I highly recommend upgrading to version 0.552.
Freshmeat link (sometimes takes a while for new release to appear):
lrzip
Now for more debugging.
Finally, there was an alarming report I received about file corruption on decompression from a user who had a 20GB log that he was compressing. The generated archive would decompress fine with version 0.530 but it randomly would silently corrupt somewhere in the decompression with any version after that and it was only noticeable after comparing to the original file. Now, try as hard as I could at home, I couldn't reproduce this bug. So with the help of this user doing possibly hundreds of tests for me and with debugging code I found a trail of possible causes. Disabling the decompression threading code surprisingly didn't fix it, confirming the bug was elsewhere. After an extensive search I found there were some mmap calls in lrzip that weren't being careful about being aligned to page size, hence writing to the buffer generated would have random results. It's surprising that no more corruption was seen but presumably the particular buffer in question never had that much data so it would fit in whatever amount happened to be allocated. Presumably the larger file is what made it easier to trigger. That would certainly have explained random position failures, but it didn't really explain why it only started happening after 0.530.
Anyway after converting the mmap calls to ordinary malloc calls, decompression would now actually fail with a crc error. Lrzip does a crc check on the data generated, and compares it to the crc stored in the archive. However if a string of zeroes is generated, and then the crc is also read as zero, it can look ok. I'm assuming this is why it was silently corrupting it. The crc errors at least gave me a trail of possible sources for the error. After much scratching of heads, heavy drinking and sleepless nights, I found that in the original rzip code that lrzip came from was a function that was designed to read a certain amount of data, then return how much it had returned. The thing is, the function that called it (unzip_literal) didn't check the return value and just assumed it always read all the data asked of it. Hitting this codepath is obviously extremely rare but there it was. Fixing that resolved the instant crc error and made it mostly reliable.
Now I'd love to say it's 100% reliable, but after running the decompression hundreds of times just to be certain, it failed on one occasion. So there's still potential for failure, possibly somewhere else, in the code on decompressing extremely large archives of 20GB or more. Given the number of fixes I put into lrzip, and that there were obvious bugs in the previous version, I've released the cumulative fixes as version 0.552. I've also put a warning in there that all very large decompressed files should be checked with md5sum or equivalent to ensure they are not corrupted.
A handful of other minor fixes went in too, so this is basically a bugfix release only. So if you're using lrzip, I highly recommend upgrading to version 0.552.
Freshmeat link (sometimes takes a while for new release to appear):
lrzip
Now for more debugging.
Thursday, 10 February 2011
Becoming a Ubu nut. -ck packages for Ubuntu
I've been on debian stable for many years as my primary desktop, being quite happy with relatively old software provided most things worked as expected. The only things I ever updated outside of the repositories were firefox, openoffice and only random other small things. However I finally have had enough. KDE 4.4 broke me. No, I'm not interested in using a different desktop. I tried and tried recently and they just never felt right, mainly because I've been on KDE for so long that it works the way I want it to (when it did work!). KDE 4.4 still had some showstopper bugs for my usage that affected me daily, and reporting them upstream has had no effect, because they're all fixed in 4.5 or later, and will not be backported to 4.4 on debian. Now it's not entirely debian's fault since they have to freeze at some point - I do understand that. It's just that kde4.4 was still... meh.
So I bit the bullet and installed (K)Ubuntu 10.10 on my primary desktop, which till now I'd only used on a laptop, but it had been quite respectable. After an uneventful install and reboot, my keyboard and mouse would not work. I tried different usb keyboards and ps/2 ones to no avail. Ironically the install DVD worked fine, and using the rescue mode I tried the first logical thing - I installed a -ck kernel that I had from my previous install. This fixed it. So anyway I'm not going to go on about distros because that's a long hard painful discussion that there is no answer to, and all I can say is there are different distros out there and just use what's comfortable for you. I'm also not going to try and review the distro. There are enough people out there that do this already. I just wish the KDE 4.x composite + nvidia driver combination causing high CPU usage and slowdown issues with kwin and xorg would be fixed :\ Search for kwin high CPU and nvidia and you'll see loads and loads of posts on heaps of forums with different distros and no solution. I'm betting it's nvidia's fault. (Disabling blur makes it use less CPU, but sluggishness kicks in after a while regardless).
So anyway the point of this is that I figured - what the heck, I may as well make some distro kernels for Ubuntu that people keep asking me about. So I've quickly packaged up an amd64 2.6.35.11-ck1 and 2.6.37-ck1 kernel. The older one I recommend for most people, and the newer one for relatively more adventurous souls.
For now they can be found here (no ppa or anything at the moment, just .deb packages):
Ubuntu Packages
Note that if you have a 32 bit Ubuntu install, but a 64 bit capable CPU (as all are in the last few years) you can still force install these kernels and they should work fine on 32 bit userspace. You might have trouble if you install other drivers though. There's a good chance these packages will work on debian as well, but I can't guarantee anything. Please report back if you try that combination out.
Enjoy!
EDIT: I added the backports ppa by kubuntu so that I can install kde 4.6 and all remnants of sluggishness are gone! Thank goodness. It looks like it was kde after all.
So I bit the bullet and installed (K)Ubuntu 10.10 on my primary desktop, which till now I'd only used on a laptop, but it had been quite respectable. After an uneventful install and reboot, my keyboard and mouse would not work. I tried different usb keyboards and ps/2 ones to no avail. Ironically the install DVD worked fine, and using the rescue mode I tried the first logical thing - I installed a -ck kernel that I had from my previous install. This fixed it. So anyway I'm not going to go on about distros because that's a long hard painful discussion that there is no answer to, and all I can say is there are different distros out there and just use what's comfortable for you. I'm also not going to try and review the distro. There are enough people out there that do this already. I just wish the KDE 4.x composite + nvidia driver combination causing high CPU usage and slowdown issues with kwin and xorg would be fixed :\ Search for kwin high CPU and nvidia and you'll see loads and loads of posts on heaps of forums with different distros and no solution. I'm betting it's nvidia's fault. (Disabling blur makes it use less CPU, but sluggishness kicks in after a while regardless).
So anyway the point of this is that I figured - what the heck, I may as well make some distro kernels for Ubuntu that people keep asking me about. So I've quickly packaged up an amd64 2.6.35.11-ck1 and 2.6.37-ck1 kernel. The older one I recommend for most people, and the newer one for relatively more adventurous souls.
For now they can be found here (no ppa or anything at the moment, just .deb packages):
Ubuntu Packages
Note that if you have a 32 bit Ubuntu install, but a 64 bit capable CPU (as all are in the last few years) you can still force install these kernels and they should work fine on 32 bit userspace. You might have trouble if you install other drivers though. There's a good chance these packages will work on debian as well, but I can't guarantee anything. Please report back if you try that combination out.
Enjoy!
EDIT: I added the backports ppa by kubuntu so that I can install kde 4.6 and all remnants of sluggishness are gone! Thank goodness. It looks like it was kde after all.
Thursday, 6 January 2011
2.6.37-ck1, BFS 0.363 for 2.6.37, Grouping by UID
It looks like 2.6.37 made it out in time before I left for my trip, so here's some goodies to keep you all busy, from the emails I just sent to announce them on lkml:
---
These are patches designed to improve system responsiveness and interactivity with specific emphasis on the desktop, but suitable to any workload.
Apply to 2.6.37:
http://www.kernel.org/pub/linux/kernel/people/ck/patches/2.6/2.6.37/2.6.37-ck1/patch-2.6.37-ck1.bz2
Broken out tarball:
http://www.kernel.org/pub/linux/kernel/people/ck/patches/2.6/2.6.37/2.6.37-ck1/2.6.37-ck1-broken-out.tar.bz2
Discrete patches:
http://www.kernel.org/pub/linux/kernel/people/ck/patches/2.6/2.6.37/2.6.37-ck1/patches/
All -ck patches:
http://www.kernel.org/pub/linux/kernel/people/ck/patches/
Web:
http://kernel.kolivas.org
Code blog when I feel like it:
http://ck-hack.blogspot.com/
Each discrete patch contains a brief description of what it does at the top of
the patch itself.
The most significant change is an updated BFS CPU scheduler, up to version
0.363. See the announce of that patch for the changelog. The rest is a resync.
2.6.37-sched-bfs-363.patch
sched-add-above-background-load-function.patch
mm-make_swappiness_really_mean_it.patch
mm-zero_swappiness.patch
mm-enable_swaptoken_only_when_swap_full.patch
mm-drop_swap_cache_aggressively.patch
mm-kswapd_inherit_prio-1.patch
mm-background_scan.patch
mm-idleprio_prio-1.patch
mm-lru_cache_add_lru_tail.patch
mm-decrease_default_dirty_ratio.patch
kconfig-expose_vmsplit_option.patch
hz-default_1000.patch
hz-no_default_250.patch
hz-raise_max.patch
preempt-desktop-tune.patch
cpufreq-bfs_tweaks.patch
ck1-version.patch
---
The BFS (name shall not be said for PG requirements) CPU scheduler for 2.6.37
is now available.
Since the last release, a lot more work was put into maintaining fine grained
accounting at all times (which should help on 32 bit machines, uniprocessor
and 100Hz configs), minor changes have been made to make CPU offline code more
robust for the purposes of suspend to ram/disk, and some small scalability
improvements were added for SMT CPUs (eg i7). These changes are unlikely to
have dramatically noticeable effects unless you were already experiencing a
problem or poor performance.
A direct link to the patch for 2.6.37 is here:
http://ck.kolivas.org/patches/bfs/2.6.37/2.6.37-sched-bfs-363.patch
All BFS patches here:
http://ck.kolivas.org/patches/bfs
Version 363 has been ported to 2.6.35 and 2.6.32 and available from that
directory in lieu of the long term release nature of these kernels.
On a related note, a small multi-user server feature request was commissioned
for BFS that I was happy to work on, which I'd like to also make publicly
available.
Here is the changelog:
---
Make it possible to proportion CPU resource strictly according to user ID by
grouping all tasks from the one user as one task.
This is done through simply tracking how many tasks from the one UID are
running at any one time and using that data to determine what the virtual
deadline is, offset by proportionately more according to the number of running
tasks. Do this by creating an array of every UID for very quick lookup of the
running value and protect it by the grq lock. This should incur almost
immeasurably small overhead even when enabled. An upper limit of 65535 UIDs
is currently supported.
Make this feature configurable at build time and runtime via Kconfig, and
through the use of sysctls
/proc/sys/kernel/group_by_uid
to enable or disable the feature (default 1 == on), and
/proc/sys/kernel/group_uid_min
to decide the minimum uid to group tasks from (default 1000)
Nice values are still respected, making it possible to allocate different
amounts of CPU to each user.
This feature is most suited to a multi-user shell type server environment and
is NOT recommended for an ordinary desktop.
---
The patch is available for the moment here:
http://ck.kolivas.org/patches/bfs/test/bfs363-group_uids.patch
A reminder that this is NOT a desktop, laptop or embedded device type feature.
The purpose of this feature is to make it impossible for any one user to get
more CPU than any other user on a multi-user login. This is suitable for
multiuser shared GUI/X session or shell type machines, and incurs almost
negligible overhead.
---
I'll be offline shortly and in Japan for a few weeks so I'll be unlikely to respond to any emails in that time.
Enjoy!
---
These are patches designed to improve system responsiveness and interactivity with specific emphasis on the desktop, but suitable to any workload.
Apply to 2.6.37:
http://www.kernel.org/pub/linux/kernel/people/ck/patches/2.6/2.6.37/2.6.37-ck1/patch-2.6.37-ck1.bz2
Broken out tarball:
http://www.kernel.org/pub/linux/kernel/people/ck/patches/2.6/2.6.37/2.6.37-ck1/2.6.37-ck1-broken-out.tar.bz2
Discrete patches:
http://www.kernel.org/pub/linux/kernel/people/ck/patches/2.6/2.6.37/2.6.37-ck1/patches/
All -ck patches:
http://www.kernel.org/pub/linux/kernel/people/ck/patches/
Web:
http://kernel.kolivas.org
Code blog when I feel like it:
http://ck-hack.blogspot.com/
Each discrete patch contains a brief description of what it does at the top of
the patch itself.
The most significant change is an updated BFS CPU scheduler, up to version
0.363. See the announce of that patch for the changelog. The rest is a resync.
2.6.37-sched-bfs-363.patch
sched-add-above-background-load-function.patch
mm-make_swappiness_really_mean_it.patch
mm-zero_swappiness.patch
mm-enable_swaptoken_only_when_swap_full.patch
mm-drop_swap_cache_aggressively.patch
mm-kswapd_inherit_prio-1.patch
mm-background_scan.patch
mm-idleprio_prio-1.patch
mm-lru_cache_add_lru_tail.patch
mm-decrease_default_dirty_ratio.patch
kconfig-expose_vmsplit_option.patch
hz-default_1000.patch
hz-no_default_250.patch
hz-raise_max.patch
preempt-desktop-tune.patch
cpufreq-bfs_tweaks.patch
ck1-version.patch
---
The BFS (name shall not be said for PG requirements) CPU scheduler for 2.6.37
is now available.
Since the last release, a lot more work was put into maintaining fine grained
accounting at all times (which should help on 32 bit machines, uniprocessor
and 100Hz configs), minor changes have been made to make CPU offline code more
robust for the purposes of suspend to ram/disk, and some small scalability
improvements were added for SMT CPUs (eg i7). These changes are unlikely to
have dramatically noticeable effects unless you were already experiencing a
problem or poor performance.
A direct link to the patch for 2.6.37 is here:
http://ck.kolivas.org/patches/bfs/2.6.37/2.6.37-sched-bfs-363.patch
All BFS patches here:
http://ck.kolivas.org/patches/bfs
Version 363 has been ported to 2.6.35 and 2.6.32 and available from that
directory in lieu of the long term release nature of these kernels.
On a related note, a small multi-user server feature request was commissioned
for BFS that I was happy to work on, which I'd like to also make publicly
available.
Here is the changelog:
---
Make it possible to proportion CPU resource strictly according to user ID by
grouping all tasks from the one user as one task.
This is done through simply tracking how many tasks from the one UID are
running at any one time and using that data to determine what the virtual
deadline is, offset by proportionately more according to the number of running
tasks. Do this by creating an array of every UID for very quick lookup of the
running value and protect it by the grq lock. This should incur almost
immeasurably small overhead even when enabled. An upper limit of 65535 UIDs
is currently supported.
Make this feature configurable at build time and runtime via Kconfig, and
through the use of sysctls
/proc/sys/kernel/group_by_uid
to enable or disable the feature (default 1 == on), and
/proc/sys/kernel/group_uid_min
to decide the minimum uid to group tasks from (default 1000)
Nice values are still respected, making it possible to allocate different
amounts of CPU to each user.
This feature is most suited to a multi-user shell type server environment and
is NOT recommended for an ordinary desktop.
---
The patch is available for the moment here:
http://ck.kolivas.org/patches/bfs/test/bfs363-group_uids.patch
A reminder that this is NOT a desktop, laptop or embedded device type feature.
The purpose of this feature is to make it impossible for any one user to get
more CPU than any other user on a multi-user login. This is suitable for
multiuser shared GUI/X session or shell type machines, and incurs almost
negligible overhead.
---
I'll be offline shortly and in Japan for a few weeks so I'll be unlikely to respond to any emails in that time.
Enjoy!
Sunday, 2 January 2011
BFS version 0.363
Welcome to 2011!
The testing on BFS ported to 2.6.37-rc8 has been reassuring, and no real show stopper bugs have shown up. The remaining changes required to make it release ready for 2.6.37 have now been committed, along with some other very minor changes, so I've bumped the version up to 0.363. The main change was implementing the fine grained interrupt accounting which will have very little, if any, impact on regular users. These changes are ONLY suitable for 2.6.37, so they have not been ported back to the BFS I'm maintaining for earlier kernels. The rest of the changes suitable for older kernels have gone into 363 for them.
Here is the changelog as it affects existing BFS 360 users:
Most of these changes should have no visible behavioural effect to the user, apart from the following:
For those on BFS 360, if you were having warnings or even OOPSes on suspend to ram/disk or wakeup from them, or if you were having trouble suspending or resuming, this change might help.
The other change guarantees that CPUs will be busier on SMP machines when tasks are being run IDLEPRIO, so it will increase throughput, but ONLY if you run tasks IDLEPRIO.
Incremental: 2.6.36-bfs-360-363.patch
For 2.6.36ish:
2.6.36-sched-bfs-363.patch
2.6.35.10:
2.6.35.10-sched-bfs-363.patch
2.6.32.27:
2.6.32.27-sched-bfs-363.patch
Shortly I'll be going to Japan for a few weeks as I do almost every year now, so I'll be offline for a while.
The testing on BFS ported to 2.6.37-rc8 has been reassuring, and no real show stopper bugs have shown up. The remaining changes required to make it release ready for 2.6.37 have now been committed, along with some other very minor changes, so I've bumped the version up to 0.363. The main change was implementing the fine grained interrupt accounting which will have very little, if any, impact on regular users. These changes are ONLY suitable for 2.6.37, so they have not been ported back to the BFS I'm maintaining for earlier kernels. The rest of the changes suitable for older kernels have gone into 363 for them.
Here is the changelog as it affects existing BFS 360 users:
Make CPU offlining more robust by simply removing all affinity for processes that no longer have any CPUs they can run on. This allows the machine stop thread to complete offlining CPUs and makes for a little less overhead in hot paths. Allow SCHED_IDLEPRIO to wake up idle CPUs in try_preempt. This would have caused minor slowdowns for IDLEPRIO tasks only on relatively quiescent systems. Remove inappropriate likely()s. Update cpustat for irq - may have been under-reporting interrupt load. Cosmetic changes. Bump version to 0.363
Most of these changes should have no visible behavioural effect to the user, apart from the following:
For those on BFS 360, if you were having warnings or even OOPSes on suspend to ram/disk or wakeup from them, or if you were having trouble suspending or resuming, this change might help.
The other change guarantees that CPUs will be busier on SMP machines when tasks are being run IDLEPRIO, so it will increase throughput, but ONLY if you run tasks IDLEPRIO.
Incremental: 2.6.36-bfs-360-363.patch
For 2.6.36ish:
2.6.36-sched-bfs-363.patch
2.6.35.10:
2.6.35.10-sched-bfs-363.patch
2.6.32.27:
2.6.32.27-sched-bfs-363.patch
Shortly I'll be going to Japan for a few weeks as I do almost every year now, so I'll be offline for a while.
Thursday, 30 December 2010
BFS for 2.6.37-rc8
So we approach yet another "stable" release with 2.6.37 just around the corner now that -pre8 is out (yes I'm old fashioned, it is still a pre-release and not a release candidate in my eyes). I usually use pre8 as the flag to port BFS to the new kernel so that I have it ready in time for the "stable" 3 point release.
I've ported the current BFS release v0.360 to the latest kernel. Of significance in the new kernel are some changes yet again to the way CPU offline code occurs (which happens just before suspend to ram and disk), some new way to improve accounting of CPU attribution during interrupt handling, and improving the reporting of CPU load during idle periods with NOHZ enabled. There are also some other architectural changes that have no cosmetic effect. None of these will have any effect on "performance" for the end user so don't expect any surprises there.
So I've ported BFS with only the changes necessary to get it working on the new kernel. I've not yet added the support for interrupt accounting, and I haven't changed the way total load is calculated on NOHZ configured kernels. The only change of any significance is the way I offline CPUs in the new BFS. I have been fighting with that for a while now since there really is no right way to do it, and it changes so frequently in mainline that I often have trouble keeping up.
For those already on version 0.360 of BFS, the only significant changes can be had now in this patch:
bfs360-test.patch
For the dirty details in the CPU offline code, what happens is that a worker thread runs at ultra high priority on the CPU it's about to offline, and half way through its work it turns off that CPU and then needs to run on another CPU to die itself. Till now, BFS has looked for tasks that had affinity for CPUs that could no longer run on any alive CPUs and then would be happy to run them anywhere. With this latest bfs360-test.patch it does more what the mainline kernel does and formally breaks affinity for tasks that no longer have anywhere they're allowed to run. It only has a negligible improvement in overhead but the main reason for doing it is to make the offline code more robust.
For those looking for the first BFS release on 2.6.37-rc8, here's the test patch:
2.6.37-rc8-sched-bfs-362.patch.
I've bumped the version number up simply because it includes the test changes above, and has had other architectural changes to be in sync with the mainline kernel. The main thing new in mainline is that there is a new "class" of realtime scheduling used only internally by the CPU offlining code called a stop class. Instead of using a new scheduling class, BFS simply adds one more realtime priority that can be used by kernel threads and is only ever applied to the CPU offline thread (kmigration). I did this to make it add zero overhead to the existing design, yet support the concept that it is a thread that nothing else can preempt.
Almost certainly I will be adding more code to the final version of BFS that I use when 2.6.37 comes out, to support the new interrupt accounting code and the global load reported, but these will only have cosmetic effects. This patch has only been lightly tested and not compile tested for multiple configurations just yet, but seems to be working nicely. For now you can grab it here:
2.6.37-rc8-sched-bfs-362.patch
Happy New Year everyone :)
I've ported the current BFS release v0.360 to the latest kernel. Of significance in the new kernel are some changes yet again to the way CPU offline code occurs (which happens just before suspend to ram and disk), some new way to improve accounting of CPU attribution during interrupt handling, and improving the reporting of CPU load during idle periods with NOHZ enabled. There are also some other architectural changes that have no cosmetic effect. None of these will have any effect on "performance" for the end user so don't expect any surprises there.
So I've ported BFS with only the changes necessary to get it working on the new kernel. I've not yet added the support for interrupt accounting, and I haven't changed the way total load is calculated on NOHZ configured kernels. The only change of any significance is the way I offline CPUs in the new BFS. I have been fighting with that for a while now since there really is no right way to do it, and it changes so frequently in mainline that I often have trouble keeping up.
For those already on version 0.360 of BFS, the only significant changes can be had now in this patch:
bfs360-test.patch
For the dirty details in the CPU offline code, what happens is that a worker thread runs at ultra high priority on the CPU it's about to offline, and half way through its work it turns off that CPU and then needs to run on another CPU to die itself. Till now, BFS has looked for tasks that had affinity for CPUs that could no longer run on any alive CPUs and then would be happy to run them anywhere. With this latest bfs360-test.patch it does more what the mainline kernel does and formally breaks affinity for tasks that no longer have anywhere they're allowed to run. It only has a negligible improvement in overhead but the main reason for doing it is to make the offline code more robust.
For those looking for the first BFS release on 2.6.37-rc8, here's the test patch:
2.6.37-rc8-sched-bfs-362.patch.
I've bumped the version number up simply because it includes the test changes above, and has had other architectural changes to be in sync with the mainline kernel. The main thing new in mainline is that there is a new "class" of realtime scheduling used only internally by the CPU offlining code called a stop class. Instead of using a new scheduling class, BFS simply adds one more realtime priority that can be used by kernel threads and is only ever applied to the CPU offline thread (kmigration). I did this to make it add zero overhead to the existing design, yet support the concept that it is a thread that nothing else can preempt.
Almost certainly I will be adding more code to the final version of BFS that I use when 2.6.37 comes out, to support the new interrupt accounting code and the global load reported, but these will only have cosmetic effects. This patch has only been lightly tested and not compile tested for multiple configurations just yet, but seems to be working nicely. For now you can grab it here:
2.6.37-rc8-sched-bfs-362.patch
Happy New Year everyone :)
Friday, 24 December 2010
Queuing theory and BFS
I had pointed out to me on IRC by Damentz the queuing theory video that was linked by a slashdot article.
For those who haven't seen it, the video is here:
Why the other line is likely to move faster
Anyway, the relevance of that video is that BFS uses a single queue, whereas the mainline Linux kernel uses a multiple queue design. The people are tasks, and the checkouts are CPUs. Of course there's a lot more to a CPU scheduler than just the queue design, but I thought this video was very relevant :)
Merry Christmas everyone.
For those who haven't seen it, the video is here:
Why the other line is likely to move faster
Anyway, the relevance of that video is that BFS uses a single queue, whereas the mainline Linux kernel uses a multiple queue design. The people are tasks, and the checkouts are CPUs. Of course there's a lot more to a CPU scheduler than just the queue design, but I thought this video was very relevant :)
Merry Christmas everyone.
Tuesday, 14 December 2010
BFS version 0.360
There's nothing better to help you work out how something should work on certain hardware than having it yourself. I recently was fortunate enough to get my hands on an i7 based laptop. While the laptop is powerful for a laptop, a 2.67 dual core, quad thread is hardly a particularly powerful i7 (except in a laptop), but it nonetheless has a topology I've not had before. After installing a -ck kernel on it, I went and benchmarked how BFS performed on it, and was a little disappointed. At load==number of CPUs, it performed well, and with a single threaded job it was good too, but performed poorly compared to mainline with half jobs (so 2 jobs on 4 logical threads). I might point out there's nothing quite like having something to compare to to have some kind of yardstick off which to work. It worked both ways in this scenario, with mainline having had BFS as a reference for which regressions/improvements were found, and in this case I found that BFS wasn't up to scratch. So I set out to find and fix it.
BFS doesn't really use the notion of "balancing" to try and move tasks from one CPU to the next based on some complex heuristic algorithm over time, but decides whenever a task wakes up, which CPU anywhere is the most suitable for the task to run on. This affords it very low latency since a task might find somewhere to run much sooner than if it were on a local queue on one CPU. BFS doesn't just decide to find the lowest latency point to run off, though, as that would ping pong tasks all over the place. That might be okay for latency, but is expensive in throughput terms. Probably the single most important factor in improving throughput with CPUs as time goes on is keeping the working set of data in the CPU cache, and the caches have gotten fatter and fatter with time. I've been asked this numerous times before, but it's the reason we can't just run tasks for a few microseconds and switch between all tasks as often as we like so that we get ultra low latencies; it costs us dramatically in throughput. The more often you move a task away from one CPU's cache to another one, the lower the throughput will be (if something else then uses that previous CPU).
BFS uses the simple information of where it last ran to determine which CPU would be the best CPU to run on next. It does this by searching for an idle CPU first, and binding to the CPU "closest" in cache to the last one it ran off. It would choose a CPU that shared the cache before moving to a different CPU (or then a different node in the case of NUMA) through the use of "cache locality" lists. BFS, unlike mainline, will never leave a task off a CPU when there is *any* CPU idle that it is allowed to run on. This is done because any improvement in throughput by trying to keep the task bound to one CPU costs us in the latency of having a task wait on a busy CPU, and latency is everything on a desktop. On a 2 core, 4 thread machine, if you have only 2 tasks running, you want each task to be running on the real cores and not threads of the same core. BFS checks to see when cores' threads are busy and chooses cores that don't have their other threads busy. However, if you have 2 tasks running (let's say a make -j2 build), then at irregularly irregular intervals, you will have other tasks wake up and consume CPU (scheduling noise), bringing the number of tasks asking for CPU at any one time jump to 3 or more. This ruins the possibility that we only ever use the 2 cores, and throws everything off balance. Since BFS has no notion of history, it's not like those tasks will then go back once the noise has died down. After a task has moved, there is not much point trying to put it back since each next move costs us yet again.
As there was a performance regression with the 2 task / 4 thread CPU case, I started experimenting with allowing tasks to sit idle for a bit and wait in case their previous CPU became available. The attempts got more and more complex and fancier, with more and more special cases I could think of, yet failed to do more than a pitiful improvement in throughput. So I abandoned that in disgust (especially since I was hating how complex this was getting, and it's just not the BFS way).
Then I investigated to see whether my code had unintentionally grouped the wrong threads with each other in BFS, and for a while I was convinced that it had. The reason was that threads 0 and 2 were considered SMT (hyperthread) siblings, as were 1 and 3. I dug deeper for a while and realised that indeed that is how they were enumerated, however silly that sounded to me. Oh well, at least I had excluded that as the problem.
Finally I went back and looked at how the idle CPU was chosen and spotted something. On a CPU with SMT siblings, it is virtually free to move a task from one logical CPU to its SMT sibling CPU. It is also virtually free to move a task from one logical CPU to its shared cache sibling. This I had taken into consideration with BFS, and made the resched best idle code as simple as possible by just lumping them all together, and picking the first one it came across. Now on this i7, all 4 logical CPUs are actually considered shared cache siblings, meaning it would always pick the first non-thread-busy CPU. It seemed okay at the time, and worked for my quad core on my desktop (non hyperthreaded), but I figured it would be best to try sticking to the same actual logical CPU first, then its SMT sibling, before trying a shared cache CPU. I also remembered the "turbo" mode in these i* CPUs, which allows one core to run at higher than the regular clock speed if it's the only one running, so it seemed to make sense that trying to keep tasks on the actual CPU they ran on, rather than a virtually free move might be just as good. I tried it and bingo, an 8% speedup on the make -j2 case, and ever so slightly faster when fully loaded make -j4. That was well worth the effort.
Even if it is free to move tasks from one CPU to a logical counterpart, there is another theoretical advantage of trying to stick to the same CPU. CPU frequency scaling is done on a per-CPU basis, and if one CPU is working harder than all of them, it will ramp up more easily. This should decrease response time under bursts of load (with the ondemand scaling governor), and potentially use less power since only one CPU will speed up. It's actually been shown that you use less power overall if a job is completed at a higher frequency before ramping down again, than doing the job at the lower frequency. This is why the conservative governor doesn't really make sense - it takes longer to respond since there's always a lag for it to speed up, and doesn't really save power. Even if you don't use cpu frequency scaling, all modern CPUs do plenty of their own powersaving internally in much the same manner.
Anyway, that change, along with a handful of other stuff make it worth bumping the BFS version up. All the changes going from 357 are low risk and it should be safe to jump versions now if 357 has been stable for you.
Here's the brief changelog of 357-360:
Get it here (including incremental) for 2.6.36ish:
http://ck.kolivas.org/patches/bfs/2.6.36/
EDIT: I should explain that the j2 test is simply a regression test for throughput, it is NOT my way of benchmarking interactivity or responsiveness.
BFS doesn't really use the notion of "balancing" to try and move tasks from one CPU to the next based on some complex heuristic algorithm over time, but decides whenever a task wakes up, which CPU anywhere is the most suitable for the task to run on. This affords it very low latency since a task might find somewhere to run much sooner than if it were on a local queue on one CPU. BFS doesn't just decide to find the lowest latency point to run off, though, as that would ping pong tasks all over the place. That might be okay for latency, but is expensive in throughput terms. Probably the single most important factor in improving throughput with CPUs as time goes on is keeping the working set of data in the CPU cache, and the caches have gotten fatter and fatter with time. I've been asked this numerous times before, but it's the reason we can't just run tasks for a few microseconds and switch between all tasks as often as we like so that we get ultra low latencies; it costs us dramatically in throughput. The more often you move a task away from one CPU's cache to another one, the lower the throughput will be (if something else then uses that previous CPU).
BFS uses the simple information of where it last ran to determine which CPU would be the best CPU to run on next. It does this by searching for an idle CPU first, and binding to the CPU "closest" in cache to the last one it ran off. It would choose a CPU that shared the cache before moving to a different CPU (or then a different node in the case of NUMA) through the use of "cache locality" lists. BFS, unlike mainline, will never leave a task off a CPU when there is *any* CPU idle that it is allowed to run on. This is done because any improvement in throughput by trying to keep the task bound to one CPU costs us in the latency of having a task wait on a busy CPU, and latency is everything on a desktop. On a 2 core, 4 thread machine, if you have only 2 tasks running, you want each task to be running on the real cores and not threads of the same core. BFS checks to see when cores' threads are busy and chooses cores that don't have their other threads busy. However, if you have 2 tasks running (let's say a make -j2 build), then at irregularly irregular intervals, you will have other tasks wake up and consume CPU (scheduling noise), bringing the number of tasks asking for CPU at any one time jump to 3 or more. This ruins the possibility that we only ever use the 2 cores, and throws everything off balance. Since BFS has no notion of history, it's not like those tasks will then go back once the noise has died down. After a task has moved, there is not much point trying to put it back since each next move costs us yet again.
As there was a performance regression with the 2 task / 4 thread CPU case, I started experimenting with allowing tasks to sit idle for a bit and wait in case their previous CPU became available. The attempts got more and more complex and fancier, with more and more special cases I could think of, yet failed to do more than a pitiful improvement in throughput. So I abandoned that in disgust (especially since I was hating how complex this was getting, and it's just not the BFS way).
Then I investigated to see whether my code had unintentionally grouped the wrong threads with each other in BFS, and for a while I was convinced that it had. The reason was that threads 0 and 2 were considered SMT (hyperthread) siblings, as were 1 and 3. I dug deeper for a while and realised that indeed that is how they were enumerated, however silly that sounded to me. Oh well, at least I had excluded that as the problem.
Finally I went back and looked at how the idle CPU was chosen and spotted something. On a CPU with SMT siblings, it is virtually free to move a task from one logical CPU to its SMT sibling CPU. It is also virtually free to move a task from one logical CPU to its shared cache sibling. This I had taken into consideration with BFS, and made the resched best idle code as simple as possible by just lumping them all together, and picking the first one it came across. Now on this i7, all 4 logical CPUs are actually considered shared cache siblings, meaning it would always pick the first non-thread-busy CPU. It seemed okay at the time, and worked for my quad core on my desktop (non hyperthreaded), but I figured it would be best to try sticking to the same actual logical CPU first, then its SMT sibling, before trying a shared cache CPU. I also remembered the "turbo" mode in these i* CPUs, which allows one core to run at higher than the regular clock speed if it's the only one running, so it seemed to make sense that trying to keep tasks on the actual CPU they ran on, rather than a virtually free move might be just as good. I tried it and bingo, an 8% speedup on the make -j2 case, and ever so slightly faster when fully loaded make -j4. That was well worth the effort.
Even if it is free to move tasks from one CPU to a logical counterpart, there is another theoretical advantage of trying to stick to the same CPU. CPU frequency scaling is done on a per-CPU basis, and if one CPU is working harder than all of them, it will ramp up more easily. This should decrease response time under bursts of load (with the ondemand scaling governor), and potentially use less power since only one CPU will speed up. It's actually been shown that you use less power overall if a job is completed at a higher frequency before ramping down again, than doing the job at the lower frequency. This is why the conservative governor doesn't really make sense - it takes longer to respond since there's always a lag for it to speed up, and doesn't really save power. Even if you don't use cpu frequency scaling, all modern CPUs do plenty of their own powersaving internally in much the same manner.
Anyway, that change, along with a handful of other stuff make it worth bumping the BFS version up. All the changes going from 357 are low risk and it should be safe to jump versions now if 357 has been stable for you.
Here's the brief changelog of 357-360:
Don't unnecessarily preempt for a task on the wrong CPU. Cope with worker threads trying to wake themselves up due to shifting CPUs on suspend by reactivating it, instead of hitting the BUG_ON Wrap timer jiffies at 10 seconds instead of 5 minutes since 32 bit load averages don't work until the first timer wrap. Remove the last_task logic as it wasn't providing any significant performance advantage. Change the locality logic to try to reschedule on the exact same logical core instead of assuming scheduling on a sibling core or sibling thread is equivalent. This allows CPUs with a "turbo" mode (such as i7) to use that more often by using one CPU more than spreading out, and allows ondemand cpu frequency scaling to ramp up more easily when a task stays on the same CPU. It increases throughput on threaded CPUs when lightly loaded, and may offer both performance and power saving advantages on all SMP topologies with cpu frequency scaling.
Get it here (including incremental) for 2.6.36ish:
http://ck.kolivas.org/patches/bfs/2.6.36/
EDIT: I should explain that the j2 test is simply a regression test for throughput, it is NOT my way of benchmarking interactivity or responsiveness.
Sunday, 12 December 2010
lrzip-0.551, overlapping streams for faster compression
A few bug reports have been trickling in since I last released a version of lrzip, which as I said last time is good and bad. It's helped direct my workflow on this, even though I keep trying to let it settle down. I'd love to leave the code since it mostly works quite well, and there is a common mantra that "release early, release often" is considered good, but in practice that doesn't seem to work. If you release stuff while it's still broken, it will get a bad reputation and people will dismiss it that time around and often not give it a chance again. So if there are show stopper bugs, they need attending to ASAP.
So anyway, the fact is that the fun part is implementing stuff rather than just fixing bugs, so for the next release I worked on doing both since I had to fix the bugs I knew about. I've mentioned before that lrzip breaks data into rzip streams and compresses each stream a block at a time. I had implemented multithreading by splitting that block up according to number of CPUs, but at the end of the stream, it waited to get all the blocks back before moving onto the next stream. For this release I moved the threading up earlier in the overall process, thus allowing the rzip stage to move onto the next stream while the back end threads are still busy compressing data from the previous stream. This means that if a file is large enough (greater than approximately 2/3 ram), once all CPUs are in use with back end threads, they should remain in use until the end of the file is reached. This speeds up compression on large files on SMP machines.
While looking for bugs to do with lzma compression failing due to running out of ram, I discovered why changing from level 7 compression to 9 compression did nothing... It's been a long time since the lzma library was imported into lrzip, and I had completely forgotten that it only goes up to level 7! The default lzma compression level is meant to be 5, so I've scaled it down now, meaning lrzip will now use the best all round compromise level for lzma. This also was affecting the window sizes internally used by the lzma library. At level 5, the windows are 16MB in size, and go up to 64MB for level 7. As one kind person noted in the comments on this blog, lzma uses up to 12x the memory of the compression window, so each thread would have been needing up to 800MB each. This explained why people were getting lzma failing on window sizes and complaining despite apparently enough ram. While the new level does lose slightly on compression, it does also improve compression speed significantly, and should pretty much eliminate lzma compression failing due to "too big window sizes".
With the two speed improvements, here are the sort of speed ups one can see with version 0.550 of lrzip:
10GB virtual image on quad core with 8GB ram:
lrzip 15m45s -> 11m03s
lrzip -M 12m03s -> 9m30s
And just as a reminder, xz completed this in 50m58s and the file size was almost halved by lrzip compared to xz.
Benchmarks can be found here:
http://ck.kolivas.org/apps/lrzip/README.benchmarks
On version 0.544 of lrzip, I had changed multithreading to use two separate groups of threads for stream 0 and stream 1 to make decompression more robust for out-of-order storage in the archive. Well unfortunately this doubled the possible number of threads, and this brought a now all-too-familiar problem: running out of ram on big files. In .550 I changed it to only use one thread for stream 0 rather than multiple ones since it's usually only one block of stream 0 before multiple blocks of stream 1. This can slow down decompression a little more, but reliability is more important.
I put some effort into tackling the dreaded running out of ram problem with multiple threads all competing for ram. What lrzip does now is detect when the compression back end has failed (which is usually for memory reasons) and instead of failing completely, it will then wait for the previous thread to complete its compression before trying again. This serialises work that would have otherwise been in parallel. Thus less ram is required since only one thread is likely to be using ram. I did this on both compression and decompression. Also, I made the lzma compression back end first try a lower compression level (since this uses a smaller window) before failing, and then finally resorting to bzip if it never succeeds. This should make running out of memory a non-fatal problem since it will just try again using less and being slower.
On the subject of constraining memory usage, I finally bit the bullet and dropped the maximum mmap down to only 1/3 of ram since I can use my sliding mmap to do the remaining chunk size. It is a tiny bit slower, but far more reliable, and leaves a lot more ram for the compression back ends. This makes my sliding mmap implementation, which was just an experiment originally, used routinely for any file larger than 1/3 of ram size.
My "spreading out of thread spawning evenly" idea in the last release was a failure. I reverted it since it provided no speed ups and occasionally slowed things down.
I fixed a few other minor errors to do with not callocing enough ram in some structs which looked really scary but I have no idea if they were causing problems. Hopefully they fixed some random problem somewhere :P
So I was very proud of this release and made it go gold as version 0.550...
About 5 minutes after releasing it, I found a couple of new bugs that were pretty nasty. When lzma worked on an incompressible block, it now failed instead of just leaving that block uncompressed. That was an easy fix. I also found that I had broken compression from STDIN a while back and hadn't noticed, and the compression summary at the end was reading 0. So I quickly corrected those and bumped the version up to 0.551.
So get it here:
http://freshmeat.net/projects/lrzip
No doubt more bugs will show up and I'll have to tend to those as well, but this should be a more robust version since a lot of failure cases in 0.544 have been addressed.
Other ideas I'm considering for lrzip at this stage based on requests include a good error testing and recovery mechanism with either reed-solomon or low density parity checking, and encrypted password protection. I have a few libraries I've looked at for both of these but am open to suggestions if people have recommendations.
EDIT: People didn't understand how the chunks and streams work, so I've copied this comment I made below into the blogpost.
---
Lrzip splits files up multiple times. Imagine a file is twice the size of ram. It will first be split into chunks:
A B C
where each chunk is 1/3 the size of the file.
Then the rzip stage is performed on chunk A, splitting it into streams:
A0 A1
where A0 is the "dictionary" and A2 the "matches"
Then each stream is split up into further chunks for the compression backend, chunks
A0A A0B
and then
A1A A1B A1C A1D A1E
and so on, and each chunk is passed onto the compressor (such as lzma) and the results written to disk.
All this used to be done in series.
Version 0.530 made the last chopping up part be done in parallel since the slow compression back end (lzma or zpaq) was the time consuming part of compression. However when once streams A0 and A1 would have finished their rzip stage, lrzip would wait for A0A A0B A1A A1B etc. to finish compressing completely before lrzip would move onto chunk B.
Version 0.551 now allows lrzip to move onto chunk B while A0A A0B A1A A1B etc are all still compressing.
---
This means that on faster compression (lzo, gzip and bzip2), the rate limiting step is the stream production of A0 and A1. The pipe dream is to parallelise those, but they rely on using as much ram as possible to make the rzip stage effective so it would compromise compression greatly (and be messy as hell to implement).
So anyway, the fact is that the fun part is implementing stuff rather than just fixing bugs, so for the next release I worked on doing both since I had to fix the bugs I knew about. I've mentioned before that lrzip breaks data into rzip streams and compresses each stream a block at a time. I had implemented multithreading by splitting that block up according to number of CPUs, but at the end of the stream, it waited to get all the blocks back before moving onto the next stream. For this release I moved the threading up earlier in the overall process, thus allowing the rzip stage to move onto the next stream while the back end threads are still busy compressing data from the previous stream. This means that if a file is large enough (greater than approximately 2/3 ram), once all CPUs are in use with back end threads, they should remain in use until the end of the file is reached. This speeds up compression on large files on SMP machines.
While looking for bugs to do with lzma compression failing due to running out of ram, I discovered why changing from level 7 compression to 9 compression did nothing... It's been a long time since the lzma library was imported into lrzip, and I had completely forgotten that it only goes up to level 7! The default lzma compression level is meant to be 5, so I've scaled it down now, meaning lrzip will now use the best all round compromise level for lzma. This also was affecting the window sizes internally used by the lzma library. At level 5, the windows are 16MB in size, and go up to 64MB for level 7. As one kind person noted in the comments on this blog, lzma uses up to 12x the memory of the compression window, so each thread would have been needing up to 800MB each. This explained why people were getting lzma failing on window sizes and complaining despite apparently enough ram. While the new level does lose slightly on compression, it does also improve compression speed significantly, and should pretty much eliminate lzma compression failing due to "too big window sizes".
With the two speed improvements, here are the sort of speed ups one can see with version 0.550 of lrzip:
10GB virtual image on quad core with 8GB ram:
lrzip 15m45s -> 11m03s
lrzip -M 12m03s -> 9m30s
And just as a reminder, xz completed this in 50m58s and the file size was almost halved by lrzip compared to xz.
Benchmarks can be found here:
http://ck.kolivas.org/apps/lrzip/README.benchmarks
On version 0.544 of lrzip, I had changed multithreading to use two separate groups of threads for stream 0 and stream 1 to make decompression more robust for out-of-order storage in the archive. Well unfortunately this doubled the possible number of threads, and this brought a now all-too-familiar problem: running out of ram on big files. In .550 I changed it to only use one thread for stream 0 rather than multiple ones since it's usually only one block of stream 0 before multiple blocks of stream 1. This can slow down decompression a little more, but reliability is more important.
I put some effort into tackling the dreaded running out of ram problem with multiple threads all competing for ram. What lrzip does now is detect when the compression back end has failed (which is usually for memory reasons) and instead of failing completely, it will then wait for the previous thread to complete its compression before trying again. This serialises work that would have otherwise been in parallel. Thus less ram is required since only one thread is likely to be using ram. I did this on both compression and decompression. Also, I made the lzma compression back end first try a lower compression level (since this uses a smaller window) before failing, and then finally resorting to bzip if it never succeeds. This should make running out of memory a non-fatal problem since it will just try again using less and being slower.
On the subject of constraining memory usage, I finally bit the bullet and dropped the maximum mmap down to only 1/3 of ram since I can use my sliding mmap to do the remaining chunk size. It is a tiny bit slower, but far more reliable, and leaves a lot more ram for the compression back ends. This makes my sliding mmap implementation, which was just an experiment originally, used routinely for any file larger than 1/3 of ram size.
My "spreading out of thread spawning evenly" idea in the last release was a failure. I reverted it since it provided no speed ups and occasionally slowed things down.
I fixed a few other minor errors to do with not callocing enough ram in some structs which looked really scary but I have no idea if they were causing problems. Hopefully they fixed some random problem somewhere :P
So I was very proud of this release and made it go gold as version 0.550...
About 5 minutes after releasing it, I found a couple of new bugs that were pretty nasty. When lzma worked on an incompressible block, it now failed instead of just leaving that block uncompressed. That was an easy fix. I also found that I had broken compression from STDIN a while back and hadn't noticed, and the compression summary at the end was reading 0. So I quickly corrected those and bumped the version up to 0.551.
So get it here:
http://freshmeat.net/projects/lrzip
No doubt more bugs will show up and I'll have to tend to those as well, but this should be a more robust version since a lot of failure cases in 0.544 have been addressed.
Other ideas I'm considering for lrzip at this stage based on requests include a good error testing and recovery mechanism with either reed-solomon or low density parity checking, and encrypted password protection. I have a few libraries I've looked at for both of these but am open to suggestions if people have recommendations.
EDIT: People didn't understand how the chunks and streams work, so I've copied this comment I made below into the blogpost.
---
Lrzip splits files up multiple times. Imagine a file is twice the size of ram. It will first be split into chunks:
A B C
where each chunk is 1/3 the size of the file.
Then the rzip stage is performed on chunk A, splitting it into streams:
A0 A1
where A0 is the "dictionary" and A2 the "matches"
Then each stream is split up into further chunks for the compression backend, chunks
A0A A0B
and then
A1A A1B A1C A1D A1E
and so on, and each chunk is passed onto the compressor (such as lzma) and the results written to disk.
All this used to be done in series.
Version 0.530 made the last chopping up part be done in parallel since the slow compression back end (lzma or zpaq) was the time consuming part of compression. However when once streams A0 and A1 would have finished their rzip stage, lrzip would wait for A0A A0B A1A A1B etc. to finish compressing completely before lrzip would move onto chunk B.
Version 0.551 now allows lrzip to move onto chunk B while A0A A0B A1A A1B etc are all still compressing.
---
This means that on faster compression (lzo, gzip and bzip2), the rate limiting step is the stream production of A0 and A1. The pipe dream is to parallelise those, but they rely on using as much ram as possible to make the rzip stage effective so it would compromise compression greatly (and be messy as hell to implement).
Tuesday, 7 December 2010
lrzip-0.544 and fudging with memory allocation on 32 bits
It never ceases to amaze me that no matter how much I try and keep the memory allocation paths in check on lrzip, userspace will continue to fail me somehow. I had reports of memory allocation failures on 32 bit machines running lrzip with big files and started investigating again. Even if you have 64GB of ram running on a 32 bit kernel with PAE enabled, you can only allocate up to 2GB maximum with malloc and friends, due to ssize_t being a 32 bit signed long, giving you only 31 bits of positive to work with (which works out to 2GB). I had taken that into account, but it seems that 32 bit userspace really doesn't even like allocating more than 2GB in total for the one application.
Since lrzip can use much smaller allocation now with the use of sliding mmap, I set out to just use even smaller base mmap on 32 bits rather than allocating up to 2GB, and relying on sliding mmap for the rest. This fixed the memory allocation errors on 32 bits, but left me with the lzma back end failing for some users out there, reporting "try decreasing window size". Even though there is plenty of ram available now, it occasionally returns with a memory error. Since I haven't actually written any the lzma library myself, I really didn't know where to begin looking. I tried decreasing the window size passed to lzma but it was failing sometimes even on windows of only 60MB despite gigabytes of ram spare. So after much battling with it I decided to just let it fail, and then fall back to bzip2 compression if lzma failed to compress that block. This means that lrzip will generate a mixed compression file which is perfectly fine, but at least it will not leave blocks uncompressed. None of this is a problem on 64 bits mind you, but there are plenty of people running 32 bits still out there and will be for some time to come.
The other issue that arose was that uclibc doesn't appear to support the sysconf() function fully, not returning a valid value for ram (it returns a negative value) so I had to add a fallback to reading /proc for that.
The multithreaded compression (till now) worked off dividing up the size of the file by the number of threads, and then when a chunk of that size was reached, it would pass it onto a thread and then continue. The problem with this approach is the file is rzip preprocessed before passing it on to a back end compression thread, so it's usually smaller than this. I modified it to put data to threads at regular intervals during the rzip hash search instead, thus spawning threads at even timeframes to hopefully use more CPU time on SMP machines. What's interesting about this, though, is that the chunks compress markedly differently because they're different parts of the rzip processed data and can be tiny or massive after that processing. It made for only very modest improvements in compression times. It has an advantage on decompression, though, because each chunk after decompression is the same size, and it can take quite a while to write say a 2.5GB chunk at a time, thus allowing the next thread to keep working on the chunk after that while it's busy writing the first chunk.
When I last posted, I was unhappy about the multithreaded decompression component so I rewote part of that as well. Now, instead of guessing what stream is likely to be read next, it simply spawns two groups of threads, one group for each stream. The vast majority of the time, archives end up with just one block of stream 0 and lots of blocks of stream 1, so it won't make a big difference, but hopefully it's more robust than it was before. It does mean, however, that you can potentially spawn twice as many threads as you have CPUs which could be a problem with memory usage.
So what's next for lrzip? Well I keep trying to let it stabilise instead of doing more work on it, but there are more people using it now which means I get more bug reports. That's both good and bad ;) The things I see that still need addressing are more ability to cope with failures to do with memory allocation. Unlike other compression applications which use fixed upper sized compression windows, the main challenge with lrzip is to try and use as much memory as possible without running out of memory. This means that the main challenges aren't to do with the algorithms themselves, but memory usage. Multithreading has just made that even more complex since each thread needs its own memory to work with. I need to put more work into coping with trying to allocate too much ram and then cut back on thread usage, but backtracking like that is a little tricky with the code the way it is at the moment. If you do get memory problems, try setting -p values to less than your CPUs manually.
Also, it still will not work on macOSX, with the last good version for that being 0.520. I need a little rest before tackling that :s
Anyway, the new version is out there, and to indicate that it's pretty much just a bug fix and that everyone should upgrade, I've only pushed the minor subversion up to 0.544.
http://freshmeat.net/projects/lrzip
Since lrzip can use much smaller allocation now with the use of sliding mmap, I set out to just use even smaller base mmap on 32 bits rather than allocating up to 2GB, and relying on sliding mmap for the rest. This fixed the memory allocation errors on 32 bits, but left me with the lzma back end failing for some users out there, reporting "try decreasing window size". Even though there is plenty of ram available now, it occasionally returns with a memory error. Since I haven't actually written any the lzma library myself, I really didn't know where to begin looking. I tried decreasing the window size passed to lzma but it was failing sometimes even on windows of only 60MB despite gigabytes of ram spare. So after much battling with it I decided to just let it fail, and then fall back to bzip2 compression if lzma failed to compress that block. This means that lrzip will generate a mixed compression file which is perfectly fine, but at least it will not leave blocks uncompressed. None of this is a problem on 64 bits mind you, but there are plenty of people running 32 bits still out there and will be for some time to come.
The other issue that arose was that uclibc doesn't appear to support the sysconf() function fully, not returning a valid value for ram (it returns a negative value) so I had to add a fallback to reading /proc for that.
The multithreaded compression (till now) worked off dividing up the size of the file by the number of threads, and then when a chunk of that size was reached, it would pass it onto a thread and then continue. The problem with this approach is the file is rzip preprocessed before passing it on to a back end compression thread, so it's usually smaller than this. I modified it to put data to threads at regular intervals during the rzip hash search instead, thus spawning threads at even timeframes to hopefully use more CPU time on SMP machines. What's interesting about this, though, is that the chunks compress markedly differently because they're different parts of the rzip processed data and can be tiny or massive after that processing. It made for only very modest improvements in compression times. It has an advantage on decompression, though, because each chunk after decompression is the same size, and it can take quite a while to write say a 2.5GB chunk at a time, thus allowing the next thread to keep working on the chunk after that while it's busy writing the first chunk.
When I last posted, I was unhappy about the multithreaded decompression component so I rewote part of that as well. Now, instead of guessing what stream is likely to be read next, it simply spawns two groups of threads, one group for each stream. The vast majority of the time, archives end up with just one block of stream 0 and lots of blocks of stream 1, so it won't make a big difference, but hopefully it's more robust than it was before. It does mean, however, that you can potentially spawn twice as many threads as you have CPUs which could be a problem with memory usage.
So what's next for lrzip? Well I keep trying to let it stabilise instead of doing more work on it, but there are more people using it now which means I get more bug reports. That's both good and bad ;) The things I see that still need addressing are more ability to cope with failures to do with memory allocation. Unlike other compression applications which use fixed upper sized compression windows, the main challenge with lrzip is to try and use as much memory as possible without running out of memory. This means that the main challenges aren't to do with the algorithms themselves, but memory usage. Multithreading has just made that even more complex since each thread needs its own memory to work with. I need to put more work into coping with trying to allocate too much ram and then cut back on thread usage, but backtracking like that is a little tricky with the code the way it is at the moment. If you do get memory problems, try setting -p values to less than your CPUs manually.
Also, it still will not work on macOSX, with the last good version for that being 0.520. I need a little rest before tackling that :s
Anyway, the new version is out there, and to indicate that it's pretty much just a bug fix and that everyone should upgrade, I've only pushed the minor subversion up to 0.544.
http://freshmeat.net/projects/lrzip
Sunday, 5 December 2010
Automated per session task groups comment.
Against my better judgement I sent an email to lkml. Here's the transcript since it summarises my position.
---
Greets.
I applaud your efforts to continue addressing interactivity and responsiveness but, I know I'm going to regret this, I feel strongly enough to speak up about this change.
On Sun, 5 Dec 2010 10:43:44 Colin Walters wrote:
> On Sat, Dec 4, 2010 at 5:39 PM, Linus Torvalds
> wrote:
> > What's your point again? It's a heuristic.
>
> So if it's a heuristic the OS can get wrong,
This is precisely what I see as the flaw in this approach. The whole reason you have CFS now is that we had a scheduler which was pretty good for all the other things in the O(1) scheduler, but needed heuristics to get interactivity right. I put them there. Then I spent the next few years trying to find a way to get rid of them. The reason is precisely what Colin says above. Heuristics get it wrong sometimes. So no matter how smart you think your heuristics are, it is impossible to get it right 100% of the time. If the heuristics make it better 99% of the time, and introduce disastrous corner cases, regressions and exploits 1% of the time, that's unforgivable. That's precisely what we had with the old O(1) scheduler and that's what you got rid of when you put CFS into mainline. The whole reason CFS was better was it was mostly fair and concentrated on ensuring decent latency rather than trying to guess what would be right, so it was predictable and reliable.
So if you introduce heuristics once again into the scheduler to try and improve the desktop by unfairly distributing CPU, you will go back to where you once were. Mostly better but sometimes really badly wrong. No matter how smart you think you can be with heuristics they cannot be right all the time. And there are regressions with these tty followed by per session group patches. Search forums where desktop users go and you'll see that people are afraid to speak up on lkml but some users are having mplayer and amarok skipping under light load when trying them. You want to program more intelligence in to work around these regressions, you'll just get yourself deeper and deeper into the same quagmire. The 'quick fix' you seek now is not something you should be defending so vehemently. The "I have a solution now" just doesn't make sense in this light. I for one do not welcome our new heuristic overlords.
If you're serious about really improving the desktop from within the kernel, as you seem to be with this latest change, then make a change that's predictable and gets it right ALL the time and is robust for the future. Stop working within all the old fashioned concepts and allow userspace to tell the kernel what it wants, and give the user the power to choose. If you think this is too hard and not doable, or that the user is too uninformed or want to modify things themselves, then allow me to propose a relatively simple change that can expedite this.
There are two aspects to getting good desktop behaviour, enough CPU and low latency. 'nice' by your own admission is too crude and doesn't really describe how either of these should really be modified. Furthermore there are 40 levels of it and only about 4 or 5 are ever used. We also know that users don't even bother using it.
What I propose is a new syscall latnice for "latency nice". It only need have 4 levels, 1 for default, 0 for latency insensitive, 2 for relatively latency sensitive gui apps, and 3 for exquisitely latency sensitive uses such as audio. These should not require extra privileges to use and thus should also not be usable for "exploiting" extra CPU by default. It's simply a matter of working with lower latencies yet shorter quota (or timeslices) which would mean throughput on these apps is sacrificed due to cache trashing but then that's not what latency sensitive applications need. These can then be encouraged to be included within the applications themselves, making this a more long term change. 'Firefox' could set itself 2, 'Amarok' and 'mplayer' 3, and 'make' - bless its soul - 0, and so on. Keeping the range simple and defined will make it easy for userspace developers to cope with, and users to fiddle with.
But that would only be the first step. The second step is to take the plunge and accept that we DO want selective unfairness on the desktop, but where WE want it, not where the kernel thinks we might want it. It's not an exploit if my full screen HD video continues to consume 80% of the CPU while make is running - on a desktop. Take a leaf out of other desktop OSs and allow the user to choose say levels 0, 1, or 2 for desktop interactivity with a simple /proc/sys/kernel/interactive tunable, a bit like the "optimise for foreground applications" seen elsewhere. This could then be used to decide whether to use the scheduling hints from latnice to either just ensure low latency but keep the same CPU usage - 0, or actually give progressively more CPU for latniced tasks as the interactive tunable is increased. Then distros can set this on installation and make it part of the many funky GUIs to choose between the different levels. This then takes the user out of the picture almost entirely, yet gives them the power to change it if they so desire.
The actual scheduler changes required to implement this are absurdly simple and doable now, and will not cost in overhead the way cgroups do. It also should cause no regressions when interactive mode is disabled and would have no effect till changes are made elsewhere, or the users use the latnice utility.
Move away from the fragile heuristic tweaks and find a longer term robust solution.
Regards,
Con
--
-ck
P.S. I'm very happy for someone else to do it. Alternatively you could include BFS and I'd code it up for that in my spare time.
---
EDIT:
And just for the sake of it I hacked up what a latnice patch would look like. Of course being unsupported by userspace means there's no point me supporting and promoting this change even on BFS.
http://ck.kolivas.org/patches/bfs/latnice/
---
Greets.
I applaud your efforts to continue addressing interactivity and responsiveness but, I know I'm going to regret this, I feel strongly enough to speak up about this change.
On Sun, 5 Dec 2010 10:43:44 Colin Walters wrote:
> On Sat, Dec 4, 2010 at 5:39 PM, Linus Torvalds
>
> > What's your point again? It's a heuristic.
>
> So if it's a heuristic the OS can get wrong,
This is precisely what I see as the flaw in this approach. The whole reason you have CFS now is that we had a scheduler which was pretty good for all the other things in the O(1) scheduler, but needed heuristics to get interactivity right. I put them there. Then I spent the next few years trying to find a way to get rid of them. The reason is precisely what Colin says above. Heuristics get it wrong sometimes. So no matter how smart you think your heuristics are, it is impossible to get it right 100% of the time. If the heuristics make it better 99% of the time, and introduce disastrous corner cases, regressions and exploits 1% of the time, that's unforgivable. That's precisely what we had with the old O(1) scheduler and that's what you got rid of when you put CFS into mainline. The whole reason CFS was better was it was mostly fair and concentrated on ensuring decent latency rather than trying to guess what would be right, so it was predictable and reliable.
So if you introduce heuristics once again into the scheduler to try and improve the desktop by unfairly distributing CPU, you will go back to where you once were. Mostly better but sometimes really badly wrong. No matter how smart you think you can be with heuristics they cannot be right all the time. And there are regressions with these tty followed by per session group patches. Search forums where desktop users go and you'll see that people are afraid to speak up on lkml but some users are having mplayer and amarok skipping under light load when trying them. You want to program more intelligence in to work around these regressions, you'll just get yourself deeper and deeper into the same quagmire. The 'quick fix' you seek now is not something you should be defending so vehemently. The "I have a solution now" just doesn't make sense in this light. I for one do not welcome our new heuristic overlords.
If you're serious about really improving the desktop from within the kernel, as you seem to be with this latest change, then make a change that's predictable and gets it right ALL the time and is robust for the future. Stop working within all the old fashioned concepts and allow userspace to tell the kernel what it wants, and give the user the power to choose. If you think this is too hard and not doable, or that the user is too uninformed or want to modify things themselves, then allow me to propose a relatively simple change that can expedite this.
There are two aspects to getting good desktop behaviour, enough CPU and low latency. 'nice' by your own admission is too crude and doesn't really describe how either of these should really be modified. Furthermore there are 40 levels of it and only about 4 or 5 are ever used. We also know that users don't even bother using it.
What I propose is a new syscall latnice for "latency nice". It only need have 4 levels, 1 for default, 0 for latency insensitive, 2 for relatively latency sensitive gui apps, and 3 for exquisitely latency sensitive uses such as audio. These should not require extra privileges to use and thus should also not be usable for "exploiting" extra CPU by default. It's simply a matter of working with lower latencies yet shorter quota (or timeslices) which would mean throughput on these apps is sacrificed due to cache trashing but then that's not what latency sensitive applications need. These can then be encouraged to be included within the applications themselves, making this a more long term change. 'Firefox' could set itself 2, 'Amarok' and 'mplayer' 3, and 'make' - bless its soul - 0, and so on. Keeping the range simple and defined will make it easy for userspace developers to cope with, and users to fiddle with.
But that would only be the first step. The second step is to take the plunge and accept that we DO want selective unfairness on the desktop, but where WE want it, not where the kernel thinks we might want it. It's not an exploit if my full screen HD video continues to consume 80% of the CPU while make is running - on a desktop. Take a leaf out of other desktop OSs and allow the user to choose say levels 0, 1, or 2 for desktop interactivity with a simple /proc/sys/kernel/interactive tunable, a bit like the "optimise for foreground applications" seen elsewhere. This could then be used to decide whether to use the scheduling hints from latnice to either just ensure low latency but keep the same CPU usage - 0, or actually give progressively more CPU for latniced tasks as the interactive tunable is increased. Then distros can set this on installation and make it part of the many funky GUIs to choose between the different levels. This then takes the user out of the picture almost entirely, yet gives them the power to change it if they so desire.
The actual scheduler changes required to implement this are absurdly simple and doable now, and will not cost in overhead the way cgroups do. It also should cause no regressions when interactive mode is disabled and would have no effect till changes are made elsewhere, or the users use the latnice utility.
Move away from the fragile heuristic tweaks and find a longer term robust solution.
Regards,
Con
--
-ck
P.S. I'm very happy for someone else to do it. Alternatively you could include BFS and I'd code it up for that in my spare time.
---
EDIT:
And just for the sake of it I hacked up what a latnice patch would look like. Of course being unsupported by userspace means there's no point me supporting and promoting this change even on BFS.
http://ck.kolivas.org/patches/bfs/latnice/
Monday, 29 November 2010
lrzip-0.543, random fixes
Yep, there are random fixes in this one.
Oh wait, you probably want more detail than that.
Lrzip splits the data into two streams during its rzip stage, one with the "dictionary" parts it ends up finding multiple copies of, and the rest. It turns out that during the multithreaded decompression modification I made the naive assumption that there would be only one chunk of stream 0 data per compressed window. Unfortunately I can't tell what the data is until I've already read it, and throwing lots of chunks into threads to decompress means I have a whole lot of data potentially in a different order to how the rzip re-expansion stage will expect it. I was loathe to modifying the lrzip archive format yet again so soon after changing it, so I used the help of the only person I could find reproducing a decompression problem to try and find how to predict what stream the data came from and then modified the decompression stage accordingly. I'm still not happy with it because it still feels fragile, but I need more time to investigate to ensure I'm doing the right thing. This is unlikely to hit anyone trying to decompress anything less than many many gigabytes in size (the archive is still fine, and the previous non-multithreaded one should decompress it ok).
Much more importantly for the common case usage, I limited the lrzip compression window on 0.542 to only 300MB because it seemed to break on larger windows on 32 bit. Unfortunately, I did it -before- chopping it up into smaller chunks for multithreading, and for all architectures, not just 32 bit. So I fixed that which means the default compression window should be significantly larger on 0.543.
Finally it was clear that the rate limiting step on multithreaded workloads was the rzip stage on extra large files, and since the compression/decompression can now begin as a separate thread to the rzip component, I made the rzip process run at a lower 'nice' level than the back end threads, which afforded a small speed up too.
I still have more planned for lrzip but was hoping to take a small break for a while to do other things now that it's more or less stable. In the meantime, apparently when it's run on a system with uclibc instead of glibc, sysconf() fails to report ram properly which needs fixing. That should be easy enough to workaround, but is a little disappointing from uclibc. The other planned changes involve committing a more robust fix for the multiple stream 0 archives, and more multithreading improvements. Currently one chunk at a time is either compressed or decompressed at the same time as the rzip preprocessing/expansion stage and then all the threads need to complete before moving onto the next chunk. The next logical change is to move the threading even higher up the chain to be able to process multiple chunks concurrently, keeping all CPUs busy at all times. This last change is unlikely to make a huge difference on default settings, but should speed up zpaq based de/compression even more when a file is large enough (bigger than available ram).
Oh and working on OSX still needs fixing (see previous post about named semaphores).
Oh wait, you probably want more detail than that.
Lrzip splits the data into two streams during its rzip stage, one with the "dictionary" parts it ends up finding multiple copies of, and the rest. It turns out that during the multithreaded decompression modification I made the naive assumption that there would be only one chunk of stream 0 data per compressed window. Unfortunately I can't tell what the data is until I've already read it, and throwing lots of chunks into threads to decompress means I have a whole lot of data potentially in a different order to how the rzip re-expansion stage will expect it. I was loathe to modifying the lrzip archive format yet again so soon after changing it, so I used the help of the only person I could find reproducing a decompression problem to try and find how to predict what stream the data came from and then modified the decompression stage accordingly. I'm still not happy with it because it still feels fragile, but I need more time to investigate to ensure I'm doing the right thing. This is unlikely to hit anyone trying to decompress anything less than many many gigabytes in size (the archive is still fine, and the previous non-multithreaded one should decompress it ok).
Much more importantly for the common case usage, I limited the lrzip compression window on 0.542 to only 300MB because it seemed to break on larger windows on 32 bit. Unfortunately, I did it -before- chopping it up into smaller chunks for multithreading, and for all architectures, not just 32 bit. So I fixed that which means the default compression window should be significantly larger on 0.543.
Finally it was clear that the rate limiting step on multithreaded workloads was the rzip stage on extra large files, and since the compression/decompression can now begin as a separate thread to the rzip component, I made the rzip process run at a lower 'nice' level than the back end threads, which afforded a small speed up too.
I still have more planned for lrzip but was hoping to take a small break for a while to do other things now that it's more or less stable. In the meantime, apparently when it's run on a system with uclibc instead of glibc, sysconf() fails to report ram properly which needs fixing. That should be easy enough to workaround, but is a little disappointing from uclibc. The other planned changes involve committing a more robust fix for the multiple stream 0 archives, and more multithreading improvements. Currently one chunk at a time is either compressed or decompressed at the same time as the rzip preprocessing/expansion stage and then all the threads need to complete before moving onto the next chunk. The next logical change is to move the threading even higher up the chain to be able to process multiple chunks concurrently, keeping all CPUs busy at all times. This last change is unlikely to make a huge difference on default settings, but should speed up zpaq based de/compression even more when a file is large enough (bigger than available ram).
Oh and working on OSX still needs fixing (see previous post about named semaphores).
Monday, 22 November 2010
lrzip-0.542, mmap windows and overcommit
I started out this blog entry with a lot more about the tty patch thingy and then decided against it. Suffice to say I don't like heuristics being special coded into the scheduler.
So back to lrzip progress. I found a lovely little bug in it which would make the sliding mmap buffer not slide from the 2nd compression window onwards. It was a lovely bug to find because after fixing it, a very large file took 1/4 the time it was taking to compress down. Overall sliding mmap has gotten a lot faster and useful.
I did a bit of further investigating to see if it was fast enough to enable unlimited compression windows by default. It hasn't quite gotten that much faster, but it's very usable. I tried it on that 9.1GB all-kernel archive on a 32 bit laptop and the default windows take 20 minutes to compress, while unlimited sized windows with sliding mmap took 80 minutes. Not bad given how much more compression it ends up giving.
Now because I run my main desktop with swap disabled (swap is a steaming pile of dodo's doodoo) I wanted to check what happened with memory allocation on linux and just how much the Virtual Memory subsystem will happily give you since lrzip tests before it allocates a window. With a file backed mmap it turns out you can allocate whatever you bloody well like. It happily let me mmap 50GB on a machine with 4GB ram. Now this is a disaster because imagine reading that file linearly into ram as lrzip is working on it. As soon as it stops being able to fit any more into real ram (which happens at about 2/3 of the ram used), it has all this stuff in ram which is now a complete waste and starts faking caching the rest. It never really has any useful amount in ram at any one time since it ends up dropping everything behind. Then if you read up to the end of the file in short bursts, and back again (as lrzip does), it gets worse and worse. Unfortunately, the -M mode in lrzip just worked that way by using whatever mmap would give you.
So I spent days trying various combinations of sizes of window that would work out optimal (with and without swap) and kept coming up with the same answer: that having a main buffer of ~2/3 of the ram gives the best speed performance. It left me without an answer for what to do with -M mode. After enough playing around I decided to make the sliding mmap mode kick in whenever the compression window is larger than 2/3 ram as it speeds up the performance noticeably compared to just mapping in the whole file and never really caching anything. But I had to make up some upper limit to how big -M would work with and 1.5 times the ram size worked out to be a reasonable compromise in how much it would slow down, and how much it would hit swap.
Lrzip can hit swap real hard was the conclusion by the way... and it's not like it speeds anything up or allows any more ram to be used compared to just disabling it. The only
time swap made a difference to the testing was how much anonymous ram could be mmapped (as in just free ram versus caching a file that's on disk). The linux vm allows you to pretty much allocate everything you have as real ram and swap. You should see what the desktop was like after it finished compressing a huge file - clicking on anything took ages since pretty much every useful application was sitting on swap instead of in ram. I had to test it with a real HD since my main desktop now uses an SSD to see how it would affect regular users. By the way, if you can afford an SSD for your desktop, get one ASAP. It's the greatest desktop speedup I've seen in years. (Sure be vigilant with backups and so on cause people don't trust them blah blah).
Despite choosing 2/3 as the amount of ram to allocate, lrzip still actually tests to ensure it won't fail and then decreases the amount till it succeeds. It's rather tricky making sure there will be enough ram for all the compression components because it needs enough ram to allocate to caching the file on disk, enough for 2 compression streams possibly as large as that first one, and then enough ram to dedicate to the back end compression phase. All in all it's a real ram hog. If you compress a big file with lrzip, expect that most things in ram will be trashed (all for a good cause of course). Making sure it doesn't fail on 32 bits is also rather annoying, but I now use the sliding mmap for anything bigger than 700MB there and that's made a big difference to how effectively a 32 bit machine can compress large files too.
Lrzip has been building successfully on and off on mac osx for a while now. Every 2nd or 3rd release I break it since I can't test it myself. It turns out that I recently broke it again. When I added the massive multithreading capabilities to the compression side, I used a fairly standard posix tool, the unnamed semaphore. It turns out that mac osx just doesn't support them. It builds fine, but then when you run it, it says function not implemented. That's pretty sad... named semaphores are much clunkier and according to the manpage "POSIX named semaphores have kernel persistence: if not removed by sem_unlink(3), a semaphore will exist until the system is shut down." I'm not sure if that's adhered to, but that's rather awkward if you don't clean up after yourself very well when you abort your program. At some stage if I can find the enthusiasm, I might try and get the multithreaded lrzip working on osx by converting the unnamed semaphores to named ones, or use them selectively on osx.
All in all I'm very pleased with how lrzip has shaped up. I went back and tested the huge kernel tarball with zpaq multithreaded compression and managed to get 9.1GB down to 116MB in just 40 minutes with zpaq on an external USB drive. Gotta be happy with that :) It's a shame people don't find this remotely as interesting as anything I have to say on the linux kernel.
Lrzip project page on freshmeat
EDIT: I keep getting people ask me what the big deal is with 32 bits and lrzip, since 32 bits with PAE should be able to address up to 64GB ram. That may well be the case, but the gnu libraries on the 32 bit userspace take a size_t on malloc and mmap, and they are the size of a signed long, which is 4 bytes on 32 bits, and 8 bytes on 64 bits. So the most you can address in one malloc with 32 bit userspace is up to 31 bits, or a bit over 2GB.
So back to lrzip progress. I found a lovely little bug in it which would make the sliding mmap buffer not slide from the 2nd compression window onwards. It was a lovely bug to find because after fixing it, a very large file took 1/4 the time it was taking to compress down. Overall sliding mmap has gotten a lot faster and useful.
I did a bit of further investigating to see if it was fast enough to enable unlimited compression windows by default. It hasn't quite gotten that much faster, but it's very usable. I tried it on that 9.1GB all-kernel archive on a 32 bit laptop and the default windows take 20 minutes to compress, while unlimited sized windows with sliding mmap took 80 minutes. Not bad given how much more compression it ends up giving.
Now because I run my main desktop with swap disabled (swap is a steaming pile of dodo's doodoo) I wanted to check what happened with memory allocation on linux and just how much the Virtual Memory subsystem will happily give you since lrzip tests before it allocates a window. With a file backed mmap it turns out you can allocate whatever you bloody well like. It happily let me mmap 50GB on a machine with 4GB ram. Now this is a disaster because imagine reading that file linearly into ram as lrzip is working on it. As soon as it stops being able to fit any more into real ram (which happens at about 2/3 of the ram used), it has all this stuff in ram which is now a complete waste and starts faking caching the rest. It never really has any useful amount in ram at any one time since it ends up dropping everything behind. Then if you read up to the end of the file in short bursts, and back again (as lrzip does), it gets worse and worse. Unfortunately, the -M mode in lrzip just worked that way by using whatever mmap would give you.
So I spent days trying various combinations of sizes of window that would work out optimal (with and without swap) and kept coming up with the same answer: that having a main buffer of ~2/3 of the ram gives the best speed performance. It left me without an answer for what to do with -M mode. After enough playing around I decided to make the sliding mmap mode kick in whenever the compression window is larger than 2/3 ram as it speeds up the performance noticeably compared to just mapping in the whole file and never really caching anything. But I had to make up some upper limit to how big -M would work with and 1.5 times the ram size worked out to be a reasonable compromise in how much it would slow down, and how much it would hit swap.
Lrzip can hit swap real hard was the conclusion by the way... and it's not like it speeds anything up or allows any more ram to be used compared to just disabling it. The only
time swap made a difference to the testing was how much anonymous ram could be mmapped (as in just free ram versus caching a file that's on disk). The linux vm allows you to pretty much allocate everything you have as real ram and swap. You should see what the desktop was like after it finished compressing a huge file - clicking on anything took ages since pretty much every useful application was sitting on swap instead of in ram. I had to test it with a real HD since my main desktop now uses an SSD to see how it would affect regular users. By the way, if you can afford an SSD for your desktop, get one ASAP. It's the greatest desktop speedup I've seen in years. (Sure be vigilant with backups and so on cause people don't trust them blah blah).
Despite choosing 2/3 as the amount of ram to allocate, lrzip still actually tests to ensure it won't fail and then decreases the amount till it succeeds. It's rather tricky making sure there will be enough ram for all the compression components because it needs enough ram to allocate to caching the file on disk, enough for 2 compression streams possibly as large as that first one, and then enough ram to dedicate to the back end compression phase. All in all it's a real ram hog. If you compress a big file with lrzip, expect that most things in ram will be trashed (all for a good cause of course). Making sure it doesn't fail on 32 bits is also rather annoying, but I now use the sliding mmap for anything bigger than 700MB there and that's made a big difference to how effectively a 32 bit machine can compress large files too.
Lrzip has been building successfully on and off on mac osx for a while now. Every 2nd or 3rd release I break it since I can't test it myself. It turns out that I recently broke it again. When I added the massive multithreading capabilities to the compression side, I used a fairly standard posix tool, the unnamed semaphore. It turns out that mac osx just doesn't support them. It builds fine, but then when you run it, it says function not implemented. That's pretty sad... named semaphores are much clunkier and according to the manpage "POSIX named semaphores have kernel persistence: if not removed by sem_unlink(3), a semaphore will exist until the system is shut down." I'm not sure if that's adhered to, but that's rather awkward if you don't clean up after yourself very well when you abort your program. At some stage if I can find the enthusiasm, I might try and get the multithreaded lrzip working on osx by converting the unnamed semaphores to named ones, or use them selectively on osx.
All in all I'm very pleased with how lrzip has shaped up. I went back and tested the huge kernel tarball with zpaq multithreaded compression and managed to get 9.1GB down to 116MB in just 40 minutes with zpaq on an external USB drive. Gotta be happy with that :) It's a shame people don't find this remotely as interesting as anything I have to say on the linux kernel.
Lrzip project page on freshmeat
EDIT: I keep getting people ask me what the big deal is with 32 bits and lrzip, since 32 bits with PAE should be able to address up to 64GB ram. That may well be the case, but the gnu libraries on the 32 bit userspace take a size_t on malloc and mmap, and they are the size of a signed long, which is 4 bytes on 32 bits, and 8 bytes on 64 bits. So the most you can address in one malloc with 32 bit userspace is up to 31 bits, or a bit over 2GB.
Subscribe to:
Posts (Atom)