Wednesday, 17 November 2010

Create task groups by TTY comment

I've had every man and his dog either drop into IRC or email me asking me what my thoughts are on the grouping tasks by tty layer patch discussed here: Phoronix link and slashdot. I guess people didn't understand my 2.6.36-ck1 announcement clearly enough, so I'll quote it again here:

Those following the development of the patches for interactivity at massive
load, I have COMPLETELY DROPPED them as they introduce regressions at normal
workloads, and I cannot under any circumstances approve changes to improve
behaviour at ridiculous workloads which affect regular ones. I still see
precisely zero point at optimising for absurd workloads. Proving how many
un-niced jobs you can throw at your kernel compiles is not a measure of one's
prowess. It is just a mindless test. 

Remember, I already had developed a hierarchical tree-based penalty patch for BFS and blogged about it here. I can do it in a 10 line patch for BFS, but it introduced regressions, which is why I dropped it (see earlier blog entry here: further-updates-on-hierarchical-tree).

Again, I can't for the life of me see why you'd optimise for make -j64 on a quad core machine. It is one workload, unique to people who compile all the time, but done in a way you wouldn't normally do it anyway. It is not going to magically make anything else better. If for some god-forsaken reason you wanted to do that, you could already do that with nice, or even better, by running it SCHED_IDLEPRIO.

nice -19 make -j 64 blahblah
or
schedtool -D -e make -j64 blahblah

It's not really that hard folks...

And if you really really really still want the feature for BFS, the patch that does the hierarchical tree based penalty is rolled into a bigger patch (so a lot more than just the 10 lines I mentioned) that can also group threads and enable/disable the features and it's still here:

bfs357-penalise_fork_depth_account_threads.patch

It is worth noting also that the mainline approach costs you in throughput, whereas this patch is virtually free.

EDIT: I forgot to mention that for YEARS now I've been using my "toolsched" wrapper scripts that do this automatically. See toolsched for the scripts. Make always starts as SCHED_IDLEPRIO for me at home.

lrzip-0.540 and multithreaded decompression.

Unable to put it down, I tackled the last big thing I had planned for lrzip in the short term future: Multithreaded decompression. In a similar fashion (but in reverse obviously) to how I tackled it on compression, I modified it to take the file and spit out chunks to as many threads as there are CPU, and let them all start decompressing each chunk independently. Then as soon as the first thread is complete, it hand its buffer over to the rzip stage and makes itself available to take on more chunks and so on.

This technique needs to have a file that was compressed with enough chunks in the first place, but the absolute minimum a file will already be is two chunks due to the 2 streams that rzip uses every time, but will obviously scale best with files already compressed with the multithreaded version.

So how does it perform? Well much like the compression side, the slower the backend compressor in use, the better the speedup with more CPUs.

Going to my regular benchmark, here are the before and after effects on decompression on quad core:

Options   Singlethread  Multithread
lrzip      4m32s         3m07s
lrzip -M   4m05s         2m50s
lrzip -l   3m12s         2m23s
lrzip -lM  2m57s         2m20s
lrzip -zM  04h08m        72m0s

Note that the -z compress/decompress had a slightly braindamaged screen output in v0.530, and in this benchmark above, so it actually didn't perform as well as it would in v0.540 since that's been fixed. Clearly, though, it's massively faster.

Summary: Kick arse.

So I'm very pleased with the performance of lrzip on SMP now. With the help of a friend online who had access to a 48 CPU machine (no he wasn't running BFS :\), we tried various files to see how much we could get lrzip to scale. To actually use all the CPUs, we needed a file large enough that had enough work to distribute to most of the CPUs, and then keep them busy for long enough to show the effect of that scaling.

lrzip -z linux-2.6.36.tar
real 1m6.552s
user 17m16.660s

So this ran about 17x faster than it would have run single threaded, but still wasn't large enough.

The kde source fit that reasonably well, at 1967093760 bytes.

lrzip -z kde.tar
real 4m33.285s
user 92m26.650s

This one ran over 20x faster than it would have run single threaded. I don't know what the upper limit will be, but clearly this has been a massive improvement in performance, and brings zpaq into usable speeds with enough cores.

Get it here (too lazy to post on freshmeat right now): lrzip.kolivas.org

Saturday, 13 November 2010

lrzip-0.530 and multithreading speed-ups.

As planned I finally manage to thread the compression phase of lrzip to parallelise as much as possible. Originally it used to feed data into a buffer which rzip acted on, and then when it was full it was handed over to the backend compressor, all of which were single threaded except for lzma which didn't even scale beyond 2 CPUs. The implementation in the latest released version of lrzip, 0.530, does it the way I mentioned in my previous blog post.

First it feeds data into the rzip buffer which preprocesses data until it has enough to pass onto a compression thread. Then it hands the data to a compression thread and continues reading more data and lets rzip work on it while the compression thread is doing the 2nd phase compression in the background. Once rzip has enough data for another thread, it spawns another thread and so on, until there are as many threads as CPUs, and then keeps reading until the first thread is free and reuses it, and so on.

Well the results of this are about as good as I could have hoped for. While the faster lzo compression backend only gains a small speedup, the slower the backend, the bigger the speedup. It becomes impressive once zpaq is in use, where I was able to get a 4x speedup on a quad core. That makes lrzip with zpaq almost as fast as regular xz! However, since zpaq takes just as long to decompress as it does to compress, and I haven't threaded the decompression phase, it ends up taking 4x longer to decompress than it did to compress (grin). So zpaq isn't -quite- at the usable stage just yet, but it may well be in the near future.

So what everyone has been waiting for (all 3 of you), benchmarks! 10GB virtual image file being compressed on a quad core 3GHz from an SSD.

Compression Size        Compress  Decompress
None        10737418240
gzip        2772899756   05m47s    2m46s
bzip2       2704781700   16m15s    6m19s
xz          2272322208   50m58s    3m52s
7z          2242897134   26m36s    5m41s
lrzip       1299228155   16m12s    4m32s
lrzip -M    1079682231   12m03s    4m05s
lrzip -l    1754694010   05m30s    3m12s
lrzip -lM   1414958844   05m15s    2m57s
lrzip -zM   1066902006   71m20s    04h08m

Kick arse.

Get it here, and remember freshmeat may not have updated their download links yet:
lrzip

Next stop, to parallelise the decompression phase. I doubt anything but zpaq will really benefit from this, but it would be great to have a zpaq based compression format that is useably fast.

Thursday, 11 November 2010

lrzip scalability improvements

I've been looking at the benchmarks I've accumulated for lrzip's performance and have been mostly pleased. The speed/size performance or pure size performance is very good compared to other compression algorithms, but not much has been done to date to make it scale better on SMP. It scales slightly during lzma compression, but even there it's a clunky threading library attached to it that increases the CPU usage up to a bit over 2 CPUs for about 50% better performance.

So I've set out to make significant improvements to the scalability of lrzip on SMP, knowing it will slightly adversely affect compression. There are two stages to compression on lrzip, the preprocessing rzip stage and the back end compression 2nd stage. Improving the scalability of the first stage looks highly improbable (I'll never admit to anything being impossible). However, if I split up the chunks handed to the back end compressor and spread it over each thread, it should scale better. So the first step I did here is to do precisely that: Grab all the data from rzip and then split it up to a fixed number of block sizes and use every CPU available, provided each block is at least a minimum size to preserve compression efficiency. I've done precisely that and committed it to the git tree:

lrzip repo

Now it would be nice if this made the backend compression 4 x faster on quad core, but there are a number of reasons this doesn't happen. Mainly it's because they're all competing for cache and memory bandwidth, and (less significantly) the blocks end up compressing different degrees so not all threads finish at the same time. Still, the classic 23 minute 10GB benchmark now dropped to 18 minutes which is a fairly significant improvement given that previous lzma compression was already lightly threaded. The file was only marginally bigger so it's definitely a win.

Next up is to try and break up the rzip stream and hand it to the compression back end before the stream has even finished, to parallelise the rzip phase with the compression phase. This will not likely get to use *every* CPU at the same time (except in the case of zpaq), but will actually be faster than the previous approach. This work is much more invasive than the previous one and is currently underway.

After that I'm going to look at decompression and try and parallelise the back end decompression phase. While decompression is already quite fast, who can ever complain about more speed?

All in all this has been loads of fun to hack on. I'd never thought I'd say this, but maybe it would be nice to have a sea change and get paid to hack part-time and do medicine only part-time.

Monday, 8 November 2010

lrzip-0.520 and massive kernel compression experiment.

So it's another day and it's time for a new version. Actually version 0.520 is the same as version 0.5.2, it's just that I had it pointed out to me that distros (and github and most everything else) don't really like 3 point numbering schemes. So I have to go back to 2 point version numbering and this is otherwise the same code. I'm not planning on any more changes in the immediate future so hopefully this will calm down for a while and give people a change to actually use it.

Anyway I had a bit of fun with testing out lrzip at the extremes of what it does well and ended up posting my results to the linux kernel mailing list since it involved linux tarballs. Here is a copy of the email as I sent it.

---

Let's do this backwards. Data first.


tarball of every 3 point linux kernel from 2.6.0 to 2.6.36

9748285440 linux-2.6.0-2.6.36.tar


compressed file

136516013 linux-2.6.0-2.6.36.tar.lrz


Compressed size: 1.4%


Compression time:

00:19:22.086


Decompression time:

00:03:02.564


lrzip.kolivas.org


Now for the introduction

Lrzip is a compression application I've been working on that is based on the
original excellent application rzip by Andrew Tridgell. Rzip worked by having
a preprocessor stage with a massive compression window up to 900MB for long
distance redundancy compaction and then compressing the data with bz2. Lrzip
was a project I began based on rzip which tried to extend the initial
compression window beyond 900MB and to use lzma as the back end for the 2nd
stage compression. The idea was that as file sizes got bigger, and machines
had more ram, it would keep getting more and more useful.

After many iterations on lrzip, I've been able to significantly expand on this
idea and make it address 64 bit sized files, ram sizes, and bigger windows.
Previously the limit was based on available ram and how much of the file being
compressed could be mmapped. The current version (0.5.2) is able to use
unlimited sized compression windows for the preprocessor stage using two
sliding mmap windows I invented to match the hash search algorithm of rzip
which can go beyond the size of the ram available. Basically the larger the
file, and the more redundant information in the file, the more the compression
is you can achieve.

Anyway the relevance of this to kernel folk is that the linux kernel is a
perfect test case for long distance redundancy when multiple kernel trees are
compressed.

So here's the lowdown. All of these results were conducted on a 3GHz quad core
64 bit machine with 8GB ram on a 160GB SSD so hard drive speed is a
(relatively) insignificant part of the times.


Starting with linux kernel 2.6.36

413511680 linux-2.6.36.tar
70277083 linux-2.6.36.tar.bz2
59905365 linux-2.6.36.tar.lrz

Compression:
Compression Ratio: 6.903. Average Compression Speed: 1.582MB/s.
Total time: 00:04:08.896

Decompression:
Average DeCompression Speed: 17.130MB/s
Total time: 00:00:22.557


So trying my best to show off what lrzip is good at, I grabbed every 3 point
2.6 linux kernel release. That's every release from 2.6.0 to 2.6.36 as a
tarball, not as patches. It came to a 9.1GB file. Now each kernel tree is
going to be repeating an -awful- lot of data, but given that they have gotten
progressively larger, and span gigabytes, normal compression doesn't really
offer much. This is where the rzip compaction stage over the full size of the
file comes in handy. The more redundant information there is over large
distances, the better. [insert joke about the linux kernel and redundant
information here]. (-MU means Maximum Unlimited; maximum ram and unlimited
window size).


9748285440 linux-2.6.0-2.6.36.tar

lrzip -MU linux-2.6.0-2.6.36.tar

136516013 linux-2.6.0-2.6.36.tar.lrz

Compression:
Compression Ratio: 71.408. Average Compression Speed: 8.000MB/s.
Total time: 00:19:22.086

That's 9.1GB to 130MB (1.4%) in 20 minutes.

Decompression:
Average DeCompression Speed: 51.359MB/s
Total time: 00:03:02.564


At 130MB, it's small enough for me to even offer the entire 3 point stable
release for download from my piddly little server. So here's the file, if
anyone wanted to confirm its validity, or just to download them all :)

linux-2.6.0-2.6.36.tar.lrz


lrzip also has an lzo and zpaq backend for those who want to extract every
last drop of compression out of it at the cost of a bit more time. zpaq is
EIGHT TIMES slower during the backend compression phase compared to lzma
whereas lzo is almost instant after the rzip phase. Note that this file loses
most of its size during the preprocessor stage so it doesn't end up taking 8x
longer to compress with zpaq:


lrzip -MUz linux-2.6.0-2.6.36.tar

121021214 linux-2.6.0-2.6.36.tar.lrz

Compression:
Compression Ratio: 80.550. Average Compression Speed: 3.041MB/s.
Total time: 00:50:56.537

That's 9.1GB to 115MB (1.2%) in 51 mins.

Decompression time is about 52 minutes (yes it's longer than compression!)


Now, I am NOT proposing lrzip be used as a drop in replacement for the
existing compression formats in use. It does work on stdin/stdout but
inefficiently except on compression from stdin. It does not have any archive
facilities beyond compressing the one file, so it comes with an lrztar and
lrzuntar wrapper to do basic directory archiving. It uses a LOT of ram, and
although it has the ability to use unlimited sized compression windows
unconstrained by ram, the bigger the discrepancy between the file size and the
ram, the slower it will get. (Decompression typically uses 10x less ram than
compression).

Nevertheless, I home some people with lots of similar files lying around, like
kernel hackers, will hopefully find it a useful tool, as will people with
database dumps and the like. I also wonder if features of this compression
approach might be helpful for other projects that deal with huge amounts of
data and have large memory requirements. When I mentioned the sliding mmap
feature online, git using lots of memory during its compression phase was
mentioned, so perhaps there are other uses for the sliding mmap idea as a way
of reducing memory usage. There's some documentation of it in the code in
rzip.c and I wrote briefly about it in my blog: http://ck-hack.blogspot.com

Saturday, 6 November 2010

lrzip-0.5.2 with unlimited size compression windows unconstrained by ram.

After my last blog, all I needed to do was sleep on the "sliding mmap" implementation I had so far to think about how to improve it to make it work in some reasonable time frame instead of 100 to 1000 times slower. Yes, I know lrzip isn't as exciting as BFS to everyone else, but I'm having fun hacking on it.

The hash search in the rzip phase of lrzip moves progressively along the file offset, and the match lengths are up to 65536 bytes long. So I changed my 2 sliding buffers to have the one large one move along with the progressive file offset, and the smaller one to be 65536 bytes long. Well the results were dramatic. Instead of being 100 times slower, my first attempt showed it to be 3 times slower. So I cleaned it up and started benchmarking. Not only was it faster than the earlier implementation, surprisingly, sometimes it's faster to compress with it enabled than not! So it was clear I was onto a winner. I tried a few different memory configuration machines and different file sizes, and I can see that as the amount of redundancy goes up in the original file, and as the file deviates further from the ram size, it slows down more and more, but it's still a usable speed.

Using the classic 10GB virtual image file on an 8GB machine example, here's how it fared with the various options:
Options Size       Compress Decompress
-l      1793312108  5m13s    3m12s
-lM     1413268368  4m18s    2m54s
-lU     1413268368  6m05s    2m54s

Not bad at all really, considering it manages to mmap about 5.5GB and then the other 4.5GB depends on the sliding feature to work.

So I fixed a swag of other things, updated the docs and made all new benchmarks, and decided I couldn't sit on it any longer. At some stage you have to release, so version 0.5.1 is out, with the goodness of unlimited size compression windows, unconstrained by ram.

EDIT: Updated link to 0.5.2
lrzip-0.5.2

Here's some more fun benchmarks:

These are benchmarks performed on a 2.53Ghz dual core Intel Core2 with 4GB ram using lrzip v0.5.1. Note that it was running with a 32 bit userspace so only 2GB addressing was possible. However the benchmark was run with the -U option allowing the whole file to be treated as one large compression window.

Tarball of 6 consecutive kernel trees, linux-2.6.31 to linux-2.6.36

Compression Size        Percentage Compress Decompress
None        2373713920   100
7z          344088002    14.5       17m26s   1m22s
lrzip -l    223130711    9.4        05m21s   1m01s
lrzip -U    73356070     3.1        08m53s   43s
lrzip -Ul   158851141    6.7        04m31s   35s
lrzip -Uz   62614573     2.6        24m42s   25m30s

Kick arse.

Update:Again I managed to break the Darwin build in 0.5.1. This time it's a simple missing ; at the end of line 536 in main.c . I'll confirm that it works otherwise on Darwin and do a quick bump to a new version. (edit:done)

EDIT: Just for grins, I tried compressing a 50GB freenet datastore on the 8GB ram machine with just the one window as a test. It succeeded with flying colours in about 2 hours. Given the nature of the data that is of very low compressibility due to all being encrypted, and the window being more than 6* the size of the ram.

Compression Ratio: 1.022. Average Compression Speed: 7.266MB/s.
Total time: 01:59:3563.609

It also decompressed fine:
Total time: 01:02:3566.894

Note that it was on an external USB hard drive which isn't the fastest thing to read/write from and reading and writing 50GB to the same drive probably would take an hour itself, so that's why it takes so long. The compression time frame looks good in light of that :)

Friday, 5 November 2010

Lrzip and sliding mmap

After my last blog I completed work on releasing a new version of lrzip. The most user visible changes would be the new file format which is slightly more compact and a little faster on encoding. Behind the scenes, though, quite a lot of work went into fixing random minor issues, improving the memory allocation code and making the maximum compression windows larger with less chance of failure by falling back as much as possible.

One other major change was to try and implement stdin/stdout properly on lrzip instead of faking it by using temporary files as it did on the last version. As you may, or more likely, may not be aware, the original rzip which lrzip was based on did not work with stdin/stdout. It was one of the most commonly requested features of lrzip, and I implemented it by simply creating temporary files on readin, and then working off that, and a temporary file on compression and then dumping it to stdout once complete. I've been thinking for a long time about just how I could do it without the temporary files as there were certain limitations. As far as stdin goes, oOn compression, the file it's compressing needs to be mmappable. On decompression, the compressed file needs to be seekable. Neither of those can be done on stdin. As far as stdout goes, the output file needs to be seekable/readable while it's being written, and the header needs to be written at the start of the file on compression as well. Neither of those can be done on stdout.

After much fluffing around, the best I could do was to implement stdin without a temporary file. As lrzip works off an mmapped file, I implemented it with an anonymous mapping that reads as much as it can from stdin before working with it. The problem with this is we can't know how big the file will be as it's being read, so the anonymous mapping is made as big as is possible. There is some slight compression lost by doing this as one cannot map any more than how much real ram can be allocated, and the block headers need to be generous in predicting how long the next data block will be when they're written, but these are not major problems.

After releasing version 0.50, it didn't take long before someone spotted a bug to do with building it on OS X. It turns out that OS X doesn't have mremap implemented. So with the help of a few people online who had OS X, we got to fixing the build problem and I also noticed that due to a wrongly made workaround for OS X not having memopen, even the linux builds would be getting the fake memopen so I fixed that and committed it to the git repo. Not enough for a new version yet.

Anyway all of that got me thinking about how to manage the whole compression window failing issue since I started implementing things to workaround failing to allocate enough ram. Up to version 0.5, a compression window is "chosen" by lrzip based on a 2/3 of ram rule, or if set by the user. This was rather prone to failure for numerous obvious reasons, and the fact that further ram needed to be allocated as the backend compression phase would kick in.

When I first started playing with lrzip as an entity, one thing struck me. The major advantage to using the rzip first stage is proportional to the size of the compression window, and that window is limited by how much of a file you can mmap. Now with linux using virtual addressing and being an aggressive overcommitter, it turns out that amount of ram can be quite a lot, but still limited by how much real ram you have. Also if you dedicate all your ram to the first stage, you're left with not much ram for the backend compressor, and lzma quite benefits from a lot of ram too. So a few years back I half-heartedly mentioned the idea of a "sliding mmap" which could look like you have the whole file mmapped, yet did not need to dedicate your entire ram to it, as the "window" of which part of the buffer you were currently accessing would simply move over that part of the file, and you'd just have a virtual offset into one massive buffer.

So I set out to build my sliding mmap. After screwing around for many hours just to sort out pointers and variable types and all sorts of other nonsense that I'm not much good with, I finally had something that started taking shape. The first thing I experimented with was one large buffer that would move as per my original concept above. Well this would be fine if you were mostly reading linearly from a file, but the hash search of a huge compression window does this NOT. It reads from one end to the other in 2 sort of streams at a time, so there were literally billions on of mmap/munmaps occurring on a large file (since that's what I was designing it for) and almost no progress was being made. So then I experimented with splitting the buffer into two half sized ones and moving just the top one. Not really any better. Then I created a mini algorithm for sliding both of those two buffers up and down. Still no better. I tried tossing the mmap out completely and using single reads from disk per byte either via pread or fread. That was even worse. I tried mallocing just one page and moving that and it was marginally better. I kept hoping I'd make progress, as I'm nothing if not persistent :P Finally I found what worked best was to still have as much ram mapped as possible since this is fast, and use a one page sized buffer (4096 bytes) move up and down once we read beyond the base mmapped buffer.

How well does it perform? Well somewhere between 100 and 1000 times slower than a real mmap once reading beyond the initial buffer, but just as fast during that initial buffer. So if you can manage to fit most of a file into ram, this will get you the rest of the way. It takes away one of the major advantages of the rzip preprocessor stage, being that it's very fast, but then again lrzip does also do the ultra slow thing when in zpaq mode. I have ideas for how I might be able to tweak it further, so we'll see :) It was fun to implement, and it's nice to know that I can claim that lrzip has "unlimited sized compression windows" now. I've since committed that to the git repo.

Probably more useful than that feature is that I put a lot more effort into maximising the size of windows by throwing away the "guessing" how big a window it can use. Now it just allocates as much as it can through testing, and then does the same on the backend compression phase so it wont run out of ram later on. Simply put, the larger the window for the preprocessor stage, the better, and if it's as big as the file, even better, and this does its best to make that as big as possible (with -M).

So here's some numbers, choosing what lrzip is best at; a 10GB virtual image (so lots of redundant data over a big file) being compressed on a machine with 8GB ram on an SSD drive (this is an excerpt from the benchmarks on my site):

Compression Size      Compress   Decompress 
None      10737418240 
gzip      2772899756  05m47.35s  2m46.77s
bzip2     2704781700  20m34.26s  7m51.36s
xz        2272322208  58m26.82s  4m46.15s
7z        2242897134  29m28.15s  6m35.95s
lrzip     1354237684  29m13.40s  6m55.44s
lrzip -M  1079528708  23m44.22s  4m05.46s
lrzip -l  1793312108  05m13.24s  3m12.88s
lrzip -lM 1413268368  04m18.33s  2m54.65s
lrzip -z  1299844906  04h32m14s  04h33m
lrzip -zM 1066902006  04h07m14s  04h08m

-l mode is using lzo as the backend
-M mode uses the maximum possible ram. In this case, lrzip succeeded in fitting the entire 10GB file into one compression window.
I'd love to tell you how well zpaq fared, but it hasn't finished yet since it's so slow :P It used to take around 4 hours to do this benchmark. I'll update this post tomorrow with them. (edit:done).

Hopefully I can finish cleaning a few more things up and release version 0.5.1 soon with these changes, but I need to go to bed. They're all committed into the git repo already.

lrzip.kolivas.org

UPDATE: I've played with the sliding mmap implementation some more, in line with how lrzip works, and found that if I make the smaller buffer 64k it matches the size of the search function, and if I slide the larger buffer along with the main hash search it makes for a MASSIVE speed improvement over the earlier implementation. It is in fact usable speed now, as I just tested a 5GB file on a laptop that could normally only use a 2GB window, but used unlimited mode having an effective 5GB window, and it completed in 15 minutes, which is very reasonable! It decompressed in under 4 minutes. The changes have been committed to the git repo.

Sunday, 31 October 2010

On lrzip and packing bytes

As some of you may be aware I maintain a compression application called lrzip which is an evolution of Tridge's rzip which does a range coding pre-processing pass and then compresses the data with bzip2. Lrzip does the compression with a selection of backend compressors, using lzma by default but offering extremely fast with lzo or extremely slow with zpaq. The other thing lrzip does differently is I updated all the code to work properly on 64 bit machines with super large sizes. The idea behind lrzip is that as machines get more ram, and file sizes get bigger, the compression it will be capable of achieving will keep getting better, as it uses compression windows limited only in size by available ram. It is not a complex mature application with lots of features like 7z, but simply offers solid compression at good speeds.

See more about it with benchmarks and other advertising blurbs here:
lrzip.kolivas.org

Anyway when I updated it to 64 bit, my conversion of the code was a little inefficient, simply changing all the offsets in the pre-processor stage from 16 and 32 bit variables to 64 bit ones. This lost some of the efficiency of the compression because suddenly 2 and 4 byte entries were now 8 bytes long. However, since the 8 bytes were mostly zeroes, they would compress quite well since the backend compression pass was done next. Still, it was clear this was inefficient. So I did some experimenting.

The first thing I did was to base the byte width on the file size of the archive, thus allowing very small archives to have even less than 4 byte wide entries, and most archives would likely be 4-5 bytes wide. 5 bytes gives us 40 bits or files up to 1,099,511,627,776 in size. The compression gains were small but very real, and given that with zpaq compression lrzip's overall compression is rather extreme, any gains are worthwhile chasing.

The next thing I tried was packing bits. As most offsets were likely to be a lot less than the full file size, then it is possible to just make each entry only as many bytes wide as the data itself. To do this, I encoded how many bytes wide the data was within the data itself, by adding 3 bits to it. Having only 61 bits for the data itself of a 64 bit variable is not a problem since a 61 bit sized file is up to 2,305,843,009,213,693,952 in size.

Here's a rundown of how the code looked for encode/decode:
static inline void put_vbytes(void *ss, int stream, i64 s)
{
 uchar encoded_bytes;
 int bytes, bits = 0;

 while (s > (1LL << bits))
  bits++;
 /* Add one bit for bit 0, and 3 bits for the encoded number of bytes */
 bits += 4;

 bytes = bits / 8;
 /* Round up to byte sized */
 if (bits % 8)
  bytes++;

 /* Always assume there will be one byte, so we only need 3 bits to count
  * the remaining 7 possible. Left shift s by 3 bits and then encode the
  * 3 bits of remaining bytes into the right-most bits of s that will be
  * written thus making them first */
 encoded_bytes = bytes - 1;
 s <<= 3;
 s |= (i64)encoded_bytes;
 put_vchars(ss, stream, s, bytes);
}


static inline i64 read_vbytes(void *ss, int stream)
{
 uchar encoded_bytes, sb = read_u8(ss, stream);
 i64 s;
 int bytes;

 /* Grab the right most 3 bits. It has encoded how many remaining
  * bytes this entry is */
 encoded_bytes = sb & 7;
 s = sb;
 for (bytes = 1; bytes <= encoded_bytes; bytes++) {
  int bits = bytes * 8;

  sb = read_u8(ss, stream);
  s |= (i64)sb << bits;
 }
 s >>= 3;
 return s;
}

Anyway after a lot of fluffing around to actually make this code work since I had numerous mistakes while getting this far, I finally got around to testing it. The results of the pre-processing stage were very promising with the data getting a lot smaller. Then after the backend compression stage kicked in, the final compressed file size was almost the same as it was with just the 8 byte wide entries already in use in the current lrzip version 0.47 release. I sat there looking at it thinking "well that was a complete waste of time". So after thinking about it a little more, I realised that this bit-packing I was doing is basically just another form of compression, so it was making the compression stage less efficient. Incidentally, some file formats and compression formats I believe use a similar form of bit packing.

So I'll be throwing away the bit-packing idea and going back to the fixed width idea since the final compression size will be smaller that way. However I wanted to put to do something useful with the code since I made it and figured I should put it on this blog since I'm about to delete it from the actual code :) So back to the fixed byte width coding, hopefully with a new version of lrzip not too far away.

Friday, 29 October 2010

2.6.36-ck2

So the one line bug in BFS 357 is big enough that it may affect anyone with an Intel wireless containing laptop or similar. How do I know this? Well I hit it on my own laptop. I hit it on suspend after about 30 suspend to ram cycles. Nothing better than a big fat OOPS on your own hardware to make you feel obliged to update your own code. So given that 2.6.36-ck1 is barely one week old, and 2.6.36 is likely to be the "stable" 3 point release for 3 months, I've decided to release 2.6.36-ck2 with just one line of code changed. As per my last blog entry, I've just removed one BUG_ON which would cause an oops on BFS 357 when it's run on a 2.6.36 kernel. If you're not hitting this bug, there is no point whatsoever to upgrade from ck1 to ck2. For my own internal testing I'm using a WARN_ON_ONCE in place of the BUG_ON, to see if there's something meaningful in the bug that's a mainline problem, but it's safe to just remove this OOPS for everyone else.

Get it here either as a full patch or split out patches:
2.6.36-ck2


I've also updated the BFS 357 patch on my website with the one line fix, and given that it's still the same BFS otherwise, all I've done is change the filename rather than bump the BFS version number.

Get it here:
2.6.36-sched-bfs-357-1.patch


On an unrelated note, I finally got off my arse and fixed the long-standing install bug if DESTDIR was set on lrzip, which was a two line fix and reported by about a billion different people. I bumped the version up to 0.47 without any other changes.

lrzip on freshmeat


I've been looking at the lrzip code recently just for a change of pace. One of the things that bugged me about it is that I upgraded it a while back to truly be 64 bit in that it accepts 64 bit sized files, and can make the most of all ram on any sized 64 bit machines, but that caused regressions in file compression sizes and speed. What this means long term is that as file sizes get bigger, and machines get more and more ram, the compression of lrzip will get more and more impressive. However the reason it bugs me is that all that 64 bit addressing costs a lot in space. So I'm working on a scaled bitsize compression format for the next version now, which will only use as many bits as the compression window is. I've seen some modest improvements only, but they're worthwhile chasing. More on that front if I make progress soon.

Monday, 25 October 2010

Minor BFS 357 Bug(?) on 2.6.36

I received a bug report from someone running BFS on 2.6.36. They hit this BUG_ON in the new code present only in 2.6.36:

+static void try_to_wake_up_local(struct task_struct *p)
+{
+ struct rq *rq = task_rq(p);
+ bool success = false;
+
+ BUG_ON(rq != this_rq());

This looks fairly straight forward and this code path is only used by the new worker code in 2.6.36. However it shouldn't be hit unless something else is calling this function (indirectly via schedule()) wrongly. Anyway they hit it it seems via the iwlwifi code. No idea how, but it's actually harmless to wake up a task on another runqueue in BFS, so simply removing this BUG_ON fixes it.

Here's a patch to apply to BFS if you're running it on 2.6.36 and run into this bug:
bfs357-worker_fix.patch which just removes the BUG_ON. However it makes me wonder if this bug is in mainline and only those who hit this bug can confirm or otherwise by running 2.6.36 vanilla and there's no point reporting it when it's so vague.

Thursday, 21 October 2010

2.6.36-ck1

I'll keep it brief by just quoting the email I sent to lkml, just to get this announce out quickly.

These are patches designed to improve system responsiveness and interactivity
with specific emphasis on the desktop, but suitable to any workload.


Apply to 2.6.36:
patch-2.6.36-ck1.bz2

Broken out tarball:
2.6.36-ck1-broken-out.tar.bz2

Discrete patches:
patches

All -ck patches:
patches


Web:
kernel.kolivas.org

Code blog when I feel like it:
ck-hack.blogspot.com

Each discrete patch contains a brief description of what it does at the top of the patch itself.


The most significant change is an updated BFS cpu scheduler to BFS 357 (Magnum). It should pretty much behave like the older one, but is tighter with respect to keeping to its deadlines, and will continue to behave fairly when load is more than 8 * number of CPUs.

The other addition is to decrease the default dirty_ratio.

The rest is a resync only since 2.6.35-ck1.


Patch series:
2.6.36-sched-bfs-357.patch
sched-add-above-background-load-function.patch
mm-make_swappiness_really_mean_it.patch
mm-zero_swappiness.patch
mm-enable_swaptoken_only_when_swap_full.patch
mm-drop_swap_cache_aggressively.patch
mm-kswapd_inherit_prio-1.patch
mm-background_scan.patch
mm-idleprio_prio-1.patch
mm-lru_cache_add_lru_tail.patch
mm-decrease_default_dirty_ratio.patch
kconfig-expose_vmsplit_option.patch
hz-default_1000.patch
hz-no_default_250.patch
hz-raise_max.patch
preempt-desktop-tune.patch
cpufreq-bfs_tweaks.patch
ck1-version.patch


Those following the development of the patches for interactivity at massive load, I have COMPLETELY DROPPED them as they introduce regressions at normal workloads, and I cannot under any circumstances approve changes to improve behaviour at ridiculous workloads which affect regular ones. I still see precisely zero point at optimising for absurd workloads. Proving how many un-niced jobs you can throw at your kernel compiles is not a measure of one's prowess. It is just a mindless test.


Enjoy!

Monday, 18 October 2010

Other schedulers? Illumos?

I've been asked in past interviews what I've thought about schedulers from other operating system kernels that I may have studied, and if I had any interest in hacking on them. Perhaps it's somewhat of a surprise to learn that until recently, I hadn't really looked at any other operating system's scheduler. Most of what I've thought up to do with scheduling is really pretty simple stuff to manage the most common scheduling policy and I part thought of myself, and part based on accounts of various approaches to scheduling in my travels. So my approach to things is pretty much "well this hack makes sense" based on some pretty basic principles to do with latency, fairness and CPU throughput.

A couple of days ago, I was asked what I thought it would take to port BFS to the opensolaris spinoff thingy Illumos. I don't pretend to even begin to understand the politics behind things so I have zero informed opinion about the whole Sun thingy. I pretty much had no idea how much effort it would take, but knowing BFS is simple code, I figure once the kernel interfaces are understood, it should be doable to any other operating system. Of course it would be much easier if it was done by someone who already knew the kernel they were working with rather than starting from scratch as I would be, but then, few people are likely to truly understand my intent with BFS unless they've studied it, no matter how simple the overall policy in BFS is.

With any existing modern kernel scheduler, there's a very good chance that an awful lot of effort has gone into making the scheduler scalable on SMP, and one of the very first things to do there is to make the scheduler have multiple runqueues, one for each CPU, or possibly one per physical package, shared amongst cores (much less common). Then, as CPU architectures become more complex, in having non-uniform memory (NUMA), sharing physical packages (multicore), having multiple threads per core (SMT) and combinations therein, special cases are made for balancing where tasks go when to make the most of these, but usually the special casing is only about improving throughput.

BFS in concept is two different distinct entities. The first is one of avoiding all heuristics for determining what is interactive and what is not, and using the simplest possible foreground-background virtual deadline to determine who goes next, and using a deterministic round robin system for sharing CPU time between all tasks. This is the overall policy for managing SCHED_NORMAL otherwise also known as SCHED_OTHER in the past, and responsible for the good interactivity, low latency and responsiveness. (I've mentioned in many articles previously why I believe all heuristics are bad in my opinion). The second entity in BFS is to throw out complex separate runqueue designs and special balancing, and have one runqueue to rule them all. This is a step backwards in possible scalability, and depending on the workload, becomes more and more of an issue the more CPUs you add, but is not relevant on any machine any normal human being can afford for a home desktop, even an extremely powerful one (note: whether this is true 10 years from now is yet to be seen, as I still think BFS will be fine up to many threads on multicore).

So to port BFS to another kernel would be porting two different concepts. The first is simply the policy for managing SCHED_OTHER tasks. This is a fairly portable concept and likely can be leveraged into just about any other scheduler without too much effort provided someone knows the kernel they're working with. The second part, however, would involve carving out large chunks of the existing complex system and replacing it with a rather simple design. As it is, if you remove the CGROUPS code as well as CFS that BFS replaces in the mainline linux kernel, you'd up with almost 20,000 lines less code. This is pretty drastic by anyone's standards since it replaces the scheduler en-bloc. The current BFS patch does not actually remove these lines of code, though, as many people requested I make it a configurable patch at build time. So it basically compiles out all that code instead of removing it.

Now one could make BFS plug into a "pluggable scheduler", and some of you may remember I posted patches for just such a feature for the linux kernel about 4 years ago. However, "pluggable" is a relative concept, depending on just how much of the scheduler you replace, and BFS replaces most of it. The current mainline kernel has pluggable policies, but isn't really pluggable on such a low level that BFS can happily live beside it (and the truth is that most of us don't "choose" a policy when we run our apps, they all run SCHED_OTHER unless we explicitly set them otherwise). The same goes for any other kernel, and the Illumos one is a perfect example since it has 4 different schedulers "plugged" into it, but probably not pluggable in the sense of "replaceable".

So anyway, where was this post going? Today I decided I'd have a look through the opensolaris code. Not because I particularly want to port BFS to it, but because someone else said they want to do so, and out of morbid curiosity of my own to see what it was like.

The summary of my impression was that I was... surprised. Now I don't claim to be any kind of expert on code per-se. I most certainly have ideas, but I just hack together my ideas however I can dream up that they work, and I have basically zero traditional teaching, so you should really take whatever I say about someone else's code with a grain of salt. Well, anyway, the code, as I saw it, was neat. Real neat. Extremely neat. In fact, I found it painful to read after a while. It was so neatly laid out that I found myself admiring it. It seems to have been built like an aircraft. It has everything that opens and shuts, has code for just about everything I've ever seen considered on a scheduler, and it's all neatly laid out in clean code and even comments. It also appears to have been coded with an awful lot of effort to ensure it's robust and measurable, with checking and tracing elements at every corner. I started to feel a little embarrassed by what we have as our own kernel. The more I looked at the code, the more it felt like it pretty much did everything the Linux kernel has been trying to do for ages. Not only that, but it's built like an aircraft, whereas ours looks like a garage job with duct tape by comparison.

As an aside, I did google a few terms they used which I hadn't seen before, and I was more than a little disappointed to find patents regarding the names... Sigh.

Now this would be a great time to take my comments out of context without reading on. The problem is that here was a scheduler that did exactly what I hate about what the Linux kernel scheduler is becoming. It's a monstrosity of epic proportions, and as far as an aircraft goes, it's like taking an Airbus A380 on a short joyride if you're running it on a desktop. It looks like a good, nay, great design for a massive airliner. By looking at it alone, I haven't got the foggiest what it might run like on a desktop. Now since I'm full of opinion and rhetoric and don't ever come through with any substance (maybe writing my own scheduler invalidates that?), I'm going to also say I can't even be bothered trying it, for you to confirm your suspicions about me.

So what do I think of it now? It looks like an excellent design for a completely different purpose. It's built like a commercial design for commercial purposes that have very different requirements than what most of us use Linux for, but it does appear to have been done so very well. It looks like a goddamn Star Destroyer, and the Linux kernel (scheduler) suddenly looks like the Millennium Falcon. Real fast, but held together with duct tape, and ready to explode at any minute. Or to put it another way, my wife used to work for a local automotive manufacturer (in Australia), and brought home the latest (American) Cadillac they were playing with at her work about 10 years ago. When I stepped into it, it looked and felt like a lounge room! It didn't feel at all like stepping into a car. It had everything that opens and shuts, and every control you'd imagine when sitting in your lounge room with your dining setting and entertainment system. Then I drove it around the block, and it had the performance, handling and braking of a lounge room too.

Saturday, 16 October 2010

2.6.36-rc8-ck1

So another week passes and my attempt to minimise my workload by syncing up with the apparently last -rc for 2.6.36 was only a mild failure with a new "release candidate" coming out. (Does anyone else still have a problem with Linus calling his pre-releases "release candidates" any more? It still annoys the hell out of me). The reason it was only a mild failure for me is that the patches from 2.6.36-rc7-ck1 pretty much apply cleanly to 2.6.36-rc8.

So I've resynced all the 2.6.36-rc7-ck1 patches, and added a couple of things.

Firstly, I added a tiny patch which decreases the default dirty_ratio in the vm from 20 to 5. Here is the changelog in the patch:
The default dirty ratio is chosen to be a compromise between throughput and
overall system latency. On a desktop, if an application writes to disk a lot,
that application should be the one to slow down rather than the desktop as a
whole. At higher dirty ratio settings, an application could write a lot to
disk and then happily use lots of CPU time after that while the rest of the
system is busy waiting on that naughty application's disk writes to complete
before anything else happening.

Lower ratios mean that applications that do a lot of disk writes end up
being responsible for their own actions and they're the ones that slow down
rather than the system in general.

This does decrease overall write throughput slightly, but to the benefit of
the latency of the system as a whole.

The only other changes are to fold in the build fixes into BFS, fix minor typos in the documentation of the BFS 357 patch, and the add the bfs357-penalise_fork_depth.patch and bfs357-group_thread_accounting.patch patches as separate entities, but DISABLED by default. The effect of these patches has been discussed at great length on this blog before. See the tunables in /proc/sys/kernel to enable them. I'm pretty sure these patches will be dropped for 2.6.36-ck1 final due to the handful of regressions seen to date.

As per last time, the patches themselves are sneakily hidden within .lrz archives which means you'll have to suffer the pain of installing my lrzip application to use them. The patches are available in here: 2.6.36 prerelease patches

Monday, 11 October 2010

Further updates on hierarchical tree-based penalty.

First of all, no new patch today, yay \o/ I know it's sometimes hard to keep up but that's the nature of things. I (sort of kind of hope that I can) promise not to release anything much new any time soon with respect to the hierarchical tree-based penalty code.

I experimented with removing the separation of processes from threads and treating them all equal and discovered that led to real lag in some gui applications as many of them use threads. So it seems the default of penalising fork depth and not penalising threads works best at biasing CPU distribution for interactivity and responsiveness (which is the default on the current patch). This is rather ironic, as this code evolved out of an initial attempt to control threads' behaviour on massively threaded applications, yet it turns out that being nice to threaded apps works better for the desktop.

I thought up another way of demonstrating the effect this patch has in a measurable way.

Using a dual core machine as an example, and running the "browser benchmark" at http://service.futuremark.com/peacekeeper/index.action allowed me to show the effect of the gold standard load of make versus the almost universal gui app, a browser.

The benchmark runs a number of different browser based workloads, and gives a score in points, where higher is better.

Running the benchmark under various different loads with the feature enabled / disabled gave me the following results:

Disabled:
Baseline: 2437
make -j2: 1642
make -j24: 208
make -j42: Failed

Enabled:
Baseline: 2437
make -j2: 2293
make -j24: 2187
make -j42: 1626

As can be seen, on the dual core machine, normally a load of 2 makes the benchmark run almost precisely 1/3 slower as would be expected with BFS' fair CPU distribution of 3 processes between 2 CPUs. Enabling this feature makes this benchmark progress almost unaffected at this load, and only once the load is more than 20 times higher does it hinder the benchmark to the same degree. At that load with it disabled, the browser just spat out 'a script on this page is causing the browser to run slowly' etc. etc. and virtually gave up.


In the last few days, most of the reports have been very positive. However, as expected, not everything is rosy. There have been reports of applications such as mplayer stalling and some people have had gnome applets fail to initialise!? How on earth a scheduling decision about who goes first can cause these is... well it's not a mystery. In my experience it's because some assumption has been made in the userspace application that naively expects a certain behaviour; that being that one process will run first. These sorts of bugs, although likely due to the userspace application itself, make changes of this nature in the scheduler as the default impossible, or at least foolish. So as much as I'd like to see this change go into the next -ck release and be the default, I can't.

The patch will still be around to play with and I rather like it on my own desktop so I'm not throwing it out any time soon. Maybe something else will come of it in the future. But now I can relax and just sync up with mainline again when 2.6.36 final comes out.

Friday, 8 October 2010

Hierarchical tree-based penalty, interactivity at massive load, updated.

I've updated the patch for fork_depth based penalty with some minor tweaks. The default fork depth is now allowed to be zero for init, thus making the base system have no fork depth penalty. Userspace reporting of "PRI" to top, ps etc was modified to make more sense when penalty was enabled. Thread group accounting was fixed to only offset by absolute deadline, instead of already penalised by fork_depth deadline. Updated changelog follows:
Make it possible to have interactivity and responsiveness at very high load
levels by making deadlines offset by the fork depth from init. This has a
similar effect to 'nice'ing loads that are fork heavy. 'make' is a perfect
example of this and will, with fork_depth_penalty enabled, be felt as much
at 'make -j24' as it normally would be with just 'make'.

Note that this drastically affects CPU distribution, and also has the
indirect side effect of partitioning CPU entitlement to different users as
well. No assumption as to CPU distribution should be made based on past
behaviour.

This is achieved by separating out forks to new processes vs new threads.
When a new process is detected, its fork depth is inherited from its parent
across fork() and then is incremented by one. That fork_depth is then used
to cause a relative offset of its deadline.

This feature is enabled in this patch by default and can be optionally
disabled.

Threads are kept at the same fork_depth as their parent process, and can
optionally have their CPU entitlement all managed as one process together
by enabling the group_thread_accounting feature. This feature is disabled
by default in this patch, as many desktop applications such as firefox,
amarok, etc are multithreaded. By disabling this feature and enabling the
fork_depth_penalty feature (default) it favours CPU towards desktop
applications.

Extensive testing is required to ensure this does not cause regressions in
common workloads.

There are two sysctls to enable/disable these features.

They are in /proc/sys/kernel/

group_thread_accounting - groups CPU accounting by threads
fork_depth_penalty - penalises according to depth of forking from init

An updated patch for 2.6.36-rc7-ck1 follows, though it should apply to a BFS357 patched kernel with offsets:
bfs357-penalise_fork_depth_account_threads.patch

EDIT: Here is a patch that can be applied to vanilla 2.6.35.7 which gives you BFS + this change:
2.6.35.7-sched-bfs357+penalise_fork_depth_account_threads.patch

EDIT2: I notice some people are trying this patch and the earlier released group as entities patch and trying to compare them. Let me make it clear, this patch REPLACES the group as entities patch and does exactly the same thing, only updated.

I am still after feedback on this approach as for my workloads it's only advantageous so I'd love it if more people would report back their experiences, either in the comments here or via email to me at kernel@kolivas.org . Thanks!

Thursday, 7 October 2010

2.6.36-rc7 with -ck1 and BFS 357

Since I was on the tail end of my hack fest and Linus announced 2.6.36-rc7, saying it was likely the last -rc, I figured it was a good opportunity to sync up my patches with mainline. As always, the porting of BFS brought some unexpected surprises where a simple port would probably work, but likely fail long term. So there were lots of little subtle changes that I had to make to BFS. Functionally this is virtually the same as BFS 357 for 2.6.35.7, apart from some minor tweaks to avoid new warnings. There was one teensy change to niffy_diff to also ensure a minimum difference was observed according to ticks, and the minimum difference was decreased from 1us to anything greater than 0 as the niffy clock may well be updated in less than 1us. One nice thing also came about from the update. I managed to remove some code when I realised the nohz_load_balancer I'd been maintaining in the BFS code was simply me blindly porting it a while back and not even realising what it was for. Of course there is no load balancing on BFS since it has a global runqueue which means all CPUs are always in balance, so there's no need for any special case balancing on nohz configs.

For those who want some overview of what was required to port it, there were some subtle changes to the try_to_wake_up code for notifying when workers are going to sleep with workqueues. Some reshuffling of what happens on context switch was ported. Some sched domains code was updated. rlimit code was tweaked. nohz balancing code was dropped. Checking that apparently idle CPUs were actually online was added to cope with changes on forking idle tasks on .36. And random other stuff I can't remember.

It's worth noting that you'll need a beta driver from nvidia if you're evil like me and use their evil binary drivers. See here: nvnews link for their latest drivers.

Anyway here's a directory that contains lrz compressed versions of all the patches, and an lrz compressed all-inclusive -ck1 and bfs357 patch. It's my secret plan that those wishing to try my pre-release patches must also grab lrzip, which I wrote, to access them :)

http://ck.kolivas.org/patches/2.6/2.6.36/

EDIT2: If you enable schedstats, you will need the patch called 2636rc7ck1-fixes.patch in that directory added to prevent build failures.

Wednesday, 6 October 2010

Hierarchical tree-based penalty.

Further investigation at last reveals why the latest code affects program groups as well as threads, which was my original intention only. The number of tasks running was getting inherited across fork, meaning it was really acting as a branch penalty. That is, the more times a task has forked from init (the further along the branching), the greater the penalty it is receiving to its deadline. So it's a hierarchical tree penalty. While there may be a real case for such a feature, it is not program grouping at all! Now I have to think hard about where to go from here. The thread grouping idea is still a valid idea, and will work with this code corrected. However the tree penalty may be worth pursuing as a separate concept since it had such dramatic effects...

Somehow I found myself doing a lot more hacking than I had intended, and that's also why there's so much blogging going on from someone who hates blogs.

EDIT: Enough talk. Here's some code. Patch to apply to a BFS 357 patched kernel, with two knobs in /proc/sys/kernel/
group_thread_accounting - groups CPU accounting by threads (default off)
fork_depth_penalty - penalises according to depth of forking from init (default on)
(Updated version):
bfs357-penalise_fork_depth_account_threads.patch

Update on scheduling automatically by group, interactive at any load.

As an update to my previous post, the testing so far on my group scheduling patch has been beyond all my expectations. There have been no reports of regressions so far, yet the improvements in load conditions are outstanding. After thinking about the code itself some more, I felt one area needed a minor tweak. Currently when tasks are queued for CPU, they're checked to ensure that their deadline is not far in the future due to being set when many more threads were running at once, and then had their deadline capped to the maximum it possibly could be for that nice level. However, I realised that the cap should be based on the maximum the deadline could be based on when the deadline was first set rather than the current time. So I've made a minor modification to the patch to offset it from the deadline_niffy.

There has also been a lot of confusion about what this means for throughput and CPU distribution on SMP so I've updated the patch changelog to include verbose examples. What follows is the updated changelog:
Group tasks by their group leader and distribute their CPU as though
they're one task.

The practical upshot of this is that each application is treated as one task
no matter how many threads it starts or children it forks.

The significance of this is that massively multithreaded applications such as
java applications do not get any more cpu than if they're were not threaded.

The unexpected side effect is that doing make -j (any number) will, provided
you don't run out of ram, feel like no more load than make -j1 no matter how
many CPUs you have. The same goes for any application with multiple threads or
processes.

Note that this drastically changes the way CPU is proportioned under load, as
each application is seen as only one entity regardless of how many children
it forks or threads. 'nice' is still respected.

For example, on my quad core machine, running make -j128 feels like no more
load than make -j1 except for when disk I/O occurs. The make -j128 proceeds
at a rate ever so slightly slower than the make -j4 (which is still optimal).

This will need extensive testing to see what disadvantages may occur, as some
applications may have depended on getting more CPU by running multiple
processes. So far I have yet to encounter a workload where this is a problem.
Note that firefox, for example, has many threads and is contained as one
application with this patch.

It requires a change in mindset about how CPU is distributed in different
workloads but I believe will be ideal for the desktop user. Think of it as
implementing everything you want out of a more complex group scheduling
policy containing each application as an entity, but without the overhead or
any input or effort on the user's part.

Note that this does not have any effect on throughput either, unlike other
approaches to decreasing latency at load. Increasing jobs up to number of CPUs
will still increase throughput if they're not competing with other
processes for CPU time.

To demonstrate the effect this will have, let's use the simplest example
of a dual core machine, and one fully CPU bound single threaded workload such
as a video encode that encodes 1000 frames per minute, competing with a 'make'
compilation. Let's assume that 'make -j2' completes in one minute, and
'make -j1' completes in 2 minutes.

Before this patch:

make -j1 and no encode:
make finishes in 2 minutes

make -j2 and no encode:
make finishes in 1 minute

make -j128 and no encode:
make finishes in 1 minute

encode no make:
1000 frames are encoded per minute

make -j1 and encode:
make finishes in 2 minutes, 1000 frames are encoded per minute

make -j2 and encode:
make finishes in 1.7 minutes, 650 frames are encoded per minute

make -j4 and encode:
make finishes in 1.25 minutes, 400 frames are encoded per minute

make -j24 and encode:
make finishes in 1.04 minutes, 40 frames are encoded per minute

make -j128 and encode:
make finishes in 1.01 minutes, 7 frames are encoded per minute

make -j2 and nice +19 encode:
make finishes in 1.03 minutes, 30 frames are encoded per minute


After this patch:

make -j1 and no encode:
make finishes in 2 minutes

make -j2 and no encode:
make finishes in 1 minute

make -j128 and no encode:
make finishes in 1 minute

encode no make:
1000 frames are encoded per minute

make -j1 and encode:
make finishes in 2 minutes, 1000 frames are encoded per minute

make -j2 and encode:
make finishes in 2 minutes, 1000 frames are encoded per minute

make -j4 and encode:
make finishes in 2 minutes, 1000 frames are encoded per minute

make -j24 and encode:
make finishes in 2 minutes, 1000 frames are encoded per minute

make -j128 and encode:
make finishes in 1.08 minutes, 150 frames are encoded per minute

make -j2 and nice +19 encode:
make finishes in 1.06 minutes, 60 frames are encoded per minute

It's worth pointing out that this code is quite safe and stable and provided no issues show up in testing, it will probably go into the next major revision of BFS. I'm thinking major version number update given all the changes that have happened to 357 and this major change to behaviour. Please try it out and report back!

Patches follow, and I've updated the links in the previous blog post.

EDIT: Updated the tests according to real world testing.

EDIT2: Also, it's been suggested that being able to disable this might be helpful so whenever I get to the next version, I'll add a simple on/off sysctl.

EDIT3: It's also worth pointing out that as well as changing the concept of how CPU is distributed, it will also bring some unfairness as a side effect. 'make', for example, will always run like a niced task due to always being 2 or more processes (usually 4). This is actually why it gets 'contained' with this approach. It is not contained within all the jobs as such.

EDIT4: Further investigation revealed this patch didn't work quite how I thought it did, but still offers significant advantages. See the newer posts on hierarchical tree based penalty.
An updated patch for 2.6.36-rc7-ck1 follows, though it should apply to a BFS357 patched kernel with harmless offsets:
bfs357-penalise_fork_depth_account_threads.patch

Here is a patch that can be applied to vanilla 2.6.35.7 which gives you BFS + this change:
2.6.35.7-sched-bfs357+penalise_fork_depth_account_threads.patch

Tuesday, 5 October 2010

Interactive at any load?

A funny thing happened on the way to creating a patch to test out a feature I've long been thinking about for BFS. I'd always thought it was wrong that applications that are massively multithreaded (java applications are often the biggest example, such as azureus/vuze) get way too much CPU because threads are treated no differently to processes on linux. So I'd been thinking for a while for a way to group the threads together as a single process for the purpose of CPU distribution. Today I finally decided to toy with an approach involving adjusting deadline (since that's how CPU is distributed on BFS) according to how many active threads each thread group has running at any one time.

In the spirit of BFS, I tried to find the cheapest possible way to do this, and decided to use the thread group_parent to store the number of active threads, and simply extend the deadline for each thread (relative to its own nice level). To my surprise, the group_parent was more than just the parent for threads that shared the same pid. The results were better than I could have possibly imagined. The changelog follows to explain:

Group tasks by their thread_group leader and distribute their CPU as though
they're one task.

The practical upshot of this is that each application is treated as one task
no matter how many threads it forks.

The significance of this is that massively multithreaded applications such as
java applications do not get any more cpu than if they're were not threaded.

The unexpected side effect is that doing make -j (any number) will, provided
you don't run out of ram, feel like no more load than make -j1 no matter how
many CPUs you have.

Note that this drastically changes the way CPU is proportioned under load, as
each application is seen as only one entity regardless of how many children
it forks or threads. 'nice' is still respected.

For example, on my quad core machine, running make -j128 feels like no more
load than make -j1 except for when disk I/O occurs. The make -j128 proceeds
at a rate ever so slightly slower than the make -j4 (which is still optimal).

This will need extensive testing to see what disadvantages may occur, as some
applications may have depended on getting more CPU by running multiple
processes. So far I have yet to encounter a workload where this is a problem.
Note that firefox, for example, has many threads and is contained as one
application with this patch.

It requires a change in mindset about how CPU is distributed in different
workloads but I believe will be ideal for the desktop user. Think of it as
implementing everything you want out of the complex CGROUPS, but without the
overhead or any input or effort on the user's part.

So anyway, as you can see this has a profound and dramatic effect. This has the smoothing effect at high loads that the latency_bias patch had, but without any slowdown whatsoever, nor any loss of throughput as well. For a while there I had forgotten I had left a make -j128 going, its effects were so insignificant. Of course if you start 128 makes instead of starting one make with 128 jobs, it will bring your machine to a halt much like the current scheduling does. It also means that a single cpuloop application (like 'yes > /dev/null') will get a whole CPU to itself on SMP even if you run your make with -j128. I need feedback and extensive testing on this one, but I'm keeping my fingers crossed and being cautiously optimistic. It will be interesting to see its effect on (regular) audio playback as well.


EDIT!: Further investigation revealed this patch didn't work quite how I thought it did, but still offers significant advantages. See the newer posts on hierarchical tree based penalty.
An updated patch for 2.6.36-rc7-ck1 follows, though it should apply to a BFS357 patched kernel with harmless offsets:
bfs357-penalise_fork_depth_account_threads.patch

Here is a patch that can be applied to vanilla 2.6.35.7 which gives you BFS + this change:
2.6.35.7-sched-bfs357+penalise_fork_depth_account_threads.patch

Interbench results are now available with and without this patch and the results are dramatic to say the least:

group_interbench_testing.txt

Monday, 4 October 2010

Going Magnum with BFS 357

As some of you may be aware, PCLinuxOS uses BFS by default. It also is a 32bit distro for maximum compatibility with desktop software. So, Texstar, who started PCLinuxOS was keen to try the newer BFS and built a 2.6.32.x kernel with BFS 350 on 32 bits. He was able to reproduce strange slowdowns and stalls and reported this back to me. After reviewing my code it was clear I had screwed up with 32 bit builds having moved to 64 bit niffy counters on BFS 350 and left a whole lot of ints and longs around that would overflow regularly. The ints would overflow even on 64 bit builds! Texstar has since confirmed for me that fixing the 32bit variables fixed the problem for him (thanks!).

I've committed those 32 bit fixes, added some more sanity checking on the crazy sched_clock interface using jiffy difference to determine upper bound and added some minor macro cleanups. I've bumped the version number up to 0.357 just because it sounds good. Testing on this has been done on 32 bit, uniprocessor, and an older kernel. Hopefully this means good things for android too!

Changelog follows:
I forgot about an awful lot of longs and ints that will overflow on 32 bit now
with u64 deadlines. Fix them.

Add some macro tidiness.

Make sched_clock sanity checking robust and standardised, using jiffy
difference as upper limit, and use nominal 1us when difference cannot be
trusted.

Go magnum.

I've uploaded a full patch for 2.6.35.7 and an incremental from 350 to 357 and will be uploading patches for older kernels shortly. Grab it now and do your worst!

BFS patches

Hopefully I can take a break from hacking for a bit and get back to my billion other pastimes while this version distils for a while... Then again, a new "stable" kernel will probably be out soon.

UPDATE: I've now uploaded patches for previous kernels as far back as 2.6.31.14.