Thursday, 30 December 2010

BFS for 2.6.37-rc8

So we approach yet another "stable" release with 2.6.37 just around the corner now that -pre8 is out (yes I'm old fashioned, it is still a pre-release and not a release candidate in my eyes). I usually use pre8 as the flag to port BFS to the new kernel so that I have it ready in time for the "stable" 3 point release.

I've ported the current BFS release v0.360 to the latest kernel. Of significance in the new kernel are some changes yet again to the way CPU offline code occurs (which happens just before suspend to ram and disk), some new way to improve accounting of CPU attribution during interrupt handling, and improving the reporting of CPU load during idle periods with NOHZ enabled. There are also some other architectural changes that have no cosmetic effect. None of these will have any effect on "performance" for the end user so don't expect any surprises there.

So I've ported BFS with only the changes necessary to get it working on the new kernel. I've not yet added the support for interrupt accounting, and I haven't changed the way total load is calculated on NOHZ configured kernels. The only change of any significance is the way I offline CPUs in the new BFS. I have been fighting with that for a while now since there really is no right way to do it, and it changes so frequently in mainline that I often have trouble keeping up.

For those already on version 0.360 of BFS, the only significant changes can be had now in this patch:
bfs360-test.patch

For the dirty details in the CPU offline code, what happens is that a worker thread runs at ultra high priority on the CPU it's about to offline, and half way through its work it turns off that CPU and then needs to run on another CPU to die itself. Till now, BFS has looked for tasks that had affinity for CPUs that could no longer run on any alive CPUs and then would be happy to run them anywhere. With this latest bfs360-test.patch it does more what the mainline kernel does and formally breaks affinity for tasks that no longer have anywhere they're allowed to run. It only has a negligible improvement in overhead but the main reason for doing it is to make the offline code more robust.

For those looking for the first BFS release on 2.6.37-rc8, here's the test patch:
2.6.37-rc8-sched-bfs-362.patch.

I've bumped the version number up simply because it includes the test changes above, and has had other architectural changes to be in sync with the mainline kernel. The main thing new in mainline is that there is a new "class" of realtime scheduling used only internally by the CPU offlining code called a stop class. Instead of using a new scheduling class, BFS simply adds one more realtime priority that can be used by kernel threads and is only ever applied to the CPU offline thread (kmigration). I did this to make it add zero overhead to the existing design, yet support the concept that it is a thread that nothing else can preempt.

Almost certainly I will be adding more code to the final version of BFS that I use when 2.6.37 comes out, to support the new interrupt accounting code and the global load reported, but these will only have cosmetic effects. This patch has only been lightly tested and not compile tested for multiple configurations just yet, but seems to be working nicely. For now you can grab it here:
2.6.37-rc8-sched-bfs-362.patch

Happy New Year everyone :)

Friday, 24 December 2010

Queuing theory and BFS

I had pointed out to me on IRC by Damentz the queuing theory video that was linked by a slashdot article.

For those who haven't seen it, the video is here:
Why the other line is likely to move faster

Anyway, the relevance of that video is that BFS uses a single queue, whereas the mainline Linux kernel uses a multiple queue design. The people are tasks, and the checkouts are CPUs. Of course there's a lot more to a CPU scheduler than just the queue design, but I thought this video was very relevant :)

Merry Christmas everyone.

Tuesday, 14 December 2010

BFS version 0.360

There's nothing better to help you work out how something should work on certain hardware than having it yourself. I recently was fortunate enough to get my hands on an i7 based laptop. While the laptop is powerful for a laptop, a 2.67 dual core, quad thread is hardly a particularly powerful i7 (except in a laptop), but it nonetheless has a topology I've not had before. After installing a -ck kernel on it, I went and benchmarked how BFS performed on it, and was a little disappointed. At load==number of CPUs, it performed well, and with a single threaded job it was good too, but performed poorly compared to mainline with half jobs (so 2 jobs on 4 logical threads). I might point out there's nothing quite like having something to compare to to have some kind of yardstick off which to work. It worked both ways in this scenario, with mainline having had BFS as a reference for which regressions/improvements were found, and in this case I found that BFS wasn't up to scratch. So I set out to find and fix it.

BFS doesn't really use the notion of "balancing" to try and move tasks from one CPU to the next based on some complex heuristic algorithm over time, but decides whenever a task wakes up, which CPU anywhere is the most suitable for the task to run on. This affords it very low latency since a task might find somewhere to run much sooner than if it were on a local queue on one CPU. BFS doesn't just decide to find the lowest latency point to run off, though, as that would ping pong tasks all over the place. That might be okay for latency, but is expensive in throughput terms. Probably the single most important factor in improving throughput with CPUs as time goes on is keeping the working set of data in the CPU cache, and the caches have gotten fatter and fatter with time. I've been asked this numerous times before, but it's the reason we can't just run tasks for a few microseconds and switch between all tasks as often as we like so that we get ultra low latencies; it costs us dramatically in throughput. The more often you move a task away from one CPU's cache to another one, the lower the throughput will be (if something else then uses that previous CPU).

BFS uses the simple information of where it last ran to determine which CPU would be the best CPU to run on next. It does this by searching for an idle CPU first, and binding to the CPU "closest" in cache to the last one it ran off. It would choose a CPU that shared the cache before moving to a different CPU (or then a different node in the case of NUMA) through the use of "cache locality" lists. BFS, unlike mainline, will never leave a task off a CPU when there is *any* CPU idle that it is allowed to run on. This is done because any improvement in throughput by trying to keep the task bound to one CPU costs us in the latency of having a task wait on a busy CPU, and latency is everything on a desktop. On a 2 core, 4 thread machine, if you have only 2 tasks running, you want each task to be running on the real cores and not threads of the same core. BFS checks to see when cores' threads are busy and chooses cores that don't have their other threads busy. However, if you have 2 tasks running (let's say a make -j2 build), then at irregularly irregular intervals, you will have other tasks wake up and consume CPU (scheduling noise), bringing the number of tasks asking for CPU at any one time jump to 3 or more. This ruins the possibility that we only ever use the 2 cores, and throws everything off balance. Since BFS has no notion of history, it's not like those tasks will then go back once the noise has died down. After a task has moved, there is not much point trying to put it back since each next move costs us yet again.

As there was a performance regression with the 2 task / 4 thread CPU case, I started experimenting with allowing tasks to sit idle for a bit and wait in case their previous CPU became available. The attempts got more and more complex and fancier, with more and more special cases I could think of, yet failed to do more than a pitiful improvement in throughput. So I abandoned that in disgust (especially since I was hating how complex this was getting, and it's just not the BFS way).

Then I investigated to see whether my code had unintentionally grouped the wrong threads with each other in BFS, and for a while I was convinced that it had. The reason was that threads 0 and 2 were considered SMT (hyperthread) siblings, as were 1 and 3. I dug deeper for a while and realised that indeed that is how they were enumerated, however silly that sounded to me. Oh well, at least I had excluded that as the problem.

Finally I went back and looked at how the idle CPU was chosen and spotted something. On a CPU with SMT siblings, it is virtually free to move a task from one logical CPU to its SMT sibling CPU. It is also virtually free to move a task from one logical CPU to its shared cache sibling. This I had taken into consideration with BFS, and made the resched best idle code as simple as possible by just lumping them all together, and picking the first one it came across. Now on this i7, all 4 logical CPUs are actually considered shared cache siblings, meaning it would always pick the first non-thread-busy CPU. It seemed okay at the time, and worked for my quad core on my desktop (non hyperthreaded), but I figured it would be best to try sticking to the same actual logical CPU first, then its SMT sibling, before trying a shared cache CPU. I also remembered the "turbo" mode in these i* CPUs, which allows one core to run at higher than the regular clock speed if it's the only one running, so it seemed to make sense that trying to keep tasks on the actual CPU they ran on, rather than a virtually free move might be just as good. I tried it and bingo, an 8% speedup on the make -j2 case, and ever so slightly faster when fully loaded make -j4. That was well worth the effort.

Even if it is free to move tasks from one CPU to a logical counterpart, there is another theoretical advantage of trying to stick to the same CPU. CPU frequency scaling is done on a per-CPU basis, and if one CPU is working harder than all of them, it will ramp up more easily. This should decrease response time under bursts of load (with the ondemand scaling governor), and potentially use less power since only one CPU will speed up. It's actually been shown that you use less power overall if a job is completed at a higher frequency before ramping down again, than doing the job at the lower frequency. This is why the conservative governor doesn't really make sense - it takes longer to respond since there's always a lag for it to speed up, and doesn't really save power. Even if you don't use cpu frequency scaling, all modern CPUs do plenty of their own powersaving internally in much the same manner.

Anyway, that change, along with a handful of other stuff make it worth bumping the BFS version up. All the changes going from 357 are low risk and it should be safe to jump versions now if 357 has been stable for you.

Here's the brief changelog of 357-360:
Don't unnecessarily preempt for a task on the wrong CPU.

Cope with worker threads trying to wake themselves up due to shifting CPUs on
suspend by reactivating it, instead of hitting the BUG_ON

Wrap timer jiffies at 10 seconds instead of 5 minutes since 32 bit load
averages don't work until the first timer wrap.

Remove the last_task logic as it wasn't providing any significant performance
advantage.

Change the locality logic to try to reschedule on the exact same logical core
instead of assuming scheduling on a sibling core or sibling thread is
equivalent. This allows CPUs with a "turbo" mode (such as i7) to use that more
often by using one CPU more than spreading out, and allows ondemand cpu
frequency scaling to ramp up more easily when a task stays on the same CPU. It
increases throughput on threaded CPUs when lightly loaded, and may offer both
performance and power saving advantages on all SMP topologies with cpu
frequency scaling.

Get it here (including incremental) for 2.6.36ish:
http://ck.kolivas.org/patches/bfs/2.6.36/

EDIT: I should explain that the j2 test is simply a regression test for throughput, it is NOT my way of benchmarking interactivity or responsiveness.

Sunday, 12 December 2010

lrzip-0.551, overlapping streams for faster compression

A few bug reports have been trickling in since I last released a version of lrzip, which as I said last time is good and bad. It's helped direct my workflow on this, even though I keep trying to let it settle down. I'd love to leave the code since it mostly works quite well, and there is a common mantra that "release early, release often" is considered good, but in practice that doesn't seem to work. If you release stuff while it's still broken, it will get a bad reputation and people will dismiss it that time around and often not give it a chance again. So if there are show stopper bugs, they need attending to ASAP.

So anyway, the fact is that the fun part is implementing stuff rather than just fixing bugs, so for the next release I worked on doing both since I had to fix the bugs I knew about. I've mentioned before that lrzip breaks data into rzip streams and compresses each stream a block at a time. I had implemented multithreading by splitting that block up according to number of CPUs, but at the end of the stream, it waited to get all the blocks back before moving onto the next stream. For this release I moved the threading up earlier in the overall process, thus allowing the rzip stage to move onto the next stream while the back end threads are still busy compressing data from the previous stream. This means that if a file is large enough (greater than approximately 2/3 ram), once all CPUs are in use with back end threads, they should remain in use until the end of the file is reached. This speeds up compression on large files on SMP machines.

While looking for bugs to do with lzma compression failing due to running out of ram, I discovered why changing from level 7 compression to 9 compression did nothing... It's been a long time since the lzma library was imported into lrzip, and I had completely forgotten that it only goes up to level 7! The default lzma compression level is meant to be 5, so I've scaled it down now, meaning lrzip will now use the best all round compromise level for lzma. This also was affecting the window sizes internally used by the lzma library. At level 5, the windows are 16MB in size, and go up to 64MB for level 7. As one kind person noted in the comments on this blog, lzma uses up to 12x the memory of the compression window, so each thread would have been needing up to 800MB each. This explained why people were getting lzma failing on window sizes and complaining despite apparently enough ram. While the new level does lose slightly on compression, it does also improve compression speed significantly, and should pretty much eliminate lzma compression failing due to "too big window sizes".

With the two speed improvements, here are the sort of speed ups one can see with version 0.550 of lrzip:
10GB virtual image on quad core with 8GB ram:
lrzip 15m45s -> 11m03s
lrzip -M 12m03s -> 9m30s

And just as a reminder, xz completed this in 50m58s and the file size was almost halved by lrzip compared to xz.

Benchmarks can be found here:
http://ck.kolivas.org/apps/lrzip/README.benchmarks

On version 0.544 of lrzip, I had changed multithreading to use two separate groups of threads for stream 0 and stream 1 to make decompression more robust for out-of-order storage in the archive. Well unfortunately this doubled the possible number of threads, and this brought a now all-too-familiar problem: running out of ram on big files. In .550 I changed it to only use one thread for stream 0 rather than multiple ones since it's usually only one block of stream 0 before multiple blocks of stream 1. This can slow down decompression a little more, but reliability is more important.

I put some effort into tackling the dreaded running out of ram problem with multiple threads all competing for ram. What lrzip does now is detect when the compression back end has failed (which is usually for memory reasons) and instead of failing completely, it will then wait for the previous thread to complete its compression before trying again. This serialises work that would have otherwise been in parallel. Thus less ram is required since only one thread is likely to be using ram. I did this on both compression and decompression. Also, I made the lzma compression back end first try a lower compression level (since this uses a smaller window) before failing, and then finally resorting to bzip if it never succeeds. This should make running out of memory a non-fatal problem since it will just try again using less and being slower.

On the subject of constraining memory usage, I finally bit the bullet and dropped the maximum mmap down to only 1/3 of ram since I can use my sliding mmap to do the remaining chunk size. It is a tiny bit slower, but far more reliable, and leaves a lot more ram for the compression back ends. This makes my sliding mmap implementation, which was just an experiment originally, used routinely for any file larger than 1/3 of ram size.

My "spreading out of thread spawning evenly" idea in the last release was a failure. I reverted it since it provided no speed ups and occasionally slowed things down.

I fixed a few other minor errors to do with not callocing enough ram in some structs which looked really scary but I have no idea if they were causing problems. Hopefully they fixed some random problem somewhere :P

So I was very proud of this release and made it go gold as version 0.550...

About 5 minutes after releasing it, I found a couple of new bugs that were pretty nasty. When lzma worked on an incompressible block, it now failed instead of just leaving that block uncompressed. That was an easy fix. I also found that I had broken compression from STDIN a while back and hadn't noticed, and the compression summary at the end was reading 0. So I quickly corrected those and bumped the version up to 0.551.

So get it here:
http://freshmeat.net/projects/lrzip

No doubt more bugs will show up and I'll have to tend to those as well, but this should be a more robust version since a lot of failure cases in 0.544 have been addressed.

Other ideas I'm considering for lrzip at this stage based on requests include a good error testing and recovery mechanism with either reed-solomon or low density parity checking, and encrypted password protection. I have a few libraries I've looked at for both of these but am open to suggestions if people have recommendations.

EDIT: People didn't understand how the chunks and streams work, so I've copied this comment I made below into the blogpost.

---

Lrzip splits files up multiple times. Imagine a file is twice the size of ram. It will first be split into chunks:
A B C
where each chunk is 1/3 the size of the file.

Then the rzip stage is performed on chunk A, splitting it into streams:
A0 A1
where A0 is the "dictionary" and A2 the "matches"

Then each stream is split up into further chunks for the compression backend, chunks
A0A A0B
and then
A1A A1B A1C A1D A1E
and so on, and each chunk is passed onto the compressor (such as lzma) and the results written to disk.

All this used to be done in series.

Version 0.530 made the last chopping up part be done in parallel since the slow compression back end (lzma or zpaq) was the time consuming part of compression. However when once streams A0 and A1 would have finished their rzip stage, lrzip would wait for A0A A0B A1A A1B etc. to finish compressing completely before lrzip would move onto chunk B.

Version 0.551 now allows lrzip to move onto chunk B while A0A A0B A1A A1B etc are all still compressing.

---

This means that on faster compression (lzo, gzip and bzip2), the rate limiting step is the stream production of A0 and A1. The pipe dream is to parallelise those, but they rely on using as much ram as possible to make the rzip stage effective so it would compromise compression greatly (and be messy as hell to implement).