Tuesday, 14 December 2010

BFS version 0.360

There's nothing better to help you work out how something should work on certain hardware than having it yourself. I recently was fortunate enough to get my hands on an i7 based laptop. While the laptop is powerful for a laptop, a 2.67 dual core, quad thread is hardly a particularly powerful i7 (except in a laptop), but it nonetheless has a topology I've not had before. After installing a -ck kernel on it, I went and benchmarked how BFS performed on it, and was a little disappointed. At load==number of CPUs, it performed well, and with a single threaded job it was good too, but performed poorly compared to mainline with half jobs (so 2 jobs on 4 logical threads). I might point out there's nothing quite like having something to compare to to have some kind of yardstick off which to work. It worked both ways in this scenario, with mainline having had BFS as a reference for which regressions/improvements were found, and in this case I found that BFS wasn't up to scratch. So I set out to find and fix it.

BFS doesn't really use the notion of "balancing" to try and move tasks from one CPU to the next based on some complex heuristic algorithm over time, but decides whenever a task wakes up, which CPU anywhere is the most suitable for the task to run on. This affords it very low latency since a task might find somewhere to run much sooner than if it were on a local queue on one CPU. BFS doesn't just decide to find the lowest latency point to run off, though, as that would ping pong tasks all over the place. That might be okay for latency, but is expensive in throughput terms. Probably the single most important factor in improving throughput with CPUs as time goes on is keeping the working set of data in the CPU cache, and the caches have gotten fatter and fatter with time. I've been asked this numerous times before, but it's the reason we can't just run tasks for a few microseconds and switch between all tasks as often as we like so that we get ultra low latencies; it costs us dramatically in throughput. The more often you move a task away from one CPU's cache to another one, the lower the throughput will be (if something else then uses that previous CPU).

BFS uses the simple information of where it last ran to determine which CPU would be the best CPU to run on next. It does this by searching for an idle CPU first, and binding to the CPU "closest" in cache to the last one it ran off. It would choose a CPU that shared the cache before moving to a different CPU (or then a different node in the case of NUMA) through the use of "cache locality" lists. BFS, unlike mainline, will never leave a task off a CPU when there is *any* CPU idle that it is allowed to run on. This is done because any improvement in throughput by trying to keep the task bound to one CPU costs us in the latency of having a task wait on a busy CPU, and latency is everything on a desktop. On a 2 core, 4 thread machine, if you have only 2 tasks running, you want each task to be running on the real cores and not threads of the same core. BFS checks to see when cores' threads are busy and chooses cores that don't have their other threads busy. However, if you have 2 tasks running (let's say a make -j2 build), then at irregularly irregular intervals, you will have other tasks wake up and consume CPU (scheduling noise), bringing the number of tasks asking for CPU at any one time jump to 3 or more. This ruins the possibility that we only ever use the 2 cores, and throws everything off balance. Since BFS has no notion of history, it's not like those tasks will then go back once the noise has died down. After a task has moved, there is not much point trying to put it back since each next move costs us yet again.

As there was a performance regression with the 2 task / 4 thread CPU case, I started experimenting with allowing tasks to sit idle for a bit and wait in case their previous CPU became available. The attempts got more and more complex and fancier, with more and more special cases I could think of, yet failed to do more than a pitiful improvement in throughput. So I abandoned that in disgust (especially since I was hating how complex this was getting, and it's just not the BFS way).

Then I investigated to see whether my code had unintentionally grouped the wrong threads with each other in BFS, and for a while I was convinced that it had. The reason was that threads 0 and 2 were considered SMT (hyperthread) siblings, as were 1 and 3. I dug deeper for a while and realised that indeed that is how they were enumerated, however silly that sounded to me. Oh well, at least I had excluded that as the problem.

Finally I went back and looked at how the idle CPU was chosen and spotted something. On a CPU with SMT siblings, it is virtually free to move a task from one logical CPU to its SMT sibling CPU. It is also virtually free to move a task from one logical CPU to its shared cache sibling. This I had taken into consideration with BFS, and made the resched best idle code as simple as possible by just lumping them all together, and picking the first one it came across. Now on this i7, all 4 logical CPUs are actually considered shared cache siblings, meaning it would always pick the first non-thread-busy CPU. It seemed okay at the time, and worked for my quad core on my desktop (non hyperthreaded), but I figured it would be best to try sticking to the same actual logical CPU first, then its SMT sibling, before trying a shared cache CPU. I also remembered the "turbo" mode in these i* CPUs, which allows one core to run at higher than the regular clock speed if it's the only one running, so it seemed to make sense that trying to keep tasks on the actual CPU they ran on, rather than a virtually free move might be just as good. I tried it and bingo, an 8% speedup on the make -j2 case, and ever so slightly faster when fully loaded make -j4. That was well worth the effort.

Even if it is free to move tasks from one CPU to a logical counterpart, there is another theoretical advantage of trying to stick to the same CPU. CPU frequency scaling is done on a per-CPU basis, and if one CPU is working harder than all of them, it will ramp up more easily. This should decrease response time under bursts of load (with the ondemand scaling governor), and potentially use less power since only one CPU will speed up. It's actually been shown that you use less power overall if a job is completed at a higher frequency before ramping down again, than doing the job at the lower frequency. This is why the conservative governor doesn't really make sense - it takes longer to respond since there's always a lag for it to speed up, and doesn't really save power. Even if you don't use cpu frequency scaling, all modern CPUs do plenty of their own powersaving internally in much the same manner.

Anyway, that change, along with a handful of other stuff make it worth bumping the BFS version up. All the changes going from 357 are low risk and it should be safe to jump versions now if 357 has been stable for you.

Here's the brief changelog of 357-360:
Don't unnecessarily preempt for a task on the wrong CPU.

Cope with worker threads trying to wake themselves up due to shifting CPUs on
suspend by reactivating it, instead of hitting the BUG_ON

Wrap timer jiffies at 10 seconds instead of 5 minutes since 32 bit load
averages don't work until the first timer wrap.

Remove the last_task logic as it wasn't providing any significant performance
advantage.

Change the locality logic to try to reschedule on the exact same logical core
instead of assuming scheduling on a sibling core or sibling thread is
equivalent. This allows CPUs with a "turbo" mode (such as i7) to use that more
often by using one CPU more than spreading out, and allows ondemand cpu
frequency scaling to ramp up more easily when a task stays on the same CPU. It
increases throughput on threaded CPUs when lightly loaded, and may offer both
performance and power saving advantages on all SMP topologies with cpu
frequency scaling.

Get it here (including incremental) for 2.6.36ish:
http://ck.kolivas.org/patches/bfs/2.6.36/

EDIT: I should explain that the j2 test is simply a regression test for throughput, it is NOT my way of benchmarking interactivity or responsiveness.

21 comments:

  1. Thanks for the Christmas gift :)

    Been running it for a day now without problems.

    ReplyDelete
  2. Same here, new BFS is running here fine too. Nice to hear, that BFS now supports the CPU burst on modern Intel I7. Had some problems to apply the patches on -ck2, but could be solved with replacing the BFS complete ;)

    Thanks sysitos

    ReplyDelete
  3. I want to continue to use the ck2 patchset. How can I apply the new update to it? I also noticed two new files with a date of 16-dec-2010. What are they?

    ReplyDelete
  4. Same here (i3), BFS is running fine. Thanks for this new version!

    ReplyDelete
  5. The new patches are for those already on different BFS versions. For those on ck2, simply add this patch:
    http://ck.kolivas.org/patches/bfs/2.6.36/2.6.36-ck2-bfs-360.patch

    ReplyDelete
  6. It's also running fine for me on OpenSUSE 11.2 && 11.3

    ReplyDelete
  7. Running fine too on Slackware 13.1 On Phenom II X6.
    Although, i don't notice differences against 357 one. :|

    ReplyDelete
  8. Thanks for the feedback. Yes most people are unlikely to notice any significant difference with the changes. Only those with 4 or more threads or cores might notice a difference and it will only be a subtle effect.

    ReplyDelete
  9. The new hardware is moving toward having the many many virtual cores I suppose. Jeez I just got into having 2 cores myself, and now home PCs have 4 and 8. Thanks for the hard work. :)

    ReplyDelete
  10. Con,
    What tests did you perform to measure interactivity ? I've been wanting to compare the mainline and your patchset, and although your patches make the system feel faster, i haven't got a way to measure "snappiness".

    --Prateek

    ReplyDelete
  11. It's been a big problem that there is hardly anything that meaningfully measures interactivity. That's why I created two benchmarks myself for responsiveness and interactivity respectively: contest and interbench . contest is hopelessly out of date and unmaintained, but interbench still works ok. Both of them suffer from being very difficult to interpret properly and not being sensitive enough to be very discriminatory, though within those limitations they still work.

    ReplyDelete
  12. Hey ck, I am using 2.6.35-22ck-generic kernel on Ubuntu 10.10 64-bit.

    I run a Core i5 M430 CPU (Laptop with 4500MHD). Is there a newer kernel that I can use or a PPA/Repo I can add to always stay current? Sorry quit new to this but what I have got running so far is well worth it. The speed increase is outstanding.

    ReplyDelete
  13. Hi Con, I build kernels for Polish Ubuntu Users and to September BFS worked perfectly ... but then there is regression on Intel processors - 2 cores/4 threads (especially on the Atom 330 -> Asus 1201N). Unfortunately BFS 363 continues to cause regressions in performance. It is especially evident in the games running under Wine and ... immediately recognizable by glxgeras (I know this is not a benchmark but immediately indicates that it will be a problem).

    Greetings

    Tomasz `e X t 7 3` Miś

    ReplyDelete
  14. @e X t 7 3

    That's interesting. From what version did you start seeing regressions, and was everything else in the kernel _exactly_ the same?

    ReplyDelete
  15. Hi Con, I honestly don`t remember the exact version, but it happened after the holidays (mayby since 323 version). I have used Your patches before, and then I used the set ZEN. Unfortunately, recent patches cause the same effect. It consists in the fact that for example, Asus 1201N, after the first boot - after installing the drivers - performance is generally very high, with very high responsiveness, but after the next reboot the machine (power is not off!) the system loses its efficiency dramatically. I'm doing the test - run glxgeras - is instead of previous ca 12400 - 12480 frames I have 11900, than I know that is the problem. From what I have watched it looks as if the scheduler had a problem with the separation of tasks. When running CFS - this one charged first one thread and then gradually more. However, BFS immediately tries to break the task on all threads - at the same time jumps from one to another and back. What is interesting - from what I remember somewhere in the 323 version there were no problems after the "cold" rebooting the machine (power of, wait and boot machine. The biggest problem has been seen in applications running under Wine - here slow performance ranged as high as 20-30 % ! Interestingly, there was no such problem on machines with "pure" cores (without separating the threads) = AMD Athlon X2 or X4, but sine 363 version of BFS noticed that such a decline also appears on these machines!! If you want to look, I will send you my config `s - which I use to build my kernels - my series - the atom and K8.

    Regards ... I hope that the problem can be quickly diagnosed and resolved - for the time I built the kernel with your patch = ck1 and + BFQ ... but instead of BFS, I build with CFS. I admit that I prefer your planner - when combined with the patch (on the back I asked Steven - http://git.zen-kernel.org/zen-stable/commit/?id=56b5ce8cd6b84b7e80e730d5157476bfc009ba7e), which results in (or rather the result), very efficient, economical and responsive kernel.

    ReplyDelete
  16. Thanks for your report. There isn't any way BFS can make the machine slow down only after reboot, so I'm not sure what you're seeing there. As for your slowdown, it would be interesting to know if you've enabled the SMT option in your configuration as that is the only change in BFS, and the atoms are threaded CPUs. It should actually stick to one cpu more since the newer version. Unfortunately I can't place any value on your glxgears report as it's not a meaningful benchmark. So check your config and see if CONFIG_SCHED_SMT is set.

    ReplyDelete
  17. Thanks for the quick response, of course I know that is not a benchmark glxgeras ... I use it only for the rapid diagnosis - the problem is or it is not. Of course, I have included = SMT (Hyperthreading) scheduler support-just because, inter alia, the atom. Thus, suggesting it off!?

    Greetings

    ReplyDelete
  18. It should be on, but it would be interesting to see if it was better with it off on that since I don't have an atom to test it on. There still is no explanation for how it can cause problems on rebooting only.

    ReplyDelete
  19. Therefore, I will build a kernel without SMT and see ... interesting. I now have a lot of work but maybe by the end of the week I can do it ... If you would like to see my kernel (for Atom), please see the link below.

    http://forum.ubuntu.pl/showthread.php?t=140724

    Greetings

    ReplyDelete
  20. We are independently recognized and leaders in manufacturing of electronic equipment like SMT Pick & Place Equipments. We offers our customers end-to-end development and manufacturing solutions

    ReplyDelete
  21. What a great post :) I love reading about development processes! More like this please :)

    Mike

    ReplyDelete