BFS doesn't really use the notion of "balancing" to try and move tasks from one CPU to the next based on some complex heuristic algorithm over time, but decides whenever a task wakes up, which CPU anywhere is the most suitable for the task to run on. This affords it very low latency since a task might find somewhere to run much sooner than if it were on a local queue on one CPU. BFS doesn't just decide to find the lowest latency point to run off, though, as that would ping pong tasks all over the place. That might be okay for latency, but is expensive in throughput terms. Probably the single most important factor in improving throughput with CPUs as time goes on is keeping the working set of data in the CPU cache, and the caches have gotten fatter and fatter with time. I've been asked this numerous times before, but it's the reason we can't just run tasks for a few microseconds and switch between all tasks as often as we like so that we get ultra low latencies; it costs us dramatically in throughput. The more often you move a task away from one CPU's cache to another one, the lower the throughput will be (if something else then uses that previous CPU).
BFS uses the simple information of where it last ran to determine which CPU would be the best CPU to run on next. It does this by searching for an idle CPU first, and binding to the CPU "closest" in cache to the last one it ran off. It would choose a CPU that shared the cache before moving to a different CPU (or then a different node in the case of NUMA) through the use of "cache locality" lists. BFS, unlike mainline, will never leave a task off a CPU when there is *any* CPU idle that it is allowed to run on. This is done because any improvement in throughput by trying to keep the task bound to one CPU costs us in the latency of having a task wait on a busy CPU, and latency is everything on a desktop. On a 2 core, 4 thread machine, if you have only 2 tasks running, you want each task to be running on the real cores and not threads of the same core. BFS checks to see when cores' threads are busy and chooses cores that don't have their other threads busy. However, if you have 2 tasks running (let's say a make -j2 build), then at irregularly irregular intervals, you will have other tasks wake up and consume CPU (scheduling noise), bringing the number of tasks asking for CPU at any one time jump to 3 or more. This ruins the possibility that we only ever use the 2 cores, and throws everything off balance. Since BFS has no notion of history, it's not like those tasks will then go back once the noise has died down. After a task has moved, there is not much point trying to put it back since each next move costs us yet again.
As there was a performance regression with the 2 task / 4 thread CPU case, I started experimenting with allowing tasks to sit idle for a bit and wait in case their previous CPU became available. The attempts got more and more complex and fancier, with more and more special cases I could think of, yet failed to do more than a pitiful improvement in throughput. So I abandoned that in disgust (especially since I was hating how complex this was getting, and it's just not the BFS way).
Then I investigated to see whether my code had unintentionally grouped the wrong threads with each other in BFS, and for a while I was convinced that it had. The reason was that threads 0 and 2 were considered SMT (hyperthread) siblings, as were 1 and 3. I dug deeper for a while and realised that indeed that is how they were enumerated, however silly that sounded to me. Oh well, at least I had excluded that as the problem.
Finally I went back and looked at how the idle CPU was chosen and spotted something. On a CPU with SMT siblings, it is virtually free to move a task from one logical CPU to its SMT sibling CPU. It is also virtually free to move a task from one logical CPU to its shared cache sibling. This I had taken into consideration with BFS, and made the resched best idle code as simple as possible by just lumping them all together, and picking the first one it came across. Now on this i7, all 4 logical CPUs are actually considered shared cache siblings, meaning it would always pick the first non-thread-busy CPU. It seemed okay at the time, and worked for my quad core on my desktop (non hyperthreaded), but I figured it would be best to try sticking to the same actual logical CPU first, then its SMT sibling, before trying a shared cache CPU. I also remembered the "turbo" mode in these i* CPUs, which allows one core to run at higher than the regular clock speed if it's the only one running, so it seemed to make sense that trying to keep tasks on the actual CPU they ran on, rather than a virtually free move might be just as good. I tried it and bingo, an 8% speedup on the make -j2 case, and ever so slightly faster when fully loaded make -j4. That was well worth the effort.
Even if it is free to move tasks from one CPU to a logical counterpart, there is another theoretical advantage of trying to stick to the same CPU. CPU frequency scaling is done on a per-CPU basis, and if one CPU is working harder than all of them, it will ramp up more easily. This should decrease response time under bursts of load (with the ondemand scaling governor), and potentially use less power since only one CPU will speed up. It's actually been shown that you use less power overall if a job is completed at a higher frequency before ramping down again, than doing the job at the lower frequency. This is why the conservative governor doesn't really make sense - it takes longer to respond since there's always a lag for it to speed up, and doesn't really save power. Even if you don't use cpu frequency scaling, all modern CPUs do plenty of their own powersaving internally in much the same manner.
Anyway, that change, along with a handful of other stuff make it worth bumping the BFS version up. All the changes going from 357 are low risk and it should be safe to jump versions now if 357 has been stable for you.
Here's the brief changelog of 357-360:
Don't unnecessarily preempt for a task on the wrong CPU. Cope with worker threads trying to wake themselves up due to shifting CPUs on suspend by reactivating it, instead of hitting the BUG_ON Wrap timer jiffies at 10 seconds instead of 5 minutes since 32 bit load averages don't work until the first timer wrap. Remove the last_task logic as it wasn't providing any significant performance advantage. Change the locality logic to try to reschedule on the exact same logical core instead of assuming scheduling on a sibling core or sibling thread is equivalent. This allows CPUs with a "turbo" mode (such as i7) to use that more often by using one CPU more than spreading out, and allows ondemand cpu frequency scaling to ramp up more easily when a task stays on the same CPU. It increases throughput on threaded CPUs when lightly loaded, and may offer both performance and power saving advantages on all SMP topologies with cpu frequency scaling.
Get it here (including incremental) for 2.6.36ish:
EDIT: I should explain that the j2 test is simply a regression test for throughput, it is NOT my way of benchmarking interactivity or responsiveness.