TL;DR: Lower latency, better throughput and better ondemand behaviour on SMP.
BFS 400 for 18.104.22.168:
BFS 400 for 22.214.171.124:
2.6.38-ck3 includes BFS 400 (applies to 126.96.36.199):
Ubuntu packages 188.8.131.52-ck2 and 184.108.40.206-ck1 include BFS 400:
It's been a while in the making, and the advantages of this new BFS are so substantial that they warrant a new version. I was going to announce this on lkml with lots of fanfare, but I see there is still some rare desktop hardware that still is not happy on ondemand governing. SO, my summary is: If you are on a desktop, stick with the performance governor and you'll get significant performance and latency advantages, and if you're on a laptop, you'll find BFS 400 much better than previous iterations with both ondemand and performance governing. There is hard evidence it is better for these. What follows is a long post that I was planning on submitting to lkml, but I'm very wary of burnout. Burnout for me is not to do with being tired of working on the code. No, not at all. The issue is that coding will start intruding too much on my regular life which has absolutely nothing to do with linux kernel and code at all. Once it interferes with my home life, or work, or health, I know it's time to cut back. And that's where I am right now. So instead, here's the post just on the blog:
This is to announce the most major upgrade in the BFS CPU scheduler in a
For those tracking the development on my blog, this is version 0.376 renamed
with mostly cosmetic changes only.
x264 benchmarks on a quad core Xeon X3360 with ondemand (courtesy of Graysky):
Benchmarks on a 4 thread 2 core i7 M620:
Quad core Q9650:
Prescott 3.2GHz Single core HT:
Six core AMD: This has the all important LATENCY and throughput benchmarks.
Each benchmark is the latency with the following loads on a 6 core AMD, make -
j1, j3, j6, j9, j12 and x264 medium and x264 ultrafast (courtesy of Serge Belyshev):
Latency Graphs here:
Sample image from the all important -j6 (where load == number of CPUs):
The BFS cpu scheduler, designed for optimum interactivity and responsiveness
while maintaining excellent throughput on commodity hardware, has been
modified to improve its behaviour on SMP machines, especially when used in
combination with scaling CPU frequency governors (like ondemand), and on
mulitcore hardware that has "turbo" modes. As many distributions are now
setting ondemand as the default governor, even on desktops, and some desktop
GUIs aren't even offering a way to alter the choice of governor any more, it
was clear that I needed to modify BFS to perform optimally with these
Briefly, BFS is a single runqueue design which is used to make the most of the
fact that single runqueues overall have the lowest average latency compared to
separate runqueue designs.
See this youtube link about queuing theory for an interesting parallel with
Why the other line is likely to move faster
The most consistent feature in BFS to date that was found to improve average
service latency whilst maintaining throughput was that no CPU was ever left to
be idle if there was a task that desired CPU time. Through extensive
experimentation I discovered that it was extremely easy for 4 or more tasks to
desire CPU at once on a virtually idle system. Thus on a quad core, the
average service time would always be lower if the tasks are allowed to move to
the first available idle CPU.
One of the disadvantages of this approach, though, is that although average
service latency is kept low by always using the first available CPU in a
relatively idle system, it means that work never clusters on a CPU. So when a
load-dependent CPU frequency governor is in use, no one CPU is ever seen as
being fully loaded except once the load is great enough for all CPUs to be
loaded. This meant that with BFS, CPUs would not be ramped up to their maximum
speed, and further, the "turbo" modes of newer multicore CPUs were never
engaged since they rely on one core being used while the others are kept idle.
Mainline does not exhibit this problem by virtue of always keeping work local
to each CPU and forcing work elsewhere once some kind of heuristically
determined imbalance was measured and then a rebalance would be forced.
It became apparent that to improve throughput with scaling governors and turbo
CPUs, that I would have to allow CPUs to become idle and service tasks
clustered to less CPUs in order for them to speed up. Trying to find the right
compromise between improving throughput while keeping the service latency low
by allowing a CPU to go idle would require heuristics which I've avoided
greatly to date on BFS. One great concern I have for the mainline scheduler is
the ever-expanding special case treatment of tasks on different architectures
for different one-benchmark workloads that do better, turning our mainline
balancing system into an unmanageable state machine.
Thus I devised a system whereby tasks that are running on a CPU but get
descheduled involuntarily in preference for another task instead of sleeping,
get flagged as "sticky". The reasoning being that if a task had not finished,
then it certainly was going to continue working once rescheduled. Then I added
a cpu_scaling flag which would tell the scheduler that a CPU had been throttled
from its maximum frequency. When one CPU is choosing its next task, if a task
is sticky on another CPU and the current CPU is throttled, it will not take
This extremely simple technique proved surprisingly good for throughput
without destroying latency. Further investigation revealed the same sticky
system could also be used to determine when to bias against tasks going next
on the local CPU even if the CPU wasn't throttled. So the existing "cache
distance" calculation was removed entirely and tasks that are sticky are just
seen as having soft affinity and biased against. This means that waking tasks
still get very low latency, but sticky tasks are very likely to go back to
their own CPU, while no CPU is left idle if there is work to do, _provided_
the CPU is not throttled.
The existing system for determining which idle CPU to bind to remains
unchanged for it has proven very effective and complementary to this added
So the entire equivalent of a "balancing" mechanism for BFS consists of:
1. Flagging the last CPU bound task that was involuntarily scheduled as being
softly affined to its CPU, and biasing against it on other CPUs.
2. Choosing an idle CPU to wake up based on how busy that core is, and how
close it was to the original CPU it last ran on.
It was discovered the improved throughput meant I was able to further lower
the rr interval such that it is initially set to 6ms regardless of the number
of CPUs, thus improving latency further on SMP.
The resulting change to BFS means throughput with the ondemand CPU frequency
governor is now very close to that of the performance governor, power usage
may be decreased (theoretical, not yet proven), and better throughput and
latency on the performance governor. I would like to point out, however, that
some CPUs were not really designed with ondemand governing in mind,
specifically such as the "extreme" desktop CPUs by Intel. These tend to only
have 2 frequencies, are slow to change from one frequency to another, and may
lie about which CPUs are throttled and actually throttle all of them. Given
how little power saving they really offer, I suggest anyone with a desktop CPU
test for themselves how well it works with ondemand and decide whether they
want to build it into the kernel or not as a way of preventing your
distribution from using the ondemand governor without your knowledge. I detest
the lack of configurability in newer distributions and GUIs, but that's another
In order to quantify the changes in this BFS version, benchmarks were
performed by myself and the help of some other very helpful members of the
community. Thanks to all those who tested along the way and helped it get this
far. The very simple, but reliable and repeatable benchmarks of kbuild at
various jobs and x264 were used to determine the effects on throughput, while
the wakeup-latency app by Jussi Laako was run in the background during the
benchmark as a rough guide to what effect the various loads had on latency.
Again I must stress, the purpose for BFS is to be used as a testing ground for
scheduling to see what can be done when the scheduler is specifically aimed at
commodity hardware for normal desktops and affordable servers. As such, it
should serve as a reference tool to compare to for these workloads and I'm
very pleased to see the uptake of its use actually by the community, and that
mainline is occasionally reminded of the need to keep in touch with this
To the server admins who have contacted me to tell me of their good
experiences with BFS, thank you, but I must stress again that I am unable to
implement extra features like CGROUPS that some of you have asked for. My
time is limited, and my work on the kernel completely unpaid so I can only
direct my attention to areas I personally have interest in.
EDIT: In response to questions from those who haven't been following the progress, yes this BFS is the Brain Fuck Scheduler.