Sunday 10 April 2011

BFS 0.400

TL;DR: Lower latency, better throughput and better ondemand behaviour on SMP.

BFS 400 for

BFS 400 for

2.6.38-ck3 includes BFS 400 (applies to

Ubuntu packages and include BFS 400:
Ubuntu Packages

It's been a while in the making, and the advantages of this new BFS are so substantial that they warrant a new version. I was going to announce this on lkml with lots of fanfare, but I see there is still some rare desktop hardware that still is not happy on ondemand governing. SO, my summary is: If you are on a desktop, stick with the performance governor and you'll get significant performance and latency advantages, and if you're on a laptop, you'll find BFS 400 much better than previous iterations with both ondemand and performance governing. There is hard evidence it is better for these. What follows is a long post that I was planning on submitting to lkml, but I'm very wary of burnout. Burnout for me is not to do with being tired of working on the code. No, not at all. The issue is that coding will start intruding too much on my regular life which has absolutely nothing to do with linux kernel and code at all. Once it interferes with my home life, or work, or health, I know it's time to cut back. And that's where I am right now. So instead, here's the post just on the blog:


This is to announce the most major upgrade in the BFS CPU scheduler in a

For those tracking the development on my blog, this is version 0.376 renamed
with mostly cosmetic changes only.

Benchmarks first:

x264 benchmarks on a quad core Xeon X3360 with ondemand (courtesy of Graysky):

Benchmarks on a 4 thread 2 core i7 M620:

Quad core Q9650:

Prescott 3.2GHz Single core HT:

Six core AMD: This has the all important LATENCY and throughput benchmarks.
Each benchmark is the latency with the following loads on a 6 core AMD, make -
j1, j3, j6, j9, j12 and x264 medium and x264 ultrafast (courtesy of Serge Belyshev):

Latency Graphs here:
Sample image from the all important -j6 (where load == number of CPUs):

The BFS cpu scheduler, designed for optimum interactivity and responsiveness
while maintaining excellent throughput on commodity hardware, has been
modified to improve its behaviour on SMP machines, especially when used in
combination with scaling CPU frequency governors (like ondemand), and on
mulitcore hardware that has "turbo" modes. As many distributions are now
setting ondemand as the default governor, even on desktops, and some desktop
GUIs aren't even offering a way to alter the choice of governor any more, it
was clear that I needed to modify BFS to perform optimally with these

Briefly, BFS is a single runqueue design which is used to make the most of the
fact that single runqueues overall have the lowest average latency compared to
separate runqueue designs.

See this youtube link about queuing theory for an interesting parallel with
supermarket checkouts:

Why the other line is likely to move faster

The most consistent feature in BFS to date that was found to improve average
service latency whilst maintaining throughput was that no CPU was ever left to
be idle if there was a task that desired CPU time. Through extensive
experimentation I discovered that it was extremely easy for 4 or more tasks to
desire CPU at once on a virtually idle system. Thus on a quad core, the
average service time would always be lower if the tasks are allowed to move to
the first available idle CPU.

One of the disadvantages of this approach, though, is that although average
service latency is kept low by always using the first available CPU in a
relatively idle system, it means that work never clusters on a CPU. So when a
load-dependent CPU frequency governor is in use, no one CPU is ever seen as
being fully loaded except once the load is great enough for all CPUs to be
loaded. This meant that with BFS, CPUs would not be ramped up to their maximum
speed, and further, the "turbo" modes of newer multicore CPUs were never
engaged since they rely on one core being used while the others are kept idle.
Mainline does not exhibit this problem by virtue of always keeping work local
to each CPU and forcing work elsewhere once some kind of heuristically
determined imbalance was measured and then a rebalance would be forced.

It became apparent that to improve throughput with scaling governors and turbo
CPUs, that I would have to allow CPUs to become idle and service tasks
clustered to less CPUs in order for them to speed up. Trying to find the right
compromise between improving throughput while keeping the service latency low
by allowing a CPU to go idle would require heuristics which I've avoided
greatly to date on BFS. One great concern I have for the mainline scheduler is
the ever-expanding special case treatment of tasks on different architectures
for different one-benchmark workloads that do better, turning our mainline
balancing system into an unmanageable state machine.

Thus I devised a system whereby tasks that are running on a CPU but get
descheduled involuntarily in preference for another task instead of sleeping,
get flagged as "sticky". The reasoning being that if a task had not finished,
then it certainly was going to continue working once rescheduled. Then I added
a cpu_scaling flag which would tell the scheduler that a CPU had been throttled
from its maximum frequency. When one CPU is choosing its next task, if a task
is sticky on another CPU and the current CPU is throttled, it will not take
the task.

This extremely simple technique proved surprisingly good for throughput
without destroying latency. Further investigation revealed the same sticky
system could also be used to determine when to bias against tasks going next
on the local CPU even if the CPU wasn't throttled. So the existing "cache
distance" calculation was removed entirely and tasks that are sticky are just
seen as having soft affinity and biased against. This means that waking tasks
still get very low latency, but sticky tasks are very likely to go back to
their own CPU, while no CPU is left idle if there is work to do, _provided_
the CPU is not throttled.

The existing system for determining which idle CPU to bind to remains
unchanged for it has proven very effective and complementary to this added

So the entire equivalent of a "balancing" mechanism for BFS consists of:
1. Flagging the last CPU bound task that was involuntarily scheduled as being
softly affined to its CPU, and biasing against it on other CPUs.
2. Choosing an idle CPU to wake up based on how busy that core is, and how
close it was to the original CPU it last ran on.

It was discovered the improved throughput meant I was able to further lower
the rr interval such that it is initially set to 6ms regardless of the number
of CPUs, thus improving latency further on SMP.

The resulting change to BFS means throughput with the ondemand CPU frequency
governor is now very close to that of the performance governor, power usage
may be decreased (theoretical, not yet proven), and better throughput and
latency on the performance governor. I would like to point out, however, that
some CPUs were not really designed with ondemand governing in mind,
specifically such as the "extreme" desktop CPUs by Intel. These tend to only
have 2 frequencies, are slow to change from one frequency to another, and may
lie about which CPUs are throttled and actually throttle all of them. Given
how little power saving they really offer, I suggest anyone with a desktop CPU
test for themselves how well it works with ondemand and decide whether they
want to build it into the kernel or not as a way of preventing your
distribution from using the ondemand governor without your knowledge. I detest
the lack of configurability in newer distributions and GUIs, but that's another

In order to quantify the changes in this BFS version, benchmarks were
performed by myself and the help of some other very helpful members of the
community. Thanks to all those who tested along the way and helped it get this
far. The very simple, but reliable and repeatable benchmarks of kbuild at
various jobs and x264 were used to determine the effects on throughput, while
the wakeup-latency app by Jussi Laako was run in the background during the
benchmark as a rough guide to what effect the various loads had on latency.

Again I must stress, the purpose for BFS is to be used as a testing ground for
scheduling to see what can be done when the scheduler is specifically aimed at
commodity hardware for normal desktops and affordable servers. As such, it
should serve as a reference tool to compare to for these workloads and I'm
very pleased to see the uptake of its use actually by the community, and that
mainline is occasionally reminded of the need to keep in touch with this

To the server admins who have contacted me to tell me of their good
experiences with BFS, thank you, but I must stress again that I am unable to
implement extra features like CGROUPS that some of you have asked for. My
time is limited, and my work on the kernel completely unpaid so I can only
direct my attention to areas I personally have interest in.

EDIT: In response to questions from those who haven't been following the progress, yes this BFS is the Brain Fuck Scheduler.


  1. Rock on, CK! Your efforts on the development of this great scheduler (with v0.400 being no exception) are deeply appreciated around the globe!

  2. Glad to hear that! I had applied bfs363-376-test.patch to 2.6.37.y yesterday, works fine. Going to add the changes of 0.4 as well. Thanks!

  3. It works!
    Really only cosmetic changes bfs400 versus bfs376?

  4. Where the cpu frequency code is patched has changed in a simpler way, and the documentation has been updated, but it basically works the same way so will be indistinguishable from 376.

  5. Bingo, I felt:
    bfs400 little smoother than bfs376!

    Con, you could easily wait a short time to go public for a released to arrive (as usual you mentioned). And adapting bfs400 to cleanly patch the kernel then ....

  6. I suspect that it will patch cleanly on anyway. It's rare that 4 point releases break my patches, except for when the mainline developers get carried away and start adding features to so-called "stable" releases.

  7. @ck - I updated the x264 dataset including a run with v0.400 to be complete. Please use the new data set on your blog post.

  8. Thanks for your efforts!

    And I agree; a hobby should not interfere in a negative way with real life :-) I don't think you have anything to prove to LKML at this point. The sheer amount of people using your work on a daily basis is proof enough.

  9. Hi there. Had a problem while compiling this new version (I use an AUR package with the patches found here)

    Got a whole dump of "no space left on device" but I'm pretty sure I have plenty of space here - 500GB HDD with nothing really demanding. What can it be?

  10. Maybe you've ran out of inodes. You can check inode usage with "df -i".

  11. @anonymous - if you're using the PKGBUILD from the AUR which I maintain, you shouldn't need to be grabbing any files since they're all self contained. As to your no space left of device warning, what RealNC said makes sense.

    Why not compile in /dev/shm if you have enough space? Also, you an use pre-compiled binaries from my custom repo. Click my nick here to get linked to the wiki page that has the needed info.

  12. @RealNC - I cleared pacman's cache and it worked on the second try. But I'll take note of that possibility on the future, thanks.

    @graysky - no, no, I didn't get anything else here. I just pass by to read some more info. My curiousity is bigger than my actual knowledge at the moment. Anyway, all I use is the PKGBUILD from AUR. Worked like a charm in the first try, and it was the one where I had the most fear - first time I get an "external" kernel.

    I don't actually know at the moment such options as compiling in /dev/shm, that's why. Also, thanks for the link, I'll make sure to readay it as well.

    Anyway, everything working here again. Thanks a lot, everyone.

  13. Is there a mailing list for -ck discussions? Managed to completely lockup my desktop with patch-2.6.38-ck3 whereas ck1 was stable.

  14. Hey CK,

    thanks for your work! I was wondering is the configuration:

    dynamic ticks OFF, preemptive kernel ON, timer 1000HZ, (plus I change generic CPU to Athlon64/Opteron/K8)

    still recommended as stated in BFS readme?

  15. @karabaja4, yep pretty much. Here's the quick summary of suggested basic configurations:

  16. Report.
    AMD 6*3400 MHz (overclock)
    10 * game servers
    1) ubuntu Natty 2.6.38-8-server_amd64.deb
    top = 20-23%
    load across multiple cores very uneven. system attempted to load the maximum single-core
    top = 14-16%
    load almost equally distributed across multiple cores

  17. Thanks very much for your report. I get a lot of good reports from people running game servers or multi-user servers with BFS.

  18. Will it have one for 2.6.37? I try to apply it on opensuse 11.4, but some code need to modify. At last, it seems failed... XD

  19. ck, thnx very much for

    I have reduced Hz_1000 to 300 and disabled dynamic tics. This seemes to be an optimal configuration for a 2009 MacMini core2 64bit Gentoo system:
    Looking flash video the MacMini keeps cool at ondemand frequency 1596Mhz with:

    I will see if latency keeps being well when compiling ...

  20. I diff the 2.6.38-sched-bfs-376.patch and There are just few difference. I use it to make one for opensuse 11.4. It will apply 2.6.37-ck2, bfs363-376-test.patch, and the change from diff above(remove the chage for cpufreq_conservative.c, cpufreq_ondemand.c, cpufreq_userspace.c and add for cpufreq.c). I hope it will be fine.

  21. very interesting update to ur scheduler. I have 2 observations....
    the first is that on my i7 960 desktop (which uses turbo,and I don't use scaling) I noticed that for whatever reason if I leave dynamic ticks on it does a better job of revving up the frequency over clocking than when I turn it off (I monitor the frequencies with a program called i7z). Also it doesn't seem to take advantage of the hyperthreaded logical cpus like it use to....
    I also have a slightly off topic question: how come your ubuntu packaged kernels don't set off the preemptible kernel bug with the fglrx driver. Everytime I compile a preemptible kernel fglrx bugs out in my dmesg

  22. @Ralph: Actually you'll see I recommended dynticks ON for a mobile device.

    @chronniff: Interesting observation. It may well be that dynticks allows the unused idle threads to go more idle, thus allowing more turbo mode use. Can you confirm with a throughput measurement, rather than a monitor program, that throughput is better with it on? I may revise my recommendations in light of this data. There is actually no reason it wouldn't use hyperthreaded the way it used to so I'm not sure why you say that.

    As for the preemptible bug question, perhaps it's just the choice of configuration items. Check the config that is installed with the kernel and compare with your own.

  23. OK, everything seems fine. these are the repositories for opensuse.

  24. The bfs configuration is helpful, I hadn't noticed that before. Thanks for pointing it out, ck.

    One small thing in there, though :)


  25. Yes of course, I can pretend that's been up there for ages (I put it up last night). Will correct the typo, thanks.

  26. Just wondering, why disable cpu frequency scaling ?
    Say if i away from keyboard, shouldn't frequency downscaling save power ?

  27. CK - two questions:

    1) Why disable dynamic ticks?
    2) According to the help item for Timer frequency, "the timer interrupt occurs on each processor in an SMP environment leading to NR_CPUS * Hz number of timer interrupts per second."

    As I read that, my quad core chip would have a timer frequency 4x higher than what I select. You recommend a setting of 1000 Hz for desktops. Does this mean I need to select 300 Hz for my case (4x300=1,200 Hz)?

  28. Hey,

    I was trying to figure out what's wrong with that slowed down 2d performance on nvidia cards. bootopper suggested ( to disables and then re-enable HT cores on cpu and it really helped, at least for me. :)

    I'm posting this just for your information. :)

  29. @graysky: Dynamic ticks adds overhead. The main purpose for dynticks is to reduce power consumption on mobile devices, and it doesn't reduce power by any significant amount on an ordinary desktop, so the added overhead and latency of using it normally isn't worth it. As discussed earlier, though, if you have a turbo enabled CPU, it *might* help it to go turbo more. As for the CPU frequency, no, I mean 1000 absolutely, regardless of the number of CPUs.

    @vojta: That's very interesting indeed. I don't have a HT CPU on my desktop, but I'm sure the bugs to do with nvidia slowdown are related.

    @anonymous: CPU frequency scaling is not reliably fast to speed up on desktop CPUs (as opposed to mobile CPUs), and desktop CPUs don't throttle very much speed-wise, and the power savings are miniscule compared to the intrinsic ability of the CPU to idle its unused parts. This is all on a desktop CPU. Mobile CPUs are very different. However, nothing is absolute. The newer CPUs are much better in this regard (last year or so).

  30. I would be more than happy to benchmark the tickless and non tickless kernels with bfs400 on my i7 960 quad with HT and Turbo mode(v1), just let me know what test to use, should I just time an x264 encoding? I should mention that, despite what I was seeing the other day with i7z, that the system does in fact seem to 'feel' a bit snappier with dynamic ticks turned off

    The Hyper Threading thing I metioned is that on previous bfs versions load was always balanced almost perfectly even over all 8 logical cores on my system. But this new release seems to keep all the load of the first four cores (especially the first two), while the remaining logical cores are almost always left idle. After reading your post more thoroughly, I'm guessing this is due to trying to ovverload one cpu while keeping the other threads relatively idle so that the one under load can take advantage of the turbo mode overclocking most effectively since it can hit even higher frequencies if it is just one thread overclocking and a time. The cpu behavior, however, behaves the same as always if I use make -j8 or anything putting the cpu under full load.

    (oh yeah, and the fglrx bug in dmesg was just because I had preemptive kernel debugging enabled in the kernel by accident)

    After a day of real work everyday use I certainly notice the increase in my systems responsiveness. Thanks again for ur hard work, and congrats on yet another significant improvement

  31. Thanks chronniff. While it may "appear" to use threads less, that's simply the way the work will be spread out to the cores. Instead of jumping to any free core, it will try to keep the same running thread on the same core. So if your load is only 3 out of 8 threads, the same 3 cores should appear loaded. That's actually an improvement. As for the benchmark, I need something that's actually single threaded to be run on that machine with and without dynticks if possible. The good old kernel build is simple enough. Boot into init 1 (if possible) and go into a clean kernel directory and do:
    make clean && make allnoconfig && make -j8 && make clean && time make
    with and without dynticks enabled and report the results of the time output please. The make -j8 is just to cache all the files in ram, but the 'make' without jobs will run mostly single threaded, which should invoke turbo mode.


  32. The overhead of dynticks is why I disabled it and at the same time reduced to Hz_300 for better throughput of my MacMini-2009.

    As the discussions make clear there are a lot of possibilities to tune and to get an optimized configuration at the end for your very particular machine ...

  33. here is the output for the kernel build test, I booted in into recovery mode in ubuntu, which i believe is supposed to emulate runlevel 1 since ubuntu uses upstart.
    cpuinfo (first couple lines):

    vendor_id : GenuineIntel
    cpu family : 6
    model : 26
    model name : Intel(R) Core(TM) i7 CPU 960 @ 3.20GHz
    stepping : 5
    cpu MHz : 3201.000
    cache size : 8192 KB


    dynamic ticks off:

    real 1m26.390s
    user 1m17.076s
    sys 0m8.056s

    dynamic ticks on (tickless):

    real 1m23.710s
    user 1m13.180s
    sys 0m9.012s

  34. oh, I and I forgot to mention that the cpu is a quadcore with both hyper threading and turbo mode enabled

  35. Looks pretty convincing, thanks! Of course in the ideal world you'd conduct multiple benchmarks average them out, check standard deviation, do a statistical test and blah blah blah, but sometimes things are brutally obvious, like this.

  36. chronniff has better performance with
    dynamic ticks on
    obviously !
    But I don't understand a word:
    Shouldn't dynamic have some overhead ?

  37. @Ralph: There is overhead to dynticks, but if your machine spends any time idle, there are advantages to it as well. On balance, it's probably better to enable it so I'll be changing the faq to suit.

  38. haha, my approach may not have been scientific but I did run the test a second time on each kernel to make sure I was getting consistent data, and each time it was nearly identical. yeah, I think u were right when u said that it enables the threads that aren't in use to stay more idle since they don't have to wakeup as much, and if u look at Intel's in depth explanation of turbo mode, if all threads are active it will overclock all of the cores up to a certain frequency (I think mine is 3.34ghz), but if a single thread is under load while the others stay under a certain frequency, that thread will will be able to overclock even further (I think mine is 3.46).

  39. chronniff, what is the keyword of that cpu capability turbo if you cat /proc/cpuinfo ?

    Are these dyntick effects probably also for
    Intel core2 duo 2009 (I think without turbo) ?

  40. Core2 did not have turbo mode. Dynticks will save you more power (when idle) but is unlikely to have any performance advantage.

  41. yeah, turbo mode is only on the core i7 i5, im not sure about i3, but only on the latest generation of intel, and I know amd has its own turbo in its latest generation, but I don't know anything about it

  42. --- linux-2.6.38.orig/kernel/sched_bfs.c 2011-04-15 00:42:27.000000000 +0200
    +++ linux-2.6.38/kernel/sched_bfs.c 2011-04-15 00:45:15.521436353 +0200
    @@ -1379,8 +1379,8 @@ static void try_preempt(struct task_stru
    if (rq_prio < highest_prio)

    - if (rq_prio > highest_prio || (rq_prio == highest_prio &&
    - deadline_after(rq->rq_deadline, latest_deadline))) {
    + if (rq_prio > highest_prio ||
    + deadline_after(rq->rq_deadline, latest_deadline)) {
    latest_deadline = rq->rq_deadline;
    highest_prio = rq_prio;
    highest_prio_rq = rq;

  43. Well spotted _sid_ , thanks!. (To everyone wondering about this, it's an optimisation, for an unnecessary test, not a bugfix, but it's worth doing).

  44. Hi Con!
    I'm on a Phenom II X4 CPU and I can confirm that 0.400 brings substantial improvements with scaling governors (I'm using conservative).
    I can also confirm what Ralph Ulrich said, 0.400 seems way smoother than 0.376 to me. I'm on and the patch applies cleanly.

  45. --- linux-2.6.35.orig/kernel/sched_bfs.c 2010-08-31 16:32:43.859976211 +0200
    +++ linux-2.6.35/kernel/sched_bfs.c 2010-08-31 16:33:53.943976073 +0200
    @@ -129,7 +129,7 @@ int sched_iso_cpu __read_mostly = 70;
    * The quota handed out to tasks of all priority levels when refilling their
    * time_slice.
    -static inline unsigned long timeslice(void)
    +static inline int timeslice(void)
    return MS_TO_US(rr_interval);

  46. Short question ck -- CONFIG_HZ_1000 is recommended, is CONFIG_HZ_250 preferable for a quad-core? I've asked about multi-core processor behavior wrt hertz before but was told by #kernel that they didn't know. /headscratch

    Thanks for your efforts and code, as always.

  47. Yes, 1000Hz absolutely is recommended for anything that is used as a desktop, no matter how many CPUs you have.

  48. I do agree to enable dynamic tick (tickless) to save more power. I think ordinary desktop user (office apps, internet, movies) could benefit from it. Since office apps and internet doesn't use much cpu power. While internet streaming via flash and playing movie can use GPU instead of CPU.

  49. Hi, as long as we are pasting patches here,
    This one might help:

    --- linux- 2011-04-19 11:47:07.666203441 +0300
    +++ linux-2.6.35/kernel/sched_bfs.c 2011-04-19 11:51:54.250203441 +0300
    @@ -2754,15 +2754,16 @@
    if (idx >= PRIO_LIMIT)
    goto out;
    queue = grq.queue + idx;
    + if (idx < MAX_RT_PRIO) {
    + /* We found an rt task */
    + edt = list_first_entry(queue, struct task_struct, run_list);
    + goto out_take;
    + }
    list_for_each_entry(p, queue, run_list) {
    /* Make sure cpu affinity is ok */
    if (needs_other_cpu(p, cpu))
    - if (idx < MAX_RT_PRIO) {
    - /* We found an rt task */
    - edt = p;
    - goto out_take;
    - }

    * Soft affinity happens here by not scheduling a task with

  50. Thanks, but that breaks affinity so I can't use it.

  51. Oh, sorry, I've missed that.
    I guess then, a better strategy would be splitting the list_for_each_entry loop into two loops: one for the idx < RT and the rest.
    (I don't really know if that will be beneficial at all, that was just something small that bugged me & I thought to share).

    Thanks again for your wonderful work on BFS !