TL;DR: Lower latency, better throughput and better ondemand behaviour on SMP.
BFS 400 for 2.6.38.2:
2.6.38.2-sched-bfs-400.patch
BFS 400 for 2.6.35.12:
2.6.35.12-sched-bfs-400.patch
2.6.38-ck3 includes BFS 400 (applies to 2.6.38.2):
patch-2.6.38-ck3.bz2
Ubuntu packages 2.6.38.2-ck2 and 2.6.35.12-ck1 include BFS 400:
Ubuntu Packages
It's been a while in the making, and the advantages of this new BFS are so substantial that they warrant a new version. I was going to announce this on lkml with lots of fanfare, but I see there is still some rare desktop hardware that still is not happy on ondemand governing. SO, my summary is: If you are on a desktop, stick with the performance governor and you'll get significant performance and latency advantages, and if you're on a laptop, you'll find BFS 400 much better than previous iterations with both ondemand and performance governing. There is hard evidence it is better for these. What follows is a long post that I was planning on submitting to lkml, but I'm very wary of burnout. Burnout for me is not to do with being tired of working on the code. No, not at all. The issue is that coding will start intruding too much on my regular life which has absolutely nothing to do with linux kernel and code at all. Once it interferes with my home life, or work, or health, I know it's time to cut back. And that's where I am right now. So instead, here's the post just on the blog:
---
This is to announce the most major upgrade in the BFS CPU scheduler in a
while.
For those tracking the development on my blog, this is version 0.376 renamed
with mostly cosmetic changes only.
Benchmarks first:
x264 benchmarks on a quad core Xeon X3360 with ondemand (courtesy of Graysky):
x264encodingfps.png
x264encodingsec.png
compilefilezilla340.png
Benchmarks on a 4 thread 2 core i7 M620:
m620-benchmarks.txt
Quad core Q9650:
q9650-benchmarks.txt
Prescott 3.2GHz Single core HT:
prescott-benchmarks.txt
Six core AMD: This has the all important LATENCY and throughput benchmarks.
Each benchmark is the latency with the following loads on a 6 core AMD, make -
j1, j3, j6, j9, j12 and x264 medium and x264 ultrafast (courtesy of Serge Belyshev):
http://ck.kolivas.org/patches/bfs/benchmark3-results-for-announcement-20110410.txt
Latency Graphs here:
http://ck.kolivas.org/patches/bfs/bfs400-cfs/
Sample image from the all important -j6 (where load == number of CPUs):
The BFS cpu scheduler, designed for optimum interactivity and responsiveness
while maintaining excellent throughput on commodity hardware, has been
modified to improve its behaviour on SMP machines, especially when used in
combination with scaling CPU frequency governors (like ondemand), and on
mulitcore hardware that has "turbo" modes. As many distributions are now
setting ondemand as the default governor, even on desktops, and some desktop
GUIs aren't even offering a way to alter the choice of governor any more, it
was clear that I needed to modify BFS to perform optimally with these
governors.
Briefly, BFS is a single runqueue design which is used to make the most of the
fact that single runqueues overall have the lowest average latency compared to
separate runqueue designs.
See this youtube link about queuing theory for an interesting parallel with
supermarket checkouts:
Why the other line is likely to move faster
The most consistent feature in BFS to date that was found to improve average
service latency whilst maintaining throughput was that no CPU was ever left to
be idle if there was a task that desired CPU time. Through extensive
experimentation I discovered that it was extremely easy for 4 or more tasks to
desire CPU at once on a virtually idle system. Thus on a quad core, the
average service time would always be lower if the tasks are allowed to move to
the first available idle CPU.
One of the disadvantages of this approach, though, is that although average
service latency is kept low by always using the first available CPU in a
relatively idle system, it means that work never clusters on a CPU. So when a
load-dependent CPU frequency governor is in use, no one CPU is ever seen as
being fully loaded except once the load is great enough for all CPUs to be
loaded. This meant that with BFS, CPUs would not be ramped up to their maximum
speed, and further, the "turbo" modes of newer multicore CPUs were never
engaged since they rely on one core being used while the others are kept idle.
Mainline does not exhibit this problem by virtue of always keeping work local
to each CPU and forcing work elsewhere once some kind of heuristically
determined imbalance was measured and then a rebalance would be forced.
It became apparent that to improve throughput with scaling governors and turbo
CPUs, that I would have to allow CPUs to become idle and service tasks
clustered to less CPUs in order for them to speed up. Trying to find the right
compromise between improving throughput while keeping the service latency low
by allowing a CPU to go idle would require heuristics which I've avoided
greatly to date on BFS. One great concern I have for the mainline scheduler is
the ever-expanding special case treatment of tasks on different architectures
for different one-benchmark workloads that do better, turning our mainline
balancing system into an unmanageable state machine.
Thus I devised a system whereby tasks that are running on a CPU but get
descheduled involuntarily in preference for another task instead of sleeping,
get flagged as "sticky". The reasoning being that if a task had not finished,
then it certainly was going to continue working once rescheduled. Then I added
a cpu_scaling flag which would tell the scheduler that a CPU had been throttled
from its maximum frequency. When one CPU is choosing its next task, if a task
is sticky on another CPU and the current CPU is throttled, it will not take
the task.
This extremely simple technique proved surprisingly good for throughput
without destroying latency. Further investigation revealed the same sticky
system could also be used to determine when to bias against tasks going next
on the local CPU even if the CPU wasn't throttled. So the existing "cache
distance" calculation was removed entirely and tasks that are sticky are just
seen as having soft affinity and biased against. This means that waking tasks
still get very low latency, but sticky tasks are very likely to go back to
their own CPU, while no CPU is left idle if there is work to do, _provided_
the CPU is not throttled.
The existing system for determining which idle CPU to bind to remains
unchanged for it has proven very effective and complementary to this added
change.
So the entire equivalent of a "balancing" mechanism for BFS consists of:
1. Flagging the last CPU bound task that was involuntarily scheduled as being
softly affined to its CPU, and biasing against it on other CPUs.
2. Choosing an idle CPU to wake up based on how busy that core is, and how
close it was to the original CPU it last ran on.
It was discovered the improved throughput meant I was able to further lower
the rr interval such that it is initially set to 6ms regardless of the number
of CPUs, thus improving latency further on SMP.
The resulting change to BFS means throughput with the ondemand CPU frequency
governor is now very close to that of the performance governor, power usage
may be decreased (theoretical, not yet proven), and better throughput and
latency on the performance governor. I would like to point out, however, that
some CPUs were not really designed with ondemand governing in mind,
specifically such as the "extreme" desktop CPUs by Intel. These tend to only
have 2 frequencies, are slow to change from one frequency to another, and may
lie about which CPUs are throttled and actually throttle all of them. Given
how little power saving they really offer, I suggest anyone with a desktop CPU
test for themselves how well it works with ondemand and decide whether they
want to build it into the kernel or not as a way of preventing your
distribution from using the ondemand governor without your knowledge. I detest
the lack of configurability in newer distributions and GUIs, but that's another
story.
In order to quantify the changes in this BFS version, benchmarks were
performed by myself and the help of some other very helpful members of the
community. Thanks to all those who tested along the way and helped it get this
far. The very simple, but reliable and repeatable benchmarks of kbuild at
various jobs and x264 were used to determine the effects on throughput, while
the wakeup-latency app by Jussi Laako was run in the background during the
benchmark as a rough guide to what effect the various loads had on latency.
Again I must stress, the purpose for BFS is to be used as a testing ground for
scheduling to see what can be done when the scheduler is specifically aimed at
commodity hardware for normal desktops and affordable servers. As such, it
should serve as a reference tool to compare to for these workloads and I'm
very pleased to see the uptake of its use actually by the community, and that
mainline is occasionally reminded of the need to keep in touch with this
hardware.
To the server admins who have contacted me to tell me of their good
experiences with BFS, thank you, but I must stress again that I am unable to
implement extra features like CGROUPS that some of you have asked for. My
time is limited, and my work on the kernel completely unpaid so I can only
direct my attention to areas I personally have interest in.
EDIT: In response to questions from those who haven't been following the progress, yes this BFS is the Brain Fuck Scheduler.
A development blog of what Con Kolivas is doing with code at the moment with the emphasis on linux kernel, MuQSS, BFS and -ck.
Showing posts with label bfs. Show all posts
Showing posts with label bfs. Show all posts
Sunday, 10 April 2011
Thursday, 7 April 2011
BFS 0.376 test
TL;DR: Fastest BFS yet for SMP.
After extended testing on BFS 0.373, a number of minor issues came up, but the results were very promising. Now I believed I've addressed all the known issues with a newer version. Instead of flagging scaling CPUs by their governor alone, I now flag them as scaling only when they're actually throttled from maximum speed. This improves throughput further with the dynamic scaling governors like ondemand and brings it now very close to that of performance under full load. I also found that the sticky flagged tasks were not keeping their sticky flags if they were rescheduled back to back. This gave me even more of a performance boost under all situations. I addressed the oops that can occur on UP, and finally I updated the docs to match the changes in the scheduler design.
So hopefully this will be the last test patch (fingers crossed) before I make it official, because... I'm about >< close to burnout. That's not something I want to experience. Incremental for those on BFS 363 already: bfs363-376-test.patch
Full patch for 2.6.38ish:
2.6.38-sched-bfs-376.patch
Benchmarks as they come to hand...
---
x264 benchmarks Courtesy of Graysky:
Higher is better: boxplotencodethroughput.png
Lower is better: boxplotencodetime.png
CPU: Intel Xeon X3360 @ 8.5x400=3.40 GHz (4 cores/4 threads)
Linux version: Arch x86_64
x264 version: 0.114.x
handbrake version: svn3853
Base kernel version: 2.6.38.2
CK Patchset: CK1
Source video clip: 720p60 (1280x720) MPEG-PS @ 15 Mbps. 62 seconds long.
Run with ondemand multiplier, 5 times per kernel. Kernels use identical configs with exception of BFS version.
Handbrake CLI: --input test.m2ps --output output.mp4 --no-dvdnav --audio none --crop 0:0:0:0 --preset=Normal
---
After extended testing on BFS 0.373, a number of minor issues came up, but the results were very promising. Now I believed I've addressed all the known issues with a newer version. Instead of flagging scaling CPUs by their governor alone, I now flag them as scaling only when they're actually throttled from maximum speed. This improves throughput further with the dynamic scaling governors like ondemand and brings it now very close to that of performance under full load. I also found that the sticky flagged tasks were not keeping their sticky flags if they were rescheduled back to back. This gave me even more of a performance boost under all situations. I addressed the oops that can occur on UP, and finally I updated the docs to match the changes in the scheduler design.
So hopefully this will be the last test patch (fingers crossed) before I make it official, because... I'm about >< close to burnout. That's not something I want to experience. Incremental for those on BFS 363 already: bfs363-376-test.patch
Full patch for 2.6.38ish:
2.6.38-sched-bfs-376.patch
Benchmarks as they come to hand...
---
x264 benchmarks Courtesy of Graysky:
Higher is better: boxplotencodethroughput.png
Lower is better: boxplotencodetime.png
CPU: Intel Xeon X3360 @ 8.5x400=3.40 GHz (4 cores/4 threads)
Linux version: Arch x86_64
x264 version: 0.114.x
handbrake version: svn3853
Base kernel version: 2.6.38.2
CK Patchset: CK1
Source video clip: 720p60 (1280x720) MPEG-PS @ 15 Mbps. 62 seconds long.
Run with ondemand multiplier, 5 times per kernel. Kernels use identical configs with exception of BFS version.
Handbrake CLI: --input test.m2ps --output output.mp4 --no-dvdnav --audio none --crop 0:0:0:0 --preset=Normal
---
Sunday, 3 April 2011
BFS 0.373 test
The BFS 0.372 test patch has proved quite a success.There have been no regressions in performance, with slight improvements even with the performance CPU governor, and better latency all round on SMP now. There were some rare crashes that I had to track down and I believe I've fixed them all so I'm releasing another test patch, 0.373, which has addressed them and is otherwise the same as 0.372.
Apply to a 0.363 patched kernel:
bfs363-373-test.patch
Thanks to those who have tested and reported back so far!
UPDATE: Throughput benchmark results on a 6 core AMD courtesy of Serge Belyshev:
benchmark3.results-cfs-bfs363-bfs373.txt
I am planning on posting latency benchmarks soon too.
Apply to a 0.363 patched kernel:
bfs363-373-test.patch
Thanks to those who have tested and reported back so far!
UPDATE: Throughput benchmark results on a 6 core AMD courtesy of Serge Belyshev:
benchmark3.results-cfs-bfs363-bfs373.txt
I am planning on posting latency benchmarks soon too.
Friday, 1 April 2011
BFS 0.372 test patch
Another day, another BFS test release. This one builds on the ideas in the 0.371-test3 patch I posted about since they're proving very positive. No April fools here. This looks like it's kicking arse.
Apply to a BFS 0.363 patched kernel such as 2.6.38-ck1:
bfs363-372-test.patch
Changelog from the patch:
---
Add a "sticky" flag for one CPU bound task per runqueue which is used to flag
the last cache warm task per CPU. Use this flag to softly affine the task to
the CPU by not allowing it to move to another CPU when a scaling CPU frequency
governor is in use. This significantly improves throughput at lower loads by
allowing tasks to cluster on CPUs, thereby allowing the scaling governor to
speed up only that CPU. This should aslo save power. Use the sticky flag to
determine cache distance in earliest_deadline_task task and abolish the
cache_distance function entirely. This is proven as, if not more, effective.
Add helpers to the 3 scaling governors to tell the scheduler when a CPU is
scaling.
Replace the frequent use of num_online_cpus() with a grq.noc variable that
is only updated when the number of online cpus changes.
Simplify resched_best_idle by removing the open coded for_each_cpu_mask as
it was not of proven benefit.
Remove warnings in try_to_wakeup_local that are harmless or never hit.
Clear the cpuidle_map bit only when edt doesn't return the idle task.
Abolish the scaled rr_interval by number of CPUs and now just use a fixed
nominal 6ms everywhere. The improved cache warmth of the sticky flag makes
this unnecessary and allows us to lower overall latencies on SMP by doing so.
---
Please test this one thoroughly. It's very stable and now heavily tested, but I won't announce any new "release" till it's been tested for maybe 5 days or more. It appears better in all workloads and with and without cpu frequency governors. Again, only SMP will benefit from this patch, but it should change behaviour in all SMP now, not just with ondemand. However, the scaling governors should show the most improvement.
The best example (with ondemand) was a single threaded cpu bound workload that took 126 seconds to complete on an i7 2 core/4 thread machine that now takes 91.5 seconds. The 2 threaded workload dropped from 66 seconds to 51.5 seconds. Note that this more or less addresses a regression in BFS behaviour with cpu frequency scaling on SMP, but it's also been an opportunity to improve behaviour elsewhere.
Apply to a BFS 0.363 patched kernel such as 2.6.38-ck1:
bfs363-372-test.patch
Changelog from the patch:
---
Add a "sticky" flag for one CPU bound task per runqueue which is used to flag
the last cache warm task per CPU. Use this flag to softly affine the task to
the CPU by not allowing it to move to another CPU when a scaling CPU frequency
governor is in use. This significantly improves throughput at lower loads by
allowing tasks to cluster on CPUs, thereby allowing the scaling governor to
speed up only that CPU. This should aslo save power. Use the sticky flag to
determine cache distance in earliest_deadline_task task and abolish the
cache_distance function entirely. This is proven as, if not more, effective.
Add helpers to the 3 scaling governors to tell the scheduler when a CPU is
scaling.
Replace the frequent use of num_online_cpus() with a grq.noc variable that
is only updated when the number of online cpus changes.
Simplify resched_best_idle by removing the open coded for_each_cpu_mask as
it was not of proven benefit.
Remove warnings in try_to_wakeup_local that are harmless or never hit.
Clear the cpuidle_map bit only when edt doesn't return the idle task.
Abolish the scaled rr_interval by number of CPUs and now just use a fixed
nominal 6ms everywhere. The improved cache warmth of the sticky flag makes
this unnecessary and allows us to lower overall latencies on SMP by doing so.
---
Please test this one thoroughly. It's very stable and now heavily tested, but I won't announce any new "release" till it's been tested for maybe 5 days or more. It appears better in all workloads and with and without cpu frequency governors. Again, only SMP will benefit from this patch, but it should change behaviour in all SMP now, not just with ondemand. However, the scaling governors should show the most improvement.
The best example (with ondemand) was a single threaded cpu bound workload that took 126 seconds to complete on an i7 2 core/4 thread machine that now takes 91.5 seconds. The 2 threaded workload dropped from 66 seconds to 51.5 seconds. Note that this more or less addresses a regression in BFS behaviour with cpu frequency scaling on SMP, but it's also been an opportunity to improve behaviour elsewhere.
BFS 0.371 test3
TL;DR I'd like more testing.
Here's a new lightly tested patch trying another simpler and cheaper approach to improving throughput with scaling CPU frequency governors (like ondemand) without the flaws of the previous approach.
I've enabled the changes at all times, not just when the ondemand governor is run, but again this change only affects SMP users. This test patch is only lightly tested, but I'd appreciate it if people gave it a bit of a run. Apply to a BFS 363 based kernel such as 2.6.38-ck1.
bfs363-371-test3.patch
Too tired to describe what it does right now... Zzzz
Here's a new lightly tested patch trying another simpler and cheaper approach to improving throughput with scaling CPU frequency governors (like ondemand) without the flaws of the previous approach.
I've enabled the changes at all times, not just when the ondemand governor is run, but again this change only affects SMP users. This test patch is only lightly tested, but I'd appreciate it if people gave it a bit of a run. Apply to a BFS 363 based kernel such as 2.6.38-ck1.
bfs363-371-test3.patch
Too tired to describe what it does right now... Zzzz
Tuesday, 29 March 2011
2.6.38-ck2, BFS 0.370
EDIT EDIT EDIT: This patch causes bizarre regressions and has been backed out. Consider 2.6.38-ck1 and BFS0.363 the stable releases, SORRY!
After more testing and cleaning up of the patch posted here earlier (test2), I've put it together as a new BFS release with almost trivial changes since that test2 patch. The changes are cosmetic only apart from a removal of the warning which is hit occasionally and is now harmless on BFS.
Just to reiterate, unless you are on an SMP machine (2 or more threads or cores) AND are using a scaling CPU frequency governor (e.g. ondemand), then there will be no significant performance advantage to upgrading to BFS 370 or ck2. For those with that combination, what you can expect to see is an increase in throughput with lightly loaded machines (single threaded apps most affected) and likely an increase in battery life. Overall latency is unlikely to be affected keeping interactivity relatively the same but responsiveness should also increase. If you are unsure of the difference, read this summary I wrote for interbench:
readme.interactivity
When the kernel mirrors sync up, ck2 will be found here:
2.6.38-ck2
It applies with some minor offsets to 2.6.38.2 so you can safely apply it to that kernel if you like.
BFS is available here:
BFS
And Ubuntu packages of 2.6.35.11-ck2 and 2.6.38.2-ck1 which have the new BFS are now available here:
Ubuntu Packages
EDIT: People keep asking me why I've "optimised" only for SMP and ondemand. This is not the case at all. This patch addresses a performance regression that only affects that combination.
EDIT2: SEE ABOVE NOTICE! PATCH CONSIDERED BAD, GO BACK TO 2.6.38-ck1 and BFS 0.363 PLEASE!
After more testing and cleaning up of the patch posted here earlier (test2), I've put it together as a new BFS release with almost trivial changes since that test2 patch. The changes are cosmetic only apart from a removal of the warning which is hit occasionally and is now harmless on BFS.
Just to reiterate, unless you are on an SMP machine (2 or more threads or cores) AND are using a scaling CPU frequency governor (e.g. ondemand), then there will be no significant performance advantage to upgrading to BFS 370 or ck2. For those with that combination, what you can expect to see is an increase in throughput with lightly loaded machines (single threaded apps most affected) and likely an increase in battery life. Overall latency is unlikely to be affected keeping interactivity relatively the same but responsiveness should also increase. If you are unsure of the difference, read this summary I wrote for interbench:
readme.interactivity
2.6.38-ck2
It applies with some minor offsets to 2.6.38.2 so you can safely apply it to that kernel if you like.
BFS is available here:
BFS
And Ubuntu packages of 2.6.35.11-ck2 and 2.6.38.2-ck1 which have the new BFS are now available here:
Ubuntu Packages
EDIT2: SEE ABOVE NOTICE! PATCH CONSIDERED BAD, GO BACK TO 2.6.38-ck1 and BFS 0.363 PLEASE!
Saturday, 26 March 2011
BFS and optimising a global scheduler for per-cpu frequency scaling
TL;DR BFS now faster with cpu frequency scaling
The last update to BFS brought some performance enhancements to threaded CPUs like the i5/i7 etc. One thing that has troubled me for some time with BFS has been that CPU frequency scaling in the form of ondemand and friends, is based on the "local load" on a CPU, and scales one CPU up in the presence of load on that CPU alone, yet BFS has almost no notion of "local load" since it's a global scheduler. While there is some rudimentary code in BFS to say whether one CPU is currently slightly more busy than the total load, the fact that load is more a total system concept rather than per-CPU in BFS means that the CPU frequency scaling is not going to work as well. So I set out to try and tackle that exact issue with this latest test patch.
bfs363-test2.patch
In this patch, a special deadline selection process is invoked only when a "scaling" type cpu frequency governor is invoked on a per-cpu basis. What this special deadline selection process does is has a kind of soft affinity for CPU bound tasks (i.e. ones that don't ever sleep) and tries to keep them on one CPU over and above the normal amount in BFS, to the point of allowing a CPU to go idle instead of pulling that task away from its last CPU. It does this by using a simple counter for the number of times the task would otherwise have been selected by another CPU and keeps it under the total number of other CPUs. If it goes over this counter it will allow it to move. It still moves very very easily compared to the mainline scheduler which has a very strong notion of locality, and works the other way around (moving tasks only if things look really badly unbalanced). Freshly waking tasks also completely ignore this and just work as per usual BFS, thereby maintaining BFS' very low latency, and if a non-scaling governor is used (currently performance or powersave), BFS reverts to the regular deadline selection. This can increase the latency slightly of CPU bound tasks, but because the CPU they're bound to will now more easily increase its frequency with (for example) the ondemand governor, this latency is offset by faster completion of its work.
All in all it addresses a weakness in the fact that the CPU frequency scaling in mainline is designed around per-CPU scheduling and not global scheduling which BFS does. The most extreme example of this in testing on a heavily threaded multicore multithread CPU (i7) shows a 30% increase in throughput on a single threaded task when run on a 4 thread CPU. That's a massive change. Not only will the workload complete faster, but almost certainly battery usage will be reduced. In practice most workloads are more mixed than this, so don't expect a sudden 30% overall boost in performance. It also has no effect without cpu frequency scaling, and no effect on single CPU machines, but the more cores and threads you have, the greater the benefit. Since some new Linux distributions now no longer even allow you to change the governor and just set it to ondemand by default, this change is something that will be essential. After a bit of testing, and possibly more tweaking, it will constitute the changes for the next BFS release.
Please test and report back how this patch performs for you if you have more than one core and run CPU frequency scaling.
The last update to BFS brought some performance enhancements to threaded CPUs like the i5/i7 etc. One thing that has troubled me for some time with BFS has been that CPU frequency scaling in the form of ondemand and friends, is based on the "local load" on a CPU, and scales one CPU up in the presence of load on that CPU alone, yet BFS has almost no notion of "local load" since it's a global scheduler. While there is some rudimentary code in BFS to say whether one CPU is currently slightly more busy than the total load, the fact that load is more a total system concept rather than per-CPU in BFS means that the CPU frequency scaling is not going to work as well. So I set out to try and tackle that exact issue with this latest test patch.
bfs363-test2.patch
In this patch, a special deadline selection process is invoked only when a "scaling" type cpu frequency governor is invoked on a per-cpu basis. What this special deadline selection process does is has a kind of soft affinity for CPU bound tasks (i.e. ones that don't ever sleep) and tries to keep them on one CPU over and above the normal amount in BFS, to the point of allowing a CPU to go idle instead of pulling that task away from its last CPU. It does this by using a simple counter for the number of times the task would otherwise have been selected by another CPU and keeps it under the total number of other CPUs. If it goes over this counter it will allow it to move. It still moves very very easily compared to the mainline scheduler which has a very strong notion of locality, and works the other way around (moving tasks only if things look really badly unbalanced). Freshly waking tasks also completely ignore this and just work as per usual BFS, thereby maintaining BFS' very low latency, and if a non-scaling governor is used (currently performance or powersave), BFS reverts to the regular deadline selection. This can increase the latency slightly of CPU bound tasks, but because the CPU they're bound to will now more easily increase its frequency with (for example) the ondemand governor, this latency is offset by faster completion of its work.
All in all it addresses a weakness in the fact that the CPU frequency scaling in mainline is designed around per-CPU scheduling and not global scheduling which BFS does. The most extreme example of this in testing on a heavily threaded multicore multithread CPU (i7) shows a 30% increase in throughput on a single threaded task when run on a 4 thread CPU. That's a massive change. Not only will the workload complete faster, but almost certainly battery usage will be reduced. In practice most workloads are more mixed than this, so don't expect a sudden 30% overall boost in performance. It also has no effect without cpu frequency scaling, and no effect on single CPU machines, but the more cores and threads you have, the greater the benefit. Since some new Linux distributions now no longer even allow you to change the governor and just set it to ondemand by default, this change is something that will be essential. After a bit of testing, and possibly more tweaking, it will constitute the changes for the next BFS release.
Please test and report back how this patch performs for you if you have more than one core and run CPU frequency scaling.
Thursday, 24 March 2011
2.6.38-ck1, BFS 0.363 for 2.6.38
TL;DR This post isn't about lrzip at last.
I've ported BFS and all previous -ck patches to 2.6.38.
2.6.38-ck1 can be grabbed here:
2.6.38-ck1
Ubuntu packages here:
Ubuntu Packages
and BFS for 2.6.38 can be grabbed here:
BFS 2.6.38
Apart from slight architectural changes between the kernel versions, and YET ANOTHER mainline rewrite of the CPU offlining code for suspend to ram/disk (which always causes problems with porting BFS since I have to rewrite my own parts of that code), this is the same BFS v0.363 as per the last release.
There is no "autogrouping by SID" in BFS or -ck. I remain unconvinced of any tangible benefit of such an approach for real world usage, and for the potential for problems and inability to apportion CPU when you actually want to.
Please report your experiences, but only if they're meaningful. I don't care how your PC performs if you do make -j 4096 unless you happen to have 4096 CPUs :)
Please enjoy! お楽しみ下さい
I've ported BFS and all previous -ck patches to 2.6.38.
2.6.38-ck1 can be grabbed here:
2.6.38-ck1
Ubuntu packages here:
Ubuntu Packages
and BFS for 2.6.38 can be grabbed here:
BFS 2.6.38
Apart from slight architectural changes between the kernel versions, and YET ANOTHER mainline rewrite of the CPU offlining code for suspend to ram/disk (which always causes problems with porting BFS since I have to rewrite my own parts of that code), this is the same BFS v0.363 as per the last release.
There is no "autogrouping by SID" in BFS or -ck. I remain unconvinced of any tangible benefit of such an approach for real world usage, and for the potential for problems and inability to apportion CPU when you actually want to.
Please report your experiences, but only if they're meaningful. I don't care how your PC performs if you do make -j 4096 unless you happen to have 4096 CPUs :)
Please enjoy! お楽しみ下さい
Thursday, 17 March 2011
2.6.38 and BFS/-ck releases
2.6.38 came out faster than I was expecting, and I've made no effort as of yet to port BFS or -ck to it. It looked like there were still bug reports on the last -rc on the linux kernel mailing list so I wasn't expecting the "stable" release to come out yet. Furthermore, I've been working very hard on the next major release of lrzip which is a massive rewrite so that has completely occupied my time and I am not done with it yet. So there will likely be quite some delay before I can release an official BFS/CK for 2.6.38. I also would like to watch what fallout, if any, there is from the autogrouping code and decide what I should do with respect to that feature. I'm sure some unofficial BFS ports will be available soon enough but obviously I won't be able to say with any confidence whether they can be trusted or not.
Thursday, 6 January 2011
2.6.37-ck1, BFS 0.363 for 2.6.37, Grouping by UID
It looks like 2.6.37 made it out in time before I left for my trip, so here's some goodies to keep you all busy, from the emails I just sent to announce them on lkml:
---
These are patches designed to improve system responsiveness and interactivity with specific emphasis on the desktop, but suitable to any workload.
Apply to 2.6.37:
http://www.kernel.org/pub/linux/kernel/people/ck/patches/2.6/2.6.37/2.6.37-ck1/patch-2.6.37-ck1.bz2
Broken out tarball:
http://www.kernel.org/pub/linux/kernel/people/ck/patches/2.6/2.6.37/2.6.37-ck1/2.6.37-ck1-broken-out.tar.bz2
Discrete patches:
http://www.kernel.org/pub/linux/kernel/people/ck/patches/2.6/2.6.37/2.6.37-ck1/patches/
All -ck patches:
http://www.kernel.org/pub/linux/kernel/people/ck/patches/
Web:
http://kernel.kolivas.org
Code blog when I feel like it:
http://ck-hack.blogspot.com/
Each discrete patch contains a brief description of what it does at the top of
the patch itself.
The most significant change is an updated BFS CPU scheduler, up to version
0.363. See the announce of that patch for the changelog. The rest is a resync.
2.6.37-sched-bfs-363.patch
sched-add-above-background-load-function.patch
mm-make_swappiness_really_mean_it.patch
mm-zero_swappiness.patch
mm-enable_swaptoken_only_when_swap_full.patch
mm-drop_swap_cache_aggressively.patch
mm-kswapd_inherit_prio-1.patch
mm-background_scan.patch
mm-idleprio_prio-1.patch
mm-lru_cache_add_lru_tail.patch
mm-decrease_default_dirty_ratio.patch
kconfig-expose_vmsplit_option.patch
hz-default_1000.patch
hz-no_default_250.patch
hz-raise_max.patch
preempt-desktop-tune.patch
cpufreq-bfs_tweaks.patch
ck1-version.patch
---
The BFS (name shall not be said for PG requirements) CPU scheduler for 2.6.37
is now available.
Since the last release, a lot more work was put into maintaining fine grained
accounting at all times (which should help on 32 bit machines, uniprocessor
and 100Hz configs), minor changes have been made to make CPU offline code more
robust for the purposes of suspend to ram/disk, and some small scalability
improvements were added for SMT CPUs (eg i7). These changes are unlikely to
have dramatically noticeable effects unless you were already experiencing a
problem or poor performance.
A direct link to the patch for 2.6.37 is here:
http://ck.kolivas.org/patches/bfs/2.6.37/2.6.37-sched-bfs-363.patch
All BFS patches here:
http://ck.kolivas.org/patches/bfs
Version 363 has been ported to 2.6.35 and 2.6.32 and available from that
directory in lieu of the long term release nature of these kernels.
On a related note, a small multi-user server feature request was commissioned
for BFS that I was happy to work on, which I'd like to also make publicly
available.
Here is the changelog:
---
Make it possible to proportion CPU resource strictly according to user ID by
grouping all tasks from the one user as one task.
This is done through simply tracking how many tasks from the one UID are
running at any one time and using that data to determine what the virtual
deadline is, offset by proportionately more according to the number of running
tasks. Do this by creating an array of every UID for very quick lookup of the
running value and protect it by the grq lock. This should incur almost
immeasurably small overhead even when enabled. An upper limit of 65535 UIDs
is currently supported.
Make this feature configurable at build time and runtime via Kconfig, and
through the use of sysctls
/proc/sys/kernel/group_by_uid
to enable or disable the feature (default 1 == on), and
/proc/sys/kernel/group_uid_min
to decide the minimum uid to group tasks from (default 1000)
Nice values are still respected, making it possible to allocate different
amounts of CPU to each user.
This feature is most suited to a multi-user shell type server environment and
is NOT recommended for an ordinary desktop.
---
The patch is available for the moment here:
http://ck.kolivas.org/patches/bfs/test/bfs363-group_uids.patch
A reminder that this is NOT a desktop, laptop or embedded device type feature.
The purpose of this feature is to make it impossible for any one user to get
more CPU than any other user on a multi-user login. This is suitable for
multiuser shared GUI/X session or shell type machines, and incurs almost
negligible overhead.
---
I'll be offline shortly and in Japan for a few weeks so I'll be unlikely to respond to any emails in that time.
Enjoy!
---
These are patches designed to improve system responsiveness and interactivity with specific emphasis on the desktop, but suitable to any workload.
Apply to 2.6.37:
http://www.kernel.org/pub/linux/kernel/people/ck/patches/2.6/2.6.37/2.6.37-ck1/patch-2.6.37-ck1.bz2
Broken out tarball:
http://www.kernel.org/pub/linux/kernel/people/ck/patches/2.6/2.6.37/2.6.37-ck1/2.6.37-ck1-broken-out.tar.bz2
Discrete patches:
http://www.kernel.org/pub/linux/kernel/people/ck/patches/2.6/2.6.37/2.6.37-ck1/patches/
All -ck patches:
http://www.kernel.org/pub/linux/kernel/people/ck/patches/
Web:
http://kernel.kolivas.org
Code blog when I feel like it:
http://ck-hack.blogspot.com/
Each discrete patch contains a brief description of what it does at the top of
the patch itself.
The most significant change is an updated BFS CPU scheduler, up to version
0.363. See the announce of that patch for the changelog. The rest is a resync.
2.6.37-sched-bfs-363.patch
sched-add-above-background-load-function.patch
mm-make_swappiness_really_mean_it.patch
mm-zero_swappiness.patch
mm-enable_swaptoken_only_when_swap_full.patch
mm-drop_swap_cache_aggressively.patch
mm-kswapd_inherit_prio-1.patch
mm-background_scan.patch
mm-idleprio_prio-1.patch
mm-lru_cache_add_lru_tail.patch
mm-decrease_default_dirty_ratio.patch
kconfig-expose_vmsplit_option.patch
hz-default_1000.patch
hz-no_default_250.patch
hz-raise_max.patch
preempt-desktop-tune.patch
cpufreq-bfs_tweaks.patch
ck1-version.patch
---
The BFS (name shall not be said for PG requirements) CPU scheduler for 2.6.37
is now available.
Since the last release, a lot more work was put into maintaining fine grained
accounting at all times (which should help on 32 bit machines, uniprocessor
and 100Hz configs), minor changes have been made to make CPU offline code more
robust for the purposes of suspend to ram/disk, and some small scalability
improvements were added for SMT CPUs (eg i7). These changes are unlikely to
have dramatically noticeable effects unless you were already experiencing a
problem or poor performance.
A direct link to the patch for 2.6.37 is here:
http://ck.kolivas.org/patches/bfs/2.6.37/2.6.37-sched-bfs-363.patch
All BFS patches here:
http://ck.kolivas.org/patches/bfs
Version 363 has been ported to 2.6.35 and 2.6.32 and available from that
directory in lieu of the long term release nature of these kernels.
On a related note, a small multi-user server feature request was commissioned
for BFS that I was happy to work on, which I'd like to also make publicly
available.
Here is the changelog:
---
Make it possible to proportion CPU resource strictly according to user ID by
grouping all tasks from the one user as one task.
This is done through simply tracking how many tasks from the one UID are
running at any one time and using that data to determine what the virtual
deadline is, offset by proportionately more according to the number of running
tasks. Do this by creating an array of every UID for very quick lookup of the
running value and protect it by the grq lock. This should incur almost
immeasurably small overhead even when enabled. An upper limit of 65535 UIDs
is currently supported.
Make this feature configurable at build time and runtime via Kconfig, and
through the use of sysctls
/proc/sys/kernel/group_by_uid
to enable or disable the feature (default 1 == on), and
/proc/sys/kernel/group_uid_min
to decide the minimum uid to group tasks from (default 1000)
Nice values are still respected, making it possible to allocate different
amounts of CPU to each user.
This feature is most suited to a multi-user shell type server environment and
is NOT recommended for an ordinary desktop.
---
The patch is available for the moment here:
http://ck.kolivas.org/patches/bfs/test/bfs363-group_uids.patch
A reminder that this is NOT a desktop, laptop or embedded device type feature.
The purpose of this feature is to make it impossible for any one user to get
more CPU than any other user on a multi-user login. This is suitable for
multiuser shared GUI/X session or shell type machines, and incurs almost
negligible overhead.
---
I'll be offline shortly and in Japan for a few weeks so I'll be unlikely to respond to any emails in that time.
Enjoy!
Sunday, 2 January 2011
BFS version 0.363
Welcome to 2011!
The testing on BFS ported to 2.6.37-rc8 has been reassuring, and no real show stopper bugs have shown up. The remaining changes required to make it release ready for 2.6.37 have now been committed, along with some other very minor changes, so I've bumped the version up to 0.363. The main change was implementing the fine grained interrupt accounting which will have very little, if any, impact on regular users. These changes are ONLY suitable for 2.6.37, so they have not been ported back to the BFS I'm maintaining for earlier kernels. The rest of the changes suitable for older kernels have gone into 363 for them.
Here is the changelog as it affects existing BFS 360 users:
Most of these changes should have no visible behavioural effect to the user, apart from the following:
For those on BFS 360, if you were having warnings or even OOPSes on suspend to ram/disk or wakeup from them, or if you were having trouble suspending or resuming, this change might help.
The other change guarantees that CPUs will be busier on SMP machines when tasks are being run IDLEPRIO, so it will increase throughput, but ONLY if you run tasks IDLEPRIO.
Incremental: 2.6.36-bfs-360-363.patch
For 2.6.36ish:
2.6.36-sched-bfs-363.patch
2.6.35.10:
2.6.35.10-sched-bfs-363.patch
2.6.32.27:
2.6.32.27-sched-bfs-363.patch
Shortly I'll be going to Japan for a few weeks as I do almost every year now, so I'll be offline for a while.
The testing on BFS ported to 2.6.37-rc8 has been reassuring, and no real show stopper bugs have shown up. The remaining changes required to make it release ready for 2.6.37 have now been committed, along with some other very minor changes, so I've bumped the version up to 0.363. The main change was implementing the fine grained interrupt accounting which will have very little, if any, impact on regular users. These changes are ONLY suitable for 2.6.37, so they have not been ported back to the BFS I'm maintaining for earlier kernels. The rest of the changes suitable for older kernels have gone into 363 for them.
Here is the changelog as it affects existing BFS 360 users:
Make CPU offlining more robust by simply removing all affinity for processes that no longer have any CPUs they can run on. This allows the machine stop thread to complete offlining CPUs and makes for a little less overhead in hot paths. Allow SCHED_IDLEPRIO to wake up idle CPUs in try_preempt. This would have caused minor slowdowns for IDLEPRIO tasks only on relatively quiescent systems. Remove inappropriate likely()s. Update cpustat for irq - may have been under-reporting interrupt load. Cosmetic changes. Bump version to 0.363
Most of these changes should have no visible behavioural effect to the user, apart from the following:
For those on BFS 360, if you were having warnings or even OOPSes on suspend to ram/disk or wakeup from them, or if you were having trouble suspending or resuming, this change might help.
The other change guarantees that CPUs will be busier on SMP machines when tasks are being run IDLEPRIO, so it will increase throughput, but ONLY if you run tasks IDLEPRIO.
Incremental: 2.6.36-bfs-360-363.patch
For 2.6.36ish:
2.6.36-sched-bfs-363.patch
2.6.35.10:
2.6.35.10-sched-bfs-363.patch
2.6.32.27:
2.6.32.27-sched-bfs-363.patch
Shortly I'll be going to Japan for a few weeks as I do almost every year now, so I'll be offline for a while.
Thursday, 30 December 2010
BFS for 2.6.37-rc8
So we approach yet another "stable" release with 2.6.37 just around the corner now that -pre8 is out (yes I'm old fashioned, it is still a pre-release and not a release candidate in my eyes). I usually use pre8 as the flag to port BFS to the new kernel so that I have it ready in time for the "stable" 3 point release.
I've ported the current BFS release v0.360 to the latest kernel. Of significance in the new kernel are some changes yet again to the way CPU offline code occurs (which happens just before suspend to ram and disk), some new way to improve accounting of CPU attribution during interrupt handling, and improving the reporting of CPU load during idle periods with NOHZ enabled. There are also some other architectural changes that have no cosmetic effect. None of these will have any effect on "performance" for the end user so don't expect any surprises there.
So I've ported BFS with only the changes necessary to get it working on the new kernel. I've not yet added the support for interrupt accounting, and I haven't changed the way total load is calculated on NOHZ configured kernels. The only change of any significance is the way I offline CPUs in the new BFS. I have been fighting with that for a while now since there really is no right way to do it, and it changes so frequently in mainline that I often have trouble keeping up.
For those already on version 0.360 of BFS, the only significant changes can be had now in this patch:
bfs360-test.patch
For the dirty details in the CPU offline code, what happens is that a worker thread runs at ultra high priority on the CPU it's about to offline, and half way through its work it turns off that CPU and then needs to run on another CPU to die itself. Till now, BFS has looked for tasks that had affinity for CPUs that could no longer run on any alive CPUs and then would be happy to run them anywhere. With this latest bfs360-test.patch it does more what the mainline kernel does and formally breaks affinity for tasks that no longer have anywhere they're allowed to run. It only has a negligible improvement in overhead but the main reason for doing it is to make the offline code more robust.
For those looking for the first BFS release on 2.6.37-rc8, here's the test patch:
2.6.37-rc8-sched-bfs-362.patch.
I've bumped the version number up simply because it includes the test changes above, and has had other architectural changes to be in sync with the mainline kernel. The main thing new in mainline is that there is a new "class" of realtime scheduling used only internally by the CPU offlining code called a stop class. Instead of using a new scheduling class, BFS simply adds one more realtime priority that can be used by kernel threads and is only ever applied to the CPU offline thread (kmigration). I did this to make it add zero overhead to the existing design, yet support the concept that it is a thread that nothing else can preempt.
Almost certainly I will be adding more code to the final version of BFS that I use when 2.6.37 comes out, to support the new interrupt accounting code and the global load reported, but these will only have cosmetic effects. This patch has only been lightly tested and not compile tested for multiple configurations just yet, but seems to be working nicely. For now you can grab it here:
2.6.37-rc8-sched-bfs-362.patch
Happy New Year everyone :)
I've ported the current BFS release v0.360 to the latest kernel. Of significance in the new kernel are some changes yet again to the way CPU offline code occurs (which happens just before suspend to ram and disk), some new way to improve accounting of CPU attribution during interrupt handling, and improving the reporting of CPU load during idle periods with NOHZ enabled. There are also some other architectural changes that have no cosmetic effect. None of these will have any effect on "performance" for the end user so don't expect any surprises there.
So I've ported BFS with only the changes necessary to get it working on the new kernel. I've not yet added the support for interrupt accounting, and I haven't changed the way total load is calculated on NOHZ configured kernels. The only change of any significance is the way I offline CPUs in the new BFS. I have been fighting with that for a while now since there really is no right way to do it, and it changes so frequently in mainline that I often have trouble keeping up.
For those already on version 0.360 of BFS, the only significant changes can be had now in this patch:
bfs360-test.patch
For the dirty details in the CPU offline code, what happens is that a worker thread runs at ultra high priority on the CPU it's about to offline, and half way through its work it turns off that CPU and then needs to run on another CPU to die itself. Till now, BFS has looked for tasks that had affinity for CPUs that could no longer run on any alive CPUs and then would be happy to run them anywhere. With this latest bfs360-test.patch it does more what the mainline kernel does and formally breaks affinity for tasks that no longer have anywhere they're allowed to run. It only has a negligible improvement in overhead but the main reason for doing it is to make the offline code more robust.
For those looking for the first BFS release on 2.6.37-rc8, here's the test patch:
2.6.37-rc8-sched-bfs-362.patch.
I've bumped the version number up simply because it includes the test changes above, and has had other architectural changes to be in sync with the mainline kernel. The main thing new in mainline is that there is a new "class" of realtime scheduling used only internally by the CPU offlining code called a stop class. Instead of using a new scheduling class, BFS simply adds one more realtime priority that can be used by kernel threads and is only ever applied to the CPU offline thread (kmigration). I did this to make it add zero overhead to the existing design, yet support the concept that it is a thread that nothing else can preempt.
Almost certainly I will be adding more code to the final version of BFS that I use when 2.6.37 comes out, to support the new interrupt accounting code and the global load reported, but these will only have cosmetic effects. This patch has only been lightly tested and not compile tested for multiple configurations just yet, but seems to be working nicely. For now you can grab it here:
2.6.37-rc8-sched-bfs-362.patch
Happy New Year everyone :)
Friday, 24 December 2010
Queuing theory and BFS
I had pointed out to me on IRC by Damentz the queuing theory video that was linked by a slashdot article.
For those who haven't seen it, the video is here:
Why the other line is likely to move faster
Anyway, the relevance of that video is that BFS uses a single queue, whereas the mainline Linux kernel uses a multiple queue design. The people are tasks, and the checkouts are CPUs. Of course there's a lot more to a CPU scheduler than just the queue design, but I thought this video was very relevant :)
Merry Christmas everyone.
For those who haven't seen it, the video is here:
Why the other line is likely to move faster
Anyway, the relevance of that video is that BFS uses a single queue, whereas the mainline Linux kernel uses a multiple queue design. The people are tasks, and the checkouts are CPUs. Of course there's a lot more to a CPU scheduler than just the queue design, but I thought this video was very relevant :)
Merry Christmas everyone.
Tuesday, 14 December 2010
BFS version 0.360
There's nothing better to help you work out how something should work on certain hardware than having it yourself. I recently was fortunate enough to get my hands on an i7 based laptop. While the laptop is powerful for a laptop, a 2.67 dual core, quad thread is hardly a particularly powerful i7 (except in a laptop), but it nonetheless has a topology I've not had before. After installing a -ck kernel on it, I went and benchmarked how BFS performed on it, and was a little disappointed. At load==number of CPUs, it performed well, and with a single threaded job it was good too, but performed poorly compared to mainline with half jobs (so 2 jobs on 4 logical threads). I might point out there's nothing quite like having something to compare to to have some kind of yardstick off which to work. It worked both ways in this scenario, with mainline having had BFS as a reference for which regressions/improvements were found, and in this case I found that BFS wasn't up to scratch. So I set out to find and fix it.
BFS doesn't really use the notion of "balancing" to try and move tasks from one CPU to the next based on some complex heuristic algorithm over time, but decides whenever a task wakes up, which CPU anywhere is the most suitable for the task to run on. This affords it very low latency since a task might find somewhere to run much sooner than if it were on a local queue on one CPU. BFS doesn't just decide to find the lowest latency point to run off, though, as that would ping pong tasks all over the place. That might be okay for latency, but is expensive in throughput terms. Probably the single most important factor in improving throughput with CPUs as time goes on is keeping the working set of data in the CPU cache, and the caches have gotten fatter and fatter with time. I've been asked this numerous times before, but it's the reason we can't just run tasks for a few microseconds and switch between all tasks as often as we like so that we get ultra low latencies; it costs us dramatically in throughput. The more often you move a task away from one CPU's cache to another one, the lower the throughput will be (if something else then uses that previous CPU).
BFS uses the simple information of where it last ran to determine which CPU would be the best CPU to run on next. It does this by searching for an idle CPU first, and binding to the CPU "closest" in cache to the last one it ran off. It would choose a CPU that shared the cache before moving to a different CPU (or then a different node in the case of NUMA) through the use of "cache locality" lists. BFS, unlike mainline, will never leave a task off a CPU when there is *any* CPU idle that it is allowed to run on. This is done because any improvement in throughput by trying to keep the task bound to one CPU costs us in the latency of having a task wait on a busy CPU, and latency is everything on a desktop. On a 2 core, 4 thread machine, if you have only 2 tasks running, you want each task to be running on the real cores and not threads of the same core. BFS checks to see when cores' threads are busy and chooses cores that don't have their other threads busy. However, if you have 2 tasks running (let's say a make -j2 build), then at irregularly irregular intervals, you will have other tasks wake up and consume CPU (scheduling noise), bringing the number of tasks asking for CPU at any one time jump to 3 or more. This ruins the possibility that we only ever use the 2 cores, and throws everything off balance. Since BFS has no notion of history, it's not like those tasks will then go back once the noise has died down. After a task has moved, there is not much point trying to put it back since each next move costs us yet again.
As there was a performance regression with the 2 task / 4 thread CPU case, I started experimenting with allowing tasks to sit idle for a bit and wait in case their previous CPU became available. The attempts got more and more complex and fancier, with more and more special cases I could think of, yet failed to do more than a pitiful improvement in throughput. So I abandoned that in disgust (especially since I was hating how complex this was getting, and it's just not the BFS way).
Then I investigated to see whether my code had unintentionally grouped the wrong threads with each other in BFS, and for a while I was convinced that it had. The reason was that threads 0 and 2 were considered SMT (hyperthread) siblings, as were 1 and 3. I dug deeper for a while and realised that indeed that is how they were enumerated, however silly that sounded to me. Oh well, at least I had excluded that as the problem.
Finally I went back and looked at how the idle CPU was chosen and spotted something. On a CPU with SMT siblings, it is virtually free to move a task from one logical CPU to its SMT sibling CPU. It is also virtually free to move a task from one logical CPU to its shared cache sibling. This I had taken into consideration with BFS, and made the resched best idle code as simple as possible by just lumping them all together, and picking the first one it came across. Now on this i7, all 4 logical CPUs are actually considered shared cache siblings, meaning it would always pick the first non-thread-busy CPU. It seemed okay at the time, and worked for my quad core on my desktop (non hyperthreaded), but I figured it would be best to try sticking to the same actual logical CPU first, then its SMT sibling, before trying a shared cache CPU. I also remembered the "turbo" mode in these i* CPUs, which allows one core to run at higher than the regular clock speed if it's the only one running, so it seemed to make sense that trying to keep tasks on the actual CPU they ran on, rather than a virtually free move might be just as good. I tried it and bingo, an 8% speedup on the make -j2 case, and ever so slightly faster when fully loaded make -j4. That was well worth the effort.
Even if it is free to move tasks from one CPU to a logical counterpart, there is another theoretical advantage of trying to stick to the same CPU. CPU frequency scaling is done on a per-CPU basis, and if one CPU is working harder than all of them, it will ramp up more easily. This should decrease response time under bursts of load (with the ondemand scaling governor), and potentially use less power since only one CPU will speed up. It's actually been shown that you use less power overall if a job is completed at a higher frequency before ramping down again, than doing the job at the lower frequency. This is why the conservative governor doesn't really make sense - it takes longer to respond since there's always a lag for it to speed up, and doesn't really save power. Even if you don't use cpu frequency scaling, all modern CPUs do plenty of their own powersaving internally in much the same manner.
Anyway, that change, along with a handful of other stuff make it worth bumping the BFS version up. All the changes going from 357 are low risk and it should be safe to jump versions now if 357 has been stable for you.
Here's the brief changelog of 357-360:
Get it here (including incremental) for 2.6.36ish:
http://ck.kolivas.org/patches/bfs/2.6.36/
EDIT: I should explain that the j2 test is simply a regression test for throughput, it is NOT my way of benchmarking interactivity or responsiveness.
BFS doesn't really use the notion of "balancing" to try and move tasks from one CPU to the next based on some complex heuristic algorithm over time, but decides whenever a task wakes up, which CPU anywhere is the most suitable for the task to run on. This affords it very low latency since a task might find somewhere to run much sooner than if it were on a local queue on one CPU. BFS doesn't just decide to find the lowest latency point to run off, though, as that would ping pong tasks all over the place. That might be okay for latency, but is expensive in throughput terms. Probably the single most important factor in improving throughput with CPUs as time goes on is keeping the working set of data in the CPU cache, and the caches have gotten fatter and fatter with time. I've been asked this numerous times before, but it's the reason we can't just run tasks for a few microseconds and switch between all tasks as often as we like so that we get ultra low latencies; it costs us dramatically in throughput. The more often you move a task away from one CPU's cache to another one, the lower the throughput will be (if something else then uses that previous CPU).
BFS uses the simple information of where it last ran to determine which CPU would be the best CPU to run on next. It does this by searching for an idle CPU first, and binding to the CPU "closest" in cache to the last one it ran off. It would choose a CPU that shared the cache before moving to a different CPU (or then a different node in the case of NUMA) through the use of "cache locality" lists. BFS, unlike mainline, will never leave a task off a CPU when there is *any* CPU idle that it is allowed to run on. This is done because any improvement in throughput by trying to keep the task bound to one CPU costs us in the latency of having a task wait on a busy CPU, and latency is everything on a desktop. On a 2 core, 4 thread machine, if you have only 2 tasks running, you want each task to be running on the real cores and not threads of the same core. BFS checks to see when cores' threads are busy and chooses cores that don't have their other threads busy. However, if you have 2 tasks running (let's say a make -j2 build), then at irregularly irregular intervals, you will have other tasks wake up and consume CPU (scheduling noise), bringing the number of tasks asking for CPU at any one time jump to 3 or more. This ruins the possibility that we only ever use the 2 cores, and throws everything off balance. Since BFS has no notion of history, it's not like those tasks will then go back once the noise has died down. After a task has moved, there is not much point trying to put it back since each next move costs us yet again.
As there was a performance regression with the 2 task / 4 thread CPU case, I started experimenting with allowing tasks to sit idle for a bit and wait in case their previous CPU became available. The attempts got more and more complex and fancier, with more and more special cases I could think of, yet failed to do more than a pitiful improvement in throughput. So I abandoned that in disgust (especially since I was hating how complex this was getting, and it's just not the BFS way).
Then I investigated to see whether my code had unintentionally grouped the wrong threads with each other in BFS, and for a while I was convinced that it had. The reason was that threads 0 and 2 were considered SMT (hyperthread) siblings, as were 1 and 3. I dug deeper for a while and realised that indeed that is how they were enumerated, however silly that sounded to me. Oh well, at least I had excluded that as the problem.
Finally I went back and looked at how the idle CPU was chosen and spotted something. On a CPU with SMT siblings, it is virtually free to move a task from one logical CPU to its SMT sibling CPU. It is also virtually free to move a task from one logical CPU to its shared cache sibling. This I had taken into consideration with BFS, and made the resched best idle code as simple as possible by just lumping them all together, and picking the first one it came across. Now on this i7, all 4 logical CPUs are actually considered shared cache siblings, meaning it would always pick the first non-thread-busy CPU. It seemed okay at the time, and worked for my quad core on my desktop (non hyperthreaded), but I figured it would be best to try sticking to the same actual logical CPU first, then its SMT sibling, before trying a shared cache CPU. I also remembered the "turbo" mode in these i* CPUs, which allows one core to run at higher than the regular clock speed if it's the only one running, so it seemed to make sense that trying to keep tasks on the actual CPU they ran on, rather than a virtually free move might be just as good. I tried it and bingo, an 8% speedup on the make -j2 case, and ever so slightly faster when fully loaded make -j4. That was well worth the effort.
Even if it is free to move tasks from one CPU to a logical counterpart, there is another theoretical advantage of trying to stick to the same CPU. CPU frequency scaling is done on a per-CPU basis, and if one CPU is working harder than all of them, it will ramp up more easily. This should decrease response time under bursts of load (with the ondemand scaling governor), and potentially use less power since only one CPU will speed up. It's actually been shown that you use less power overall if a job is completed at a higher frequency before ramping down again, than doing the job at the lower frequency. This is why the conservative governor doesn't really make sense - it takes longer to respond since there's always a lag for it to speed up, and doesn't really save power. Even if you don't use cpu frequency scaling, all modern CPUs do plenty of their own powersaving internally in much the same manner.
Anyway, that change, along with a handful of other stuff make it worth bumping the BFS version up. All the changes going from 357 are low risk and it should be safe to jump versions now if 357 has been stable for you.
Here's the brief changelog of 357-360:
Don't unnecessarily preempt for a task on the wrong CPU. Cope with worker threads trying to wake themselves up due to shifting CPUs on suspend by reactivating it, instead of hitting the BUG_ON Wrap timer jiffies at 10 seconds instead of 5 minutes since 32 bit load averages don't work until the first timer wrap. Remove the last_task logic as it wasn't providing any significant performance advantage. Change the locality logic to try to reschedule on the exact same logical core instead of assuming scheduling on a sibling core or sibling thread is equivalent. This allows CPUs with a "turbo" mode (such as i7) to use that more often by using one CPU more than spreading out, and allows ondemand cpu frequency scaling to ramp up more easily when a task stays on the same CPU. It increases throughput on threaded CPUs when lightly loaded, and may offer both performance and power saving advantages on all SMP topologies with cpu frequency scaling.
Get it here (including incremental) for 2.6.36ish:
http://ck.kolivas.org/patches/bfs/2.6.36/
EDIT: I should explain that the j2 test is simply a regression test for throughput, it is NOT my way of benchmarking interactivity or responsiveness.
Monday, 25 October 2010
Minor BFS 357 Bug(?) on 2.6.36
I received a bug report from someone running BFS on 2.6.36. They hit this BUG_ON in the new code present only in 2.6.36:
+static void try_to_wake_up_local(struct task_struct *p)
+{
+ struct rq *rq = task_rq(p);
+ bool success = false;
+
+ BUG_ON(rq != this_rq());
This looks fairly straight forward and this code path is only used by the new worker code in 2.6.36. However it shouldn't be hit unless something else is calling this function (indirectly via schedule()) wrongly. Anyway they hit it it seems via the iwlwifi code. No idea how, but it's actually harmless to wake up a task on another runqueue in BFS, so simply removing this BUG_ON fixes it.
Here's a patch to apply to BFS if you're running it on 2.6.36 and run into this bug:
bfs357-worker_fix.patch which just removes the BUG_ON. However it makes me wonder if this bug is in mainline and only those who hit this bug can confirm or otherwise by running 2.6.36 vanilla and there's no point reporting it when it's so vague.
+static void try_to_wake_up_local(struct task_struct *p)
+{
+ struct rq *rq = task_rq(p);
+ bool success = false;
+
+ BUG_ON(rq != this_rq());
This looks fairly straight forward and this code path is only used by the new worker code in 2.6.36. However it shouldn't be hit unless something else is calling this function (indirectly via schedule()) wrongly. Anyway they hit it it seems via the iwlwifi code. No idea how, but it's actually harmless to wake up a task on another runqueue in BFS, so simply removing this BUG_ON fixes it.
Here's a patch to apply to BFS if you're running it on 2.6.36 and run into this bug:
bfs357-worker_fix.patch which just removes the BUG_ON. However it makes me wonder if this bug is in mainline and only those who hit this bug can confirm or otherwise by running 2.6.36 vanilla and there's no point reporting it when it's so vague.
Thursday, 7 October 2010
2.6.36-rc7 with -ck1 and BFS 357
Since I was on the tail end of my hack fest and Linus announced 2.6.36-rc7, saying it was likely the last -rc, I figured it was a good opportunity to sync up my patches with mainline. As always, the porting of BFS brought some unexpected surprises where a simple port would probably work, but likely fail long term. So there were lots of little subtle changes that I had to make to BFS. Functionally this is virtually the same as BFS 357 for 2.6.35.7, apart from some minor tweaks to avoid new warnings. There was one teensy change to niffy_diff to also ensure a minimum difference was observed according to ticks, and the minimum difference was decreased from 1us to anything greater than 0 as the niffy clock may well be updated in less than 1us. One nice thing also came about from the update. I managed to remove some code when I realised the nohz_load_balancer I'd been maintaining in the BFS code was simply me blindly porting it a while back and not even realising what it was for. Of course there is no load balancing on BFS since it has a global runqueue which means all CPUs are always in balance, so there's no need for any special case balancing on nohz configs.
For those who want some overview of what was required to port it, there were some subtle changes to the try_to_wake_up code for notifying when workers are going to sleep with workqueues. Some reshuffling of what happens on context switch was ported. Some sched domains code was updated. rlimit code was tweaked. nohz balancing code was dropped. Checking that apparently idle CPUs were actually online was added to cope with changes on forking idle tasks on .36. And random other stuff I can't remember.
It's worth noting that you'll need a beta driver from nvidia if you're evil like me and use their evil binary drivers. See here: nvnews link for their latest drivers.
Anyway here's a directory that contains lrz compressed versions of all the patches, and an lrz compressed all-inclusive -ck1 and bfs357 patch. It's my secret plan that those wishing to try my pre-release patches must also grab lrzip, which I wrote, to access them :)
http://ck.kolivas.org/patches/2.6/2.6.36/
EDIT2: If you enable schedstats, you will need the patch called 2636rc7ck1-fixes.patch in that directory added to prevent build failures.
For those who want some overview of what was required to port it, there were some subtle changes to the try_to_wake_up code for notifying when workers are going to sleep with workqueues. Some reshuffling of what happens on context switch was ported. Some sched domains code was updated. rlimit code was tweaked. nohz balancing code was dropped. Checking that apparently idle CPUs were actually online was added to cope with changes on forking idle tasks on .36. And random other stuff I can't remember.
It's worth noting that you'll need a beta driver from nvidia if you're evil like me and use their evil binary drivers. See here: nvnews link for their latest drivers.
Anyway here's a directory that contains lrz compressed versions of all the patches, and an lrz compressed all-inclusive -ck1 and bfs357 patch. It's my secret plan that those wishing to try my pre-release patches must also grab lrzip, which I wrote, to access them :)
http://ck.kolivas.org/patches/2.6/2.6.36/
EDIT2: If you enable schedstats, you will need the patch called 2636rc7ck1-fixes.patch in that directory added to prevent build failures.
Monday, 4 October 2010
Going Magnum with BFS 357
As some of you may be aware, PCLinuxOS uses BFS by default. It also is a 32bit distro for maximum compatibility with desktop software. So, Texstar, who started PCLinuxOS was keen to try the newer BFS and built a 2.6.32.x kernel with BFS 350 on 32 bits. He was able to reproduce strange slowdowns and stalls and reported this back to me. After reviewing my code it was clear I had screwed up with 32 bit builds having moved to 64 bit niffy counters on BFS 350 and left a whole lot of ints and longs around that would overflow regularly. The ints would overflow even on 64 bit builds! Texstar has since confirmed for me that fixing the 32bit variables fixed the problem for him (thanks!).
I've committed those 32 bit fixes, added some more sanity checking on the crazy sched_clock interface using jiffy difference to determine upper bound and added some minor macro cleanups. I've bumped the version number up to 0.357 just because it sounds good. Testing on this has been done on 32 bit, uniprocessor, and an older kernel. Hopefully this means good things for android too!
Changelog follows:
I've uploaded a full patch for 2.6.35.7 and an incremental from 350 to 357 and will be uploading patches for older kernels shortly. Grab it now and do your worst!
BFS patches
Hopefully I can take a break from hacking for a bit and get back to my billion other pastimes while this version distils for a while... Then again, a new "stable" kernel will probably be out soon.
UPDATE: I've now uploaded patches for previous kernels as far back as 2.6.31.14.
I've committed those 32 bit fixes, added some more sanity checking on the crazy sched_clock interface using jiffy difference to determine upper bound and added some minor macro cleanups. I've bumped the version number up to 0.357 just because it sounds good. Testing on this has been done on 32 bit, uniprocessor, and an older kernel. Hopefully this means good things for android too!
Changelog follows:
I forgot about an awful lot of longs and ints that will overflow on 32 bit now
with u64 deadlines. Fix them.
Add some macro tidiness.
Make sched_clock sanity checking robust and standardised, using jiffy
difference as upper limit, and use nominal 1us when difference cannot be
trusted.
Go magnum.
I've uploaded a full patch for 2.6.35.7 and an incremental from 350 to 357 and will be uploading patches for older kernels shortly. Grab it now and do your worst!
BFS patches
Hopefully I can take a break from hacking for a bit and get back to my billion other pastimes while this version distils for a while... Then again, a new "stable" kernel will probably be out soon.
UPDATE: I've now uploaded patches for previous kernels as far back as 2.6.31.14.
Sunday, 3 October 2010
Of jiffies, gjiffies, miffies, niffies and BFS 350
So we like to think that our software is perfect if there are no bug reports. Truth be told it's been a very long time since anyone's reported anything on BFS, which could be both good and bad. It means either no one is using it, or no one is having any bugs with it, or no one is inclined to report anything. Since I get about 10,000 downloads a month of it, I'm pretty sure that someone is using it, but I don't really know for certain how much of the other two it is. Anyway the point is that I hadn't really hacked much on it for a while now and had only been porting it with each stable kernel release as they came out, until I started working on 330. So for the next version, I planned to try and commit some more changes along the "hope it fixes android" lines and while I was at it, to try and find a few minor optimisations, and audit the code in the process.
The main change in 350 was changing the deadline system to stop using jiffies. I had already done so in 330 by using a local 64 bit jiffy counter I called gjiffies (for global runqueue jiffies), which was just a wrapper on 64 bit builds, but had locking and a u64 variable on 32 bit builds. Currently, BFS uses microsecond accounting for timeslices. The reason for doing so is that we don't have hardware that can do a context switch in less than a microsecond (some take up to 30microseconds) and to avoid overflowing on 32 bits. If I were to use nanoseconds, a 32 bit signed counter would overflow in just 2.1 seconds. So anyway to try and improve the resolution of the deadlines and maintain some deadline difference on lower Hz builds (specifically 100Hz), I changed the local jiffy counter I introduced in 330 to a monotonically incrementing microsecond counter I called miffies. This worked off the dreaded sched_clock interface we have that gets time from the TSC nanosecond timer on each CPU. This timer sounds great in principle, but often returns bogus values, looking like it goes backwards or jumping forwards dramatically, fluctuates with CPU frequency scaling, and the timers are not aligned between CPUs on SMP. Numerous sanity checks exist in the sched_clock_cpu interface to make the number meaningful, but still is only useful as used by one CPU. I used every CPU to update the monotonic counter by just adding their difference to the time whenever they were called, and scaled it to microseconds. Thus I ended up with a relatively monotonic microsecond counter in 64 bits that could be used everywhere. Serge Belyshev quickly pointed out that there wasn't much point it being microsecond since it was 64 bit anyway, and I moved from miffies to niffies, since nanoseconds won't wrap in 64 bits for 500 years. So in 350, all timing of deadlines is now done comparing times to the monotonic 64 bit nanosecond niffies timer.
Anyway, the point of all this, is that deadlines used to be limited by jiffy resolution, which means whatever Hz the build was compiled as. That meant that on 100Hz, it's possible for many tasks to have the same deadline, dramatically decreasing the utility of having the deadlines in the first place. It also means that the deadline differentiation should be valid, even long after the "virtual deadline" has passed (see the discussion on the BFS 330 update).
The next thing I added to improve management of low Hz builds was to test for whether a task was starting in the first half or 2nd half of a tick (jiffy), and then use a "dither" flag to determine whether we should expire that task early once the tick fires. The reason for this is to minimise the extra time a task can get running beyond its allocated timeslice when we're limited by Hz resolution. So if we have say our rr_interval set to 10 milliseconds, and our jiffies are 10 milliseconds long (as is the case at 100Hz), then it's possible, if a task starts just after a tick, that it can run for just under 20 milliseconds. With the dither flag, since it starts in the first half of that tick, when it gets to the next tick, it will determine that it has less than half a tick left, and expire it early. Thus, the longest a task can overrun its quota is only half a tick instead of one tick, meaning 5 ms in this example. This is not an expensive operation, but of course it would be much better if BFS wasn't Hz limited in this way at all by using high precision event timers to expire tasks at the right time. However, there are 2 reasons I haven't done this. First, hpet isn't available on all hardware and does have a slight performance hit. But second, and most important, I've never bothered to figure them out and do it properly. That's why I just suggest people just raise their Hz.
Apart from those major changes, a whole swag of things I had been considering doing I also committed. I use close shifts (right shift by 20 instead of divide by 1000000000 for example) where possible instead of division since it's cheaper. ISO calculations which are done during a scheduler tick now have their own lock to avoid grabbing the grq lock. I found some mistakes in the test for preemption when there are tasks of different scheduling policies, so SCHED_BATCH behaviour wasn't working properly, and sometimes the not-lowest priority CPU was being chosen to test against when there may have been a lower priority task running elsewhere. That's been fixed. I decreased the maximum rr_interval possible to 1000 (1 second) since new changes mean that 5 seconds would overflow, and higher values don't really help anyway. I added a local last_task variable per CPU, which could tell if that task was cache warm still on that CPU, even if it had run elsewhere. I dropped the first_time_slice flag as it was adding overhead and was not helping as it was possible to retrieve timeslice from children long long after they were forked if they ran only very little. I made sure timeslice was split properly between parent and child on fork() and that a reschedule is forced if the relative loss of timeslice meant the parent ran out. I changed reschedule to use 100 microseconds as the cutoff for expiration of timeslices since rescheduling with less time available than this will either greatly overrun its quota or reschedule very quickly. SCHED_BATCH tasks now refill timeslice and reset their deadline every time they're rescheduled which means they're lower weight than previously, but since they've been flagged as latency insensitive, it means they're likely to be fully CPU bound anyway and will only be rescheduled when they run out of timeslice or are preempted. I bypass the test for earliest_deadline_task now in schedule() when there is nothing else running and the previous task will simply run again. Numerous other minor cleanups and fixes were committed.
So, after that long-winded discussion, what's the advantage of BFS 350 over 318 onwards, and what should you see? Hopefully... nothing really obvious. There may be some lower latencies, and possibly very minor improvements in throughput and behaviour when many different scheduling policies are used at once. So far the feedback from people on 2.6.35.X has been that they haven't noticed a difference. Unfortunately, that is not the case for those older kernels! I probably should never have started backporting these changes to the older kernels since I haven't been extensively testing the backports. Ironically, though, it's those stuck on those older kernels that made me start making these changes in the first place! Furthermore, I now have no idea what's going on in android land. So I'm in a bit of a bind now, and I'd love to get feedback from those who are on those older kernels, and comparison with newer kernels with 350 and old. But if you're not adventurous and are on an older kernel and just want stable, stick to BFS 318. It may be that the whole deadline expiry thing is STILL needed, yet I honestly don't know why :(
Oh, I also noticed that despite quite a few people reading this blog, there have been virtually no comments. I know that many many more people read than comment, and that people may not feel qualified to comment, but a simple yes/no, good/bad would be greatly appreciated.
UPDATE: It may well be that the behavioural problems are just 32 bit overflows after all. Texstar of pclinuxos has been testing 32 bit kernels and found them to be problematic. More on this as he helps me sort it out.
The main change in 350 was changing the deadline system to stop using jiffies. I had already done so in 330 by using a local 64 bit jiffy counter I called gjiffies (for global runqueue jiffies), which was just a wrapper on 64 bit builds, but had locking and a u64 variable on 32 bit builds. Currently, BFS uses microsecond accounting for timeslices. The reason for doing so is that we don't have hardware that can do a context switch in less than a microsecond (some take up to 30microseconds) and to avoid overflowing on 32 bits. If I were to use nanoseconds, a 32 bit signed counter would overflow in just 2.1 seconds. So anyway to try and improve the resolution of the deadlines and maintain some deadline difference on lower Hz builds (specifically 100Hz), I changed the local jiffy counter I introduced in 330 to a monotonically incrementing microsecond counter I called miffies. This worked off the dreaded sched_clock interface we have that gets time from the TSC nanosecond timer on each CPU. This timer sounds great in principle, but often returns bogus values, looking like it goes backwards or jumping forwards dramatically, fluctuates with CPU frequency scaling, and the timers are not aligned between CPUs on SMP. Numerous sanity checks exist in the sched_clock_cpu interface to make the number meaningful, but still is only useful as used by one CPU. I used every CPU to update the monotonic counter by just adding their difference to the time whenever they were called, and scaled it to microseconds. Thus I ended up with a relatively monotonic microsecond counter in 64 bits that could be used everywhere. Serge Belyshev quickly pointed out that there wasn't much point it being microsecond since it was 64 bit anyway, and I moved from miffies to niffies, since nanoseconds won't wrap in 64 bits for 500 years. So in 350, all timing of deadlines is now done comparing times to the monotonic 64 bit nanosecond niffies timer.
Anyway, the point of all this, is that deadlines used to be limited by jiffy resolution, which means whatever Hz the build was compiled as. That meant that on 100Hz, it's possible for many tasks to have the same deadline, dramatically decreasing the utility of having the deadlines in the first place. It also means that the deadline differentiation should be valid, even long after the "virtual deadline" has passed (see the discussion on the BFS 330 update).
The next thing I added to improve management of low Hz builds was to test for whether a task was starting in the first half or 2nd half of a tick (jiffy), and then use a "dither" flag to determine whether we should expire that task early once the tick fires. The reason for this is to minimise the extra time a task can get running beyond its allocated timeslice when we're limited by Hz resolution. So if we have say our rr_interval set to 10 milliseconds, and our jiffies are 10 milliseconds long (as is the case at 100Hz), then it's possible, if a task starts just after a tick, that it can run for just under 20 milliseconds. With the dither flag, since it starts in the first half of that tick, when it gets to the next tick, it will determine that it has less than half a tick left, and expire it early. Thus, the longest a task can overrun its quota is only half a tick instead of one tick, meaning 5 ms in this example. This is not an expensive operation, but of course it would be much better if BFS wasn't Hz limited in this way at all by using high precision event timers to expire tasks at the right time. However, there are 2 reasons I haven't done this. First, hpet isn't available on all hardware and does have a slight performance hit. But second, and most important, I've never bothered to figure them out and do it properly. That's why I just suggest people just raise their Hz.
Apart from those major changes, a whole swag of things I had been considering doing I also committed. I use close shifts (right shift by 20 instead of divide by 1000000000 for example) where possible instead of division since it's cheaper. ISO calculations which are done during a scheduler tick now have their own lock to avoid grabbing the grq lock. I found some mistakes in the test for preemption when there are tasks of different scheduling policies, so SCHED_BATCH behaviour wasn't working properly, and sometimes the not-lowest priority CPU was being chosen to test against when there may have been a lower priority task running elsewhere. That's been fixed. I decreased the maximum rr_interval possible to 1000 (1 second) since new changes mean that 5 seconds would overflow, and higher values don't really help anyway. I added a local last_task variable per CPU, which could tell if that task was cache warm still on that CPU, even if it had run elsewhere. I dropped the first_time_slice flag as it was adding overhead and was not helping as it was possible to retrieve timeslice from children long long after they were forked if they ran only very little. I made sure timeslice was split properly between parent and child on fork() and that a reschedule is forced if the relative loss of timeslice meant the parent ran out. I changed reschedule to use 100 microseconds as the cutoff for expiration of timeslices since rescheduling with less time available than this will either greatly overrun its quota or reschedule very quickly. SCHED_BATCH tasks now refill timeslice and reset their deadline every time they're rescheduled which means they're lower weight than previously, but since they've been flagged as latency insensitive, it means they're likely to be fully CPU bound anyway and will only be rescheduled when they run out of timeslice or are preempted. I bypass the test for earliest_deadline_task now in schedule() when there is nothing else running and the previous task will simply run again. Numerous other minor cleanups and fixes were committed.
So, after that long-winded discussion, what's the advantage of BFS 350 over 318 onwards, and what should you see? Hopefully... nothing really obvious. There may be some lower latencies, and possibly very minor improvements in throughput and behaviour when many different scheduling policies are used at once. So far the feedback from people on 2.6.35.X has been that they haven't noticed a difference. Unfortunately, that is not the case for those older kernels! I probably should never have started backporting these changes to the older kernels since I haven't been extensively testing the backports. Ironically, though, it's those stuck on those older kernels that made me start making these changes in the first place! Furthermore, I now have no idea what's going on in android land. So I'm in a bit of a bind now, and I'd love to get feedback from those who are on those older kernels, and comparison with newer kernels with 350 and old. But if you're not adventurous and are on an older kernel and just want stable, stick to BFS 318. It may be that the whole deadline expiry thing is STILL needed, yet I honestly don't know why :(
Oh, I also noticed that despite quite a few people reading this blog, there have been virtually no comments. I know that many many more people read than comment, and that people may not feel qualified to comment, but a simple yes/no, good/bad would be greatly appreciated.
UPDATE: It may well be that the behavioural problems are just 32 bit overflows after all. Texstar of pclinuxos has been testing 32 bit kernels and found them to be problematic. More on this as he helps me sort it out.
Friday, 1 October 2010
BFS in real time
A question I often get is "how does BFS compare to the -rt patchset?" and I also get asked if BFS is compatible with the -rt patchset.
The second question is easy to answer: No. The code that makes up the -rt patchset carves heavily into the core of the CPU scheduler and to make it compatible with BFS would require a complete port.
The first question is a little harder to answer because -rt and BFS are tools for completely different workloads. BFS is a general purpose desktop orientated (yes we do have an extra syllable in English unlike American) CPU scheduler designed to decrease overall latencies to below human perceptible level in all regular workloads. Human perceptible latencies are in the millisecond range, where anything within about the 5ms range will not be noticeable. The -rt patchset is designed to decrease latencies in the microsecond range. This is well below anything that anybody could feel on a desktop. -rt also provides special tools for management of tasks that are running with realtime scheduling policies such as SCHED_FIFO and SCHED_RR. The addition of threaded interrupt handlers, priority inheritance and interrupt priority handling and so on are very specialised tools that are desirable in the realtime world. To use these, one must specially program for the -rt patchset, be using very carefully chosen hardware with known capabilities, run a stripped down userspace that only does precisely what one needs and so on. These are not the domain of a normal desktop user, nor that of any workload a desktop user is likely to ever encounter. If you were doing semi-professional audio recording you might, and then you'd need to understand the inner workings of the software and the -rt patchset to make the most of it. Just patching it in and expecting it to work for you will not really give you any advantage. Most users of the -rt patchset are embedded device manufacturers (and I don't mean Android phones).
A common response I get, then, is "surely if low latencies are good, then extremely low are even better?" Unfortunately, that's not the case, as the cost of running the -rt patchset does incur significant overhead and will not give the user any perceivable advantage, especially if you don't use any of the special features of it.
On the other hand, BFS is designed as a "plug it in and it will work" type of patch for the most part. A greatly undervalued feature of BFS is its ability to handle realtime tasks on SMP machines. BFS uses a global runqueue where the scheduler decides what task is the next most important and will choose the most suitable CPU anywhere to make the next task run as soon as possible. It finds the lowest priority, or oldest deadline task running on any CPU, and kicks that off if there is something higher priority that demands CPU time. While this is the case with any CPU scheduler, the major difference is that multiple runqueue designs (i.e. one for every CPU) spends most of its time looking for the highest priority task in its own runqueue, rather than all the tasks running on the whole machine. It then only moves tasks around to keep CPUs relatively balanced over time periods. The multiple queue design is ultimately used because it is more scalable when CPU numbers get high.
Using the simplest example possible of a dual core machine, it's easy to demonstrate the disadvantage of this multiple queue design. It is possible for say, 4 tasks to be running - two high priority realtime tasks and two regular tasks. In the multiple runqueue design, the two high priority tasks may end up on the same CPU for a period fighting each other for CPU, while the two regular tasks are on the other CPU, gorging themselves of CPU time. This scenario only exists for a short period until the CPUs are balanced again (assuming the balancing is done right biasing for the high priority tasks!), but it can happen over and over again, and the more CPUs, and the more tasks running, the worse this situation gets. It means that an ultra low priority task, possibly even nice 19 could be getting CPU time on one CPU, while a high priority SCHED_FIFO task is not getting CPU time because it's on another CPU waiting below an even higher priority SCHED_FIFO task. If you are using the mainline kernel with more than one realtime task on an SMP machine, then it is imperative you use affinities to cope with this situation to ensure you have the lowest possible latencies. However there is no workaround for when there are more realtime tasks than CPU cores.
When I had the time and inclination to hack for mainline, I created what I called the "SMP Nice" patches which were designed to try and balance things out to prevent all the high priority tasks clustering on one CPU and the low priority tasks on another and so on. A lot more work has gone into that on mainline since then to help alleviate that problem. An interesting footnote is that Peter Williams, who helped work on those patches, reported to me he noticed a few years ago that the priority management on SMP machines on windows was woeful for a while and it was a dead giveaway that they had changed to multiple runqueues.
BFS, with its global runqueue, manages these realtime tasks optimally without any extra effort. The highest priority task will always run first on any CPU it's allowed on. I've had quite a few people report to me their surprise to find that realtime tasks worked better on BFS, usually with audio on jackd, and asked if there was a reason for it. Priority balancing on SMP just works by design on a global runqueue without any extra effort. 'nice' also works better on SMP for exactly the same reason.
The second question is easy to answer: No. The code that makes up the -rt patchset carves heavily into the core of the CPU scheduler and to make it compatible with BFS would require a complete port.
The first question is a little harder to answer because -rt and BFS are tools for completely different workloads. BFS is a general purpose desktop orientated (yes we do have an extra syllable in English unlike American) CPU scheduler designed to decrease overall latencies to below human perceptible level in all regular workloads. Human perceptible latencies are in the millisecond range, where anything within about the 5ms range will not be noticeable. The -rt patchset is designed to decrease latencies in the microsecond range. This is well below anything that anybody could feel on a desktop. -rt also provides special tools for management of tasks that are running with realtime scheduling policies such as SCHED_FIFO and SCHED_RR. The addition of threaded interrupt handlers, priority inheritance and interrupt priority handling and so on are very specialised tools that are desirable in the realtime world. To use these, one must specially program for the -rt patchset, be using very carefully chosen hardware with known capabilities, run a stripped down userspace that only does precisely what one needs and so on. These are not the domain of a normal desktop user, nor that of any workload a desktop user is likely to ever encounter. If you were doing semi-professional audio recording you might, and then you'd need to understand the inner workings of the software and the -rt patchset to make the most of it. Just patching it in and expecting it to work for you will not really give you any advantage. Most users of the -rt patchset are embedded device manufacturers (and I don't mean Android phones).
A common response I get, then, is "surely if low latencies are good, then extremely low are even better?" Unfortunately, that's not the case, as the cost of running the -rt patchset does incur significant overhead and will not give the user any perceivable advantage, especially if you don't use any of the special features of it.
On the other hand, BFS is designed as a "plug it in and it will work" type of patch for the most part. A greatly undervalued feature of BFS is its ability to handle realtime tasks on SMP machines. BFS uses a global runqueue where the scheduler decides what task is the next most important and will choose the most suitable CPU anywhere to make the next task run as soon as possible. It finds the lowest priority, or oldest deadline task running on any CPU, and kicks that off if there is something higher priority that demands CPU time. While this is the case with any CPU scheduler, the major difference is that multiple runqueue designs (i.e. one for every CPU) spends most of its time looking for the highest priority task in its own runqueue, rather than all the tasks running on the whole machine. It then only moves tasks around to keep CPUs relatively balanced over time periods. The multiple queue design is ultimately used because it is more scalable when CPU numbers get high.
Using the simplest example possible of a dual core machine, it's easy to demonstrate the disadvantage of this multiple queue design. It is possible for say, 4 tasks to be running - two high priority realtime tasks and two regular tasks. In the multiple runqueue design, the two high priority tasks may end up on the same CPU for a period fighting each other for CPU, while the two regular tasks are on the other CPU, gorging themselves of CPU time. This scenario only exists for a short period until the CPUs are balanced again (assuming the balancing is done right biasing for the high priority tasks!), but it can happen over and over again, and the more CPUs, and the more tasks running, the worse this situation gets. It means that an ultra low priority task, possibly even nice 19 could be getting CPU time on one CPU, while a high priority SCHED_FIFO task is not getting CPU time because it's on another CPU waiting below an even higher priority SCHED_FIFO task. If you are using the mainline kernel with more than one realtime task on an SMP machine, then it is imperative you use affinities to cope with this situation to ensure you have the lowest possible latencies. However there is no workaround for when there are more realtime tasks than CPU cores.
When I had the time and inclination to hack for mainline, I created what I called the "SMP Nice" patches which were designed to try and balance things out to prevent all the high priority tasks clustering on one CPU and the low priority tasks on another and so on. A lot more work has gone into that on mainline since then to help alleviate that problem. An interesting footnote is that Peter Williams, who helped work on those patches, reported to me he noticed a few years ago that the priority management on SMP machines on windows was woeful for a while and it was a dead giveaway that they had changed to multiple runqueues.
BFS, with its global runqueue, manages these realtime tasks optimally without any extra effort. The highest priority task will always run first on any CPU it's allowed on. I've had quite a few people report to me their surprise to find that realtime tasks worked better on BFS, usually with audio on jackd, and asked if there was a reason for it. Priority balancing on SMP just works by design on a global runqueue without any extra effort. 'nice' also works better on SMP for exactly the same reason.
Wednesday, 29 September 2010
Biasing for latency under load
One of the mantras of BFS is that it has very little in the way of tunables and should require no input on the part of the user to get both good latency and good throughput by default. The only tunable available is the rr_interval found in /proc/sys/kernel/ and turning it up will improve throughput at the expense of latency, while turning it down will do the opposite (iso_cpu is also there, but that's more a permission tunable than a scheduler behavioural tunable). Giving you the bottom line for those who want to tune, I suggest running 100Hz with an rr_interval of 300 if you are doing nothing but cpu intensive slow tasks (video encoding, folding etc) and running 1000Hz with an rr_interval of 2 if you care only for latency at all costs.
I've believed for a long time that it makes no sense to tune for ridiculously high loads on your machine if your primary use is a desktop, and if you get into an overload situation, you should expect a slowdown. Trying to tune for such conditions always ends up costing you in other ways that just isn't worth it since you spend 99.9999% of your time at "normal loads". What BFS does at high loads is a progressive lengthening of latency proportional to the load while maintaining relatively regular throughput. So if you run say a make -j4 on a quad core machine, you shouldn't really notice anything going on, but if you run a make -j64 you should notice a dramatic slowdown, and loss of fluid movement of your cursor and possibly have audio skip. What exactly is the point of doing a make -j64 on a quad core desktop? There is none apart from as some kind of mindless test. However on any busy server, it spends most of its time on loads much greater than 1 per CPU. In that setting maintaining reasonable latency, while ensuring maximal throughput is optimal.
The mainline kernel seems intent on continually readdressing latency under load on a desktop as though that's some holy grail. Lately the make -j10 load on uniprocessor workload has been used as the benchmark. What they're finding, not surprisingly, is that the lower you aim your latencies, the smoother the desktop will continue to feel at the higher loads, and trying to find some "optimum" value where latency will still be good without sacrificing throughput too much. Why 10? Why not 100? How about 1000? Why choose some arbitrary upper figure to tune to? Why not just accept that overload is overload and that latency is going to suffer and not damage throughput to try and contain it?
For my own response to this, here's a patch:
(edit, patch for bfs357)
bfs357-latency_bias.patch
The changelog, also in the patch itself, follows:
This is to achieve the converse of what normally happens. You can choose to tune to maintain either latency at high loads (set to 100), or throughput (set to 0 for current behaviour), or some value in between (set to 1 or more). So I'm putting this out there for people to test and report back to see if they think it's worthwhile.
I've believed for a long time that it makes no sense to tune for ridiculously high loads on your machine if your primary use is a desktop, and if you get into an overload situation, you should expect a slowdown. Trying to tune for such conditions always ends up costing you in other ways that just isn't worth it since you spend 99.9999% of your time at "normal loads". What BFS does at high loads is a progressive lengthening of latency proportional to the load while maintaining relatively regular throughput. So if you run say a make -j4 on a quad core machine, you shouldn't really notice anything going on, but if you run a make -j64 you should notice a dramatic slowdown, and loss of fluid movement of your cursor and possibly have audio skip. What exactly is the point of doing a make -j64 on a quad core desktop? There is none apart from as some kind of mindless test. However on any busy server, it spends most of its time on loads much greater than 1 per CPU. In that setting maintaining reasonable latency, while ensuring maximal throughput is optimal.
The mainline kernel seems intent on continually readdressing latency under load on a desktop as though that's some holy grail. Lately the make -j10 load on uniprocessor workload has been used as the benchmark. What they're finding, not surprisingly, is that the lower you aim your latencies, the smoother the desktop will continue to feel at the higher loads, and trying to find some "optimum" value where latency will still be good without sacrificing throughput too much. Why 10? Why not 100? How about 1000? Why choose some arbitrary upper figure to tune to? Why not just accept that overload is overload and that latency is going to suffer and not damage throughput to try and contain it?
For my own response to this, here's a patch:
(edit, patch for bfs357)
bfs357-latency_bias.patch
The changelog, also in the patch itself, follows:
Make it possible to maintain low latency as much as possible at high loads by shortening timeslice the more loaded the machine will get. Do so by adding a tunable latency_bias which is disabled by default. Valid values are from 0 to 100, where higher values mean bias more for latency as load increases. Note that this should still maintain fairness, but will sacrifice throughput, potentially dramatically, to try and keep latencies as low as possible. Hz will
still be a limiting factor so the higher Hz is, the lower the latencies maintainable.
The effect of enabling this tunable will be to ensure that very low CPU usage processes, such as mouse cursor movement, will remain fluid no matter how high the load is. It's possible to have a smooth mouse cursor with massive loads, but the effect on throughput can be up to a -20%- loss at ultra high loads. At meaningful loads, a value of one will have minimal impact on throughput and ensure that under the occasional overload condition the machine will still feel fluid.
This is to achieve the converse of what normally happens. You can choose to tune to maintain either latency at high loads (set to 100), or throughput (set to 0 for current behaviour), or some value in between (set to 1 or more). So I'm putting this out there for people to test and report back to see if they think it's worthwhile.
Subscribe to:
Posts (Atom)

