TL;DR: Lower latency, better throughput and better ondemand behaviour on SMP.
BFS 400 for 2.6.38.2:
2.6.38.2-sched-bfs-400.patch
BFS 400 for 2.6.35.12:
2.6.35.12-sched-bfs-400.patch
2.6.38-ck3 includes BFS 400 (applies to 2.6.38.2):
patch-2.6.38-ck3.bz2
Ubuntu packages 2.6.38.2-ck2 and 2.6.35.12-ck1 include BFS 400:
Ubuntu Packages
It's been a while in the making, and the advantages of this new BFS are so substantial that they warrant a new version. I was going to announce this on lkml with lots of fanfare, but I see there is still some rare desktop hardware that still is not happy on ondemand governing. SO, my summary is: If you are on a desktop, stick with the performance governor and you'll get significant performance and latency advantages, and if you're on a laptop, you'll find BFS 400 much better than previous iterations with both ondemand and performance governing. There is hard evidence it is better for these. What follows is a long post that I was planning on submitting to lkml, but I'm very wary of burnout. Burnout for me is not to do with being tired of working on the code. No, not at all. The issue is that coding will start intruding too much on my regular life which has absolutely nothing to do with linux kernel and code at all. Once it interferes with my home life, or work, or health, I know it's time to cut back. And that's where I am right now. So instead, here's the post just on the blog:
---
This is to announce the most major upgrade in the BFS CPU scheduler in a
while.
For those tracking the development on my blog, this is version 0.376 renamed
with mostly cosmetic changes only.
Benchmarks first:
x264 benchmarks on a quad core Xeon X3360 with ondemand (courtesy of Graysky):
x264encodingfps.png
x264encodingsec.png
compilefilezilla340.png
Benchmarks on a 4 thread 2 core i7 M620:
m620-benchmarks.txt
Quad core Q9650:
q9650-benchmarks.txt
Prescott 3.2GHz Single core HT:
prescott-benchmarks.txt
Six core AMD: This has the all important LATENCY and throughput benchmarks.
Each benchmark is the latency with the following loads on a 6 core AMD, make -
j1, j3, j6, j9, j12 and x264 medium and x264 ultrafast (courtesy of Serge Belyshev):
http://ck.kolivas.org/patches/bfs/benchmark3-results-for-announcement-20110410.txt
Latency Graphs here:
http://ck.kolivas.org/patches/bfs/bfs400-cfs/
Sample image from the all important -j6 (where load == number of CPUs):
The BFS cpu scheduler, designed for optimum interactivity and responsiveness
while maintaining excellent throughput on commodity hardware, has been
modified to improve its behaviour on SMP machines, especially when used in
combination with scaling CPU frequency governors (like ondemand), and on
mulitcore hardware that has "turbo" modes. As many distributions are now
setting ondemand as the default governor, even on desktops, and some desktop
GUIs aren't even offering a way to alter the choice of governor any more, it
was clear that I needed to modify BFS to perform optimally with these
governors.
Briefly, BFS is a single runqueue design which is used to make the most of the
fact that single runqueues overall have the lowest average latency compared to
separate runqueue designs.
See this youtube link about queuing theory for an interesting parallel with
supermarket checkouts:
Why the other line is likely to move faster
The most consistent feature in BFS to date that was found to improve average
service latency whilst maintaining throughput was that no CPU was ever left to
be idle if there was a task that desired CPU time. Through extensive
experimentation I discovered that it was extremely easy for 4 or more tasks to
desire CPU at once on a virtually idle system. Thus on a quad core, the
average service time would always be lower if the tasks are allowed to move to
the first available idle CPU.
One of the disadvantages of this approach, though, is that although average
service latency is kept low by always using the first available CPU in a
relatively idle system, it means that work never clusters on a CPU. So when a
load-dependent CPU frequency governor is in use, no one CPU is ever seen as
being fully loaded except once the load is great enough for all CPUs to be
loaded. This meant that with BFS, CPUs would not be ramped up to their maximum
speed, and further, the "turbo" modes of newer multicore CPUs were never
engaged since they rely on one core being used while the others are kept idle.
Mainline does not exhibit this problem by virtue of always keeping work local
to each CPU and forcing work elsewhere once some kind of heuristically
determined imbalance was measured and then a rebalance would be forced.
It became apparent that to improve throughput with scaling governors and turbo
CPUs, that I would have to allow CPUs to become idle and service tasks
clustered to less CPUs in order for them to speed up. Trying to find the right
compromise between improving throughput while keeping the service latency low
by allowing a CPU to go idle would require heuristics which I've avoided
greatly to date on BFS. One great concern I have for the mainline scheduler is
the ever-expanding special case treatment of tasks on different architectures
for different one-benchmark workloads that do better, turning our mainline
balancing system into an unmanageable state machine.
Thus I devised a system whereby tasks that are running on a CPU but get
descheduled involuntarily in preference for another task instead of sleeping,
get flagged as "sticky". The reasoning being that if a task had not finished,
then it certainly was going to continue working once rescheduled. Then I added
a cpu_scaling flag which would tell the scheduler that a CPU had been throttled
from its maximum frequency. When one CPU is choosing its next task, if a task
is sticky on another CPU and the current CPU is throttled, it will not take
the task.
This extremely simple technique proved surprisingly good for throughput
without destroying latency. Further investigation revealed the same sticky
system could also be used to determine when to bias against tasks going next
on the local CPU even if the CPU wasn't throttled. So the existing "cache
distance" calculation was removed entirely and tasks that are sticky are just
seen as having soft affinity and biased against. This means that waking tasks
still get very low latency, but sticky tasks are very likely to go back to
their own CPU, while no CPU is left idle if there is work to do, _provided_
the CPU is not throttled.
The existing system for determining which idle CPU to bind to remains
unchanged for it has proven very effective and complementary to this added
change.
So the entire equivalent of a "balancing" mechanism for BFS consists of:
1. Flagging the last CPU bound task that was involuntarily scheduled as being
softly affined to its CPU, and biasing against it on other CPUs.
2. Choosing an idle CPU to wake up based on how busy that core is, and how
close it was to the original CPU it last ran on.
It was discovered the improved throughput meant I was able to further lower
the rr interval such that it is initially set to 6ms regardless of the number
of CPUs, thus improving latency further on SMP.
The resulting change to BFS means throughput with the ondemand CPU frequency
governor is now very close to that of the performance governor, power usage
may be decreased (theoretical, not yet proven), and better throughput and
latency on the performance governor. I would like to point out, however, that
some CPUs were not really designed with ondemand governing in mind,
specifically such as the "extreme" desktop CPUs by Intel. These tend to only
have 2 frequencies, are slow to change from one frequency to another, and may
lie about which CPUs are throttled and actually throttle all of them. Given
how little power saving they really offer, I suggest anyone with a desktop CPU
test for themselves how well it works with ondemand and decide whether they
want to build it into the kernel or not as a way of preventing your
distribution from using the ondemand governor without your knowledge. I detest
the lack of configurability in newer distributions and GUIs, but that's another
story.
In order to quantify the changes in this BFS version, benchmarks were
performed by myself and the help of some other very helpful members of the
community. Thanks to all those who tested along the way and helped it get this
far. The very simple, but reliable and repeatable benchmarks of kbuild at
various jobs and x264 were used to determine the effects on throughput, while
the wakeup-latency app by Jussi Laako was run in the background during the
benchmark as a rough guide to what effect the various loads had on latency.
Again I must stress, the purpose for BFS is to be used as a testing ground for
scheduling to see what can be done when the scheduler is specifically aimed at
commodity hardware for normal desktops and affordable servers. As such, it
should serve as a reference tool to compare to for these workloads and I'm
very pleased to see the uptake of its use actually by the community, and that
mainline is occasionally reminded of the need to keep in touch with this
hardware.
To the server admins who have contacted me to tell me of their good
experiences with BFS, thank you, but I must stress again that I am unable to
implement extra features like CGROUPS that some of you have asked for. My
time is limited, and my work on the kernel completely unpaid so I can only
direct my attention to areas I personally have interest in.
EDIT: In response to questions from those who haven't been following the progress, yes this BFS is the Brain Fuck Scheduler.
A development blog of what Con Kolivas is doing with code at the moment with the emphasis on linux kernel, MuQSS, BFS and -ck.
Sunday, 10 April 2011
Thursday, 7 April 2011
Quick lrzip comparison
So I was building a kernel package for myself and when I was done saw a large directory and thought, what the heck, I'll compress it with lrzip for grins.
These were done on a quad core 3GHz with 8GB ram using nothing but default options. Note that lrzip was done using the lrztar wrapper to make the comparison fair since it's just a single command without temporary files, so it took 3 passes to compress. If I compressed it with temporary files it would be smaller again (i.e. tar cf directory && lrzip)
Of course there is parallel bzip2 which speeds up bzip2's compression but has no effect on compression. And then there is xz which slows down 7z's compression and has no effect on compression. So why is xz becoming the new defacto standard? Probably because it's not aimed squarely at the windows market the way 7z is and feels more politically correct to the linux crowd. Maybe it's because it's called lzma2 that it must be better than lzma since it has a higher version number. I've never seen any performance or compression advantage of any significance with lzma2 versus lzma. Personally I'm relatively unimpressed with xz so I don't really understand it. Even less understandable is why the kernel has both lzma AND xz support in it now.
I wish I could get people more excited about lrzip :)
bzip2 size: 1831647009 time: 12m50s (command "tar cjf") 7z size: 945054166 time: 36m23s (command "7za a") lrzip size: 586087630 time: 17m09s (command "lrztar")
These were done on a quad core 3GHz with 8GB ram using nothing but default options. Note that lrzip was done using the lrztar wrapper to make the comparison fair since it's just a single command without temporary files, so it took 3 passes to compress. If I compressed it with temporary files it would be smaller again (i.e. tar cf directory && lrzip)
Of course there is parallel bzip2 which speeds up bzip2's compression but has no effect on compression. And then there is xz which slows down 7z's compression and has no effect on compression. So why is xz becoming the new defacto standard? Probably because it's not aimed squarely at the windows market the way 7z is and feels more politically correct to the linux crowd. Maybe it's because it's called lzma2 that it must be better than lzma since it has a higher version number. I've never seen any performance or compression advantage of any significance with lzma2 versus lzma. Personally I'm relatively unimpressed with xz so I don't really understand it. Even less understandable is why the kernel has both lzma AND xz support in it now.
I wish I could get people more excited about lrzip :)
BFS 0.376 test
TL;DR: Fastest BFS yet for SMP.
After extended testing on BFS 0.373, a number of minor issues came up, but the results were very promising. Now I believed I've addressed all the known issues with a newer version. Instead of flagging scaling CPUs by their governor alone, I now flag them as scaling only when they're actually throttled from maximum speed. This improves throughput further with the dynamic scaling governors like ondemand and brings it now very close to that of performance under full load. I also found that the sticky flagged tasks were not keeping their sticky flags if they were rescheduled back to back. This gave me even more of a performance boost under all situations. I addressed the oops that can occur on UP, and finally I updated the docs to match the changes in the scheduler design.
So hopefully this will be the last test patch (fingers crossed) before I make it official, because... I'm about >< close to burnout. That's not something I want to experience. Incremental for those on BFS 363 already: bfs363-376-test.patch
Full patch for 2.6.38ish:
2.6.38-sched-bfs-376.patch
Benchmarks as they come to hand...
---
x264 benchmarks Courtesy of Graysky:
Higher is better: boxplotencodethroughput.png
Lower is better: boxplotencodetime.png
CPU: Intel Xeon X3360 @ 8.5x400=3.40 GHz (4 cores/4 threads)
Linux version: Arch x86_64
x264 version: 0.114.x
handbrake version: svn3853
Base kernel version: 2.6.38.2
CK Patchset: CK1
Source video clip: 720p60 (1280x720) MPEG-PS @ 15 Mbps. 62 seconds long.
Run with ondemand multiplier, 5 times per kernel. Kernels use identical configs with exception of BFS version.
Handbrake CLI: --input test.m2ps --output output.mp4 --no-dvdnav --audio none --crop 0:0:0:0 --preset=Normal
---
After extended testing on BFS 0.373, a number of minor issues came up, but the results were very promising. Now I believed I've addressed all the known issues with a newer version. Instead of flagging scaling CPUs by their governor alone, I now flag them as scaling only when they're actually throttled from maximum speed. This improves throughput further with the dynamic scaling governors like ondemand and brings it now very close to that of performance under full load. I also found that the sticky flagged tasks were not keeping their sticky flags if they were rescheduled back to back. This gave me even more of a performance boost under all situations. I addressed the oops that can occur on UP, and finally I updated the docs to match the changes in the scheduler design.
So hopefully this will be the last test patch (fingers crossed) before I make it official, because... I'm about >< close to burnout. That's not something I want to experience. Incremental for those on BFS 363 already: bfs363-376-test.patch
Full patch for 2.6.38ish:
2.6.38-sched-bfs-376.patch
Benchmarks as they come to hand...
---
x264 benchmarks Courtesy of Graysky:
Higher is better: boxplotencodethroughput.png
Lower is better: boxplotencodetime.png
CPU: Intel Xeon X3360 @ 8.5x400=3.40 GHz (4 cores/4 threads)
Linux version: Arch x86_64
x264 version: 0.114.x
handbrake version: svn3853
Base kernel version: 2.6.38.2
CK Patchset: CK1
Source video clip: 720p60 (1280x720) MPEG-PS @ 15 Mbps. 62 seconds long.
Run with ondemand multiplier, 5 times per kernel. Kernels use identical configs with exception of BFS version.
Handbrake CLI: --input test.m2ps --output output.mp4 --no-dvdnav --audio none --crop 0:0:0:0 --preset=Normal
---
Sunday, 3 April 2011
BFS 0.373 test
The BFS 0.372 test patch has proved quite a success.There have been no regressions in performance, with slight improvements even with the performance CPU governor, and better latency all round on SMP now. There were some rare crashes that I had to track down and I believe I've fixed them all so I'm releasing another test patch, 0.373, which has addressed them and is otherwise the same as 0.372.
Apply to a 0.363 patched kernel:
bfs363-373-test.patch
Thanks to those who have tested and reported back so far!
UPDATE: Throughput benchmark results on a 6 core AMD courtesy of Serge Belyshev:
benchmark3.results-cfs-bfs363-bfs373.txt
I am planning on posting latency benchmarks soon too.
Apply to a 0.363 patched kernel:
bfs363-373-test.patch
Thanks to those who have tested and reported back so far!
UPDATE: Throughput benchmark results on a 6 core AMD courtesy of Serge Belyshev:
benchmark3.results-cfs-bfs363-bfs373.txt
I am planning on posting latency benchmarks soon too.
Subscribe to:
Posts (Atom)