TL;DR: Fastest BFS yet for SMP.
After extended testing on BFS 0.373, a number of minor issues came up, but the results were very promising. Now I believed I've addressed all the known issues with a newer version. Instead of flagging scaling CPUs by their governor alone, I now flag them as scaling only when they're actually throttled from maximum speed. This improves throughput further with the dynamic scaling governors like ondemand and brings it now very close to that of performance under full load. I also found that the sticky flagged tasks were not keeping their sticky flags if they were rescheduled back to back. This gave me even more of a performance boost under all situations. I addressed the oops that can occur on UP, and finally I updated the docs to match the changes in the scheduler design.
So hopefully this will be the last test patch (fingers crossed) before I make it official, because... I'm about >< close to burnout. That's not something I want to experience.
Incremental for those on BFS 363 already:
bfs363-376-test.patch
Full patch for 2.6.38ish:
2.6.38-sched-bfs-376.patch
Benchmarks as they come to hand...
---
x264 benchmarks Courtesy of Graysky:
Higher is better: boxplotencodethroughput.png
Lower is better: boxplotencodetime.png
CPU: Intel Xeon X3360 @ 8.5x400=3.40 GHz (4 cores/4 threads)
Linux version: Arch x86_64
x264 version: 0.114.x
handbrake version: svn3853
Base kernel version: 2.6.38.2
CK Patchset: CK1
Source video clip: 720p60 (1280x720) MPEG-PS @ 15 Mbps. 62 seconds long.
Run with ondemand multiplier, 5 times per kernel. Kernels use identical configs with exception of BFS version.
Handbrake CLI: --input test.m2ps --output output.mp4 --no-dvdnav --audio none --crop 0:0:0:0 --preset=Normal
---
Legendary, thank you CK! Don't burn yourself out :)
ReplyDeleteKeep up the good work; I've been using your patches since kernel 2.4 and I check for updates every couple of days because they're that damn good.
ReplyDeleteAh-hah, this made my morning! I don't know which is worse: my kernel-recompiling addition or my Android-ROM-flashing addition...
ReplyDeleteInteractivity on my quad-core Xeon during heavy compiles and encodes has never been better. No more Rhythmbox skipping... whew!
Anyways, thanks from Kenya!
Yes CK.
ReplyDeleteThank you for your work eventhough you're not really that appreciated by the mainstream kernel dev.
bfs + bfq is a good combo here on 2.6.35.11
Great job, CK! BFS v0.376 slightly out preforms v0.363 in my x264 tests. The results are darn close to a statistically significant margin by the way. As always, both versions of BFS are faster than the corresponding mainline scheduler by 2.6 % and 3.0 % respectively.
ReplyDeleteHere are the data:
http://img705.imageshack.us/img705/2135/boxplotencodethroughput.png
http://img849.imageshack.us/img849/4756/boxplotencodetime.png
Anyway, keep up the great work!
Excellent, thank you very much everyone for your testing and feedback. It made a massive difference to making sure I tackled all the issues. I'm hoping this release can go gold over the weekend as version 0.4. If I may, graysky, could I post those very pretty graphs on my BFS page?
ReplyDelete@CK - Always glad to help. Keep up the great work! Please feel free to repost.
ReplyDeleteForgot to add some context to the data.
ReplyDeleteCPU: Intel Xeon X3360 @ 8.5x400=3.40 GHz (4 cores/4 threads)
Linux version: Arch x86_64
x264 version: 0.114.x
handbrake version: svn3853
Base kernel version: 2.6.38.2
CK Patchset: CK1
Source video clip: 720p60 (1280x720) MPEG-PS @ 15 Mbps. 62 seconds long.
Run with ondemand multiplier, 5 times per kernel. Kernels use identical configs with exception of BFS version.
Handbrake CLI: --input test.m2ps --output output.mp4 --no-dvdnav --audio none --crop 0:0:0:0 --preset=Normal
ReplyDeleteThank you for your work.
ReplyDeleteWhen can we expect Ubuntu Packages vs 0.376?
I want to put it on the AMD Phenom II X6 1090T to work with a 10-15 game servers
ReplyDeleteDid a few more comparisons of mainline vs. bfs.
ReplyDeleteIn x264 encoding, both bfs versions (0.363 and 0.376) beat the mainline scheduler hands down.
For compiling though, and interestingly enough, when my quad core CPU compiled filezilla-3.4.0, using make -j4, the latest bfs clearly beat both its predecessor and mainline; however, adding the extra thread to mainline (make -j5) brought it statistically in-line with bfs v0.376 for total compile time. Dunno what to make of that.
http://img854.imageshack.us/img854/4042/compilefilezilla340.png
http://img402.imageshack.us/img402/5769/720p60x264encode.png
Thanks. That's typical of mainline's inability to utilise CPUs fully when load==CPUs, unlike BFS.
ReplyDeleteI would like to remind all af you what the problem was: Latency performance
ReplyDelete1.) Latency on the desktop
2.) Energy efficiency for most of us having notebooks
3.) Throughput performance
All of that your tests done looking only at 3. ?
I now have the same Kernel:
2.6.38.2
2.6.38.2 + ck1 + 363
2.6.38.2 + ck1 + 376
And would like to have a fine test tool at hand to see results regarding 1.) for my Intel core2 mac mini !
Ralph that's a very good comment. SSB who has performed many of the benchmarks to date, is compiling some meaningful benchmarks in this area and I will post them as they come to hand.
ReplyDeletehttp://ck.kolivas.org/patches/bfs/test/wakeup-latency.c
Is a latency testing benchmark that he has run under various conditions with these different kernels and different throughput benchmarks at the same time. The results will be available soon, but suffice to say they're reassuring. Note also that absolute wake-up latency isn't the whole picture, since it will be wakeup-and-achieving-work that matters, but the two are intimately related.