Thursday, 7 April 2011

BFS 0.376 test

TL;DR: Fastest BFS yet for SMP.

After extended testing on BFS 0.373, a number of minor issues came up, but the results were very promising. Now I believed I've addressed all the known issues with a newer version. Instead of flagging scaling CPUs by their governor alone, I now flag them as scaling only when they're actually throttled from maximum speed. This improves throughput further with the dynamic scaling governors like ondemand and brings it now very close to that of performance under full load. I also found that the sticky flagged tasks were not keeping their sticky flags if they were rescheduled back to back. This gave me even more of a performance boost under all situations. I addressed the oops that can occur on UP, and finally I updated the docs to match the changes in the scheduler design.

So hopefully this will be the last test patch (fingers crossed) before I make it official, because... I'm about >< close to burnout. That's not something I want to experience. Incremental for those on BFS 363 already: bfs363-376-test.patch

Full patch for 2.6.38ish:
2.6.38-sched-bfs-376.patch

Benchmarks as they come to hand...

---
x264 benchmarks Courtesy of Graysky:
Higher is better: boxplotencodethroughput.png
Lower is better: boxplotencodetime.png
CPU: Intel Xeon X3360 @ 8.5x400=3.40 GHz (4 cores/4 threads)
Linux version: Arch x86_64
x264 version: 0.114.x
handbrake version: svn3853
Base kernel version: 2.6.38.2
CK Patchset: CK1
Source video clip: 720p60 (1280x720) MPEG-PS @ 15 Mbps. 62 seconds long.
Run with ondemand multiplier, 5 times per kernel. Kernels use identical configs with exception of BFS version.
Handbrake CLI: --input test.m2ps --output output.mp4 --no-dvdnav --audio none --crop 0:0:0:0 --preset=Normal
---

15 comments:

  1. Legendary, thank you CK! Don't burn yourself out :)

    ReplyDelete
  2. Keep up the good work; I've been using your patches since kernel 2.4 and I check for updates every couple of days because they're that damn good.

    ReplyDelete
  3. Ah-hah, this made my morning! I don't know which is worse: my kernel-recompiling addition or my Android-ROM-flashing addition...

    Interactivity on my quad-core Xeon during heavy compiles and encodes has never been better. No more Rhythmbox skipping... whew!

    Anyways, thanks from Kenya!

    ReplyDelete
  4. Yes CK.
    Thank you for your work eventhough you're not really that appreciated by the mainstream kernel dev.
    bfs + bfq is a good combo here on 2.6.35.11

    ReplyDelete
  5. Great job, CK! BFS v0.376 slightly out preforms v0.363 in my x264 tests. The results are darn close to a statistically significant margin by the way. As always, both versions of BFS are faster than the corresponding mainline scheduler by 2.6 % and 3.0 % respectively.

    Here are the data:

    http://img705.imageshack.us/img705/2135/boxplotencodethroughput.png
    http://img849.imageshack.us/img849/4756/boxplotencodetime.png

    Anyway, keep up the great work!

    ReplyDelete
  6. Excellent, thank you very much everyone for your testing and feedback. It made a massive difference to making sure I tackled all the issues. I'm hoping this release can go gold over the weekend as version 0.4. If I may, graysky, could I post those very pretty graphs on my BFS page?

    ReplyDelete
  7. @CK - Always glad to help. Keep up the great work! Please feel free to repost.

    ReplyDelete
  8. Forgot to add some context to the data.

    CPU: Intel Xeon X3360 @ 8.5x400=3.40 GHz (4 cores/4 threads)
    Linux version: Arch x86_64
    x264 version: 0.114.x
    handbrake version: svn3853
    Base kernel version: 2.6.38.2
    CK Patchset: CK1

    Source video clip: 720p60 (1280x720) MPEG-PS @ 15 Mbps. 62 seconds long.

    Run with ondemand multiplier, 5 times per kernel. Kernels use identical configs with exception of BFS version.

    ReplyDelete
  9. Handbrake CLI: --input test.m2ps --output output.mp4 --no-dvdnav --audio none --crop 0:0:0:0 --preset=Normal

    ReplyDelete
  10. Thank you for your work.
    When can we expect Ubuntu Packages vs 0.376?

    ReplyDelete
  11. I want to put it on the AMD Phenom II X6 1090T to work with a 10-15 game servers

    ReplyDelete
  12. Did a few more comparisons of mainline vs. bfs.

    In x264 encoding, both bfs versions (0.363 and 0.376) beat the mainline scheduler hands down.
    For compiling though, and interestingly enough, when my quad core CPU compiled filezilla-3.4.0, using make -j4, the latest bfs clearly beat both its predecessor and mainline; however, adding the extra thread to mainline (make -j5) brought it statistically in-line with bfs v0.376 for total compile time. Dunno what to make of that.

    http://img854.imageshack.us/img854/4042/compilefilezilla340.png
    http://img402.imageshack.us/img402/5769/720p60x264encode.png

    ReplyDelete
  13. Thanks. That's typical of mainline's inability to utilise CPUs fully when load==CPUs, unlike BFS.

    ReplyDelete
  14. I would like to remind all af you what the problem was: Latency performance

    1.) Latency on the desktop
    2.) Energy efficiency for most of us having notebooks
    3.) Throughput performance

    All of that your tests done looking only at 3. ?

    I now have the same Kernel:
    2.6.38.2
    2.6.38.2 + ck1 + 363
    2.6.38.2 + ck1 + 376

    And would like to have a fine test tool at hand to see results regarding 1.) for my Intel core2 mac mini !

    ReplyDelete
  15. Ralph that's a very good comment. SSB who has performed many of the benchmarks to date, is compiling some meaningful benchmarks in this area and I will post them as they come to hand.
    http://ck.kolivas.org/patches/bfs/test/wakeup-latency.c
    Is a latency testing benchmark that he has run under various conditions with these different kernels and different throughput benchmarks at the same time. The results will be available soon, but suffice to say they're reassuring. Note also that absolute wake-up latency isn't the whole picture, since it will be wakeup-and-achieving-work that matters, but the two are intimately related.

    ReplyDelete