Monday, 26 September 2011

BFS 0.410 test with skiplists.

Hi all.

TL;DR: Test release for fastest BFS ever.

The skiplists patch has proven to be quite stable, but a couple of minor issues have shown up. First, as I knew, the ondemand behaviour would not quite match the current release, and second, SCHED_IDLEPRIO tasks weren't scheduled properly (they acted like normal tasks). However the code itself seemed quite safe otherwise. So I've tentatively put out a test release of the next version of BFS. The two main changes to the skiplist code are to bring the ondemand governor performance up to par with current BFS, and to fix the behaviour of SCHED_IDLEPRIO tasks.

Quick benchmarks:
BFS 406:
Make -j4: 26.6s
Make -j : 27.8s

BFS 410:
Make -j4: 26.4s
Make -j : 27.1s


Changelog in patch:
Implement skip lists as described here:
http://en.wikipedia.org/wiki/Skip_list
for the main priority queue.

The old queue had: O(1) insertion, O(1) removal, but a lookup involving
both a binary search of a bitmap and O(n) search through a linked list which
is very cache unfriendly as the list gets larger.

This queue is now: O(log n) insertion, O(1) lookup and O(k) lookup in a much
more cache friendly manner.

This should not compromise performance at all at very low loads, but improve
both throughput and latency as loads get higher, as confirmed by benchmarks.


Other changes: Cleanups of variable choices and micro-optimisations.

Full bfs410 patch for 3.0.x:
http://ck.kolivas.org/patches/bfs/test/3.0-sched-bfs-410.patch

Incremental bfs410 for those on 3.0.x-bfs406:
http://ck.kolivas.org/patches/bfs/test/3.0-bfs406-410.patch

Incremental bfs410 for those on 3.0.x-ck1:
http://ck.kolivas.org/patches/bfs/test/ck1-bfs406-410.patch

Note that make headers check will report errors, so you may have to disable this option in kernel hacking if it is enabled. I'm investigating why this is, but it's a harmless warning.

EDIT: Uniprocessor builds will need the following fix on top:
http://ck.kolivas.org/patches/bfs/test/bfs410-upfix.patch

EDIT2: The following fix is required for interactivity issues with the patch
http://ck.kolivas.org/patches/bfs/test/bfs410-offsetfix.patch

21 comments:

  1. It seem taht the "sticky_task" variable is only used for SMP system. In my UP system, I need the following patch:
    --- /tmp/sched_bfs.c 2011-09-26 10:28:11.471160242 +0800
    +++ kernel/sched_bfs.c 2011-09-26 10:26:51.025936067 +0800
    @@ -2960,7 +2960,11 @@
    static inline struct
    task_struct *earliest_deadline_task(struct rq *rq, struct task_struct *idle)
    {
    +#ifdef CONFIG_SMP
    struct task_struct *edt = idle, *rqs = rq->sticky_task;
    +#else
    + struct task_struct *edt = idle;
    +#endif
    skiplist_node *node = grq.node;
    int cpu = cpu_of(rq);

    @@ -2981,7 +2985,7 @@
    edt = slp;
    break;
    }
    -
    +#ifdef CONFIG_SMP
    if (edt->prio >= SCHED_NORMAL && rqs) {
    /* Check the sticky task for this CPU isn't a better choice
    * than the task returned by the skiplist since the sticky
    @@ -2990,6 +2994,7 @@
    rqs->deadline - longest_deadline_diff() < edt->deadline)
    edt = rqs;
    }
    +#endif
    if (edt != idle)
    take_task(rq, edt);
    return edt;

    ReplyDelete
  2. @Con, I am not a programer. This work aimes to get your scheduler BFS skale up to very large machines?

    Didn't you once say something like: It is more realistic to think of a scheduler not be able to scale to all use cases (like CFS pretends to do). With this work you try to falsify this your assumption?

    ReplyDelete
  3. That's a very good question. In fact this is not about scaling BFS up to large machines. Believe it or not, the machine that will benefit the most from this change is the one with the lowest CPU count! The reason is that load is relatively higher the less CPUs you have. Despite the fact that the benchmark is aimed at showing the heavily loaded "make -j" case, this is simply a way of quantifying it somehow. In my experimentation, even if load is often low at around ~1, you can *easily* get bursts of load up to 10-15 it's just that they're so short lived they never show up in your load average.

    ReplyDelete
  4. @Anonymous: Thanks for pointing that out. I've posted a UP fix in the main post which will optimise out anything to do with sticky tasks.

    ReplyDelete
  5. Hi! Any chances for patch to 2.6.39? :-)

    ReplyDelete
  6. CK - I ran the following benchmark on a dual Xeon system: 8 physical cores and 8 hyperthreaded cores = 16 total. My benchmark was running a make -j16 bzImage modules on linux v3.0.4 source code using the Arch Linux .config file. The boxplot shows compile time and the distributions for each of the 10 runs per kernel. As usual, your shit out preforms mainline (identified as "stock" in my plots). On this machine, there isn't a statistically significant difference between bfs v0.406 and bfs v0.410 :(

    When I get some CPU time on my home system (X3360, 4 physical cores and 0 hyperthreaded cores), I will repeat run the experiment on it and see if there is a bigger difference.

    http://s2.postimage.org/9rfmu4s9f/linux.png

    ReplyDelete
  7. Serious regression to report with v0.410 using the ondemand governor on my quad core workstation: when I tried to load virtualbox, it spit back a bunch of errors related to my vdi file. After I forcefully quit vbox, nothing worked properly. For example, grub-mkconfig -o /boot/grub/grub.cfg just froze up. After rebooting into my 3.0.4 kernel with bfs v0.406 everything was fine. I repeated with v0.410 and again, vbox errors :(

    ReplyDelete
  8. Interesting. Thanks for testing. No regression is probably the most important part of this code, and even if the performance is the same at -j16 on 16x, that's fine by me as it will be hard to overload a machine like that in normal use! The vdi regression concerns me more. What were the errors exactly? Was there anything in dmesg/syslog?

    ReplyDelete
  9. Yes! Here is everything in /var/log/everything.log before the first problem:

    http://pastebin.com/LnkfEpLp

    On your quad core, can you try some stuff with ondemand enabled? If so, what happens if you try to use some virtual machines?

    ReplyDelete
  10. Works fine here, but it's kvm I use for virtualisation... I'll see about other vms

    ReplyDelete
  11. CK - well, this time I nuked my vbox modules and recompiled them after booting into my kernel with bfsv0.410 and no errors. Perhaps that was the problem before? I NEVER needed to rebuild modules like this before. Only when a two point kernel is released...

    So, is this a one-off occurrence, the results of solar flares (you read that article?) or what...

    dmesg:
    [14701.248221] vboxdrv: Found 4 processor cores.
    [14701.248300] VBoxDrv: dbg - g_abExecMemory=ffffffffa113eb20
    [14701.248345] vboxdrv: fAsync=0 offMin=0x4e2 offMax=0x1fc7
    [14701.248404] vboxdrv: TSC mode is 'synchronous', kernel timer mode is 'normal'.
    [14701.248408] vboxdrv: Successfully loaded version 4.1.2_OSE (interface 0x00190000).
    [14701.272574] warning: `VirtualBox' uses 32-bit capabilities (legacy support in use)
    [14718.604856] device eth0 entered promiscuous mode

    In other news, I am planning to test v0.406 head-to-head with v0.410 now and will post the results soon.

    BTW, how can I post an image or hyperlink to this blog?

    ReplyDelete
  12. Uh yeah modules often need to be rebuilt with such a dramatic change as the task struct in the cpu scheduler. Pretty sure it just accepts html in the comments section.

    ReplyDelete
  13. This comment has been removed by the author.

    ReplyDelete
  14. CK - I am experiencing a serious lack of responsiveness when I compile on my workstation and browse the web or use any GUI app. The mouse does not scroll smoothly nor does my cursor move smoothly while typing. As soon as I stop the compilation, desktop responsiveness returns to normal.

    Thoughts?

    Link

    ReplyDelete
  15. Ck, I have exactly the some problem of Graysky, you can find my.config in your mail

    ReplyDelete
  16. Ah yes thanks, you'll all need the offset patch. I double compensated for the deadline:
    bfs410-offsetfix.patch

    ReplyDelete
  17. CK - Can you explain why we need the offset patch? Should this be included with v0.410 for general use for for our specific cases?

    ReplyDelete
  18. It addresses a bug in the deadline calculation. It's essential. I'm trying to accumulate fixes before the next test release as this is a major change you can imagine. There is also a suspend/resume regression (like there always is with scheduler changes sigh).

    ReplyDelete
  19. I just wanna thank you for sharing your information and your site or blog this is simple but nice article I've ever seen i like it i learn something today.

    ReplyDelete
  20. It works!!!With offset patch it works very well!Very ffluid Full screen flash videos during compilation without any problem!!
    Thank you ck for your hard work in this fantastic project!!!!

    ReplyDelete
  21. And thank you very much for your patience and testing!

    ReplyDelete