Thursday 22 March 2012

BFS 420 AKA smoking for linux 3.3

Here's the release candidate for the first stable release of BFS for linux 3.3. BFS hadn't received much attention lately so I felt a little guilty and worked on a few things that had been nagging me:

3.3-sched-bfs-420.patch

It is not clear whether the bug which I posted about here is present any more or not, given that it is extremely rare and hard to reproduce it is impossible to know for sure. A number of minor issues were addressed in the code which hopefully have fixed it (and hardly anyone was affected anyway).

The changes since BFS version 0.416 include a fairly large architectural change just to bring the codebase in sync with 3.3, but none of the changes should be noticeable in any way. One change that may be user-visible is that the high resolution IRQ accounting now appears to be on by default for x86 architectures. There is an issue that system time accounting is wrong without this feature enabled in BFS so this should correct that problem.

Other changes:
416-417: A number of ints were changed to bool which though unlikely to have any performance impact, do make the code cleaner and the compiled code does often come out different. rq_running_iso was converted from a function to macro to avoid it being a separate function call when compiled in with the attendant overhead. requeue_task within the scheduler tick was moved to being done under lock which may prevent rare races.  test_ret_isorefractory() was optimised. set_rq_task() was not being called on tasks that were being requeued within schedule() which could possibly have led to issues if the task ran out of timeslice during that requeue and should have had its deadline offset. The need_resched() check that occurs at the end of schedule() was changed to unlikely() since it really is that. Moved the scheduler version print function to bfs.c to avoid recompiling the entire kernel if the version number is changed.

417-418: Fixed a problem with the accounting resync for linux 3.3.

418-419: There was a small possibility that an unnecessary resched would occur in try_preempt if a task had changed affinity and called try_preempt with its ->cpu still set to the old cpu it could no longer run on, so try_preempt was reworked slightly. Reintroduced the deadline offset based on CPU cache locality on sticky tasks in a way that was cheaper than we currently offset the deadline.

419-420: Finally rewrote the earliest_deadline_task code. This has long been one of the hottest code paths in the scheduler and small changes here that made it look nice would often slow it down. I spent quite a few hours reworking it to include less GOTOs while disassembling the code to make sure it was actually getting smaller with every change.  Then I wrote a scheduler specific version of find_next_bit which could be inlined into this code and avoid another function call in the hot path. The overall behaviour is unchanged from previous BFS versions, but initial benchmarking confirms slight improvements in throughput.

Now I'll leave it open for wider testing confirming it's all good and then I have to think about what I should do with the full -ck patchset. As I've said on numerous posts before, I'm no longer sure about the validity of some of the patches in the set with all the changes to the virtual memory subsystem in the mainline kernel.

20 comments:

  1. Just booted and am banging away on it compiling nice -n 19 kernels trying to reproduce your PR > 41 bug.

    ReplyDelete
    Replies
    1. I was only 93% happy with your ANOVA plots. It's those 2 outliers that bothered me.

      Delete
    2. LOL let's see if we can't fix those outliers in BFS 420.

      Delete
    3. @anonymous - Two things.

      1) I see more outliers on the dual quad system (HT) than I do on my single quad (no HT).

      2) The outliers do not affect the overall. If I remove them, the results are still significant at the 0.05 level.

      surpressed outliers.png

      I will repeat with bfs v0.420 tomorrow and post here.

      Delete
    4. No. I only suggest that outliers indicate that the scheduler stumbles once in a blue moon. It is not good. It is important to watch and understand them. You seem to suggest that they have something to do with scalability to a large number of CPUs. So, they may or may not be relevant to PR > 41.

      Delete
  2. Applied bfs-420
    without a shedtool command
    but niced
    I have these when compiling chromium with -j2

    29176 portage 16 19 21376 6004 1236 S 200.0 0.2 5124095h as
    29155 portage 39 19 66004 37m 7088 R 12.0 1.0 0:00.60 cc1plus
    1445 root 1 0 146m 24m 5196 S 9.8 0.7 6:21.42 X
    29175 portage 40 19 48564 17m 4076 R 2.6 0.5 0:00.13 cc1plus
    1770 ral 4 0 375m 21m 9.9m S 1.8 0.6 0:12.59 yakuake

    Seems running fine this bfs!
    Ralph Ulrich

    ReplyDelete
  3. OK... here is the standard make benchmark on 3.3.0/3.3.0+bfs v0.418/3.3.0+bfs v0.420 running it a total of 9 times. There is a statistically significant difference between the two bfs patched kernels compared to the native kernel.

    ANOVA_quad.png

    The 'make benchmark' is compiling the linux 3.3.0 via 'make -j4 bzImage' and timing the result, the repeating 9 times. This is done via a bash script. The result is for my X3360 workstation (4 cores no HT). I will repeat on the dual quad tomorrow and post again.

    GREAT JOB CK!

    P.S. I can now post a link to an image but how can I post the image embedded in my reply? The blogspot server tells me I cannot use an img tag? If anyone knows, please post the html code for me.

    ReplyDelete
    Replies
    1. Thanks again. There is a trend towards better throughput, but not significant. It will be interesting to see on the larger machine it shows more of a change.

      Delete
  4. @graysky

    this graphic means the better performance of bfs is significant?

    /me-still-wonder
    how a scheduler which delivers better latency to the destop user is able to even perform better ...

    Ralph Ulrich

    ReplyDelete
  5. @Ralph - It means that the differences in the median values of the 9 runs (461 ms) IS statistically different, bfs v0.420 vs cfs but the differences from v0.418 to v0.420 are not different.

    This to me is an artificial endpoint done only to see if the code has retained its efficiency with regard to this endpoint. This data set says nothing about interactivity. The fact that both BFS kernels are no different is good news. The fact that both are measurably better than the CFS is also good news. What it means under real world computing is not the point of this experiment.

    ReplyDelete
  6. OK. The standard make benchmark on 3.3.0/3.3.0+bfs v0.418/3.3.0+bfs v0.420 on two machines.

    Quad core no HT.png

    There is a statistically significant difference between the two bfs patched kernels compared to the cfs in mainline, but the two bfs kernels do not differentiate themselves from each other with n=9 on this machine.

    NEW DATA ON A DUAL QUAD WITH HT:

    Dual Quad with HT.png

    There is a statistically significant difference between both bfs patched kernels compared to the cfs in mainline, AND for the first time, a difference between the two bfs's with v0.420 differentiating itself from v0.418!

    Again, great job, CK. Seems as though the refinements to the code you outlined scale very well! I don't understand why the difference on the quad machine. I only ~doubled the n value as you can see (8 vs 19). Perhaps I'll power the statistics with a higher n value on the quad and see if there is a differentiation between the two bfs kernels...

    The 'make benchmark' is compiling the linux 3.3.0 via 'make -jn bzImage' and timing the result. The "n" corresponds to the number of cores on the box. In this case, n=4 for the quad and n=16 for the dual quad. This is done via a bash script.

    ReplyDelete
    Replies
    1. Excellent, thanks again. The changes will be bigger on more core machines as they are scalability improvements. I suggested that earlier when I said there was a trend but not statistically significant difference in the quad core and I expected it would be greater on the 16x machine.

      Delete
  7. Anything I can do about this compile failure manually? It's a non SMP setup, btw.

    CC arch/x86/kernel/init_task.o
    arch/x86/kernel/init_task.c:31:8: error: unknown field ‘sticky’ specified in initializer
    arch/x86/kernel/init_task.c:31:8: warning: missing braces around initializer [-Wmissing-braces]
    arch/x86/kernel/init_task.c:31:8: warning: (near initialization for ‘init_task.cpus_allowed’) [-Wmissing-braces]
    make[2]: *** [arch/x86/kernel/init_task.o] Error 1
    make[1]: *** [arch/x86/kernel] Error 2
    make: *** [arch/x86] Error 2

    I would be glad if someone can point me towards on how / where to fix it. Or to determine if it's caused by openSUSE kernel flavour.

    Many thanks in advance!

    ReplyDelete
    Replies
    1. Damn! Many thanks!!!

      (It's compiling beyond that point now. That it was so simple to fix .oO I just was clueless.)

      Delete
  8. CK - is the fix_init_task.patch going to get incorporated into the stable release or is this just a one-off?

    ReplyDelete
    Replies
    1. It was only cosmetic code anyway that is being deleted with that patch, so it will be part of the stable release. Otherwise, BFS is pretty much complete for 3.3. Coincidentally, I'm working on a cut-down -ck at this moment. I'm removing a few of the more invasive, unproven patches.

      Delete
  9. I do not buy many of Mike Galbraith's arguments. For example, who cares that the top snapshot shows some fairness imbalance at a given moment as long as its fluctuations are not too large or too long.

    It is however alarming that the heterogeneous load is problematic for BFS. It does not look like the homogeneous make -j captures that. I mixture of make -j and x264 encoding would be an interesting case. Please comment.

    ReplyDelete