-ck hacking: BFS 420 AKA smoking for linux 3.3

Thursday, 22 March 2012

BFS 420 AKA smoking for linux 3.3

Here's the release candidate for the first stable release of BFS for linux 3.3. BFS hadn't received much attention lately so I felt a little guilty and worked on a few things that had been nagging me:

3.3-sched-bfs-420.patch

It is not clear whether the bug which I posted about here is present any more or not, given that it is extremely rare and hard to reproduce it is impossible to know for sure. A number of minor issues were addressed in the code which hopefully have fixed it (and hardly anyone was affected anyway).

The changes since BFS version 0.416 include a fairly large architectural change just to bring the codebase in sync with 3.3, but none of the changes should be noticeable in any way. One change that may be user-visible is that the high resolution IRQ accounting now appears to be on by default for x86 architectures. There is an issue that system time accounting is wrong without this feature enabled in BFS so this should correct that problem.

Other changes:
416-417: A number of ints were changed to bool which though unlikely to have any performance impact, do make the code cleaner and the compiled code does often come out different. rq_running_iso was converted from a function to macro to avoid it being a separate function call when compiled in with the attendant overhead. requeue_task within the scheduler tick was moved to being done under lock which may prevent rare races. test_ret_isorefractory() was optimised. set_rq_task() was not being called on tasks that were being requeued within schedule() which could possibly have led to issues if the task ran out of timeslice during that requeue and should have had its deadline offset. The need_resched() check that occurs at the end of schedule() was changed to unlikely() since it really is that. Moved the scheduler version print function to bfs.c to avoid recompiling the entire kernel if the version number is changed.

417-418: Fixed a problem with the accounting resync for linux 3.3.

418-419: There was a small possibility that an unnecessary resched would occur in try_preempt if a task had changed affinity and called try_preempt with its ->cpu still set to the old cpu it could no longer run on, so try_preempt was reworked slightly. Reintroduced the deadline offset based on CPU cache locality on sticky tasks in a way that was cheaper than we currently offset the deadline.

419-420: Finally rewrote the earliest_deadline_task code. This has long been one of the hottest code paths in the scheduler and small changes here that made it look nice would often slow it down. I spent quite a few hours reworking it to include less GOTOs while disassembling the code to make sure it was actually getting smaller with every change. Then I wrote a scheduler specific version of find_next_bit which could be inlined into this code and avoid another function call in the hot path. The overall behaviour is unchanged from previous BFS versions, but initial benchmarking confirms slight improvements in throughput.

Now I'll leave it open for wider testing confirming it's all good and then I have to think about what I should do with the full -ck patchset. As I've said on numerous posts before, I'm no longer sure about the validity of some of the patches in the set with all the changes to the virtual memory subsystem in the mainline kernel.

20 comments:

Anonymous22 March 2012 at 18:38
THANKS, CON!
ReplyDelete
Replies
graysky22 March 2012 at 21:57
Just booted and am banging away on it compiling nice -n 19 kernels trying to reproduce your PR > 41 bug.
ReplyDelete
Replies
Anonymous23 March 2012 at 01:40
Applied bfs-420
without a shedtool command
but niced
I have these when compiling chromium with -j2

29176 portage 16 19 21376 6004 1236 S 200.0 0.2 5124095h as
29155 portage 39 19 66004 37m 7088 R 12.0 1.0 0:00.60 cc1plus
1445 root 1 0 146m 24m 5196 S 9.8 0.7 6:21.42 X
29175 portage 40 19 48564 17m 4076 R 2.6 0.5 0:00.13 cc1plus
1770 ral 4 0 375m 21m 9.9m S 1.8 0.6 0:12.59 yakuake

Seems running fine this bfs!
Ralph Ulrich
ReplyDelete
Replies
graysky23 March 2012 at 04:19
OK... here is the standard make benchmark on 3.3.0/3.3.0+bfs v0.418/3.3.0+bfs v0.420 running it a total of 9 times. There is a statistically significant difference between the two bfs patched kernels compared to the native kernel.

ANOVA_quad.png

The 'make benchmark' is compiling the linux 3.3.0 via 'make -j4 bzImage' and timing the result, the repeating 9 times. This is done via a bash script. The result is for my X3360 workstation (4 cores no HT). I will repeat on the dual quad tomorrow and post again.

GREAT JOB CK!

P.S. I can now post a link to an image but how can I post the image embedded in my reply? The blogspot server tells me I cannot use an img tag? If anyone knows, please post the html code for me.
ReplyDelete
Replies
Anonymous23 March 2012 at 06:01
@graysky

this graphic means the better performance of bfs is significant?

/me-still-wonder
how a scheduler which delivers better latency to the destop user is able to even perform better ...

Ralph Ulrich
ReplyDelete
Replies
graysky23 March 2012 at 07:39
@Ralph - It means that the differences in the median values of the 9 runs (461 ms) IS statistically different, bfs v0.420 vs cfs but the differences from v0.418 to v0.420 are not different.

This to me is an artificial endpoint done only to see if the code has retained its efficiency with regard to this endpoint. This data set says nothing about interactivity. The fact that both BFS kernels are no different is good news. The fact that both are measurably better than the CFS is also good news. What it means under real world computing is not the point of this experiment.
ReplyDelete
Replies
graysky24 March 2012 at 07:03
OK. The standard make benchmark on 3.3.0/3.3.0+bfs v0.418/3.3.0+bfs v0.420 on two machines.

Quad core no HT.png

There is a statistically significant difference between the two bfs patched kernels compared to the cfs in mainline, but the two bfs kernels do not differentiate themselves from each other with n=9 on this machine.

NEW DATA ON A DUAL QUAD WITH HT:

Dual Quad with HT.png

There is a statistically significant difference between both bfs patched kernels compared to the cfs in mainline, AND for the first time, a difference between the two bfs's with v0.420 differentiating itself from v0.418!

Again, great job, CK. Seems as though the refinements to the code you outlined scale very well! I don't understand why the difference on the quad machine. I only ~doubled the n value as you can see (8 vs 19). Perhaps I'll power the statistics with a higher n value on the quad and see if there is a differentiation between the two bfs kernels...

The 'make benchmark' is compiling the linux 3.3.0 via 'make -jn bzImage' and timing the result. The "n" corresponds to the number of cores on the box. In this case, n=4 for the quad and n=16 for the dual quad. This is done via a bash script.
ReplyDelete
Replies
Anonymous24 March 2012 at 08:09
Anything I can do about this compile failure manually? It's a non SMP setup, btw.

CC arch/x86/kernel/init_task.o
arch/x86/kernel/init_task.c:31:8: error: unknown field ‘sticky’ specified in initializer
arch/x86/kernel/init_task.c:31:8: warning: missing braces around initializer [-Wmissing-braces]
arch/x86/kernel/init_task.c:31:8: warning: (near initialization for ‘init_task.cpus_allowed’) [-Wmissing-braces]
make[2]: *** [arch/x86/kernel/init_task.o] Error 1
make[1]: *** [arch/x86/kernel] Error 2
make: *** [arch/x86] Error 2

I would be glad if someone can point me towards on how / where to fix it. Or to determine if it's caused by openSUSE kernel flavour.

Many thanks in advance!
ReplyDelete
Replies
graysky24 March 2012 at 19:07
CK - is the fix_init_task.patch going to get incorporated into the stable release or is this just a one-off?
ReplyDelete
Replies
Anonymous27 March 2012 at 00:41
I do not buy many of Mike Galbraith's arguments. For example, who cares that the top snapshot shows some fairness imbalance at a given moment as long as its fluctuations are not too large or too long.

It is however alarming that the heterogeneous load is problematic for BFS. It does not look like the homogeneous make -j captures that. I mixture of make -j and x264 encoding would be an interesting case. Please comment.
ReplyDelete
Replies

Add comment