Showing posts with label bfs. Show all posts
Showing posts with label bfs. Show all posts

Thursday, 5 January 2012

BFS 0.416 for 3.2.0

Well 3.2.0 is finally out.

I've done a quick and dirty port of BFS from 3.1 with mostly trivial changes and a minor change to idle CPU selection (it will choose the old CPU first now even if the other "thread" is busy on hyperthreaded CPUs). For the most part the changes are trivial only to stay in sync with the new scheduler changes in mainline so it should be a very safe upgrade. Nonetheless, I'm putting this up for testing here because my Ubuntu laptop seems unhappy starting X but that seems 3.2 related rather than BFS related. I've since moved my home desktop to arch linux and have been very happy on that distro.

So anyway here's BFS for 3.2.0 for those who want things hot off the press:

3.2-sched-bfs-416.patch

The -ck patch will not be far behind assuming the first few testers report no problems with this BFS patch. To be honest, the Virtual Memory subsystem has changed so much that I'm having trouble predicting how it will behave now and have no way of confirming the VM patches that go into -ck are even helpful any more, so I'm not even sure if I should continue to support them in light of that.

Besides, I've been too busy playing Zelda - Skyward Sword to really care much about anything code related. I've loved every 3D rendition of Zelda since Ocarina of time, but this latest incarnation is the most amazing game ever...

Friday, 11 November 2011

3.1.0-ck2, BFS 0.415

Blah blah fixes blah blah rollbacks, anyway here's a new -ck and bfs 415. Nothing new here, just bugfixes.

3.1.0-ck2

3.1-sched-bfs-415.patch


Full -ck patchlist:

3.1-sched-bfs-414.patch
sched-add-above-background-load-function.patch
mm-minimal_swappiness.patch
mm-enable_swaptoken_only_when_swap_full.patch
mm-drop_swap_cache_aggressively.patch
mm-kswapd_inherit_prio-1.patch
mm-background_scan.patch
mm-idleprio_prio-1.patch
mm-lru_cache_add_lru_tail-1.patch
mm-decrease_default_dirty_ratio.patch
kconfig-expose_vmsplit_option.patch
hz-default_1000.patch
hz-no_default_250.patch
hz-raise_max.patch
preempt-desktop-tune.patch
cpufreq-bfs_tweaks.patch
bfs414-noprefetch.patch
bfs414-remove_rqpreempt.patch
bfs414-correct_int_size.patch
bfs414-stickybool.patch
bfs414-dont_use_task.patch
bfs414-clear_deactivated_sticky.patch
bfs414-fix-nonbfs-build.patch
bfs414-415version.patch
ck2-version.patch 
 
 
BFS 415 is now the following patches from -ck2 above:
 
3.1-sched-bfs-414.patch
sched-add-above-background-load-function.patch
cpufreq-bfs_tweaks.patch
bfs414-noprefetch.patch
bfs414-remove_rqpreempt.patch
bfs414-correct_int_size.patch
bfs414-stickybool.patch
bfs414-dont_use_task.patch
bfs414-clear_deactivated_sticky.patch
bfs414-fix-nonbfs-build.patch
bfs414-415version.patch 
 
Enjoy!

Thursday, 3 November 2011

3.1.0-ck1, BFS 0.414

I finally found enough time to sync up BFS with the latest 3.1.0 linux kernel. In keeping with the fact that there are some changes to support new scheduler features in linux-3.1, and the general code churn that happens on every release, the version number is updated from the last 0.413 release but from the user perspective it should behave identically. There were some build fixes pending and one bug involving displayed hard interrupt usage that were fixed. Overall most users should notice no significant change.

Apply to linux-3.1.x:
3.1-sched-bfs-414.patch

Closely following on with syncing up with the latest kernel is an updated -ck kernel which is almost identical, apart from relaxing the swappiness from 0 to 10. Some very low ram users reported the virtual memory subsystem tried so hard to not swap anything at 0 that they experienced stalls while the VM twiddled its thumbs screwing around so I've relaxed the value to 10 which avoids these stalls and should still be quite aggressive at avoiding swapping.

Apart from that the -ck patch is a resync from the last 3.0.0-ck1 release:

Directory containing combined ck1 patch and separate patches:
3.1.0-ck1

With the recent kernel.org hacking it has become a lot harder to get a kernel.org account and as I would have great difficulty getting a gpg key personally signed by any other kernel developer in my area it could be a while before I can set up an account there again. So I've decided to just use my own storage once more instead of putting it on kernel.org. Unfortunately I don't have quite a few of the older -ck kernels since I never bothered to upload them there. I foolishly thought that kernel.org storage would be permanent cloud storage for them so I don't even have a copy of many of them. If someone has archived the 2.6 -ck patches, or has an old mirror of kernel.org please forward them on to me.

Enjoy!

EDIT: A few bugs are showing up in this latest BFS release, and I'm currently accumulating fixes fort them and putting them in the directory listed above for ck1. There will be a ck2 soon to follow with all the fixes.

Monday, 17 October 2011

BFS 0.413, BFS benchmarks, (and more cksort stuff)

Finally some BFS news, yay \o/

TL;DR: Minor unnoticeable tweaks for BFS.

Incremental:
3.0-bfs-406-413.patch

Incremental for 3.0.x-ck1:
3.0-ck1-bfs-406-413.patch

Full patch for 3.0.x:
3.0-sched-bfs-413.patch

Alas there isn't anything earth shattering here. The skiplist implementation experiment failed to be of any benefit despite my best intentions, so I've abandoned the idea for the time being. Meanwhile, there were a number of minor improvements that I had considered for BFS (and I mean really really minor) and a few cleanups that had been waiting. So here's the changelog for the new BFS for 3.0:


Improve the preempt call by explicitly setting a task that will be used on
lookup in earliest_deadline_task to avoid iterating over all tasks. Set the
effective runqueue priority for further calls to try_preempt based on the
preempting task's priority. Thus on the small chance that something else
tries to preempt the runqueue before the new tasks gets the CPU, it will
compare to the previously successful preempting task. This should avoid a
preempting task from rescheduling a runqueue and then the CPU deciding to
take a different task instead of the preempting one.

Clean up a number of variables from unnecessary longs to ints and ints to
bools.

Microoptimise around try preempt by settings highest_prio to that of the
task that's trying to preempt to avoid unnecessary comparisons.

Break sole affinity on cpu offline -after- the cpu has been set to be offline. 

I'd be surprised if anyone noticed anything different with these changes, but then, there have been some surprises from apparently innocuous changes in the past, so I'm keen to get feedback from testers. There is a small chance that the preempt changes can improve throughput by avoiding unnecessary rescheduling, and that the cpu offline changes might help rare scenarios where a machine will fail to suspend to ram.

Meanwhile, graysky who has been maintaining the unofficial ck repos for archlinux has very kindly put together a number of throughput benchmarks comparing BFS to mainline on machines from 2x up to 16x in a very nicely put together PDF which you can download and read here:
 
repo-ck.com/bench/benchmark.pdf


On the cksort front, it's clear to me that sorting algorithms are pretty boring to most people, despite me finding transient interest in them and developing my own. Nevertheless, the cksort algorithm has been further improved. After playing with it some more, and learning from the legendary work of the past, I modified the merge sort that cksort "tips" to to include an insertion sort as well, making the algorithm now a blended insertion+merge sort once it tips away from the presort+nodemerge algorithm which is basically the only unique part of cksort. This capitalises on the fact that insertion sort is pretty much the fastest way to sort up to 32 different entries by splitting up the merge into up-to-32 entry sized groups and doing insertion sort on all of them, followed by using a non-recursive merge sort on each group.

This has further improved cksort to now be faster than qsort for up to 1500 entries, and now equally fast with library qsort() up to 10000 (uniquely random)entries. Beyond that it is still slower than qsort, but it has -no- corner cases where it will ever become O(n^2) and will always be O(n log n) bound, unlike qsort.

Version 0.44 of cksort is here:
http://ck.kolivas.org/apps/cksort/cksort.c

And in response to requests for the benchmarking code vs qsort it is here:
http://ck.kolivas.org/apps/cksort/benchsort.c

Now Aldo Cortesi  has a very interesting static view of comparison sort algorithms here:
http://sortvis.org/

After contacting him, he was kind enough to modify his code such that it would take output from something like the cksort to generate a graph from any sorting algorithm. Alas cksort uses a separate array for its work and looking at the array of the final data does not tell the entire story of what's going on, but it does help further to visualise how cksort works.

Here is a view of 20 random variables being sorted and you can see the 2 phases of cksort where it builds up the nodes first, and the 2nd phase where it merges the nodes (click to enlarge):


It's important to note that the 2nd stage graph is fairly arbitrary since the data from the unmerged nodes is not actually written back to the original array until it's merged, but you should be able to make out the final sorted data appearing from the outside in.

Sorting non-random entries is really what makes cksort interesting and here is a (rather complex) visualisation of the 50 mostly-sorted example in the animated form I showed on my blog previously:


Anyway it's been a lot of fun playing with this stuff, and maybe one day I'll come up with something more useful from the algorithm. I played with an in-place version of the code which didn't need a second array and the amount of memory moving required made the algorithm slower and more complex so I soon lost interest in developing that for the time being. I'd like to find a way to make the presort+node merging aspect scale to any size instead of tipping to a different algorithm but the lack of any pattern in large random data sets means you can't really find a pattern in it. I still think there may be something like a heap structure that would help it but haven't found anything that works in practice. Maybe I'll come back to this in the future. Meanwhile it works extremely well for the smaller data sets, partially ordered data, reverse ordered data, and few unique data sets. Interestingly it coincidentally shares a lot in its ideas with Timsort. All I can say after all this is that the work of the legends of the past, and coming up with something like Smoothsort is truly amazing.

Wednesday, 28 September 2011

BFS 0.411 test with skiplists

Thanks to those who've tested BFS 0.410 and found a few bugs. I've updated the test patch to 0.411 hopefully addressing all the regressions, and uploaded some incremental versions as well:

http://ck.kolivas.org/patches/bfs/test/3.0-sched-bfs-411.patch

http://ck.kolivas.org/patches/bfs/test/3.0-bfs406-410.patch

http://ck.kolivas.org/patches/bfs/test/ck1-bfs406-411.patch

http://ck.kolivas.org/patches/bfs/test/3.0-bfs410-411.patch

http://ck.kolivas.org/patches/bfs/test/2.6.39-bfs406-411.patch

The bugs fixed in this version were to fix the Uniprocessor builds, fix the poor interactivity due to completely miscalculating deadline, and the suspend/resume regression. Please keep on testing and hopefully I can declare this one a stable release!

EDIT: I have re-tested and confirmed as reported by others, that 411 is indeed slower than 410 and actually slower than the stable 406. This is really disappointing as 410 was only faster by virtue of a bug that would favour throughput (and cause lousy interactivity). I've tried further optimising the code, but alas it appears that for our desktop workloads, skiplists are not the way forward.

EDIT2: Small improvements to be had by adding this patch, but it's still not matching 406 performance for me:
http://ck.kolivas.org/patches/bfs/test/bfs411-412test.patch

Monday, 26 September 2011

BFS 0.410 test with skiplists.

Hi all.

TL;DR: Test release for fastest BFS ever.

The skiplists patch has proven to be quite stable, but a couple of minor issues have shown up. First, as I knew, the ondemand behaviour would not quite match the current release, and second, SCHED_IDLEPRIO tasks weren't scheduled properly (they acted like normal tasks). However the code itself seemed quite safe otherwise. So I've tentatively put out a test release of the next version of BFS. The two main changes to the skiplist code are to bring the ondemand governor performance up to par with current BFS, and to fix the behaviour of SCHED_IDLEPRIO tasks.

Quick benchmarks:
BFS 406:
Make -j4: 26.6s
Make -j : 27.8s

BFS 410:
Make -j4: 26.4s
Make -j : 27.1s


Changelog in patch:
Implement skip lists as described here:
http://en.wikipedia.org/wiki/Skip_list
for the main priority queue.

The old queue had: O(1) insertion, O(1) removal, but a lookup involving
both a binary search of a bitmap and O(n) search through a linked list which
is very cache unfriendly as the list gets larger.

This queue is now: O(log n) insertion, O(1) lookup and O(k) lookup in a much
more cache friendly manner.

This should not compromise performance at all at very low loads, but improve
both throughput and latency as loads get higher, as confirmed by benchmarks.


Other changes: Cleanups of variable choices and micro-optimisations.

Full bfs410 patch for 3.0.x:
http://ck.kolivas.org/patches/bfs/test/3.0-sched-bfs-410.patch

Incremental bfs410 for those on 3.0.x-bfs406:
http://ck.kolivas.org/patches/bfs/test/3.0-bfs406-410.patch

Incremental bfs410 for those on 3.0.x-ck1:
http://ck.kolivas.org/patches/bfs/test/ck1-bfs406-410.patch

Note that make headers check will report errors, so you may have to disable this option in kernel hacking if it is enabled. I'm investigating why this is, but it's a harmless warning.

EDIT: Uniprocessor builds will need the following fix on top:
http://ck.kolivas.org/patches/bfs/test/bfs410-upfix.patch

EDIT2: The following fix is required for interactivity issues with the patch
http://ck.kolivas.org/patches/bfs/test/bfs410-offsetfix.patch

Saturday, 24 September 2011

BFS and skip lists

Hi all.

Just a quick post to tell you all what I've been up to with BFS.

Some of you may know about Skip lists as an alternative to balanced binary search trees. They feature O(log n) insertion, lookup and removal of table entries. Anyway I've been looking for some time at the O(n) lookup of BFS (which is O(1) insertion and removal) to try and find a solution that didn't cost us at the desktop level since O(n) of small numbers of n is very fast. The problem is of course at higher numbers of n (or server type loads), where it gets linearly slower, and the cache trashing aspect of scanning linked lists becomes expensive.

So to cut a long story short, I've implemented the first draft of a custom version of skip lists for BFS in place of the O(n) lookup. The insertion remains O(log n), but by sorting all tasks realtime, iso, normal, batch and idleprio in a way they all end up on the same table, I was able to do away with the regular linked lists and the bitmap priority lookup. Then I was able to utilise one of the features of the skip lists in that the first task on the "bottom" list is always sorted to be the highest priority. This means the lookup is O(1). Then I put an extra pointer into each entry to the previous entry (the original design normally only points to the next entry). Finally I placed a pointer to the entry in the task struct as a way of reverse looking up the entry without any search.

So what I'm left with is an O(log n) insertion, O(1) lookup, and O(k) removal (k is the "height" of the node in question, up to 16, but usually only 1-4).

I've implemented the sticky task used for CPU affinity by simply comparing the last sticky task to the first entry returned from the skip list. Alas I have not yet provided a good version of the sticky task being used to improve scaling governor behaviour. This means that this will likely perform worse with the ondemand governor at this stage. On the other hand, the performance governor seems to be working very nicely in my preliminary tests.

Anyway I'd love to post more, but I need to sleep, so here's some code instead (for a BFS patched 3.0.x kernel):
bfs406-skiplists.patch

Full bfs406 + skiplists patch:
http://ck.kolivas.org/patches/bfs/test/3.0-sched-bfs-406-skiplists.patch

Try it out, see what you think. It seems to be running safely here but as always, there are no guarantees. All going well, you should notice pretty much no difference on a desktop. If you do any throughput benchmarks comparing before/after, I'd love to hear about them.

EDIT:
Here's the thread about it on lkml: https://lkml.org/lkml/2011/9/23/371

Note the benchmarks showing substantial speedup in the highly contended make -j case that I mention in the thread.

3.0.0:
Elapsed Time 28.7

3.0.0-bfs406:
Elapsed Time 28.5

3.0.0-bfs406-sl:
Elapsed Time 27.0

Tuesday, 16 August 2011

Phoronix revisits BFS

Phoronix once benchmarked BFS in its early days. I guess at the prodding of people who suggested it here (thanks!), they've revisited it. Of course they did a throughput benchmark comparison which is kinda missing the point of BFS, but then again if they find ANY throughput benchmark where BFS is better, that's somewhere between amusing and disappointing. Anyway they did mention in passing that "However, the system with the BFS scheduler was more responsive during the test process." While this at least pays homage to what BFS really is about, it makes me wonder - what were they doing while the benchmarks ran? Benchmarks are meant to run without anything else running on the machine.

Anyway, thanks to phoronix for conducting these benchmarks and their logo suggestion.

http://www.phoronix.com/scan.php?page=article&item=bfs_two_years&num=1


Probably more interesting a test of something besides pure throughput is that of a server admin who was running mainline with 6000-8000 processes and his system was falling over under that load. Then he switched to BFS with no change in the usage (actually increasing) which fixed his problems. Here's the graph and see if you can guess when he switched to BFS.


Thursday, 11 August 2011

3.0.0-ck1 and BFS for 3.0.0

Surprise, I'm back and it's all ready.

http://ck.kolivas.org/patches/bfs/3.0.0/3.0-sched-bfs-406.patch

http://www.kernel.org/pub/linux/kernel/people/ck/patches/3.0/3.0.0-ck1/

Note no packages this time because the current distributions get really confused thanks to the new numbering and may not boot.

Nauru was a very depressing place indeed, really living up to the expectations I had based on what I knew of the history of it. Goddamn we (collectively, the developed world) screwed that country up big...

Monday, 25 July 2011

3.0 BFS delays

Hi all

I haven't blogged much lately because I've been distracted from kernel hacking by bitcoin mining. For some crazy reason I took it upon myself to make mining software that did what I wanted, writing it the way I write kernel code. Anyway since it's unrelated I haven't posted about it here before, but if anyone's interested, the development thread is here:

http://forum.bitcoin.org/index.php?topic=28402.0

Now about the kernel. To be honest I haven't followed the development of 3.0 almost at all, being totally pre-occupied with other things as I've taken time out from work as a sabbatical while I reassess work-life balance, long term career management (even to the point of considering changing line of work - anyone need a c programmer?) and spend time with family, friends and random other personal development things. No, I'm not quitting kernel development any time soon (again).

Anyway the thing is I'm going with Interplast next week to Nauru (of all places) as a volunteer anaesthetist for needy children for 10 days. I'm not sure if I'll find time to port BFS to 3.0 before then, or if I'll be able to do it while I'm actually there (doubtful). So just a heads up that it might be a while before we BF the 3.0 kernel.

Monday, 6 June 2011

Porting BFS to FreeBSD

There is an interesting Google Summer of Code project underway to port BFS to FreeBSD and see how it compares to their mainline CPU schedulers, both the V4 scheduler and ULE. The GSOC person porting it has got a rudimentary version of the BFS algorithm working based on the V4 scheduler and has started testing it. His progress shall be interesting to watch. The first results he's getting are interesting already.

His blog is here: rudot.blog.com

Sunday, 5 June 2011

2.6.39-ck2, bfs-0.406

These are patches designed to improve system responsiveness and interactivity with specific emphasis on the desktop, but suitable to any commodity hardware workload.


Apply to 2.6.39(.x):
http://www.kernel.org/pub/linux/kernel/people/ck/patches/2.6/2.6.39/2.6.39-ck2/patch-2.6.39-ck2.bz2

Ubuntu packages (2.6.39-ck1-3 is equivalent to 2.6.39-ck2):
http://ck.kolivas.org/patches/Ubuntu%20Packages/

Broken out tarball:
http://www.kernel.org/pub/linux/kernel/people/ck/patches/2.6/2.6.39/2.6.39-ck2/2.6.39-ck2-broken-out.tar.bz2

Discrete patches:
http://www.kernel.org/pub/linux/kernel/people/ck/patches/2.6/2.6.39/2.6.39-ck2/patches/

All -ck patches:
http://www.kernel.org/pub/linux/kernel/people/ck/patches/

BFS by itself:
http://ck.kolivas.org/patches/bfs/

Web:
http://kernel.kolivas.org

Code blog when I feel like it:
http://ck-hack.blogspot.com/

Each discrete patch contains a brief description of what it does at the top of the patch itself.


The only change from 2.6.39-ck1 is an upgrade to BFS CPU scheduler version 0.406. A bug that would cause hangs due to an incompatibility with the new block plug flushing code and BFS was fixed. For those who tried the "bfs404-test9" patch, this is only trivially different apart from the bfs version change.

Full patchlist:
2.6.39-sched-bfs-406.patch
sched-add-above-background-load-function.patch
mm-zero_swappiness.patch
mm-enable_swaptoken_only_when_swap_full.patch
mm-drop_swap_cache_aggressively.patch
mm-kswapd_inherit_prio-1.patch
mm-background_scan.patch
mm-idleprio_prio-1.patch
mm-lru_cache_add_lru_tail-1.patch
mm-decrease_default_dirty_ratio.patch
kconfig-expose_vmsplit_option.patch
hz-default_1000.patch
hz-no_default_250.patch
hz-raise_max.patch
preempt-desktop-tune.patch
ck2-version.patch

Please enjoy!
お楽しみください
--
-ck

Friday, 3 June 2011

2.6.39 BFS test 9 - is this the one?

Hopefully this test patch should fix all the problems with BFS 404 on 2.6.39:

bfs404-test9.patch
Ubuntu Packages : grab the 2.6.39-ck1-3 package

Please report back if you haven't already! Thanks to everyone who has tested so far! Your feedback has been absolutely essential on this weird and wonderful bug.

Monday, 30 May 2011

2.6.39 BFS progress

TL;DR: 2.6.39 BFS fixed maybe?

After walking away from the code for a while, annoyed at the bug I couldn't track down, I had another good look at what might be happening. It appears that while the grq lock is dropped in schedule() to perform the block plug flush, a call to the task via try_to_wake_up may be missed entirely, leaving the task deactivated when it should actually keep running. Anyway, first tests from the people on these blog comments are reassuring.

Here is a cleaned up and slightly modified version of the "test8" patch that has so far been stable and shows to have fixed the problem for a handful of people:

Apply to 2.6.39-ck1 or 2.6.39 with BFS 404:
bfs404-recheck_unplugged.patch

In response to requests for packaged versions, I've uploaded a 2.6.39-ck1-2 ubuntu package which includes this change:
Ubuntu Packages

Please test and report back! If this fixes the problem, I'll be releasing it as ck2.

Thursday, 19 May 2011

2.6.39-ck1

These are patches designed to improve system responsiveness and interactivity
with specific emphasis on the desktop, but suitable to any commodity hardware workload.


Apply to 2.6.39:
patch-2.6.39-ck1.bz2

Broken out tarball:
2.6.39-ck1-broken-out.tar.bz2

Discrete patches:
patches

Ubuntu packages:
http://ck.kolivas.org/patches/Ubuntu%20Packages

All -ck patches:
http://www.kernel.org/pub/linux/kernel/people/ck/patches/

BFS by itself:
http://ck.kolivas.org/patches/bfs/

Web:
http://kernel.kolivas.org

Code blog when I feel like it:
http://ck-hack.blogspot.com/

Each discrete patch contains a brief description of what it does at the top of
the patch itself.


The most substantial change since the last public release is a major version upgrade to the BFS CPU scheduler version 0.404.

Full details of the most substantial changes, which went into version 0.400, are in my blog here:
http://ck-hack.blogspot.com/2011/04/bfs-0400.html

This version exhibits better throughput, better latencies, better behaviour with scaling cpu frequency governors (e.g. ondemand), better use of turbo modes in newer CPUs, and addresses a long-standing bug that affected all configurations, but was only demonstrable on lower Hz configurations (i.e. 100Hz) that caused fluctuating performance and latencies. Thus mobile configurations (e.g. Android on 100Hz) also perform better. The tuning for default round robin interval on all hardware is now set to 6ms (i.e. tuned primarily for latency). This can be easily modified with the rr_interval sysctl in BFS for special configurations (e.g. increase to 300 for encoding / folding machines).

Performance of BFS has been tested on lower power single core machines through various configuration SMP hardware, both threaded and multicore, up to 24x AMD. The 24x machine exhibited better throughput on optimally loaded kbuild performance (from make -j1 up to make -j24). Performance beyond this level of load did not match mainline. On folding benchmarks at 24x, BFS was consistently faster for the unbound (no cpu affinity in use) multi-threaded version. On 6x hardware, performance at all levels of load in kbuild and x264 encoding benchmarks was better than mainline in both throughput and latency in the presence of the workloads.

For 6 core results and graphs, see:
benchmarks 20110516
(desktop = 1000Hz + preempt, server = 100Hz + no preempt):

Here are some desktop config highlights:
Throughput at make -j6:

Latency in the presence of x264 ultrafast:

Throughput with x264 ultrafast:


This is not by any means a comprehensive performance analysis, nor is it meant to claim that BFS is better under all workloads and hardware than mainline. They are simply easily demonstrable advantages on some very common workloads on commodity hardware, and constitute a regular part of my regression testing. Thanks to Serge Belyshev for 6x results, statistical analysis and graphs.


Other changes in this patch release include an updated version of lru_cache_add_lru_tail as the previous version did not work entirely as planned, dropping the dirty ratio to the extreme value of 1 by default in decrease_default_dirty_ratio, and dropping of the cpufreq ondemand tweaks since BFS detects scaling CPUs internally now and works with them.


Full patchlist:

2.6.39-sched-bfs-404.patch
sched-add-above-background-load-function.patch
mm-zero_swappiness.patch
mm-enable_swaptoken_only_when_swap_full.patch
mm-drop_swap_cache_aggressively.patch
mm-kswapd_inherit_prio-1.patch
mm-background_scan.patch
mm-idleprio_prio-1.patch
mm-lru_cache_add_lru_tail-1.patch
mm-decrease_default_dirty_ratio.patch
kconfig-expose_vmsplit_option.patch
hz-default_1000.patch
hz-no_default_250.patch
hz-raise_max.patch
preempt-desktop-tune.patch
ck1-version.patch


Please enjoy!
お楽しみください
--
-ck

EDIT4: For those having hangs, please try this patch on top of ck1:
bfs404-test6.patch

Monday, 16 May 2011

BFS 0.404 page that really exists

There was one regression going into BFS 0.403, and that was expanding the sticky flag to cache warm as well. Not only didn't it improve throughput on anything I could measure, it caused latency regressions so I've backed it out. The only other change going to 404 was fixing a couple of unused variable warnings that were reported by a commenter on this blog. So I consider this patch now stable and pretty much how it will go into 2.6.39 final when it comes out.

Get it here:
2.6.39-rc7-sched-bfs-404.patch.lrz

Saturday, 14 May 2011

BFS 0.403 test for 2.6.39-rc7

BFS 0.402 test has proven very stable on 2.6.39-rc7 but a minor issue came up with respect to the new accurate IRQ accounting where some CPU time did not get accounted. So I went in and revised the way it worked to be cheaper and more accurate. There has also been a problem in the accounting that the total cpu did not always add up to 100%. The reason for this was the small inaccuracies of each respective CPU usage (user, system, wait etc.) all were exacerbated when added together. I've put in a total CPU percentage counter that checks the total adds up to 100 and if not, it rounds the values up so they should add up to 100%.

There was also a change I considered doing with the sticky flag that is used to minimise task movement to different CPUs that I've committed to 403 test. Instead of it being a binary on/off flag, I made it a stepped flag going from CACHE_COLD through CACHE_WARM to CACHE_HOT. Basically any task that is knocked off a CPU but is still waiting for more CPU is immediately labelled hot. Only one task is considered hot and previously as soon as a new cache hot task appeared, the sticky flag was cleared. Now, instead of it being cleared, it is set to warm, and only cleared to cold when the task sleeps. Forked child processes are now also labelled cache warm since they share many structures with their parent process. Any task that is cache warm or cache hot is biased against moving to another cpu by offsetting its relative deadline. Any task that is cache hot will not move cpu to a different cpu if that different one is scaled down in speed (as for example when ondemand cpu frequency governor slows it down). Basically this new change should improve throughput more in the overloaded case (when jobs > CPUs), but that's just a generic comment as I haven't benchmarked it yet.

Anyway give the new BFS a try. Everything appears to be running nice and stable, and as a bonus, my feel-good-o-meter is reading quite high with the upcoming 2.6.39! The magnitude of changes going into it seemed a lot less than previous kernels and I've had no issues with the -rc7 version so far.

As per previously, I've compressed the patch with lrzip as part of my evil plot to force you all to use it. Get it here:
2.6.39-rc7-sched-bfs-403-test.patch.lrz

Enjoy, and please report back if you try it!

Wednesday, 11 May 2011

BFS 0.402 test2 for 2.6.39-rc7

Well it looks like another stable release is just around the corner, so it's time for me to sync up. Here's the first BFS test release patch for 2.6.39-rc7:

2.6.39-rc7-sched-bfs-402-test2.patch.lrz

Of course I've used my evil powers to compress it with lrzip as a ploy to make you all have to use it again.

I've been using it for a few hours and it seems to be stable enough, but all the usual warnings apply. I also tested it on the most common configurations, but that doesn't mean it will definitely build fine on all configurations.

The only changes in the impending final release of BFS version 0.402 include some changes inspired by the people posting changes here in the forums (Thanks guys!), though not exactly in the form offered, and a resync of the new changes required to support 2.6.39. Specifically there is more high resolution IRQ accounting, and a new syscall "yield_to".

Funnily enough, it was a good 6 years or so ago I had a discussion with William Lee Irwin III who suggested such a yield call as a useful programming addition which of course was discounted by the mainline maintainers back then. Now they suddenly find it's a useful idea after all, since there may well be scenarios where a directed yield is helpful instead of strict locking semantics. Oh well, I guess there is the adage that you should only ever implement a feature at the time you need it rather than "for when you might need it in the future". The difference now from back then is that the people who wanted it back then couldn't push so hard since they weren't kernel hackers themselves. This time it's KVM that desires it, so it's required by another part of the kernel instead of userspace.

So anyway, please test and report back, and enjoy!

Thursday, 21 April 2011

BFS 0.401

I was meant to be on holidays this week, and indeed I've been away from home somewhere warm. While BFS was supposed to be the last thing I cared about, I was fortunate enough to have other people actually find some bugfixes to BFS. First up was _sid_ who found some very small optimisations that I've committed to the new version of BFS. But even more impressively, Serge Belyshev found a long standing bug that would cause bad latencies when Hz values were low, due to the "last_ran" variable not being set. This may well have been causing a significant latency disadvantage to BFS when Hz was 100.


As you can see in this graph, worst case latencies could be 100 times better with this bug fixed. While it will affect all Hz values, it is most significant at low Hz and probably unnoticeable by the time you're on 1000Hz. Those who are on low Hz configurations, especially those on say android, will notice a dramatic speedup moving to BFS 401.

So get it here (available for 2.6.38.3, 2.6.35.12 and 2.6.32.38):
BFS PATCHES

Again, thanks VERY much to the testers and even more to those contributing bugfixes and code.

Wednesday, 13 April 2011

Scalability of BFS?

So it occurred to me that for some time I've been saying that BFS may scale well only up to about 16 CPUs. That was a fairly generic guess based on the design of BFS, but it appears that these more-thread machines and multi-core machines seem to quite like BFS on the real-world benchmarks I'm getting back from various people. With the latest changes to BFS, which bumped the version up to 0.400, it should have improved further. I've tried googling for links to do with BFS and scalability and the biggest machine I've been able to find that benefits from it is a 24 core machine running F@H (folding at home). Given that this was with an older version of BFS, and that there were actually advantages even at 24 cores, I wonder what the point is where it doesn't scale? Obviously scalability is more than just "running F@H" and will depend entirely on architecture and workload and definition of scalability, and so on, but... I wanted to ask the community what's the biggest machine anyone has tried BFS on, and how well did it perform? If someone had access to 16+ cores to try it out I'd be mighty grateful for your results.