Thursday, 26 May 2011

2.6.39-ck1 unstable

As much as I hate to say this, I have to give up on 2.6.39 for now. I just don't have the time nor energy to fix this. I'm grateful for all your testing, but it's just going to have to go on hold and I'll have to support .38 kernels in the meantime until I have a revelation of some sort, or help from someone who also knows kernel internals.

Thursday, 19 May 2011

2.6.39-ck1

These are patches designed to improve system responsiveness and interactivity
with specific emphasis on the desktop, but suitable to any commodity hardware workload.


Apply to 2.6.39:
patch-2.6.39-ck1.bz2

Broken out tarball:
2.6.39-ck1-broken-out.tar.bz2

Discrete patches:
patches

Ubuntu packages:
http://ck.kolivas.org/patches/Ubuntu%20Packages

All -ck patches:
http://www.kernel.org/pub/linux/kernel/people/ck/patches/

BFS by itself:
http://ck.kolivas.org/patches/bfs/

Web:
http://kernel.kolivas.org

Code blog when I feel like it:
http://ck-hack.blogspot.com/

Each discrete patch contains a brief description of what it does at the top of
the patch itself.


The most substantial change since the last public release is a major version upgrade to the BFS CPU scheduler version 0.404.

Full details of the most substantial changes, which went into version 0.400, are in my blog here:
http://ck-hack.blogspot.com/2011/04/bfs-0400.html

This version exhibits better throughput, better latencies, better behaviour with scaling cpu frequency governors (e.g. ondemand), better use of turbo modes in newer CPUs, and addresses a long-standing bug that affected all configurations, but was only demonstrable on lower Hz configurations (i.e. 100Hz) that caused fluctuating performance and latencies. Thus mobile configurations (e.g. Android on 100Hz) also perform better. The tuning for default round robin interval on all hardware is now set to 6ms (i.e. tuned primarily for latency). This can be easily modified with the rr_interval sysctl in BFS for special configurations (e.g. increase to 300 for encoding / folding machines).

Performance of BFS has been tested on lower power single core machines through various configuration SMP hardware, both threaded and multicore, up to 24x AMD. The 24x machine exhibited better throughput on optimally loaded kbuild performance (from make -j1 up to make -j24). Performance beyond this level of load did not match mainline. On folding benchmarks at 24x, BFS was consistently faster for the unbound (no cpu affinity in use) multi-threaded version. On 6x hardware, performance at all levels of load in kbuild and x264 encoding benchmarks was better than mainline in both throughput and latency in the presence of the workloads.

For 6 core results and graphs, see:
benchmarks 20110516
(desktop = 1000Hz + preempt, server = 100Hz + no preempt):

Here are some desktop config highlights:
Throughput at make -j6:

Latency in the presence of x264 ultrafast:

Throughput with x264 ultrafast:


This is not by any means a comprehensive performance analysis, nor is it meant to claim that BFS is better under all workloads and hardware than mainline. They are simply easily demonstrable advantages on some very common workloads on commodity hardware, and constitute a regular part of my regression testing. Thanks to Serge Belyshev for 6x results, statistical analysis and graphs.


Other changes in this patch release include an updated version of lru_cache_add_lru_tail as the previous version did not work entirely as planned, dropping the dirty ratio to the extreme value of 1 by default in decrease_default_dirty_ratio, and dropping of the cpufreq ondemand tweaks since BFS detects scaling CPUs internally now and works with them.


Full patchlist:

2.6.39-sched-bfs-404.patch
sched-add-above-background-load-function.patch
mm-zero_swappiness.patch
mm-enable_swaptoken_only_when_swap_full.patch
mm-drop_swap_cache_aggressively.patch
mm-kswapd_inherit_prio-1.patch
mm-background_scan.patch
mm-idleprio_prio-1.patch
mm-lru_cache_add_lru_tail-1.patch
mm-decrease_default_dirty_ratio.patch
kconfig-expose_vmsplit_option.patch
hz-default_1000.patch
hz-no_default_250.patch
hz-raise_max.patch
preempt-desktop-tune.patch
ck1-version.patch


Please enjoy!
お楽しみください
--
-ck

EDIT4: For those having hangs, please try this patch on top of ck1:
bfs404-test6.patch

Monday, 16 May 2011

BFS 0.404 page that really exists

There was one regression going into BFS 0.403, and that was expanding the sticky flag to cache warm as well. Not only didn't it improve throughput on anything I could measure, it caused latency regressions so I've backed it out. The only other change going to 404 was fixing a couple of unused variable warnings that were reported by a commenter on this blog. So I consider this patch now stable and pretty much how it will go into 2.6.39 final when it comes out.

Get it here:
2.6.39-rc7-sched-bfs-404.patch.lrz

Saturday, 14 May 2011

BFS 0.403 test for 2.6.39-rc7

BFS 0.402 test has proven very stable on 2.6.39-rc7 but a minor issue came up with respect to the new accurate IRQ accounting where some CPU time did not get accounted. So I went in and revised the way it worked to be cheaper and more accurate. There has also been a problem in the accounting that the total cpu did not always add up to 100%. The reason for this was the small inaccuracies of each respective CPU usage (user, system, wait etc.) all were exacerbated when added together. I've put in a total CPU percentage counter that checks the total adds up to 100 and if not, it rounds the values up so they should add up to 100%.

There was also a change I considered doing with the sticky flag that is used to minimise task movement to different CPUs that I've committed to 403 test. Instead of it being a binary on/off flag, I made it a stepped flag going from CACHE_COLD through CACHE_WARM to CACHE_HOT. Basically any task that is knocked off a CPU but is still waiting for more CPU is immediately labelled hot. Only one task is considered hot and previously as soon as a new cache hot task appeared, the sticky flag was cleared. Now, instead of it being cleared, it is set to warm, and only cleared to cold when the task sleeps. Forked child processes are now also labelled cache warm since they share many structures with their parent process. Any task that is cache warm or cache hot is biased against moving to another cpu by offsetting its relative deadline. Any task that is cache hot will not move cpu to a different cpu if that different one is scaled down in speed (as for example when ondemand cpu frequency governor slows it down). Basically this new change should improve throughput more in the overloaded case (when jobs > CPUs), but that's just a generic comment as I haven't benchmarked it yet.

Anyway give the new BFS a try. Everything appears to be running nice and stable, and as a bonus, my feel-good-o-meter is reading quite high with the upcoming 2.6.39! The magnitude of changes going into it seemed a lot less than previous kernels and I've had no issues with the -rc7 version so far.

As per previously, I've compressed the patch with lrzip as part of my evil plot to force you all to use it. Get it here:
2.6.39-rc7-sched-bfs-403-test.patch.lrz

Enjoy, and please report back if you try it!

lrzip-0.606

I broke lrzuntar in version 0.605 when I disabled automatic stdout on lrzip so this is a tiny bugfix release to make lrzuntar work again.

lrzip on freshmeat

Wednesday, 11 May 2011

BFS 0.402 test2 for 2.6.39-rc7

Well it looks like another stable release is just around the corner, so it's time for me to sync up. Here's the first BFS test release patch for 2.6.39-rc7:

2.6.39-rc7-sched-bfs-402-test2.patch.lrz

Of course I've used my evil powers to compress it with lrzip as a ploy to make you all have to use it again.

I've been using it for a few hours and it seems to be stable enough, but all the usual warnings apply. I also tested it on the most common configurations, but that doesn't mean it will definitely build fine on all configurations.

The only changes in the impending final release of BFS version 0.402 include some changes inspired by the people posting changes here in the forums (Thanks guys!), though not exactly in the form offered, and a resync of the new changes required to support 2.6.39. Specifically there is more high resolution IRQ accounting, and a new syscall "yield_to".

Funnily enough, it was a good 6 years or so ago I had a discussion with William Lee Irwin III who suggested such a yield call as a useful programming addition which of course was discounted by the mainline maintainers back then. Now they suddenly find it's a useful idea after all, since there may well be scenarios where a directed yield is helpful instead of strict locking semantics. Oh well, I guess there is the adage that you should only ever implement a feature at the time you need it rather than "for when you might need it in the future". The difference now from back then is that the people who wanted it back then couldn't push so hard since they weren't kernel hackers themselves. This time it's KVM that desires it, so it's required by another part of the kernel instead of userspace.

So anyway, please test and report back, and enjoy!

Sunday, 8 May 2011

lrzip-0.605

A few minor bugs have shown up since version 0.604 that people have spotted, so I've attended to them and added a new lrzcat symbolic link and released version 0.605. Here's the changelog:

Addition of lrzcat - automatically decompresses .lrz files to STDOUT. lrzip and lrunzip will no longer automatically output to STDOUT when output is redirected. Progress output will no longer spam the output unless the percentage has changed. lrzip now has no lower limit on file sizes it will happily compress and is able to work with zero byte sized files. The percentage counter when getting file info on small files will not show %nan. The executable bit will not be enabled when compressing via a means that can't preserve the original permissions (e.g. from STDIN).

I'm now getting reports that lrzip is sneaking into many distribution repositories, that the 'file' command is finally getting support for .lrz files, and that rpm is even including support for lrzip compressed files. Great news, and thanks to the packagers and other developers.

Get version 0.605 from here: LRZIP 0.605 on freshmeat

Happy Mother's day if you're celebrating it in your part of the world.

Wednesday, 27 April 2011

lrzip-0.604

A one bug fix release.

Changelog:
Detach threads after creating them on the compression side. Not joining them meant that compressing massive files requiring hundreds of threads would eventually hit the resource limit of number of threads created even though the threads themselves would exit.

English:
lrzip will no longer fail with a "resource temporarily unavailable" error when compressing files over 100GB that require hundreds of threads to complete.

Get it here: LRZIP

Friday, 22 April 2011

lrzip-0.603

Trying to polish off version 0.6x of lrzip to be nice and stable and working as planned, I've made a few more updates addressing a few issues that have come up, along with some outside help. Here's the short changelog:

lrzip now detects when output is being redirected without a filename and will automatically output to stdout. Apple builds, which had errors on compressing files larger than 2GB in size, were fixed. lrztar now properly supports -o, -O, and -S. The lrzip configuration file now supports encryption. lrzip will now warn if it's inappropriately passed a directory as an argument directly.

Probably the most fun part of this is the first feature upgrade to do with stdout, which I use regularly now since I store all my kernels and patches as .lrz, I can now do:

lrunzip patch-2.6.38.4.lrz | patch -p1

Also, graysky made some nice graphs and I feel obliged to put them up here:





Of course, with much larger files and more CPUs and RAM the discrepancy becomes much greater with lrzip but that doesn't change the fact this is a real world test.

So grab it here:
LRZIP ON FRESHMEAT

As an aside, debian unstable now has 0.602+ in its repo, and the upcoming elite release of slackware also has 0.602.

Thursday, 21 April 2011

BFS 0.401

I was meant to be on holidays this week, and indeed I've been away from home somewhere warm. While BFS was supposed to be the last thing I cared about, I was fortunate enough to have other people actually find some bugfixes to BFS. First up was _sid_ who found some very small optimisations that I've committed to the new version of BFS. But even more impressively, Serge Belyshev found a long standing bug that would cause bad latencies when Hz values were low, due to the "last_ran" variable not being set. This may well have been causing a significant latency disadvantage to BFS when Hz was 100.


As you can see in this graph, worst case latencies could be 100 times better with this bug fixed. While it will affect all Hz values, it is most significant at low Hz and probably unnoticeable by the time you're on 1000Hz. Those who are on low Hz configurations, especially those on say android, will notice a dramatic speedup moving to BFS 401.

So get it here (available for 2.6.38.3, 2.6.35.12 and 2.6.32.38):
BFS PATCHES

Again, thanks VERY much to the testers and even more to those contributing bugfixes and code.

Wednesday, 13 April 2011

lrzip 0.602

So the latest version of lrzip seems to be working well in the field with very few bug reports which is nice considering the magnitude of the changes that went into 0.600. I let it stew for a while at 0.601 while I shook out any obvious bugs and am releasing a new stable version that is mostly a bugfix release. Here's the what's new entry for this version:

Now builds on Cygwin.
Fixed wrong symlinks which broke some package generation.
Imposed limits for 32bit machines with way too much ram for their own good.
Disable md5 generation on Apple for now since it's faulty.
Displays full version with -V.
Checks for podman on ./configure
File permissions are better carried over instead of being only 0600.

The only new "feature" is building on cygwin which was contributed by Тулебаев Салават. Thanks!

Just a reminder for what sort of data lrzip works particularly well on:
linux-2.6.0-2.6.38.tar.lrz

This is a tarball of all 39 stable kernel releases from 2.6.0 to 2.6.38 and is only 160MB.
Decompressed file size: 10618664960
Compressed file size: 168125950
Compression ratio: 63.159

Enjoy!
lrzip 0.602 at freshmeat

Scalability of BFS?

So it occurred to me that for some time I've been saying that BFS may scale well only up to about 16 CPUs. That was a fairly generic guess based on the design of BFS, but it appears that these more-thread machines and multi-core machines seem to quite like BFS on the real-world benchmarks I'm getting back from various people. With the latest changes to BFS, which bumped the version up to 0.400, it should have improved further. I've tried googling for links to do with BFS and scalability and the biggest machine I've been able to find that benefits from it is a 24 core machine running F@H (folding at home). Given that this was with an older version of BFS, and that there were actually advantages even at 24 cores, I wonder what the point is where it doesn't scale? Obviously scalability is more than just "running F@H" and will depend entirely on architecture and workload and definition of scalability, and so on, but... I wanted to ask the community what's the biggest machine anyone has tried BFS on, and how well did it perform? If someone had access to 16+ cores to try it out I'd be mighty grateful for your results.

Sunday, 10 April 2011

BFS 0.400

TL;DR: Lower latency, better throughput and better ondemand behaviour on SMP.

BFS 400 for 2.6.38.2:
2.6.38.2-sched-bfs-400.patch

BFS 400 for 2.6.35.12:
2.6.35.12-sched-bfs-400.patch

2.6.38-ck3 includes BFS 400 (applies to 2.6.38.2):
patch-2.6.38-ck3.bz2

Ubuntu packages 2.6.38.2-ck2 and 2.6.35.12-ck1 include BFS 400:
Ubuntu Packages

It's been a while in the making, and the advantages of this new BFS are so substantial that they warrant a new version. I was going to announce this on lkml with lots of fanfare, but I see there is still some rare desktop hardware that still is not happy on ondemand governing. SO, my summary is: If you are on a desktop, stick with the performance governor and you'll get significant performance and latency advantages, and if you're on a laptop, you'll find BFS 400 much better than previous iterations with both ondemand and performance governing. There is hard evidence it is better for these. What follows is a long post that I was planning on submitting to lkml, but I'm very wary of burnout. Burnout for me is not to do with being tired of working on the code. No, not at all. The issue is that coding will start intruding too much on my regular life which has absolutely nothing to do with linux kernel and code at all. Once it interferes with my home life, or work, or health, I know it's time to cut back. And that's where I am right now. So instead, here's the post just on the blog:

---

This is to announce the most major upgrade in the BFS CPU scheduler in a
while.


For those tracking the development on my blog, this is version 0.376 renamed
with mostly cosmetic changes only.

Benchmarks first:

x264 benchmarks on a quad core Xeon X3360 with ondemand (courtesy of Graysky):
x264encodingfps.png
x264encodingsec.png
compilefilezilla340.png




Benchmarks on a 4 thread 2 core i7 M620:
m620-benchmarks.txt

Quad core Q9650:
q9650-benchmarks.txt

Prescott 3.2GHz Single core HT:
prescott-benchmarks.txt

Six core AMD: This has the all important LATENCY and throughput benchmarks.
Each benchmark is the latency with the following loads on a 6 core AMD, make -
j1, j3, j6, j9, j12 and x264 medium and x264 ultrafast (courtesy of Serge Belyshev):
http://ck.kolivas.org/patches/bfs/benchmark3-results-for-announcement-20110410.txt

Latency Graphs here:
http://ck.kolivas.org/patches/bfs/bfs400-cfs/
Sample image from the all important -j6 (where load == number of CPUs):




The BFS cpu scheduler, designed for optimum interactivity and responsiveness
while maintaining excellent throughput on commodity hardware, has been
modified to improve its behaviour on SMP machines, especially when used in
combination with scaling CPU frequency governors (like ondemand), and on
mulitcore hardware that has "turbo" modes. As many distributions are now
setting ondemand as the default governor, even on desktops, and some desktop
GUIs aren't even offering a way to alter the choice of governor any more, it
was clear that I needed to modify BFS to perform optimally with these
governors.

Briefly, BFS is a single runqueue design which is used to make the most of the
fact that single runqueues overall have the lowest average latency compared to
separate runqueue designs.

See this youtube link about queuing theory for an interesting parallel with
supermarket checkouts:

Why the other line is likely to move faster

The most consistent feature in BFS to date that was found to improve average
service latency whilst maintaining throughput was that no CPU was ever left to
be idle if there was a task that desired CPU time. Through extensive
experimentation I discovered that it was extremely easy for 4 or more tasks to
desire CPU at once on a virtually idle system. Thus on a quad core, the
average service time would always be lower if the tasks are allowed to move to
the first available idle CPU.

One of the disadvantages of this approach, though, is that although average
service latency is kept low by always using the first available CPU in a
relatively idle system, it means that work never clusters on a CPU. So when a
load-dependent CPU frequency governor is in use, no one CPU is ever seen as
being fully loaded except once the load is great enough for all CPUs to be
loaded. This meant that with BFS, CPUs would not be ramped up to their maximum
speed, and further, the "turbo" modes of newer multicore CPUs were never
engaged since they rely on one core being used while the others are kept idle.
Mainline does not exhibit this problem by virtue of always keeping work local
to each CPU and forcing work elsewhere once some kind of heuristically
determined imbalance was measured and then a rebalance would be forced.

It became apparent that to improve throughput with scaling governors and turbo
CPUs, that I would have to allow CPUs to become idle and service tasks
clustered to less CPUs in order for them to speed up. Trying to find the right
compromise between improving throughput while keeping the service latency low
by allowing a CPU to go idle would require heuristics which I've avoided
greatly to date on BFS. One great concern I have for the mainline scheduler is
the ever-expanding special case treatment of tasks on different architectures
for different one-benchmark workloads that do better, turning our mainline
balancing system into an unmanageable state machine.

Thus I devised a system whereby tasks that are running on a CPU but get
descheduled involuntarily in preference for another task instead of sleeping,
get flagged as "sticky". The reasoning being that if a task had not finished,
then it certainly was going to continue working once rescheduled. Then I added
a cpu_scaling flag which would tell the scheduler that a CPU had been throttled
from its maximum frequency. When one CPU is choosing its next task, if a task
is sticky on another CPU and the current CPU is throttled, it will not take
the task.

This extremely simple technique proved surprisingly good for throughput
without destroying latency. Further investigation revealed the same sticky
system could also be used to determine when to bias against tasks going next
on the local CPU even if the CPU wasn't throttled. So the existing "cache
distance" calculation was removed entirely and tasks that are sticky are just
seen as having soft affinity and biased against. This means that waking tasks
still get very low latency, but sticky tasks are very likely to go back to
their own CPU, while no CPU is left idle if there is work to do, _provided_
the CPU is not throttled.

The existing system for determining which idle CPU to bind to remains
unchanged for it has proven very effective and complementary to this added
change.

So the entire equivalent of a "balancing" mechanism for BFS consists of:
1. Flagging the last CPU bound task that was involuntarily scheduled as being
softly affined to its CPU, and biasing against it on other CPUs.
2. Choosing an idle CPU to wake up based on how busy that core is, and how
close it was to the original CPU it last ran on.

It was discovered the improved throughput meant I was able to further lower
the rr interval such that it is initially set to 6ms regardless of the number
of CPUs, thus improving latency further on SMP.

The resulting change to BFS means throughput with the ondemand CPU frequency
governor is now very close to that of the performance governor, power usage
may be decreased (theoretical, not yet proven), and better throughput and
latency on the performance governor. I would like to point out, however, that
some CPUs were not really designed with ondemand governing in mind,
specifically such as the "extreme" desktop CPUs by Intel. These tend to only
have 2 frequencies, are slow to change from one frequency to another, and may
lie about which CPUs are throttled and actually throttle all of them. Given
how little power saving they really offer, I suggest anyone with a desktop CPU
test for themselves how well it works with ondemand and decide whether they
want to build it into the kernel or not as a way of preventing your
distribution from using the ondemand governor without your knowledge. I detest
the lack of configurability in newer distributions and GUIs, but that's another
story.

In order to quantify the changes in this BFS version, benchmarks were
performed by myself and the help of some other very helpful members of the
community. Thanks to all those who tested along the way and helped it get this
far. The very simple, but reliable and repeatable benchmarks of kbuild at
various jobs and x264 were used to determine the effects on throughput, while
the wakeup-latency app by Jussi Laako was run in the background during the
benchmark as a rough guide to what effect the various loads had on latency.

Again I must stress, the purpose for BFS is to be used as a testing ground for
scheduling to see what can be done when the scheduler is specifically aimed at
commodity hardware for normal desktops and affordable servers. As such, it
should serve as a reference tool to compare to for these workloads and I'm
very pleased to see the uptake of its use actually by the community, and that
mainline is occasionally reminded of the need to keep in touch with this
hardware.

To the server admins who have contacted me to tell me of their good
experiences with BFS, thank you, but I must stress again that I am unable to
implement extra features like CGROUPS that some of you have asked for. My
time is limited, and my work on the kernel completely unpaid so I can only
direct my attention to areas I personally have interest in.

EDIT: In response to questions from those who haven't been following the progress, yes this BFS is the Brain Fuck Scheduler.

Thursday, 7 April 2011

Quick lrzip comparison

So I was building a kernel package for myself and when I was done saw a large directory and thought, what the heck, I'll compress it with lrzip for grins.

bzip2 size: 1831647009 time: 12m50s (command "tar cjf")
7z    size:  945054166 time: 36m23s (command "7za a")
lrzip size:  586087630 time: 17m09s (command "lrztar")

These were done on a quad core 3GHz with 8GB ram using nothing but default options. Note that lrzip was done using the lrztar wrapper to make the comparison fair since it's just a single command without temporary files, so it took 3 passes to compress. If I compressed it with temporary files it would be smaller again (i.e. tar cf directory && lrzip)

Of course there is parallel bzip2 which speeds up bzip2's compression but has no effect on compression. And then there is xz which slows down 7z's compression and has no effect on compression. So why is xz becoming the new defacto standard? Probably because it's not aimed squarely at the windows market the way 7z is and feels more politically correct to the linux crowd. Maybe it's because it's called lzma2 that it must be better than lzma since it has a higher version number. I've never seen any performance or compression advantage of any significance with lzma2 versus lzma. Personally I'm relatively unimpressed with xz so I don't really understand it. Even less understandable is why the kernel has both lzma AND xz support in it now.

I wish I could get people more excited about lrzip :)

BFS 0.376 test

TL;DR: Fastest BFS yet for SMP.

After extended testing on BFS 0.373, a number of minor issues came up, but the results were very promising. Now I believed I've addressed all the known issues with a newer version. Instead of flagging scaling CPUs by their governor alone, I now flag them as scaling only when they're actually throttled from maximum speed. This improves throughput further with the dynamic scaling governors like ondemand and brings it now very close to that of performance under full load. I also found that the sticky flagged tasks were not keeping their sticky flags if they were rescheduled back to back. This gave me even more of a performance boost under all situations. I addressed the oops that can occur on UP, and finally I updated the docs to match the changes in the scheduler design.

So hopefully this will be the last test patch (fingers crossed) before I make it official, because... I'm about >< close to burnout. That's not something I want to experience. Incremental for those on BFS 363 already: bfs363-376-test.patch

Full patch for 2.6.38ish:
2.6.38-sched-bfs-376.patch

Benchmarks as they come to hand...

---
x264 benchmarks Courtesy of Graysky:
Higher is better: boxplotencodethroughput.png
Lower is better: boxplotencodetime.png
CPU: Intel Xeon X3360 @ 8.5x400=3.40 GHz (4 cores/4 threads)
Linux version: Arch x86_64
x264 version: 0.114.x
handbrake version: svn3853
Base kernel version: 2.6.38.2
CK Patchset: CK1
Source video clip: 720p60 (1280x720) MPEG-PS @ 15 Mbps. 62 seconds long.
Run with ondemand multiplier, 5 times per kernel. Kernels use identical configs with exception of BFS version.
Handbrake CLI: --input test.m2ps --output output.mp4 --no-dvdnav --audio none --crop 0:0:0:0 --preset=Normal
---

Sunday, 3 April 2011

BFS 0.373 test

The BFS 0.372 test patch has proved quite a success.There have been no regressions in performance, with slight improvements even with the performance CPU governor, and better latency all round on SMP now. There were some rare crashes that I had to track down and I believe I've fixed them all so I'm releasing another test patch, 0.373, which has addressed them and is otherwise the same as 0.372.

Apply to a 0.363 patched kernel:
bfs363-373-test.patch

Thanks to those who have tested and reported back so far!

UPDATE: Throughput benchmark results on a 6 core AMD courtesy of Serge Belyshev:
benchmark3.results-cfs-bfs363-bfs373.txt

I am planning on posting latency benchmarks soon too.

Friday, 1 April 2011

BFS 0.372 test patch

Another day, another BFS test release. This one builds on the ideas in the 0.371-test3 patch I posted about since they're proving very positive. No April fools here. This looks like it's kicking arse.

Apply to a BFS 0.363 patched kernel such as 2.6.38-ck1:

bfs363-372-test.patch

Changelog from the patch:
---
Add a "sticky" flag for one CPU bound task per runqueue which is used to flag
the last cache warm task per CPU. Use this flag to softly affine the task to
the CPU by not allowing it to move to another CPU when a scaling CPU frequency
governor is in use. This significantly improves throughput at lower loads by
allowing tasks to cluster on CPUs, thereby allowing the scaling governor to
speed up only that CPU. This should aslo save power. Use the sticky flag to
determine cache distance in earliest_deadline_task task and abolish the
cache_distance function entirely. This is proven as, if not more, effective.

Add helpers to the 3 scaling governors to tell the scheduler when a CPU is
scaling.

Replace the frequent use of num_online_cpus() with a grq.noc variable that
is only updated when the number of online cpus changes.

Simplify resched_best_idle by removing the open coded for_each_cpu_mask as
it was not of proven benefit.

Remove warnings in try_to_wakeup_local that are harmless or never hit.

Clear the cpuidle_map bit only when edt doesn't return the idle task.

Abolish the scaled rr_interval by number of CPUs and now just use a fixed
nominal 6ms everywhere. The improved cache warmth of the sticky flag makes
this unnecessary and allows us to lower overall latencies on SMP by doing so.
---

Please test this one thoroughly. It's very stable and now heavily tested, but I won't announce any new "release" till it's been tested for maybe 5 days or more. It appears better in all workloads and with and without cpu frequency governors. Again, only SMP will benefit from this patch, but it should change behaviour in all SMP now, not just with ondemand. However, the scaling governors should show the most improvement.

The best example (with ondemand) was a single threaded cpu bound workload that took 126 seconds to complete on an i7 2 core/4 thread machine that now takes 91.5 seconds. The 2 threaded workload dropped from 66 seconds to 51.5 seconds. Note that this more or less addresses a regression in BFS behaviour with cpu frequency scaling on SMP, but it's also been an opportunity to improve behaviour elsewhere.

BFS 0.371 test3

TL;DR I'd like more testing.

Here's a new lightly tested patch trying another simpler and cheaper approach to improving throughput with scaling CPU frequency governors (like ondemand) without the flaws of the previous approach.

I've enabled the changes at all times, not just when the ondemand governor is run, but again this change only affects SMP users. This test patch is only lightly tested, but I'd appreciate it if people gave it a bit of a run. Apply to a BFS 363 based kernel such as 2.6.38-ck1.

bfs363-371-test3.patch

Too tired to describe what it does right now... Zzzz

Tuesday, 29 March 2011

2.6.38-ck1, BFS 0.363

So I screwed up. Sorry!

BFS 370 causes some strange regressions as per this blog and offlist. It appears that F&H doesn't, for example, scale to multiple CPUs under BFS 370. Also some latency regressions were reported here (and elsewhere). So I've decided to pull BFS 370 and 2.6.38-ck2, pending further investigation. There's only so much I can do without lots of people testing, I'm afraid, and often I get 1 maybe 2 people testing before a "stable" release, and then get about 10,000 downloads once the stable release comes out. So it was with test2/BFS370. Anyway the point is, go back to 2.6.38-ck1 or BFS 363 till I figure out what the problem was and decide whether it's worth pursuing this avenue or not.

2.6.38-ck2, BFS 0.370

EDIT EDIT EDIT: This patch causes bizarre regressions and has been backed out. Consider 2.6.38-ck1 and BFS0.363 the stable releases, SORRY!

After more testing and cleaning up of the patch posted here earlier (test2), I've put it together as a new BFS release with almost trivial changes since that test2 patch. The changes are cosmetic only apart from a removal of the warning which is hit occasionally and is now harmless on BFS.

Just to reiterate, unless you are on an SMP machine (2 or more threads or cores) AND are using a scaling CPU frequency governor (e.g. ondemand), then there will be no significant performance advantage to upgrading to BFS 370 or ck2. For those with that combination, what you can expect to see is an increase in throughput with lightly loaded machines (single threaded apps most affected) and likely an increase in battery life. Overall latency is unlikely to be affected keeping interactivity relatively the same but responsiveness should also increase. If you are unsure of the difference, read this summary I wrote for interbench:
readme.interactivity

When the kernel mirrors sync up, ck2 will be found here:
2.6.38-ck2
It applies with some minor offsets to 2.6.38.2 so you can safely apply it to that kernel if you like.

BFS is available here:
BFS

And Ubuntu packages of 2.6.35.11-ck2 and 2.6.38.2-ck1 which have the new BFS are now available here:
Ubuntu Packages

EDIT: People keep asking me why I've "optimised" only for SMP and ondemand. This is not the case at all. This patch addresses a performance regression that only affects that combination.

EDIT2: SEE ABOVE NOTICE! PATCH CONSIDERED BAD, GO BACK TO 2.6.38-ck1 and BFS 0.363 PLEASE!