Wednesday, 6 October 2010

Hierarchical tree-based penalty.

Further investigation at last reveals why the latest code affects program groups as well as threads, which was my original intention only. The number of tasks running was getting inherited across fork, meaning it was really acting as a branch penalty. That is, the more times a task has forked from init (the further along the branching), the greater the penalty it is receiving to its deadline. So it's a hierarchical tree penalty. While there may be a real case for such a feature, it is not program grouping at all! Now I have to think hard about where to go from here. The thread grouping idea is still a valid idea, and will work with this code corrected. However the tree penalty may be worth pursuing as a separate concept since it had such dramatic effects...

Somehow I found myself doing a lot more hacking than I had intended, and that's also why there's so much blogging going on from someone who hates blogs.

EDIT: Enough talk. Here's some code. Patch to apply to a BFS 357 patched kernel, with two knobs in /proc/sys/kernel/
group_thread_accounting - groups CPU accounting by threads (default off)
fork_depth_penalty - penalises according to depth of forking from init (default on)
(Updated version):
bfs357-penalise_fork_depth_account_threads.patch

Update on scheduling automatically by group, interactive at any load.

As an update to my previous post, the testing so far on my group scheduling patch has been beyond all my expectations. There have been no reports of regressions so far, yet the improvements in load conditions are outstanding. After thinking about the code itself some more, I felt one area needed a minor tweak. Currently when tasks are queued for CPU, they're checked to ensure that their deadline is not far in the future due to being set when many more threads were running at once, and then had their deadline capped to the maximum it possibly could be for that nice level. However, I realised that the cap should be based on the maximum the deadline could be based on when the deadline was first set rather than the current time. So I've made a minor modification to the patch to offset it from the deadline_niffy.

There has also been a lot of confusion about what this means for throughput and CPU distribution on SMP so I've updated the patch changelog to include verbose examples. What follows is the updated changelog:
Group tasks by their group leader and distribute their CPU as though
they're one task.

The practical upshot of this is that each application is treated as one task
no matter how many threads it starts or children it forks.

The significance of this is that massively multithreaded applications such as
java applications do not get any more cpu than if they're were not threaded.

The unexpected side effect is that doing make -j (any number) will, provided
you don't run out of ram, feel like no more load than make -j1 no matter how
many CPUs you have. The same goes for any application with multiple threads or
processes.

Note that this drastically changes the way CPU is proportioned under load, as
each application is seen as only one entity regardless of how many children
it forks or threads. 'nice' is still respected.

For example, on my quad core machine, running make -j128 feels like no more
load than make -j1 except for when disk I/O occurs. The make -j128 proceeds
at a rate ever so slightly slower than the make -j4 (which is still optimal).

This will need extensive testing to see what disadvantages may occur, as some
applications may have depended on getting more CPU by running multiple
processes. So far I have yet to encounter a workload where this is a problem.
Note that firefox, for example, has many threads and is contained as one
application with this patch.

It requires a change in mindset about how CPU is distributed in different
workloads but I believe will be ideal for the desktop user. Think of it as
implementing everything you want out of a more complex group scheduling
policy containing each application as an entity, but without the overhead or
any input or effort on the user's part.

Note that this does not have any effect on throughput either, unlike other
approaches to decreasing latency at load. Increasing jobs up to number of CPUs
will still increase throughput if they're not competing with other
processes for CPU time.

To demonstrate the effect this will have, let's use the simplest example
of a dual core machine, and one fully CPU bound single threaded workload such
as a video encode that encodes 1000 frames per minute, competing with a 'make'
compilation. Let's assume that 'make -j2' completes in one minute, and
'make -j1' completes in 2 minutes.

Before this patch:

make -j1 and no encode:
make finishes in 2 minutes

make -j2 and no encode:
make finishes in 1 minute

make -j128 and no encode:
make finishes in 1 minute

encode no make:
1000 frames are encoded per minute

make -j1 and encode:
make finishes in 2 minutes, 1000 frames are encoded per minute

make -j2 and encode:
make finishes in 1.7 minutes, 650 frames are encoded per minute

make -j4 and encode:
make finishes in 1.25 minutes, 400 frames are encoded per minute

make -j24 and encode:
make finishes in 1.04 minutes, 40 frames are encoded per minute

make -j128 and encode:
make finishes in 1.01 minutes, 7 frames are encoded per minute

make -j2 and nice +19 encode:
make finishes in 1.03 minutes, 30 frames are encoded per minute


After this patch:

make -j1 and no encode:
make finishes in 2 minutes

make -j2 and no encode:
make finishes in 1 minute

make -j128 and no encode:
make finishes in 1 minute

encode no make:
1000 frames are encoded per minute

make -j1 and encode:
make finishes in 2 minutes, 1000 frames are encoded per minute

make -j2 and encode:
make finishes in 2 minutes, 1000 frames are encoded per minute

make -j4 and encode:
make finishes in 2 minutes, 1000 frames are encoded per minute

make -j24 and encode:
make finishes in 2 minutes, 1000 frames are encoded per minute

make -j128 and encode:
make finishes in 1.08 minutes, 150 frames are encoded per minute

make -j2 and nice +19 encode:
make finishes in 1.06 minutes, 60 frames are encoded per minute

It's worth pointing out that this code is quite safe and stable and provided no issues show up in testing, it will probably go into the next major revision of BFS. I'm thinking major version number update given all the changes that have happened to 357 and this major change to behaviour. Please try it out and report back!

Patches follow, and I've updated the links in the previous blog post.

EDIT: Updated the tests according to real world testing.

EDIT2: Also, it's been suggested that being able to disable this might be helpful so whenever I get to the next version, I'll add a simple on/off sysctl.

EDIT3: It's also worth pointing out that as well as changing the concept of how CPU is distributed, it will also bring some unfairness as a side effect. 'make', for example, will always run like a niced task due to always being 2 or more processes (usually 4). This is actually why it gets 'contained' with this approach. It is not contained within all the jobs as such.

EDIT4: Further investigation revealed this patch didn't work quite how I thought it did, but still offers significant advantages. See the newer posts on hierarchical tree based penalty.
An updated patch for 2.6.36-rc7-ck1 follows, though it should apply to a BFS357 patched kernel with harmless offsets:
bfs357-penalise_fork_depth_account_threads.patch

Here is a patch that can be applied to vanilla 2.6.35.7 which gives you BFS + this change:
2.6.35.7-sched-bfs357+penalise_fork_depth_account_threads.patch

Tuesday, 5 October 2010

Interactive at any load?

A funny thing happened on the way to creating a patch to test out a feature I've long been thinking about for BFS. I'd always thought it was wrong that applications that are massively multithreaded (java applications are often the biggest example, such as azureus/vuze) get way too much CPU because threads are treated no differently to processes on linux. So I'd been thinking for a while for a way to group the threads together as a single process for the purpose of CPU distribution. Today I finally decided to toy with an approach involving adjusting deadline (since that's how CPU is distributed on BFS) according to how many active threads each thread group has running at any one time.

In the spirit of BFS, I tried to find the cheapest possible way to do this, and decided to use the thread group_parent to store the number of active threads, and simply extend the deadline for each thread (relative to its own nice level). To my surprise, the group_parent was more than just the parent for threads that shared the same pid. The results were better than I could have possibly imagined. The changelog follows to explain:

Group tasks by their thread_group leader and distribute their CPU as though
they're one task.

The practical upshot of this is that each application is treated as one task
no matter how many threads it forks.

The significance of this is that massively multithreaded applications such as
java applications do not get any more cpu than if they're were not threaded.

The unexpected side effect is that doing make -j (any number) will, provided
you don't run out of ram, feel like no more load than make -j1 no matter how
many CPUs you have.

Note that this drastically changes the way CPU is proportioned under load, as
each application is seen as only one entity regardless of how many children
it forks or threads. 'nice' is still respected.

For example, on my quad core machine, running make -j128 feels like no more
load than make -j1 except for when disk I/O occurs. The make -j128 proceeds
at a rate ever so slightly slower than the make -j4 (which is still optimal).

This will need extensive testing to see what disadvantages may occur, as some
applications may have depended on getting more CPU by running multiple
processes. So far I have yet to encounter a workload where this is a problem.
Note that firefox, for example, has many threads and is contained as one
application with this patch.

It requires a change in mindset about how CPU is distributed in different
workloads but I believe will be ideal for the desktop user. Think of it as
implementing everything you want out of the complex CGROUPS, but without the
overhead or any input or effort on the user's part.

So anyway, as you can see this has a profound and dramatic effect. This has the smoothing effect at high loads that the latency_bias patch had, but without any slowdown whatsoever, nor any loss of throughput as well. For a while there I had forgotten I had left a make -j128 going, its effects were so insignificant. Of course if you start 128 makes instead of starting one make with 128 jobs, it will bring your machine to a halt much like the current scheduling does. It also means that a single cpuloop application (like 'yes > /dev/null') will get a whole CPU to itself on SMP even if you run your make with -j128. I need feedback and extensive testing on this one, but I'm keeping my fingers crossed and being cautiously optimistic. It will be interesting to see its effect on (regular) audio playback as well.


EDIT!: Further investigation revealed this patch didn't work quite how I thought it did, but still offers significant advantages. See the newer posts on hierarchical tree based penalty.
An updated patch for 2.6.36-rc7-ck1 follows, though it should apply to a BFS357 patched kernel with harmless offsets:
bfs357-penalise_fork_depth_account_threads.patch

Here is a patch that can be applied to vanilla 2.6.35.7 which gives you BFS + this change:
2.6.35.7-sched-bfs357+penalise_fork_depth_account_threads.patch

Interbench results are now available with and without this patch and the results are dramatic to say the least:

group_interbench_testing.txt

Monday, 4 October 2010

Going Magnum with BFS 357

As some of you may be aware, PCLinuxOS uses BFS by default. It also is a 32bit distro for maximum compatibility with desktop software. So, Texstar, who started PCLinuxOS was keen to try the newer BFS and built a 2.6.32.x kernel with BFS 350 on 32 bits. He was able to reproduce strange slowdowns and stalls and reported this back to me. After reviewing my code it was clear I had screwed up with 32 bit builds having moved to 64 bit niffy counters on BFS 350 and left a whole lot of ints and longs around that would overflow regularly. The ints would overflow even on 64 bit builds! Texstar has since confirmed for me that fixing the 32bit variables fixed the problem for him (thanks!).

I've committed those 32 bit fixes, added some more sanity checking on the crazy sched_clock interface using jiffy difference to determine upper bound and added some minor macro cleanups. I've bumped the version number up to 0.357 just because it sounds good. Testing on this has been done on 32 bit, uniprocessor, and an older kernel. Hopefully this means good things for android too!

Changelog follows:
I forgot about an awful lot of longs and ints that will overflow on 32 bit now
with u64 deadlines. Fix them.

Add some macro tidiness.

Make sched_clock sanity checking robust and standardised, using jiffy
difference as upper limit, and use nominal 1us when difference cannot be
trusted.

Go magnum.

I've uploaded a full patch for 2.6.35.7 and an incremental from 350 to 357 and will be uploading patches for older kernels shortly. Grab it now and do your worst!

BFS patches

Hopefully I can take a break from hacking for a bit and get back to my billion other pastimes while this version distils for a while... Then again, a new "stable" kernel will probably be out soon.

UPDATE: I've now uploaded patches for previous kernels as far back as 2.6.31.14.