In the spirit of BFS, I tried to find the cheapest possible way to do this, and decided to use the thread group_parent to store the number of active threads, and simply extend the deadline for each thread (relative to its own nice level). To my surprise, the group_parent was more than just the parent for threads that shared the same pid. The results were better than I could have possibly imagined. The changelog follows to explain:
Group tasks by their thread_group leader and distribute their CPU as though
they're one task.
The practical upshot of this is that each application is treated as one task
no matter how many threads it forks.
The significance of this is that massively multithreaded applications such as
java applications do not get any more cpu than if they're were not threaded.
The unexpected side effect is that doing make -j (any number) will, provided
you don't run out of ram, feel like no more load than make -j1 no matter how
many CPUs you have.
Note that this drastically changes the way CPU is proportioned under load, as
each application is seen as only one entity regardless of how many children
it forks or threads. 'nice' is still respected.
For example, on my quad core machine, running make -j128 feels like no more
load than make -j1 except for when disk I/O occurs. The make -j128 proceeds
at a rate ever so slightly slower than the make -j4 (which is still optimal).
This will need extensive testing to see what disadvantages may occur, as some
applications may have depended on getting more CPU by running multiple
processes. So far I have yet to encounter a workload where this is a problem.
Note that firefox, for example, has many threads and is contained as one
application with this patch.
It requires a change in mindset about how CPU is distributed in different
workloads but I believe will be ideal for the desktop user. Think of it as
implementing everything you want out of the complex CGROUPS, but without the
overhead or any input or effort on the user's part.
So anyway, as you can see this has a profound and dramatic effect. This has the smoothing effect at high loads that the latency_bias patch had, but without any slowdown whatsoever, nor any loss of throughput as well. For a while there I had forgotten I had left a make -j128 going, its effects were so insignificant. Of course if you start 128 makes instead of starting one make with 128 jobs, it will bring your machine to a halt much like the current scheduling does. It also means that a single cpuloop application (like 'yes > /dev/null') will get a whole CPU to itself on SMP even if you run your make with -j128. I need feedback and extensive testing on this one, but I'm keeping my fingers crossed and being cautiously optimistic. It will be interesting to see its effect on (regular) audio playback as well.
EDIT!: Further investigation revealed this patch didn't work quite how I thought it did, but still offers significant advantages. See the newer posts on hierarchical tree based penalty.
An updated patch for 2.6.36-rc7-ck1 follows, though it should apply to a BFS357 patched kernel with harmless offsets:
Here is a patch that can be applied to vanilla 220.127.116.11 which gives you BFS + this change:
Interbench results are now available with and without this patch and the results are dramatic to say the least: