Showing posts with label interactivity. Show all posts
Showing posts with label interactivity. Show all posts

Monday, 12 December 2016

linux-4.9-ck1, MuQSS version 0.150

Announcing a new -ck release, 4.9-ck1  with new version of the Multiple Queue Skiplist Scheduler, version 0.150. These are patches designed to improve system responsiveness and interactivity with specific emphasis on the desktop, but configurable for any workload.

linux-4.9-ck1

-ck1 patches:
http://ck.kolivas.org/patches/4.0/4.9/4.9-ck1/

Git tree:
https://github.com/ckolivas/linux/tree/4.9-ck

Ubuntu 16.04 LTS packages:
http://ck.kolivas.org/patches/4.0/4.9/4.9-ck1/Ubuntu16.04/

MuQSS

Download:
4.9-sched-MuQSS_150.patch

Git tree:
4.9-muqss


MuQSS 0.150 updates

Regarding MuQSS, apart from a resync to linux-4.9, which has numerous hotplug and cpufreq changes (again!), I've cleaned up the patch to not include any Hz changes of its own, leaving Hz changes up to users to choose, unless they use the -ck patchset.
Additionally, I've modified sched_yield yet again. Since expected behaviour is different for different (inappropriate) users out there of sched_yield, I've made it tunable in /proc/sys/kernel/yield_type and changed the default to what I believe should happen. From the documentation I added in Documentation/sysctl/kernel.txt:

yield_type: (MuQSS CPU scheduler only)

This determines what type of yield calls to sched_yield will perform.

 0: No yield.
 1: Yield only to better priority/deadline tasks. (default)
 2: Expire timeslice and recalculate deadline.

Previous versions of MuQSS defaulted to type 2 above. If you find behavioural regressions with any of your workloads try switching it back to 2.

4.9-ck1 updates

Apart from resyncing with the latest trees from linux-bfq and wb-buf-throttling
- Added a new kernel configuration option to enable threaded IRQs and set it by default
- Changed Hz to default to the safe 100 value, removing 128 which caused spurious issues and had no real world advantage.
- Fixed a build for muqss disabled (why would you use -ck and do that I don't know)
- Made hrtimers not be used if we know we're in suspend which may have caused suspend failures for drivers that did no use correct freezable vs normal timeouts
- Enabled bfq and set it to default
- Enabled writeback throttling by default

Enjoy!
お楽しみ下さい
-ck

Tuesday, 22 November 2016

linux-4.8-ck8, MuQSS version 0.144

Here's a new release to go along with and commemorate the 4.8.10 stable release (they're releasing stable releases faster than my development code now.)

linux-4.8-ck8 patch:
patch-4.8-ck8.lrz

MuQSS by itself:
4.8-sched-MuQSS_144.patch

There are a small number of updates to MuQSS itself.
Notably there's an improvement in interactive mode when SMT nice is enabled and/or realtime tasks are running, or there are users of CPU affinity. Tasks previously would not schedule on CPUs when they were stuck behind those as the highest priority task and it would refuse to schedule them transiently.
The old hacks for CPU frequency changes from BFS have been removed, leaving the tunables to default as per mainline.
The default of 100Hz has been removed, but in its place a new and recommended 128Hz has been implemented - this just a silly microoptimisation to take advantage of the fast shifts that /128 has on CPUs compared to /100, and is close enough to 100Hz to behave otherwise the same.

For the -ck patch only I've reinstated updated and improved versions of the high resolution timeouts to improve behaviour of userspace that is inappropriately Hz dependent allowing low Hz choices to not affect latency.
Additionally by request I've added a couple of tunables to adjust the behaviour of the high res timers and timeouts.
/proc/sys/kernel/hrtimer_granularity_us
and
/proc/sys/kernel/hrtimeout_min_us

Both of these are in microseconds and can be set from 1-10,000. The first is how accurate high res timers will be in the kernel and is set to 100us by default (on mainline it is Hz accuracy).
The second is how small to make a request for a "minimum timeout" generically in all kernel code. The default is set to 1000us by default (on mainline it is one tick).

I doubt you'll find anything useful by tuning these but feel free to go nuts. Decreasing the second tunable much further risks breaking some driver behaviour.

Enjoy!
お楽しみ下さい
-ck

Monday, 24 October 2016

Interbench benchmarks for MuQSS 116

As mentioned in my previous post, I recently upgraded interbench which is a benchmark application I invented/wrote to assess perceptible latency in the setting of various loads. The updates were to make the results meaningful on today's larger ram/multicore machines where the load scales accordingly.

The results for mainline 4.8.4 and 4.8.4-ck4 on a multithreaded hexcore (init 1) can be found here:
 http://ck.kolivas.org/patches/muqss/Benchmarks/20161024/
and are copied below. I do not have swap on this machine so the "memload" was not performed. This is a 3.6GHz hexcore with 64GB ram and fast Intel SSDs so to show any difference on this is nice. To make it easier, I've highlighted it in colours similar to the throughput benchmarks I posted previously. Blue means within 1% of each other, red means significantly worse and green significantly better.


Load set to 12 processors

Using 4008580 loops per ms, running every load for 30 seconds
Benchmarking kernel 4.8.4 at datestamp 201610242116
Comment: cfs

--- Benchmarking simulated cpu of Audio in the presence of simulated ---
Load Latency +/- SD (ms)  Max Latency   % Desired CPU  % Deadlines Met
None       0.1 +/- 0.1        0.1           100         100
Video      0.0 +/- 0.0        0.1           100         100
X          0.1 +/- 0.1        0.1           100         100
Burn       0.0 +/- 0.0        0.0           100         100
Write      0.1 +/- 0.1        0.1           100         100
Read       0.1 +/- 0.1        0.1           100         100
Ring       0.0 +/- 0.0        0.1           100         100
Compile    0.0 +/- 0.0        0.0           100         100

--- Benchmarking simulated cpu of Video in the presence of simulated ---
Load Latency +/- SD (ms)  Max Latency   % Desired CPU  % Deadlines Met
None       0.1 +/- 0.1        0.1           100         100
X          0.1 +/- 0.1        0.1           100         100
Burn      17.4 +/- 19.5      46.3            87        7.62
Write      0.1 +/- 0.1        0.1           100         100
Read       0.1 +/- 0.1        0.1           100         100
Ring       0.0 +/- 0.0        0.0           100         100
Compile   17.4 +/- 19.1      45.9          89.5        6.07

--- Benchmarking simulated cpu of X in the presence of simulated ---
Load Latency +/- SD (ms)  Max Latency   % Desired CPU  % Deadlines Met
None       0.0 +/- 0.1        1.0           100        99.3
Video     13.4 +/- 25.8      68.0          36.2        27.3
Burn      94.4 +/- 127.0    334.0          12.9        4.37
Write      0.1 +/- 0.4        4.0          97.4        96.4
Read       0.1 +/- 0.7        4.0          96.2        93.8
Ring       0.5 +/- 1.9        9.0          89.3        84.9
Compile   93.3 +/- 127.7    333.0          12.2         4.2

--- Benchmarking simulated cpu of Gaming in the presence of simulated ---
Load Latency +/- SD (ms)  Max Latency   % Desired CPU
None       0.0 +/- 0.2        2.2           100
Video      7.9 +/- 21.4      69.3          92.7
X          1.4 +/- 1.6        2.7          98.7
Burn     136.5 +/- 145.3    360.8          42.3
Write      1.8 +/- 2.0        4.4          98.2
Read      11.2 +/- 20.3      47.8          89.9
Ring       8.1 +/- 8.1        8.2          92.5
Compile  152.3 +/- 166.8    346.1          39.6
Load set to 12 processors

Using 4008580 loops per ms, running every load for 30 seconds
Benchmarking kernel 4.8.4-ck4+ at datestamp 201610242047
Comment: muqss116-int1

--- Benchmarking simulated cpu of Audio in the presence of simulated ---
Load Latency +/- SD (ms)  Max Latency   % Desired CPU  % Deadlines Met
None       0.0 +/- 0.0        0.0           100         100
Video      0.0 +/- 0.0        0.0           100         100
X          0.0 +/- 0.0        0.0           100         100
Burn       0.0 +/- 0.0        0.0           100         100
Write      0.0 +/- 0.0        0.1           100         100
Read       0.0 +/- 0.0        0.0           100         100
Ring       0.0 +/- 0.0        0.0           100         100
Compile    0.0 +/- 0.1        0.8           100         100

--- Benchmarking simulated cpu of Video in the presence of simulated ---
Load Latency +/- SD (ms)  Max Latency   % Desired CPU  % Deadlines Met
None       0.0 +/- 0.0        0.0           100         100
X          0.0 +/- 0.0        0.0           100         100
Burn       3.1 +/- 7.2       17.7           100        81.6
Write      0.0 +/- 0.0        0.5           100         100
Read       0.0 +/- 0.0        0.0           100         100
Ring       0.0 +/- 0.0        0.0           100         100
Compile   10.5 +/- 13.3      19.7           100        37.3

--- Benchmarking simulated cpu of X in the presence of simulated ---
Load Latency +/- SD (ms)  Max Latency   % Desired CPU  % Deadlines Met
None       0.0 +/- 0.1        1.0           100        99.3
Video      3.7 +/- 12.1      56.0            89        82.6
Burn      47.2 +/- 66.5     142.0          16.7        7.58
Write      0.1 +/- 0.5        5.0          97.7        95.7
Read       0.1 +/- 0.7        4.0          95.6        93.5
Ring       0.5 +/- 1.9       12.0          89.8          86
Compile   55.9 +/- 77.6     196.0          18.6        8.12

--- Benchmarking simulated cpu of Gaming in the presence of simulated ---
Load Latency +/- SD (ms)  Max Latency   % Desired CPU
None       0.0 +/- 0.1        0.5           100
Video      1.2 +/- 1.2        1.8          98.8
X          1.4 +/- 1.6        2.9          98.7
Burn     130.9 +/- 132.1    160.3          43.3
Write      2.4 +/- 2.5        7.0          97.7
Read       3.2 +/- 3.2        3.6          96.9
Ring       5.9 +/- 6.2       10.3          94.4
Compile  146.5 +/- 149.3    209.2          40.6

As you can see, the only times mainline is better, there is less than 1% difference between them which is within the margins for noise. MuQSS meets more deadlines, gives the benchmarked task more of its desired CPU and has substantially lower max latencies.

I'm reasonably confident that I've been able to maintain the interactivity people have come to expect from BFS in the transition to MuQSS now and have the data to support it above.

Enjoy!
お楽しみ下さい
-ck

linux-4.8-ck4, MuQSS CPU scheduler v0.116

Yet another bugfix release for MuQSS and the -ck patchset with one of the most substantial latency fixes yet. Everyone should upgrade if they're on a previous 4.8 patchset of mine. Sorry about the frequency of these releases but I just can't allow a known buggy release be the latest version.

4.8-ck4 patchset:
http://ck.kolivas.org/patches/4.0/4.8/4.8-ck4/

MuQSS by itself for 4.8:
4.8-sched-MuQSS_116.patch

MuQSS by itself for 4.7:
4.7-sched-MuQSS_116.patch

I'm hoping this is the release that allows me to not push any more -ck versions out till 4.9 is released since it addresses all remaining issues that I know about.

A lingering bug that has been troubling me for some time was leading to occasional massive latencies and thanks to some detective work by Serge Belyshev I was able to narrow it down to a single line fix which dramatically improves worst case latency when measured. Throughput is virtually unchanged. The flow-on effect to other areas was also apparent with sometimes unused CPU cycles and weird stalls on some workloads.

Sched_yield was reverted to the old BFS mechanism again which GPU drivers prefer but it wasn't working previously on MuQSS because of the first bug. The difference is substantial now and drivers (such as nvidia proprietary) and apps that use it a lot (such as the folding @ home client) behave much better now.

The late introduced bugs that got into ck3/muqss115 were reverted.

The results come up quite well now with interbench (my latency under load benchmark) which I have recently updated and should now give sensible values:

https://github.com/ckolivas/interbench

If you're baffled by interbench results, the most important number is %deadlines met which should be as close to 100% as possible followed by max latency which should be as low as possible for each section. In the near future I'll announce an official new release version.

Pedro in the comments section previously was using runqlat from bcc tools to test latencies as well, but after some investigation it became clear to me that the tool was buggy and did not work properly with bfs/muqss either so I've provided a slightly updated version here which should work properly:

runqlat.py

Enjoy!
お楽しみ下さい
-ck

Tuesday, 18 October 2016

First MuQSS Throughput Benchmarks

The short version graphical summary:



Red = MuQSS 112 interactive off
Purple = MuQSS 112 interactive on
Blue = CFS

The detail:
http://ck.kolivas.org/patches/muqss/Benchmarks/20161018/

I went on a journey looking for meaningful benchmarks to conduct to assess the scalability aspect as far as I could on my own 12x machine and was really quite depressed to see what the benchmark situation on linux is like. Only the old and completely invalid benchmarks seem to still be hanging around in public sites and promoted, like Reaim, aim7, dbench, volanomark, etc. and none of those are useful scalability benchmarks. Even more depressing was the only ones with any reputation are actually commercial benchmarks costing hundreds of dollars.

This made me wonder out loud just how the heck mainline is even doing scalability improvements if there are precious few valid benchmarks for linux and no one's using them. The most promising ones, like mosbench, need multiple machines and quite a bit of set up to get them going.

I spent a day wading through the phoronix test suite - a site and its suite not normally known for meaningful high performance computing discussion and benchmarks - looking for benchmarks that could be used for meaningful results for multicore scalability assessment and were not too difficult to deploy and came up with the following collection:

John The Ripper - a CPU bound application that is threaded to the number of CPUs and intermittently drops to one thread making for slightly more interesting behaviour than just a fully CPU bound workload.

7-Zip Compression - a valid real world CPU bound application that is threaded but rarely able to spread out to all CPUs making it an interesting light load benchmark.

ebizzy - This emulates a heavy content delivery server load which scales beyond the number of CPUs and emulates what goes on between a http server and database.

Timed Linux Kernel Compilation - A perennial favourite because it is a real world case and very easy to reproduce. Despite numerous complaints about its validity as a benchmark, it is surprisingly consistent in its results and tests many facets of scalability, though does not scale to use all CPUs at all time either.

C-Ray - A ray tracing benchmark that uses massive threading per CPU and is completely CPU bound but overloads all CPUs.

Primesieve - A prime number generator that is threaded to the number of CPUs exactly, is fully CPU bound and is cache intensive.

PostgreSQL pgbench - A meaningful database benchmark that is done at 3 different levels - single threaded, normal loaded and heavily contended, each testing different aspects of scalability.

And here is a set of results comparing 4.8.2 mainline (labelled CFS), MuQSS 112 in interactive mode (MuQSS-int1) and MuQSS 112 in non-interactive mode (MuQSS-int0):

http://ck.kolivas.org/patches/muqss/Benchmarks/20161018/

It's worth noting that there is quite a bit of variance in these benchmarks and some are bordering on the difference being just noise. However there is a clear pattern here - when the load is light, in terms of throughput, CFS outperforms MuQSS. When load is heavy, the heavier it gets, MuQSS outperforms CFS, especially in non-interactive mode. As a friend noted, for the workloads where you wouldn't be running MuQSS in interactive mode, such as a web server, database etc, non-interactive mode is of clear performance benefit. So at least on the hardware I had available to me, on a 12x machine, MuQSS is scaling better than mainline on these workloads as load increases.

The obvious question people will ask is why MuQSS doesn't perform better at light loads, and in fact I have an explanation. The reason is that mainline tends to cling to processes much more so that if it is hovering at low numbers of active processes, they'll all cluster on one CPU or fewer CPUs than being spread out everywhere. This means the CPU benefits more from the turbo modes virtually all newer CPUs have, but it comes at a cost. The latency to tasks is greater because they're competing for CPU time on fewer busy CPUs rather than spreading out to idle cores or threads. It is a design decision in MuQSS, as taken from BFS, to always spread out to any idle CPUs if they're available, to minimise latency, and that's one of the reasons for the interactivity and responsiveness of MuQSS. Of course I am still investigating ways of closing that gap further.

Hopefully I can get some more benchmarks from someone with even bigger hardware, and preferably with more than one physical package since that's when things really start getting interesting. All in all I'm very pleased with the performance of MuQSS in terms of scalability on these results, especially assuming I'm able to maintain the interactivity of BFS which were my dual goals.

There is MUCH more to benchmarking than pure throughput of CPU - which is almost the only thing these benchmarks is checking - but that's what I'm interested in here. I hope that providing my list of easy to use benchmarks and the reasoning behind them can generate interest in some kind of meaningful standard set of benchmarks. I did start out in kernel development originally after writing and being a benchmarker :P

To aid that, I'll give simple instructions here for how to ~imitate the benchmarks and get results like I've produced above.

Download the phoronix test suite from here:
http://www.phoronix-test-suite.com/

The generic tar.gz is perfectly fine. Then extract it and install the relevant benchmarks like so:

tar xf phoronix-test-suite-6.6.1.tar.gz
cd phoronix-test-suite
./phoronix-test-suite install build-linux-kernel c-ray compress-7zip ebizzy john-the-ripper pgbench primesieve
./phoronix-test-suite default-run build-linux-kernel c-ray compress-7zip ebizzy john-the-ripper pgbench primesieve


Now obviously this is not ideal since you shouldn't run benchmarks on a multiuser login with Xorg and all sorts of other crap running so I actually always run benchmarks at init level 1.

Enjoy!
お楽しみ下さい
-ck

Friday, 7 October 2016

MuQSS - The Multiple Queue Skiplist Scheduler v0.108

A new version of the MuQSS CPU scheduler

Incrementals and full patches available for 4.8 and 4.7 respectively here:
http://ck.kolivas.org/patches/muqss/4.0/4.8/


http://ck.kolivas.org/patches/muqss/4.0/4.7/

Yet more minor bugfixes and some important performance enhancements.

This version brings to the table the same locking scheme for trying to wake tasks up as mainline which is advantageous on process busy workloads and many CPUs. This is important because the main reason for moving to multiple runqueues was to minimise lock contention for the global runqueue lock that is in BFS (as mentioned here numerous times before) and this wake up scheme helps make the most of the multiple discrete runqueue locks.

Note this change is much more significant than the last releases so new instability is a possibility. Please report any problems or stacktraces!

There was a workload when I started out that I used lockstat to debug to get an idea of how much lock contention was going on and how long it lasted. Originally with the first incarnations of MuQSS on a 14 second benchmark with thousands of tasks on a 12x CPU it obtained 3 million locks and had almost 300k contentions with the longest contention lasting 80us. Now the same workload grabs the lock just 5k times with only 18 contentions in total and the longest lasted 1us.

This clearly demonstrates that the target endpoint for avoiding lock contention has been achieved. It does not translate into performance improvements on ordinary hardware today because you need ridiculous workloads on many CPUs to even begin deriving advantage from it. However as even our phones now have reached 8 logical CPUs, it will only be a matter of time before 16 threads appears on commodity hardware - a complaint that was directed at BFS when it came out 7 years ago but they still haven't appeared just yet. BFS was shown to be scalable for all workloads up to 16 CPUs, and beyond for certain workloads, but suffered dramatically for others. MuQSS now makes it possible for what was BFS to be useful much further into the future.

Again - MuQSS is aimed primarily at desktop/laptop/mobile device users for the best possible interactivity and responsiveness, and is still very simple in its approach to balancing workloads to CPUs so there are likely to be throughput workloads on mainline that outperform it, though there are almost certainly workloads where the opposite is true.

I've now addressed all planned changes to MuQSS and plan to hopefully only look at bug reports instead of further development from here on for a little while. In my eyes it is now stable enough to replace BFS in the next -ck release barring some unexpected showstopper bug appearing.

EDIT: If you blinked you missed the 107 announcement which was shortly superseded by 108.

EDIT2: Always watch the pending directory for updated pending patches to add.
http://ck.kolivas.org/patches/muqss/4.0/4.8/Pending/

Enjoy!
お楽しみ下さい
-ck

Tuesday, 15 December 2015

BFS 467, linux-4.3-ck3

Announcing an updated BFS for linux-4.3 based kernels.

BFS by itself:
4.3-sched-bfs-467.patch

-ck branded linux-4.3-ck3 patches:
4.3-ck3

After my initial enthusiasm regarding the improved throughput on the previous BFS release, I unfortunately had some reports of regressions in interactive behaviour for the first time in a while, both on this forum and through other channels. So for the first time in a while, I've released a -ck3 with yet another updated BFS to address this since I can't stand having a dodgy release out for any extended period.

With this release what I've done is reinstate an old tunable that used to be on my scheduler patches many years ago

/proc/sys/kernel/interactive

This has two settings, 1 for on and 0 for off. By default - of course since this is BFS - it is set to 1. What it does is prioritise latency over throughput in mode 1 and vice versa in mode 0. In addition to addressing the latency issues in the previous kernel, mode 1 actually completely turns off all soft affinity scheduling in the kernel, for the lowest possible latencies all round, so this may be the first kernel with an improvement in latency in a while too.

Bear in mind none of these changes make any difference on uniprocessor kernels so there is no need for UP users to update unless they need the build fixes that came with BFS466+.

Amusingly enough, linux-4.3.2 wouldn't boot for me for unrelated reasons so I'm using 4.3.0-ck3 myself. So if you can't get 4.3.2 to boot, roll back.

Enjoy!
お楽しみください

Monday, 11 October 2010

Further updates on hierarchical tree-based penalty.

First of all, no new patch today, yay \o/ I know it's sometimes hard to keep up but that's the nature of things. I (sort of kind of hope that I can) promise not to release anything much new any time soon with respect to the hierarchical tree-based penalty code.

I experimented with removing the separation of processes from threads and treating them all equal and discovered that led to real lag in some gui applications as many of them use threads. So it seems the default of penalising fork depth and not penalising threads works best at biasing CPU distribution for interactivity and responsiveness (which is the default on the current patch). This is rather ironic, as this code evolved out of an initial attempt to control threads' behaviour on massively threaded applications, yet it turns out that being nice to threaded apps works better for the desktop.

I thought up another way of demonstrating the effect this patch has in a measurable way.

Using a dual core machine as an example, and running the "browser benchmark" at http://service.futuremark.com/peacekeeper/index.action allowed me to show the effect of the gold standard load of make versus the almost universal gui app, a browser.

The benchmark runs a number of different browser based workloads, and gives a score in points, where higher is better.

Running the benchmark under various different loads with the feature enabled / disabled gave me the following results:

Disabled:
Baseline: 2437
make -j2: 1642
make -j24: 208
make -j42: Failed

Enabled:
Baseline: 2437
make -j2: 2293
make -j24: 2187
make -j42: 1626

As can be seen, on the dual core machine, normally a load of 2 makes the benchmark run almost precisely 1/3 slower as would be expected with BFS' fair CPU distribution of 3 processes between 2 CPUs. Enabling this feature makes this benchmark progress almost unaffected at this load, and only once the load is more than 20 times higher does it hinder the benchmark to the same degree. At that load with it disabled, the browser just spat out 'a script on this page is causing the browser to run slowly' etc. etc. and virtually gave up.


In the last few days, most of the reports have been very positive. However, as expected, not everything is rosy. There have been reports of applications such as mplayer stalling and some people have had gnome applets fail to initialise!? How on earth a scheduling decision about who goes first can cause these is... well it's not a mystery. In my experience it's because some assumption has been made in the userspace application that naively expects a certain behaviour; that being that one process will run first. These sorts of bugs, although likely due to the userspace application itself, make changes of this nature in the scheduler as the default impossible, or at least foolish. So as much as I'd like to see this change go into the next -ck release and be the default, I can't.

The patch will still be around to play with and I rather like it on my own desktop so I'm not throwing it out any time soon. Maybe something else will come of it in the future. But now I can relax and just sync up with mainline again when 2.6.36 final comes out.

Friday, 8 October 2010

Hierarchical tree-based penalty, interactivity at massive load, updated.

I've updated the patch for fork_depth based penalty with some minor tweaks. The default fork depth is now allowed to be zero for init, thus making the base system have no fork depth penalty. Userspace reporting of "PRI" to top, ps etc was modified to make more sense when penalty was enabled. Thread group accounting was fixed to only offset by absolute deadline, instead of already penalised by fork_depth deadline. Updated changelog follows:
Make it possible to have interactivity and responsiveness at very high load
levels by making deadlines offset by the fork depth from init. This has a
similar effect to 'nice'ing loads that are fork heavy. 'make' is a perfect
example of this and will, with fork_depth_penalty enabled, be felt as much
at 'make -j24' as it normally would be with just 'make'.

Note that this drastically affects CPU distribution, and also has the
indirect side effect of partitioning CPU entitlement to different users as
well. No assumption as to CPU distribution should be made based on past
behaviour.

This is achieved by separating out forks to new processes vs new threads.
When a new process is detected, its fork depth is inherited from its parent
across fork() and then is incremented by one. That fork_depth is then used
to cause a relative offset of its deadline.

This feature is enabled in this patch by default and can be optionally
disabled.

Threads are kept at the same fork_depth as their parent process, and can
optionally have their CPU entitlement all managed as one process together
by enabling the group_thread_accounting feature. This feature is disabled
by default in this patch, as many desktop applications such as firefox,
amarok, etc are multithreaded. By disabling this feature and enabling the
fork_depth_penalty feature (default) it favours CPU towards desktop
applications.

Extensive testing is required to ensure this does not cause regressions in
common workloads.

There are two sysctls to enable/disable these features.

They are in /proc/sys/kernel/

group_thread_accounting - groups CPU accounting by threads
fork_depth_penalty - penalises according to depth of forking from init

An updated patch for 2.6.36-rc7-ck1 follows, though it should apply to a BFS357 patched kernel with offsets:
bfs357-penalise_fork_depth_account_threads.patch

EDIT: Here is a patch that can be applied to vanilla 2.6.35.7 which gives you BFS + this change:
2.6.35.7-sched-bfs357+penalise_fork_depth_account_threads.patch

EDIT2: I notice some people are trying this patch and the earlier released group as entities patch and trying to compare them. Let me make it clear, this patch REPLACES the group as entities patch and does exactly the same thing, only updated.

I am still after feedback on this approach as for my workloads it's only advantageous so I'd love it if more people would report back their experiences, either in the comments here or via email to me at kernel@kolivas.org . Thanks!

Wednesday, 6 October 2010

Hierarchical tree-based penalty.

Further investigation at last reveals why the latest code affects program groups as well as threads, which was my original intention only. The number of tasks running was getting inherited across fork, meaning it was really acting as a branch penalty. That is, the more times a task has forked from init (the further along the branching), the greater the penalty it is receiving to its deadline. So it's a hierarchical tree penalty. While there may be a real case for such a feature, it is not program grouping at all! Now I have to think hard about where to go from here. The thread grouping idea is still a valid idea, and will work with this code corrected. However the tree penalty may be worth pursuing as a separate concept since it had such dramatic effects...

Somehow I found myself doing a lot more hacking than I had intended, and that's also why there's so much blogging going on from someone who hates blogs.

EDIT: Enough talk. Here's some code. Patch to apply to a BFS 357 patched kernel, with two knobs in /proc/sys/kernel/
group_thread_accounting - groups CPU accounting by threads (default off)
fork_depth_penalty - penalises according to depth of forking from init (default on)
(Updated version):
bfs357-penalise_fork_depth_account_threads.patch

Update on scheduling automatically by group, interactive at any load.

As an update to my previous post, the testing so far on my group scheduling patch has been beyond all my expectations. There have been no reports of regressions so far, yet the improvements in load conditions are outstanding. After thinking about the code itself some more, I felt one area needed a minor tweak. Currently when tasks are queued for CPU, they're checked to ensure that their deadline is not far in the future due to being set when many more threads were running at once, and then had their deadline capped to the maximum it possibly could be for that nice level. However, I realised that the cap should be based on the maximum the deadline could be based on when the deadline was first set rather than the current time. So I've made a minor modification to the patch to offset it from the deadline_niffy.

There has also been a lot of confusion about what this means for throughput and CPU distribution on SMP so I've updated the patch changelog to include verbose examples. What follows is the updated changelog:
Group tasks by their group leader and distribute their CPU as though
they're one task.

The practical upshot of this is that each application is treated as one task
no matter how many threads it starts or children it forks.

The significance of this is that massively multithreaded applications such as
java applications do not get any more cpu than if they're were not threaded.

The unexpected side effect is that doing make -j (any number) will, provided
you don't run out of ram, feel like no more load than make -j1 no matter how
many CPUs you have. The same goes for any application with multiple threads or
processes.

Note that this drastically changes the way CPU is proportioned under load, as
each application is seen as only one entity regardless of how many children
it forks or threads. 'nice' is still respected.

For example, on my quad core machine, running make -j128 feels like no more
load than make -j1 except for when disk I/O occurs. The make -j128 proceeds
at a rate ever so slightly slower than the make -j4 (which is still optimal).

This will need extensive testing to see what disadvantages may occur, as some
applications may have depended on getting more CPU by running multiple
processes. So far I have yet to encounter a workload where this is a problem.
Note that firefox, for example, has many threads and is contained as one
application with this patch.

It requires a change in mindset about how CPU is distributed in different
workloads but I believe will be ideal for the desktop user. Think of it as
implementing everything you want out of a more complex group scheduling
policy containing each application as an entity, but without the overhead or
any input or effort on the user's part.

Note that this does not have any effect on throughput either, unlike other
approaches to decreasing latency at load. Increasing jobs up to number of CPUs
will still increase throughput if they're not competing with other
processes for CPU time.

To demonstrate the effect this will have, let's use the simplest example
of a dual core machine, and one fully CPU bound single threaded workload such
as a video encode that encodes 1000 frames per minute, competing with a 'make'
compilation. Let's assume that 'make -j2' completes in one minute, and
'make -j1' completes in 2 minutes.

Before this patch:

make -j1 and no encode:
make finishes in 2 minutes

make -j2 and no encode:
make finishes in 1 minute

make -j128 and no encode:
make finishes in 1 minute

encode no make:
1000 frames are encoded per minute

make -j1 and encode:
make finishes in 2 minutes, 1000 frames are encoded per minute

make -j2 and encode:
make finishes in 1.7 minutes, 650 frames are encoded per minute

make -j4 and encode:
make finishes in 1.25 minutes, 400 frames are encoded per minute

make -j24 and encode:
make finishes in 1.04 minutes, 40 frames are encoded per minute

make -j128 and encode:
make finishes in 1.01 minutes, 7 frames are encoded per minute

make -j2 and nice +19 encode:
make finishes in 1.03 minutes, 30 frames are encoded per minute


After this patch:

make -j1 and no encode:
make finishes in 2 minutes

make -j2 and no encode:
make finishes in 1 minute

make -j128 and no encode:
make finishes in 1 minute

encode no make:
1000 frames are encoded per minute

make -j1 and encode:
make finishes in 2 minutes, 1000 frames are encoded per minute

make -j2 and encode:
make finishes in 2 minutes, 1000 frames are encoded per minute

make -j4 and encode:
make finishes in 2 minutes, 1000 frames are encoded per minute

make -j24 and encode:
make finishes in 2 minutes, 1000 frames are encoded per minute

make -j128 and encode:
make finishes in 1.08 minutes, 150 frames are encoded per minute

make -j2 and nice +19 encode:
make finishes in 1.06 minutes, 60 frames are encoded per minute

It's worth pointing out that this code is quite safe and stable and provided no issues show up in testing, it will probably go into the next major revision of BFS. I'm thinking major version number update given all the changes that have happened to 357 and this major change to behaviour. Please try it out and report back!

Patches follow, and I've updated the links in the previous blog post.

EDIT: Updated the tests according to real world testing.

EDIT2: Also, it's been suggested that being able to disable this might be helpful so whenever I get to the next version, I'll add a simple on/off sysctl.

EDIT3: It's also worth pointing out that as well as changing the concept of how CPU is distributed, it will also bring some unfairness as a side effect. 'make', for example, will always run like a niced task due to always being 2 or more processes (usually 4). This is actually why it gets 'contained' with this approach. It is not contained within all the jobs as such.

EDIT4: Further investigation revealed this patch didn't work quite how I thought it did, but still offers significant advantages. See the newer posts on hierarchical tree based penalty.
An updated patch for 2.6.36-rc7-ck1 follows, though it should apply to a BFS357 patched kernel with harmless offsets:
bfs357-penalise_fork_depth_account_threads.patch

Here is a patch that can be applied to vanilla 2.6.35.7 which gives you BFS + this change:
2.6.35.7-sched-bfs357+penalise_fork_depth_account_threads.patch

Tuesday, 5 October 2010

Interactive at any load?

A funny thing happened on the way to creating a patch to test out a feature I've long been thinking about for BFS. I'd always thought it was wrong that applications that are massively multithreaded (java applications are often the biggest example, such as azureus/vuze) get way too much CPU because threads are treated no differently to processes on linux. So I'd been thinking for a while for a way to group the threads together as a single process for the purpose of CPU distribution. Today I finally decided to toy with an approach involving adjusting deadline (since that's how CPU is distributed on BFS) according to how many active threads each thread group has running at any one time.

In the spirit of BFS, I tried to find the cheapest possible way to do this, and decided to use the thread group_parent to store the number of active threads, and simply extend the deadline for each thread (relative to its own nice level). To my surprise, the group_parent was more than just the parent for threads that shared the same pid. The results were better than I could have possibly imagined. The changelog follows to explain:

Group tasks by their thread_group leader and distribute their CPU as though
they're one task.

The practical upshot of this is that each application is treated as one task
no matter how many threads it forks.

The significance of this is that massively multithreaded applications such as
java applications do not get any more cpu than if they're were not threaded.

The unexpected side effect is that doing make -j (any number) will, provided
you don't run out of ram, feel like no more load than make -j1 no matter how
many CPUs you have.

Note that this drastically changes the way CPU is proportioned under load, as
each application is seen as only one entity regardless of how many children
it forks or threads. 'nice' is still respected.

For example, on my quad core machine, running make -j128 feels like no more
load than make -j1 except for when disk I/O occurs. The make -j128 proceeds
at a rate ever so slightly slower than the make -j4 (which is still optimal).

This will need extensive testing to see what disadvantages may occur, as some
applications may have depended on getting more CPU by running multiple
processes. So far I have yet to encounter a workload where this is a problem.
Note that firefox, for example, has many threads and is contained as one
application with this patch.

It requires a change in mindset about how CPU is distributed in different
workloads but I believe will be ideal for the desktop user. Think of it as
implementing everything you want out of the complex CGROUPS, but without the
overhead or any input or effort on the user's part.

So anyway, as you can see this has a profound and dramatic effect. This has the smoothing effect at high loads that the latency_bias patch had, but without any slowdown whatsoever, nor any loss of throughput as well. For a while there I had forgotten I had left a make -j128 going, its effects were so insignificant. Of course if you start 128 makes instead of starting one make with 128 jobs, it will bring your machine to a halt much like the current scheduling does. It also means that a single cpuloop application (like 'yes > /dev/null') will get a whole CPU to itself on SMP even if you run your make with -j128. I need feedback and extensive testing on this one, but I'm keeping my fingers crossed and being cautiously optimistic. It will be interesting to see its effect on (regular) audio playback as well.


EDIT!: Further investigation revealed this patch didn't work quite how I thought it did, but still offers significant advantages. See the newer posts on hierarchical tree based penalty.
An updated patch for 2.6.36-rc7-ck1 follows, though it should apply to a BFS357 patched kernel with harmless offsets:
bfs357-penalise_fork_depth_account_threads.patch

Here is a patch that can be applied to vanilla 2.6.35.7 which gives you BFS + this change:
2.6.35.7-sched-bfs357+penalise_fork_depth_account_threads.patch

Interbench results are now available with and without this patch and the results are dramatic to say the least:

group_interbench_testing.txt