Saturday, 16 October 2010

2.6.36-rc8-ck1

So another week passes and my attempt to minimise my workload by syncing up with the apparently last -rc for 2.6.36 was only a mild failure with a new "release candidate" coming out. (Does anyone else still have a problem with Linus calling his pre-releases "release candidates" any more? It still annoys the hell out of me). The reason it was only a mild failure for me is that the patches from 2.6.36-rc7-ck1 pretty much apply cleanly to 2.6.36-rc8.

So I've resynced all the 2.6.36-rc7-ck1 patches, and added a couple of things.

Firstly, I added a tiny patch which decreases the default dirty_ratio in the vm from 20 to 5. Here is the changelog in the patch:
The default dirty ratio is chosen to be a compromise between throughput and
overall system latency. On a desktop, if an application writes to disk a lot,
that application should be the one to slow down rather than the desktop as a
whole. At higher dirty ratio settings, an application could write a lot to
disk and then happily use lots of CPU time after that while the rest of the
system is busy waiting on that naughty application's disk writes to complete
before anything else happening.

Lower ratios mean that applications that do a lot of disk writes end up
being responsible for their own actions and they're the ones that slow down
rather than the system in general.

This does decrease overall write throughput slightly, but to the benefit of
the latency of the system as a whole.

The only other changes are to fold in the build fixes into BFS, fix minor typos in the documentation of the BFS 357 patch, and the add the bfs357-penalise_fork_depth.patch and bfs357-group_thread_accounting.patch patches as separate entities, but DISABLED by default. The effect of these patches has been discussed at great length on this blog before. See the tunables in /proc/sys/kernel to enable them. I'm pretty sure these patches will be dropped for 2.6.36-ck1 final due to the handful of regressions seen to date.

As per last time, the patches themselves are sneakily hidden within .lrz archives which means you'll have to suffer the pain of installing my lrzip application to use them. The patches are available in here: 2.6.36 prerelease patches

Monday, 11 October 2010

Further updates on hierarchical tree-based penalty.

First of all, no new patch today, yay \o/ I know it's sometimes hard to keep up but that's the nature of things. I (sort of kind of hope that I can) promise not to release anything much new any time soon with respect to the hierarchical tree-based penalty code.

I experimented with removing the separation of processes from threads and treating them all equal and discovered that led to real lag in some gui applications as many of them use threads. So it seems the default of penalising fork depth and not penalising threads works best at biasing CPU distribution for interactivity and responsiveness (which is the default on the current patch). This is rather ironic, as this code evolved out of an initial attempt to control threads' behaviour on massively threaded applications, yet it turns out that being nice to threaded apps works better for the desktop.

I thought up another way of demonstrating the effect this patch has in a measurable way.

Using a dual core machine as an example, and running the "browser benchmark" at http://service.futuremark.com/peacekeeper/index.action allowed me to show the effect of the gold standard load of make versus the almost universal gui app, a browser.

The benchmark runs a number of different browser based workloads, and gives a score in points, where higher is better.

Running the benchmark under various different loads with the feature enabled / disabled gave me the following results:

Disabled:
Baseline: 2437
make -j2: 1642
make -j24: 208
make -j42: Failed

Enabled:
Baseline: 2437
make -j2: 2293
make -j24: 2187
make -j42: 1626

As can be seen, on the dual core machine, normally a load of 2 makes the benchmark run almost precisely 1/3 slower as would be expected with BFS' fair CPU distribution of 3 processes between 2 CPUs. Enabling this feature makes this benchmark progress almost unaffected at this load, and only once the load is more than 20 times higher does it hinder the benchmark to the same degree. At that load with it disabled, the browser just spat out 'a script on this page is causing the browser to run slowly' etc. etc. and virtually gave up.


In the last few days, most of the reports have been very positive. However, as expected, not everything is rosy. There have been reports of applications such as mplayer stalling and some people have had gnome applets fail to initialise!? How on earth a scheduling decision about who goes first can cause these is... well it's not a mystery. In my experience it's because some assumption has been made in the userspace application that naively expects a certain behaviour; that being that one process will run first. These sorts of bugs, although likely due to the userspace application itself, make changes of this nature in the scheduler as the default impossible, or at least foolish. So as much as I'd like to see this change go into the next -ck release and be the default, I can't.

The patch will still be around to play with and I rather like it on my own desktop so I'm not throwing it out any time soon. Maybe something else will come of it in the future. But now I can relax and just sync up with mainline again when 2.6.36 final comes out.

Friday, 8 October 2010

Hierarchical tree-based penalty, interactivity at massive load, updated.

I've updated the patch for fork_depth based penalty with some minor tweaks. The default fork depth is now allowed to be zero for init, thus making the base system have no fork depth penalty. Userspace reporting of "PRI" to top, ps etc was modified to make more sense when penalty was enabled. Thread group accounting was fixed to only offset by absolute deadline, instead of already penalised by fork_depth deadline. Updated changelog follows:
Make it possible to have interactivity and responsiveness at very high load
levels by making deadlines offset by the fork depth from init. This has a
similar effect to 'nice'ing loads that are fork heavy. 'make' is a perfect
example of this and will, with fork_depth_penalty enabled, be felt as much
at 'make -j24' as it normally would be with just 'make'.

Note that this drastically affects CPU distribution, and also has the
indirect side effect of partitioning CPU entitlement to different users as
well. No assumption as to CPU distribution should be made based on past
behaviour.

This is achieved by separating out forks to new processes vs new threads.
When a new process is detected, its fork depth is inherited from its parent
across fork() and then is incremented by one. That fork_depth is then used
to cause a relative offset of its deadline.

This feature is enabled in this patch by default and can be optionally
disabled.

Threads are kept at the same fork_depth as their parent process, and can
optionally have their CPU entitlement all managed as one process together
by enabling the group_thread_accounting feature. This feature is disabled
by default in this patch, as many desktop applications such as firefox,
amarok, etc are multithreaded. By disabling this feature and enabling the
fork_depth_penalty feature (default) it favours CPU towards desktop
applications.

Extensive testing is required to ensure this does not cause regressions in
common workloads.

There are two sysctls to enable/disable these features.

They are in /proc/sys/kernel/

group_thread_accounting - groups CPU accounting by threads
fork_depth_penalty - penalises according to depth of forking from init

An updated patch for 2.6.36-rc7-ck1 follows, though it should apply to a BFS357 patched kernel with offsets:
bfs357-penalise_fork_depth_account_threads.patch

EDIT: Here is a patch that can be applied to vanilla 2.6.35.7 which gives you BFS + this change:
2.6.35.7-sched-bfs357+penalise_fork_depth_account_threads.patch

EDIT2: I notice some people are trying this patch and the earlier released group as entities patch and trying to compare them. Let me make it clear, this patch REPLACES the group as entities patch and does exactly the same thing, only updated.

I am still after feedback on this approach as for my workloads it's only advantageous so I'd love it if more people would report back their experiences, either in the comments here or via email to me at kernel@kolivas.org . Thanks!

Thursday, 7 October 2010

2.6.36-rc7 with -ck1 and BFS 357

Since I was on the tail end of my hack fest and Linus announced 2.6.36-rc7, saying it was likely the last -rc, I figured it was a good opportunity to sync up my patches with mainline. As always, the porting of BFS brought some unexpected surprises where a simple port would probably work, but likely fail long term. So there were lots of little subtle changes that I had to make to BFS. Functionally this is virtually the same as BFS 357 for 2.6.35.7, apart from some minor tweaks to avoid new warnings. There was one teensy change to niffy_diff to also ensure a minimum difference was observed according to ticks, and the minimum difference was decreased from 1us to anything greater than 0 as the niffy clock may well be updated in less than 1us. One nice thing also came about from the update. I managed to remove some code when I realised the nohz_load_balancer I'd been maintaining in the BFS code was simply me blindly porting it a while back and not even realising what it was for. Of course there is no load balancing on BFS since it has a global runqueue which means all CPUs are always in balance, so there's no need for any special case balancing on nohz configs.

For those who want some overview of what was required to port it, there were some subtle changes to the try_to_wake_up code for notifying when workers are going to sleep with workqueues. Some reshuffling of what happens on context switch was ported. Some sched domains code was updated. rlimit code was tweaked. nohz balancing code was dropped. Checking that apparently idle CPUs were actually online was added to cope with changes on forking idle tasks on .36. And random other stuff I can't remember.

It's worth noting that you'll need a beta driver from nvidia if you're evil like me and use their evil binary drivers. See here: nvnews link for their latest drivers.

Anyway here's a directory that contains lrz compressed versions of all the patches, and an lrz compressed all-inclusive -ck1 and bfs357 patch. It's my secret plan that those wishing to try my pre-release patches must also grab lrzip, which I wrote, to access them :)

http://ck.kolivas.org/patches/2.6/2.6.36/

EDIT2: If you enable schedstats, you will need the patch called 2636rc7ck1-fixes.patch in that directory added to prevent build failures.