Wednesday, 6 October 2010

Update on scheduling automatically by group, interactive at any load.

As an update to my previous post, the testing so far on my group scheduling patch has been beyond all my expectations. There have been no reports of regressions so far, yet the improvements in load conditions are outstanding. After thinking about the code itself some more, I felt one area needed a minor tweak. Currently when tasks are queued for CPU, they're checked to ensure that their deadline is not far in the future due to being set when many more threads were running at once, and then had their deadline capped to the maximum it possibly could be for that nice level. However, I realised that the cap should be based on the maximum the deadline could be based on when the deadline was first set rather than the current time. So I've made a minor modification to the patch to offset it from the deadline_niffy.

There has also been a lot of confusion about what this means for throughput and CPU distribution on SMP so I've updated the patch changelog to include verbose examples. What follows is the updated changelog:
Group tasks by their group leader and distribute their CPU as though
they're one task.

The practical upshot of this is that each application is treated as one task
no matter how many threads it starts or children it forks.

The significance of this is that massively multithreaded applications such as
java applications do not get any more cpu than if they're were not threaded.

The unexpected side effect is that doing make -j (any number) will, provided
you don't run out of ram, feel like no more load than make -j1 no matter how
many CPUs you have. The same goes for any application with multiple threads or
processes.

Note that this drastically changes the way CPU is proportioned under load, as
each application is seen as only one entity regardless of how many children
it forks or threads. 'nice' is still respected.

For example, on my quad core machine, running make -j128 feels like no more
load than make -j1 except for when disk I/O occurs. The make -j128 proceeds
at a rate ever so slightly slower than the make -j4 (which is still optimal).

This will need extensive testing to see what disadvantages may occur, as some
applications may have depended on getting more CPU by running multiple
processes. So far I have yet to encounter a workload where this is a problem.
Note that firefox, for example, has many threads and is contained as one
application with this patch.

It requires a change in mindset about how CPU is distributed in different
workloads but I believe will be ideal for the desktop user. Think of it as
implementing everything you want out of a more complex group scheduling
policy containing each application as an entity, but without the overhead or
any input or effort on the user's part.

Note that this does not have any effect on throughput either, unlike other
approaches to decreasing latency at load. Increasing jobs up to number of CPUs
will still increase throughput if they're not competing with other
processes for CPU time.

To demonstrate the effect this will have, let's use the simplest example
of a dual core machine, and one fully CPU bound single threaded workload such
as a video encode that encodes 1000 frames per minute, competing with a 'make'
compilation. Let's assume that 'make -j2' completes in one minute, and
'make -j1' completes in 2 minutes.

Before this patch:

make -j1 and no encode:
make finishes in 2 minutes

make -j2 and no encode:
make finishes in 1 minute

make -j128 and no encode:
make finishes in 1 minute

encode no make:
1000 frames are encoded per minute

make -j1 and encode:
make finishes in 2 minutes, 1000 frames are encoded per minute

make -j2 and encode:
make finishes in 1.7 minutes, 650 frames are encoded per minute

make -j4 and encode:
make finishes in 1.25 minutes, 400 frames are encoded per minute

make -j24 and encode:
make finishes in 1.04 minutes, 40 frames are encoded per minute

make -j128 and encode:
make finishes in 1.01 minutes, 7 frames are encoded per minute

make -j2 and nice +19 encode:
make finishes in 1.03 minutes, 30 frames are encoded per minute


After this patch:

make -j1 and no encode:
make finishes in 2 minutes

make -j2 and no encode:
make finishes in 1 minute

make -j128 and no encode:
make finishes in 1 minute

encode no make:
1000 frames are encoded per minute

make -j1 and encode:
make finishes in 2 minutes, 1000 frames are encoded per minute

make -j2 and encode:
make finishes in 2 minutes, 1000 frames are encoded per minute

make -j4 and encode:
make finishes in 2 minutes, 1000 frames are encoded per minute

make -j24 and encode:
make finishes in 2 minutes, 1000 frames are encoded per minute

make -j128 and encode:
make finishes in 1.08 minutes, 150 frames are encoded per minute

make -j2 and nice +19 encode:
make finishes in 1.06 minutes, 60 frames are encoded per minute

It's worth pointing out that this code is quite safe and stable and provided no issues show up in testing, it will probably go into the next major revision of BFS. I'm thinking major version number update given all the changes that have happened to 357 and this major change to behaviour. Please try it out and report back!

Patches follow, and I've updated the links in the previous blog post.

EDIT: Updated the tests according to real world testing.

EDIT2: Also, it's been suggested that being able to disable this might be helpful so whenever I get to the next version, I'll add a simple on/off sysctl.

EDIT3: It's also worth pointing out that as well as changing the concept of how CPU is distributed, it will also bring some unfairness as a side effect. 'make', for example, will always run like a niced task due to always being 2 or more processes (usually 4). This is actually why it gets 'contained' with this approach. It is not contained within all the jobs as such.

EDIT4: Further investigation revealed this patch didn't work quite how I thought it did, but still offers significant advantages. See the newer posts on hierarchical tree based penalty.
An updated patch for 2.6.36-rc7-ck1 follows, though it should apply to a BFS357 patched kernel with harmless offsets:
bfs357-penalise_fork_depth_account_threads.patch

Here is a patch that can be applied to vanilla 2.6.35.7 which gives you BFS + this change:
2.6.35.7-sched-bfs357+penalise_fork_depth_account_threads.patch

6 comments:

  1. -j128 still leads to extreme GUI lag on my dual core. Does the same happen with -j256 on quad?

    I think I do have enough RAM; 6GB of it, and with -j128, 3GB are used, with 2.7GB reported free by "free -m". Swap usage stays at 0.

    ReplyDelete
  2. Btw, I'm not using any of the "2.6.35-ck1" patches currently since they don't apply cleanly to later .35 kernels. Don't know if it's important.

    (Those at http://www.kernel.org/pub/linux/kernel/people/ck/patches/2.6/2.6.35/2.6.35-ck1)

    ReplyDelete
  3. I don't know because my 8GB runs out of ram on my quad core with jobs more than about 120. There are other resources to consider too when thinking about slowdowns. Note that nothing is being done about disk I/O too. I notice no load up to about -j60 on my dual core laptop with 4GB ram.

    ReplyDelete
  4. Actually there does seem to be an upper limit to how much this newer version can control load. I was most concerned about latency and fairness when generating this version. It does appear to gently drop off as load rises. The older one could seriously starve some threads.

    ReplyDelete
  5. Did some real world testing on a dual core and I found where the inflection point is. The load is completely contained till 24 and then starts gently dropping off. I've updated the benchmark results above according to the real world values.

    ReplyDelete
  6. Thank you for BFS! It works like a charm and this new "groups as entity" idea is really intriguing.

    ReplyDelete