Tuesday 5 October 2010

Interactive at any load?

A funny thing happened on the way to creating a patch to test out a feature I've long been thinking about for BFS. I'd always thought it was wrong that applications that are massively multithreaded (java applications are often the biggest example, such as azureus/vuze) get way too much CPU because threads are treated no differently to processes on linux. So I'd been thinking for a while for a way to group the threads together as a single process for the purpose of CPU distribution. Today I finally decided to toy with an approach involving adjusting deadline (since that's how CPU is distributed on BFS) according to how many active threads each thread group has running at any one time.

In the spirit of BFS, I tried to find the cheapest possible way to do this, and decided to use the thread group_parent to store the number of active threads, and simply extend the deadline for each thread (relative to its own nice level). To my surprise, the group_parent was more than just the parent for threads that shared the same pid. The results were better than I could have possibly imagined. The changelog follows to explain:

Group tasks by their thread_group leader and distribute their CPU as though
they're one task.

The practical upshot of this is that each application is treated as one task
no matter how many threads it forks.

The significance of this is that massively multithreaded applications such as
java applications do not get any more cpu than if they're were not threaded.

The unexpected side effect is that doing make -j (any number) will, provided
you don't run out of ram, feel like no more load than make -j1 no matter how
many CPUs you have.

Note that this drastically changes the way CPU is proportioned under load, as
each application is seen as only one entity regardless of how many children
it forks or threads. 'nice' is still respected.

For example, on my quad core machine, running make -j128 feels like no more
load than make -j1 except for when disk I/O occurs. The make -j128 proceeds
at a rate ever so slightly slower than the make -j4 (which is still optimal).

This will need extensive testing to see what disadvantages may occur, as some
applications may have depended on getting more CPU by running multiple
processes. So far I have yet to encounter a workload where this is a problem.
Note that firefox, for example, has many threads and is contained as one
application with this patch.

It requires a change in mindset about how CPU is distributed in different
workloads but I believe will be ideal for the desktop user. Think of it as
implementing everything you want out of the complex CGROUPS, but without the
overhead or any input or effort on the user's part.

So anyway, as you can see this has a profound and dramatic effect. This has the smoothing effect at high loads that the latency_bias patch had, but without any slowdown whatsoever, nor any loss of throughput as well. For a while there I had forgotten I had left a make -j128 going, its effects were so insignificant. Of course if you start 128 makes instead of starting one make with 128 jobs, it will bring your machine to a halt much like the current scheduling does. It also means that a single cpuloop application (like 'yes > /dev/null') will get a whole CPU to itself on SMP even if you run your make with -j128. I need feedback and extensive testing on this one, but I'm keeping my fingers crossed and being cautiously optimistic. It will be interesting to see its effect on (regular) audio playback as well.


EDIT!: Further investigation revealed this patch didn't work quite how I thought it did, but still offers significant advantages. See the newer posts on hierarchical tree based penalty.
An updated patch for 2.6.36-rc7-ck1 follows, though it should apply to a BFS357 patched kernel with harmless offsets:
bfs357-penalise_fork_depth_account_threads.patch

Here is a patch that can be applied to vanilla 2.6.35.7 which gives you BFS + this change:
2.6.35.7-sched-bfs357+penalise_fork_depth_account_threads.patch

Interbench results are now available with and without this patch and the results are dramatic to say the least:

group_interbench_testing.txt

8 comments:

  1. Just tried this out. I'm on dual core, so I tested it with -j64 (inside a tmpfs to minimize hard disk I/O latencies.) There was no noticeable impact on sound playback and desktop performance.

    With -j128 though sound skips and GUI movements become jerky, though still responsive (which is important; even if the GUI starts frameskipping like crazy, it's important to be able to click something and not having to wait 20 seconds for X11 to pick up the event.)

    There's one thing I don't quite understand though; I though the -j option of make creates new processes instead of new threads...

    ReplyDelete
  2. Indeed I was only planning on getting the thread group parent but I managed to use the program group. Bear in mind that the limit for job numbers with make -jX will be the amount of ram you have where you need about 1GB ram per 15 jobs so you'll end up getting ram swappage slowing you down beyond that. I'll be working on a slightly improved version soon.

    ReplyDelete
  3. Ok, I've posted a slightly updated version which ensures that when tasks are requeued that they're not penalised with an ancient deadline by having their deadline re-examined according to the program group's current number of threads. I've also added some more documentation in the code.

    ReplyDelete
  4. These are exciting developments! Great news.

    Galen

    ReplyDelete
  5. What I'm doing now: Building kernel26, qt, kdelibs, kdebase-workspace, kdeplasma-applets-kdenetworkmanager via "yaourt -Sb". All simultaneously.

    top - 16:27:15 up 38 min, 2 users, load average: 8.17, 8.44, 7.42

    http://img3.imagebanana.com/img/7rrjelbq/Bildschirmfoto1.jpg

    Desktop is reacting as if there's no load at all. I love it!

    ReplyDelete
  6. Nice. Thanks, and keep an eye out for newer patches!

    ReplyDelete