There has also been a lot of confusion about what this means for throughput and CPU distribution on SMP so I've updated the patch changelog to include verbose examples. What follows is the updated changelog:
Group tasks by their group leader and distribute their CPU as though
they're one task.
The practical upshot of this is that each application is treated as one task
no matter how many threads it starts or children it forks.
The significance of this is that massively multithreaded applications such as
java applications do not get any more cpu than if they're were not threaded.
The unexpected side effect is that doing make -j (any number) will, provided
you don't run out of ram, feel like no more load than make -j1 no matter how
many CPUs you have. The same goes for any application with multiple threads or
processes.
Note that this drastically changes the way CPU is proportioned under load, as
each application is seen as only one entity regardless of how many children
it forks or threads. 'nice' is still respected.
For example, on my quad core machine, running make -j128 feels like no more
load than make -j1 except for when disk I/O occurs. The make -j128 proceeds
at a rate ever so slightly slower than the make -j4 (which is still optimal).
This will need extensive testing to see what disadvantages may occur, as some
applications may have depended on getting more CPU by running multiple
processes. So far I have yet to encounter a workload where this is a problem.
Note that firefox, for example, has many threads and is contained as one
application with this patch.
It requires a change in mindset about how CPU is distributed in different
workloads but I believe will be ideal for the desktop user. Think of it as
implementing everything you want out of a more complex group scheduling
policy containing each application as an entity, but without the overhead or
any input or effort on the user's part.
Note that this does not have any effect on throughput either, unlike other
approaches to decreasing latency at load. Increasing jobs up to number of CPUs
will still increase throughput if they're not competing with other
processes for CPU time.
To demonstrate the effect this will have, let's use the simplest example
of a dual core machine, and one fully CPU bound single threaded workload such
as a video encode that encodes 1000 frames per minute, competing with a 'make'
compilation. Let's assume that 'make -j2' completes in one minute, and
'make -j1' completes in 2 minutes.
Before this patch:
make -j1 and no encode:
make finishes in 2 minutes
make -j2 and no encode:
make finishes in 1 minute
make -j128 and no encode:
make finishes in 1 minute
encode no make:
1000 frames are encoded per minute
make -j1 and encode:
make finishes in 2 minutes, 1000 frames are encoded per minute
make -j2 and encode:
make finishes in 1.7 minutes, 650 frames are encoded per minute
make -j4 and encode:
make finishes in 1.25 minutes, 400 frames are encoded per minute
make -j24 and encode:
make finishes in 1.04 minutes, 40 frames are encoded per minute
make -j128 and encode:
make finishes in 1.01 minutes, 7 frames are encoded per minute
make -j2 and nice +19 encode:
make finishes in 1.03 minutes, 30 frames are encoded per minute
After this patch:
make -j1 and no encode:
make finishes in 2 minutes
make -j2 and no encode:
make finishes in 1 minute
make -j128 and no encode:
make finishes in 1 minute
encode no make:
1000 frames are encoded per minute
make -j1 and encode:
make finishes in 2 minutes, 1000 frames are encoded per minute
make -j2 and encode:
make finishes in 2 minutes, 1000 frames are encoded per minute
make -j4 and encode:
make finishes in 2 minutes, 1000 frames are encoded per minute
make -j24 and encode:
make finishes in 2 minutes, 1000 frames are encoded per minute
make -j128 and encode:
make finishes in 1.08 minutes, 150 frames are encoded per minute
make -j2 and nice +19 encode:
make finishes in 1.06 minutes, 60 frames are encoded per minute
It's worth pointing out that this code is quite safe and stable and provided no issues show up in testing, it will probably go into the next major revision of BFS. I'm thinking major version number update given all the changes that have happened to 357 and this major change to behaviour. Please try it out and report back!
Patches follow, and I've updated the links in the previous blog post.
EDIT: Updated the tests according to real world testing.
EDIT2: Also, it's been suggested that being able to disable this might be helpful so whenever I get to the next version, I'll add a simple on/off sysctl.
EDIT3: It's also worth pointing out that as well as changing the concept of how CPU is distributed, it will also bring some unfairness as a side effect. 'make', for example, will always run like a niced task due to always being 2 or more processes (usually 4). This is actually why it gets 'contained' with this approach. It is not contained within all the jobs as such.
EDIT4: Further investigation revealed this patch didn't work quite how I thought it did, but still offers significant advantages. See the newer posts on hierarchical tree based penalty.
An updated patch for 2.6.36-rc7-ck1 follows, though it should apply to a BFS357 patched kernel with harmless offsets:
bfs357-penalise_fork_depth_account_threads.patch
Here is a patch that can be applied to vanilla 2.6.35.7 which gives you BFS + this change:
2.6.35.7-sched-bfs357+penalise_fork_depth_account_threads.patch
-j128 still leads to extreme GUI lag on my dual core. Does the same happen with -j256 on quad?
ReplyDeleteI think I do have enough RAM; 6GB of it, and with -j128, 3GB are used, with 2.7GB reported free by "free -m". Swap usage stays at 0.
Btw, I'm not using any of the "2.6.35-ck1" patches currently since they don't apply cleanly to later .35 kernels. Don't know if it's important.
ReplyDelete(Those at http://www.kernel.org/pub/linux/kernel/people/ck/patches/2.6/2.6.35/2.6.35-ck1)
I don't know because my 8GB runs out of ram on my quad core with jobs more than about 120. There are other resources to consider too when thinking about slowdowns. Note that nothing is being done about disk I/O too. I notice no load up to about -j60 on my dual core laptop with 4GB ram.
ReplyDeleteActually there does seem to be an upper limit to how much this newer version can control load. I was most concerned about latency and fairness when generating this version. It does appear to gently drop off as load rises. The older one could seriously starve some threads.
ReplyDeleteDid some real world testing on a dual core and I found where the inflection point is. The load is completely contained till 24 and then starts gently dropping off. I've updated the benchmark results above according to the real world values.
ReplyDeleteThank you for BFS! It works like a charm and this new "groups as entity" idea is really intriguing.
ReplyDelete