-ck hacking: SMT/Hyperthreading, nice and scheduling policies

Friday, 1 August 2014

SMT/Hyperthreading, nice and scheduling policies

The concept of symmetric multi-threading, which Intel called "Hyperthreading" and introduced into their commodity CPUs first around 2001, is not remotely a new one and goes back a long way before Intel introduced it into the mainstream market. I suspect the introduction of it back then by Intel was them easing the concept of increasing threads and cores for marketing reasons with the imminent walls they'd soon hit with CPU heat and power requirements that would stop the pursuit for higher and higher single CPU frequencies. The idea is that, since a lot of the CPU sits unused even when something is running as fast as it can on part of it, with a bit of extra logic and architecture, you could throw another "virtual core" at some of the unused execution units and behave like 2 (or more) CPUs, putting more of the CPU to good use. These days the vast majority of CPUs sold by Intel have hyperthreading on them, thus doubling the virtual or "logical" cores the CPU has, including even their low power atom offerings.

There have been numerous benchmarks, in-field tests, workloads etc., where people have tried to find whether hyperthreading is better or not. With a bit of knowledge of the workings of hyperthreading, it's pretty easy to know what the answer is, and not surprisingly, it's the frustrating answer of "it depends". And that's the most accurate answer by far, but I'd go further than that and say that if you have any kind of mixed workload, hyperthreading is always going to be better, whereas if you have precisely one workload , then you have to define exactly how it's going to work and whether hyperthreading will be better or not. Which means that in my opinion at least, hyperthreading is advantageous on a desktop, laptop, tablet and even phone since by design they're nothing but mixed workloads. I won't spend much longer on this discussion, but suffice to say that I think about 4 threads (at the moment) is about optimal for most real world desktop(y) workloads.

Imagine for a moment you have a single core CPU which you can run as is, or enable hyperthreading to run as a 2 thread CPU. If you were to run your CPU in single core only mode, then when you run one task at a time it will always use the full power of the CPU, but if you run two tasks, each task runs at 50% the speed and completes in double the time. If you enable hyperthreading, then if you have two mixed workloads that actually use different parts of the CPU, you can actually get effectively (at best) about 140% of the performance of running the CPU in single core mode. This means that instead of the two tasks running at 50% speed when run concurrently, they run at 70% speed. In practice, the actual performance benefit is rarely 40% but it is often on the order of 25%, so each task tends to run about 60% speed instead of 50% speed. Still a nice speedup for "free".

One thing has always troubled me about hyperthreading, though, and that is the way it tends to break priority support in the scheduler. By priority support, I refer to the use of 'nice' and other scheduling policies, such as realtime, sched idleprio etc.

If you have a single core CPU and run a nice 0 task concurrently with a nice +19 task, the nice 0 task will get about 98% of the CPU time and the nice +19 task only about 2%. The scheduler does this by serialising and metering out the time each task gets to spend on the CPU. Now if you enable hyperthreading on that CPU, the scheduler no longer serialises access to the CPU, but gives each of those tasks one logical "core" on the CPU, and you get an overall 25% increase in throughput. However of the total throughput, both the nice 0 and nice +19 task get precisely half. This would be fine if we had two real cores, but they're not, and the performance of both tasks is sacrificed to ~60% to achieve this. Which means that for this contrived but simple example, enabling hyperthreading slows down the overall execution speed of your nice 0 task when you run a nice +19 task much more than on a single core - it runs at 60% speed instead of 98%.

An even more dramatic example is what happens with realtime tasks, which these days most audio backends on linux use (usually through pulseaudio). Running a realtime task concurrently with a SCHED_NORMAL nice 0 task on a single core means the realtime task will get 100% CPU and the nice 0 task will get zero CPU time. Enable hyperthreading and suddenly the realtime task only runs at 60% of its normal speed even with a heavily niced +19 task running in the background.

Enter SMT-nice as I call it. This is not a new idea, and in fact my first iteration of it was for mainline 10(!) years ago. See here: SMT Nice 2.6.4-rc1-mm1

I actually had the patch removed myself from mainline for criticism regarding throughput reasons, though I still argue that worrying about the last percentage points of throughput are not relevant if you break a mechanism as valuable as nice and scheduling policies, but I had lost the energy for defending it which is why I pushed it be removed myself. Note that although throughput overall may be slightly decreased, the throughput of higher priority tasks is not only fairer with respect to low priority tasks, but enhanced because the low priority tasks will have less cache trashing effects.

What this does is it examines all hyperthread "siblings" to see what is running on them, and then decides whether the currently running or next running task should actually have access to the sibling or allow the sibling to go idle completely, allowing a higher priority task to have the actual true core and all its execution units to itself. I'd been meaning to create an equivalent patch for BFS for the longest time but CPUs got faster, cheaper, more cores, I got lazy etc... though I recently found more enthusiasm for hacking.

So here is a reincarnation of the SMT-nice concept for BFS, improved to work across multiple scheduling policies from realtime, iso down to idleprio, and I've made it a compile time option in case people feel they don't wish to sacrifice any throughput:

Patch for BFS449 with pending patches:
bfs449-smtnice-2.patch

And to make life easy, here's an all inclusive ck1+pending+smtnice patch:
3.15-ck1-smtnice2.patch

The TL;DR is: On Intel hyperthreaded CPUs, 'nice', realtime and sched idleprio works better, and background tasks interfere much less with the foreground tasks. Note: This patch does nothing if you don't have a hyperthreaded CPU.

If you wish to do testing to see how this works, try running with and without the patch and running two benchmarks concurrently, one at nice 0 and one at nice +19 (such as 'make -j2' on one kernel and 'nice -19 make -j2' on another kernel on a machine with 2 cores/4 threads) and compare times. Or run some jackd benchmarks of your choice to see what it takes to get xruns etc.

This patch will almost certainly make its way into the next BFS in some form.

EDIT: It seems people have missed the point of this patch. It improves the performance of foreground applications at the expense of background ones. So your desktop/gui/applications will remain fast even if you run folding@home, mprime, seti@home etc., but those background tasks will slow down more. If you don't want it doing that, disable it in your build config.

---
Enjoy!
お楽しみ下さい
-ck

26 comments:

_thalamus1 August 2014 at 23:59
That's very interesting, thanks.

A question though - the AMD bulldozer / piledriver architecture is similar to Intel hyperthreading - could this work benefit that?
ReplyDelete
Replies
Unknown2 August 2014 at 02:01
This comment has been removed by the author.
ReplyDelete
Replies
Anonymous2 August 2014 at 07:59
Great reading as always, thanks!
ReplyDelete
Replies
Anonymous2 August 2014 at 18:55
Kernel 3.15.8-99: experimental nrjQL with better HT/realtime
http://mib.pianetalinux.org/forum/viewtopic.php?f=38&t=4442

I inform you that for OpenMandriva and ROSA linux OS, the so called "nrjQL" kernel flavours are all ck/bfs equipped and enabled by default

Here the kernel wiki with detailed specs:
https://wiki.openmandriva.org/en/Kernel
http://wiki.rosalab.ru/en/index.php/Kernel

thanks a lot for your awesome work
bye, NicCo
ReplyDelete
Replies
Anonymous4 August 2014 at 06:04
I have applied the bfs449-smtnice-2.patch after bfs and without any of the pendings.
It runs smoothly!

Ralph from Hamburg, Germany
ReplyDelete
Replies
kernelOfTruth7 August 2014 at 01:25
haha - posted too fast ^^

now compile-tested only

- will try-boot into it soon, wish me luck :D

for those who are brave (and have data-backups :P )

http://pastebin.com/BkuSYRng

full references:

https://lkml.org/lkml/2014/5/8/228
[tip:sched/core] sched: Rework sched_domain topology definition

https://lkml.org/lkml/2014/5/27/505
[PATCH v2 7/6] sched: rename capacity related flags

https://lkml.org/lkml/2014/6/5/712
[tip:sched/core] sched/idle: Optimize try-to-wake-up IPI
ReplyDelete
Replies
kernelOfTruth7 August 2014 at 08:08
the last change was related to

https://lkml.org/lkml/2014/3/19/34
[PATCH 03/31] arch: Prepare for smp_mb__{before,after}_atomic()
ReplyDelete
Replies
Alfred Chen7 August 2014 at 14:40
Thanks. Will try 3.16 when things are done.

I am still working on multiple queue locking in 3.15.x. The early stage throughput testing shows there is slight improvement again the baseline code on a 4 cores machine. Hopefully there will be a call-for-test in 3.16.
ReplyDelete
Replies
Anonymous8 August 2014 at 04:40
with bfs_3.15-to-3.16_ck1-smtnice2_r5_vanilla:

Hunk #1 succeeded at 40 with fuzz 1 (offset -48 lines).
(Stripping trailing CRs from patch; use --binary to disable.)
patching file kernel/sched/bfs.c
Hunk #1 succeeded at 92 (offset 1 line).
...........
Hunk #8 succeeded at 6593 (offset 142 lines).
(Stripping trailing CRs from patch; use --binary to disable.)
patching file kernel/sched/idle.c
(Stripping trailing CRs from patch; use --binary to disable.)
patching file kernel/sched/sched.h
patch unexpectedly ends in middle of line
Hunk #2 succeeded at 788 with fuzz 1.
ReplyDelete
Replies
Anonymous8 August 2014 at 05:40
Btw, sorry, off topic, but does we really need all of the "ifdef CONFIG_SCHED_BFS" and "ifndef CONFIG_SCHED_BFS"? I mean, if i want to use bfs, then i will patch the kernel and if i don´t want to use bfs, i will not patch it. But i don´t know if it is possible to remove them, i am not a kernel hacker.
xxx
ReplyDelete
Replies
kernelOfTruth8 August 2014 at 08:51
bfs_3.15-to-3.16_ck1-smtnice2_r5_vanilla

isn't meant to be used, therefore I created all of the other patches - it's basically the reference implementation of the changes for Con or other devs to easily reproduce what I did & check whether anything was missed

not sure what you dislike about the ifdef's and ifndef's, that's the same way BFS is added to the kernel and other patchsets,

if you don't like the look of it & want to break compatibility with 3.16 and the new locking unification changes of it, feel free to remove it

/scratches head

everything's running fine for me so far - so more people could try it out but have backups available - like everything in the community/opensource: no guarantees
ReplyDelete
Replies
Anonymous8 August 2014 at 13:43
Bad experience here with smtnice enabled. Two systems, dualcore arrandale i7 and quadcore i7 sandy bridge.

If compiling, causes browser windows to not draw, Chromium will just stay a white page until I stop the compile, sometimes it will draw after 30s or so into the compile, but I guess this is counted as a background operation? Windows instantly draw again if compile stops and also it's just fine without the smtnice option selected at compile-time.
ReplyDelete
Replies
Mike8 August 2014 at 21:12
Hi Con,

would only say thanks for the detailed explanation and the new code. (Hoped, that the hyperthreading of my i7 was not only a nice marketing show at all. ;-) )

Regards sysitos
ReplyDelete
Replies
nestorespinoza8 August 2014 at 22:33
g'day mr. kolivas,
I think the patch doesn't behaves well with zfs & spl modules from the daily ubuntu ppa's because I get a lot of kernel hangs and they don't show up when using btrfs as the root filesystem. anyhow, great work.
ReplyDelete
Replies
Anonymous9 August 2014 at 02:39
with the other 3 patches:

patching file lib/Kconfig.debug
(Stripping trailing CRs from patch; use --binary to disable.)
patching file Makefile
patch unexpectedly ends in middle of line

I would like to use the incremental patch, is he completely?
ReplyDelete
Replies
Anonymous5 September 2014 at 23:46
"if you run two tasks, each task runs at 50% the speed and completes in half the time"

I think that should be "[...] and completes in twice the time"
ReplyDelete
Replies

Add comment