Friday, 1 August 2014

SMT/Hyperthreading, nice and scheduling policies

The concept of symmetric multi-threading, which Intel called "Hyperthreading" and introduced into their commodity CPUs first around 2001, is not remotely a new one and goes back a long way before Intel introduced it into the mainstream market. I suspect the introduction of it back then by Intel was them easing the concept of increasing threads and cores for marketing reasons with the imminent walls they'd soon hit with CPU heat and power requirements that would stop the pursuit for higher and higher single CPU frequencies. The idea is that, since a lot of the CPU sits unused even when something is running as fast as it can on part of it, with a bit of extra logic and architecture, you could throw another "virtual core" at some of the unused execution units and behave like 2 (or more) CPUs, putting more of the CPU to good use. These days the vast majority of CPUs sold by Intel have hyperthreading on them, thus doubling the virtual or "logical" cores the CPU has, including even their low power atom offerings.

There have been numerous benchmarks, in-field tests, workloads etc., where people have tried to find whether hyperthreading is better or not. With a bit of knowledge of the workings of hyperthreading, it's pretty easy to know what the answer is, and not surprisingly, it's the frustrating answer of "it depends". And that's the most accurate answer by far, but I'd go further than that and say that if you have any kind of mixed workload, hyperthreading is always going to be better, whereas if you have precisely one workload , then you have to define exactly how it's going to work and whether hyperthreading will be better or not. Which means that in my opinion at least, hyperthreading is advantageous on a desktop, laptop, tablet and even phone since by design they're nothing but mixed workloads. I won't spend much longer on this discussion, but suffice to say that I think about 4 threads (at the moment) is about optimal for most real world desktop(y) workloads.

Imagine for a moment you have a single core CPU which you can run as is, or enable hyperthreading to run as a 2 thread CPU. If you were to run your CPU in single core only mode, then when you run one task at a time it will always use the full power of the CPU, but if you run two tasks, each task runs at 50% the speed and completes in double the time. If you enable hyperthreading, then if you have two mixed workloads that actually use different parts of the CPU, you can actually get effectively (at best) about 140% of the performance of running the CPU in single core mode. This means that instead of the two tasks running at 50% speed when run concurrently, they run at 70% speed. In practice, the actual performance benefit is rarely 40% but it is often on the order of 25%, so each task tends to run about 60% speed instead of 50% speed. Still a nice speedup for "free".

One thing has always troubled me about hyperthreading, though, and that is the way it tends to break priority support in the scheduler. By priority support, I refer to the use of 'nice' and other scheduling policies, such as realtime, sched idleprio etc.

If you have a single core CPU and run a nice 0 task concurrently with a nice +19 task, the nice 0 task will get about 98% of the CPU time and the nice +19 task only about 2%. The scheduler does this by serialising and metering out the time each task gets to spend on the CPU. Now if you enable hyperthreading on that CPU, the scheduler no longer serialises access to the CPU, but gives each of those tasks one logical "core" on the CPU, and you get an overall 25% increase in throughput. However of the total throughput, both the nice 0 and nice +19 task get precisely half. This would be fine if we had two real cores, but they're not, and the performance of both tasks is sacrificed to ~60% to achieve this. Which means that for this contrived but simple example, enabling hyperthreading slows down the overall execution speed of your nice 0 task when you run a nice +19 task much more than on a single core - it runs at 60% speed instead of 98%.

An even more dramatic example is what happens with realtime tasks, which these days most audio backends on linux use (usually through pulseaudio). Running a realtime task concurrently with a SCHED_NORMAL nice 0 task on a single core means the realtime task will get 100% CPU and the nice 0 task will get zero CPU time. Enable hyperthreading and suddenly the realtime task only runs at 60% of its normal speed even with a heavily niced +19 task running in the background.

Enter SMT-nice as I call it. This is not a new idea, and in fact my first iteration of it was for mainline 10(!) years ago. See here: SMT Nice 2.6.4-rc1-mm1

I actually had the patch removed myself from mainline for criticism regarding throughput reasons, though I still argue that worrying about the last percentage points of throughput are not relevant if you break a mechanism as valuable as nice and scheduling policies, but I had lost the energy for defending it which is why I pushed it be removed myself. Note that although throughput overall may be slightly decreased, the throughput of higher priority tasks is not only fairer with respect to low priority tasks, but enhanced because the low priority tasks will have less cache trashing effects.

What this does is it examines all hyperthread "siblings" to see what is running on them, and then decides whether the currently running or next running task should actually have access to the sibling or allow the sibling to go idle completely, allowing a higher priority task to have the actual true core and all its execution units to itself. I'd been meaning to create an equivalent patch for BFS for the longest time but CPUs got faster, cheaper, more cores, I got lazy etc... though I recently found more enthusiasm for hacking.

So here is a reincarnation of the SMT-nice concept for BFS, improved to work across multiple scheduling policies from realtime, iso down to idleprio, and I've made it a compile time option in case people feel they don't wish to sacrifice any throughput:

Patch for BFS449 with pending patches:
bfs449-smtnice-2.patch

And to make life easy, here's an all inclusive ck1+pending+smtnice patch:
3.15-ck1-smtnice2.patch

The TL;DR is: On Intel hyperthreaded CPUs, 'nice', realtime and sched idleprio works better, and background tasks interfere much less with the foreground tasks. Note: This patch does nothing if you don't have a hyperthreaded CPU.

If you wish to do testing to see how this works, try running with and without the patch and running two benchmarks concurrently, one at nice 0 and one at nice +19 (such as 'make -j2' on one kernel and 'nice -19 make -j2' on another kernel on a machine with 2 cores/4 threads) and compare times. Or run some jackd benchmarks of your choice to see  what it takes to get xruns etc.

This patch will almost certainly make its way into the next BFS in some form.

EDIT: It seems people have missed the point of this patch. It improves the performance of foreground applications at the expense of background ones. So your desktop/gui/applications will remain fast even if you run folding@home, mprime, seti@home etc., but those background tasks will slow down more. If you don't want it doing that, disable it in your build config.

---
Enjoy!
お楽しみ下さい
-ck

26 comments:

  1. That's very interesting, thanks.

    A question though - the AMD bulldozer / piledriver architecture is similar to Intel hyperthreading - could this work benefit that?

    ReplyDelete
    Replies
    1. Yes indeed it would, apologies for leaving them out! They call it CMT but it's the same thing. Any SMT capable CPU would, though I've only bothered to make it configurable on x86_64 at the moment.

      Delete
  2. This comment has been removed by the author.

    ReplyDelete
  3. Great reading as always, thanks!

    ReplyDelete
  4. Kernel 3.15.8-99: experimental nrjQL with better HT/realtime
    http://mib.pianetalinux.org/forum/viewtopic.php?f=38&t=4442


    I inform you that for OpenMandriva and ROSA linux OS, the so called "nrjQL" kernel flavours are all ck/bfs equipped and enabled by default

    Here the kernel wiki with detailed specs:
    https://wiki.openmandriva.org/en/Kernel
    http://wiki.rosalab.ru/en/index.php/Kernel

    thanks a lot for your awesome work
    bye, NicCo

    ReplyDelete
    Replies
    1. Hi NicCo,

      your kernel(-config) sounds interesting and the wikis really fine.
      Is there a git tree for the kernel so that I could use it for other distros too or is your kernel project only for Mandriva/Rosa?
      Are there any plans to integrate the exfat kernel driver?

      Thanks.

      Regards sysitos

      Delete
    2. Hi Mike,
      sorry seeing your msg just now...

      the Wiki is currently missing an important chapter, that I'll write soon, about the so called ABF Kernel Farm

      ABF is our common (Conectiva, OpenMandriva, ROSA, Unity linux, and different spinoff like MDK, MagOS, ecc.) build server that allow us to work remotely, it builds and hosts all our projects sources (srpms) and the resulting binary files (rpms)

      We can split our work in many project folders, then rebuild for different Linux platforms, something about is here:
      http://forum.rosalab.ru/viewtopic.php?f=29&t=1223

      ABF is Open, and its contents are open for all people that want to see, download, ecc..., to have a full access needs a subscription

      ExFat: I have not been interested in it because I was not asked, I only know that we have in our repository: fuse-exfat & exfat utils

      If you want to know more and/or discuss, you can contact me via email (abitrules AT yahoo.it), or into my forum (http://mib.pianetalinux.org/forum), or in the IRC that I often attend (freenode > #openmandriva-cooker)

      Thanks for your interest,
      Regards NicCo

      Delete
    3. @NicCo,

      thanks so far. Will you contact later.

      Btw, I think ABF is similar to the OBS von OpenSuse, there you could also build your own or public packages for different platforms.

      But building a stripped down version of the kernel with a quick and dirty make and make install is enough for me ;)


      Thanks.
      sysitos

      Delete
  5. I have applied the bfs449-smtnice-2.patch after bfs and without any of the pendings.
    It runs smoothly!

    Ralph from Hamburg, Germany

    ReplyDelete
  6. haha - posted too fast ^^

    now compile-tested only

    - will try-boot into it soon, wish me luck :D

    for those who are brave (and have data-backups :P )

    http://pastebin.com/BkuSYRng


    full references:

    https://lkml.org/lkml/2014/5/8/228
    [tip:sched/core] sched: Rework sched_domain topology definition

    https://lkml.org/lkml/2014/5/27/505
    [PATCH v2 7/6] sched: rename capacity related flags

    https://lkml.org/lkml/2014/6/5/712
    [tip:sched/core] sched/idle: Optimize try-to-wake-up IPI

    ReplyDelete
    Replies
    1. 3rd revision:

      http://pastebin.com/buGXRR2w
      (BFS_3.15-to-3.16_ck1-smtnice2_r3.diff)

      change: ifnef'ed sched_ttwu_pending stuff in kernel/sched/sched.h
      that is only needed by sched/idle: Optimize try-to-wake-up IPI

      kernel booted up fine - no problems so far, will do heavy/interactivity testing during next hours ...


      Thanks a lot, Con, for these amazing improvements !

      This is the "fair" scheduler we've all been waiting for =)

      Delete
    2. Could you please post diff against vanilla 3.16.0?

      Delete
    3. sure, will do once it's finished,

      found some issues, will attempt to fix

      Delete
  7. the last change was related to

    https://lkml.org/lkml/2014/3/19/34
    [PATCH 03/31] arch: Prepare for smp_mb__{before,after}_atomic()

    ReplyDelete
  8. Thanks. Will try 3.16 when things are done.

    I am still working on multiple queue locking in 3.15.x. The early stage throughput testing shows there is slight improvement again the baseline code on a 4 cores machine. Hopefully there will be a call-for-test in 3.16.

    ReplyDelete
  9. with bfs_3.15-to-3.16_ck1-smtnice2_r5_vanilla:

    Hunk #1 succeeded at 40 with fuzz 1 (offset -48 lines).
    (Stripping trailing CRs from patch; use --binary to disable.)
    patching file kernel/sched/bfs.c
    Hunk #1 succeeded at 92 (offset 1 line).
    ...........
    Hunk #8 succeeded at 6593 (offset 142 lines).
    (Stripping trailing CRs from patch; use --binary to disable.)
    patching file kernel/sched/idle.c
    (Stripping trailing CRs from patch; use --binary to disable.)
    patching file kernel/sched/sched.h
    patch unexpectedly ends in middle of line
    Hunk #2 succeeded at 788 with fuzz 1.

    ReplyDelete
  10. Btw, sorry, off topic, but does we really need all of the "ifdef CONFIG_SCHED_BFS" and "ifndef CONFIG_SCHED_BFS"? I mean, if i want to use bfs, then i will patch the kernel and if i don´t want to use bfs, i will not patch it. But i don´t know if it is possible to remove them, i am not a kernel hacker.
    xxx

    ReplyDelete
  11. bfs_3.15-to-3.16_ck1-smtnice2_r5_vanilla

    isn't meant to be used, therefore I created all of the other patches - it's basically the reference implementation of the changes for Con or other devs to easily reproduce what I did & check whether anything was missed


    not sure what you dislike about the ifdef's and ifndef's, that's the same way BFS is added to the kernel and other patchsets,

    if you don't like the look of it & want to break compatibility with 3.16 and the new locking unification changes of it, feel free to remove it

    /scratches head


    everything's running fine for me so far - so more people could try it out but have backups available - like everything in the community/opensource: no guarantees

    ReplyDelete
  12. Bad experience here with smtnice enabled. Two systems, dualcore arrandale i7 and quadcore i7 sandy bridge.

    If compiling, causes browser windows to not draw, Chromium will just stay a white page until I stop the compile, sometimes it will draw after 30s or so into the compile, but I guess this is counted as a background operation? Windows instantly draw again if compile stops and also it's just fine without the smtnice option selected at compile-time.

    ReplyDelete
    Replies
    1. The difference between nice levels is faulty. I will need to rework it.

      Delete
    2. Same here, seems like foraground operations are taking "too much" ressources and starve background ops.

      Delete
  13. Hi Con,

    would only say thanks for the detailed explanation and the new code. (Hoped, that the hyperthreading of my i7 was not only a nice marketing show at all. ;-) )

    Regards sysitos

    ReplyDelete
  14. g'day mr. kolivas,
    I think the patch doesn't behaves well with zfs & spl modules from the daily ubuntu ppa's because I get a lot of kernel hangs and they don't show up when using btrfs as the root filesystem. anyhow, great work.

    ReplyDelete
  15. with the other 3 patches:

    patching file lib/Kconfig.debug
    (Stripping trailing CRs from patch; use --binary to disable.)
    patching file Makefile
    patch unexpectedly ends in middle of line

    I would like to use the incremental patch, is he completely?

    ReplyDelete
  16. "if you run two tasks, each task runs at 50% the speed and completes in half the time"

    I think that should be "[...] and completes in twice the time"

    ReplyDelete