Sunday 10 August 2014

SMT Nice 6

In my last post I discussed the problem with nice levels, scheduling policies and SMT, and my first public patch for BFS to work around the issue, "SMT Nice":


With a bit of extra testing, and feedback from a number of users, a few issues were discovered with the first patch, so I've reworked it. Thanks very much to those who tested it and provided feedback. There were a couple of scheduling points where SMT siblings weren't being examined, and the difference between nice levels was far more aggressive than it was supposed to be.

Here is an updated patch for BFS 449 with all pending patches:

And a convenient all inclusive patch for 3.15 with ck1+pending+smtnice3456:

EDIT: Added one minor change to not allow kernel threads to deschedule any users tasks on smt siblings and bumped the patch version up to smtnice4.

EDIT2: Added a change to fix the high power usage bug bringing version up to smtnice5

EDIT3: Fixed a logic fail which would cause far too many reschedules and not use full CPU with many niced tasks bringing the version up to smtnice6.



  1. Replies
    1. Life's given me another kick in the guts, but I tried not to bring it up. Suffice to say usual delays and more will be there.

    2. this sounds bad
      good luck and health for you and your family

    3. Hi, all. For those want try 0449 on 3.16, I have synced up with 3.16. The follow 3 patches are for pure BFS 0449 and all pending patches on 3.16.

      #1 Pure 0449 apply on 3.16

      #2 bfs: 0449 to 3.16 fixes.

      #3 bfs: Apply all 0449 pending patches.

      The rest bfs related commits on linux-3.16.y-gc branch are my bfs improvement patches which not yet accepted by ck.

      Branch link is

      Have fun with 3.16. :)

    4. @ck
      One remarkable thing when I port 0449 to 3.16 is tsk_is_polling define in the original bfs.c
      -#ifndef tsk_is_polling
      -#define tsk_is_polling(t) 0
      which total disabled tsk_is_polling function and always make smp_send_reschedule(cpu) called in resched_task().

      Is this the bfs intend design?

      When porting, I remove the defines and use the mainline new routines.

    5. thank you ! :)

  2. @ ck: Of course, I also wish you the very best !!!

    But I have some questions left:
    - The changes you've made between smtnice-2 & -3 are not really useful: I feel
    the duration to re-activate a foreground process like firefox to be too long. (I
    experienced this after heavy? swap usage.)
    - Can you think of and maybe implement some userspace knobs in sysfs to let
    adjust /experiment more with some values?
    - I'm using the worldcommunitygrid client, and with your smtnice-2 (the earlier one),
    although I'm no candidate for SMT & SMTnice, due to my CPU having SMT in
    each core, but for no benefit, your smtnice-2 code significantly increased the
    client's performance during the last days.

    Thank you for sharing your insights,
    Manuel Krause

    1. Firefox doesn't nice its processes so I think you're seeing something totally unrelated and probably placebo effect. The previous patch caused major stalls so was unusable. As I said in my previous blogpost, this patch will speed up foreground processes by slowing down background processes, so if all you care about is total throughput, disable the option.

    2. About: Kernel -server

      For a kernel flavour oriented to -deskop, -laptop, -realtime, now is better

      for a kernel flavour oriented to -server, when configured as you have suggested, see the table below:

      should it be better or preferable disabling this feature that can "slowing down background processes" with a proper

      bye, NicCo

    3. For server, generically CONFIG_SMT_NICE=n would be a good default. However specifically if you used nice levels on your server for balancing database/httpd server for example this patch would be helpful.

  3. Very minor change added bumping the version to 4

    The diff of that change alone is here, though all the links in the main post have been updated to full patches:

    1. If I apply this patch the cpu power state stays almost all the time at C0/C1 at idle. Normal would be 90%+ at C7. Maybe it's related to my system.
      Could someone else can check the powerstates with this patch applied.
      I get the cstates from: and check which kernel functions are probably causing it with 'perf top'.
      Seems to be lapic_next_deadline called from smt_should_schedule.

    2. Ok, i double checked it on my other machine without the smtnice3-4.patch with just CONFIG_SMT_NICE the same problem very high C0/C1 at idle strange...

    3. Minor change with major effect, I would say... And, please, don't consider it to be placebo again. ;-) This patch cures the reactivation time to normal interactivity/latency for the issue with firefox I've written about last night.

      Thank you, Manuel

      BTW, the link to the bfs449-smtnice-4.patch throws a 404 Not found.

    4. @Jan Thanks. Can you confirm if the problem still exists if you keep it patched but set CONFIG_SMT_NICE=n ?

      @Manuel Thanks. That's why testing is so important. I discovered it was virtually a bug to not make that change to smtnice4. I will correct the link.

    5. Ok, I just applied the patch and compiled with CONFIG_SMT_NICE=n and I got normal C0 levels so it seems CONFIG_SMT_NICE is the problem here.
      Also the CPU frequency is not going down to the minimum but that "problem" could also exist with CFS. I also read somewhere that most of the power savings are done through the cstates so that doesn't bother me that much but regardless I will do some testing with the vanilla kernel to confirm that.

    6. No that's okay. If you had smt_should_schedule coming up as the hot spot, then that's the likely culprit. It may be forcing reschedules of idle CPUs when there's nothing to do. I'll investigate when time permits. Thanks.

    7. confirmed, I was wondering why my computer recently was getting that hot and blaming it on kwin & composited desktop, and gaming ;)

      compare values of BFS 449:

      *without* smtnice, during rsync (backup)

      2 cores 66%, 1 core 33%, 5 cores 99% idle

      when doing nothing and simply observing with powertop: 8 cores, 99% idle

      *with* smtnice:

      simply observing via powertop: 68-70% idle of all 8 cores (at best)

      "idle" = C7s-HSW, deepest sleep state

      being an anaesthetist already is very mentally & bodily taxing (just found out a few months ago first-hand) - no need to put further pressure on yourself and your health with other things

      take care of yourself & all the time you need, Con

      All the best and Thank you so much for creating CK/CFS

    8. Found the bug, and it was as I suspected it was just rescheduling when there was nothing to do. I've posted incrementals and updated the patch to smtnice5. Thanks for spotting it!

    9. Thx for fixing that. Now I've found a new bug if I use ffmpeg to convert a video with CONFIG_SCHED_SMT the CPU stays 50% idle but it should be 100% load nice levels doesn't seem to matter. If I compile the kernel everything is normal. ffmpeg uses on my 4770k (4/8 cores) 12 threads to convert the video maybe it's related to that don't know...
      If you need more information just ask.

    10. @Jan: You sure it doesn't spawn threads at different nice levels to the parent process?

      @KOT: No it's not meant to punish CPU hogs in any way unless they're more niced than something else running.

    11. That doesn't mean I won't investigate by the way... the scheduler is subtle and quick to anger when you change it.

    12. @kernelOfTruth: could you please produce properly formatted patch with diff -Naur or git diff?

    13. sure,


      I wonder why no one mentioned & explained before :/

      using that non-standard format is easier for me to diagnose when things go wrong, therefore using Nrupad as a standard

      mea culpa

    14. In fact, -Naur didn't help, and your patch is still broken :). How do you paste it?

    15. ctrl + a, ctrl + v

      probably not "good enough" :P

      anyway - haven't worked with git (github) for some time

      here's the branch:

      that should be more easy to fork and work with

    16. and here the patch for your convenience:

      (adding a .diff to the commit number/address)

    17. I have informed you that your patches can not be integrated

      see Anonymous9 August 2014 02:39

      /scratches head?

    18. @Anonymous:

      thanks - now I understand what that meant ;)

      that current patch from github works for you, right ?

    19. I'll upload an incremental branch shortly

  4. Best wishes to you and your family!
    Greetings from Serbia

    1. Thanks for your well wishes everyone, code is proving a good distraction after all.

  5. Thanks con, you make linux and learning about the kernel fun for me!

  6. I'd really like to read something from Con about the _additional_ proposed patches by Alfred Chen (not only for 3.16 as in this thread here). He already advertised most/all of his BFS related patches for 3.15 when I tried Con's URWlocks revisitation IIRC.

    So, all said and asked in a friendly tone: Con, what's your opinion? And Alfred Chen: Are your patches ck-smtnice-5-ready already?

    Best regards, Manuel Krause

    1. ****, sorry for forgetting half of my question to Alfred Chen...
      did you have changed much from 3.15 to 3.16?

      And does the following

      >Alfred Chen13 August 2014 22:46
      >One remarkable thing when I port 0449 to 3.16 is tsk_is_polling define in the >original bfs.c
      >-#ifndef tsk_is_polling
      >-#define tsk_is_polling(t) 0
      >which total disabled tsk_is_polling function and always make >smp_send_reschedule(cpu) called in resched_task().
      >Is this the bfs intend design?
      >When porting, I remove the defines and use the mainline new routines.

      exist & imply problems on 3.15 ?

    2. @Manuel Krause
      smtnice is not yet be ported in my linux-3.16.y-gc for two reason, firstly, smtnice is new and still under testing in this thread, secondly I don't have smt hardware to test it.

      >did you have changed much from 3.15 to 3.16?
      For the 3 commits I have posted here to port 0449 to 3.16, I don't change bfs logic besides the tsk_is_polling thing. Most works are sync up mainline changes in core.c, and for 3.16, those are topology level/sd related and none for the core code.
      The rest bfs commits on linux-3.16.y-gc are ported from linux-3.15.y-gc, I have posted and explain them in previous threads, nothing new.

      I do have multiple queue locking code for bfs, but it need to rebase to 3.16 then retest.

      >exist & imply problems on 3.15 ?
      Both code works on 3.15 and 3.16 for me, so I can't tell if problems caused by this.

    3. Without your i915 related revert, not shipped with your current collection, I'm not able to survive a suspend-to-disk with open videos in SMPLAYER. I'm also lacking brain, to port the related patch myself to 3.16.y-gc.

      What I've done successfully is, to port/"edit" TuxOnIce to 3.16. Completely unofficial: and pastebin doesn't let it be posted as of greater than 500kB.*grrr*

      So, I'd say good night,
      Manuel Krause

    4. Why to port TOI manually if there's official one?

    5. @ post-factum: I'm embarrassed, I haven't looked at the end of Nigel's repo. My understandable excuse is: It doesn't provide a proper sort by date...

      @ Alfred Chen: I've now spent several hours to adjust this i915 related revert patch, step-wise, as Im no developer. The result still doesn't heal the issue but brings back the known behaviour of your 3.15.y-gc patch. E.g. getting after a resume from hibernation:
      [ 218.708017] [drm] stuck on render ring
      [ 218.709466] [drm] GPU HANG: ecode 0:0x00000000, in Xorg [793], reason: Ring hung, action: reset

      But that at least has no impact and reenables me to work on the resumed machine.
      I hope I haven't forgotten something for the complete patch.

      BTW, I'm now using your full patch collection for 3.16.y-gc and it's working fine, so far.

      Best regards, Manuel Krause

    6. @Manel
      It's a little off topic about bfs. The i915 revert patch I added in 3.15 b/c it breaks my GM45 chipset machines, I found the first introduction is to fix an issue but finally turns out it hasn't, but maintainer pull it to upstream as a fix. I have no idea about the gpu setup codes but have to revert it, lucky this works. That's the sorry in 3.15. In 3.16, vanilla kernel works without drm issue here, so I doesn't cherry-pick that revert patch from 3.15.
      I don't use TOI, suspend/resume works fine with my machines in 3.16.

    7. @Alfred Chen: Of course, it's very off topic on here. I only wanted to leave a little message. With vanilla 3.16.1 video playback in SMplayer or VLC will not display anything other than black playback content with sound(!) after a resume from hibernation, and it's not related to TOI. The result with 3.15.y-gc is provoking a gpu ring reset (what I meant with "not healing" the issue) on my GM45 graphics and I'm somekind of proud to have managed to replicate this behaviour with my reworked patch without breaking the rest of the kernel ^^. :-)
      I'm a bit annoyed/tired to file another bug in maybe another bugzilla, where the assignees don't seem(!) to fix -- but are able to push 'masses' of new code into the kernel in the meantime. Please, don't read this as flames, maybe I'm just too spoiled by the experience of this -ck related community: Just see the fast efforts that were possible with Con's SMTnice.

      Best regards, and to all, please, keep up your good work,

    8. @Manuel
      Gpu issues are very odd, when it happens on 3.15, 3 of my GM45 machine have different behaviors. Re-test on 1 of them using 3.16, mpv playback, suspend/resume, mpv playback again works good. Or you mean suspend while playing back?

    9. @Manuel Sorry that I can't remember which email to LKML, but we are looking at the same bug on As I google search today, there are still similar issues reported in 3.16.

  7. changes for 3.15 to 3.16:

  8. I did some more testing of the SMT nice problem with threads (ffmpeg)
    Here is the thread readout from ps:
    top shows this as overall:
    %Cpu(s): 6.9 us, 0.0 sy, 36.2 ni, 53.4 id, 0.0 wa, 3.4 hi, 0.0 si, 0.0 st
    I also played around with SCHED_ISO and the nice levels but nothing changes seems like some threads are starving I will also test this without smtnice in a couple of minutes...

    1. Ok, the thread readout looks pretty much the same:
      But the overall from top looks like this:
      %Cpu(s): 0.8 us, 0.0 sy, 99.0 ni, 0.0 id, 0.0 wa, 0.1 hi, 0.0 si, 0.0 st
      One thing in addition perf top shows nothing specific to the scheduler in the higher up functions without smt nice but if I turn it on it shows smt_should_schedule in the top 5 functions.

    2. Thanks. All those threads are niced. I'm guessing ffmpeg is a combination of niced and un-niced processes and the un-niced ones are making the niced ones go to sleep which is pretty much how smt nice is supposed to work. If it's not time critical then you can run the whole of ffmpeg as sched idleprio so they're all treated equally. If it is, then it's unusual to nice the encoding threads so heavily. At the moment the baseline bias is 100% and decreases with decreasing nice levels. On the next incarnation I'm planning on making it 75% (to account for the increased cpu power overall) and configurable, but the extra code/decision making in a configurable version is what has put me off doing it so far.

      smt_should_schedule needs to be called on every single scheduling decision so that's expected but I guess since it's such a hot spot it needs to be ultra optimised.

      Thanks for your testing so far.

    3. Found a bug. Will post an update shortly.

  9. SMT nice 6 incremental and full patches uploaded. Thanks to all those that keep testing and finding bugs! This one has to be close to release quality now.

    1. Yeah, that did the trick everything is working nicely now thx for your effort.

    2. No, thank you for your tireless testing and quick reporting back which helped this code mature much faster than it otherwise would :)