Monday, 20 February 2017

linux-4.10-ck1, MuQSS version 0.152 for linux-4.10

Announcing a new -ck release, 4.9-ck1  with new version of the Multiple Queue Skiplist Scheduler, version 0.150. These are patches designed to improve system responsiveness and interactivity with specific emphasis on the desktop, but configurable for any workload.

linux-4.10-ck1

-ck1 patches:
http://ck.kolivas.org/patches/4.0/4.10/4.10-ck1/

Git tree:
https://github.com/ckolivas/linux/tree/4.10-ck

Ubuntu 16.10 packages (sorry I'm no longer on 16.04):
http://ck.kolivas.org/patches/4.0/4.9/4.10-ck1/Ubuntu16.10/

MuQSS

Download:
4.10-sched-MuQSS_152.patch

Git tree:
4.10-muqss


MuQSS 0.152 updates

Removed the rapid ramp-up in schedutil cpufreq which was overactive.
Bugfixes

4.10-ck1 updates

Apart from resyncing with the latest tree from linux-bfq:
- The wb-buf-throttling patches are now part of mainline and do not need to be added separately
- Minor swap setting tweaks

For those of you trying to build the evil nvidia driver for linux-4.10, this patch will help:
nvidia-375.39-linux-4.10.patch

Enjoy!
お楽しみ下さい
-ck

40 comments:

  1. Nice low latency you got there ;).
    Also thanks for the nvidia patch, couldn't find one before.

    ReplyDelete
  2. @CK I have weird SCHED logs in dmesg

    #3
    [ 0.753442] SCHED: No cpumask for kworker/4:0/36

    ...
    [ 0.131197] TSC deadline timer enabled
    [ 0.131200] smpboot: CPU0: Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz (family: 0x6, model: 0x3a, stepping: 0x9)
    [ 0.131243] Performance Events: PEBS fmt1+, IvyBridge events, 16-deep LBR, full-width counters, Intel PMU driver.
    [ 0.131263] ... version: 3
    [ 0.131264] ... bit width: 48
    [ 0.131264] ... generic registers: 4
    [ 0.131265] ... value mask: 0000ffffffffffff
    [ 0.131266] ... max period: 00007fffffffffff
    [ 0.131266] ... fixed-purpose events: 3
    [ 0.131266] ... event mask: 000000070000000f
    [ 0.201364] NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
    [ 0.221236] smp: Bringing up secondary CPUs ...
    [ 0.291257] SCHED: No cpumask for kworker/1:0/18
    [ 0.301260] SCHED: No cpumask for kworker/1:0H/19
    [ 0.301289] x86: Booting SMP configuration:
    [ 0.301290] .... node #0, CPUs: #1
    [ 0.453414] SCHED: No cpumask for kworker/2:0/24
    [ 0.453427] SCHED: No cpumask for kworker/2:0H/25
    [ 0.453441] #2
    [ 0.603439] SCHED: No cpumask for kworker/3:0/30
    [ 0.603452] SCHED: No cpumask for kworker/3:0H/31
    [ 0.603463] #3
    [ 0.753442] SCHED: No cpumask for kworker/4:0/36
    [ 0.753457] SCHED: No cpumask for kworker/4:0H/37
    [ 0.753466] #4
    [ 0.903475] SCHED: No cpumask for kworker/5:0/42
    [ 0.903488] SCHED: No cpumask for kworker/5:0H/43
    [ 0.903499] #5
    [ 1.053497] SCHED: No cpumask for kworker/6:0/48
    [ 1.053509] SCHED: No cpumask for kworker/6:0H/49
    [ 1.053521] #6
    [ 1.203507] SCHED: No cpumask for kworker/7:0/54
    [ 1.203521] SCHED: No cpumask for kworker/7:0H/55
    [ 1.203530] #7
    [ 1.353452] smp: Brought up 1 node, 8 CPUs
    [ 1.353454] smpboot: Total of 8 processors activated (36719.67 BogoMIPS)
    [ 1.360342] MuQSS locality CPU 0 to 1: 2
    [ 1.360343] MuQSS locality CPU 0 to 2: 2
    [ 1.360343] MuQSS locality CPU 0 to 3: 2
    [ 1.360344] MuQSS locality CPU 0 to 4: 1
    [ 1.360344] MuQSS locality CPU 0 to 5: 2
    [ 1.360345] MuQSS locality CPU 0 to 6: 2
    [ 1.360345] MuQSS locality CPU 0 to 7: 2
    [ 1.360346] MuQSS locality CPU 1 to 2: 2
    [ 1.360347] MuQSS locality CPU 1 to 3: 2
    [ 1.360347] MuQSS locality CPU 1 to 4: 2
    [ 1.360347] MuQSS locality CPU 1 to 5: 1
    [ 1.360348] MuQSS locality CPU 1 to 6: 2
    [ 1.360348] MuQSS locality CPU 1 to 7: 2
    [ 1.360349] MuQSS locality CPU 2 to 3: 2
    [ 1.360350] MuQSS locality CPU 2 to 4: 2
    [ 1.360350] MuQSS locality CPU 2 to 5: 2
    [ 1.360350] MuQSS locality CPU 2 to 6: 1
    [ 1.360351] MuQSS locality CPU 2 to 7: 2
    [ 1.360352] MuQSS locality CPU 3 to 4: 2
    [ 1.360352] MuQSS locality CPU 3 to 5: 2
    [ 1.360353] MuQSS locality CPU 3 to 6: 2
    [ 1.360353] MuQSS locality CPU 3 to 7: 1
    [ 1.360354] MuQSS locality CPU 4 to 5: 2
    [ 1.360354] MuQSS locality CPU 4 to 6: 2
    [ 1.360355] MuQSS locality CPU 4 to 7: 2
    [ 1.360355] MuQSS locality CPU 5 to 6: 2
    [ 1.360356] MuQSS locality CPU 5 to 7: 2
    [ 1.360356] MuQSS locality CPU 6 to 7: 2
    [ 1.360570] devtmpfs: initialized
    ...

    FULL dmesg:https://pastebin.com/raw/9YbMTik5

    CONFIG: https://github.com/FadeMind/linux410-custom.src/blob/master/linux410/config.x86_64

    Regards

    FadeMind

    ReplyDelete
    Replies
    1. I put them there much like the MuQSS locality messages. They're harmless and for my information.

      Delete
  3. Thanks Con! I built and ran x64 and i686-UP on Arch last night; working fine.

    ReplyDelete
    Replies
    1. Yes, thanks!
      No problems so far.

      Delete
  4. This comment has been removed by the author.

    ReplyDelete
  5. Too bad there isn't a nvidia 340.x driver for 4.10 as of yet.

    ReplyDelete
  6. Have a look at the end of Cons -ck1 announcement for 4.10, he provided a link to a nvidia driver patch.
    Peter

    ReplyDelete
    Replies
    1. Yes, I saw it, but it is for the 375.x driver series. Thanks anyway.

      Delete
    2. First hit on google search:

      https://devtalk.nvidia.com/default/topic/982052/linux/latest-nvidia-driver-340-101-builds-compiles-properly-but-fails-to-load-has-errors-with-linux-kernel-4-9-resolved-with-patch-/

      search terms: "nvidia 340 4.10 kernel"

      Delete
  7. As with the 4.9.0 -ck, I am getting huge spikes in several CPU monitors while the system is actually idle process-wise. `top` sees my CPUs at '100% si' almost constantly, `xosview` displays 100% "SYS" spikes in a second's interval or less, and XFCE4's xfce4-systemload-plugin shows the CPU at 100% constantly. I set CONFIG_HZ=300. Any hints how to get a usable CPU load monitoring again?

    ReplyDelete
    Replies
    1. I have the same problem with 4.10-muqss branch.

      Delete
    2. Con, any hints on how to track this down, and possibly fix this?

      Delete
    3. It's an accounting error (it's not actually using extra CPU.) Unless you can hack the code and fix it, there's nothing more you can do until I find time to investigate and fix it (which alas won't be any time soon.)

      Delete
    4. Thanks for the heads-up. I unfortunately cannot fix it myself, but as long as it is on the list, I'm a happy camper. :)

      Delete
  8. Great work once more on updating MuQSS. Personally I think it's a great scheduler. I've been getting very impressive results from it when combined with the schedutil governor and using yield_type 2, interactive 1 and rr_interval 1.

    Not only is the system incredibly responsive but performance seems to be the best as well, like that. Mileage may vary for other people but I could not be happier.

    ReplyDelete
    Replies
    1. +1 All of the above.
      It is the best, no doubt.

      Delete
  9. echo 1 > /proc/sys/kernel/rr_interval nice low latency for "real-time" audio work.
    Thanks a lot.

    ReplyDelete
    Replies
    1. Astonishing. And this doesn't hurt throughput in any way? In my former testings, some years and kernels ago, setting to 1 did not only affect disk i/o negatively, but also gfx and audio being not "in-time" asap.

      Are you using the full feature -ck1 or the MuQSS only patch?

      BR, Manuel Krause

      Delete
    2. Yes, it hurts throughput.

      But when the CPU is fast it takes some "abuse" to reach that point.

      On a slow CPU it might be not that fun since it might be choking all the time when the value is too low.

      ck1.

      Delete
  10. @27 February 2017 at 05:30:

    Actually, I'm running rr_interval 1, interactive 1 and yield_type 2 and whereas one might expect that to hurt throughput, from some testing (both synthetic as well as real world) I've actually found that throughput seems to be BETTER than with, for example, rr_interval 6, interactive 0 and yield_type 1 (or 0).

    I suspect this has to do with more and more applications as well as OS subsystems becoming increasingly multithreaded and the overhead of the context switching (yield type 2 and rr_interval 1) being less than the overhead of threads simply waiting for other threads to complete their tasks.

    Something along those lines anyhow.

    Just to give you an idea -- running a demoscene demo (synthetic metric, obviously) in WINE sees a 12% difference for me between running the highly cooperative mode (yield 2, interval 1, interactive 1) and the highly selfish mode (yield 0, interval 6+, interactive 0). In favour of the cooperative approach.

    ReplyDelete
    Replies
    1. @Anon,

      please specify whether (in Your tests) You use performance / ondemand / powersave governor and which of cpufreq or p-state You actually use.
      As mentioned in different threads here and there, ondemand vs performance itself is a big win, at least on non p-state capable hw, if You get 12% out of performance that's neat and worth a try :)

      Delete
    2. Schedutil. Been a fan of that one since it was first implemented. Tried it with ondemand as well and even that was a performance degradation. Performance might be on-par with schedutil but I'd hardly wager it being better.

      Delete
    3. Obviously I meant that the 'Performance' governor might be on-par with the 'schedutil' governor.

      Delete
  11. If you have an Intel CPU, I would be cautious about schedutil.

    I've tested it with CFS on my Intel 4770k, with both acpi-cpufreq and intel_pstate (by adding intel_pstate=passive to kernel boot line, a new option of 4.10).

    It is broken: the CPU frequency is always locked at the maximum turbo frequency (3.9GHz in my case), and the performance is bad with acpi-cpufreq (I didn't benchmarked intel_pstate).

    I've not tried MuQSS with schedutil.

    Pedro

    ReplyDelete
    Replies
    1. I use "intel_pstate=disable intel_idle.max_cstate=0 idle=poll nohalt" on Intel CPUs for maximum performance.

      Delete
    2. pstate passive + schedutil scales correctly for me but the performance is way lower than cpufreq + schedutil
      i dont know why

      Delete
    3. pstate + schedutil + mux = almost always max standard freq for me on skylake here. Not usable at all.
      pstate or cpufreq both are ok separately.

      Br, Eduardo

      Delete
  12. Thanks Con for this new release.

    Here are the usual throughput benchmarks on 4.10:
    https://docs.google.com/spreadsheets/d/163U3H-gnVeGopMrHiJLeEY1b7XlvND2yoceKbOvQRm4/edit?usp=sharing

    Pedro

    ReplyDelete
    Replies
    1. Is it possible if you could re-run the latency tests (Interbench). I am wondering if it's even worth running MuQSS because the throughput is probably better with CFS but I am not sure about the latencies I've been using both schedulers and I can't find any difference latency wise while my workload consists of compiling large projects like LLVM/Chromium while programming, I haven't noticed anything slowing down even with CFS.

      Delete
    2. Latency-wise CFS is a turtle while MUQSS is a rabbit. HTH.

      Delete
    3. But then you have the catch: MuQSS aims for low latency vs. CFS.

      BR, Manuel Krause

      Delete
    4. Added the interbench results.

      Pedro

      Delete
    5. I've done some latency oriented tests with runqlat and cyclictest, on MuQSS152@100Hz and CFS@300Hz.
      This time I hope i get it right.
      Charts are at the bottom of the sheets.

      The cpu is loaded with a linux kernel build at various -j.
      During the build runqlat or cyclictest are run with the following command lines :
      'runqlat -m 180 1'
      'cyclictest -q -S -m -D3m -H=40000'

      I also ran cyclictest at +1 nice level, as suggested by some doc I read.

      Overall MuQSS show much higher average latencies under high load, but lower max latencies under runqlat.

      Maybe ck or someone else can comment on these.
      Is it expected ? Are the tests not measuring the right thing ?

      Pedro

      Delete
    6. Again it's not testing what you think it's testing. Try changing the yield proc tunable and you'll see the results will change.

      Delete
    7. Additionally the function it hooks into aren't exactly the same so the results are never going to be directly comparable.

      Delete
    8. Thanks for replying.

      The thing is, I try to back up the positive comments I read on MuQSS latency with figures, as I don't feel any difference between MuQSS and CFS with my workload.
      I guess it's not that easy.

      I'll try changing the sched_yield setting.

      I had looked at cyclictest source code and didn't see any call to sched_yield, so I thought it was the right tool to compare CFS and MuQSS.
      Well, I just don't understand this scheduling stuff :(

      Pedro

      Delete
    9. I've done some testing with yield_type setting.
      It doesn't make real differences with this workload (build kernel).
      I won't draw conclusions from that.

      Pedro

      Delete
  13. maybe http://www.brendangregg.com/blog/2017-03-16/perf-sched.html could help further tweaking of MUQSS. HTH.

    ReplyDelete