Tuesday, 12 March 2019

linux-5.0-ck1, MuQSS version 0.190 for linux-5.0

Announcing a new -ck release, 5.0-ck1  with the latest version of the Multiple Queue Skiplist Scheduler, version 0.190. These are patches designed to improve system responsiveness and interactivity with specific emphasis on the desktop, but configurable for any workload.

-ck1 patches:
Git tree:
MuQSS only:
Git tree:

Web: http://kernel.kolivas.org

This is mostly a resync from 4.20-ck1 with a minor tweak to CPU ordering for slightly better throughput. Note that BFQ and I/O schedulers have nothing to do with MuQSS or any of the -ck code so the changes to I/O schedulers in mainline are of no consequence.



  1. Pulled those before the weekend from GIT, and compile/works fine so far for me :)

    Thank you CK!

  2. Thank you, feels a little more responsive. Good job!

  3. Did you ever test your patches against AMD CPUs?

    I have the problem that on (pre-Ryzen) AMD CPUs, the clock speed is always on max with schedutil on the MuQSS patchset...?

    1. Seeing the same phenomenon. NO_HZ_IDLE, 100 Hz, MuQSS, schedutil and all 4 cores at max clock 100% of the time.

      With CFS instead of MuQSS, the core clocks behave normally.

      It is almost as if schedutil simply is not functioning with MuQSS, as if 'performance' is active instead.

      Measured with a watch -n1 of cat /proc/cpuinfo over some time.

    2. It's likely schedutil broke. No, I don't have any AMD CPUs. Try another governor like ondemand to see if it's just schedutil I broke or something else.

    3. OP here:
      Max clock problem is only with schedutil.
      On ondemand clocks are reasonable, however I don't reach max. clocks with ondemand under full load.

      I don't use ondemand normally, so can't tell if this is a problem with the MuQSS patchset or in general.

    4. Seeing the same here; other software-based governors functioning as expected, schedutil simply not functioning at all.

      @OP: Regarding ondemand not reaching max clock; unsure what could be going on there. Is 'performance' reaching max clock? Not suggesting that is a viable workaround, mostly just suggesting it to confirm.

      Anyhow... I'd love to be able to donate an AMD CPU to debug this issue but, I have none spare.

      schedutil is so far ahead of its competitors, would be wonderful to see it functioning as expected in conjunction with MuQSS.

    5. Was schedutil working properly last -ck release?

    6. Actually, never even bothered to take notice of that... but, just recompiled 4.20 with MuQSS and... no, schedutil was broken in 4.20 as well. At least, in conjunction with MuQSS. 100% max clock on all cores, 100% of the time.

      Could go back further but this is an aging PC with a relatively small amount of memory so recompiling a half a dozen kernels is going to take a while.

      So, for now...

      - 5.0, disfunctional
      - 4.20, disfunctional

    7. So, as per the OP's report of a broken schedutil on 4.19 as well... hmmm...

      I think we may need a more powerful AMD box to debug this. Just to reduce the compile time; takes me 90+ minutes for a single compile.

      Not enough memory to run it off /dev/shm, sadly. Would drastically reduce it.

    8. If you are just building the kernel for quick testing "make localmodconfig" could speed up the process.

    9. Thanks for the suggestion but, looked at the numbers and it's just not going to be enough. Best way to go about this would be to go back all the way to 4.7, the kernel that first saw schedutil being implemented.

      Since it cannot be entirely ruled out at this point that schedutil simply never worked correctly in conjunction with MuQSS.

      And I have neither the time nor the hardware to compile that many kernels to manually bisect this phenomenon.

  4. OP here:
    I have broken schedutil on 4.19

  5. Thanks Con for maintaining this.

    I've made throughput benchmarks here :

    I've tested ck1 and MuQSS alone, configured with NO_HZ_IDLE and HZ=100.

    Reading one of your comments on the 0.185 announcement, I understand that for low latency 'MuQSS + high res timers' (aka ck1 patchset) is better than MuQSS alone.
    Is that right ?

    I also wanted to update interbench results. Is this benchmark suited to compare different scheduler ?
    From all my tries to benchmark latency, I remember it is tricky to do so due to different system calls implementations.


  6. Hey there, I've noticed that the Multi-Core sibling runqueue sharing doesn't seem to detect the cache topology of older core 2 quad cpus, it ends up running running 1 runqueue for all 4 cores, when the cpu is essentially 2 dual core dies on a package (2x2), instead of running 2 runqueues.

    This results in lower than expected performance.

    I suspect it is due to the fact it is uma, unlike most modern multisocket/multi-die platforms.

    Is there anyway to manually configure cpu locality? so that i can test it against runqueue sharing being off.

    1. You can't force the topology but you can choose the sharing level at boot time.
      rqshare=none / smt / mc / smp / all
      Want to post the debug dmesg output at boot time of MuQSS locality?

    2. sure,

      [ 0.234865] MuQSS locality CPU 0 to 1: 2
      [ 0.234867] MuQSS locality CPU 0 to 2: 2
      [ 0.234868] MuQSS locality CPU 0 to 3: 2
      [ 0.234868] MuQSS locality CPU 1 to 2: 2
      [ 0.234869] MuQSS locality CPU 1 to 3: 2
      [ 0.234869] MuQSS locality CPU 2 to 3: 2
      [ 0.234896] MuQSS runqueue share type MC total runqueues: 1
      [ 2.956091] MuQSS CPU scheduler v0.190 by Con Kolivas.

  7. Giuseppe Ghibò2 April 2019 at 05:02

    Hi. Upstream kernels are using the CONFIG_SCHED_SMT option around 4.19.8 and beyond for sched_cpu_activate|deactivate; shouldn't the file MuQSS.c have to include the following patch too:


    1. Technically, probably. But point releases are not actively supported by ck, largely due to time constraints.

    2. MuQSS 0.190 is for 5.0, and not a "point release"... Or is this not the case for 5.0 kernel, and just something in > 4.19.8?

      Is this patch tested for current MuQSS and 5.0 kernel?

    3. Well.. to answer myself: No ill effects (yet) :)

    4. The error may have been mine, I interpreted the question in reverse. As in, a request to downstream v0.190 to 4.19, which did indeed see its MuQSS version break due to a kernel point release.

      @OP: v0.190 will compile fine on 5.0.x. So far, none of the 5.0 point releases have broken MuQSS v0.190.

  8. MUQSS fails with 5.0.7:
    patching file kernel/sysctl.c
    Hunk #1 FAILED at 127.
    Hunk #2 succeeded at 297 (offset 1 line).
    Hunk #3 succeeded at 314 (offset 1 line).
    Hunk #4 succeeded at 469 (offset 1 line).
    Hunk #5 succeeded at 1042 (offset 1 line).
    1 out of 5 hunks FAILED -- saving rejects to file kernel/sysctl.c.rej

    1. The patch seems to apply just fine (both 5.0-muqss-190.patch against 5.0.7, and incremental patch-5.0.6-7 against already muqss patched 5.0.6), if you simply increase the fuzz value when patching.

    2. patch -F x

      Where x is the fuzz number, 2 by default. Try 3 or 4. Do not go too large as it may cause incorrectly applied hunks. The fuzz number increases the number of lines patch will look for when trying to figure out where the patch file needs to be applied in the file.

    3. Thank you very much.

  9. The failed segment looks like this with 5.0.7 kernel:
    diff --git a/kernel/sysctl.c b/kernel/sysctl.c
    index ba4d9e8..37cf9d5 100644
    --- a/kernel/sysctl.c
    +++ b/kernel/sysctl.c
    @@ -128,8 +128,14 @@ static int __maybe_unused two = 2;
    static int __maybe_unused four = 4;
    static unsigned long one_ul = 1;
    static unsigned long long_max = LONG_MAX;
    -static int one_hundred = 100;
    -static int one_thousand = 1000;
    +static int __read_mostly one_hundred = 100;
    +static int __read_mostly one_thousand = 1000;
    +extern int rr_interval;
    +extern int sched_interactive;
    +extern int sched_iso_cpu;
    +extern int sched_yield_type;
    #ifdef CONFIG_PRINTK
    static int ten_thousand = 10000;

    Offsets is just cosmetics, so you dont need to change.

  10. I'd like to add that with 5.0.7 the patch might also fail because of the new $(LIBELF_FLAGS) in tools/objtool/Makefile.

    This can be remedied with a short sed command before applying the ck patchset:

    sed -i '/-CFLAGS/ s/$/ \$(LIBELF_FLAGS)/' patch-5.0-ck1

    As for the rejects in kernel/sysctl.c, these can be worked around with a fuzz factor of 3 as pointed out above.

    With these two tweaks i was able to build 5.0.7 with the -ck patches applied.

    1. For several kernel versions now we have seen point releases break MuQSS or at the very least complicate its installation.

      We need volunteers to start keeping the individual MuQSS versions in sync with their respective kernel versions.

    2. So you mean a git repository?

    3. Sure, yes. Just something to help us keep track of what additional patches are necessary to keep MuQSS working in spite of individual kernel point releases doing their best to break it.

      For now, information on how to counter cases of this happening is spread all over the place. Some of it here on ck's blog, some just arbitrarily on random sites to be found with a Google search.

      It'd be better to roll all that information together. If anything, I'd personally recommend maintaining at least 2 versions... (see, kernel.org), the most recent stable and the most recent LTS.

    4. Yeah, a "git-rebased" version ref. to point releases would be very nice.
      I update my kernel patches https://github.com/SveSop/kernel_cybmod/tree/MuQSS , but i have NO WAY to do any testing other than with the current config i personally use.. So if a rebased/fixed MuQSS + extra patches i use, works for my .config it does not really indicate that a point release has not broken something i dont use (ie. AMD+++).

      So, I vote for someone to maintain this in a proper manner :)

    5. I'm doing the same, updating the patches for my specific config. I agree that a universal repo would be nice, though.

  11. [ 1.535455] ------------[ cut here ]------------
    [ 1.535460] Current state: 1
    [ 1.535464] WARNING: CPU: 1 PID: 0 at 0xffffffff8108c865
    [ 1.535466] Modules linked in:
    [ 1.535469] CPU: 1 PID: 0 Comm: MuQSS/1 Not tainted 5.0.7-ck1 #3
    [ 1.535471] Hardware name: System manufacturer System Product Name/Crosshair IV Formula, BIOS 3029 10/09/2012
    [ 1.535474] RIP: 0010:0xffffffff8108c865
    [ 1.535476] Code: 04 77 29 89 f1 ff 24 cd 38 76 a0 81 80 3d 53 1b bd 00 00 75 17 89 c6 48 c7 c7 90 c6 ad 81 c6 05 41 1b bd 00 01 e8 7b ae fa ff <0f> 0b 48 83 c4 08 5b c3 48 8b 47 60 48 85 c0 75 64 83 fe 03 89 73
    [ 1.535480] RSP: 0018:ffff888437c43f50 EFLAGS: 00010082
    [ 1.535482] RAX: 0000000000000010 RBX: ffff888437c504c0 RCX: ffffffff81c1fdb8
    [ 1.535483] RDX: 0000000000000001 RSI: 0000000000000086 RDI: ffffffff81f8fcac
    [ 1.535485] RBP: 7fffffffffffffff R08: 00000000000001f0 R09: 0000000000000000
    [ 1.535487] R10: 0720072007200720 R11: 0720072007200720 R12: 7fffffffffffffff
    [ 1.535489] R13: ffff888437c56900 R14: ffff888437c569f8 R15: ffff888437c56a38
    [ 1.535491] FS: 0000000000000000(0000) GS:ffff888437c40000(0000) knlGS:0000000000000000
    [ 1.535493] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 1.535494] CR2: 0000000000000000 CR3: 0000000001c0c000 CR4: 00000000000006e0
    [ 1.535496] Call Trace:
    [ 1.535498]
    [ 1.535500] 0xffffffff8108e7cb
    [ 1.535502] 0xffffffff810817cb
    [ 1.535503] 0xffffffff81601568
    [ 1.535505] 0xffffffff8160117f
    [ 1.535506]
    [ 1.535507] RIP: 0010:0xffffffff8100f592
    [ 1.535509] Code: 0f ba e0 24 72 11 65 8b 05 bb eb ff 7e fb f4 65 8b 05 b2 eb ff 7e c3 bf 01 00 00 00 e8 17 e0 07 00 65 8b 05 a0 eb ff 7e fb f4 <65> 8b 05 97 eb ff 7e fa 31 ff e8 ff df 07 00 fb c3 66 66 2e 0f 1f
    [ 1.535512] RSP: 0018:ffffc9000007bf00 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
    [ 1.535515] RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000001
    [ 1.535516] RDX: 000000005b7f1466 RSI: 0000000000000001 RDI: 0000000000000380
    [ 1.535518] RBP: ffffffff81c601a8 R08: 0000000000000000 R09: 0000000000019840
    [ 1.535520] R10: 0000001e3c819be7 R11: 000000007260bc7a R12: 0000000000000000
    [ 1.535522] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
    [ 1.535524] 0xffffffff8105bd2f
    [ 1.535526] 0xffffffff8105bf5b
    [ 1.535527] 0xffffffff810000d4
    [ 1.535529] ---[ end trace 71fe021b29fa5d1f ]---

    I am having this problem on all my phenom 2 systems, looks like some kind of interrupt problem, I tried to enable nothreadedirqs option and also enabled fix for broken boot irqs option but none of them had any effect on this, I enabled stack traces but for some reason he don't show them :x

    anyone might know what this is ? I searched around and found this which may be helpful: https://pastebin.com/y0aXvBNP
    (this is not mine but it looks very much like mine)

    thank :)

  12. here more information: https://gist.github.com/ManoaNosea/69f698e40661be29df5016143bd81a86

    1. I did get interrupt related panics as well on my machines. In my case, it seemed to be related to the bfq I/O scheduler (the call trace showed its "try_to_wake_up" function on top).

      What I/O scheduler are you using? I have switched to mq-deadline as an experiment and haven't had panics since.

    2. Tried to reproduce the issue again and indeed the instruction pointer is always somewhere in the "bfq_bfqq_expire" function.

      I have some additional stuff in my kernel, though, so you might be facing a different issue. Anyway, check whether you're using BFQ or not.

    3. I personally was kind of expecting this. With the removal of the legacy IO schedulers, BFQ is all but the default (not to mention the single best choice, the other MQ schedulers are simply not fit for desktop use).

      Using BFQ is inevitable, required. Again, all other options are simply unrealistic in terms of latency and bandwidth.

      Sad to see it is happening as I predicted, that MuQSS and BFQ are no longer playing nice. It's unrealistic to expect BFQ to move over. It is being promoted hard as the de-facto standard IO scheduler.

  13. I used kyber in all this

    1. Interesting. I can't reproduce the issue with Kyber so the I/O scheduler doesn't seem to be the main culprit...

    2. I was curious and checked the addresses of your github trace against a 5.0.7-ck kernel with debug symbols. It would seem that RIP is stuck at line 364 of arch/x86/events/intel/ds.c:

      ds->pebs_buffer_base = (unsigned long) cea;

      It apparently arrived there via a call to "reserve_ds_buffers".

      The first call trace has two frames of x2apic_phys_probe(). As for the bottom trace, it seems to look something like this:

      clear_page_presence() ---> ./arch/x86/include/asm/paravirt.h:482
      clear_page_presence() ---> arch/x86/mm/kmmio.c:161

      0xffffffff81078e3b ---> line 30 of arch/x86/kernel/apic/x2apic_phys.c


      0xffffffff8105053d ---> line 330 of ./include/linux/bitmap.h


    3. I am using Kyber for my spin-disk, and none for my NVME, and have not had any issues like this tho.

      I have tried a bit with bfq, but not really that impressed on NVME even tho Phoronix had a lengthy article about improved startup times and whatnot.

      However, i have dabbled with a wee bit of patchwork for this that i found here: https://github.com/sirlucjan/kernel-patches/tree/master/5.0

      Weeded out what i found that patched cleanly (and have verified 2-3 of them actually being implemented upstream as-is). https://github.com/SveSop/kernel_cybmod/tree/MuQSS/blk-patches

      Not saying, or claiming this is a solve-it-all, or by any means want to stand responsible for any data corruption from this tho.. But if you want to experiment feel free. Patches the 5.0.7 source.

  14. thank :) it big help :) I will test all this functions :)

  15. but it look verry strange, this functions show the problems in things like hpet and x2apic and iommu (I don't know the functions realy, mybe this is not necessary a problem with kernel at all, mybe hardware bad iteself), but it show that it affect the muqss, this I don't understand whay...muqss can be calling this functions ?

  16. it strange because I am not running any virtualization, only python3 was running on the computer :x

    1. Indeed, from the trace I was at first thinking of virtualization as well. It is also strange that some Intel specific stuff was (apparently) executed on a Phenom processor. But then I'm not a kernel dev either, so...

      Was there anything else running apart from python3?

      For your investigations, I'd check the value of your kernel's CONFIG_KALLSYMS and recompile with it set to yes. That way the panic message should contain function names and offsets instead of just addresses. In the end your kernel might be very different from the one I used for the check.

  17. I disabled HPET,IOMMU,x2apic but it didn't fixed: https://gist.github.com/ManoaNosea/7ceec21d49d87dc679265a1371c0433e

  18. thank now I know whay he didn't gived codes and instead gived addresses

  19. [ 1.864822] Current state: 1
    [ 1.864829] WARNING: CPU: 2 PID: 0 at clockevents_switch_state+0x45/0xe0
    [ 1.864830] Modules linked in:
    [ 1.864833] CPU: 2 PID: 0 Comm: MuQSS/2 Not tainted 5.0.7-ck1 #10
    [ 1.864835] Hardware name: System manufacturer System Product Name/Crosshair IV Formula, BIOS 3029 10/09/2012
    [ 1.864839] RIP: 0010:clockevents_switch_state+0x45/0xe0
    [ 1.864841] Code: 04 77 29 89 f1 ff 24 cd 38 75 a0 81 80 3d 53 24 bd 00 00 75 17 89 c6 48 c7 c7 c8 9e b7 81 c6 05 41 24 bd 00 01 e8 db af fa ff <0f> 0b 48 83 c4 08 5b c3 48 8b 47 60 48 85 c0 75 64 83 fe 03 89 73
    [ 1.864844] RSP: 0018:ffff888437c83f50 EFLAGS: 00010082
    [ 1.864846] RAX: 0000000000000010 RBX: ffff888437c904c0 RCX: ffffffff81c1f618
    [ 1.864848] RDX: 0000000000000001 RSI: 0000000000000086 RDI: ffffffff81d41c04
    [ 1.864850] RBP: 7fffffffffffffff R08: 00000000000001e6 R09: 0000000000000000
    [ 1.864852] R10: 0720072007200720 R11: 0720072007200720 R12: 7fffffffffffffff
    [ 1.864854] R13: ffff888437c96900 R14: ffff888437c969f8 R15: ffff888437c96a38
    [ 1.864856] FS: 0000000000000000(0000) GS:ffff888437c80000(0000) knlGS:0000000000000000
    [ 1.864858] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 1.864860] CR2: 0000000000000000 CR3: 0000000001c0c000 CR4: 00000000000006e0
    [ 1.864861] Call Trace:
    [ 1.864864]
    [ 1.864866] ? tick_program_event+0x4b/0x80
    [ 1.864869] ? hrtimer_interrupt+0x12b/0x220
    [ 1.864872] ? smp_apic_timer_interrupt+0x48/0xa0
    [ 1.864874] ? apic_timer_interrupt+0xf/0x20
    [ 1.864875]
    [ 1.864877] ? amd_e400_idle+0x32/0x60
    [ 1.864880] ? do_idle+0x1cf/0x280
    [ 1.864882] ? cpu_startup_entry+0x1b/0x20
    [ 1.864884] ? secondary_startup_64+0xa4/0xb0

    it look like hrtimer problem, but the options of kernel not allow to disable that option, so I enable hpet and booted clocksource=hpet, but this boot error the same :x

  20. could be a dynticks idle problem ?

  21. I disable dynticks idle and the boot error is gone

    1. Some further confirmation for you, Liquorix uses MuQSS and disabled tickless idle a long time ago: https://github.com/damentz/liquorix-package/blob/5.0-7/linux-liquorix/debian/config/kernelarch-x86/config-arch-64#L85

      I've settled on 250hz since it gives decent timekeeping without the the negative affects on power consumption that 1000hz causes.

  22. On AMD Ryzen 2400G APU booting muqss with mc (default) results in 25 runqueues (which is more than odd, because this is 4 core 8 thread CPU):
    kernel: MuQSS runqueue share type MC total runqueues: 25

    When using virtualization (KVM), a lot of errors like these appear:
    apr 15 16:44:52 kernel: BUG: using smp_processor_id() in preemptible [00000000] code: CPU 4/KVM/14536
    apr 15 16:44:52 kernel: caller is single_task_running+0xe/0x30
    apr 15 16:44:52 kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./AB350M Pro4, BIOS P5.70 03/14/2019
    apr 15 16:44:52 kernel: Call Trace:
    apr 15 16:44:52 kernel: dump_stack+0x65/0x8a
    apr 15 16:44:52 kernel: debug_smp_processor_id+0xe8/0xf0
    apr 15 16:44:52 kernel: single_task_running+0xe/0x30
    apr 15 16:44:52 kernel: kvm_vcpu_block+0x230/0x370 [kvm]
    apr 15 16:44:52 kernel: kvm_arch_vcpu_ioctl_run+0x312/0x1cc0 [kvm]
    apr 15 16:44:52 kernel: kvm_vcpu_ioctl+0x24b/0x630 [kvm]
    apr 15 16:44:52 kernel: ? hrtimer_start_range_ns+0x1ce/0x360
    apr 15 16:44:52 kernel: do_vfs_ioctl+0xa9/0x760
    apr 15 16:44:52 kernel: ? __schedule+0xa5a/0xd90
    apr 15 16:44:52 kernel: ? __fget+0x73/0xa0
    apr 15 16:44:52 kernel: __x64_sys_ioctl+0x6a/0xa0
    apr 15 16:44:52 kernel: do_syscall_64+0x5a/0x110
    apr 15 16:44:52 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
    apr 15 16:44:52 kernel: RIP: 0033:0x7f25c93995d7

    Using standard Ubuntu mainline kernel, there are no errors.

    1. Here's the output of "virsh capabilities" (if that helps at all):


    2. I should have clicked preview before, blogspot ate all my xml :D
      Here's the same in pastebin: https://pastebin.com/8eHJ2sc8

  23. but gives again kernel problems when running: https://gist.github.com/ManoaNosea/61fff8a1504b869d352b938528a5a4ab

    1. This might be of help for your case:


  24. thank, I tested same .config same kernel without patch - no problems at all, this is muqss problem I think it don't work with AMD

  25. I admire your patchset, as a suggestion to boost single-threaded performance, will the scheduler allow a single thread to hog a core?

    This would boost FPS in games, and web app run times.

    1. If you disable all runqueue sharing you will get more binding to one core, but latency will still dictate that something else will knock it off the core, even with multiple cores. You can't have it both ways. Ultimately latency comes at some expense to throughput unless you have infinite cores.

  26. Another tool to test schedulers in "gaming" workflow. used to test VariableRefreshRate implementations and visual research. github.com/kleinerm/VRRTestPlots

  27. Did anyone adapt this to 5.1 yet?