Announcing
a new -ck release, 5.0-ck1 with the latest version of the Multiple
Queue Skiplist Scheduler, version 0.190. These are patches designed to
improve system responsiveness and interactivity with specific emphasis
on the desktop, but configurable for any workload.
linux-5.0-ck1:
-ck1 patches:
Git tree:
MuQSS only:
Download:
Git tree:
Web: http://kernel.kolivas.org
This is mostly a resync from 4.20-ck1 with a minor tweak to CPU ordering for slightly better throughput. Note that BFQ and I/O schedulers have nothing to do with MuQSS or any of the -ck code so the changes to I/O schedulers in mainline are of no consequence.
Enjoy!
お楽しみ下さい
-ck
... right.
ReplyDeletePulled those before the weekend from GIT, and compile/works fine so far for me :)
ReplyDeleteThank you CK!
Thank you, feels a little more responsive. Good job!
ReplyDeleteDid you ever test your patches against AMD CPUs?
ReplyDeleteI have the problem that on (pre-Ryzen) AMD CPUs, the clock speed is always on max with schedutil on the MuQSS patchset...?
Seeing the same phenomenon. NO_HZ_IDLE, 100 Hz, MuQSS, schedutil and all 4 cores at max clock 100% of the time.
DeleteWith CFS instead of MuQSS, the core clocks behave normally.
It is almost as if schedutil simply is not functioning with MuQSS, as if 'performance' is active instead.
Measured with a watch -n1 of cat /proc/cpuinfo over some time.
It's likely schedutil broke. No, I don't have any AMD CPUs. Try another governor like ondemand to see if it's just schedutil I broke or something else.
DeleteOP here:
DeleteMax clock problem is only with schedutil.
On ondemand clocks are reasonable, however I don't reach max. clocks with ondemand under full load.
I don't use ondemand normally, so can't tell if this is a problem with the MuQSS patchset or in general.
Seeing the same here; other software-based governors functioning as expected, schedutil simply not functioning at all.
Delete@OP: Regarding ondemand not reaching max clock; unsure what could be going on there. Is 'performance' reaching max clock? Not suggesting that is a viable workaround, mostly just suggesting it to confirm.
Anyhow... I'd love to be able to donate an AMD CPU to debug this issue but, I have none spare.
schedutil is so far ahead of its competitors, would be wonderful to see it functioning as expected in conjunction with MuQSS.
Was schedutil working properly last -ck release?
DeleteActually, never even bothered to take notice of that... but, just recompiled 4.20 with MuQSS and... no, schedutil was broken in 4.20 as well. At least, in conjunction with MuQSS. 100% max clock on all cores, 100% of the time.
DeleteCould go back further but this is an aging PC with a relatively small amount of memory so recompiling a half a dozen kernels is going to take a while.
So, for now...
- 5.0, disfunctional
- 4.20, disfunctional
So, as per the OP's report of a broken schedutil on 4.19 as well... hmmm...
DeleteI think we may need a more powerful AMD box to debug this. Just to reduce the compile time; takes me 90+ minutes for a single compile.
Not enough memory to run it off /dev/shm, sadly. Would drastically reduce it.
If you are just building the kernel for quick testing "make localmodconfig" could speed up the process.
DeleteThanks for the suggestion but, looked at the numbers and it's just not going to be enough. Best way to go about this would be to go back all the way to 4.7, the kernel that first saw schedutil being implemented.
DeleteSince it cannot be entirely ruled out at this point that schedutil simply never worked correctly in conjunction with MuQSS.
And I have neither the time nor the hardware to compile that many kernels to manually bisect this phenomenon.
OP here:
ReplyDeleteI have broken schedutil on 4.19
Thanks Con for maintaining this.
ReplyDeleteI've made throughput benchmarks here :
https://docs.google.com/spreadsheets/d/163U3H-gnVeGopMrHiJLeEY1b7XlvND2yoceKbOvQRm4/edit?usp=sharing
I've tested ck1 and MuQSS alone, configured with NO_HZ_IDLE and HZ=100.
Reading one of your comments on the 0.185 announcement, I understand that for low latency 'MuQSS + high res timers' (aka ck1 patchset) is better than MuQSS alone.
Is that right ?
I also wanted to update interbench results. Is this benchmark suited to compare different scheduler ?
From all my tries to benchmark latency, I remember it is tricky to do so due to different system calls implementations.
Pedro
Hey there, I've noticed that the Multi-Core sibling runqueue sharing doesn't seem to detect the cache topology of older core 2 quad cpus, it ends up running running 1 runqueue for all 4 cores, when the cpu is essentially 2 dual core dies on a package (2x2), instead of running 2 runqueues.
ReplyDeleteThis results in lower than expected performance.
I suspect it is due to the fact it is uma, unlike most modern multisocket/multi-die platforms.
Is there anyway to manually configure cpu locality? so that i can test it against runqueue sharing being off.
You can't force the topology but you can choose the sharing level at boot time.
Deleterqshare=none / smt / mc / smp / all
Want to post the debug dmesg output at boot time of MuQSS locality?
sure,
Delete[ 0.234865] MuQSS locality CPU 0 to 1: 2
[ 0.234867] MuQSS locality CPU 0 to 2: 2
[ 0.234868] MuQSS locality CPU 0 to 3: 2
[ 0.234868] MuQSS locality CPU 1 to 2: 2
[ 0.234869] MuQSS locality CPU 1 to 3: 2
[ 0.234869] MuQSS locality CPU 2 to 3: 2
[ 0.234896] MuQSS runqueue share type MC total runqueues: 1
[ 2.956091] MuQSS CPU scheduler v0.190 by Con Kolivas.
Hi. Upstream kernels are using the CONFIG_SCHED_SMT option around 4.19.8 and beyond for sched_cpu_activate|deactivate; shouldn't the file MuQSS.c have to include the following patch too:
ReplyDeletehttps://pastebin.com/ZB7X0WQe
Technically, probably. But point releases are not actively supported by ck, largely due to time constraints.
DeleteMuQSS 0.190 is for 5.0, and not a "point release"... Or is this not the case for 5.0 kernel, and just something in > 4.19.8?
DeleteIs this patch tested for current MuQSS and 5.0 kernel?
Well.. to answer myself: No ill effects (yet) :)
DeleteThe error may have been mine, I interpreted the question in reverse. As in, a request to downstream v0.190 to 4.19, which did indeed see its MuQSS version break due to a kernel point release.
Delete@OP: v0.190 will compile fine on 5.0.x. So far, none of the 5.0 point releases have broken MuQSS v0.190.
MUQSS fails with 5.0.7:
ReplyDeletepatching file kernel/sysctl.c
Hunk #1 FAILED at 127.
Hunk #2 succeeded at 297 (offset 1 line).
Hunk #3 succeeded at 314 (offset 1 line).
Hunk #4 succeeded at 469 (offset 1 line).
Hunk #5 succeeded at 1042 (offset 1 line).
1 out of 5 hunks FAILED -- saving rejects to file kernel/sysctl.c.rej
The patch seems to apply just fine (both 5.0-muqss-190.patch against 5.0.7, and incremental patch-5.0.6-7 against already muqss patched 5.0.6), if you simply increase the fuzz value when patching.
DeleteOK.
DeleteHow?
patch -F x
DeleteWhere x is the fuzz number, 2 by default. Try 3 or 4. Do not go too large as it may cause incorrectly applied hunks. The fuzz number increases the number of lines patch will look for when trying to figure out where the patch file needs to be applied in the file.
Thank you very much.
DeleteThe failed segment looks like this with 5.0.7 kernel:
ReplyDeletediff --git a/kernel/sysctl.c b/kernel/sysctl.c
index ba4d9e8..37cf9d5 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -128,8 +128,14 @@ static int __maybe_unused two = 2;
static int __maybe_unused four = 4;
static unsigned long one_ul = 1;
static unsigned long long_max = LONG_MAX;
-static int one_hundred = 100;
-static int one_thousand = 1000;
+static int __read_mostly one_hundred = 100;
+static int __read_mostly one_thousand = 1000;
+#ifdef CONFIG_SCHED_MUQSS
+extern int rr_interval;
+extern int sched_interactive;
+extern int sched_iso_cpu;
+extern int sched_yield_type;
+#endif
#ifdef CONFIG_PRINTK
static int ten_thousand = 10000;
#endif
Offsets is just cosmetics, so you dont need to change.
Build fails.
DeleteI'd like to add that with 5.0.7 the patch might also fail because of the new $(LIBELF_FLAGS) in tools/objtool/Makefile.
ReplyDeleteThis can be remedied with a short sed command before applying the ck patchset:
sed -i '/-CFLAGS/ s/$/ \$(LIBELF_FLAGS)/' patch-5.0-ck1
As for the rejects in kernel/sysctl.c, these can be worked around with a fuzz factor of 3 as pointed out above.
With these two tweaks i was able to build 5.0.7 with the -ck patches applied.
For several kernel versions now we have seen point releases break MuQSS or at the very least complicate its installation.
DeleteWe need volunteers to start keeping the individual MuQSS versions in sync with their respective kernel versions.
So you mean a git repository?
DeleteSure, yes. Just something to help us keep track of what additional patches are necessary to keep MuQSS working in spite of individual kernel point releases doing their best to break it.
DeleteFor now, information on how to counter cases of this happening is spread all over the place. Some of it here on ck's blog, some just arbitrarily on random sites to be found with a Google search.
It'd be better to roll all that information together. If anything, I'd personally recommend maintaining at least 2 versions... (see, kernel.org), the most recent stable and the most recent LTS.
Yeah, a "git-rebased" version ref. to point releases would be very nice.
DeleteI update my kernel patches https://github.com/SveSop/kernel_cybmod/tree/MuQSS , but i have NO WAY to do any testing other than with the current config i personally use.. So if a rebased/fixed MuQSS + extra patches i use, works for my .config it does not really indicate that a point release has not broken something i dont use (ie. AMD+++).
So, I vote for someone to maintain this in a proper manner :)
I'm doing the same, updating the patches for my specific config. I agree that a universal repo would be nice, though.
Delete[ 1.535455] ------------[ cut here ]------------
ReplyDelete[ 1.535460] Current state: 1
[ 1.535464] WARNING: CPU: 1 PID: 0 at 0xffffffff8108c865
[ 1.535466] Modules linked in:
[ 1.535469] CPU: 1 PID: 0 Comm: MuQSS/1 Not tainted 5.0.7-ck1 #3
[ 1.535471] Hardware name: System manufacturer System Product Name/Crosshair IV Formula, BIOS 3029 10/09/2012
[ 1.535474] RIP: 0010:0xffffffff8108c865
[ 1.535476] Code: 04 77 29 89 f1 ff 24 cd 38 76 a0 81 80 3d 53 1b bd 00 00 75 17 89 c6 48 c7 c7 90 c6 ad 81 c6 05 41 1b bd 00 01 e8 7b ae fa ff <0f> 0b 48 83 c4 08 5b c3 48 8b 47 60 48 85 c0 75 64 83 fe 03 89 73
[ 1.535480] RSP: 0018:ffff888437c43f50 EFLAGS: 00010082
[ 1.535482] RAX: 0000000000000010 RBX: ffff888437c504c0 RCX: ffffffff81c1fdb8
[ 1.535483] RDX: 0000000000000001 RSI: 0000000000000086 RDI: ffffffff81f8fcac
[ 1.535485] RBP: 7fffffffffffffff R08: 00000000000001f0 R09: 0000000000000000
[ 1.535487] R10: 0720072007200720 R11: 0720072007200720 R12: 7fffffffffffffff
[ 1.535489] R13: ffff888437c56900 R14: ffff888437c569f8 R15: ffff888437c56a38
[ 1.535491] FS: 0000000000000000(0000) GS:ffff888437c40000(0000) knlGS:0000000000000000
[ 1.535493] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1.535494] CR2: 0000000000000000 CR3: 0000000001c0c000 CR4: 00000000000006e0
[ 1.535496] Call Trace:
[ 1.535498]
[ 1.535500] 0xffffffff8108e7cb
[ 1.535502] 0xffffffff810817cb
[ 1.535503] 0xffffffff81601568
[ 1.535505] 0xffffffff8160117f
[ 1.535506]
[ 1.535507] RIP: 0010:0xffffffff8100f592
[ 1.535509] Code: 0f ba e0 24 72 11 65 8b 05 bb eb ff 7e fb f4 65 8b 05 b2 eb ff 7e c3 bf 01 00 00 00 e8 17 e0 07 00 65 8b 05 a0 eb ff 7e fb f4 <65> 8b 05 97 eb ff 7e fa 31 ff e8 ff df 07 00 fb c3 66 66 2e 0f 1f
[ 1.535512] RSP: 0018:ffffc9000007bf00 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[ 1.535515] RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000001
[ 1.535516] RDX: 000000005b7f1466 RSI: 0000000000000001 RDI: 0000000000000380
[ 1.535518] RBP: ffffffff81c601a8 R08: 0000000000000000 R09: 0000000000019840
[ 1.535520] R10: 0000001e3c819be7 R11: 000000007260bc7a R12: 0000000000000000
[ 1.535522] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 1.535524] 0xffffffff8105bd2f
[ 1.535526] 0xffffffff8105bf5b
[ 1.535527] 0xffffffff810000d4
[ 1.535529] ---[ end trace 71fe021b29fa5d1f ]---
I am having this problem on all my phenom 2 systems, looks like some kind of interrupt problem, I tried to enable nothreadedirqs option and also enabled fix for broken boot irqs option but none of them had any effect on this, I enabled stack traces but for some reason he don't show them :x
anyone might know what this is ? I searched around and found this which may be helpful: https://pastebin.com/y0aXvBNP
(this is not mine but it looks very much like mine)
thank :)
here more information: https://gist.github.com/ManoaNosea/69f698e40661be29df5016143bd81a86
ReplyDeleteI did get interrupt related panics as well on my machines. In my case, it seemed to be related to the bfq I/O scheduler (the call trace showed its "try_to_wake_up" function on top).
DeleteWhat I/O scheduler are you using? I have switched to mq-deadline as an experiment and haven't had panics since.
Tried to reproduce the issue again and indeed the instruction pointer is always somewhere in the "bfq_bfqq_expire" function.
DeleteI have some additional stuff in my kernel, though, so you might be facing a different issue. Anyway, check whether you're using BFQ or not.
I personally was kind of expecting this. With the removal of the legacy IO schedulers, BFQ is all but the default (not to mention the single best choice, the other MQ schedulers are simply not fit for desktop use).
DeleteUsing BFQ is inevitable, required. Again, all other options are simply unrealistic in terms of latency and bandwidth.
Sad to see it is happening as I predicted, that MuQSS and BFQ are no longer playing nice. It's unrealistic to expect BFQ to move over. It is being promoted hard as the de-facto standard IO scheduler.
I used kyber in all this
ReplyDeleteInteresting. I can't reproduce the issue with Kyber so the I/O scheduler doesn't seem to be the main culprit...
DeleteI was curious and checked the addresses of your github trace against a 5.0.7-ck kernel with debug symbols. It would seem that RIP is stuck at line 364 of arch/x86/events/intel/ds.c:
Deleteds->pebs_buffer_base = (unsigned long) cea;
It apparently arrived there via a call to "reserve_ds_buffers".
The first call trace has two frames of x2apic_phys_probe(). As for the bottom trace, it seems to look something like this:
hpet_alloc()
__mmdrop()
iommu_group_alloc()
copy_reserved_iova()
trace_event_raw_event_task_rename()
trace_event_raw_event_iommu_error()
clear_page_presence() ---> ./arch/x86/include/asm/paravirt.h:482
clear_page_presence() ---> arch/x86/mm/kmmio.c:161
0xffffffff81078e3b ---> line 30 of arch/x86/kernel/apic/x2apic_phys.c
enable_mmiotrace()
0xffffffff8105053d ---> line 330 of ./include/linux/bitmap.h
rdtgroup_locksetup_enter()
usbdev_release()
I am using Kyber for my spin-disk, and none for my NVME, and have not had any issues like this tho.
DeleteI have tried a bit with bfq, but not really that impressed on NVME even tho Phoronix had a lengthy article about improved startup times and whatnot.
However, i have dabbled with a wee bit of patchwork for this that i found here: https://github.com/sirlucjan/kernel-patches/tree/master/5.0
Weeded out what i found that patched cleanly (and have verified 2-3 of them actually being implemented upstream as-is). https://github.com/SveSop/kernel_cybmod/tree/MuQSS/blk-patches
Not saying, or claiming this is a solve-it-all, or by any means want to stand responsible for any data corruption from this tho.. But if you want to experiment feel free. Patches the 5.0.7 source.
thank :) it big help :) I will test all this functions :)
ReplyDeletebut it look verry strange, this functions show the problems in things like hpet and x2apic and iommu (I don't know the functions realy, mybe this is not necessary a problem with kernel at all, mybe hardware bad iteself), but it show that it affect the muqss, this I don't understand whay...muqss can be calling this functions ?
ReplyDeleteit strange because I am not running any virtualization, only python3 was running on the computer :x
ReplyDeleteIndeed, from the trace I was at first thinking of virtualization as well. It is also strange that some Intel specific stuff was (apparently) executed on a Phenom processor. But then I'm not a kernel dev either, so...
DeleteWas there anything else running apart from python3?
For your investigations, I'd check the value of your kernel's CONFIG_KALLSYMS and recompile with it set to yes. That way the panic message should contain function names and offsets instead of just addresses. In the end your kernel might be very different from the one I used for the check.
I disabled HPET,IOMMU,x2apic but it didn't fixed: https://gist.github.com/ManoaNosea/7ceec21d49d87dc679265a1371c0433e
ReplyDeletethank now I know whay he didn't gived codes and instead gived addresses
ReplyDelete[ 1.864822] Current state: 1
ReplyDelete[ 1.864829] WARNING: CPU: 2 PID: 0 at clockevents_switch_state+0x45/0xe0
[ 1.864830] Modules linked in:
[ 1.864833] CPU: 2 PID: 0 Comm: MuQSS/2 Not tainted 5.0.7-ck1 #10
[ 1.864835] Hardware name: System manufacturer System Product Name/Crosshair IV Formula, BIOS 3029 10/09/2012
[ 1.864839] RIP: 0010:clockevents_switch_state+0x45/0xe0
[ 1.864841] Code: 04 77 29 89 f1 ff 24 cd 38 75 a0 81 80 3d 53 24 bd 00 00 75 17 89 c6 48 c7 c7 c8 9e b7 81 c6 05 41 24 bd 00 01 e8 db af fa ff <0f> 0b 48 83 c4 08 5b c3 48 8b 47 60 48 85 c0 75 64 83 fe 03 89 73
[ 1.864844] RSP: 0018:ffff888437c83f50 EFLAGS: 00010082
[ 1.864846] RAX: 0000000000000010 RBX: ffff888437c904c0 RCX: ffffffff81c1f618
[ 1.864848] RDX: 0000000000000001 RSI: 0000000000000086 RDI: ffffffff81d41c04
[ 1.864850] RBP: 7fffffffffffffff R08: 00000000000001e6 R09: 0000000000000000
[ 1.864852] R10: 0720072007200720 R11: 0720072007200720 R12: 7fffffffffffffff
[ 1.864854] R13: ffff888437c96900 R14: ffff888437c969f8 R15: ffff888437c96a38
[ 1.864856] FS: 0000000000000000(0000) GS:ffff888437c80000(0000) knlGS:0000000000000000
[ 1.864858] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1.864860] CR2: 0000000000000000 CR3: 0000000001c0c000 CR4: 00000000000006e0
[ 1.864861] Call Trace:
[ 1.864864]
[ 1.864866] ? tick_program_event+0x4b/0x80
[ 1.864869] ? hrtimer_interrupt+0x12b/0x220
[ 1.864872] ? smp_apic_timer_interrupt+0x48/0xa0
[ 1.864874] ? apic_timer_interrupt+0xf/0x20
[ 1.864875]
[ 1.864877] ? amd_e400_idle+0x32/0x60
[ 1.864880] ? do_idle+0x1cf/0x280
[ 1.864882] ? cpu_startup_entry+0x1b/0x20
[ 1.864884] ? secondary_startup_64+0xa4/0xb0
it look like hrtimer problem, but the options of kernel not allow to disable that option, so I enable hpet and booted clocksource=hpet, but this boot error the same :x
could be a dynticks idle problem ?
ReplyDeleteI disable dynticks idle and the boot error is gone
ReplyDeleteSome further confirmation for you, Liquorix uses MuQSS and disabled tickless idle a long time ago: https://github.com/damentz/liquorix-package/blob/5.0-7/linux-liquorix/debian/config/kernelarch-x86/config-arch-64#L85
DeleteI've settled on 250hz since it gives decent timekeeping without the the negative affects on power consumption that 1000hz causes.
On AMD Ryzen 2400G APU booting muqss with mc (default) results in 25 runqueues (which is more than odd, because this is 4 core 8 thread CPU):
ReplyDeletekernel: MuQSS runqueue share type MC total runqueues: 25
When using virtualization (KVM), a lot of errors like these appear:
apr 15 16:44:52 kernel: BUG: using smp_processor_id() in preemptible [00000000] code: CPU 4/KVM/14536
apr 15 16:44:52 kernel: caller is single_task_running+0xe/0x30
apr 15 16:44:52 kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./AB350M Pro4, BIOS P5.70 03/14/2019
apr 15 16:44:52 kernel: Call Trace:
apr 15 16:44:52 kernel: dump_stack+0x65/0x8a
apr 15 16:44:52 kernel: debug_smp_processor_id+0xe8/0xf0
apr 15 16:44:52 kernel: single_task_running+0xe/0x30
apr 15 16:44:52 kernel: kvm_vcpu_block+0x230/0x370 [kvm]
apr 15 16:44:52 kernel: kvm_arch_vcpu_ioctl_run+0x312/0x1cc0 [kvm]
apr 15 16:44:52 kernel: kvm_vcpu_ioctl+0x24b/0x630 [kvm]
apr 15 16:44:52 kernel: ? hrtimer_start_range_ns+0x1ce/0x360
apr 15 16:44:52 kernel: do_vfs_ioctl+0xa9/0x760
apr 15 16:44:52 kernel: ? __schedule+0xa5a/0xd90
apr 15 16:44:52 kernel: ? __fget+0x73/0xa0
apr 15 16:44:52 kernel: __x64_sys_ioctl+0x6a/0xa0
apr 15 16:44:52 kernel: do_syscall_64+0x5a/0x110
apr 15 16:44:52 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
apr 15 16:44:52 kernel: RIP: 0033:0x7f25c93995d7
Using standard Ubuntu mainline kernel, there are no errors.
Here's the output of "virsh capabilities" (if that helps at all):
Delete14347140
3586785
0
0
I should have clicked preview before, blogspot ate all my xml :D
DeleteHere's the same in pastebin: https://pastebin.com/8eHJ2sc8
but gives again kernel problems when running: https://gist.github.com/ManoaNosea/61fff8a1504b869d352b938528a5a4ab
ReplyDeleteThis might be of help for your case:
Deletehttps://www.kernel.org/doc/Documentation/RCU/stallwarn.txt
thank, I tested same .config same kernel without patch - no problems at all, this is muqss problem I think it don't work with AMD
ReplyDeleteI admire your patchset, as a suggestion to boost single-threaded performance, will the scheduler allow a single thread to hog a core?
ReplyDeleteThis would boost FPS in games, and web app run times.
If you disable all runqueue sharing you will get more binding to one core, but latency will still dictate that something else will knock it off the core, even with multiple cores. You can't have it both ways. Ultimately latency comes at some expense to throughput unless you have infinite cores.
DeleteAnother tool to test schedulers in "gaming" workflow. used to test VariableRefreshRate implementations and visual research. github.com/kleinerm/VRRTestPlots
ReplyDeleteDid anyone adapt this to 5.1 yet?
ReplyDelete