Announcing a new -ck release, 4.19-ck1 with the latest version of the Multiple Queue Skiplist Scheduler, version 0.180. These are patches designed to improve system responsiveness and interactivity with specific emphasis on the desktop, but configurable for any workload.
In addition to a resync from 4.18-ck1, there are a number of minor accounting fixes, and I've since dropped BFQ being enabled by default. I've been less than impressed with its latency over the last two kernel releases, and recommend people use another I/O scheduler.
EDIT: Apparently patch 0008 has one hunk that is out of place. It still should work fine even if this fails to apply. I don't know why git was happy with that part of the patch...
Thanks so much.ReplyDelete
Thank you for your work. I agree with BFQ. Back in the BFS days I also patched the BFQ scheduler in as it was a joy to use. Nowadays it skips audio during heavy IO loads just as much as CFS+CFQ :(ReplyDelete
The Linux kernel is becoming so bloated and unoptimized, I'm contemplating just moving to the BSD family. At least there I have some notion of determinism in my workloads.
BFQ has only recently gotten, very good. If you've been running mainline, the final patches that make BFQ worthwhile probably just got merged in 4.19. Otherwise, if you've been running the Algodev/bfq-mq branch already, then BFQ has been behaving properly for a few months now.Delete
Secondly, the legacy block subsystem caused a lot of problems that BFQ has no control over. The newer 'blk-mq' subsystem seems to reduce problems the legacy block incurred, letting BFQ do its thing.
I doubt the kernel developers debating with Paolo on making his IO scheduler default will read this blog post, but if they do, it'll definitely add pressure against his efforts.
Fortunately, data talks, and Paolo backed up his scheduler with reproducible benchmarks for the skeptical.
mq-deadline, all good.Delete
Well, the problem, damentz... is that blk-mq only really flies well on SSDs. I've got an HDD and any and all blk-mq schedulers simply stall (not hang, mind you, just a very lengthy stall) during boot, implying an issue with blk-mq itself.Delete
Making BFQ completely irrelevant at that point. It is supposed to improve throughput and latency for all devices, including but not limited to HDDs. But, it does anything but that at this point.
All benchmarks I can perform, all measurements and all "feeling the waters" in normal use all point towards CFQ, despite its age, clearly being far, far ahead of BFQ.
For my part, I hope it never gets made default. The idea was sound, the execution is simply flawed. I'd just as soon rather wait for 4.20 and the intended Kyber rework, it looks far more promising.
Addendum: Not to mention the fact that BFQ has been unstable for months now. And yes, I regularly do still test it out just to see how things have been progressing.Delete
Getting random kernel panics whenever I use BFQ and only when I use BFQ. It is the sole remaining variable.
I use mq-deadline with HDDs, no problem.Delete
Boot drives are SSDs in all of my machines though.
And for sure, you have reported all the panics and stalls to BFQ developers.Delete
Would you be willing to execute one test? It boils down to executing one scripts, which would take about five minutes. Results would tell us the exact latency you are experiencing, with and without BFQ. If you accept, I only ask you to run that test with a 4.20 (currently rc2 IIRC), to avoid any need of a retry.Delete
> Unknown 17 November 2018 at 20:49Delete
Paolo, fix your profile name to become recognisable ;).
Actually there are "three" BFQ schedulers: bfq, bfq-mq and bfq-sq. The first one is included in the upstream kernel and it's available *ONLY* when the multiqueue is enabled (e.g specifying scsi_mod.use_blk_mq=1 on boot cmdline). The other twos are available adding an external patchset that can be derivable from the algodev site, and includes the latest version of the multiqueue's BFQ (bfq-mq basically same features as upstream bfq), as well as (i.e. bfq-sq) the version for the single queue mode (which is still the default for most distro). Without the extra patchset there is no way to use the bfq in single queue mode on the vanilla upstream kernel.Delete
Anyway, I found that when bundled with MuQSS 0.180, the BFQ still has a good responsiveness, and many test shows it performing better than many other schedulers.
Here is a little test, performed with kernel-joeghi 4.19 on Mageia Linux 6.1, with MuQSS and BFQ enabled; as you can see under heavy I/O workload on some scheduler the gnome-terminal can't even start within 120 seconds, while doesn't on BFQ:
# Workload bfq-sq cfq
0r-seq 1.5775 6.4025
10r-seq 1.75 50.645
5r5w-seq 5.67 X
# Workload bfq-sq cfq
0r-seq 1.3625 6.655
10r-seq 1.7225 44.1525
5r5w-seq 2.885 X
The situation is not much different on multiqueue, where both bfq and bfq-mq performs with the lowest latency:
# Workload bfq bfq-mq mq-deadline kyber none
0r-raw_seq 1.71 1.7975 14.4975 7.7075 6.1525
10r-raw_seq 2.0025 2.0725 X X X
# Workload bfq bfq-mq mq-deadline kyber none
0r-raw_seq 0.955 1.3125 1.77 2.3875 2.495
10r-raw_seq 0.67 0.6725 X X X
The test were run using the Algodev/S benchmark (available on github), and running the following command for single queue:
$ ./run_main_benchmarks.sh replayed-startup "bfq-sq cfq"
$ ./run_main_benchmarks.sh replayed-startup "bfq-sq cfq noop deadline"
$ ./run_main_benchmarks.sh replayed-startup "bfq bfq-mq mq-deadline kyber none"
I find this exceedingly ironic:Delete
It just confirms everyting I outlined earlier; that something is just off with recent IO scheduling development in general. And maybe not with just BFQ alone.
Although I have to note that the BFQ panics I did experience were with blk_mq=0. Still, read that particular post, or heck, read the entire thread. If even developers are talking about possibly considering blk_mq unfit for production that has to account for something.
Regarding needing a patch set to enable BFQ on SQ -- Of course, that is just silly. Fact is, with modern SSDs (the penultimate device class for which MQ was implememented into the kernel) a good case could be made for having no traditional IO scheduler at all.
Not to mention the fact that any gains that might still exist for BFQ over not having a traditional IO scheduler on such devices is marginal at best.
The largest gain and therefore the strongest argument in favor for BFQ (irrespective of the queue model) is in fact on SQ devices. But, that requires more patches which may or may not be in sync with mainline.
Honestly... too much of a hassle like that. Either mainline SQ support for BFQ or just forget about it in general. Particularly with blk_mq apparently being far from completely reliable as evidenced from the recent corruption errors (again, see link).
Note that devices might be different, you might have fast SSDs, NVME, mechanical hard disks, all in the same machine. From what I could see BFQ is the one that on "average" has the best performance and low latency on any device, i.e. you don't need to pick some particular scheduler, including "none" or "noop" for SSDs. Sort of "all wheather" queue scheduler. Regarding the performance there was a newer benchmark on Phoronix which matches the tests posted above, see:Delete
As you can see sometimes has a better performance than "None".
Nevertheless being unpopular, the SQ stack is planned to be removed definitively upstream starting from 4.21 because it seems difficult to maintain two stacks, so there won't be the possibility to have any BFQ-SQ there because there there wouldn't be a layer to attach on.
Regarding the BFQ (MQ+SQ) patchset it's in sync with mainline at the algodev site (currently at 4.20rc IIRC). What is lacking is maybe the official BFQ+MQ patchset (indeed sirlucjan is maintaining one) for the "stable" kernel, like it was in the past up to 4.4, but I think the problem was because of the merging of the BFQ upstream, which apparently has been just a subset of the original project (which includes both SQ and MQ schedulers).
This comment has been removed by the author.ReplyDelete
Hello what timer I should set 100 or 1000 ?ReplyDelete
1000 seems to be more snappy on my machine.Delete
Yeah for me the same.Delete
The adviced Hz seems 100, though I found it performs good too with 250 Hz. On 1000 to me the machine become too much "nervous".Delete
If using MuQSS alone I would recommend 1000. If using MuQSS with the other -ck patches, 100Hz should work well since it has highres timers at 2000Hz.Delete
BTW, here is the output of the schbench benchmark for latency, measured in usec with kernel 4.19.6+ck (CPU has 4 core+4HT) at 100Hz:Delete
schbench -t 16 -m 2
Latency percentiles (usec)
On plain kernel the value of the 99th percentile is much higher. What are the typical values you achieve (using for instance as argument for -t the double of the number of total available cores)?
Thank you for 4.19-ck. I can't find anything special to report after a week of usage, zero problems here. It runs totally great.ReplyDelete
Great thanks. As the saying goes - no news is good news.Delete
Running great here, interactivity is excellent, but on seemingly idle system with NOHZ_FULL, accounting is way off:ReplyDelete
Tasks: 367 total, 2 running, 267 sleeping, 0 stopped, 10 zombie
%Cpu0 : 21,5 us, 1,5 sy, 0,0 ni, 76,9 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu1 : 9,0 us, 61,5 sy, 0,0 ni, 29,5 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu2 : 1,1 us, 52,9 sy, 0,0 ni, 46,0 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu3 : 1,1 us, 54,1 sy, 0,0 ni, 44,9 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu4 : 2,6 us, 50,8 sy, 0,0 ni, 46,6 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu5 : 1,0 us, 49,8 sy, 0,0 ni, 49,3 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu6 : 1,0 us, 51,5 sy, 0,0 ni, 47,4 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu7 : 0,5 us, 60,4 sy, 0,0 ni, 39,1 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
KiB Mem : 16172800 total, 9904280 free, 2340432 used, 3928088 buff/cache
KiB Swap: 16777212 total, 16777212 free, 0 used. 12488688 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4576 username 1 0 1367024 138900 86212 S 6,9 0,9 0:06.91 chromium-browse
1899 username 47 0 1245920 102160 57840 S 3,0 0,6 0:28.25 compiz
1370 root 2 0 1398896 131420 115320 S 2,0 0,8 0:16.57 Xorg
2963 username 1 0 670404 41836 30428 S 2,0 0,3 0:08.33 gnome-terminal-
1004 root 1 0 453632 18232 13936 S 1,0 0,1 0:01.83 NetworkManager
1594 username 1 0 210848 9492 6284 S 1,0 0,1 0:01.38 gnome-keyring-d
1678 username 1 0 46380 4740 2852 S 1,0 0,0 0:04.81 dbus-daemon
2068 username 1 0 927696 40208 35484 S 1,0 0,2 0:03.12 clickshare-laun
3059 username 7 0 44920 4076 3232 R 1,0 0,0 0:05.57 top
10788 username 1 0 1291448 96296 75792 S 1,0 0,6 0:00.33 chromium-browse
Can You plz look at it or NOHZ_FULL is off Your interests?
I've always recommended not using it for desktops or mobile devices, so it's a miracle it even boots and runs. I doubt I'll ever get time to rewrite the CPU accounting entirely in order for it to work under nohz_full (which is what it would take.)Delete
Ironically, MuQSS time accounting only seems to work well on my systems with periodic tick. I recently did an investigation to find better defaults for Liquorix, and the only way to feed ondemand (and intel_pstate), idle data properly was to give it a periodic tick of 250hz or higher.Delete
At 100hz, I found that ondemand still thought the cores were under load and would randomly increase core frequency even though that particular core was reporting 1% C0 and nearly 99% C7 in i7z.
250hz was a nice compromise - ondemand would properly leave untouched cores at their lowest frequency for the most part while idle
And 1000hz, although just a bit more accurate with frequency ramping, prevented any of my cores from staying in C7 for more than 85% of its time while idle. This increased temperatures a full 8-10*C while completely idle for my Latitude E7450 (work laptop).
Leaving my kernel at 250hz is fine for me, but I already received a report on the linux-lqx AUR that 250hz increases underruns for certain audio configurations. You can find it in the AUR comments here: https://aur.archlinux.org/packages/linux-lqx/
You can set buffer size in the kernel config.Delete
You mean sampling rate? The problem is during a sample, without updated load data, ondemand will believe a core that was previously active, is still active, even though the core is genuinely idle. I think increasing sampling_rate (samples less often), will just make ondemand too coarse, and reducing it will make it sample the same bad data more often, exacerbating the problem.Delete
You said underruns.Delete
I suppose you mean buffer underruns.
Increase buffer size?
Ok, it's only for HDA.
I think increasing buffer increases latency. In this case, the particular comment mentioned sub 2ms latency, probably for realtime audio mixing and redirection through software.Delete
I think there's no way around 1000 Hz then.Delete
Personally running MuQSS with full dynticks (tickless) at 100Hz and it is running just fine. Having said that though, it does return some odd reports.Delete
The task manager in use is suggesting all 4 cores are constantly nearly capped. Which is far from the case, obviously.
Important to note there though is that the kernel does seem to head towards a full dynticks state. It is being pushed and has been pushed for quite a few years now. And the Linux kernel is not the only kernel that is heading that way.
Not jumping on board there with MuQSS would be incredibly unwise.
Eventually, one can fully expect ticks to end up being removed completely from mainline. It is all but inevitable and in fact, I agree completely with this movement towards that end result.
If for no other reason then for the fact that a ful dyntick kernel will typically be the more energy efficient kernel and with the UN and numerous other organizations making it dreadfully clear that we have in fact probably missed the boat on turning the tide called climate change it would be irresponsible not to focus on efficiency.
Thanks for your report. What you say should make sense except for the fact that full dynticks does not do what you think it does. It disables ticks on active cores in order to increase throughput when tasks are heavily affined to those CPU cores for very specific workloads, at the expense of latency in other areas; it was never made with lowering power consumption in PC and mobile devices in mind. Idle dynticks is what disables unnecessary ticks to conserve power, and is fully supported by MuQSS.Delete
I think you want constant ticks for latency reasons.Delete
@ck -- Be that as it may, the point remains that there is a movement in the development of the kernel to move away from ticks in general. I found quite a bit of reference to continued development in that direction while doing some quick research on the subject.Delete
I personally feel that full dynticks should be fully supported by MuQSS. Particularly because in my opinion MuQSS is the way forward as far as CPU scheduling is concerned. Its codebase is simple(-ish), straightforward while still being highly configurable to any workload.
So any effort you could make to ensuring that even full dynticks and not just idle dynticks are supported by MuQSS would be preferable, in my humble opinion.
Just my 2 cents' worth. Just call it a vote of confidence in your work so far.
Yesterday I was testing performance of Ryzen7 8c16t CPU and Ryzen5 APU 4c8t CPU in games using MUQSS and found out that smth is not exactly right (at least to me) with runqueue count.ReplyDelete
Can You please look at the locality and runqueues, what bothers me is Ryzen5, coz they seem to be weird for that CPU.
Ryzen5 does indeed look weird with apparently more runqueues than there are CPUs. I wonder if internally the CPU behaves like it has more physical slots but they're unpopulated so the "extra" runqueues are actually dormant. If it works okay I probably wouldn't worry.Delete
Well, the main thing is that performance on Ryzen5 2400G with MUQSS is not great compared to standard kernels or PDS. At least benchmark numbers show that. It works, no doubt about that, even with odd number of queues :)Delete
As far as I know 2400G all cores are in one physical die (CCX) unlike Ryzen7 which have 2 dies (CCX).
There comes the question, isn't it so that in MC mode Ryzen5 2400G should have 1 runqueue but Ryzen7 2 runqueues and in SMT 4 and 8 respectively?
A somewhat belated input -- you may want to try a single runqueue regardless of the make or model of the CPU. Particularly for gaming purposes I have personally found that a single runqueue simply does perform better than multiple runqueues.Delete
To force this, I personally use rqshare=smp.
I think you might even find that this completely closes whatever gap might have existed before between CFS and PDS, for your particular workload.
Yeah, the problem is that with smp i get 9 runqueues, which is odd. So I can not enforce just one runqueue at any rqshare parameter.Delete
For 2400G runqueues seem to be out of order, but Ryzen7, however, seems to have one runqueue when using mc.
I don't know whether that's correct or not.
rqshare=smp should never result in 9. It should ALWAYS result in 1, EVEN in, say, a dual socket board. Since the queues are shared between actual physical packages with rqshare=smp. Which is why I elected to use rqshare=smp. So there can be no doubt as to what my intention is, a single queue.Delete
rqshare=none is the polar opposite of that and will spawn as many queues as there are logical cores.
Numa nodes are never shared, so there are still multiple queues if the CPU architecture registers as Numa. I plan to close that gap as optional in the future as well.Delete
4.19.6 is a rocket. Thanks.ReplyDelete
Next kernel 4.19.7 will have this patch, which will break MuQSS patchset:ReplyDelete
... and also this one:ReplyDelete
4.19.7 is out.Delete
Waiting for the official patchfix from Con, in the meanwhile for the impatients here are two patches that should work with the CK patchset with 4.19.7:Delete
The first one is a patch of the patch 0001-MultiQueue-Skiplist-Scheduler-version-v0.180.patch, The second one instead needs to be applied in sequence after all the others.
Thank you very much.Delete
Not seen a point release break MuQSS before. At least not in a good, long while.Delete
Personally I'll wait for Con's input on this but still, good work on those patches, @anon.
I can't apply the first patch. What am I doing wrong? I am in "patches" from the brokenout ck1:Delete
wget -O 1.patch https://pastebin.com/raw/HFuG7ide
patch -i 1.patch
(Stripping trailing CRs from patch; use --binary to disable.)
patching file 0001-MultiQueue-Skiplist-Scheduler-version-v0.180.patch
Hunk #1 FAILED at 907.
patch unexpectedly ends in middle of line
Hunk #2 succeeded at 931 with fuzz 2.
1 out of 2 hunks FAILED -- saving rejects to file 0001-MultiQueue-Skiplist-Scheduler-version-v0.180.patch.rej
I think pastebin fucks with the whitespaces or something... I followed the patch manually and created one that seems to work fine:Delete
Thank you and kudos to the Anonymous user who created these two patches! For Arch, I incorporated these into 4.19.8-ck and pushed to the AUR. Please give it a try and post to the AUR or here with feedback.
A workaround to the pastebin problem should be to scroll down a bit and copy the Raw version instead. That should not have been tampered with in any way.Delete
Looks like this patch failed on 4.19.7ReplyDelete
patch -p1 < ../muqss/0001-MultiQueue-Skiplist-Scheduler-version-v0.180.patchDelete
patching file Documentation/admin-guide/kernel-parameters.txt
Hunk #1 succeeded at 4005 (offset 4 lines).
patching file Documentation/scheduler/sched-BFS.txt
patching file Documentation/scheduler/sched-MuQSS.txt
patching file Documentation/sysctl/kernel.txt
patching file arch/powerpc/platforms/cell/spufs/sched.c
patching file arch/x86/Kconfig
Hunk #1 FAILED at 1009.
Hunk #2 succeeded at 1033 (offset -10 lines).
1 out of 2 hunks FAILED -- saving rejects to file arch/x86/Kconfig.rej
Probably as graysky said the pastebin swallowed some blank. The original fix/reworked patch that applies temporarely to current kernel 4.19.8 (or the latest 4.19.7) was contained in the mageialinux kernel-419-joeghi package. You can extract directly from their svn repo (svnweb.mageia.org), e.g. these commands works, and applies without errors, apart some hunk offset:Delete
wget -O linux-4.19.8-419-joeghi4.tar.xz http://binrepo.mageia.org/798a98dc158f294bb0ed10182f9a533f897f7f41
tar -xvf linux-4.19.8-419-joeghi4.tar.xz 4.19.8-419-joeghi4/patches/0001-MultiQueue-Skiplist-Scheduler-version-v0.180_reworked.patch --strip-components=2
tar -xvf linux-4.19.8-419-joeghi4.tar.xz 4.19.8-419-joeghi4/patches/0001-MultiQueue-Skiplist-Scheduler-version-v0.180_fix_sched_smt_present.patch --strip-components=2
tar -xf linux-4.19.tar.xz
xz -cd ../patch-4.19.8.xz | patch -s -p1
patch -p1 < ../0001-MultiQueue-Skiplist-Scheduler-version-v0.180_reworked.patch
patch -p1 < ../0001-MultiQueue-Skiplist-Scheduler-version-v0.180_fix_sched_smt_present.patch
Thank you very much.Delete
This comment has been removed by the author.ReplyDelete
thanks and can we expect a sync this year?
I've never kept in sync with minor releases I'm afraid. I'll likely only sync up again with 4.20Delete
Shame, that. Might want to consider doing something akin to what the Linux devs are doing, maintainers of MuQSS for each individual Linux minor version so that they can keep it working in case a point release does break it.Delete
Which is bound to happen sooner or later. As it just did with 4.19.7. And with 4.19 being LTS (iirc) it'd be better to keep MuQSS synced with it. We'll be seeing quite a few 4.19 point releases yet.
I hacked together a rebased and patched MuQSS patch from the various info in this thread here: https://github.com/SveSop/kernel_cybmod/blob/MuQSS/0002_MultiQueue-Skiplist-Scheduler-version-v0.180_v2.patchDelete
Ofc i take no responsibility if flames spew out from the back of your computer, nor do i think you can complain to CK if things does not work... But it has worked fine for me tho :)
Thank you very much.ReplyDelete
Wow...this was really interesting!!!ReplyDelete
Greetings! Very helpful advice within this post! It is the little changes which will make the greatest changes. Thanks for sharing!
Order Custom Made Patches