Wednesday, 14 November 2018

linux-4.19-ck1, MuQSS version 0.180 for linux-4.19

Announcing a new -ck release, 4.19-ck1  with the latest version of the Multiple Queue Skiplist Scheduler, version 0.180. These are patches designed to improve system responsiveness and interactivity with specific emphasis on the desktop, but configurable for any workload.

linux-4.19ck1:
-ck1 patches:
Git tree:
MuQSS only:
Download:
Git tree:


Web: http://kernel.kolivas.org


In addition to a resync from 4.18-ck1, there are a number of minor accounting fixes, and I've since dropped BFQ being enabled by default. I've been less than impressed with its latency over the last two kernel releases, and recommend people use another I/O scheduler.

EDIT: Apparently patch 0008 has one hunk that is out of place. It still should work fine even if this fails to apply. I don't know why git was happy with that part of the patch...

Enjoy!
お楽しみ下さい
-ck

58 comments:

  1. Thanks so much.

    ReplyDelete
  2. Thank you for your work. I agree with BFQ. Back in the BFS days I also patched the BFQ scheduler in as it was a joy to use. Nowadays it skips audio during heavy IO loads just as much as CFS+CFQ :(
    The Linux kernel is becoming so bloated and unoptimized, I'm contemplating just moving to the BSD family. At least there I have some notion of determinism in my workloads.

    ReplyDelete
    Replies
    1. BFQ has only recently gotten, very good. If you've been running mainline, the final patches that make BFQ worthwhile probably just got merged in 4.19. Otherwise, if you've been running the Algodev/bfq-mq branch already, then BFQ has been behaving properly for a few months now.

      Secondly, the legacy block subsystem caused a lot of problems that BFQ has no control over. The newer 'blk-mq' subsystem seems to reduce problems the legacy block incurred, letting BFQ do its thing.

      I doubt the kernel developers debating with Paolo on making his IO scheduler default will read this blog post, but if they do, it'll definitely add pressure against his efforts.

      Fortunately, data talks, and Paolo backed up his scheduler with reproducible benchmarks for the skeptical.

      Delete
    2. mq-deadline, all good.

      Delete
    3. Well, the problem, damentz... is that blk-mq only really flies well on SSDs. I've got an HDD and any and all blk-mq schedulers simply stall (not hang, mind you, just a very lengthy stall) during boot, implying an issue with blk-mq itself.

      Making BFQ completely irrelevant at that point. It is supposed to improve throughput and latency for all devices, including but not limited to HDDs. But, it does anything but that at this point.

      All benchmarks I can perform, all measurements and all "feeling the waters" in normal use all point towards CFQ, despite its age, clearly being far, far ahead of BFQ.

      For my part, I hope it never gets made default. The idea was sound, the execution is simply flawed. I'd just as soon rather wait for 4.20 and the intended Kyber rework, it looks far more promising.

      Delete
    4. Addendum: Not to mention the fact that BFQ has been unstable for months now. And yes, I regularly do still test it out just to see how things have been progressing.

      Getting random kernel panics whenever I use BFQ and only when I use BFQ. It is the sole remaining variable.

      Delete
    5. I use mq-deadline with HDDs, no problem.
      Boot drives are SSDs in all of my machines though.

      Delete
    6. And for sure, you have reported all the panics and stalls to BFQ developers.

      Right?

      Right???

      Delete
    7. Would you be willing to execute one test? It boils down to executing one scripts, which would take about five minutes. Results would tell us the exact latency you are experiencing, with and without BFQ. If you accept, I only ask you to run that test with a 4.20 (currently rc2 IIRC), to avoid any need of a retry.

      Delete
    8. > Unknown 17 November 2018 at 20:49

      Paolo, fix your profile name to become recognisable ;).

      Delete
    9. Actually there are "three" BFQ schedulers: bfq, bfq-mq and bfq-sq. The first one is included in the upstream kernel and it's available *ONLY* when the multiqueue is enabled (e.g specifying scsi_mod.use_blk_mq=1 on boot cmdline). The other twos are available adding an external patchset that can be derivable from the algodev site, and includes the latest version of the multiqueue's BFQ (bfq-mq basically same features as upstream bfq), as well as (i.e. bfq-sq) the version for the single queue mode (which is still the default for most distro). Without the extra patchset there is no way to use the bfq in single queue mode on the vanilla upstream kernel.

      Anyway, I found that when bundled with MuQSS 0.180, the BFQ still has a good responsiveness, and many test shows it performing better than many other schedulers.

      Here is a little test, performed with kernel-joeghi 4.19 on Mageia Linux 6.1, with MuQSS and BFQ enabled; as you can see under heavy I/O workload on some scheduler the gnome-terminal can't even start within 120 seconds, while doesn't on BFQ:

      # Workload bfq-sq cfq
      0r-seq 1.5775 6.4025
      10r-seq 1.75 50.645
      5r5w-seq 5.67 X

      # Workload bfq-sq cfq
      0r-seq 1.3625 6.655
      10r-seq 1.7225 44.1525
      5r5w-seq 2.885 X

      The situation is not much different on multiqueue, where both bfq and bfq-mq performs with the lowest latency:

      # Workload bfq bfq-mq mq-deadline kyber none
      0r-raw_seq 1.71 1.7975 14.4975 7.7075 6.1525
      10r-raw_seq 2.0025 2.0725 X X X

      # Workload bfq bfq-mq mq-deadline kyber none
      0r-raw_seq 0.955 1.3125 1.77 2.3875 2.495
      10r-raw_seq 0.67 0.6725 X X X

      The test were run using the Algodev/S benchmark (available on github), and running the following command for single queue:

      $ ./run_main_benchmarks.sh replayed-startup "bfq-sq cfq"

      or

      $ ./run_main_benchmarks.sh replayed-startup "bfq-sq cfq noop deadline"

      and

      $ ./run_main_benchmarks.sh replayed-startup "bfq bfq-mq mq-deadline kyber none"

      for multiqueue.

      Delete
    10. I find this exceedingly ironic:

      https://patchwork.kernel.org/patch/10712695/#22362095

      It just confirms everyting I outlined earlier; that something is just off with recent IO scheduling development in general. And maybe not with just BFQ alone.

      Although I have to note that the BFQ panics I did experience were with blk_mq=0. Still, read that particular post, or heck, read the entire thread. If even developers are talking about possibly considering blk_mq unfit for production that has to account for something.

      Regarding needing a patch set to enable BFQ on SQ -- Of course, that is just silly. Fact is, with modern SSDs (the penultimate device class for which MQ was implememented into the kernel) a good case could be made for having no traditional IO scheduler at all.

      Not to mention the fact that any gains that might still exist for BFQ over not having a traditional IO scheduler on such devices is marginal at best.

      The largest gain and therefore the strongest argument in favor for BFQ (irrespective of the queue model) is in fact on SQ devices. But, that requires more patches which may or may not be in sync with mainline.

      Honestly... too much of a hassle like that. Either mainline SQ support for BFQ or just forget about it in general. Particularly with blk_mq apparently being far from completely reliable as evidenced from the recent corruption errors (again, see link).

      Delete
    11. Note that devices might be different, you might have fast SSDs, NVME, mechanical hard disks, all in the same machine. From what I could see BFQ is the one that on "average" has the best performance and low latency on any device, i.e. you don't need to pick some particular scheduler, including "none" or "noop" for SSDs. Sort of "all wheather" queue scheduler. Regarding the performance there was a newer benchmark on Phoronix which matches the tests posted above, see:

      https://www.phoronix.com/scan.php?page=article&item=linux-420-io&num=3

      https://openbenchmarking.org/result/1812038-SK-LINUX420I24&obr_sor=y&obr_rro=y

      As you can see sometimes has a better performance than "None".

      Nevertheless being unpopular, the SQ stack is planned to be removed definitively upstream starting from 4.21 because it seems difficult to maintain two stacks, so there won't be the possibility to have any BFQ-SQ there because there there wouldn't be a layer to attach on.

      Regarding the BFQ (MQ+SQ) patchset it's in sync with mainline at the algodev site (currently at 4.20rc IIRC). What is lacking is maybe the official BFQ+MQ patchset (indeed sirlucjan is maintaining one) for the "stable" kernel, like it was in the past up to 4.4, but I think the problem was because of the merging of the BFQ upstream, which apparently has been just a subset of the original project (which includes both SQ and MQ schedulers).

      Delete
  3. This comment has been removed by the author.

    ReplyDelete
  4. Hello what timer I should set 100 or 1000 ?

    ReplyDelete
    Replies
    1. 1000 seems to be more snappy on my machine.

      Delete
    2. The adviced Hz seems 100, though I found it performs good too with 250 Hz. On 1000 to me the machine become too much "nervous".

      Delete
    3. If using MuQSS alone I would recommend 1000. If using MuQSS with the other -ck patches, 100Hz should work well since it has highres timers at 2000Hz.

      Delete
    4. BTW, here is the output of the schbench benchmark for latency, measured in usec with kernel 4.19.6+ck (CPU has 4 core+4HT) at 100Hz:

      schbench -t 16 -m 2
      Latency percentiles (usec)
      50.0000th: 875
      75.0000th: 6328
      90.0000th: 11152
      95.0000th: 13296
      *99.0000th: 16480
      99.5000th: 17504
      99.9000th: 18848
      min=0, max=22622

      On plain kernel the value of the 99th percentile is much higher. What are the typical values you achieve (using for instance as argument for -t the double of the number of total available cores)?

      Delete
  5. Thank you for 4.19-ck. I can't find anything special to report after a week of usage, zero problems here. It runs totally great.

    ReplyDelete
    Replies
    1. Great thanks. As the saying goes - no news is good news.

      Delete
    2. Yeah, thanks.
      Running great.

      Delete
  6. Running great here, interactivity is excellent, but on seemingly idle system with NOHZ_FULL, accounting is way off:
    Tasks: 367 total, 2 running, 267 sleeping, 0 stopped, 10 zombie
    %Cpu0 : 21,5 us, 1,5 sy, 0,0 ni, 76,9 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
    %Cpu1 : 9,0 us, 61,5 sy, 0,0 ni, 29,5 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
    %Cpu2 : 1,1 us, 52,9 sy, 0,0 ni, 46,0 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
    %Cpu3 : 1,1 us, 54,1 sy, 0,0 ni, 44,9 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
    %Cpu4 : 2,6 us, 50,8 sy, 0,0 ni, 46,6 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
    %Cpu5 : 1,0 us, 49,8 sy, 0,0 ni, 49,3 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
    %Cpu6 : 1,0 us, 51,5 sy, 0,0 ni, 47,4 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
    %Cpu7 : 0,5 us, 60,4 sy, 0,0 ni, 39,1 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
    KiB Mem : 16172800 total, 9904280 free, 2340432 used, 3928088 buff/cache
    KiB Swap: 16777212 total, 16777212 free, 0 used. 12488688 avail Mem

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    4576 username 1 0 1367024 138900 86212 S 6,9 0,9 0:06.91 chromium-browse
    1899 username 47 0 1245920 102160 57840 S 3,0 0,6 0:28.25 compiz
    1370 root 2 0 1398896 131420 115320 S 2,0 0,8 0:16.57 Xorg
    2963 username 1 0 670404 41836 30428 S 2,0 0,3 0:08.33 gnome-terminal-
    1004 root 1 0 453632 18232 13936 S 1,0 0,1 0:01.83 NetworkManager
    1594 username 1 0 210848 9492 6284 S 1,0 0,1 0:01.38 gnome-keyring-d
    1678 username 1 0 46380 4740 2852 S 1,0 0,0 0:04.81 dbus-daemon
    2068 username 1 0 927696 40208 35484 S 1,0 0,2 0:03.12 clickshare-laun
    3059 username 7 0 44920 4076 3232 R 1,0 0,0 0:05.57 top
    10788 username 1 0 1291448 96296 75792 S 1,0 0,6 0:00.33 chromium-browse

    Can You plz look at it or NOHZ_FULL is off Your interests?

    Thanks.

    ReplyDelete
    Replies
    1. I've always recommended not using it for desktops or mobile devices, so it's a miracle it even boots and runs. I doubt I'll ever get time to rewrite the CPU accounting entirely in order for it to work under nohz_full (which is what it would take.)

      Delete
    2. Ironically, MuQSS time accounting only seems to work well on my systems with periodic tick. I recently did an investigation to find better defaults for Liquorix, and the only way to feed ondemand (and intel_pstate), idle data properly was to give it a periodic tick of 250hz or higher.

      At 100hz, I found that ondemand still thought the cores were under load and would randomly increase core frequency even though that particular core was reporting 1% C0 and nearly 99% C7 in i7z.

      250hz was a nice compromise - ondemand would properly leave untouched cores at their lowest frequency for the most part while idle

      And 1000hz, although just a bit more accurate with frequency ramping, prevented any of my cores from staying in C7 for more than 85% of its time while idle. This increased temperatures a full 8-10*C while completely idle for my Latitude E7450 (work laptop).

      Leaving my kernel at 250hz is fine for me, but I already received a report on the linux-lqx AUR that 250hz increases underruns for certain audio configurations. You can find it in the AUR comments here: https://aur.archlinux.org/packages/linux-lqx/

      Delete
    3. You can set buffer size in the kernel config.

      Delete
    4. You mean sampling rate? The problem is during a sample, without updated load data, ondemand will believe a core that was previously active, is still active, even though the core is genuinely idle. I think increasing sampling_rate (samples less often), will just make ondemand too coarse, and reducing it will make it sample the same bad data more often, exacerbating the problem.

      Delete
    5. You said underruns.
      I suppose you mean buffer underruns.
      Increase buffer size?
      CONFIG_SND_HDA_PREALLOC_SIZE
      Ok, it's only for HDA.

      Delete
    6. I think increasing buffer increases latency. In this case, the particular comment mentioned sub 2ms latency, probably for realtime audio mixing and redirection through software.

      Delete
    7. I think there's no way around 1000 Hz then.

      Delete
    8. Personally running MuQSS with full dynticks (tickless) at 100Hz and it is running just fine. Having said that though, it does return some odd reports.

      The task manager in use is suggesting all 4 cores are constantly nearly capped. Which is far from the case, obviously.

      Important to note there though is that the kernel does seem to head towards a full dynticks state. It is being pushed and has been pushed for quite a few years now. And the Linux kernel is not the only kernel that is heading that way.

      Not jumping on board there with MuQSS would be incredibly unwise.

      Eventually, one can fully expect ticks to end up being removed completely from mainline. It is all but inevitable and in fact, I agree completely with this movement towards that end result.

      If for no other reason then for the fact that a ful dyntick kernel will typically be the more energy efficient kernel and with the UN and numerous other organizations making it dreadfully clear that we have in fact probably missed the boat on turning the tide called climate change it would be irresponsible not to focus on efficiency.

      Delete
    9. Thanks for your report. What you say should make sense except for the fact that full dynticks does not do what you think it does. It disables ticks on active cores in order to increase throughput when tasks are heavily affined to those CPU cores for very specific workloads, at the expense of latency in other areas; it was never made with lowering power consumption in PC and mobile devices in mind. Idle dynticks is what disables unnecessary ticks to conserve power, and is fully supported by MuQSS.

      Delete
    10. I think you want constant ticks for latency reasons.

      Delete
    11. @ck -- Be that as it may, the point remains that there is a movement in the development of the kernel to move away from ticks in general. I found quite a bit of reference to continued development in that direction while doing some quick research on the subject.

      I personally feel that full dynticks should be fully supported by MuQSS. Particularly because in my opinion MuQSS is the way forward as far as CPU scheduling is concerned. Its codebase is simple(-ish), straightforward while still being highly configurable to any workload.

      So any effort you could make to ensuring that even full dynticks and not just idle dynticks are supported by MuQSS would be preferable, in my humble opinion.

      Just my 2 cents' worth. Just call it a vote of confidence in your work so far.

      Delete
  7. Yesterday I was testing performance of Ryzen7 8c16t CPU and Ryzen5 APU 4c8t CPU in games using MUQSS and found out that smth is not exactly right (at least to me) with runqueue count.
    Can You please look at the locality and runqueues, what bothers me is Ryzen5, coz they seem to be weird for that CPU.

    Ryzen7: https://pastebin.com/xuqPQijm
    Ryzen5: https://pastebin.com/dwWWVhC6

    Thanks.

    ReplyDelete
    Replies
    1. Ryzen5 does indeed look weird with apparently more runqueues than there are CPUs. I wonder if internally the CPU behaves like it has more physical slots but they're unpopulated so the "extra" runqueues are actually dormant. If it works okay I probably wouldn't worry.

      Delete
    2. Well, the main thing is that performance on Ryzen5 2400G with MUQSS is not great compared to standard kernels or PDS. At least benchmark numbers show that. It works, no doubt about that, even with odd number of queues :)
      As far as I know 2400G all cores are in one physical die (CCX) unlike Ryzen7 which have 2 dies (CCX).
      There comes the question, isn't it so that in MC mode Ryzen5 2400G should have 1 runqueue but Ryzen7 2 runqueues and in SMT 4 and 8 respectively?
      Thanks.

      Delete
    3. A somewhat belated input -- you may want to try a single runqueue regardless of the make or model of the CPU. Particularly for gaming purposes I have personally found that a single runqueue simply does perform better than multiple runqueues.

      To force this, I personally use rqshare=smp.

      I think you might even find that this completely closes whatever gap might have existed before between CFS and PDS, for your particular workload.

      Delete
    4. Yeah, the problem is that with smp i get 9 runqueues, which is odd. So I can not enforce just one runqueue at any rqshare parameter.
      For 2400G runqueues seem to be out of order, but Ryzen7, however, seems to have one runqueue when using mc.
      I don't know whether that's correct or not.

      Delete
    5. rqshare=smp should never result in 9. It should ALWAYS result in 1, EVEN in, say, a dual socket board. Since the queues are shared between actual physical packages with rqshare=smp. Which is why I elected to use rqshare=smp. So there can be no doubt as to what my intention is, a single queue.

      rqshare=none is the polar opposite of that and will spawn as many queues as there are logical cores.

      Delete
    6. Numa nodes are never shared, so there are still multiple queues if the CPU architecture registers as Numa. I plan to close that gap as optional in the future as well.

      Delete
  8. 4.19.6 is a rocket. Thanks.

    ReplyDelete
  9. Next kernel 4.19.7 will have this patch, which will break MuQSS patchset:

    https://git.kernel.org/pub/scm/linux/kernel/git/stable/stable-queue.git/tree/queue-4.19/schedsmt_Make_sched_smt_present_track
    _topology.patch?id=91d67d0fbce47ce01db382d037c27f85ccff6ef3

    ReplyDelete
  10. ... and also this one:

    https://git.kernel.org/pub/scm/linux/kernel/git/stable/stable-queue.git/tree/queue-4.19/schedsmt_Expose_sched_smt_present_static_key.patch

    ReplyDelete
    Replies
    1. 4.19.7 is out.

      Delete
    2. Waiting for the official patchfix from Con, in the meanwhile for the impatients here are two patches that should work with the CK patchset with 4.19.7:

      https://pastebin.com/HFuG7ide

      https://pastebin.com/v618N2xd

      The first one is a patch of the patch 0001-MultiQueue-Skiplist-Scheduler-version-v0.180.patch, The second one instead needs to be applied in sequence after all the others.

      Delete
    3. Thank you very much.

      Delete
    4. Not seen a point release break MuQSS before. At least not in a good, long while.

      Personally I'll wait for Con's input on this but still, good work on those patches, @anon.

      Delete
    5. I can't apply the first patch. What am I doing wrong? I am in "patches" from the brokenout ck1:

      wget -O 1.patch https://pastebin.com/raw/HFuG7ide
      patch -i 1.patch
      (Stripping trailing CRs from patch; use --binary to disable.)
      patching file 0001-MultiQueue-Skiplist-Scheduler-version-v0.180.patch
      Hunk #1 FAILED at 907.
      patch unexpectedly ends in middle of line
      Hunk #2 succeeded at 931 with fuzz 2.
      1 out of 2 hunks FAILED -- saving rejects to file 0001-MultiQueue-Skiplist-Scheduler-version-v0.180.patch.rej

      Delete
    6. I think pastebin fucks with the whitespaces or something... I followed the patch manually and created one that seems to work fine:

      https://gist.githubusercontent.com/graysky2/dc820c1b41c5eeb63d2de7a2e72499f4/raw/3d2b3be40df7e538db4ef9a69b438e00e5575e2d/unfuck-pre.patch

      Thank you and kudos to the Anonymous user who created these two patches! For Arch, I incorporated these into 4.19.8-ck and pushed to the AUR. Please give it a try and post to the AUR or here with feedback.

      Delete
    7. A workaround to the pastebin problem should be to scroll down a bit and copy the Raw version instead. That should not have been tampered with in any way.

      Delete
  11. Looks like this patch failed on 4.19.7

    ReplyDelete
    Replies
    1. patch -p1 < ../muqss/0001-MultiQueue-Skiplist-Scheduler-version-v0.180.patch
      patching file Documentation/admin-guide/kernel-parameters.txt
      Hunk #1 succeeded at 4005 (offset 4 lines).
      patching file Documentation/scheduler/sched-BFS.txt
      patching file Documentation/scheduler/sched-MuQSS.txt
      patching file Documentation/sysctl/kernel.txt
      patching file arch/powerpc/platforms/cell/spufs/sched.c
      patching file arch/x86/Kconfig
      Hunk #1 FAILED at 1009.
      Hunk #2 succeeded at 1033 (offset -10 lines).
      1 out of 2 hunks FAILED -- saving rejects to file arch/x86/Kconfig.rej

      Delete
    2. Probably as graysky said the pastebin swallowed some blank. The original fix/reworked patch that applies temporarely to current kernel 4.19.8 (or the latest 4.19.7) was contained in the mageialinux kernel-419-joeghi package. You can extract directly from their svn repo (svnweb.mageia.org), e.g. these commands works, and applies without errors, apart some hunk offset:

      wget -O linux-4.19.8-419-joeghi4.tar.xz http://binrepo.mageia.org/798a98dc158f294bb0ed10182f9a533f897f7f41
      tar -xvf linux-4.19.8-419-joeghi4.tar.xz 4.19.8-419-joeghi4/patches/0001-MultiQueue-Skiplist-Scheduler-version-v0.180_reworked.patch --strip-components=2
      tar -xvf linux-4.19.8-419-joeghi4.tar.xz 4.19.8-419-joeghi4/patches/0001-MultiQueue-Skiplist-Scheduler-version-v0.180_fix_sched_smt_present.patch --strip-components=2
      wget http://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.19.tar.xz
      wget http://cdn.kernel.org/pub/linux/kernel/v4.x/patch-4.19.8.xz
      tar -xf linux-4.19.tar.xz
      cd linux-4.19
      xz -cd ../patch-4.19.8.xz | patch -s -p1
      patch -p1 < ../0001-MultiQueue-Skiplist-Scheduler-version-v0.180_reworked.patch
      patch -p1 < ../0001-MultiQueue-Skiplist-Scheduler-version-v0.180_fix_sched_smt_present.patch

      Delete
    3. Thank you very much.

      Delete
  12. This comment has been removed by the author.

    ReplyDelete