Friday, 24 November 2017

Runqueue sharing experiments with MuQSS.

For a while now I've been wanting to experiment with what happens when, instead of having either a global runqueue - the way BFS did, or per CPU runqueues - the way MuQSS currently does, we made runqueues shared depending on CPU architecture topology.

Given the fact that Simultaneous MultiThreaded - SMT siblings (hyperthread) are actually on the one physical core and share virtually all resources, then it is almost free, at least at the hardware level, for processes or threads to bounce between the two (or more) siblings. This obviously doesn't take into account the fact that the kernel itself has many unique structures for each logical CPU and that sharing there is not really free. Additionally it is interesting to see what happens if we extend that thinking to CPUs that only share cache, such as MultiCore - MC siblings. Today's modern CPUs are virtually all a combination of one and/or the other shared types above.

At least theoretically, there could be significant advantages to decreasing the number of runqueues for the overhead effects they have, and the decreased latency we'd get from guaranteeing access to more processes on a per-CPU basis with each scheduling decision. From the throughput side, the decreased overhead would also be helpful at the potential expense of slightly more spinlock contention - the more shared runqueues, the more contention, but if the amount of sharing is kept small it should be negligible. From the actual sharing side, given the lack of a formal balancing system in MuQSS, sharing the logical CPUs that are cheapest to switch/balance to should automatically improve throughput for certain workloads. Additionally, with SMT sharing, if light workloads can be bound to just two threads on the same core, there could be better cpu speed consolidation and substantial power saving advantages.

To that end, I've created experimental code for MuQSS that does this exact thing in a configurable way. You can configure the scheduler to share by SMT siblings or MC siblings. Only the runqueue locks and the process skip lists are actually shared. The rest of the runqueue structures at this stage are all still discrete per logical CPU.

Here is a git tree based on 4.14 and the current 0.162 version of MuQSS:
4.14-muqss-rqshare

And for those who use traditional patches, here is a patch that can be applied on top of a muqss-162 patched kernel:
0001-Implement-the-ability-to-share-runqueues-when-CPUs-a.patch

While so far only being a proof of concept, there are some throughput workloads that seem to benefit when sharing is kept to SMT siblings - specifically when there is only enough work for real cores, there is a demonstrable improvement. Latency is more consistently kept within bound levels. But it's not all improvement with some workloads showing slightly lower throughput. When sharing is moved to MC siblings, the results are mixed, and it changes dramatically depending on how many cores you have. Some workloads benefit a lot, while others suffer a lot. Worst case latency improves the more sharing that is done, but in its current rudimentary form there is very little to keep tasks bound to one CPU and with the highly variable CPU frequencies of today's CPUs and the need to bind tasks for an extended period to one CPU to allow the CPU to throttle up, throughput suffers when loads are light. Conversely they seem to improve quite a lot at heavy loads.

Either way, this is pretty much an "untuned" addition to MuQSS, and for my testing at least, I think the SMT siblings sharing is advantageous and have been running it successfully for a while now.

Regardless, if you're looking for something to experiment with, as MuQSS is more or less stable these days, it should be worth giving this patch a try and see what you find in terms of throughput and/or latency. As with all experimental patches, I cannot guarantee the stability of the code, though I am using it on my desktop myself. Note that CPU load reporting is likely to be off. Make sure to report back any results you have!

Enjoy!
お楽しみください

62 comments:

  1. Interesting changes.

    Should be trivial to add support for this (configurable with USE flags, disabled by default) on gentoo. If even a single person requests (IRC or otherwise) I'll set some time aside to do some more rigorous testing / QA.

    Fallback to the EAPI-6 gentoo-style /etc/portage/patches/ method should be fine for testing. It's the method I'll be using personally if I decide to give it a go (assuming nobody specifically asks for this to be included)

    ReplyDelete
    Replies
    1. Oops! Seems google is being a massive doofus and listing my name instead of kuzetsa (for pinging on IRC: kuzetsa on freenode or OFTC will be the quickest route to get my attention since I'm generally awful about keeping up with the ck-hack blog. @kuzetsa is also fine if you're a tweet-bird I guess.)

      Delete
  2. Gets stuck booting the kernel when running with RQSHARE_MC on my AMD Phenom X6.

    Setting it to RQSHARE_NONE boots just fine.

    Seems to typically stop right around PCI initialization or after the following line:
    NOHZ: local_softirq_pending 02

    ReplyDelete
    Replies
    1. Thanks very much for reporting this back. That's a real pest then and I'm not sure why it is. Do you have config hotplug enabled? Setting it to rqshare none is the equivalent of not using the patch. Do you know how many shared cores there are on your CPU?

      Delete
    2. # CONFIG_MEMORY_HOTPLUG is not set
      CONFIG_HOTPLUG_CPU=y
      # CONFIG_BOOTPARAM_HOTPLUG_CPU0 is not set
      # CONFIG_DEBUG_HOTPLUG_CPU0 is not set
      CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
      CONFIG_ACPI_HOTPLUG_CPU=y
      CONFIG_ACPI_HOTPLUG_IOAPIC=y
      # CONFIG_HOTPLUG_PCI is not set
      # CONFIG_CPU_HOTPLUG_STATE_CONTROL is not set

      All 6 cores are on the same die and I believe they share the L3 cache, but otherwise I don't think they share any other resources.

      Delete
    3. Thanks. Nothing suspicious in that config. Can you get the debug output at bootup of the MuQSS locality listing on a working config? Should look something like this:
      [ 0.333198] MuQSS locality CPU 0 to 1: 2
      [ 0.333199] MuQSS locality CPU 0 to 2: 1
      [ 0.333199] MuQSS locality CPU 0 to 3: 2
      [ 0.333200] MuQSS locality CPU 1 to 2: 2
      [ 0.333200] MuQSS locality CPU 1 to 3: 1
      [ 0.333200] MuQSS locality CPU 2 to 3: 2

      Delete
    4. Have the same with AMD Phenom II X4 965

      Delete
  3. Hello.
    Thank you for your work.
    I have a processor form testing desktop: Intel(R) Core(TM) i5-3230M CPU @ 2.60GHz
    With a heavy load of the processor, I have a smoother DE with the option of RQSHARE_SMT than RQSHARE_MC.

    ReplyDelete
    Replies
    1. A core i5 does not have HT, so it does not take advantage of RQSHARE_SMT right?

      Delete
    2. Most i5s do have hyperthreading.

      Delete
    3. https://ark.intel.com/en/products/72164/Intel-Core-i5-3230M-Processor-3M-Cache-up-to-3_20-GHz-rPGA

      Delete
    4. Nice filtering tool here (in this case, showing Intel Core CPU's w/ HT): https://ark.intel.com/Search/FeatureFilter?productType=processors&HyperThreading=true&FamilyText=Intel%C2%AE%20Core%E2%84%A2%20Processors

      Delete
    5. after applying this patch:
      http://ck.kolivas.org/patches/muqss/4.0/4.14/Experimental/0002-Calculate-rq-nr_running-discretely-since-skip-lists-.patch
      with the same configuration of PCs and kernels - the interactivity and smoothness of the DE at heavy load on the processor became even more prominent.

      Delete
    6. Hi, here are my findings.

      Ryzen 1700 on Linux 4.14.3 + MuQSS 162 + both patches here with default RQSHARE_SMT set.
      I have a comparison of 2 kernels (PDS & MuQSS) for Diablo 3 (via wine of course). PDS ~95 FPS, MuQSS RQSHARE_SMT ~60 FPS. Results have to be pretty reliable as they are measured in the same place when my char spawns into the game.
      DE, browsing and the rest of actions feel about the same as the default kernel I use (PDS), can not say worse or better. This applies to i7 6700HQ & Ryzen 1700.

      BR, Eduardo

      Delete
    7. Thanks for that. Unfortunately it's not much use to me unless you're simply comparing muqss with and without the smt runqueue sharing patch. Comparing it to a completely different scheduler doesn't help me. Additionally games via wine are most sensitive to yield settings and not so much schedulers. I'm not even sure the ryzen even has multithreading so rqshare_smt wouldn't do anything in that case.

      Delete
    8. Ahh, I thought comparison could help. No problem, I'll compile vanilla MuQSS and try that as a comparison to rqshare. On ryzen compilations takes little time :)

      If it matters, this is what I have in rc.local:
      echo 0 > /proc/sys/kernel/yield_type || true
      echo ondemand | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor || true
      echo 100 > /sys/devices/system/cpu/cpufreq/ondemand/sampling_down_factor || true
      echo 60 > /sys/devices/system/cpu/cpufreq/ondemand/up_threshold || true

      Ryzen surely has SMT, I have 8 core 16 thread CPU, which is composed with 2 CCX'es (some sort of core complexes / dies).
      What types of tests would be best for You (compilation, native games, DE resposiviness, ...)? I'll try to help as I can.

      BR, Eduardo

      Delete
    9. Did more tests on Ryzen.
      Kernel compilation is comparable (PDS, MuQSS, MuQSS RQSHARE_SMT), couple of seconds difference with -j16.
      Diablo 3 MuQSS vs MuQSS RQSHARE_SMT vs MuQSS RQSHARE_MC is day and night: MuQSS ~105 FPS, MuQSS RQSHARE_SMT ~60 FPS (as reported earlier), MuQSS RQSHARE_MC ~125 FPS. Nice boost with MC.
      Unigine Valley, RQSHARE_SMT vs RQSHARE_MC is 6% difference, RQSHARE_MC wins.
      MuQSS locality info: https://pastebin.com/g0W3Wt45

      BR, Eduardo

      Delete
    10. Thanks that's very interesting data for that one most unusual workload (and a little unexpected.)

      Delete
    11. Actually, I'm not too surprised by SMT sharing performing worse than MC sharing. The strength of SMT siblings might also be their weakness; shared resources.

      There could be some resource contention going on.

      Delete
    12. Well, the biggest difference is Diablo3 and there is a reason I always put Diablo3 in scheduler tests.

      Sorry for story, but I want to share background :)
      When I was testing VRQ(PDS) there was a strange scheduling issue in VRQ(PDS), there was a constant stutter which was not present in CFS or BFS/MUQSS. This stutter did not show up in any other 3D application, wine or not.
      So I imagine that Diablo3 is extra heavy multi-threaded application (but I might be wrong), so since then I always test D3 and sometimes I get surprising results :)

      Delete
    13. I was in fact speaking of the D3 results specifically; they do not surprise me one bit. I'd wager that, for example, Kerbal Space Program would show similar results.

      And in something more native -- Minecraft -- I'd expect the effect to be even more pronounced there. Especially a modded MC might show a very clear advantage of MC sharing. Since a properly modded MC is a multithreading behemoth.

      Delete
    14. Don't assume anything. Just because something is multithreaded does NOT mean it will instantly be better with shared MC runqueues. The cost of switching threads to other cores could greatly outweigh the advantages for just sharing caches. Additionally with CPU speed being continually adjusted by load, with processes never staying on one core, it may never increase the cpu frequency on any of the CPUs, thus leading to lots of threads running at low frequencies. CPU binding of processes is in general a good thing for throughput. Latency is a completely different issue though. Additionally programs with lousy threading and poor locking primitives may have exactly the opposite behaviour.

      Delete
    15. Well, if anyone would care to spin me an Ubuntu-kernel package of this for me, I'd be happy to put my wager to the test. In fact, I'll even add it in a different kind of workload; .NET Task-based multithreading. Since that is what my password manager uses.

      Delete
    16. http://fidaj.noip.me/linux-headers-4.14.6-zen-muqss-rqshare+_4.14.6-zen-muqss-rqshare+-10.00.Custom_amd64.deb
      http://fidaj.noip.me/linux-image-4.14.6-zen-muqss-rqshare+_4.14.6-zen-muqss-rqshare+-10.00.Custom_amd64.deb

      Delete
    17. Here are 4 kernels: PDS, MuQSS, MuQSS+SMT, MuQSS+MC
      Kernel config is standard Ubuntu, except schedulers and BFQ.

      Delete
    18. Yes, and the link :) You may want to check readme.1st.
      https://drive.google.com/drive/folders/1tiOwW4Nk0JIM6TknNKn4sAAF-f3ogZTM?usp=sharing

      Delete
    19. https://pastebin.com/tiwgKke8

      Just a few very random and completely unrelated workloads. Comparing RQ sharing vs not sharing it. Suffice it to say, there were only 2 striking results on my rather measly CPU.

      First of all, as I expected, MC did indeed perform noticably better on MC sharing. Secondly, for unknown reasons, so did the Basemark Web 3.0.

      Those 2 results were well outside of expected error bounds/statistical noise while all other results were within a normal error margin.

      I was not able to test SMT, as this PC is just 4 cores, no HT of any kind. Which is another key thing we do not want to lose track of. Even if all CPUs these days are in fact multicored, not all of them are hyperthreaded.

      In fact, not even all Ryzen (to name but one very modern family) are hyperthreaded. So, at first glance and from a fairly randomized set of workloads, I would personally prefer MC sharing over SMT sharing, if any at all.

      Delete
  4. I get a scheduling while atomic error with kvm windows guest running:
    https://paste.ubuntu.com/26054261/

    I have RQSHARE_MC set.

    ReplyDelete
    Replies
    1. Thanks, that looks related to the use of the yield_to call. I'll investigate.

      Delete
    2. yield_type is 0 if that helps.

      Delete
    3. Now I get a kernel panic with kvm running:
      https://paste.ubuntu.com/26088466/

      Delete
    4. For now i disabled the yield_to function in muqss that fixed the kernel panic.

      Delete
  5. This comment has been removed by the author.

    ReplyDelete
  6. What I'd be interested to see tested is how all of MuQSS knobs (yield_type, interactive, rr_interval) affect this new knob.

    To me it seems like this new idea probably fares best in a configuration that is as cooperative as is possible.

    Personally I run MuQSS (without this new knob; not too comfortable spinning my own kernel, I use Liquorix instead) like this: interactive 1, yield_type 2, rr_interval 1) and I find it to function incredibly well like that.

    Throughput is hardly any less for what I do with the machine (a very unpredictable workload; it's a PC I use for basically anything and everything) but responsiveness is just spot on.

    ReplyDelete
  7. Con,

    Thanks for all of your work. I've been following SCHED_BFS and now SCHED_MUQSS for years, and I need some help.

    I have a specific use case where I have a "real time" process that uses roughly 75% CPU on a single thread. It must receive and respond to UDP packets every 1.5ms, performing calculations on the data therein. If 15ms goes by without any communication (both directions), a fault occurs.

    This process runs at SCHED_RR @ Prio=90. There are other RR processes in the system with Prio's ranging from 50 to 80, but this one process has the highest Prio.

    I've been using SCHED_BFS for the last several years without any issues--up to kernel 4.5. However I have not been able to get this to work right with SCHED_MUQSS on the latest kernels:

    On my PCs with only 2 threads (1 core), turning on CONFIG_SMT_NICE slows down the rest of the GUI (Xorg interface). The "real time" process claims the whole core, preventing anything else from running while it's running. So I have CONFIG_SMT_NICE disabled in order to let Lower RR Prio or Normal processes to run. The "real time" process can still complete its workload without issues; top just reports a CPU usage of 5% to 10% higher.

    However randomly, 3-5 times per hour, the "real time" process will suddenly report no UDP packets are received. In fact, it tries to send UDP packets during this time and select() returns a timeout on receiving data back. (Switching back to kernel 4.5 + SCHED_BFS results in no errors.)

    I've tried SCHED_MUQSS on every kernel since 4.8 and essentially have the same problem, so I initially suspect the scheduler. It's also interesting to note that this project has never worked right with the stock Linux scheduler--only SCHED_BFS so far. But lately I'm wondering if there's another driver (or the e1000e driver itself) hanging things up--but I don't know how to go about debugging this yet.

    My settings: I'm using a CONFIG_PREEMPT kernel, with rr_interval = 1 and everything else default. Back in the 4.8 kernel days I have tried sched_interactive = 0, but that didn't seem to help. I will try your new experimental patch next week.

    Do you have any other suggestions for me?

    ReplyDelete
    Replies
    1. The problem you're describing actually does sound like a possible driver issue. Architecturaly, BFS and MuQSS are not that far removed from one another.

      Personally, I would have expected MuQSS to perform as well on this particular workload as BFS did. The fact you're experiencing an issue beyond a certain kernel version is a tell tale sign in my humble opinion.

      Anyhow, just my 2 cents worth -- Is it really paramount that you keep updating the kernel version? This workload of yours sounds important, it executing properly is of relevance to you. I would suggest actually just sticking with the original configuration of which you knew it worked -- 4.5 + BFS.

      Delete
    2. Yes, unfortunately newer PCs with Skylake and above require kernels > 4.5 to run the X Interface (GUI) without glitches or errors.

      But thanks for your input. At this point I'm going to focus my attention on it being a driver issue rather than a scheduler issue.

      Delete
    3. Maybe you can try with PDS, a scheduler derived from BFS. You can see more info here:
      cchalpha.blogspot.com

      Pedro

      Delete
  8. Thanks Con.
    I've done the usual throughput tests.

    https://docs.google.com/spreadsheets/d/163U3H-gnVeGopMrHiJLeEY1b7XlvND2yoceKbOvQRm4/edit?usp=sharing

    RQSHARE_SMT is indeed improving the results on most of the tests.

    Pedro

    ReplyDelete
  9. Regarding the stability -- I just performed a do-release-upgrade (Ubuntu 17.04 > 17.10) while running the MC version of the patch and it performed just fine.

    In fact, been running it for most of the day and not a single hickup. No applications hanging up, no kernel panics, no lost network packages. Seems fine, so far.

    ReplyDelete
  10. Hi.Happy New Year.
    Has recently started receiving the following messages:
    https://pastebin.com/bVPbJdyc
    Are they related to this topic?

    ReplyDelete
  11. and how to be, into account recent events?
    https://googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html

    ReplyDelete
    Replies
    1. This is a good point; the Meltdown fix does enforce certain behaviour which may in fact end up not playing nicely in conjunction with MuQSS.

      The interaction between MuQSS and KPTI is worth investigating.

      Delete
  12. First of all thank you very much.
    Impressive latency or better: a lack thereof.
    Love this.

    ReplyDelete
    Replies
    1. ^Update:
      Wow!
      Multicore siblings is even more impressive.
      Xeon quadcore.

      Delete
    2. Did some more testing.
      Even at full load is is extremely smooth and responsive.
      The low latency is all you could wish for.
      And it has a nice balance of low latency and throughput.
      Usually throughput suffers when going low latency but this one does it nicely.

      Please develop this further.
      I think it has a lot of potential.

      Delete
  13. Agreed and if at all possible, prioritize MC over SMT. Getting amazing performance over base MuQSS with MC MuQSS and not all CPUs are SMT enabled.

    ReplyDelete
  14. Agreed, if possible prioritize MC over SMT.
    :)

    ReplyDelete
    Replies
    1. For those of you preferring MC sharing, how many cores do your CPUs have?

      Delete
    2. 4, HT disabled.

      Delete
    3. 4. There are very few 2 core CPUs remaining in the field, I think. And 6/8 core almost always are HT capable.

      If you personally are still unsure, ck, then is it at all possible to make it a configurable option like rr_interval, yield_type and interactive?

      Delete
    4. That is approximately the behaviour I was expecting. However making it runtime configurable is considerably more difficult than compile time (but of course nothing's impossible.) I was thinking of one that chose according to the topology at boot time. Any HT and use SMT sharing. Up to 4 cores and use MC sharing instead. However everyones' workloads will be different so nothing is perfect everywhere.

      Delete
    5. Therefore please make it user adjustable, if possible.
      I cannot live without it anymore.
      :)

      Delete
    6. Well, boot time is another thing I personally was thinking about as well, yeah. As you suggested, ck, if SMT is available just use it. And if not, use MC.

      Thing is, from personal testing, I have not seen any workload where NOT sharing at all is measurably better than MC sharing.

      I have to agree with the other anon in this little line, MuQSS with MC/SMT sharing is just amazing really. MuQSS already is great for desktop use but MC sharing just takes it to a whole new level of being happy with the Linux kernel.

      Delete
    7. Another thought just now -- Configurable... at boot time. Like a regular kernel boot option (apparmor, quiet, splash, etc, etc, etc).

      So you would not have to implement topology detection during boot time and people would still have the ability to change it as they see fit. But without the hassle of figuring out how to do that during runtime.

      Delete
    8. ^"Boot time" is a good idea I think.
      Also I agree with the "happiness".

      Just one question, ck:
      Why are you doing this?

      Delete
    9. 10 Cores with HT enabled here.
      I used SMT sharing for a while and now i'm trying MC.
      kernel compiling is a bit faster.
      sched tool is all over the place i tried to ftrace what is causing the latency but I just see idle calls taking a long time.

      schedtool -n -20 -F -p99 -e ./cyclictest --latency=0 -t -n -p95 -m -N

      https://paste.ubuntu.com/26367870/

      The Max times a just me moving my mouse around really strange most likely interactive=1 related.

      But the desktop experience is really nice no dropped frames in mpv and everything in chrome is really good.
      compile time switching between MC/SMT is ok for me.

      Delete
    10. Thanks for the feedback. Why am I doing this? Because I'm _sure_ it's better to share some runqueues instead of having one per logical CPU for latency and reasonably sure it is for throughput too, but the right amount of sharing for throughput is not clear. As for cyclictest, I've had this discussion before but it's not clear it's measuring the correct function points in muqss because it's based on functions in the mainline scheduler and is likely giving bogus results.

      Delete
    11. Thanks.
      Understood.
      I wonder about the performance once this is perfected.
      And yes, cyclictest is not reliable.
      I also had it displaying interesting numbers far from reality.

      Delete
    12. Task-switching/Multitasking seems to have also improved a lot.

      Delete
  15. @10 HT cores:

    Compile time switching is OK of course but it would require people keeping twice the package count up-to-date for those of us that have neither the incling nor the hardware to compile the kernel ourselves.

    And I'm sure that neither xanmod nor liquorix are going to bother keeping both options (MC or SMT) updated in their packages. Some method of incorporating both and configuring it either runtime or during boot time would be highly preferrable.

    ReplyDelete