For a while now I've been wanting to experiment with what happens when, instead of having either a global runqueue - the way BFS did, or per CPU runqueues - the way MuQSS currently does, we made runqueues shared depending on CPU architecture topology.
Given the fact that Simultaneous MultiThreaded - SMT siblings (hyperthread) are actually on the one physical core and share virtually all resources, then it is almost free, at least at the hardware level, for processes or threads to bounce between the two (or more) siblings. This obviously doesn't take into account the fact that the kernel itself has many unique structures for each logical CPU and that sharing there is not really free. Additionally it is interesting to see what happens if we extend that thinking to CPUs that only share cache, such as MultiCore - MC siblings. Today's modern CPUs are virtually all a combination of one and/or the other shared types above.
At least theoretically, there could be significant advantages to decreasing the number of runqueues for the overhead effects they have, and the decreased latency we'd get from guaranteeing access to more processes on a per-CPU basis with each scheduling decision. From the throughput side, the decreased overhead would also be helpful at the potential expense of slightly more spinlock contention - the more shared runqueues, the more contention, but if the amount of sharing is kept small it should be negligible. From the actual sharing side, given the lack of a formal balancing system in MuQSS, sharing the logical CPUs that are cheapest to switch/balance to should automatically improve throughput for certain workloads. Additionally, with SMT sharing, if light workloads can be bound to just two threads on the same core, there could be better cpu speed consolidation and substantial power saving advantages.
To that end, I've created experimental code for MuQSS that does this exact thing in a configurable way. You can configure the scheduler to share by SMT siblings or MC siblings. Only the runqueue locks and the process skip lists are actually shared. The rest of the runqueue structures at this stage are all still discrete per logical CPU.
Here is a git tree based on 4.14 and the current 0.162 version of MuQSS:
And for those who use traditional patches, here is a patch that can be applied on top of a muqss-162 patched kernel:
While so far only being a proof of concept, there are some throughput workloads that seem to benefit when sharing is kept to SMT siblings - specifically when there is only enough work for real cores, there is a demonstrable improvement. Latency is more consistently kept within bound levels. But it's not all improvement with some workloads showing slightly lower throughput. When sharing is moved to MC siblings, the results are mixed, and it changes dramatically depending on how many cores you have. Some workloads benefit a lot, while others suffer a lot. Worst case latency improves the more sharing that is done, but in its current rudimentary form there is very little to keep tasks bound to one CPU and with the highly variable CPU frequencies of today's CPUs and the need to bind tasks for an extended period to one CPU to allow the CPU to throttle up, throughput suffers when loads are light. Conversely they seem to improve quite a lot at heavy loads.
Either way, this is pretty much an "untuned" addition to MuQSS, and for my testing at least, I think the SMT siblings sharing is advantageous and have been running it successfully for a while now.
Regardless, if you're looking for something to experiment with, as MuQSS is more or less stable these days, it should be worth giving this patch a try and see what you find in terms of throughput and/or latency. As with all experimental patches, I cannot guarantee the stability of the code, though I am using it on my desktop myself. Note that CPU load reporting is likely to be off. Make sure to report back any results you have!
Should be trivial to add support for this (configurable with USE flags, disabled by default) on gentoo. If even a single person requests (IRC or otherwise) I'll set some time aside to do some more rigorous testing / QA.
Fallback to the EAPI-6 gentoo-style /etc/portage/patches/ method should be fine for testing. It's the method I'll be using personally if I decide to give it a go (assuming nobody specifically asks for this to be included)
Oops! Seems google is being a massive doofus and listing my name instead of kuzetsa (for pinging on IRC: kuzetsa on freenode or OFTC will be the quickest route to get my attention since I'm generally awful about keeping up with the ck-hack blog. @kuzetsa is also fine if you're a tweet-bird I guess.)Delete
Gets stuck booting the kernel when running with RQSHARE_MC on my AMD Phenom X6.ReplyDelete
Setting it to RQSHARE_NONE boots just fine.
Seems to typically stop right around PCI initialization or after the following line:
NOHZ: local_softirq_pending 02
Thanks very much for reporting this back. That's a real pest then and I'm not sure why it is. Do you have config hotplug enabled? Setting it to rqshare none is the equivalent of not using the patch. Do you know how many shared cores there are on your CPU?Delete
# CONFIG_MEMORY_HOTPLUG is not setDelete
# CONFIG_BOOTPARAM_HOTPLUG_CPU0 is not set
# CONFIG_DEBUG_HOTPLUG_CPU0 is not set
# CONFIG_HOTPLUG_PCI is not set
# CONFIG_CPU_HOTPLUG_STATE_CONTROL is not set
All 6 cores are on the same die and I believe they share the L3 cache, but otherwise I don't think they share any other resources.
Thanks. Nothing suspicious in that config. Can you get the debug output at bootup of the MuQSS locality listing on a working config? Should look something like this:Delete
[ 0.333198] MuQSS locality CPU 0 to 1: 2
[ 0.333199] MuQSS locality CPU 0 to 2: 1
[ 0.333199] MuQSS locality CPU 0 to 3: 2
[ 0.333200] MuQSS locality CPU 1 to 2: 2
[ 0.333200] MuQSS locality CPU 1 to 3: 1
[ 0.333200] MuQSS locality CPU 2 to 3: 2
Have the same with AMD Phenom II X4 965Delete
similar issue encountred: MuQSS 0.170, core i7 2760qmDelete
setting notrheadirqs boots just fine with whatever rqshare=smp,smt,none setting, if either SMT/HT is enabled or not in BIOS.
threadirqs locks up boot if HT is disabled in BIOS and rqshare=smp, with rqshare=none threadirqs does not lock up boot. conclusion is that threadirqs conflicts with any type of rqsuare=smp|smt|mc
i've encountered a similar issue with a NUMA machine (dual-socket xeon x5670), but didn't track down wether it is related to threadirqs for sure, but i suspect it is because boot lockups were observed too. otherwise the following setup boots just fine with NUMA awareness: rqshare=smp and properly shared runqueue per socket, HT disabled in BIOS, threadirqs enabled but boot lockup was observed with HT=enabled in BIOS, rqshare=smp and threadirqs
still i'm not that much concerned about latency with the NUMA workhorse but like the idea MuQSS would scale with NUMA machines ;)
Thanks very much, that is very useful extra information.Delete
Thank you for your work.
I have a processor form testing desktop: Intel(R) Core(TM) i5-3230M CPU @ 2.60GHz
With a heavy load of the processor, I have a smoother DE with the option of RQSHARE_SMT than RQSHARE_MC.
Thanks for that.Delete
A core i5 does not have HT, so it does not take advantage of RQSHARE_SMT right?Delete
Most i5s do have hyperthreading.Delete
Nice filtering tool here (in this case, showing Intel Core CPU's w/ HT): https://ark.intel.com/Search/FeatureFilter?productType=processors&HyperThreading=true&FamilyText=Intel%C2%AE%20Core%E2%84%A2%20ProcessorsDelete
after applying this patch:Delete
with the same configuration of PCs and kernels - the interactivity and smoothness of the DE at heavy load on the processor became even more prominent.
Hi, here are my findings.Delete
Ryzen 1700 on Linux 4.14.3 + MuQSS 162 + both patches here with default RQSHARE_SMT set.
I have a comparison of 2 kernels (PDS & MuQSS) for Diablo 3 (via wine of course). PDS ~95 FPS, MuQSS RQSHARE_SMT ~60 FPS. Results have to be pretty reliable as they are measured in the same place when my char spawns into the game.
DE, browsing and the rest of actions feel about the same as the default kernel I use (PDS), can not say worse or better. This applies to i7 6700HQ & Ryzen 1700.
Thanks for that. Unfortunately it's not much use to me unless you're simply comparing muqss with and without the smt runqueue sharing patch. Comparing it to a completely different scheduler doesn't help me. Additionally games via wine are most sensitive to yield settings and not so much schedulers. I'm not even sure the ryzen even has multithreading so rqshare_smt wouldn't do anything in that case.Delete
Ahh, I thought comparison could help. No problem, I'll compile vanilla MuQSS and try that as a comparison to rqshare. On ryzen compilations takes little time :)Delete
If it matters, this is what I have in rc.local:
echo 0 > /proc/sys/kernel/yield_type || true
echo ondemand | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor || true
echo 100 > /sys/devices/system/cpu/cpufreq/ondemand/sampling_down_factor || true
echo 60 > /sys/devices/system/cpu/cpufreq/ondemand/up_threshold || true
Ryzen surely has SMT, I have 8 core 16 thread CPU, which is composed with 2 CCX'es (some sort of core complexes / dies).
What types of tests would be best for You (compilation, native games, DE resposiviness, ...)? I'll try to help as I can.
Did more tests on Ryzen.Delete
Kernel compilation is comparable (PDS, MuQSS, MuQSS RQSHARE_SMT), couple of seconds difference with -j16.
Diablo 3 MuQSS vs MuQSS RQSHARE_SMT vs MuQSS RQSHARE_MC is day and night: MuQSS ~105 FPS, MuQSS RQSHARE_SMT ~60 FPS (as reported earlier), MuQSS RQSHARE_MC ~125 FPS. Nice boost with MC.
Unigine Valley, RQSHARE_SMT vs RQSHARE_MC is 6% difference, RQSHARE_MC wins.
MuQSS locality info: https://pastebin.com/g0W3Wt45
Thanks that's very interesting data for that one most unusual workload (and a little unexpected.)Delete
Actually, I'm not too surprised by SMT sharing performing worse than MC sharing. The strength of SMT siblings might also be their weakness; shared resources.Delete
There could be some resource contention going on.
Well, the biggest difference is Diablo3 and there is a reason I always put Diablo3 in scheduler tests.Delete
Sorry for story, but I want to share background :)
When I was testing VRQ(PDS) there was a strange scheduling issue in VRQ(PDS), there was a constant stutter which was not present in CFS or BFS/MUQSS. This stutter did not show up in any other 3D application, wine or not.
So I imagine that Diablo3 is extra heavy multi-threaded application (but I might be wrong), so since then I always test D3 and sometimes I get surprising results :)
I was in fact speaking of the D3 results specifically; they do not surprise me one bit. I'd wager that, for example, Kerbal Space Program would show similar results.Delete
And in something more native -- Minecraft -- I'd expect the effect to be even more pronounced there. Especially a modded MC might show a very clear advantage of MC sharing. Since a properly modded MC is a multithreading behemoth.
Don't assume anything. Just because something is multithreaded does NOT mean it will instantly be better with shared MC runqueues. The cost of switching threads to other cores could greatly outweigh the advantages for just sharing caches. Additionally with CPU speed being continually adjusted by load, with processes never staying on one core, it may never increase the cpu frequency on any of the CPUs, thus leading to lots of threads running at low frequencies. CPU binding of processes is in general a good thing for throughput. Latency is a completely different issue though. Additionally programs with lousy threading and poor locking primitives may have exactly the opposite behaviour.Delete
Well, if anyone would care to spin me an Ubuntu-kernel package of this for me, I'd be happy to put my wager to the test. In fact, I'll even add it in a different kind of workload; .NET Task-based multithreading. Since that is what my password manager uses.Delete
This comment has been removed by the author.Delete
Here are 4 kernels: PDS, MuQSS, MuQSS+SMT, MuQSS+MCDelete
Kernel config is standard Ubuntu, except schedulers and BFQ.
Yes, and the link :) You may want to check readme.1st.Delete
Just a few very random and completely unrelated workloads. Comparing RQ sharing vs not sharing it. Suffice it to say, there were only 2 striking results on my rather measly CPU.
First of all, as I expected, MC did indeed perform noticably better on MC sharing. Secondly, for unknown reasons, so did the Basemark Web 3.0.
Those 2 results were well outside of expected error bounds/statistical noise while all other results were within a normal error margin.
I was not able to test SMT, as this PC is just 4 cores, no HT of any kind. Which is another key thing we do not want to lose track of. Even if all CPUs these days are in fact multicored, not all of them are hyperthreaded.
In fact, not even all Ryzen (to name but one very modern family) are hyperthreaded. So, at first glance and from a fairly randomized set of workloads, I would personally prefer MC sharing over SMT sharing, if any at all.
I get a scheduling while atomic error with kvm windows guest running:ReplyDelete
I have RQSHARE_MC set.
Thanks, that looks related to the use of the yield_to call. I'll investigate.Delete
yield_type is 0 if that helps.Delete
Now I get a kernel panic with kvm running:Delete
For now i disabled the yield_to function in muqss that fixed the kernel panic.Delete
This comment has been removed by the author.ReplyDelete
What I'd be interested to see tested is how all of MuQSS knobs (yield_type, interactive, rr_interval) affect this new knob.ReplyDelete
To me it seems like this new idea probably fares best in a configuration that is as cooperative as is possible.
Personally I run MuQSS (without this new knob; not too comfortable spinning my own kernel, I use Liquorix instead) like this: interactive 1, yield_type 2, rr_interval 1) and I find it to function incredibly well like that.
Throughput is hardly any less for what I do with the machine (a very unpredictable workload; it's a PC I use for basically anything and everything) but responsiveness is just spot on.
Thanks for all of your work. I've been following SCHED_BFS and now SCHED_MUQSS for years, and I need some help.
I have a specific use case where I have a "real time" process that uses roughly 75% CPU on a single thread. It must receive and respond to UDP packets every 1.5ms, performing calculations on the data therein. If 15ms goes by without any communication (both directions), a fault occurs.
This process runs at SCHED_RR @ Prio=90. There are other RR processes in the system with Prio's ranging from 50 to 80, but this one process has the highest Prio.
I've been using SCHED_BFS for the last several years without any issues--up to kernel 4.5. However I have not been able to get this to work right with SCHED_MUQSS on the latest kernels:
On my PCs with only 2 threads (1 core), turning on CONFIG_SMT_NICE slows down the rest of the GUI (Xorg interface). The "real time" process claims the whole core, preventing anything else from running while it's running. So I have CONFIG_SMT_NICE disabled in order to let Lower RR Prio or Normal processes to run. The "real time" process can still complete its workload without issues; top just reports a CPU usage of 5% to 10% higher.
However randomly, 3-5 times per hour, the "real time" process will suddenly report no UDP packets are received. In fact, it tries to send UDP packets during this time and select() returns a timeout on receiving data back. (Switching back to kernel 4.5 + SCHED_BFS results in no errors.)
I've tried SCHED_MUQSS on every kernel since 4.8 and essentially have the same problem, so I initially suspect the scheduler. It's also interesting to note that this project has never worked right with the stock Linux scheduler--only SCHED_BFS so far. But lately I'm wondering if there's another driver (or the e1000e driver itself) hanging things up--but I don't know how to go about debugging this yet.
My settings: I'm using a CONFIG_PREEMPT kernel, with rr_interval = 1 and everything else default. Back in the 4.8 kernel days I have tried sched_interactive = 0, but that didn't seem to help. I will try your new experimental patch next week.
Do you have any other suggestions for me?
The problem you're describing actually does sound like a possible driver issue. Architecturaly, BFS and MuQSS are not that far removed from one another.Delete
Personally, I would have expected MuQSS to perform as well on this particular workload as BFS did. The fact you're experiencing an issue beyond a certain kernel version is a tell tale sign in my humble opinion.
Anyhow, just my 2 cents worth -- Is it really paramount that you keep updating the kernel version? This workload of yours sounds important, it executing properly is of relevance to you. I would suggest actually just sticking with the original configuration of which you knew it worked -- 4.5 + BFS.
Yes, unfortunately newer PCs with Skylake and above require kernels > 4.5 to run the X Interface (GUI) without glitches or errors.Delete
But thanks for your input. At this point I'm going to focus my attention on it being a driver issue rather than a scheduler issue.
Maybe you can try with PDS, a scheduler derived from BFS. You can see more info here:Delete
I've done the usual throughput tests.
RQSHARE_SMT is indeed improving the results on most of the tests.
Regarding the stability -- I just performed a do-release-upgrade (Ubuntu 17.04 > 17.10) while running the MC version of the patch and it performed just fine.ReplyDelete
In fact, been running it for most of the day and not a single hickup. No applications hanging up, no kernel panics, no lost network packages. Seems fine, so far.
Hi.Happy New Year.ReplyDelete
Has recently started receiving the following messages:
Are they related to this topic?
and how to be, into account recent events?ReplyDelete
This is a good point; the Meltdown fix does enforce certain behaviour which may in fact end up not playing nicely in conjunction with MuQSS.Delete
The interaction between MuQSS and KPTI is worth investigating.
First of all thank you very much.ReplyDelete
Impressive latency or better: a lack thereof.
Multicore siblings is even more impressive.
Did some more testing.Delete
Even at full load is is extremely smooth and responsive.
The low latency is all you could wish for.
And it has a nice balance of low latency and throughput.
Usually throughput suffers when going low latency but this one does it nicely.
Please develop this further.
I think it has a lot of potential.
Agreed and if at all possible, prioritize MC over SMT. Getting amazing performance over base MuQSS with MC MuQSS and not all CPUs are SMT enabled.ReplyDelete
Agreed, if possible prioritize MC over SMT.ReplyDelete
For those of you preferring MC sharing, how many cores do your CPUs have?Delete
4, HT disabled.Delete
4. There are very few 2 core CPUs remaining in the field, I think. And 6/8 core almost always are HT capable.Delete
If you personally are still unsure, ck, then is it at all possible to make it a configurable option like rr_interval, yield_type and interactive?
That is approximately the behaviour I was expecting. However making it runtime configurable is considerably more difficult than compile time (but of course nothing's impossible.) I was thinking of one that chose according to the topology at boot time. Any HT and use SMT sharing. Up to 4 cores and use MC sharing instead. However everyones' workloads will be different so nothing is perfect everywhere.Delete
Therefore please make it user adjustable, if possible.Delete
I cannot live without it anymore.
Well, boot time is another thing I personally was thinking about as well, yeah. As you suggested, ck, if SMT is available just use it. And if not, use MC.Delete
Thing is, from personal testing, I have not seen any workload where NOT sharing at all is measurably better than MC sharing.
I have to agree with the other anon in this little line, MuQSS with MC/SMT sharing is just amazing really. MuQSS already is great for desktop use but MC sharing just takes it to a whole new level of being happy with the Linux kernel.
Another thought just now -- Configurable... at boot time. Like a regular kernel boot option (apparmor, quiet, splash, etc, etc, etc).Delete
So you would not have to implement topology detection during boot time and people would still have the ability to change it as they see fit. But without the hassle of figuring out how to do that during runtime.
^"Boot time" is a good idea I think.Delete
Also I agree with the "happiness".
Just one question, ck:
Why are you doing this?
10 Cores with HT enabled here.Delete
I used SMT sharing for a while and now i'm trying MC.
kernel compiling is a bit faster.
sched tool is all over the place i tried to ftrace what is causing the latency but I just see idle calls taking a long time.
schedtool -n -20 -F -p99 -e ./cyclictest --latency=0 -t -n -p95 -m -N
The Max times a just me moving my mouse around really strange most likely interactive=1 related.
But the desktop experience is really nice no dropped frames in mpv and everything in chrome is really good.
compile time switching between MC/SMT is ok for me.
Thanks for the feedback. Why am I doing this? Because I'm _sure_ it's better to share some runqueues instead of having one per logical CPU for latency and reasonably sure it is for throughput too, but the right amount of sharing for throughput is not clear. As for cyclictest, I've had this discussion before but it's not clear it's measuring the correct function points in muqss because it's based on functions in the mainline scheduler and is likely giving bogus results.Delete
I wonder about the performance once this is perfected.
And yes, cyclictest is not reliable.
I also had it displaying interesting numbers far from reality.
Task-switching/Multitasking seems to have also improved a lot.Delete
@10 HT cores:ReplyDelete
Compile time switching is OK of course but it would require people keeping twice the package count up-to-date for those of us that have neither the incling nor the hardware to compile the kernel ourselves.
And I'm sure that neither xanmod nor liquorix are going to bother keeping both options (MC or SMT) updated in their packages. Some method of incorporating both and configuring it either runtime or during boot time would be highly preferrable.
I'm getting really high idle CPU usage on the CK kernel, whether I compile myself or use one of the precompiled ones from graysky's repo. I've put details on the Arch forum here - https://bbs.archlinux.org/viewtopic.php?pid=1764327#p1764327ReplyDelete
All other kernels (stock, Zen, self-compiled, etc) are fine.
Yep, probably just an accounting anomaly as always. It doesn't mean it's using high CPU at all, but I don't have the time and equipment to fix the accounting on all CPUs I'm afraid :(Delete
I originally suspected that, but in my second post you can see that it's keeping the CPU awake so it's not clocking down - https://bbs.archlinux.org/viewtopic.php?pid=1764717#p1764717Delete
Okay, is this with any runqueue sharing or default MuQSS?Delete
Whichever is the default. I'm not using any CK or scheduler specific kernel options.Delete
Okay, try disabling interactive modeDelete
echo 0 > /proc/sys/kernel/interactive
See if that helps.
No change in behaviour after changing that, unfortunately.Delete
Looks to be mostly nvidia driver related in fact. Could be due to the threaded IRQs which are on by default. Try booting with the extra kernel parameter:Delete
Actually it looks like your mainline config has threadedirqs too? If that's the case then I have no idea what to suggest. My laptop doesn't use much cpu at all at idle but has the inbuilt intel GPU.Delete
Booted with nothreadirqs and also unloaded all of the nvidia modules just in case. Random services / threads are still popping up using high amounts of CPU for no apparent reason resulting in the CPU not clocking down.Delete
In that case it looks like whatever is leading to your accounting errors is also feeding into the cpu frequency handler. What governor is in use and what CPU?Delete
I have a nvidia quadro.Delete
Apart from cpu accounting seeming to be a little off, no problems so far.
It's an i7-5775c using the default powersave governor.Delete
If I may offer a remark -- It is somewhat odd that even while using powersave, CPUs are not throttling down. Powersave is supposed to be quite aggressive in its downthrottling; it's the powersaving governor after all.Delete
No real reason to assume the following but might it be some odd, unexpected, interaction with MuQSS? Might I suggest using the schedutil governor? For me, MuQSS behaves as expected with schedutil.
You're mixing up the powersave governor - which runs at lowest speed 100% of the time, with the powersave setting on the intel p state governor - the powersave name with the intel p states is confusing as it's really a dynamic ondemand setting. The choices with the p state governor are powersave and performance only, where performance is maximum frequency all the time.Delete
What CK said. "powersave" on these processors is basically "ondemand".Delete
I've had to switch back to the normal kernel for now as no amount of fiddling resulted in normal behaviour.
Actually, no, that is not what I was mixing up. I was mixing up 2 individual governors; nothing related to Intel p state. I was, somehow, getting powersave mixed up with conservative.Delete
This comment has been removed by the author.ReplyDelete
Any news on the 4.15 sync ?ReplyDelete
I gave it myself a try but the kernel is booting but not working correctly.
Still probably a week away.Delete
Have there been any developments regarding boottime vs compile time vs runtime configuration?ReplyDelete
Trying your new gift to us all (boottime configuration) as we speak. Will report any problems.ReplyDelete
That code is all in preparation for the 4.15 release which is not far away now ;)Delete