-ck hacking: Runqueue sharing experiments with MuQSS.

Friday, 24 November 2017

Runqueue sharing experiments with MuQSS.

For a while now I've been wanting to experiment with what happens when, instead of having either a global runqueue - the way BFS did, or per CPU runqueues - the way MuQSS currently does, we made runqueues shared depending on CPU architecture topology.

Given the fact that Simultaneous MultiThreaded - SMT siblings (hyperthread) are actually on the one physical core and share virtually all resources, then it is almost free, at least at the hardware level, for processes or threads to bounce between the two (or more) siblings. This obviously doesn't take into account the fact that the kernel itself has many unique structures for each logical CPU and that sharing there is not really free. Additionally it is interesting to see what happens if we extend that thinking to CPUs that only share cache, such as MultiCore - MC siblings. Today's modern CPUs are virtually all a combination of one and/or the other shared types above.

At least theoretically, there could be significant advantages to decreasing the number of runqueues for the overhead effects they have, and the decreased latency we'd get from guaranteeing access to more processes on a per-CPU basis with each scheduling decision. From the throughput side, the decreased overhead would also be helpful at the potential expense of slightly more spinlock contention - the more shared runqueues, the more contention, but if the amount of sharing is kept small it should be negligible. From the actual sharing side, given the lack of a formal balancing system in MuQSS, sharing the logical CPUs that are cheapest to switch/balance to should automatically improve throughput for certain workloads. Additionally, with SMT sharing, if light workloads can be bound to just two threads on the same core, there could be better cpu speed consolidation and substantial power saving advantages.

To that end, I've created experimental code for MuQSS that does this exact thing in a configurable way. You can configure the scheduler to share by SMT siblings or MC siblings. Only the runqueue locks and the process skip lists are actually shared. The rest of the runqueue structures at this stage are all still discrete per logical CPU.

Here is a git tree based on 4.14 and the current 0.162 version of MuQSS:
4.14-muqss-rqshare

And for those who use traditional patches, here is a patch that can be applied on top of a muqss-162 patched kernel:
0001-Implement-the-ability-to-share-runqueues-when-CPUs-a.patch

While so far only being a proof of concept, there are some throughput workloads that seem to benefit when sharing is kept to SMT siblings - specifically when there is only enough work for real cores, there is a demonstrable improvement. Latency is more consistently kept within bound levels. But it's not all improvement with some workloads showing slightly lower throughput. When sharing is moved to MC siblings, the results are mixed, and it changes dramatically depending on how many cores you have. Some workloads benefit a lot, while others suffer a lot. Worst case latency improves the more sharing that is done, but in its current rudimentary form there is very little to keep tasks bound to one CPU and with the highly variable CPU frequencies of today's CPUs and the need to bind tasks for an extended period to one CPU to allow the CPU to throttle up, throughput suffers when loads are light. Conversely they seem to improve quite a lot at heavy loads.

Either way, this is pretty much an "untuned" addition to MuQSS, and for my testing at least, I think the SMT siblings sharing is advantageous and have been running it successfully for a while now.

Regardless, if you're looking for something to experiment with, as MuQSS is more or less stable these days, it should be worth giving this patch a try and see what you find in terms of throughput and/or latency. As with all experimental patches, I cannot guarantee the stability of the code, though I am using it on my desktop myself. Note that CPU load reporting is likely to be off. Make sure to report back any results you have!

Enjoy!
お楽しみください

88 comments:

Unknown24 November 2017 at 16:09
Interesting changes.

Should be trivial to add support for this (configurable with USE flags, disabled by default) on gentoo. If even a single person requests (IRC or otherwise) I'll set some time aside to do some more rigorous testing / QA.

Fallback to the EAPI-6 gentoo-style /etc/portage/patches/ method should be fine for testing. It's the method I'll be using personally if I decide to give it a go (assuming nobody specifically asks for this to be included)
ReplyDelete
Replies
Anonymous25 November 2017 at 11:59
Gets stuck booting the kernel when running with RQSHARE_MC on my AMD Phenom X6.

Setting it to RQSHARE_NONE boots just fine.

Seems to typically stop right around PCI initialization or after the following line:
NOHZ: local_softirq_pending 02
ReplyDelete
Replies
fidaj25 November 2017 at 23:52
Hello.
Thank you for your work.
I have a processor form testing desktop: Intel(R) Core(TM) i5-3230M CPU @ 2.60GHz
With a heavy load of the processor, I have a smoother DE with the option of RQSHARE_SMT than RQSHARE_MC.
ReplyDelete
Replies
Anonymous27 November 2017 at 10:10
I get a scheduling while atomic error with kvm windows guest running:
https://paste.ubuntu.com/26054261/

I have RQSHARE_MC set.
ReplyDelete
Replies
Alexandre Frade29 November 2017 at 06:55
This comment has been removed by the author.
ReplyDelete
Replies
Anonymous3 December 2017 at 23:25
What I'd be interested to see tested is how all of MuQSS knobs (yield_type, interactive, rr_interval) affect this new knob.

To me it seems like this new idea probably fares best in a configuration that is as cooperative as is possible.

Personally I run MuQSS (without this new knob; not too comfortable spinning my own kernel, I use Liquorix instead) like this: interactive 1, yield_type 2, rr_interval 1) and I find it to function incredibly well like that.

Throughput is hardly any less for what I do with the machine (a very unpredictable workload; it's a PC I use for basically anything and everything) but responsiveness is just spot on.
ReplyDelete
Replies
klynastor4 December 2017 at 03:46
Con,

Thanks for all of your work. I've been following SCHED_BFS and now SCHED_MUQSS for years, and I need some help.

I have a specific use case where I have a "real time" process that uses roughly 75% CPU on a single thread. It must receive and respond to UDP packets every 1.5ms, performing calculations on the data therein. If 15ms goes by without any communication (both directions), a fault occurs.

This process runs at SCHED_RR @ Prio=90. There are other RR processes in the system with Prio's ranging from 50 to 80, but this one process has the highest Prio.

I've been using SCHED_BFS for the last several years without any issues--up to kernel 4.5. However I have not been able to get this to work right with SCHED_MUQSS on the latest kernels:

On my PCs with only 2 threads (1 core), turning on CONFIG_SMT_NICE slows down the rest of the GUI (Xorg interface). The "real time" process claims the whole core, preventing anything else from running while it's running. So I have CONFIG_SMT_NICE disabled in order to let Lower RR Prio or Normal processes to run. The "real time" process can still complete its workload without issues; top just reports a CPU usage of 5% to 10% higher.

However randomly, 3-5 times per hour, the "real time" process will suddenly report no UDP packets are received. In fact, it tries to send UDP packets during this time and select() returns a timeout on receiving data back. (Switching back to kernel 4.5 + SCHED_BFS results in no errors.)

I've tried SCHED_MUQSS on every kernel since 4.8 and essentially have the same problem, so I initially suspect the scheduler. It's also interesting to note that this project has never worked right with the stock Linux scheduler--only SCHED_BFS so far. But lately I'm wondering if there's another driver (or the e1000e driver itself) hanging things up--but I don't know how to go about debugging this yet.

My settings: I'm using a CONFIG_PREEMPT kernel, with rr_interval = 1 and everything else default. Back in the 4.8 kernel days I have tried sched_interactive = 0, but that didn't seem to help. I will try your new experimental patch next week.

Do you have any other suggestions for me?
ReplyDelete
Replies
Anonymous12 December 2017 at 03:11
Thanks Con.
I've done the usual throughput tests.

https://docs.google.com/spreadsheets/d/163U3H-gnVeGopMrHiJLeEY1b7XlvND2yoceKbOvQRm4/edit?usp=sharing

RQSHARE_SMT is indeed improving the results on most of the tests.

Pedro
ReplyDelete
Replies
Anonymous16 December 2017 at 15:01
Regarding the stability -- I just performed a do-release-upgrade (Ubuntu 17.04 > 17.10) while running the MC version of the patch and it performed just fine.

In fact, been running it for most of the day and not a single hickup. No applications hanging up, no kernel panics, no lost network packages. Seems fine, so far.
ReplyDelete
Replies
fidaj4 January 2018 at 18:51
Hi.Happy New Year.
Has recently started receiving the following messages:
https://pastebin.com/bVPbJdyc
Are they related to this topic?
ReplyDelete
Replies
fidaj6 January 2018 at 00:56
and how to be, into account recent events?
https://googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html
ReplyDelete
Replies
Anonymous10 January 2018 at 00:53
First of all thank you very much.
Impressive latency or better: a lack thereof.
Love this.

ReplyDelete
Replies
Anonymous11 January 2018 at 11:53
Agreed and if at all possible, prioritize MC over SMT. Getting amazing performance over base MuQSS with MC MuQSS and not all CPUs are SMT enabled.
ReplyDelete
Replies
Anonymous11 January 2018 at 18:46
Agreed, if possible prioritize MC over SMT.
:)

ReplyDelete
Replies
Anonymous12 January 2018 at 09:48
@10 HT cores:

Compile time switching is OK of course but it would require people keeping twice the package count up-to-date for those of us that have neither the incling nor the hardware to compile the kernel ourselves.

And I'm sure that neither xanmod nor liquorix are going to bother keeping both options (MC or SMT) updated in their packages. Some method of incorporating both and configuring it either runtime or during boot time would be highly preferrable.
ReplyDelete
Replies
Unknown1 February 2018 at 09:43
I'm getting really high idle CPU usage on the CK kernel, whether I compile myself or use one of the precompiled ones from graysky's repo. I've put details on the Arch forum here - https://bbs.archlinux.org/viewtopic.php?pid=1764327#p1764327

All other kernels (stock, Zen, self-compiled, etc) are fine.
ReplyDelete
Replies
Unknown2 February 2018 at 09:01
This comment has been removed by the author.
ReplyDelete
Replies
Anonymous13 February 2018 at 23:27
Any news on the 4.15 sync ?
I gave it myself a try but the kernel is booting but not working correctly.
ReplyDelete
Replies
Anonymous15 February 2018 at 09:38
Have there been any developments regarding boottime vs compile time vs runtime configuration?
ReplyDelete
Replies
Anonymous18 February 2018 at 07:56
Trying your new gift to us all (boottime configuration) as we speak. Will report any problems.
ReplyDelete
Replies