For a while now I've been wanting to experiment with what happens when, instead of having either a global runqueue - the way BFS did, or per CPU runqueues - the way MuQSS currently does, we made runqueues shared depending on CPU architecture topology.
Given the fact that Simultaneous MultiThreaded - SMT siblings (hyperthread) are actually on the one physical core and share virtually all resources, then it is almost free, at least at the hardware level, for processes or threads to bounce between the two (or more) siblings. This obviously doesn't take into account the fact that the kernel itself has many unique structures for each logical CPU and that sharing there is not really free. Additionally it is interesting to see what happens if we extend that thinking to CPUs that only share cache, such as MultiCore - MC siblings. Today's modern CPUs are virtually all a combination of one and/or the other shared types above.
At least theoretically, there could be significant advantages to decreasing the number of runqueues for the overhead effects they have, and the decreased latency we'd get from guaranteeing access to more processes on a per-CPU basis with each scheduling decision. From the throughput side, the decreased overhead would also be helpful at the potential expense of slightly more spinlock contention - the more shared runqueues, the more contention, but if the amount of sharing is kept small it should be negligible. From the actual sharing side, given the lack of a formal balancing system in MuQSS, sharing the logical CPUs that are cheapest to switch/balance to should automatically improve throughput for certain workloads. Additionally, with SMT sharing, if light workloads can be bound to just two threads on the same core, there could be better cpu speed consolidation and substantial power saving advantages.
To that end, I've created experimental code for MuQSS that does this exact thing in a configurable way. You can configure the scheduler to share by SMT siblings or MC siblings. Only the runqueue locks and the process skip lists are actually shared. The rest of the runqueue structures at this stage are all still discrete per logical CPU.
Here is a git tree based on 4.14 and the current 0.162 version of MuQSS:
And for those who use traditional patches, here is a patch that can be applied on top of a muqss-162 patched kernel:
While so far only being a proof of concept, there are some throughput workloads that seem to benefit when sharing is kept to SMT siblings - specifically when there is only enough work for real cores, there is a demonstrable improvement. Latency is more consistently kept within bound levels. But it's not all improvement with some workloads showing slightly lower throughput. When sharing is moved to MC siblings, the results are mixed, and it changes dramatically depending on how many cores you have. Some workloads benefit a lot, while others suffer a lot. Worst case latency improves the more sharing that is done, but in its current rudimentary form there is very little to keep tasks bound to one CPU and with the highly variable CPU frequencies of today's CPUs and the need to bind tasks for an extended period to one CPU to allow the CPU to throttle up, throughput suffers when loads are light. Conversely they seem to improve quite a lot at heavy loads.
Either way, this is pretty much an "untuned" addition to MuQSS, and for my testing at least, I think the SMT siblings sharing is advantageous and have been running it successfully for a while now.
Regardless, if you're looking for something to experiment with, as MuQSS is more or less stable these days, it should be worth giving this patch a try and see what you find in terms of throughput and/or latency. As with all experimental patches, I cannot guarantee the stability of the code, though I am using it on my desktop myself. Note that CPU load reporting is likely to be off. Make sure to report back any results you have!