Wednesday, 27 June 2018

linux-4.17-ck1, MuQSS version 0.172 for linux-4.17

Announcing a new -ck release, 4.17-ck1  with the latest version of the Multiple Queue Skiplist Scheduler, version 0.172. These are patches designed to improve system responsiveness and interactivity with specific emphasis on the desktop, but configurable for any workload.
linux-4.17-ck1:
-ck1 patches:
Git tree:
MuQSS only:
Download:
Git tree:


Web: http://kernel.kolivas.org


This is just a resync with 4.16 MuQSS and -ck patches.


Enjoy!
お楽しみ下さい
-ck

Tuesday, 1 May 2018

linux-4.16-ck1, MuQSS version 0.171 for linux-4.16

Announcing a new -ck release, 4.16-ck1  with the latest version of the Multiple Queue Skiplist Scheduler, version 0.171. These are patches designed to improve system responsiveness and interactivity with specific emphasis on the desktop, but configurable for any workload.
linux-4.16-ck1:
-ck1 patches:
Git tree:
MuQSS only:
Download:
Git tree:


Web: http://kernel.kolivas.org


This is mostly just a resync with 4.15 MuQSS and -ck patches. The only significant difference is that the default config for threaded IRQs is now set to disabled as this seems to be associated with boot failures when used in concert with runqueue sharing. I still include the patch in -ck that stops build warnings from making the kernel build fail, and I've added a single patch to aid building an evil out-of-kernel driver that many of us use.


Enjoy!
お楽しみ下さい
-ck

Sunday, 18 February 2018

linux-4.15-ck1, MuQSS version 0.170 for linux-4.15

Announcing a new -ck release, 4.15-ck1  with the latest version of the Multiple Queue Skiplist Scheduler, version 0.170. These are patches designed to improve system responsiveness and interactivity with specific emphasis on the desktop, but configurable for any workload.

linux-4.15-ck1:
-ck1 patches:
Git tree:
MuQSS only:
Download:
Git tree:


Web: http://kernel.kolivas.org


The major change in this release is the addition of a much more mature version of the experimental runqueue sharing code I posted on this blog earlier. After further experimenting and with lots of feedback from users, I decided to make multicore based sharing default instead of multithread. The numbers support better throughput and it should definitely provide more consistent low latency compared to previous versions of MuQSS. For those that found that interactivity on MuQSS never quite matched that of BFS before it, you should find this version now equals it.

In addition, the runqueue sharing code in this release also allows you to share runqueues for SMP as well so you can share runqueues with all physical CPUs if latency is your primary concern, even though it will likely lead to worse throughput. I have not made it possible to share between NUMA nodes because the cost of shifting tasks across nodes is usually substantial and it may even have worse latency, and will definitely have worse throughput.

I've also made the runqueue sharing possible to be configured at boot time with the boot parameter rqshare. Setting it to one of none, smt, mc, smp is done by appending the following to your kernel command line:
 rqshare=mc

Documentation has been added for the runqueue sharing code above to the MuQSS patch.

A number of minor bugs were discovered and have been fixed, which has also made booting more robust.

The -ck tree is mostly just a resync of previous patches, but with the addition of a patch to disable a -Werror CFLAG setting in the build tools which has suddenly made it impossible to build the kernel with newer GCCs on some distros.


Enjoy!
お楽しみ下さい
-ck

Friday, 24 November 2017

Runqueue sharing experiments with MuQSS.

For a while now I've been wanting to experiment with what happens when, instead of having either a global runqueue - the way BFS did, or per CPU runqueues - the way MuQSS currently does, we made runqueues shared depending on CPU architecture topology.

Given the fact that Simultaneous MultiThreaded - SMT siblings (hyperthread) are actually on the one physical core and share virtually all resources, then it is almost free, at least at the hardware level, for processes or threads to bounce between the two (or more) siblings. This obviously doesn't take into account the fact that the kernel itself has many unique structures for each logical CPU and that sharing there is not really free. Additionally it is interesting to see what happens if we extend that thinking to CPUs that only share cache, such as MultiCore - MC siblings. Today's modern CPUs are virtually all a combination of one and/or the other shared types above.

At least theoretically, there could be significant advantages to decreasing the number of runqueues for the overhead effects they have, and the decreased latency we'd get from guaranteeing access to more processes on a per-CPU basis with each scheduling decision. From the throughput side, the decreased overhead would also be helpful at the potential expense of slightly more spinlock contention - the more shared runqueues, the more contention, but if the amount of sharing is kept small it should be negligible. From the actual sharing side, given the lack of a formal balancing system in MuQSS, sharing the logical CPUs that are cheapest to switch/balance to should automatically improve throughput for certain workloads. Additionally, with SMT sharing, if light workloads can be bound to just two threads on the same core, there could be better cpu speed consolidation and substantial power saving advantages.

To that end, I've created experimental code for MuQSS that does this exact thing in a configurable way. You can configure the scheduler to share by SMT siblings or MC siblings. Only the runqueue locks and the process skip lists are actually shared. The rest of the runqueue structures at this stage are all still discrete per logical CPU.

Here is a git tree based on 4.14 and the current 0.162 version of MuQSS:
4.14-muqss-rqshare

And for those who use traditional patches, here is a patch that can be applied on top of a muqss-162 patched kernel:
0001-Implement-the-ability-to-share-runqueues-when-CPUs-a.patch

While so far only being a proof of concept, there are some throughput workloads that seem to benefit when sharing is kept to SMT siblings - specifically when there is only enough work for real cores, there is a demonstrable improvement. Latency is more consistently kept within bound levels. But it's not all improvement with some workloads showing slightly lower throughput. When sharing is moved to MC siblings, the results are mixed, and it changes dramatically depending on how many cores you have. Some workloads benefit a lot, while others suffer a lot. Worst case latency improves the more sharing that is done, but in its current rudimentary form there is very little to keep tasks bound to one CPU and with the highly variable CPU frequencies of today's CPUs and the need to bind tasks for an extended period to one CPU to allow the CPU to throttle up, throughput suffers when loads are light. Conversely they seem to improve quite a lot at heavy loads.

Either way, this is pretty much an "untuned" addition to MuQSS, and for my testing at least, I think the SMT siblings sharing is advantageous and have been running it successfully for a while now.

Regardless, if you're looking for something to experiment with, as MuQSS is more or less stable these days, it should be worth giving this patch a try and see what you find in terms of throughput and/or latency. As with all experimental patches, I cannot guarantee the stability of the code, though I am using it on my desktop myself. Note that CPU load reporting is likely to be off. Make sure to report back any results you have!

Enjoy!
お楽しみください