For a while now I've been wanting to experiment with what happens when, instead of having either a global runqueue - the way BFS did, or per CPU runqueues - the way MuQSS currently does, we made runqueues shared depending on CPU architecture topology.
Given the fact that Simultaneous MultiThreaded - SMT siblings (hyperthread) are actually on the one physical core and share virtually all resources, then it is almost free, at least at the hardware level, for processes or threads to bounce between the two (or more) siblings. This obviously doesn't take into account the fact that the kernel itself has many unique structures for each logical CPU and that sharing there is not really free. Additionally it is interesting to see what happens if we extend that thinking to CPUs that only share cache, such as MultiCore - MC siblings. Today's modern CPUs are virtually all a combination of one and/or the other shared types above.
At least theoretically, there could be significant advantages to decreasing the number of runqueues for the overhead effects they have, and the decreased latency we'd get from guaranteeing access to more processes on a per-CPU basis with each scheduling decision. From the throughput side, the decreased overhead would also be helpful at the potential expense of slightly more spinlock contention - the more shared runqueues, the more contention, but if the amount of sharing is kept small it should be negligible. From the actual sharing side, given the lack of a formal balancing system in MuQSS, sharing the logical CPUs that are cheapest to switch/balance to should automatically improve throughput for certain workloads. Additionally, with SMT sharing, if light workloads can be bound to just two threads on the same core, there could be better cpu speed consolidation and substantial power saving advantages.
To that end, I've created experimental code for MuQSS that does this exact thing in a configurable way. You can configure the scheduler to share by SMT siblings or MC siblings. Only the runqueue locks and the process skip lists are actually shared. The rest of the runqueue structures at this stage are all still discrete per logical CPU.
Here is a git tree based on 4.14 and the current 0.162 version of MuQSS:
4.14-muqss-rqshare
And for those who use traditional patches, here is a patch that can be applied on top of a muqss-162 patched kernel:
0001-Implement-the-ability-to-share-runqueues-when-CPUs-a.patch
While so far only being a proof of concept, there are some throughput workloads that seem to benefit when sharing is kept to SMT siblings - specifically when there is only enough work for real cores, there is a demonstrable improvement. Latency is more consistently kept within bound levels. But it's not all improvement with some workloads showing slightly lower throughput. When sharing is moved to MC siblings, the results are mixed, and it changes dramatically depending on how many cores you have. Some workloads benefit a lot, while others suffer a lot. Worst case latency improves the more sharing that is done, but in its current rudimentary form there is very little to keep tasks bound to one CPU and with the highly variable CPU frequencies of today's CPUs and the need to bind tasks for an extended period to one CPU to allow the CPU to throttle up, throughput suffers when loads are light. Conversely they seem to improve quite a lot at heavy loads.
Either way, this is pretty much an "untuned" addition to MuQSS, and for my testing at least, I think the SMT siblings sharing is advantageous and have been running it successfully for a while now.
Regardless, if you're looking for something to experiment with, as MuQSS is more or less stable these days, it should be worth giving this patch a try and see what you find in terms of throughput and/or latency. As with all experimental patches, I cannot guarantee the stability of the code, though I am using it on my desktop myself. Note that CPU load reporting is likely to be off. Make sure to report back any results you have!
Enjoy!
お楽しみください
A development blog of what Con Kolivas is doing with code at the moment with the emphasis on linux kernel, MuQSS, BFS and -ck.
Showing posts with label MuQSS. Show all posts
Showing posts with label MuQSS. Show all posts
Friday, 24 November 2017
Monday, 20 November 2017
linux-4.14-ck1, MuQSS version 0.162 for linux-4.14
Announcing
a new -ck release, 4.14-ck1 with the latest version of the Multiple
Queue Skiplist Scheduler, version 0.162. These are patches designed to
improve system responsiveness and interactivity with specific emphasis
on the desktop, but configurable for any workload.
linux-4.14-ck1:
-ck1 patches:
Git tree:
MuQSS only:
Download:
Git tree:
Apart from minor cleanups and syncing with the current kernel, I have removed the default BFQ patch from -ck1 which did not reliably work, though I have changed its default config setting to on along with blk mq scsi default to on which is required to make it work. It will not default to BFQ at boot time.
Enjoy!
お楽しみ下さい
-ck
Monday, 16 October 2017
linux-4.13-ck1, MuQSS version 0.161 for linux-4.13
Announcing
a new -ck release, 4.13-ck1 with the latest version of the Multiple
Queue Skiplist Scheduler, version 0.161. These are patches designed to
improve system responsiveness and interactivity with specific emphasis
on the desktop, but configurable for any workload.
linux-4.13-ck1:
-ck1 patches:
Git tree:
MuQSS only:
Download:
Git tree:
This version is no more than a resync from 4.12-ck2
Enjoy!
お楽しみ下さい
-ck
Tuesday, 15 August 2017
linux-4.12-ck2, MuQSS version 0.160 for linux-4.12
Announcing
a new -ck release, 4.12-ck2 with the latest version of the Multiple
Queue Skiplist Scheduler, version 0.160. These are patches designed to
improve system responsiveness and interactivity with specific emphasis
on the desktop, but configurable for any workload.
linux-4.12-ck2
-ck2 patches:
Git tree:
MuQSS
Download:
Git tree:
Sorry about the delay. I skipped announcing 4.12-ck1 as there was a lingering bug report from pf (thanks for extensive report!) and a config problem in it that rendered it unbootable without extra confg options.
MuQSS 0.160 updates
- Fixed race leading to crash on use of sched_setaffinity.
4.12-ck2 updates
- BFQ is now in mainline so it is no longer part of the patchset.- BFQ now enabled by default along with scsi multiqueue to enable booting with it by default.
- Enable setting new kyber I/O scheduler as default as well (I recommend people use BFQ though.)
- Removed the mandatory swap_full() flag in the swap sucks patch after reports saying it was unhelpful.
Enjoy!
お楽しみ下さい
-ck
Friday, 26 May 2017
linux-4.11-ck2, MuQSS version 0.156 for linux-4.11
Announcing a new -ck release, 4.11-ck2 with the latest version of the Multiple Queue Skiplist Scheduler, version 0.156. These are patches designed to improve system responsiveness and interactivity with specific emphasis on the desktop, but configurable for any workload.
linux-4.11-ck2
-ck1 patches:
Git tree:
MuQSS
Download:
Git tree:
MuQSS 0.156 updates
- Fixed failed UP builds.
- Remove the last traces of the global run queue data, moving nr_running, nr_uninterruptible and nr_switches to each runqueue. Calculate nr_running accurately at the end of each context switch only once, reusing the variable in place of rq_load. (May improve reported load accuracy.)
4.11-ck2 updates
- Make full preempt default on all arches.
- Revert inappropriately reverted part of vmsplit patch.
Enjoy!
お楽しみ下さい
-ck
I seem to have unintentionally deleted the -ck1 post, sorry about that.
I seem to have unintentionally deleted the -ck1 post, sorry about that.
Monday, 20 February 2017
linux-4.10-ck1, MuQSS version 0.152 for linux-4.10
Announcing a new -ck release, 4.9-ck1 with new version of the Multiple
Queue Skiplist Scheduler, version 0.150. These are patches designed to
improve system responsiveness and interactivity with specific
emphasis on the desktop, but configurable for any workload.
http://ck.kolivas.org/patches/4.0/4.10/4.10-ck1/
Git tree:
https://github.com/ckolivas/linux/tree/4.10-ck
Ubuntu 16.10 packages (sorry I'm no longer on 16.04):
http://ck.kolivas.org/patches/4.0/4.9/4.10-ck1/Ubuntu16.10/
4.10-sched-MuQSS_152.patch
Git tree:
4.10-muqss
Bugfixes
- The wb-buf-throttling patches are now part of mainline and do not need to be added separately
- Minor swap setting tweaks
For those of you trying to build the evil nvidia driver for linux-4.10, this patch will help:
nvidia-375.39-linux-4.10.patch
Enjoy!
お楽しみ下さい
-ck
linux-4.10-ck1
-ck1 patches:http://ck.kolivas.org/patches/4.0/4.10/4.10-ck1/
Git tree:
https://github.com/ckolivas/linux/tree/4.10-ck
Ubuntu 16.10 packages (sorry I'm no longer on 16.04):
http://ck.kolivas.org/patches/4.0/4.9/4.10-ck1/Ubuntu16.10/
MuQSS
Download:4.10-sched-MuQSS_152.patch
Git tree:
4.10-muqss
MuQSS 0.152 updates
Removed the rapid ramp-up in schedutil cpufreq which was overactive.Bugfixes
4.10-ck1 updates
Apart from resyncing with the latest tree from linux-bfq:- The wb-buf-throttling patches are now part of mainline and do not need to be added separately
- Minor swap setting tweaks
For those of you trying to build the evil nvidia driver for linux-4.10, this patch will help:
nvidia-375.39-linux-4.10.patch
Enjoy!
お楽しみ下さい
-ck
Monday, 12 December 2016
linux-4.9-ck1, MuQSS version 0.150
Announcing a new -ck release, 4.9-ck1 with new version of the Multiple Queue Skiplist Scheduler, version 0.150. These are patches designed to improve system responsiveness and interactivity with specific
emphasis on the desktop, but configurable for any workload.
http://ck.kolivas.org/patches/4.0/4.9/4.9-ck1/
Git tree:
https://github.com/ckolivas/linux/tree/4.9-ck
Ubuntu 16.04 LTS packages:
http://ck.kolivas.org/patches/4.0/4.9/4.9-ck1/Ubuntu16.04/
4.9-sched-MuQSS_150.patch
Git tree:
4.9-muqss
Additionally, I've modified sched_yield yet again. Since expected behaviour is different for different (inappropriate) users out there of sched_yield, I've made it tunable in /proc/sys/kernel/yield_type and changed the default to what I believe should happen. From the documentation I added in Documentation/sysctl/kernel.txt:
yield_type: (MuQSS CPU scheduler only)
This determines what type of yield calls to sched_yield will perform.
0: No yield.
1: Yield only to better priority/deadline tasks. (default)
2: Expire timeslice and recalculate deadline.
Previous versions of MuQSS defaulted to type 2 above. If you find behavioural regressions with any of your workloads try switching it back to 2.
- Added a new kernel configuration option to enable threaded IRQs and set it by default
- Changed Hz to default to the safe 100 value, removing 128 which caused spurious issues and had no real world advantage.
- Fixed a build for muqss disabled (why would you use -ck and do that I don't know)
- Made hrtimers not be used if we know we're in suspend which may have caused suspend failures for drivers that did no use correct freezable vs normal timeouts
- Enabled bfq and set it to default
- Enabled writeback throttling by default
Enjoy!
お楽しみ下さい
-ck
linux-4.9-ck1
-ck1 patches:http://ck.kolivas.org/patches/4.0/4.9/4.9-ck1/
Git tree:
https://github.com/ckolivas/linux/tree/4.9-ck
Ubuntu 16.04 LTS packages:
http://ck.kolivas.org/patches/4.0/4.9/4.9-ck1/Ubuntu16.04/
MuQSS
Download:4.9-sched-MuQSS_150.patch
Git tree:
4.9-muqss
MuQSS 0.150 updates
Regarding MuQSS, apart from a resync to linux-4.9, which has numerous hotplug and cpufreq changes (again!), I've cleaned up the patch to not include any Hz changes of its own, leaving Hz changes up to users to choose, unless they use the -ck patchset.Additionally, I've modified sched_yield yet again. Since expected behaviour is different for different (inappropriate) users out there of sched_yield, I've made it tunable in /proc/sys/kernel/yield_type and changed the default to what I believe should happen. From the documentation I added in Documentation/sysctl/kernel.txt:
yield_type: (MuQSS CPU scheduler only)
This determines what type of yield calls to sched_yield will perform.
0: No yield.
1: Yield only to better priority/deadline tasks. (default)
2: Expire timeslice and recalculate deadline.
Previous versions of MuQSS defaulted to type 2 above. If you find behavioural regressions with any of your workloads try switching it back to 2.
4.9-ck1 updates
Apart from resyncing with the latest trees from linux-bfq and wb-buf-throttling- Added a new kernel configuration option to enable threaded IRQs and set it by default
- Changed Hz to default to the safe 100 value, removing 128 which caused spurious issues and had no real world advantage.
- Fixed a build for muqss disabled (why would you use -ck and do that I don't know)
- Made hrtimers not be used if we know we're in suspend which may have caused suspend failures for drivers that did no use correct freezable vs normal timeouts
- Enabled bfq and set it to default
- Enabled writeback throttling by default
Enjoy!
お楽しみ下さい
-ck
Labels:
-ck,
4.9,
interactivity,
kernel,
linux,
MuQSS,
sched_yield,
scheduler
Tuesday, 22 November 2016
linux-4.8-ck8, MuQSS version 0.144
Here's a new release to go along with and commemorate the 4.8.10 stable release (they're releasing stable releases faster than my development code now.)
linux-4.8-ck8 patch:
patch-4.8-ck8.lrz
MuQSS by itself:
4.8-sched-MuQSS_144.patch
There are a small number of updates to MuQSS itself.
Notably there's an improvement in interactive mode when SMT nice is enabled and/or realtime tasks are running, or there are users of CPU affinity. Tasks previously would not schedule on CPUs when they were stuck behind those as the highest priority task and it would refuse to schedule them transiently.
The old hacks for CPU frequency changes from BFS have been removed, leaving the tunables to default as per mainline.
The default of 100Hz has been removed, but in its place a new and recommended 128Hz has been implemented - this just a silly microoptimisation to take advantage of the fast shifts that /128 has on CPUs compared to /100, and is close enough to 100Hz to behave otherwise the same.
For the -ck patch only I've reinstated updated and improved versions of the high resolution timeouts to improve behaviour of userspace that is inappropriately Hz dependent allowing low Hz choices to not affect latency.
Additionally by request I've added a couple of tunables to adjust the behaviour of the high res timers and timeouts.
/proc/sys/kernel/hrtimer_granularity_us
and
/proc/sys/kernel/hrtimeout_min_us
Both of these are in microseconds and can be set from 1-10,000. The first is how accurate high res timers will be in the kernel and is set to 100us by default (on mainline it is Hz accuracy).
The second is how small to make a request for a "minimum timeout" generically in all kernel code. The default is set to 1000us by default (on mainline it is one tick).
I doubt you'll find anything useful by tuning these but feel free to go nuts. Decreasing the second tunable much further risks breaking some driver behaviour.
Enjoy!
お楽しみ下さい
-ck
linux-4.8-ck8 patch:
patch-4.8-ck8.lrz
MuQSS by itself:
4.8-sched-MuQSS_144.patch
There are a small number of updates to MuQSS itself.
Notably there's an improvement in interactive mode when SMT nice is enabled and/or realtime tasks are running, or there are users of CPU affinity. Tasks previously would not schedule on CPUs when they were stuck behind those as the highest priority task and it would refuse to schedule them transiently.
The old hacks for CPU frequency changes from BFS have been removed, leaving the tunables to default as per mainline.
The default of 100Hz has been removed, but in its place a new and recommended 128Hz has been implemented - this just a silly microoptimisation to take advantage of the fast shifts that /128 has on CPUs compared to /100, and is close enough to 100Hz to behave otherwise the same.
For the -ck patch only I've reinstated updated and improved versions of the high resolution timeouts to improve behaviour of userspace that is inappropriately Hz dependent allowing low Hz choices to not affect latency.
Additionally by request I've added a couple of tunables to adjust the behaviour of the high res timers and timeouts.
/proc/sys/kernel/hrtimer_granularity_us
and
/proc/sys/kernel/hrtimeout_min_us
Both of these are in microseconds and can be set from 1-10,000. The first is how accurate high res timers will be in the kernel and is set to 100us by default (on mainline it is Hz accuracy).
The second is how small to make a request for a "minimum timeout" generically in all kernel code. The default is set to 1000us by default (on mainline it is one tick).
I doubt you'll find anything useful by tuning these but feel free to go nuts. Decreasing the second tunable much further risks breaking some driver behaviour.
Enjoy!
お楽しみ下さい
-ck
Labels:
-ck,
4.8,
cpufreq,
hyperthreading,
interactivity,
kernel,
latency,
linux,
MuQSS,
real-time,
scheduler,
sleep
Saturday, 12 November 2016
linux-4.8-ck7, MuQSS version 0.140
Another week has passed, another stable linux release, and to follow, another -ck and MuQSS release.
linux-4.7-ck7 patch:
patch-4.8-ck7.lrz
Split out patches:
http://ck.kolivas.org/patches/4.0/4.8/4.8-ck7/patches/
MuQSS by itself for 4.8:
4.8-sched-MuQSS_140.patch
MuQSS by itself for 4.7:
4.7-sched-MuQSS_140.patch
This release marks a change towards conservative changes only.
I've rolled back the extensive timer changes outside the main scheduler code. There are too many assumptions made about timeouts in the kernel code that are potentially problematic in the real world, and there is code that is poorly prepared for freezer usage (suspend to ram) that breaks. Additionally, not a single user reported a workload that they noticed benefited from the lower latency accurate timeouts. Finally, the added overhead is demonstrable in throughput benchmarks, and when doing comparisons with mainline it is doing MuQSS a disservice to mix in other code that it's not actually responsible for.
There are also a small number of bugfixes for warnings/crashes in the updated MuQSS that showed up after the last release as people are using it on more and varied hardware in the wild now. These may have positive effects on other less defined issues in the wild too.
The -ck release also includes an updated version of BFQ. Along with this updated version, I would like to issue a warning regarding BFQ. I have heard rumour that a number of users have reported filesystem corruption with the combination of BTRFS and BFQ. If you are using this filesystem, I urge you to not compile in BFQ at all, or at the very least not make it default to BFQ, using it selectively on devices you are running a different filesystem (I still recommend people use ext4.) I would like to encourage users who have run into this problem to report it to the BFQ maintainer.
I've cleaned up the patches in the -ck tarball once again to include only the changes in combined related patches. This will ease the burden of porting to the next major linux kernel release and allow users to easily select which patches they wish to use themselves.
As always, make sure to give me your feedback, bug reports, warnings, and bitcoin.
Enjoy!
お楽しみ下さい
-ck
linux-4.7-ck7 patch:
patch-4.8-ck7.lrz
Split out patches:
http://ck.kolivas.org/patches/4.0/4.8/4.8-ck7/patches/
MuQSS by itself for 4.8:
4.8-sched-MuQSS_140.patch
MuQSS by itself for 4.7:
4.7-sched-MuQSS_140.patch
This release marks a change towards conservative changes only.
I've rolled back the extensive timer changes outside the main scheduler code. There are too many assumptions made about timeouts in the kernel code that are potentially problematic in the real world, and there is code that is poorly prepared for freezer usage (suspend to ram) that breaks. Additionally, not a single user reported a workload that they noticed benefited from the lower latency accurate timeouts. Finally, the added overhead is demonstrable in throughput benchmarks, and when doing comparisons with mainline it is doing MuQSS a disservice to mix in other code that it's not actually responsible for.
There are also a small number of bugfixes for warnings/crashes in the updated MuQSS that showed up after the last release as people are using it on more and varied hardware in the wild now. These may have positive effects on other less defined issues in the wild too.
The -ck release also includes an updated version of BFQ. Along with this updated version, I would like to issue a warning regarding BFQ. I have heard rumour that a number of users have reported filesystem corruption with the combination of BTRFS and BFQ. If you are using this filesystem, I urge you to not compile in BFQ at all, or at the very least not make it default to BFQ, using it selectively on devices you are running a different filesystem (I still recommend people use ext4.) I would like to encourage users who have run into this problem to report it to the BFQ maintainer.
I've cleaned up the patches in the -ck tarball once again to include only the changes in combined related patches. This will ease the burden of porting to the next major linux kernel release and allow users to easily select which patches they wish to use themselves.
As always, make sure to give me your feedback, bug reports, warnings, and bitcoin.
Enjoy!
お楽しみ下さい
-ck
Saturday, 5 November 2016
linux-4.8-ck6, MuQSS version 0.135
4.8-ck6 patchset:
http://ck.kolivas.org/patches/4.0/4.8/4.8-ck6/
MuQSS by itself for 4.8:
4.8-sched-MuQSS_135.patch
MuQSS by itself for 4.7:
4.7-sched-MuQSS_135.patch
Git tree:
https://github.com/ckolivas/linux
A week has passed since the last major update to BFS and -ck was posted, allowing me to concentrate on receiving and responding to any bug reports. As it turns out, there were very few apart from the recurring local_softirq_pending warning/stalls. This is nice because it means MuQSS is mostly ~stable now. Mainline has even had more "stable" releases in the same time as MuQSS for 4.8, moving to 4.8.6 in the interim.
In this version I've added aggressive handling of pending softirqs in the hope the warnings and stalls all go away. The true reason the handling of softirqs are being dropped still escapes me but is likely related to the fact that MuQSS does a lot of lockless rescheduling across CPUs to decrease overhead but this does not give guarantees that locking would.
Additionally, I've added a number of APIs to the kernel to do specified millisecond schedule timeouts which use the highres timers which are mandatory now for MuQSS. The reason for doing this is there are many timeouts in the kernel that specify values below 10ms and the timer resolution at 100Hz only guarantees timeouts under 20ms.
I've also added a code sweep across the entire kernel looking for timeout calls under 50ms and use the new interface in its place. Additionally there are numerous places where schedule_timeout(1) are used in the kernel where a "minimum timeout" is expected, yet this is entirely Hz dependent, again being up to 20ms in duration. I've replaced all these with a 1ms timeout, emulating what would happen on a 1000Hz kernel, but without the overhead of running the higher Hz kernel. I'm not entirely sure this will equate to any real world improvements but the fact it's used in things like audio drivers worries me that it might.
Finally I've replaced the standard msleep call from userspace to use highres timers, in case there are userspace applications that expects msleep to actually give some kind of sleep that resembles what's asked of it, instead of something Hz limited, in case this is leading to slowdowns in userspace due to assumptions on the userspace coders' part. Calls to msleep() from userspace now give 100us accuracy at 100Hz instead of 20ms.
All these timing changes add overhead since they're trying to emulate the timing accuracy of running at 1000Hz but in a latency-focused scheduler I believe they're appropriate, and they do not incur the overhead that actually changing Hz would incur. Additionally they add accuracy to timers and timeouts that 1000Hz does not afford.
In the -ck tarball of broken-out patches, I've kept these timer changes separate to allow the muqss scheduler to be applied by itself should they prove problematic, and they will make merging with future kernels easier.
Enjoy!
お楽しみください
-ck
Saturday, 29 October 2016
linux-4.8-ck5, MuQSS version 0.120
Announcing a new version of MuQSS and a -ck release to go with it in concert with mainline releasing 4.8.5
4.8-ck5 patchset:
http://ck.kolivas.org/patches/4.0/4.8/4.8-ck5/
MuQSS by itself for 4.8:
4.8-sched-MuQSS_120.patch
MuQSS by itself for 4.7:
4.7-sched-MuQSS_120.patch
Git tree:
https://github.com/ckolivas/linux
This is a fairly substantial update to MuQSS which includes bugfixes for the previous version, performance enhancements, new features, and completed documentation. This will likely be the first publicly announced version on LKML.
EDIT: Announce here: LKML
New features:
- MuQSS is now a tickless scheduler. That means it can maintain its guaranteed low latency even in a build configured with a low Hz tick rate. To that end, it is now defaulting to 100Hz, and it is recommended to use this as the default choice for it leads to more throughput and power savings as well.
- Improved performance for single threaded workloads with CPU frequency scaling.
- Full NoHZ now supported. This disables ticks on busy CPUs instead of just idle ones. Unlike mainline, MuQSS can do this virtually all the time, regardless of how many tasks are currently running. However this option is for very specific use cases (compute servers running specific workloads) and not for regular desktops or servers.
- Numerous other configuration options that were previously disabled from mainline are now allowed again (though not recommended for regular users.)
- Completed documentation can now be found in Documentation/scheduler/sched-MuQSS.txt
Bugfixes:
- Fix for the various stalls some people were still experiencing, along with the softirq pending warnings.
- Fix for some loss of CPU for heavily sched_yielding tasks.
- Fix for the BFQ warning (-ck only)
Enjoy!
お楽しみ下さい
-ck
4.8-ck5 patchset:
http://ck.kolivas.org/patches/4.0/4.8/4.8-ck5/
MuQSS by itself for 4.8:
4.8-sched-MuQSS_120.patch
MuQSS by itself for 4.7:
4.7-sched-MuQSS_120.patch
Git tree:
https://github.com/ckolivas/linux
This is a fairly substantial update to MuQSS which includes bugfixes for the previous version, performance enhancements, new features, and completed documentation. This will likely be the first publicly announced version on LKML.
EDIT: Announce here: LKML
New features:
- MuQSS is now a tickless scheduler. That means it can maintain its guaranteed low latency even in a build configured with a low Hz tick rate. To that end, it is now defaulting to 100Hz, and it is recommended to use this as the default choice for it leads to more throughput and power savings as well.
- Improved performance for single threaded workloads with CPU frequency scaling.
- Full NoHZ now supported. This disables ticks on busy CPUs instead of just idle ones. Unlike mainline, MuQSS can do this virtually all the time, regardless of how many tasks are currently running. However this option is for very specific use cases (compute servers running specific workloads) and not for regular desktops or servers.
- Numerous other configuration options that were previously disabled from mainline are now allowed again (though not recommended for regular users.)
- Completed documentation can now be found in Documentation/scheduler/sched-MuQSS.txt
Bugfixes:
- Fix for the various stalls some people were still experiencing, along with the softirq pending warnings.
- Fix for some loss of CPU for heavily sched_yielding tasks.
- Fix for the BFQ warning (-ck only)
Enjoy!
お楽しみ下さい
-ck
Monday, 24 October 2016
Interbench benchmarks for MuQSS 116
As mentioned in my previous post, I recently upgraded interbench which is a benchmark application I invented/wrote to assess perceptible latency in the setting of various loads. The updates were to make the results meaningful on today's larger ram/multicore machines where the load scales accordingly.
The results for mainline 4.8.4 and 4.8.4-ck4 on a multithreaded hexcore (init 1) can be found here:
http://ck.kolivas.org/patches/muqss/Benchmarks/20161024/
and are copied below. I do not have swap on this machine so the "memload" was not performed. This is a 3.6GHz hexcore with 64GB ram and fast Intel SSDs so to show any difference on this is nice. To make it easier, I've highlighted it in colours similar to the throughput benchmarks I posted previously. Blue means within 1% of each other, red means significantly worse and green significantly better.
As you can see, the only times mainline is better, there is less than 1% difference between them which is within the margins for noise. MuQSS meets more deadlines, gives the benchmarked task more of its desired CPU and has substantially lower max latencies.
I'm reasonably confident that I've been able to maintain the interactivity people have come to expect from BFS in the transition to MuQSS now and have the data to support it above.
Enjoy!
お楽しみ下さい
-ck
The results for mainline 4.8.4 and 4.8.4-ck4 on a multithreaded hexcore (init 1) can be found here:
http://ck.kolivas.org/patches/muqss/Benchmarks/20161024/
and are copied below. I do not have swap on this machine so the "memload" was not performed. This is a 3.6GHz hexcore with 64GB ram and fast Intel SSDs so to show any difference on this is nice. To make it easier, I've highlighted it in colours similar to the throughput benchmarks I posted previously. Blue means within 1% of each other, red means significantly worse and green significantly better.
Load set to 12 processors Using 4008580 loops per ms, running every load for 30 seconds Benchmarking kernel 4.8.4 at datestamp 201610242116 Comment: cfs --- Benchmarking simulated cpu of Audio in the presence of simulated --- Load Latency +/- SD (ms) Max Latency % Desired CPU % Deadlines Met None 0.1 +/- 0.1 0.1 100 100 Video 0.0 +/- 0.0 0.1 100 100 X 0.1 +/- 0.1 0.1 100 100 Burn 0.0 +/- 0.0 0.0 100 100 Write 0.1 +/- 0.1 0.1 100 100 Read 0.1 +/- 0.1 0.1 100 100 Ring 0.0 +/- 0.0 0.1 100 100 Compile 0.0 +/- 0.0 0.0 100 100 --- Benchmarking simulated cpu of Video in the presence of simulated --- Load Latency +/- SD (ms) Max Latency % Desired CPU % Deadlines Met None 0.1 +/- 0.1 0.1 100 100 X 0.1 +/- 0.1 0.1 100 100 Burn 17.4 +/- 19.5 46.3 87 7.62 Write 0.1 +/- 0.1 0.1 100 100 Read 0.1 +/- 0.1 0.1 100 100 Ring 0.0 +/- 0.0 0.0 100 100 Compile 17.4 +/- 19.1 45.9 89.5 6.07 --- Benchmarking simulated cpu of X in the presence of simulated --- Load Latency +/- SD (ms) Max Latency % Desired CPU % Deadlines Met None 0.0 +/- 0.1 1.0 100 99.3 Video 13.4 +/- 25.8 68.0 36.2 27.3 Burn 94.4 +/- 127.0 334.0 12.9 4.37 Write 0.1 +/- 0.4 4.0 97.4 96.4 Read 0.1 +/- 0.7 4.0 96.2 93.8 Ring 0.5 +/- 1.9 9.0 89.3 84.9 Compile 93.3 +/- 127.7 333.0 12.2 4.2 --- Benchmarking simulated cpu of Gaming in the presence of simulated --- Load Latency +/- SD (ms) Max Latency % Desired CPU None 0.0 +/- 0.2 2.2 100 Video 7.9 +/- 21.4 69.3 92.7 X 1.4 +/- 1.6 2.7 98.7 Burn 136.5 +/- 145.3 360.8 42.3 Write 1.8 +/- 2.0 4.4 98.2 Read 11.2 +/- 20.3 47.8 89.9 Ring 8.1 +/- 8.1 8.2 92.5 Compile 152.3 +/- 166.8 346.1 39.6
Load set to 12 processors Using 4008580 loops per ms, running every load for 30 seconds Benchmarking kernel 4.8.4-ck4+ at datestamp 201610242047 Comment: muqss116-int1 --- Benchmarking simulated cpu of Audio in the presence of simulated --- Load Latency +/- SD (ms) Max Latency % Desired CPU % Deadlines Met None 0.0 +/- 0.0 0.0 100 100 Video 0.0 +/- 0.0 0.0 100 100 X 0.0 +/- 0.0 0.0 100 100 Burn 0.0 +/- 0.0 0.0 100 100 Write 0.0 +/- 0.0 0.1 100 100 Read 0.0 +/- 0.0 0.0 100 100 Ring 0.0 +/- 0.0 0.0 100 100 Compile 0.0 +/- 0.1 0.8 100 100 --- Benchmarking simulated cpu of Video in the presence of simulated --- Load Latency +/- SD (ms) Max Latency % Desired CPU % Deadlines Met None 0.0 +/- 0.0 0.0 100 100 X 0.0 +/- 0.0 0.0 100 100 Burn 3.1 +/- 7.2 17.7 100 81.6 Write 0.0 +/- 0.0 0.5 100 100 Read 0.0 +/- 0.0 0.0 100 100 Ring 0.0 +/- 0.0 0.0 100 100 Compile 10.5 +/- 13.3 19.7 100 37.3 --- Benchmarking simulated cpu of X in the presence of simulated --- Load Latency +/- SD (ms) Max Latency % Desired CPU % Deadlines Met None 0.0 +/- 0.1 1.0 100 99.3 Video 3.7 +/- 12.1 56.0 89 82.6 Burn 47.2 +/- 66.5 142.0 16.7 7.58 Write 0.1 +/- 0.5 5.0 97.7 95.7 Read 0.1 +/- 0.7 4.0 95.6 93.5 Ring 0.5 +/- 1.9 12.0 89.8 86 Compile 55.9 +/- 77.6 196.0 18.6 8.12 --- Benchmarking simulated cpu of Gaming in the presence of simulated --- Load Latency +/- SD (ms) Max Latency % Desired CPU None 0.0 +/- 0.1 0.5 100 Video 1.2 +/- 1.2 1.8 98.8 X 1.4 +/- 1.6 2.9 98.7 Burn 130.9 +/- 132.1 160.3 43.3 Write 2.4 +/- 2.5 7.0 97.7 Read 3.2 +/- 3.2 3.6 96.9 Ring 5.9 +/- 6.2 10.3 94.4 Compile 146.5 +/- 149.3 209.2 40.6
As you can see, the only times mainline is better, there is less than 1% difference between them which is within the margins for noise. MuQSS meets more deadlines, gives the benchmarked task more of its desired CPU and has substantially lower max latencies.
I'm reasonably confident that I've been able to maintain the interactivity people have come to expect from BFS in the transition to MuQSS now and have the data to support it above.
Enjoy!
お楽しみ下さい
-ck
Labels:
-ck,
4.8,
benchmark,
bfs,
interactivity,
interbench,
MuQSS,
scheduler
linux-4.8-ck4, MuQSS CPU scheduler v0.116
Yet another bugfix release for MuQSS and the -ck patchset with one of the most substantial latency fixes yet. Everyone should upgrade if they're on a previous 4.8 patchset of mine. Sorry about the frequency of these releases but I just can't allow a known buggy release be the latest version.
4.8-ck4 patchset:
http://ck.kolivas.org/patches/4.0/4.8/4.8-ck4/
MuQSS by itself for 4.8:
4.8-sched-MuQSS_116.patch
MuQSS by itself for 4.7:
4.7-sched-MuQSS_116.patch
I'm hoping this is the release that allows me to not push any more -ck versions out till 4.9 is released since it addresses all remaining issues that I know about.
A lingering bug that has been troubling me for some time was leading to occasional massive latencies and thanks to some detective work by Serge Belyshev I was able to narrow it down to a single line fix which dramatically improves worst case latency when measured. Throughput is virtually unchanged. The flow-on effect to other areas was also apparent with sometimes unused CPU cycles and weird stalls on some workloads.
Sched_yield was reverted to the old BFS mechanism again which GPU drivers prefer but it wasn't working previously on MuQSS because of the first bug. The difference is substantial now and drivers (such as nvidia proprietary) and apps that use it a lot (such as the folding @ home client) behave much better now.
The late introduced bugs that got into ck3/muqss115 were reverted.
The results come up quite well now with interbench (my latency under load benchmark) which I have recently updated and should now give sensible values:
https://github.com/ckolivas/interbench
If you're baffled by interbench results, the most important number is %deadlines met which should be as close to 100% as possible followed by max latency which should be as low as possible for each section. In the near future I'll announce an official new release version.
Pedro in the comments section previously was using runqlat from bcc tools to test latencies as well, but after some investigation it became clear to me that the tool was buggy and did not work properly with bfs/muqss either so I've provided a slightly updated version here which should work properly:
runqlat.py
Enjoy!
お楽しみ下さい
-ck
4.8-ck4 patchset:
http://ck.kolivas.org/patches/4.0/4.8/4.8-ck4/
MuQSS by itself for 4.8:
4.8-sched-MuQSS_116.patch
MuQSS by itself for 4.7:
4.7-sched-MuQSS_116.patch
I'm hoping this is the release that allows me to not push any more -ck versions out till 4.9 is released since it addresses all remaining issues that I know about.
A lingering bug that has been troubling me for some time was leading to occasional massive latencies and thanks to some detective work by Serge Belyshev I was able to narrow it down to a single line fix which dramatically improves worst case latency when measured. Throughput is virtually unchanged. The flow-on effect to other areas was also apparent with sometimes unused CPU cycles and weird stalls on some workloads.
Sched_yield was reverted to the old BFS mechanism again which GPU drivers prefer but it wasn't working previously on MuQSS because of the first bug. The difference is substantial now and drivers (such as nvidia proprietary) and apps that use it a lot (such as the folding @ home client) behave much better now.
The late introduced bugs that got into ck3/muqss115 were reverted.
The results come up quite well now with interbench (my latency under load benchmark) which I have recently updated and should now give sensible values:
https://github.com/ckolivas/interbench
If you're baffled by interbench results, the most important number is %deadlines met which should be as close to 100% as possible followed by max latency which should be as low as possible for each section. In the near future I'll announce an official new release version.
Pedro in the comments section previously was using runqlat from bcc tools to test latencies as well, but after some investigation it became clear to me that the tool was buggy and did not work properly with bfs/muqss either so I've provided a slightly updated version here which should work properly:
runqlat.py
Enjoy!
お楽しみ下さい
-ck
Saturday, 22 October 2016
linux-4.8-ck3, MuQSS version 0.115
This is mainly a bugfix release for those who had boot failures, TOI patched failures, and warnings. Otherwise it only has minor changes.
http://ck.kolivas.org/patches/4.0/4.8/4.8-ck3/
MuQSS version 0.115 by itself:
4.8-sched-MuQSS_115.patch
Git tree includes branches for MuQSS and -ck:
https://github.com/ckolivas/linux
EDIT: There is a regression in this release as well and you need to either grab the latest 4.8-ck git tree or add the two patches here:
http://ck.kolivas.org/patches/muqss/4.0/4.8/Pending/
Sorry, when enough other problems get fixed I'll release another version pretty soon too.
Enjoy!
お楽しみ下さい
-ck
http://ck.kolivas.org/patches/4.0/4.8/4.8-ck3/
MuQSS version 0.115 by itself:
4.8-sched-MuQSS_115.patch
Git tree includes branches for MuQSS and -ck:
https://github.com/ckolivas/linux
EDIT: There is a regression in this release as well and you need to either grab the latest 4.8-ck git tree or add the two patches here:
http://ck.kolivas.org/patches/muqss/4.0/4.8/Pending/
Sorry, when enough other problems get fixed I'll release another version pretty soon too.
Enjoy!
お楽しみ下さい
-ck
Friday, 21 October 2016
linux-4.8-ck2, MuQSS version 0.114
Announcing an updated version, and the first -ck release with MuQSS as the scheduler, officially retiring BFS from further development, in line with the diminished rate of bug reports with MuQSS. It is clear that the little attention BFS had received over the years apart from rushed synchronisation with mainline had cause a number of bugs to creep in and MuQSS is basically a rewritten evolution of the same code so it makes no sense to maintain both.
http://ck.kolivas.org/patches/4.0/4.8/4.8-ck2/
MuQSS version 0.114 by itself:
4.8-sched-MuQSS_114.patch
Git tree includes branches for MuQSS and -ck:
https://github.com/ckolivas/linux
In addition to the most up to date version of MuQSS replacing BFS, this is the first release with BFQ included. It is configurable and is set by default in -ck though it is entirely optional.
The MuQSS changes since 112 are as follows:
- Added cacheline alignment to atomic variables courtesy of Holger Hoffstätte
- Fixed PPC build courtesy of Serge Belyshev.
- Implemented wake lists for separate CPU packages.
- Send hotplug threads to CPUs even if they're not alive yet since they'll be enabling them.
- Build fixes for uniprocessor.
- A substantial revamp of the sub-tick process accounting, decreasing the number of variables used, simplifying the code, and increasing the resolution to nanosecond accounting. Now even tasks that run for less than 100us will not escape visible accounting.
This release should bring slightly better performance, more so on multi-cpu machines, and fairer accounting/latency.
Enjoy!
お楽しみ下さい
-ck
http://ck.kolivas.org/patches/4.0/4.8/4.8-ck2/
MuQSS version 0.114 by itself:
4.8-sched-MuQSS_114.patch
Git tree includes branches for MuQSS and -ck:
https://github.com/ckolivas/linux
In addition to the most up to date version of MuQSS replacing BFS, this is the first release with BFQ included. It is configurable and is set by default in -ck though it is entirely optional.
The MuQSS changes since 112 are as follows:
- Added cacheline alignment to atomic variables courtesy of Holger Hoffstätte
- Fixed PPC build courtesy of Serge Belyshev.
- Implemented wake lists for separate CPU packages.
- Send hotplug threads to CPUs even if they're not alive yet since they'll be enabling them.
- Build fixes for uniprocessor.
- A substantial revamp of the sub-tick process accounting, decreasing the number of variables used, simplifying the code, and increasing the resolution to nanosecond accounting. Now even tasks that run for less than 100us will not escape visible accounting.
This release should bring slightly better performance, more so on multi-cpu machines, and fairer accounting/latency.
Enjoy!
お楽しみ下さい
-ck
Tuesday, 18 October 2016
First MuQSS Throughput Benchmarks
The short version graphical summary:
Red = MuQSS 112 interactive off
Purple = MuQSS 112 interactive on
Blue = CFS
The detail:
http://ck.kolivas.org/patches/muqss/Benchmarks/20161018/
I went on a journey looking for meaningful benchmarks to conduct to assess the scalability aspect as far as I could on my own 12x machine and was really quite depressed to see what the benchmark situation on linux is like. Only the old and completely invalid benchmarks seem to still be hanging around in public sites and promoted, like Reaim, aim7, dbench, volanomark, etc. and none of those are useful scalability benchmarks. Even more depressing was the only ones with any reputation are actually commercial benchmarks costing hundreds of dollars.
This made me wonder out loud just how the heck mainline is even doing scalability improvements if there are precious few valid benchmarks for linux and no one's using them. The most promising ones, like mosbench, need multiple machines and quite a bit of set up to get them going.
I spent a day wading through the phoronix test suite - a site and its suite not normally known for meaningful high performance computing discussion and benchmarks - looking for benchmarks that could be used for meaningful results for multicore scalability assessment and were not too difficult to deploy and came up with the following collection:
John The Ripper - a CPU bound application that is threaded to the number of CPUs and intermittently drops to one thread making for slightly more interesting behaviour than just a fully CPU bound workload.
7-Zip Compression - a valid real world CPU bound application that is threaded but rarely able to spread out to all CPUs making it an interesting light load benchmark.
ebizzy - This emulates a heavy content delivery server load which scales beyond the number of CPUs and emulates what goes on between a http server and database.
Timed Linux Kernel Compilation - A perennial favourite because it is a real world case and very easy to reproduce. Despite numerous complaints about its validity as a benchmark, it is surprisingly consistent in its results and tests many facets of scalability, though does not scale to use all CPUs at all time either.
C-Ray - A ray tracing benchmark that uses massive threading per CPU and is completely CPU bound but overloads all CPUs.
Primesieve - A prime number generator that is threaded to the number of CPUs exactly, is fully CPU bound and is cache intensive.
PostgreSQL pgbench - A meaningful database benchmark that is done at 3 different levels - single threaded, normal loaded and heavily contended, each testing different aspects of scalability.
And here is a set of results comparing 4.8.2 mainline (labelled CFS), MuQSS 112 in interactive mode (MuQSS-int1) and MuQSS 112 in non-interactive mode (MuQSS-int0):
http://ck.kolivas.org/patches/muqss/Benchmarks/20161018/
It's worth noting that there is quite a bit of variance in these benchmarks and some are bordering on the difference being just noise. However there is a clear pattern here - when the load is light, in terms of throughput, CFS outperforms MuQSS. When load is heavy, the heavier it gets, MuQSS outperforms CFS, especially in non-interactive mode. As a friend noted, for the workloads where you wouldn't be running MuQSS in interactive mode, such as a web server, database etc, non-interactive mode is of clear performance benefit. So at least on the hardware I had available to me, on a 12x machine, MuQSS is scaling better than mainline on these workloads as load increases.
The obvious question people will ask is why MuQSS doesn't perform better at light loads, and in fact I have an explanation. The reason is that mainline tends to cling to processes much more so that if it is hovering at low numbers of active processes, they'll all cluster on one CPU or fewer CPUs than being spread out everywhere. This means the CPU benefits more from the turbo modes virtually all newer CPUs have, but it comes at a cost. The latency to tasks is greater because they're competing for CPU time on fewer busy CPUs rather than spreading out to idle cores or threads. It is a design decision in MuQSS, as taken from BFS, to always spread out to any idle CPUs if they're available, to minimise latency, and that's one of the reasons for the interactivity and responsiveness of MuQSS. Of course I am still investigating ways of closing that gap further.
Hopefully I can get some more benchmarks from someone with even bigger hardware, and preferably with more than one physical package since that's when things really start getting interesting. All in all I'm very pleased with the performance of MuQSS in terms of scalability on these results, especially assuming I'm able to maintain the interactivity of BFS which were my dual goals.
There is MUCH more to benchmarking than pure throughput of CPU - which is almost the only thing these benchmarks is checking - but that's what I'm interested in here. I hope that providing my list of easy to use benchmarks and the reasoning behind them can generate interest in some kind of meaningful standard set of benchmarks. I did start out in kernel development originally after writing and being a benchmarker :P
To aid that, I'll give simple instructions here for how to ~imitate the benchmarks and get results like I've produced above.
Download the phoronix test suite from here:
http://www.phoronix-test-suite.com/
The generic tar.gz is perfectly fine. Then extract it and install the relevant benchmarks like so:
Now obviously this is not ideal since you shouldn't run benchmarks on a multiuser login with Xorg and all sorts of other crap running so I actually always run benchmarks at init level 1.
Enjoy!
お楽しみ下さい
-ck
Red = MuQSS 112 interactive off
Purple = MuQSS 112 interactive on
Blue = CFS
The detail:
http://ck.kolivas.org/patches/muqss/Benchmarks/20161018/
I went on a journey looking for meaningful benchmarks to conduct to assess the scalability aspect as far as I could on my own 12x machine and was really quite depressed to see what the benchmark situation on linux is like. Only the old and completely invalid benchmarks seem to still be hanging around in public sites and promoted, like Reaim, aim7, dbench, volanomark, etc. and none of those are useful scalability benchmarks. Even more depressing was the only ones with any reputation are actually commercial benchmarks costing hundreds of dollars.
This made me wonder out loud just how the heck mainline is even doing scalability improvements if there are precious few valid benchmarks for linux and no one's using them. The most promising ones, like mosbench, need multiple machines and quite a bit of set up to get them going.
I spent a day wading through the phoronix test suite - a site and its suite not normally known for meaningful high performance computing discussion and benchmarks - looking for benchmarks that could be used for meaningful results for multicore scalability assessment and were not too difficult to deploy and came up with the following collection:
John The Ripper - a CPU bound application that is threaded to the number of CPUs and intermittently drops to one thread making for slightly more interesting behaviour than just a fully CPU bound workload.
7-Zip Compression - a valid real world CPU bound application that is threaded but rarely able to spread out to all CPUs making it an interesting light load benchmark.
ebizzy - This emulates a heavy content delivery server load which scales beyond the number of CPUs and emulates what goes on between a http server and database.
Timed Linux Kernel Compilation - A perennial favourite because it is a real world case and very easy to reproduce. Despite numerous complaints about its validity as a benchmark, it is surprisingly consistent in its results and tests many facets of scalability, though does not scale to use all CPUs at all time either.
C-Ray - A ray tracing benchmark that uses massive threading per CPU and is completely CPU bound but overloads all CPUs.
Primesieve - A prime number generator that is threaded to the number of CPUs exactly, is fully CPU bound and is cache intensive.
PostgreSQL pgbench - A meaningful database benchmark that is done at 3 different levels - single threaded, normal loaded and heavily contended, each testing different aspects of scalability.
And here is a set of results comparing 4.8.2 mainline (labelled CFS), MuQSS 112 in interactive mode (MuQSS-int1) and MuQSS 112 in non-interactive mode (MuQSS-int0):
http://ck.kolivas.org/patches/muqss/Benchmarks/20161018/
It's worth noting that there is quite a bit of variance in these benchmarks and some are bordering on the difference being just noise. However there is a clear pattern here - when the load is light, in terms of throughput, CFS outperforms MuQSS. When load is heavy, the heavier it gets, MuQSS outperforms CFS, especially in non-interactive mode. As a friend noted, for the workloads where you wouldn't be running MuQSS in interactive mode, such as a web server, database etc, non-interactive mode is of clear performance benefit. So at least on the hardware I had available to me, on a 12x machine, MuQSS is scaling better than mainline on these workloads as load increases.
The obvious question people will ask is why MuQSS doesn't perform better at light loads, and in fact I have an explanation. The reason is that mainline tends to cling to processes much more so that if it is hovering at low numbers of active processes, they'll all cluster on one CPU or fewer CPUs than being spread out everywhere. This means the CPU benefits more from the turbo modes virtually all newer CPUs have, but it comes at a cost. The latency to tasks is greater because they're competing for CPU time on fewer busy CPUs rather than spreading out to idle cores or threads. It is a design decision in MuQSS, as taken from BFS, to always spread out to any idle CPUs if they're available, to minimise latency, and that's one of the reasons for the interactivity and responsiveness of MuQSS. Of course I am still investigating ways of closing that gap further.
Hopefully I can get some more benchmarks from someone with even bigger hardware, and preferably with more than one physical package since that's when things really start getting interesting. All in all I'm very pleased with the performance of MuQSS in terms of scalability on these results, especially assuming I'm able to maintain the interactivity of BFS which were my dual goals.
There is MUCH more to benchmarking than pure throughput of CPU - which is almost the only thing these benchmarks is checking - but that's what I'm interested in here. I hope that providing my list of easy to use benchmarks and the reasoning behind them can generate interest in some kind of meaningful standard set of benchmarks. I did start out in kernel development originally after writing and being a benchmarker :P
To aid that, I'll give simple instructions here for how to ~imitate the benchmarks and get results like I've produced above.
Download the phoronix test suite from here:
http://www.phoronix-test-suite.com/
The generic tar.gz is perfectly fine. Then extract it and install the relevant benchmarks like so:
tar xf phoronix-test-suite-6.6.1.tar.gz
cd phoronix-test-suite
./phoronix-test-suite install build-linux-kernel c-ray compress-7zip ebizzy john-the-ripper pgbench primesieve
./phoronix-test-suite default-run build-linux-kernel c-ray compress-7zip ebizzy john-the-ripper pgbench primesieve
Now obviously this is not ideal since you shouldn't run benchmarks on a multiuser login with Xorg and all sorts of other crap running so I actually always run benchmarks at init level 1.
Enjoy!
お楽しみ下さい
-ck
Labels:
benchmark,
bfs,
interactivity,
kernel,
latency,
linux,
MuQSS,
scalability,
scheduler
Monday, 17 October 2016
MuQSS - The Multiple Queue Skiplist Scheduler v0.112
Here's an updated version of MuQSS.
For 4.8.*:
4.8-sched-MuQSS_112.patch
For 4.7.*:
4.7-sched-MuQSS_112.patch
Git tree here as 4.7-muqss or 4.8-muqss branches:
https://github.com/ckolivas/linux
It's getting close now to the point where it can replace BFS in -ck releases. Thanks to the many people testing and reporting back, some other misbehaviours were discovered and their associated fixes have been committed.
In particular,
- Balancing across CPUs was not looking at higher and lower scheduling policies correctly (SCHED_ISO, SCHED_IDLEPRIO and realtime policies)
- A serious stall/hang could happen with tasks using sched_yield (such as f@h client and numerous GPU drivers)
- Some minor accounting issues on new tasks with affinity set were fixed
- Overhead was further decreased on task selection
- Spurious preemption on CPUs where the preempted task had already gone are now avoided
- Spurious wakeup on CPUs that were assumed and are no longer idle are avoided
- A potential race in suspending to ram was fixed
- Old unused code from BFS was removed, along with unnecessary intermediate variables.
- Clean ups
- Some work towards actually documenting MuQSS in Documentation/scheduler/sched-MuQSS.txt was done, though incomplete.
Enjoy!
お楽しみ下さい
-ck
For 4.8.*:
4.8-sched-MuQSS_112.patch
For 4.7.*:
4.7-sched-MuQSS_112.patch
Git tree here as 4.7-muqss or 4.8-muqss branches:
https://github.com/ckolivas/linux
It's getting close now to the point where it can replace BFS in -ck releases. Thanks to the many people testing and reporting back, some other misbehaviours were discovered and their associated fixes have been committed.
In particular,
- Balancing across CPUs was not looking at higher and lower scheduling policies correctly (SCHED_ISO, SCHED_IDLEPRIO and realtime policies)
- A serious stall/hang could happen with tasks using sched_yield (such as f@h client and numerous GPU drivers)
- Some minor accounting issues on new tasks with affinity set were fixed
- Overhead was further decreased on task selection
- Spurious preemption on CPUs where the preempted task had already gone are now avoided
- Spurious wakeup on CPUs that were assumed and are no longer idle are avoided
- A potential race in suspending to ram was fixed
- Old unused code from BFS was removed, along with unnecessary intermediate variables.
- Clean ups
- Some work towards actually documenting MuQSS in Documentation/scheduler/sched-MuQSS.txt was done, though incomplete.
Enjoy!
お楽しみ下さい
-ck
Tuesday, 11 October 2016
MuQSS - The Multiple Queue Skiplist Scheduler v0.111
Lots of bugfixes, lots of improvements, build fixes, you name it.
For 4.8:
4.8-sched-MuQSS_111.patch
For 4.7:
4.7-sched-MuQSS_111.patch
And in a complete departure from BFS, a git tree (which suits constant development like this, unlike BFS's stable release massive ports):
https://github.com/ckolivas/linux
Look in the pending/ directory to see all the patches that went into this or read the git changelog. In particular numerous warnings were fixed, throughput improved compared to 108, SCHED_ISO was rewritten for multiple queues, potential races/crashes were addressed, and build fixes for different configurations were committed.
I haven't been able to track the bizarre latency issues reported by runqlat and when I try to reproduce it myself I get nonsense values of latency greater than the history of the earth so I suspect an interface bug with BPF reporting values. It doesn't seem to affect actual latency in any way.
EDIT: Updated to version 0.111 which has a fix for suspend/resume.
Enjoy!
お楽しみ下さい
-ck
For 4.8:
4.8-sched-MuQSS_111.patch
For 4.7:
4.7-sched-MuQSS_111.patch
And in a complete departure from BFS, a git tree (which suits constant development like this, unlike BFS's stable release massive ports):
https://github.com/ckolivas/linux
Look in the pending/ directory to see all the patches that went into this or read the git changelog. In particular numerous warnings were fixed, throughput improved compared to 108, SCHED_ISO was rewritten for multiple queues, potential races/crashes were addressed, and build fixes for different configurations were committed.
I haven't been able to track the bizarre latency issues reported by runqlat and when I try to reproduce it myself I get nonsense values of latency greater than the history of the earth so I suspect an interface bug with BPF reporting values. It doesn't seem to affect actual latency in any way.
EDIT: Updated to version 0.111 which has a fix for suspend/resume.
Enjoy!
お楽しみ下さい
-ck
Friday, 7 October 2016
MuQSS - The Multiple Queue Skiplist Scheduler v0.108
A new version of the MuQSS CPU scheduler
Incrementals and full patches available for 4.8 and 4.7 respectively here:
http://ck.kolivas.org/patches/muqss/4.0/4.8/
http://ck.kolivas.org/patches/muqss/4.0/4.7/
Yet more minor bugfixes and some important performance enhancements.
This version brings to the table the same locking scheme for trying to wake tasks up as mainline which is advantageous on process busy workloads and many CPUs. This is important because the main reason for moving to multiple runqueues was to minimise lock contention for the global runqueue lock that is in BFS (as mentioned here numerous times before) and this wake up scheme helps make the most of the multiple discrete runqueue locks.
Note this change is much more significant than the last releases so new instability is a possibility. Please report any problems or stacktraces!
There was a workload when I started out that I used lockstat to debug to get an idea of how much lock contention was going on and how long it lasted. Originally with the first incarnations of MuQSS on a 14 second benchmark with thousands of tasks on a 12x CPU it obtained 3 million locks and had almost 300k contentions with the longest contention lasting 80us. Now the same workload grabs the lock just 5k times with only 18 contentions in total and the longest lasted 1us.
This clearly demonstrates that the target endpoint for avoiding lock contention has been achieved. It does not translate into performance improvements on ordinary hardware today because you need ridiculous workloads on many CPUs to even begin deriving advantage from it. However as even our phones now have reached 8 logical CPUs, it will only be a matter of time before 16 threads appears on commodity hardware - a complaint that was directed at BFS when it came out 7 years ago but they still haven't appeared just yet. BFS was shown to be scalable for all workloads up to 16 CPUs, and beyond for certain workloads, but suffered dramatically for others. MuQSS now makes it possible for what was BFS to be useful much further into the future.
Again - MuQSS is aimed primarily at desktop/laptop/mobile device users for the best possible interactivity and responsiveness, and is still very simple in its approach to balancing workloads to CPUs so there are likely to be throughput workloads on mainline that outperform it, though there are almost certainly workloads where the opposite is true.
I've now addressed all planned changes to MuQSS and plan to hopefully only look at bug reports instead of further development from here on for a little while. In my eyes it is now stable enough to replace BFS in the next -ck release barring some unexpected showstopper bug appearing.
EDIT: If you blinked you missed the 107 announcement which was shortly superseded by 108.
EDIT2: Always watch the pending directory for updated pending patches to add.
http://ck.kolivas.org/patches/muqss/4.0/4.8/Pending/
Enjoy!
お楽しみ下さい
-ck
Incrementals and full patches available for 4.8 and 4.7 respectively here:
http://ck.kolivas.org/patches/muqss/4.0/4.8/
http://ck.kolivas.org/patches/muqss/4.0/4.7/
Yet more minor bugfixes and some important performance enhancements.
This version brings to the table the same locking scheme for trying to wake tasks up as mainline which is advantageous on process busy workloads and many CPUs. This is important because the main reason for moving to multiple runqueues was to minimise lock contention for the global runqueue lock that is in BFS (as mentioned here numerous times before) and this wake up scheme helps make the most of the multiple discrete runqueue locks.
Note this change is much more significant than the last releases so new instability is a possibility. Please report any problems or stacktraces!
There was a workload when I started out that I used lockstat to debug to get an idea of how much lock contention was going on and how long it lasted. Originally with the first incarnations of MuQSS on a 14 second benchmark with thousands of tasks on a 12x CPU it obtained 3 million locks and had almost 300k contentions with the longest contention lasting 80us. Now the same workload grabs the lock just 5k times with only 18 contentions in total and the longest lasted 1us.
This clearly demonstrates that the target endpoint for avoiding lock contention has been achieved. It does not translate into performance improvements on ordinary hardware today because you need ridiculous workloads on many CPUs to even begin deriving advantage from it. However as even our phones now have reached 8 logical CPUs, it will only be a matter of time before 16 threads appears on commodity hardware - a complaint that was directed at BFS when it came out 7 years ago but they still haven't appeared just yet. BFS was shown to be scalable for all workloads up to 16 CPUs, and beyond for certain workloads, but suffered dramatically for others. MuQSS now makes it possible for what was BFS to be useful much further into the future.
Again - MuQSS is aimed primarily at desktop/laptop/mobile device users for the best possible interactivity and responsiveness, and is still very simple in its approach to balancing workloads to CPUs so there are likely to be throughput workloads on mainline that outperform it, though there are almost certainly workloads where the opposite is true.
I've now addressed all planned changes to MuQSS and plan to hopefully only look at bug reports instead of further development from here on for a little while. In my eyes it is now stable enough to replace BFS in the next -ck release barring some unexpected showstopper bug appearing.
EDIT: If you blinked you missed the 107 announcement which was shortly superseded by 108.
EDIT2: Always watch the pending directory for updated pending patches to add.
http://ck.kolivas.org/patches/muqss/4.0/4.8/Pending/
Enjoy!
お楽しみ下さい
-ck
Labels:
4.8,
bfs,
interactivity,
kernel,
latency,
linux,
MuQSS,
scalability,
scheduler
Wednesday, 5 October 2016
MuQSS - The Multiple Queue Skiplist Scheduler v0.106
Another day and time for yet another release.
There are 0.106 versions and incrementals available for linux-4.7:
http://ck.kolivas.org/patches/muqss/4.0/4.7/
and linux-4.8:
http://ck.kolivas.org/patches/muqss/4.0/4.8
Two large remaining races that could lead to warnings, stalls, or in the worst case, crashes, have been fixed in this version.
Additionally the multiple-runqueue locking has been significantly optimised to take only the runqueues needed for as long as they're needed only and dropped as soon as possible which should bring the lock contention levels down even further. This is a performance enhancement, more so in non-interactive mode, though it will only start being demonstrable if you're lucky enough to have many CPUs.
This version addresses all the known bugs and warnings I've received to date so hopefully I can have a little rest and let people out there actually give it a go. What will you expect if you use this instead of BFS? If I've done this correctly, you will notice absolutely no difference since the idea was to preserve the interactivity and responsiveness of BFS and make it scalable to more CPUs than most people can afford.
Keep the feedback coming, thanks.
Enjoy!
お楽しみ下さい
-ck
There are 0.106 versions and incrementals available for linux-4.7:
http://ck.kolivas.org/patches/muqss/4.0/4.7/
and linux-4.8:
http://ck.kolivas.org/patches/muqss/4.0/4.8
Two large remaining races that could lead to warnings, stalls, or in the worst case, crashes, have been fixed in this version.
Additionally the multiple-runqueue locking has been significantly optimised to take only the runqueues needed for as long as they're needed only and dropped as soon as possible which should bring the lock contention levels down even further. This is a performance enhancement, more so in non-interactive mode, though it will only start being demonstrable if you're lucky enough to have many CPUs.
This version addresses all the known bugs and warnings I've received to date so hopefully I can have a little rest and let people out there actually give it a go. What will you expect if you use this instead of BFS? If I've done this correctly, you will notice absolutely no difference since the idea was to preserve the interactivity and responsiveness of BFS and make it scalable to more CPUs than most people can afford.
Keep the feedback coming, thanks.
Enjoy!
お楽しみ下さい
-ck
Subscribe to:
Posts (Atom)
