Saturday, 5 November 2016

linux-4.8-ck6, MuQSS version 0.135

Announcing a new version of MuQSS and a -ck release

4.8-ck6 patchset:
http://ck.kolivas.org/patches/4.0/4.8/4.8-ck6/

MuQSS by itself for 4.8:
4.8-sched-MuQSS_135.patch

MuQSS by itself for 4.7:
4.7-sched-MuQSS_135.patch

Git tree:
https://github.com/ckolivas/linux


A week has passed since the last major update to BFS and -ck was posted, allowing me to concentrate on receiving and responding to any bug reports. As it turns out, there were very few apart from the recurring local_softirq_pending warning/stalls. This is nice because it means MuQSS is mostly ~stable now. Mainline has even had more "stable" releases in the same time as MuQSS for 4.8, moving to 4.8.6 in the interim.


In this version I've added aggressive handling of pending softirqs in the hope the warnings and stalls all go away. The true reason the handling of softirqs are being dropped still escapes me but is likely related to the fact that MuQSS does a lot of lockless rescheduling across CPUs to decrease overhead but this does not give guarantees that locking would.

Additionally, I've added a number of APIs to the kernel to do specified millisecond schedule timeouts which use the highres timers which are mandatory now for MuQSS. The reason for doing this is there are many timeouts in the kernel that specify values below 10ms and the timer resolution at 100Hz only guarantees timeouts under 20ms.

I've also added a code sweep across the entire kernel looking for timeout calls under 50ms and use the new interface in its place. Additionally there are numerous places where schedule_timeout(1) are used in the kernel where a "minimum timeout" is expected, yet this is entirely Hz dependent, again being up to 20ms in duration. I've replaced all these with a 1ms timeout, emulating what would happen on a 1000Hz kernel, but without the overhead of running the higher Hz kernel. I'm not entirely sure this will equate to any real world improvements but the fact it's used in things like audio drivers worries me that it might.

Finally I've replaced the standard msleep call from userspace to use highres timers, in case there are userspace applications that expects msleep to actually give some kind of sleep that resembles what's asked of it, instead of something Hz limited, in case this is leading to slowdowns in userspace due to assumptions on the userspace coders' part. Calls to msleep() from userspace now give 100us accuracy at 100Hz instead of 20ms.


All these timing changes add overhead since they're trying to emulate the timing accuracy of running at 1000Hz but in a latency-focused scheduler I believe they're appropriate, and they do not incur the overhead that actually changing Hz would incur. Additionally they add accuracy to timers and timeouts that 1000Hz does not afford.

In the -ck tarball of broken-out patches, I've kept these timer changes separate to allow the muqss scheduler to be applied by itself should they prove problematic, and they will make merging with future kernels easier.


Enjoy!
お楽しみください
-ck

Saturday, 29 October 2016

linux-4.8-ck5, MuQSS version 0.120

Announcing a new version of MuQSS and a -ck release to go with it in concert with mainline releasing 4.8.5



4.8-ck5 patchset:
http://ck.kolivas.org/patches/4.0/4.8/4.8-ck5/

MuQSS by itself for 4.8:
4.8-sched-MuQSS_120.patch

MuQSS by itself for 4.7:
4.7-sched-MuQSS_120.patch


Git tree:
https://github.com/ckolivas/linux



This is a fairly substantial update to MuQSS which includes bugfixes for the previous version, performance enhancements, new features, and completed documentation. This will likely be the first publicly announced version on LKML.

EDIT: Announce here: LKML

New features:
- MuQSS is now a tickless scheduler. That means it can maintain its guaranteed low latency even in a build configured with a low Hz tick rate. To that end, it is now defaulting to 100Hz, and it is recommended to use this as the default choice for it leads to more throughput and power savings as well.
- Improved performance for single threaded workloads with CPU frequency scaling.
- Full NoHZ now supported. This disables ticks on busy CPUs instead of just idle ones. Unlike mainline, MuQSS can do this virtually all the time, regardless of how many tasks are currently running. However this option is for very specific use cases (compute servers running specific workloads) and not for regular desktops or servers.
- Numerous other configuration options that were previously disabled from mainline are now allowed again (though not recommended for regular users.)
- Completed documentation can now be found in Documentation/scheduler/sched-MuQSS.txt
 Bugfixes:
- Fix for the various stalls some people were still experiencing, along with the softirq pending warnings.
- Fix for some loss of CPU for heavily sched_yielding tasks.
- Fix for the BFQ warning (-ck only)

Enjoy!
お楽しみ下さい
-ck

Monday, 24 October 2016

Interbench benchmarks for MuQSS 116

As mentioned in my previous post, I recently upgraded interbench which is a benchmark application I invented/wrote to assess perceptible latency in the setting of various loads. The updates were to make the results meaningful on today's larger ram/multicore machines where the load scales accordingly.

The results for mainline 4.8.4 and 4.8.4-ck4 on a multithreaded hexcore (init 1) can be found here:
 http://ck.kolivas.org/patches/muqss/Benchmarks/20161024/
and are copied below. I do not have swap on this machine so the "memload" was not performed. This is a 3.6GHz hexcore with 64GB ram and fast Intel SSDs so to show any difference on this is nice. To make it easier, I've highlighted it in colours similar to the throughput benchmarks I posted previously. Blue means within 1% of each other, red means significantly worse and green significantly better.


Load set to 12 processors

Using 4008580 loops per ms, running every load for 30 seconds
Benchmarking kernel 4.8.4 at datestamp 201610242116
Comment: cfs

--- Benchmarking simulated cpu of Audio in the presence of simulated ---
Load Latency +/- SD (ms)  Max Latency   % Desired CPU  % Deadlines Met
None       0.1 +/- 0.1        0.1           100         100
Video      0.0 +/- 0.0        0.1           100         100
X          0.1 +/- 0.1        0.1           100         100
Burn       0.0 +/- 0.0        0.0           100         100
Write      0.1 +/- 0.1        0.1           100         100
Read       0.1 +/- 0.1        0.1           100         100
Ring       0.0 +/- 0.0        0.1           100         100
Compile    0.0 +/- 0.0        0.0           100         100

--- Benchmarking simulated cpu of Video in the presence of simulated ---
Load Latency +/- SD (ms)  Max Latency   % Desired CPU  % Deadlines Met
None       0.1 +/- 0.1        0.1           100         100
X          0.1 +/- 0.1        0.1           100         100
Burn      17.4 +/- 19.5      46.3            87        7.62
Write      0.1 +/- 0.1        0.1           100         100
Read       0.1 +/- 0.1        0.1           100         100
Ring       0.0 +/- 0.0        0.0           100         100
Compile   17.4 +/- 19.1      45.9          89.5        6.07

--- Benchmarking simulated cpu of X in the presence of simulated ---
Load Latency +/- SD (ms)  Max Latency   % Desired CPU  % Deadlines Met
None       0.0 +/- 0.1        1.0           100        99.3
Video     13.4 +/- 25.8      68.0          36.2        27.3
Burn      94.4 +/- 127.0    334.0          12.9        4.37
Write      0.1 +/- 0.4        4.0          97.4        96.4
Read       0.1 +/- 0.7        4.0          96.2        93.8
Ring       0.5 +/- 1.9        9.0          89.3        84.9
Compile   93.3 +/- 127.7    333.0          12.2         4.2

--- Benchmarking simulated cpu of Gaming in the presence of simulated ---
Load Latency +/- SD (ms)  Max Latency   % Desired CPU
None       0.0 +/- 0.2        2.2           100
Video      7.9 +/- 21.4      69.3          92.7
X          1.4 +/- 1.6        2.7          98.7
Burn     136.5 +/- 145.3    360.8          42.3
Write      1.8 +/- 2.0        4.4          98.2
Read      11.2 +/- 20.3      47.8          89.9
Ring       8.1 +/- 8.1        8.2          92.5
Compile  152.3 +/- 166.8    346.1          39.6
Load set to 12 processors

Using 4008580 loops per ms, running every load for 30 seconds
Benchmarking kernel 4.8.4-ck4+ at datestamp 201610242047
Comment: muqss116-int1

--- Benchmarking simulated cpu of Audio in the presence of simulated ---
Load Latency +/- SD (ms)  Max Latency   % Desired CPU  % Deadlines Met
None       0.0 +/- 0.0        0.0           100         100
Video      0.0 +/- 0.0        0.0           100         100
X          0.0 +/- 0.0        0.0           100         100
Burn       0.0 +/- 0.0        0.0           100         100
Write      0.0 +/- 0.0        0.1           100         100
Read       0.0 +/- 0.0        0.0           100         100
Ring       0.0 +/- 0.0        0.0           100         100
Compile    0.0 +/- 0.1        0.8           100         100

--- Benchmarking simulated cpu of Video in the presence of simulated ---
Load Latency +/- SD (ms)  Max Latency   % Desired CPU  % Deadlines Met
None       0.0 +/- 0.0        0.0           100         100
X          0.0 +/- 0.0        0.0           100         100
Burn       3.1 +/- 7.2       17.7           100        81.6
Write      0.0 +/- 0.0        0.5           100         100
Read       0.0 +/- 0.0        0.0           100         100
Ring       0.0 +/- 0.0        0.0           100         100
Compile   10.5 +/- 13.3      19.7           100        37.3

--- Benchmarking simulated cpu of X in the presence of simulated ---
Load Latency +/- SD (ms)  Max Latency   % Desired CPU  % Deadlines Met
None       0.0 +/- 0.1        1.0           100        99.3
Video      3.7 +/- 12.1      56.0            89        82.6
Burn      47.2 +/- 66.5     142.0          16.7        7.58
Write      0.1 +/- 0.5        5.0          97.7        95.7
Read       0.1 +/- 0.7        4.0          95.6        93.5
Ring       0.5 +/- 1.9       12.0          89.8          86
Compile   55.9 +/- 77.6     196.0          18.6        8.12

--- Benchmarking simulated cpu of Gaming in the presence of simulated ---
Load Latency +/- SD (ms)  Max Latency   % Desired CPU
None       0.0 +/- 0.1        0.5           100
Video      1.2 +/- 1.2        1.8          98.8
X          1.4 +/- 1.6        2.9          98.7
Burn     130.9 +/- 132.1    160.3          43.3
Write      2.4 +/- 2.5        7.0          97.7
Read       3.2 +/- 3.2        3.6          96.9
Ring       5.9 +/- 6.2       10.3          94.4
Compile  146.5 +/- 149.3    209.2          40.6

As you can see, the only times mainline is better, there is less than 1% difference between them which is within the margins for noise. MuQSS meets more deadlines, gives the benchmarked task more of its desired CPU and has substantially lower max latencies.

I'm reasonably confident that I've been able to maintain the interactivity people have come to expect from BFS in the transition to MuQSS now and have the data to support it above.

Enjoy!
お楽しみ下さい
-ck

linux-4.8-ck4, MuQSS CPU scheduler v0.116

Yet another bugfix release for MuQSS and the -ck patchset with one of the most substantial latency fixes yet. Everyone should upgrade if they're on a previous 4.8 patchset of mine. Sorry about the frequency of these releases but I just can't allow a known buggy release be the latest version.

4.8-ck4 patchset:
http://ck.kolivas.org/patches/4.0/4.8/4.8-ck4/

MuQSS by itself for 4.8:
4.8-sched-MuQSS_116.patch

MuQSS by itself for 4.7:
4.7-sched-MuQSS_116.patch

I'm hoping this is the release that allows me to not push any more -ck versions out till 4.9 is released since it addresses all remaining issues that I know about.

A lingering bug that has been troubling me for some time was leading to occasional massive latencies and thanks to some detective work by Serge Belyshev I was able to narrow it down to a single line fix which dramatically improves worst case latency when measured. Throughput is virtually unchanged. The flow-on effect to other areas was also apparent with sometimes unused CPU cycles and weird stalls on some workloads.

Sched_yield was reverted to the old BFS mechanism again which GPU drivers prefer but it wasn't working previously on MuQSS because of the first bug. The difference is substantial now and drivers (such as nvidia proprietary) and apps that use it a lot (such as the folding @ home client) behave much better now.

The late introduced bugs that got into ck3/muqss115 were reverted.

The results come up quite well now with interbench (my latency under load benchmark) which I have recently updated and should now give sensible values:

https://github.com/ckolivas/interbench

If you're baffled by interbench results, the most important number is %deadlines met which should be as close to 100% as possible followed by max latency which should be as low as possible for each section. In the near future I'll announce an official new release version.

Pedro in the comments section previously was using runqlat from bcc tools to test latencies as well, but after some investigation it became clear to me that the tool was buggy and did not work properly with bfs/muqss either so I've provided a slightly updated version here which should work properly:

runqlat.py

Enjoy!
お楽しみ下さい
-ck