-ck hacking: real-time

Showing posts with label real-time. Show all posts

Tuesday, 22 November 2016

linux-4.8-ck8, MuQSS version 0.144

Here's a new release to go along with and commemorate the 4.8.10 stable release (they're releasing stable releases faster than my development code now.)

linux-4.8-ck8 patch:
patch-4.8-ck8.lrz

MuQSS by itself:
4.8-sched-MuQSS_144.patch

There are a small number of updates to MuQSS itself.
Notably there's an improvement in interactive mode when SMT nice is enabled and/or realtime tasks are running, or there are users of CPU affinity. Tasks previously would not schedule on CPUs when they were stuck behind those as the highest priority task and it would refuse to schedule them transiently.
The old hacks for CPU frequency changes from BFS have been removed, leaving the tunables to default as per mainline.
The default of 100Hz has been removed, but in its place a new and recommended 128Hz has been implemented - this just a silly microoptimisation to take advantage of the fast shifts that /128 has on CPUs compared to /100, and is close enough to 100Hz to behave otherwise the same.

For the -ck patch only I've reinstated updated and improved versions of the high resolution timeouts to improve behaviour of userspace that is inappropriately Hz dependent allowing low Hz choices to not affect latency.
Additionally by request I've added a couple of tunables to adjust the behaviour of the high res timers and timeouts.
/proc/sys/kernel/hrtimer_granularity_us
and
/proc/sys/kernel/hrtimeout_min_us

Both of these are in microseconds and can be set from 1-10,000. The first is how accurate high res timers will be in the kernel and is set to 100us by default (on mainline it is Hz accuracy).
The second is how small to make a request for a "minimum timeout" generically in all kernel code. The default is set to 1000us by default (on mainline it is one tick).

I doubt you'll find anything useful by tuning these but feel free to go nuts. Decreasing the second tunable much further risks breaking some driver behaviour.

Enjoy!
お楽しみ下さい
-ck

Monday, 17 October 2016

MuQSS - The Multiple Queue Skiplist Scheduler v0.112

Here's an updated version of MuQSS.

For 4.8.*:
4.8-sched-MuQSS_112.patch

For 4.7.*:
4.7-sched-MuQSS_112.patch

Git tree here as 4.7-muqss or 4.8-muqss branches:
https://github.com/ckolivas/linux

It's getting close now to the point where it can replace BFS in -ck releases. Thanks to the many people testing and reporting back, some other misbehaviours were discovered and their associated fixes have been committed.

In particular,
- Balancing across CPUs was not looking at higher and lower scheduling policies correctly (SCHED_ISO, SCHED_IDLEPRIO and realtime policies)
- A serious stall/hang could happen with tasks using sched_yield (such as f@h client and numerous GPU drivers)
- Some minor accounting issues on new tasks with affinity set were fixed
- Overhead was further decreased on task selection
- Spurious preemption on CPUs where the preempted task had already gone are now avoided
- Spurious wakeup on CPUs that were assumed and are no longer idle are avoided
- A potential race in suspending to ram was fixed
- Old unused code from BFS was removed, along with unnecessary intermediate variables.
- Clean ups
- Some work towards actually documenting MuQSS in Documentation/scheduler/sched-MuQSS.txt was done, though incomplete.

Enjoy!
お楽しみ下さい
-ck

Friday, 1 August 2014

SMT/Hyperthreading, nice and scheduling policies

The concept of symmetric multi-threading, which Intel called "Hyperthreading" and introduced into their commodity CPUs first around 2001, is not remotely a new one and goes back a long way before Intel introduced it into the mainstream market. I suspect the introduction of it back then by Intel was them easing the concept of increasing threads and cores for marketing reasons with the imminent walls they'd soon hit with CPU heat and power requirements that would stop the pursuit for higher and higher single CPU frequencies. The idea is that, since a lot of the CPU sits unused even when something is running as fast as it can on part of it, with a bit of extra logic and architecture, you could throw another "virtual core" at some of the unused execution units and behave like 2 (or more) CPUs, putting more of the CPU to good use. These days the vast majority of CPUs sold by Intel have hyperthreading on them, thus doubling the virtual or "logical" cores the CPU has, including even their low power atom offerings.

There have been numerous benchmarks, in-field tests, workloads etc., where people have tried to find whether hyperthreading is better or not. With a bit of knowledge of the workings of hyperthreading, it's pretty easy to know what the answer is, and not surprisingly, it's the frustrating answer of "it depends". And that's the most accurate answer by far, but I'd go further than that and say that if you have any kind of mixed workload, hyperthreading is always going to be better, whereas if you have precisely one workload , then you have to define exactly how it's going to work and whether hyperthreading will be better or not. Which means that in my opinion at least, hyperthreading is advantageous on a desktop, laptop, tablet and even phone since by design they're nothing but mixed workloads. I won't spend much longer on this discussion, but suffice to say that I think about 4 threads (at the moment) is about optimal for most real world desktop(y) workloads.

Imagine for a moment you have a single core CPU which you can run as is, or enable hyperthreading to run as a 2 thread CPU. If you were to run your CPU in single core only mode, then when you run one task at a time it will always use the full power of the CPU, but if you run two tasks, each task runs at 50% the speed and completes in double the time. If you enable hyperthreading, then if you have two mixed workloads that actually use different parts of the CPU, you can actually get effectively (at best) about 140% of the performance of running the CPU in single core mode. This means that instead of the two tasks running at 50% speed when run concurrently, they run at 70% speed. In practice, the actual performance benefit is rarely 40% but it is often on the order of 25%, so each task tends to run about 60% speed instead of 50% speed. Still a nice speedup for "free".

One thing has always troubled me about hyperthreading, though, and that is the way it tends to break priority support in the scheduler. By priority support, I refer to the use of 'nice' and other scheduling policies, such as realtime, sched idleprio etc.

If you have a single core CPU and run a nice 0 task concurrently with a nice +19 task, the nice 0 task will get about 98% of the CPU time and the nice +19 task only about 2%. The scheduler does this by serialising and metering out the time each task gets to spend on the CPU. Now if you enable hyperthreading on that CPU, the scheduler no longer serialises access to the CPU, but gives each of those tasks one logical "core" on the CPU, and you get an overall 25% increase in throughput. However of the total throughput, both the nice 0 and nice +19 task get precisely half. This would be fine if we had two real cores, but they're not, and the performance of both tasks is sacrificed to ~60% to achieve this. Which means that for this contrived but simple example, enabling hyperthreading slows down the overall execution speed of your nice 0 task when you run a nice +19 task much more than on a single core - it runs at 60% speed instead of 98%.

An even more dramatic example is what happens with realtime tasks, which these days most audio backends on linux use (usually through pulseaudio). Running a realtime task concurrently with a SCHED_NORMAL nice 0 task on a single core means the realtime task will get 100% CPU and the nice 0 task will get zero CPU time. Enable hyperthreading and suddenly the realtime task only runs at 60% of its normal speed even with a heavily niced +19 task running in the background.

Enter SMT-nice as I call it. This is not a new idea, and in fact my first iteration of it was for mainline 10(!) years ago. See here: SMT Nice 2.6.4-rc1-mm1

I actually had the patch removed myself from mainline for criticism regarding throughput reasons, though I still argue that worrying about the last percentage points of throughput are not relevant if you break a mechanism as valuable as nice and scheduling policies, but I had lost the energy for defending it which is why I pushed it be removed myself. Note that although throughput overall may be slightly decreased, the throughput of higher priority tasks is not only fairer with respect to low priority tasks, but enhanced because the low priority tasks will have less cache trashing effects.

What this does is it examines all hyperthread "siblings" to see what is running on them, and then decides whether the currently running or next running task should actually have access to the sibling or allow the sibling to go idle completely, allowing a higher priority task to have the actual true core and all its execution units to itself. I'd been meaning to create an equivalent patch for BFS for the longest time but CPUs got faster, cheaper, more cores, I got lazy etc... though I recently found more enthusiasm for hacking.

So here is a reincarnation of the SMT-nice concept for BFS, improved to work across multiple scheduling policies from realtime, iso down to idleprio, and I've made it a compile time option in case people feel they don't wish to sacrifice any throughput:

Patch for BFS449 with pending patches:
bfs449-smtnice-2.patch

And to make life easy, here's an all inclusive ck1+pending+smtnice patch:
3.15-ck1-smtnice2.patch

The TL;DR is: On Intel hyperthreaded CPUs, 'nice', realtime and sched idleprio works better, and background tasks interfere much less with the foreground tasks. Note: This patch does nothing if you don't have a hyperthreaded CPU.

If you wish to do testing to see how this works, try running with and without the patch and running two benchmarks concurrently, one at nice 0 and one at nice +19 (such as 'make -j2' on one kernel and 'nice -19 make -j2' on another kernel on a machine with 2 cores/4 threads) and compare times. Or run some jackd benchmarks of your choice to see what it takes to get xruns etc.

This patch will almost certainly make its way into the next BFS in some form.

EDIT: It seems people have missed the point of this patch. It improves the performance of foreground applications at the expense of background ones. So your desktop/gui/applications will remain fast even if you run folding@home, mprime, seti@home etc., but those background tasks will slow down more. If you don't want it doing that, disable it in your build config.

---
Enjoy!
お楽しみ下さい
-ck

Friday, 1 October 2010

BFS in real time

A question I often get is "how does BFS compare to the -rt patchset?" and I also get asked if BFS is compatible with the -rt patchset.

The second question is easy to answer: No. The code that makes up the -rt patchset carves heavily into the core of the CPU scheduler and to make it compatible with BFS would require a complete port.

The first question is a little harder to answer because -rt and BFS are tools for completely different workloads. BFS is a general purpose desktop orientated (yes we do have an extra syllable in English unlike American) CPU scheduler designed to decrease overall latencies to below human perceptible level in all regular workloads. Human perceptible latencies are in the millisecond range, where anything within about the 5ms range will not be noticeable. The -rt patchset is designed to decrease latencies in the microsecond range. This is well below anything that anybody could feel on a desktop. -rt also provides special tools for management of tasks that are running with realtime scheduling policies such as SCHED_FIFO and SCHED_RR. The addition of threaded interrupt handlers, priority inheritance and interrupt priority handling and so on are very specialised tools that are desirable in the realtime world. To use these, one must specially program for the -rt patchset, be using very carefully chosen hardware with known capabilities, run a stripped down userspace that only does precisely what one needs and so on. These are not the domain of a normal desktop user, nor that of any workload a desktop user is likely to ever encounter. If you were doing semi-professional audio recording you might, and then you'd need to understand the inner workings of the software and the -rt patchset to make the most of it. Just patching it in and expecting it to work for you will not really give you any advantage. Most users of the -rt patchset are embedded device manufacturers (and I don't mean Android phones).

A common response I get, then, is "surely if low latencies are good, then extremely low are even better?" Unfortunately, that's not the case, as the cost of running the -rt patchset does incur significant overhead and will not give the user any perceivable advantage, especially if you don't use any of the special features of it.

On the other hand, BFS is designed as a "plug it in and it will work" type of patch for the most part. A greatly undervalued feature of BFS is its ability to handle realtime tasks on SMP machines. BFS uses a global runqueue where the scheduler decides what task is the next most important and will choose the most suitable CPU anywhere to make the next task run as soon as possible. It finds the lowest priority, or oldest deadline task running on any CPU, and kicks that off if there is something higher priority that demands CPU time. While this is the case with any CPU scheduler, the major difference is that multiple runqueue designs (i.e. one for every CPU) spends most of its time looking for the highest priority task in its own runqueue, rather than all the tasks running on the whole machine. It then only moves tasks around to keep CPUs relatively balanced over time periods. The multiple queue design is ultimately used because it is more scalable when CPU numbers get high.

Using the simplest example possible of a dual core machine, it's easy to demonstrate the disadvantage of this multiple queue design. It is possible for say, 4 tasks to be running - two high priority realtime tasks and two regular tasks. In the multiple runqueue design, the two high priority tasks may end up on the same CPU for a period fighting each other for CPU, while the two regular tasks are on the other CPU, gorging themselves of CPU time. This scenario only exists for a short period until the CPUs are balanced again (assuming the balancing is done right biasing for the high priority tasks!), but it can happen over and over again, and the more CPUs, and the more tasks running, the worse this situation gets. It means that an ultra low priority task, possibly even nice 19 could be getting CPU time on one CPU, while a high priority SCHED_FIFO task is not getting CPU time because it's on another CPU waiting below an even higher priority SCHED_FIFO task. If you are using the mainline kernel with more than one realtime task on an SMP machine, then it is imperative you use affinities to cope with this situation to ensure you have the lowest possible latencies. However there is no workaround for when there are more realtime tasks than CPU cores.

When I had the time and inclination to hack for mainline, I created what I called the "SMP Nice" patches which were designed to try and balance things out to prevent all the high priority tasks clustering on one CPU and the low priority tasks on another and so on. A lot more work has gone into that on mainline since then to help alleviate that problem. An interesting footnote is that Peter Williams, who helped work on those patches, reported to me he noticed a few years ago that the priority management on SMP machines on windows was woeful for a while and it was a dead giveaway that they had changed to multiple runqueues.

BFS, with its global runqueue, manages these realtime tasks optimally without any extra effort. The highest priority task will always run first on any CPU it's allowed on. I've had quite a few people report to me their surprise to find that realtime tasks worked better on BFS, usually with audio on jackd, and asked if there was a reason for it. Priority balancing on SMP just works by design on a global runqueue without any extra effort. 'nice' also works better on SMP for exactly the same reason.