Friday, 1 October 2010

BFS in real time

A question I often get is "how does BFS compare to the -rt patchset?" and I also get asked if BFS is compatible with the -rt patchset.

The second question is easy to answer: No. The code that makes up the -rt patchset carves heavily into the core of the CPU scheduler and to make it compatible with BFS would require a complete port.

The first question is a little harder to answer because -rt and BFS are tools for completely different workloads. BFS is a general purpose desktop orientated (yes we do have an extra syllable in English unlike American) CPU scheduler designed to decrease overall latencies to below human perceptible level in all regular workloads. Human perceptible latencies are in the millisecond range, where anything within about the 5ms range will not be noticeable. The -rt patchset is designed to decrease latencies in the microsecond range. This is well below anything that anybody could feel on a desktop. -rt also provides special tools for management of tasks that are running with realtime scheduling policies such as SCHED_FIFO and SCHED_RR. The addition of threaded interrupt handlers, priority inheritance and interrupt priority handling and so on are very specialised tools that are desirable in the realtime world. To use these, one must specially program for the -rt patchset, be using very carefully chosen hardware with known capabilities, run a stripped down userspace that only does precisely what one needs and so on. These are not the domain of a normal desktop user, nor that of any workload a desktop user is likely to ever encounter. If you were doing semi-professional audio recording you might, and then you'd need to understand the inner workings of the software and the -rt patchset to make the most of it. Just patching it in and expecting it to work for you will not really give you any advantage. Most users of the -rt patchset are embedded device manufacturers (and I don't mean Android phones).

A common response I get, then, is "surely if low latencies are good, then extremely low are even better?" Unfortunately, that's not the case, as the cost of running the -rt patchset does incur significant overhead and will not give the user any perceivable advantage, especially if you don't use any of the special features of it.

On the other hand, BFS is designed as a "plug it in and it will work" type of patch for the most part. A greatly undervalued feature of BFS is its ability to handle realtime tasks on SMP machines. BFS uses a global runqueue where the scheduler decides what task is the next most important and will choose the most suitable CPU anywhere to make the next task run as soon as possible. It finds the lowest priority, or oldest deadline task running on any CPU, and kicks that off if there is something higher priority that demands CPU time. While this is the case with any CPU scheduler, the major difference is that multiple runqueue designs (i.e. one for every CPU) spends most of its time looking for the highest priority task in its own runqueue, rather than all the tasks running on the whole machine. It then only moves tasks around to keep CPUs relatively balanced over time periods. The multiple queue design is ultimately used because it is more scalable when CPU numbers get high.

Using the simplest example possible of a dual core machine, it's easy to demonstrate the disadvantage of this multiple queue design. It is possible for say, 4 tasks to be running - two high priority realtime tasks and two regular tasks. In the multiple runqueue design, the two high priority tasks may end up on the same CPU for a period fighting each other for CPU, while the two regular tasks are on the other CPU, gorging themselves of CPU time. This scenario only exists for a short period until the CPUs are balanced again (assuming the balancing is done right biasing for the high priority tasks!), but it can happen over and over again, and the more CPUs, and the more tasks running, the worse this situation gets. It means that an ultra low priority task, possibly even nice 19 could be getting CPU time on one CPU, while a high priority SCHED_FIFO task is not getting CPU time because it's on another CPU waiting below an even higher priority SCHED_FIFO task. If you are using the mainline kernel with more than one realtime task on an SMP machine, then it is imperative you use affinities to cope with this situation to ensure you have the lowest possible latencies. However there is no workaround for when there are more realtime tasks than CPU cores.

When I had the time and inclination to hack for mainline, I created what I called the "SMP Nice" patches which were designed to try and balance things out to prevent all the high priority tasks clustering on one CPU and the low priority tasks on another and so on. A lot more work has gone into that on mainline since then to help alleviate that problem. An interesting footnote is that Peter Williams, who helped work on those patches, reported to me he noticed a few years ago that the priority management on SMP machines on windows was woeful for a while and it was a dead giveaway that they had changed to multiple runqueues.

BFS, with its global runqueue, manages these realtime tasks optimally without any extra effort. The highest priority task will always run first on any CPU it's allowed on. I've had quite a few people report to me their surprise to find that realtime tasks worked better on BFS, usually with audio on jackd, and asked if there was a reason for it. Priority balancing on SMP just works by design on a global runqueue without any extra effort. 'nice' also works better on SMP for exactly the same reason.

No comments:

Post a Comment