So we like to think that our software is perfect if there are no bug reports. Truth be told it's been a very long time since anyone's reported anything on BFS, which could be both good and bad. It means either no one is using it, or no one is having any bugs with it, or no one is inclined to report anything. Since I get about 10,000 downloads a month of it, I'm pretty sure that someone is using it, but I don't really know for certain how much of the other two it is. Anyway the point is that I hadn't really hacked much on it for a while now and had only been porting it with each stable kernel release as they came out, until I started working on 330. So for the next version, I planned to try and commit some more changes along the "hope it fixes android" lines and while I was at it, to try and find a few minor optimisations, and audit the code in the process.
The main change in 350 was changing the deadline system to stop using jiffies. I had already done so in 330 by using a local 64 bit jiffy counter I called gjiffies (for global runqueue jiffies), which was just a wrapper on 64 bit builds, but had locking and a u64 variable on 32 bit builds. Currently, BFS uses microsecond accounting for timeslices. The reason for doing so is that we don't have hardware that can do a context switch in less than a microsecond (some take up to 30microseconds) and to avoid overflowing on 32 bits. If I were to use nanoseconds, a 32 bit signed counter would overflow in just 2.1 seconds. So anyway to try and improve the resolution of the deadlines and maintain some deadline difference on lower Hz builds (specifically 100Hz), I changed the local jiffy counter I introduced in 330 to a monotonically incrementing microsecond counter I called miffies. This worked off the dreaded sched_clock interface we have that gets time from the TSC nanosecond timer on each CPU. This timer sounds great in principle, but often returns bogus values, looking like it goes backwards or jumping forwards dramatically, fluctuates with CPU frequency scaling, and the timers are not aligned between CPUs on SMP. Numerous sanity checks exist in the sched_clock_cpu interface to make the number meaningful, but still is only useful as used by one CPU. I used every CPU to update the monotonic counter by just adding their difference to the time whenever they were called, and scaled it to microseconds. Thus I ended up with a relatively monotonic microsecond counter in 64 bits that could be used everywhere. Serge Belyshev quickly pointed out that there wasn't much point it being microsecond since it was 64 bit anyway, and I moved from miffies to niffies, since nanoseconds won't wrap in 64 bits for 500 years. So in 350, all timing of deadlines is now done comparing times to the monotonic 64 bit nanosecond niffies timer.
Anyway, the point of all this, is that deadlines used to be limited by jiffy resolution, which means whatever Hz the build was compiled as. That meant that on 100Hz, it's possible for many tasks to have the same deadline, dramatically decreasing the utility of having the deadlines in the first place. It also means that the deadline differentiation should be valid, even long after the "virtual deadline" has passed (see the discussion on the BFS 330 update).
The next thing I added to improve management of low Hz builds was to test for whether a task was starting in the first half or 2nd half of a tick (jiffy), and then use a "dither" flag to determine whether we should expire that task early once the tick fires. The reason for this is to minimise the extra time a task can get running beyond its allocated timeslice when we're limited by Hz resolution. So if we have say our rr_interval set to 10 milliseconds, and our jiffies are 10 milliseconds long (as is the case at 100Hz), then it's possible, if a task starts just after a tick, that it can run for just under 20 milliseconds. With the dither flag, since it starts in the first half of that tick, when it gets to the next tick, it will determine that it has less than half a tick left, and expire it early. Thus, the longest a task can overrun its quota is only half a tick instead of one tick, meaning 5 ms in this example. This is not an expensive operation, but of course it would be much better if BFS wasn't Hz limited in this way at all by using high precision event timers to expire tasks at the right time. However, there are 2 reasons I haven't done this. First, hpet isn't available on all hardware and does have a slight performance hit. But second, and most important, I've never bothered to figure them out and do it properly. That's why I just suggest people just raise their Hz.
Apart from those major changes, a whole swag of things I had been considering doing I also committed. I use close shifts (right shift by 20 instead of divide by 1000000000 for example) where possible instead of division since it's cheaper. ISO calculations which are done during a scheduler tick now have their own lock to avoid grabbing the grq lock. I found some mistakes in the test for preemption when there are tasks of different scheduling policies, so SCHED_BATCH behaviour wasn't working properly, and sometimes the not-lowest priority CPU was being chosen to test against when there may have been a lower priority task running elsewhere. That's been fixed. I decreased the maximum rr_interval possible to 1000 (1 second) since new changes mean that 5 seconds would overflow, and higher values don't really help anyway. I added a local last_task variable per CPU, which could tell if that task was cache warm still on that CPU, even if it had run elsewhere. I dropped the first_time_slice flag as it was adding overhead and was not helping as it was possible to retrieve timeslice from children long long after they were forked if they ran only very little. I made sure timeslice was split properly between parent and child on fork() and that a reschedule is forced if the relative loss of timeslice meant the parent ran out. I changed reschedule to use 100 microseconds as the cutoff for expiration of timeslices since rescheduling with less time available than this will either greatly overrun its quota or reschedule very quickly. SCHED_BATCH tasks now refill timeslice and reset their deadline every time they're rescheduled which means they're lower weight than previously, but since they've been flagged as latency insensitive, it means they're likely to be fully CPU bound anyway and will only be rescheduled when they run out of timeslice or are preempted. I bypass the test for earliest_deadline_task now in schedule() when there is nothing else running and the previous task will simply run again. Numerous other minor cleanups and fixes were committed.
So, after that long-winded discussion, what's the advantage of BFS 350 over 318 onwards, and what should you see? Hopefully... nothing really obvious. There may be some lower latencies, and possibly very minor improvements in throughput and behaviour when many different scheduling policies are used at once. So far the feedback from people on 2.6.35.X has been that they haven't noticed a difference. Unfortunately, that is not the case for those older kernels! I probably should never have started backporting these changes to the older kernels since I haven't been extensively testing the backports. Ironically, though, it's those stuck on those older kernels that made me start making these changes in the first place! Furthermore, I now have no idea what's going on in android land. So I'm in a bit of a bind now, and I'd love to get feedback from those who are on those older kernels, and comparison with newer kernels with 350 and old. But if you're not adventurous and are on an older kernel and just want stable, stick to BFS 318. It may be that the whole deadline expiry thing is STILL needed, yet I honestly don't know why :(
Oh, I also noticed that despite quite a few people reading this blog, there have been virtually no comments. I know that many many more people read than comment, and that people may not feel qualified to comment, but a simple yes/no, good/bad would be greatly appreciated.
UPDATE: It may well be that the behavioural problems are just 32 bit overflows after all. Texstar of pclinuxos has been testing 32 bit kernels and found them to be problematic. More on this as he helps me sort it out.
A development blog of what Con Kolivas is doing with code at the moment with the emphasis on linux kernel, MuQSS, BFS and -ck.
Sunday, 3 October 2010
Friday, 1 October 2010
BFS in real time
A question I often get is "how does BFS compare to the -rt patchset?" and I also get asked if BFS is compatible with the -rt patchset.
The second question is easy to answer: No. The code that makes up the -rt patchset carves heavily into the core of the CPU scheduler and to make it compatible with BFS would require a complete port.
The first question is a little harder to answer because -rt and BFS are tools for completely different workloads. BFS is a general purpose desktop orientated (yes we do have an extra syllable in English unlike American) CPU scheduler designed to decrease overall latencies to below human perceptible level in all regular workloads. Human perceptible latencies are in the millisecond range, where anything within about the 5ms range will not be noticeable. The -rt patchset is designed to decrease latencies in the microsecond range. This is well below anything that anybody could feel on a desktop. -rt also provides special tools for management of tasks that are running with realtime scheduling policies such as SCHED_FIFO and SCHED_RR. The addition of threaded interrupt handlers, priority inheritance and interrupt priority handling and so on are very specialised tools that are desirable in the realtime world. To use these, one must specially program for the -rt patchset, be using very carefully chosen hardware with known capabilities, run a stripped down userspace that only does precisely what one needs and so on. These are not the domain of a normal desktop user, nor that of any workload a desktop user is likely to ever encounter. If you were doing semi-professional audio recording you might, and then you'd need to understand the inner workings of the software and the -rt patchset to make the most of it. Just patching it in and expecting it to work for you will not really give you any advantage. Most users of the -rt patchset are embedded device manufacturers (and I don't mean Android phones).
A common response I get, then, is "surely if low latencies are good, then extremely low are even better?" Unfortunately, that's not the case, as the cost of running the -rt patchset does incur significant overhead and will not give the user any perceivable advantage, especially if you don't use any of the special features of it.
On the other hand, BFS is designed as a "plug it in and it will work" type of patch for the most part. A greatly undervalued feature of BFS is its ability to handle realtime tasks on SMP machines. BFS uses a global runqueue where the scheduler decides what task is the next most important and will choose the most suitable CPU anywhere to make the next task run as soon as possible. It finds the lowest priority, or oldest deadline task running on any CPU, and kicks that off if there is something higher priority that demands CPU time. While this is the case with any CPU scheduler, the major difference is that multiple runqueue designs (i.e. one for every CPU) spends most of its time looking for the highest priority task in its own runqueue, rather than all the tasks running on the whole machine. It then only moves tasks around to keep CPUs relatively balanced over time periods. The multiple queue design is ultimately used because it is more scalable when CPU numbers get high.
Using the simplest example possible of a dual core machine, it's easy to demonstrate the disadvantage of this multiple queue design. It is possible for say, 4 tasks to be running - two high priority realtime tasks and two regular tasks. In the multiple runqueue design, the two high priority tasks may end up on the same CPU for a period fighting each other for CPU, while the two regular tasks are on the other CPU, gorging themselves of CPU time. This scenario only exists for a short period until the CPUs are balanced again (assuming the balancing is done right biasing for the high priority tasks!), but it can happen over and over again, and the more CPUs, and the more tasks running, the worse this situation gets. It means that an ultra low priority task, possibly even nice 19 could be getting CPU time on one CPU, while a high priority SCHED_FIFO task is not getting CPU time because it's on another CPU waiting below an even higher priority SCHED_FIFO task. If you are using the mainline kernel with more than one realtime task on an SMP machine, then it is imperative you use affinities to cope with this situation to ensure you have the lowest possible latencies. However there is no workaround for when there are more realtime tasks than CPU cores.
When I had the time and inclination to hack for mainline, I created what I called the "SMP Nice" patches which were designed to try and balance things out to prevent all the high priority tasks clustering on one CPU and the low priority tasks on another and so on. A lot more work has gone into that on mainline since then to help alleviate that problem. An interesting footnote is that Peter Williams, who helped work on those patches, reported to me he noticed a few years ago that the priority management on SMP machines on windows was woeful for a while and it was a dead giveaway that they had changed to multiple runqueues.
BFS, with its global runqueue, manages these realtime tasks optimally without any extra effort. The highest priority task will always run first on any CPU it's allowed on. I've had quite a few people report to me their surprise to find that realtime tasks worked better on BFS, usually with audio on jackd, and asked if there was a reason for it. Priority balancing on SMP just works by design on a global runqueue without any extra effort. 'nice' also works better on SMP for exactly the same reason.
The second question is easy to answer: No. The code that makes up the -rt patchset carves heavily into the core of the CPU scheduler and to make it compatible with BFS would require a complete port.
The first question is a little harder to answer because -rt and BFS are tools for completely different workloads. BFS is a general purpose desktop orientated (yes we do have an extra syllable in English unlike American) CPU scheduler designed to decrease overall latencies to below human perceptible level in all regular workloads. Human perceptible latencies are in the millisecond range, where anything within about the 5ms range will not be noticeable. The -rt patchset is designed to decrease latencies in the microsecond range. This is well below anything that anybody could feel on a desktop. -rt also provides special tools for management of tasks that are running with realtime scheduling policies such as SCHED_FIFO and SCHED_RR. The addition of threaded interrupt handlers, priority inheritance and interrupt priority handling and so on are very specialised tools that are desirable in the realtime world. To use these, one must specially program for the -rt patchset, be using very carefully chosen hardware with known capabilities, run a stripped down userspace that only does precisely what one needs and so on. These are not the domain of a normal desktop user, nor that of any workload a desktop user is likely to ever encounter. If you were doing semi-professional audio recording you might, and then you'd need to understand the inner workings of the software and the -rt patchset to make the most of it. Just patching it in and expecting it to work for you will not really give you any advantage. Most users of the -rt patchset are embedded device manufacturers (and I don't mean Android phones).
A common response I get, then, is "surely if low latencies are good, then extremely low are even better?" Unfortunately, that's not the case, as the cost of running the -rt patchset does incur significant overhead and will not give the user any perceivable advantage, especially if you don't use any of the special features of it.
On the other hand, BFS is designed as a "plug it in and it will work" type of patch for the most part. A greatly undervalued feature of BFS is its ability to handle realtime tasks on SMP machines. BFS uses a global runqueue where the scheduler decides what task is the next most important and will choose the most suitable CPU anywhere to make the next task run as soon as possible. It finds the lowest priority, or oldest deadline task running on any CPU, and kicks that off if there is something higher priority that demands CPU time. While this is the case with any CPU scheduler, the major difference is that multiple runqueue designs (i.e. one for every CPU) spends most of its time looking for the highest priority task in its own runqueue, rather than all the tasks running on the whole machine. It then only moves tasks around to keep CPUs relatively balanced over time periods. The multiple queue design is ultimately used because it is more scalable when CPU numbers get high.
Using the simplest example possible of a dual core machine, it's easy to demonstrate the disadvantage of this multiple queue design. It is possible for say, 4 tasks to be running - two high priority realtime tasks and two regular tasks. In the multiple runqueue design, the two high priority tasks may end up on the same CPU for a period fighting each other for CPU, while the two regular tasks are on the other CPU, gorging themselves of CPU time. This scenario only exists for a short period until the CPUs are balanced again (assuming the balancing is done right biasing for the high priority tasks!), but it can happen over and over again, and the more CPUs, and the more tasks running, the worse this situation gets. It means that an ultra low priority task, possibly even nice 19 could be getting CPU time on one CPU, while a high priority SCHED_FIFO task is not getting CPU time because it's on another CPU waiting below an even higher priority SCHED_FIFO task. If you are using the mainline kernel with more than one realtime task on an SMP machine, then it is imperative you use affinities to cope with this situation to ensure you have the lowest possible latencies. However there is no workaround for when there are more realtime tasks than CPU cores.
When I had the time and inclination to hack for mainline, I created what I called the "SMP Nice" patches which were designed to try and balance things out to prevent all the high priority tasks clustering on one CPU and the low priority tasks on another and so on. A lot more work has gone into that on mainline since then to help alleviate that problem. An interesting footnote is that Peter Williams, who helped work on those patches, reported to me he noticed a few years ago that the priority management on SMP machines on windows was woeful for a while and it was a dead giveaway that they had changed to multiple runqueues.
BFS, with its global runqueue, manages these realtime tasks optimally without any extra effort. The highest priority task will always run first on any CPU it's allowed on. I've had quite a few people report to me their surprise to find that realtime tasks worked better on BFS, usually with audio on jackd, and asked if there was a reason for it. Priority balancing on SMP just works by design on a global runqueue without any extra effort. 'nice' also works better on SMP for exactly the same reason.
Wednesday, 29 September 2010
Biasing for latency under load
One of the mantras of BFS is that it has very little in the way of tunables and should require no input on the part of the user to get both good latency and good throughput by default. The only tunable available is the rr_interval found in /proc/sys/kernel/ and turning it up will improve throughput at the expense of latency, while turning it down will do the opposite (iso_cpu is also there, but that's more a permission tunable than a scheduler behavioural tunable). Giving you the bottom line for those who want to tune, I suggest running 100Hz with an rr_interval of 300 if you are doing nothing but cpu intensive slow tasks (video encoding, folding etc) and running 1000Hz with an rr_interval of 2 if you care only for latency at all costs.
I've believed for a long time that it makes no sense to tune for ridiculously high loads on your machine if your primary use is a desktop, and if you get into an overload situation, you should expect a slowdown. Trying to tune for such conditions always ends up costing you in other ways that just isn't worth it since you spend 99.9999% of your time at "normal loads". What BFS does at high loads is a progressive lengthening of latency proportional to the load while maintaining relatively regular throughput. So if you run say a make -j4 on a quad core machine, you shouldn't really notice anything going on, but if you run a make -j64 you should notice a dramatic slowdown, and loss of fluid movement of your cursor and possibly have audio skip. What exactly is the point of doing a make -j64 on a quad core desktop? There is none apart from as some kind of mindless test. However on any busy server, it spends most of its time on loads much greater than 1 per CPU. In that setting maintaining reasonable latency, while ensuring maximal throughput is optimal.
The mainline kernel seems intent on continually readdressing latency under load on a desktop as though that's some holy grail. Lately the make -j10 load on uniprocessor workload has been used as the benchmark. What they're finding, not surprisingly, is that the lower you aim your latencies, the smoother the desktop will continue to feel at the higher loads, and trying to find some "optimum" value where latency will still be good without sacrificing throughput too much. Why 10? Why not 100? How about 1000? Why choose some arbitrary upper figure to tune to? Why not just accept that overload is overload and that latency is going to suffer and not damage throughput to try and contain it?
For my own response to this, here's a patch:
(edit, patch for bfs357)
bfs357-latency_bias.patch
The changelog, also in the patch itself, follows:
This is to achieve the converse of what normally happens. You can choose to tune to maintain either latency at high loads (set to 100), or throughput (set to 0 for current behaviour), or some value in between (set to 1 or more). So I'm putting this out there for people to test and report back to see if they think it's worthwhile.
I've believed for a long time that it makes no sense to tune for ridiculously high loads on your machine if your primary use is a desktop, and if you get into an overload situation, you should expect a slowdown. Trying to tune for such conditions always ends up costing you in other ways that just isn't worth it since you spend 99.9999% of your time at "normal loads". What BFS does at high loads is a progressive lengthening of latency proportional to the load while maintaining relatively regular throughput. So if you run say a make -j4 on a quad core machine, you shouldn't really notice anything going on, but if you run a make -j64 you should notice a dramatic slowdown, and loss of fluid movement of your cursor and possibly have audio skip. What exactly is the point of doing a make -j64 on a quad core desktop? There is none apart from as some kind of mindless test. However on any busy server, it spends most of its time on loads much greater than 1 per CPU. In that setting maintaining reasonable latency, while ensuring maximal throughput is optimal.
The mainline kernel seems intent on continually readdressing latency under load on a desktop as though that's some holy grail. Lately the make -j10 load on uniprocessor workload has been used as the benchmark. What they're finding, not surprisingly, is that the lower you aim your latencies, the smoother the desktop will continue to feel at the higher loads, and trying to find some "optimum" value where latency will still be good without sacrificing throughput too much. Why 10? Why not 100? How about 1000? Why choose some arbitrary upper figure to tune to? Why not just accept that overload is overload and that latency is going to suffer and not damage throughput to try and contain it?
For my own response to this, here's a patch:
(edit, patch for bfs357)
bfs357-latency_bias.patch
The changelog, also in the patch itself, follows:
Make it possible to maintain low latency as much as possible at high loads by shortening timeslice the more loaded the machine will get. Do so by adding a tunable latency_bias which is disabled by default. Valid values are from 0 to 100, where higher values mean bias more for latency as load increases. Note that this should still maintain fairness, but will sacrifice throughput, potentially dramatically, to try and keep latencies as low as possible. Hz will
still be a limiting factor so the higher Hz is, the lower the latencies maintainable.
The effect of enabling this tunable will be to ensure that very low CPU usage processes, such as mouse cursor movement, will remain fluid no matter how high the load is. It's possible to have a smooth mouse cursor with massive loads, but the effect on throughput can be up to a -20%- loss at ultra high loads. At meaningful loads, a value of one will have minimal impact on throughput and ensure that under the occasional overload condition the machine will still feel fluid.
This is to achieve the converse of what normally happens. You can choose to tune to maintain either latency at high loads (set to 100), or throughput (set to 0 for current behaviour), or some value in between (set to 1 or more). So I'm putting this out there for people to test and report back to see if they think it's worthwhile.
Do androids dream of electric sheep with BFS 330?
An interesting thing happened on the way to creating the latest BFS.
Prior to BFS versions 330 I had this note in the BFS code:
* Once deadlines are expired (jiffies has passed it) tasks are chosen in FIFO
* order. Note that very few tasks will be FIFO for very long because they
* only end up that way if they sleep for long or if if there are enough fully
* cpu bound tasks to push the load to ~8 higher than the number of CPUs for
* nice 0.
Now BFS works off a very simple principle for dealing with all cpu distribution and latency management: That each task sets a deadline once and it gets reset when that task runs out of timeslice. If it sleeps, its deadline looks relatively older and older and therefore when it wakes up again, it will be considered very late for its deadline and get the lowest possible latency. This is how low latency is guaranteed for interactive tasks and so on. Note the deadline is really just a relative time and not a true "deadline" per se.
Anyway the point of bringing that up is that there really is no reason to "expire" deadlines as the comment says above. If the deadlines are older, meaning they have slept the longest, they should also get the lowest latency on wakeup. There are two ways deadlines can expire, one is that the tasks have simply slept for long enough (say firefox is sitting at a static page) and the other is that the load is so high (8 * CPUs) and then all the tasks will have "late" deadlines. In neither of these scenarios should there be any point to expiring deadlines since it's just a relative time that is used for comparison to see who needs to go next.
However, I originally added the code to expire deadlines because of jiffy wrap; that is on 32 bit machines the timer would get to greater than 32bits and return to zero. This happens at a rate dependant on what HZ you build your kernel to, and would normally happen after ~50 days at 1000Hz. However the default time on starting a kernel is set to -5 minutes so that it wraps 5 minutes after boot, to make it easy to test for jiffy wrap errors. The kernel, however, has a time_before and time_after wrapper which should make these errors go away unless you have > 2^31 difference in time (so at least 25 days). I used those wrappers a while ago and tried to get rid of expiration and suddenly had reports of strange lag on uniprocessor low power machines - specifically more common on android. It was a clear binary test of with/without expiration and people confirmed it caused/fixed the problem. So I left it in, since it was unlikely to cause a real world problem.
Now a little while ago, Nenolod of rapidxen was playing around with BFS (318) on massive loads of 100 or more and found that it maintained fairness up to a point and then suddenly broke down on fairness and became virtually unresponsive. He asked me about it and I brought up the 8 * CPUs issue and then he did some more testing and found that it broke down precisely as predicted once load got to that level and above. So I gave him a quick patch that disabled the task expiration. It took out this code:
/*
* Look for tasks with old deadlines and pick them in FIFO
* order, taking the first one found.
*/
if (time_is_before_jiffies(dl)) {
edt = p;
goto out_take;
}
The high load problem was solved with fairness and (proportional to load) latency more or less being stable. However this would bring back that lag that I never really isolated the cause of. So, assuming it was related to timer wrap I changed all the deadlines to use be compared against a local 64 bit jiffy counter within the scheduler only. This was the major change going from v323 to 330 (along with random other less significant changes) and I quickly backported it all the way to 2.6.31 in the hope the android folk would test it.
Now since Froyo, the android folk stopped using BFS because they said they did a blind comparison and couldn't tell BFS apart from mainline. Unfortunately I have no idea what version of BFS they used and there was a buggy version of BFS out for about 6 months(!) on uniprocessor that I never picked up the bug due to lack of feedback :s. So basically, when I say I hope the android folk would test it, it means the random people who put together their own kernel roms rather than actual Android developers. It took me a while to trawl the random forums before I found a couple of people who were patching in BFS into android roms but finally found some and tried to message them. Cutting a longer story short, a brief test of 330 on android was a flop. Lag was back.
Now it's very hard to try and isolate a problem you can't reproduce locally, and try I have. The other thing that android does is uses 100Hz for power saving reasons (note that 300Hz with scaling and nohz features is likely indistinguishable from 100Hz in power consumption so I don't see the point of such low resolution timers these days). The best I could do is build a 32 bit 100Hz uniprocessor kernel and try it on an aging P4 that I have at home and I don't really see the problem. I really had to wonder if it had anything to do with the real expiration as such, or whether it's related to limitations of the TSC high resolution timer on other hardware or what. It's not like Android is seeing loads of 8 or more, and that would have the opposite effect anyway.
It really worries me looking at some of these alternative Android rom kernels as the people putting them together don't appear to really know the code they're merging very well, and are using different kernel version numbers and so on, and the scheduler undergoes MASSIVE flux in mainline so it's non-trivial to port code from one kernel version to another. Short of maintaining my own kernel tree for various 'droids, I don't know how much I can rely on reports. I don't have that kind of time unfortunately. Anyway that's the story up to 330. Lots more code went into 350. More on that later.
Prior to BFS versions 330 I had this note in the BFS code:
* Once deadlines are expired (jiffies has passed it) tasks are chosen in FIFO
* order. Note that very few tasks will be FIFO for very long because they
* only end up that way if they sleep for long or if if there are enough fully
* cpu bound tasks to push the load to ~8 higher than the number of CPUs for
* nice 0.
Now BFS works off a very simple principle for dealing with all cpu distribution and latency management: That each task sets a deadline once and it gets reset when that task runs out of timeslice. If it sleeps, its deadline looks relatively older and older and therefore when it wakes up again, it will be considered very late for its deadline and get the lowest possible latency. This is how low latency is guaranteed for interactive tasks and so on. Note the deadline is really just a relative time and not a true "deadline" per se.
Anyway the point of bringing that up is that there really is no reason to "expire" deadlines as the comment says above. If the deadlines are older, meaning they have slept the longest, they should also get the lowest latency on wakeup. There are two ways deadlines can expire, one is that the tasks have simply slept for long enough (say firefox is sitting at a static page) and the other is that the load is so high (8 * CPUs) and then all the tasks will have "late" deadlines. In neither of these scenarios should there be any point to expiring deadlines since it's just a relative time that is used for comparison to see who needs to go next.
However, I originally added the code to expire deadlines because of jiffy wrap; that is on 32 bit machines the timer would get to greater than 32bits and return to zero. This happens at a rate dependant on what HZ you build your kernel to, and would normally happen after ~50 days at 1000Hz. However the default time on starting a kernel is set to -5 minutes so that it wraps 5 minutes after boot, to make it easy to test for jiffy wrap errors. The kernel, however, has a time_before and time_after wrapper which should make these errors go away unless you have > 2^31 difference in time (so at least 25 days). I used those wrappers a while ago and tried to get rid of expiration and suddenly had reports of strange lag on uniprocessor low power machines - specifically more common on android. It was a clear binary test of with/without expiration and people confirmed it caused/fixed the problem. So I left it in, since it was unlikely to cause a real world problem.
Now a little while ago, Nenolod of rapidxen was playing around with BFS (318) on massive loads of 100 or more and found that it maintained fairness up to a point and then suddenly broke down on fairness and became virtually unresponsive. He asked me about it and I brought up the 8 * CPUs issue and then he did some more testing and found that it broke down precisely as predicted once load got to that level and above. So I gave him a quick patch that disabled the task expiration. It took out this code:
/*
* Look for tasks with old deadlines and pick them in FIFO
* order, taking the first one found.
*/
if (time_is_before_jiffies(dl)) {
edt = p;
goto out_take;
}
The high load problem was solved with fairness and (proportional to load) latency more or less being stable. However this would bring back that lag that I never really isolated the cause of. So, assuming it was related to timer wrap I changed all the deadlines to use be compared against a local 64 bit jiffy counter within the scheduler only. This was the major change going from v323 to 330 (along with random other less significant changes) and I quickly backported it all the way to 2.6.31 in the hope the android folk would test it.
Now since Froyo, the android folk stopped using BFS because they said they did a blind comparison and couldn't tell BFS apart from mainline. Unfortunately I have no idea what version of BFS they used and there was a buggy version of BFS out for about 6 months(!) on uniprocessor that I never picked up the bug due to lack of feedback :s. So basically, when I say I hope the android folk would test it, it means the random people who put together their own kernel roms rather than actual Android developers. It took me a while to trawl the random forums before I found a couple of people who were patching in BFS into android roms but finally found some and tried to message them. Cutting a longer story short, a brief test of 330 on android was a flop. Lag was back.
Now it's very hard to try and isolate a problem you can't reproduce locally, and try I have. The other thing that android does is uses 100Hz for power saving reasons (note that 300Hz with scaling and nohz features is likely indistinguishable from 100Hz in power consumption so I don't see the point of such low resolution timers these days). The best I could do is build a 32 bit 100Hz uniprocessor kernel and try it on an aging P4 that I have at home and I don't really see the problem. I really had to wonder if it had anything to do with the real expiration as such, or whether it's related to limitations of the TSC high resolution timer on other hardware or what. It's not like Android is seeing loads of 8 or more, and that would have the opposite effect anyway.
It really worries me looking at some of these alternative Android rom kernels as the people putting them together don't appear to really know the code they're merging very well, and are using different kernel version numbers and so on, and the scheduler undergoes MASSIVE flux in mainline so it's non-trivial to port code from one kernel version to another. Short of maintaining my own kernel tree for various 'droids, I don't know how much I can rely on reports. I don't have that kind of time unfortunately. Anyway that's the story up to 330. Lots more code went into 350. More on that later.
Subscribe to:
Posts (Atom)