Wednesday 29 September 2010

Do androids dream of electric sheep with BFS 330?

An interesting thing happened on the way to creating the latest BFS.

Prior to BFS versions 330 I had this note in the BFS code:
* Once deadlines are expired (jiffies has passed it) tasks are chosen in FIFO
* order. Note that very few tasks will be FIFO for very long because they
* only end up that way if they sleep for long or if if there are enough fully
* cpu bound tasks to push the load to ~8 higher than the number of CPUs for
* nice 0.

Now BFS works off a very simple principle for dealing with all cpu distribution and latency management: That each task sets a deadline once and it gets reset when that task runs out of timeslice. If it sleeps, its deadline looks relatively older and older and therefore when it wakes up again, it will be considered very late for its deadline and get the lowest possible latency. This is how low latency is guaranteed for interactive tasks and so on. Note the deadline is really just a relative time and not a true "deadline" per se.

Anyway the point of bringing that up is that there really is no reason to "expire" deadlines as the comment says above. If the deadlines are older, meaning they have slept the longest, they should also get the lowest latency on wakeup. There are two ways deadlines can expire, one is that the tasks have simply slept for long enough (say firefox is sitting at a static page) and the other is that the load is so high (8 * CPUs) and then all the tasks will have "late" deadlines. In neither of these scenarios should there be any point to expiring deadlines since it's just a relative time that is used for comparison to see who needs to go next.

However, I originally added the code to expire deadlines because of jiffy wrap; that is on 32 bit machines the timer would get to greater than 32bits and return to zero. This happens at a rate dependant on what HZ you build your kernel to, and would normally happen after ~50 days at 1000Hz. However the default time on starting a kernel is set to -5 minutes so that it wraps 5 minutes after boot, to make it easy to test for jiffy wrap errors. The kernel, however, has a time_before and time_after wrapper which should make these errors go away unless you have > 2^31 difference in time (so at least 25 days). I used those wrappers a while ago and tried to get rid of expiration and suddenly had reports of strange lag on uniprocessor low power machines - specifically more common on android. It was a clear binary test of with/without expiration and people confirmed it caused/fixed the problem. So I left it in, since it was unlikely to cause a real world problem.

Now a little while ago, Nenolod of rapidxen was playing around with BFS (318) on massive loads of 100 or more and found that it maintained fairness up to a point and then suddenly broke down on fairness and became virtually unresponsive. He asked me about it and I brought up the 8 * CPUs issue and then he did some more testing and found that it broke down precisely as predicted once load got to that level and above. So I gave him a quick patch that disabled the task expiration. It took out this code:

/*
* Look for tasks with old deadlines and pick them in FIFO
* order, taking the first one found.
*/
if (time_is_before_jiffies(dl)) {
edt = p;
goto out_take;
}

The high load problem was solved with fairness and (proportional to load) latency more or less being stable. However this would bring back that lag that I never really isolated the cause of. So, assuming it was related to timer wrap I changed all the deadlines to use be compared against a local 64 bit jiffy counter within the scheduler only. This was the major change going from v323 to 330 (along with random other less significant changes) and I quickly backported it all the way to 2.6.31 in the hope the android folk would test it.

Now since Froyo, the android folk stopped using BFS because they said they did a blind comparison and couldn't tell BFS apart from mainline. Unfortunately I have no idea what version of BFS they used and there was a buggy version of BFS out for about 6 months(!) on uniprocessor that I never picked up the bug due to lack of feedback :s. So basically, when I say I hope the android folk would test it, it means the random people who put together their own kernel roms rather than actual Android developers. It took me a while to trawl the random forums before I found a couple of people who were patching in BFS into android roms but finally found some and tried to message them. Cutting a longer story short, a brief test of 330 on android was a flop. Lag was back.

Now it's very hard to try and isolate a problem you can't reproduce locally, and try I have. The other thing that android does is uses 100Hz for power saving reasons (note that 300Hz with scaling and nohz features is likely indistinguishable from 100Hz in power consumption so I don't see the point of such low resolution timers these days). The best I could do is build a 32 bit 100Hz uniprocessor kernel and try it on an aging P4 that I have at home and I don't really see the problem. I really had to wonder if it had anything to do with the real expiration as such, or whether it's related to limitations of the TSC high resolution timer on other hardware or what. It's not like Android is seeing loads of 8 or more, and that would have the opposite effect anyway.

It really worries me looking at some of these alternative Android rom kernels as the people putting them together don't appear to really know the code they're merging very well, and are using different kernel version numbers and so on, and the scheduler undergoes MASSIVE flux in mainline so it's non-trivial to port code from one kernel version to another. Short of maintaining my own kernel tree for various 'droids, I don't know how much I can rely on reports. I don't have that kind of time unfortunately. Anyway that's the story up to 330. Lots more code went into 350. More on that later.

1 comment: