So the one line bug in BFS 357 is big enough that it may affect anyone with an Intel wireless containing laptop or similar. How do I know this? Well I hit it on my own laptop. I hit it on suspend after about 30 suspend to ram cycles. Nothing better than a big fat OOPS on your own hardware to make you feel obliged to update your own code. So given that 2.6.36-ck1 is barely one week old, and 2.6.36 is likely to be the "stable" 3 point release for 3 months, I've decided to release 2.6.36-ck2 with just one line of code changed. As per my last blog entry, I've just removed one BUG_ON which would cause an oops on BFS 357 when it's run on a 2.6.36 kernel. If you're not hitting this bug, there is no point whatsoever to upgrade from ck1 to ck2. For my own internal testing I'm using a WARN_ON_ONCE in place of the BUG_ON, to see if there's something meaningful in the bug that's a mainline problem, but it's safe to just remove this OOPS for everyone else.
Get it here either as a full patch or split out patches:
2.6.36-ck2
I've also updated the BFS 357 patch on my website with the one line fix, and given that it's still the same BFS otherwise, all I've done is change the filename rather than bump the BFS version number.
Get it here:
2.6.36-sched-bfs-357-1.patch
On an unrelated note, I finally got off my arse and fixed the long-standing install bug if DESTDIR was set on lrzip, which was a two line fix and reported by about a billion different people. I bumped the version up to 0.47 without any other changes.
lrzip on freshmeat
I've been looking at the lrzip code recently just for a change of pace. One of the things that bugged me about it is that I upgraded it a while back to truly be 64 bit in that it accepts 64 bit sized files, and can make the most of all ram on any sized 64 bit machines, but that caused regressions in file compression sizes and speed. What this means long term is that as file sizes get bigger, and machines get more and more ram, the compression of lrzip will get more and more impressive. However the reason it bugs me is that all that 64 bit addressing costs a lot in space. So I'm working on a scaled bitsize compression format for the next version now, which will only use as many bits as the compression window is. I've seen some modest improvements only, but they're worthwhile chasing. More on that front if I make progress soon.
A development blog of what Con Kolivas is doing with code at the moment with the emphasis on linux kernel, MuQSS, BFS and -ck.
Friday, 29 October 2010
Monday, 25 October 2010
Minor BFS 357 Bug(?) on 2.6.36
I received a bug report from someone running BFS on 2.6.36. They hit this BUG_ON in the new code present only in 2.6.36:
+static void try_to_wake_up_local(struct task_struct *p)
+{
+ struct rq *rq = task_rq(p);
+ bool success = false;
+
+ BUG_ON(rq != this_rq());
This looks fairly straight forward and this code path is only used by the new worker code in 2.6.36. However it shouldn't be hit unless something else is calling this function (indirectly via schedule()) wrongly. Anyway they hit it it seems via the iwlwifi code. No idea how, but it's actually harmless to wake up a task on another runqueue in BFS, so simply removing this BUG_ON fixes it.
Here's a patch to apply to BFS if you're running it on 2.6.36 and run into this bug:
bfs357-worker_fix.patch which just removes the BUG_ON. However it makes me wonder if this bug is in mainline and only those who hit this bug can confirm or otherwise by running 2.6.36 vanilla and there's no point reporting it when it's so vague.
+static void try_to_wake_up_local(struct task_struct *p)
+{
+ struct rq *rq = task_rq(p);
+ bool success = false;
+
+ BUG_ON(rq != this_rq());
This looks fairly straight forward and this code path is only used by the new worker code in 2.6.36. However it shouldn't be hit unless something else is calling this function (indirectly via schedule()) wrongly. Anyway they hit it it seems via the iwlwifi code. No idea how, but it's actually harmless to wake up a task on another runqueue in BFS, so simply removing this BUG_ON fixes it.
Here's a patch to apply to BFS if you're running it on 2.6.36 and run into this bug:
bfs357-worker_fix.patch which just removes the BUG_ON. However it makes me wonder if this bug is in mainline and only those who hit this bug can confirm or otherwise by running 2.6.36 vanilla and there's no point reporting it when it's so vague.
Thursday, 21 October 2010
2.6.36-ck1
I'll keep it brief by just quoting the email I sent to lkml, just to get this announce out quickly.
These are patches designed to improve system responsiveness and interactivity
with specific emphasis on the desktop, but suitable to any workload.
Apply to 2.6.36:
patch-2.6.36-ck1.bz2
Broken out tarball:
2.6.36-ck1-broken-out.tar.bz2
Discrete patches:
patches
All -ck patches:
patches
Web:
kernel.kolivas.org
Code blog when I feel like it:
ck-hack.blogspot.com
Each discrete patch contains a brief description of what it does at the top of the patch itself.
The most significant change is an updated BFS cpu scheduler to BFS 357 (Magnum). It should pretty much behave like the older one, but is tighter with respect to keeping to its deadlines, and will continue to behave fairly when load is more than 8 * number of CPUs.
The other addition is to decrease the default dirty_ratio.
The rest is a resync only since 2.6.35-ck1.
Patch series:
2.6.36-sched-bfs-357.patch
sched-add-above-background-load-function.patch
mm-make_swappiness_really_mean_it.patch
mm-zero_swappiness.patch
mm-enable_swaptoken_only_when_swap_full.patch
mm-drop_swap_cache_aggressively.patch
mm-kswapd_inherit_prio-1.patch
mm-background_scan.patch
mm-idleprio_prio-1.patch
mm-lru_cache_add_lru_tail.patch
mm-decrease_default_dirty_ratio.patch
kconfig-expose_vmsplit_option.patch
hz-default_1000.patch
hz-no_default_250.patch
hz-raise_max.patch
preempt-desktop-tune.patch
cpufreq-bfs_tweaks.patch
ck1-version.patch
Those following the development of the patches for interactivity at massive load, I have COMPLETELY DROPPED them as they introduce regressions at normal workloads, and I cannot under any circumstances approve changes to improve behaviour at ridiculous workloads which affect regular ones. I still see precisely zero point at optimising for absurd workloads. Proving how many un-niced jobs you can throw at your kernel compiles is not a measure of one's prowess. It is just a mindless test.
Enjoy!
These are patches designed to improve system responsiveness and interactivity
with specific emphasis on the desktop, but suitable to any workload.
Apply to 2.6.36:
patch-2.6.36-ck1.bz2
Broken out tarball:
2.6.36-ck1-broken-out.tar.bz2
Discrete patches:
patches
All -ck patches:
patches
Web:
kernel.kolivas.org
Code blog when I feel like it:
ck-hack.blogspot.com
Each discrete patch contains a brief description of what it does at the top of the patch itself.
The most significant change is an updated BFS cpu scheduler to BFS 357 (Magnum). It should pretty much behave like the older one, but is tighter with respect to keeping to its deadlines, and will continue to behave fairly when load is more than 8 * number of CPUs.
The other addition is to decrease the default dirty_ratio.
The rest is a resync only since 2.6.35-ck1.
Patch series:
2.6.36-sched-bfs-357.patch
sched-add-above-background-load-function.patch
mm-make_swappiness_really_mean_it.patch
mm-zero_swappiness.patch
mm-enable_swaptoken_only_when_swap_full.patch
mm-drop_swap_cache_aggressively.patch
mm-kswapd_inherit_prio-1.patch
mm-background_scan.patch
mm-idleprio_prio-1.patch
mm-lru_cache_add_lru_tail.patch
mm-decrease_default_dirty_ratio.patch
kconfig-expose_vmsplit_option.patch
hz-default_1000.patch
hz-no_default_250.patch
hz-raise_max.patch
preempt-desktop-tune.patch
cpufreq-bfs_tweaks.patch
ck1-version.patch
Those following the development of the patches for interactivity at massive load, I have COMPLETELY DROPPED them as they introduce regressions at normal workloads, and I cannot under any circumstances approve changes to improve behaviour at ridiculous workloads which affect regular ones. I still see precisely zero point at optimising for absurd workloads. Proving how many un-niced jobs you can throw at your kernel compiles is not a measure of one's prowess. It is just a mindless test.
Enjoy!
Monday, 18 October 2010
Other schedulers? Illumos?
I've been asked in past interviews what I've thought about schedulers from other operating system kernels that I may have studied, and if I had any interest in hacking on them. Perhaps it's somewhat of a surprise to learn that until recently, I hadn't really looked at any other operating system's scheduler. Most of what I've thought up to do with scheduling is really pretty simple stuff to manage the most common scheduling policy and I part thought of myself, and part based on accounts of various approaches to scheduling in my travels. So my approach to things is pretty much "well this hack makes sense" based on some pretty basic principles to do with latency, fairness and CPU throughput.
A couple of days ago, I was asked what I thought it would take to port BFS to the opensolaris spinoff thingy Illumos. I don't pretend to even begin to understand the politics behind things so I have zero informed opinion about the whole Sun thingy. I pretty much had no idea how much effort it would take, but knowing BFS is simple code, I figure once the kernel interfaces are understood, it should be doable to any other operating system. Of course it would be much easier if it was done by someone who already knew the kernel they were working with rather than starting from scratch as I would be, but then, few people are likely to truly understand my intent with BFS unless they've studied it, no matter how simple the overall policy in BFS is.
With any existing modern kernel scheduler, there's a very good chance that an awful lot of effort has gone into making the scheduler scalable on SMP, and one of the very first things to do there is to make the scheduler have multiple runqueues, one for each CPU, or possibly one per physical package, shared amongst cores (much less common). Then, as CPU architectures become more complex, in having non-uniform memory (NUMA), sharing physical packages (multicore), having multiple threads per core (SMT) and combinations therein, special cases are made for balancing where tasks go when to make the most of these, but usually the special casing is only about improving throughput.
BFS in concept is two different distinct entities. The first is one of avoiding all heuristics for determining what is interactive and what is not, and using the simplest possible foreground-background virtual deadline to determine who goes next, and using a deterministic round robin system for sharing CPU time between all tasks. This is the overall policy for managing SCHED_NORMAL otherwise also known as SCHED_OTHER in the past, and responsible for the good interactivity, low latency and responsiveness. (I've mentioned in many articles previously why I believe all heuristics are bad in my opinion). The second entity in BFS is to throw out complex separate runqueue designs and special balancing, and have one runqueue to rule them all. This is a step backwards in possible scalability, and depending on the workload, becomes more and more of an issue the more CPUs you add, but is not relevant on any machine any normal human being can afford for a home desktop, even an extremely powerful one (note: whether this is true 10 years from now is yet to be seen, as I still think BFS will be fine up to many threads on multicore).
So to port BFS to another kernel would be porting two different concepts. The first is simply the policy for managing SCHED_OTHER tasks. This is a fairly portable concept and likely can be leveraged into just about any other scheduler without too much effort provided someone knows the kernel they're working with. The second part, however, would involve carving out large chunks of the existing complex system and replacing it with a rather simple design. As it is, if you remove the CGROUPS code as well as CFS that BFS replaces in the mainline linux kernel, you'd up with almost 20,000 lines less code. This is pretty drastic by anyone's standards since it replaces the scheduler en-bloc. The current BFS patch does not actually remove these lines of code, though, as many people requested I make it a configurable patch at build time. So it basically compiles out all that code instead of removing it.
Now one could make BFS plug into a "pluggable scheduler", and some of you may remember I posted patches for just such a feature for the linux kernel about 4 years ago. However, "pluggable" is a relative concept, depending on just how much of the scheduler you replace, and BFS replaces most of it. The current mainline kernel has pluggable policies, but isn't really pluggable on such a low level that BFS can happily live beside it (and the truth is that most of us don't "choose" a policy when we run our apps, they all run SCHED_OTHER unless we explicitly set them otherwise). The same goes for any other kernel, and the Illumos one is a perfect example since it has 4 different schedulers "plugged" into it, but probably not pluggable in the sense of "replaceable".
So anyway, where was this post going? Today I decided I'd have a look through the opensolaris code. Not because I particularly want to port BFS to it, but because someone else said they want to do so, and out of morbid curiosity of my own to see what it was like.
The summary of my impression was that I was... surprised. Now I don't claim to be any kind of expert on code per-se. I most certainly have ideas, but I just hack together my ideas however I can dream up that they work, and I have basically zero traditional teaching, so you should really take whatever I say about someone else's code with a grain of salt. Well, anyway, the code, as I saw it, was neat. Real neat. Extremely neat. In fact, I found it painful to read after a while. It was so neatly laid out that I found myself admiring it. It seems to have been built like an aircraft. It has everything that opens and shuts, has code for just about everything I've ever seen considered on a scheduler, and it's all neatly laid out in clean code and even comments. It also appears to have been coded with an awful lot of effort to ensure it's robust and measurable, with checking and tracing elements at every corner. I started to feel a little embarrassed by what we have as our own kernel. The more I looked at the code, the more it felt like it pretty much did everything the Linux kernel has been trying to do for ages. Not only that, but it's built like an aircraft, whereas ours looks like a garage job with duct tape by comparison.
As an aside, I did google a few terms they used which I hadn't seen before, and I was more than a little disappointed to find patents regarding the names... Sigh.
Now this would be a great time to take my comments out of context without reading on. The problem is that here was a scheduler that did exactly what I hate about what the Linux kernel scheduler is becoming. It's a monstrosity of epic proportions, and as far as an aircraft goes, it's like taking an Airbus A380 on a short joyride if you're running it on a desktop. It looks like a good, nay, great design for a massive airliner. By looking at it alone, I haven't got the foggiest what it might run like on a desktop. Now since I'm full of opinion and rhetoric and don't ever come through with any substance (maybe writing my own scheduler invalidates that?), I'm going to also say I can't even be bothered trying it, for you to confirm your suspicions about me.
So what do I think of it now? It looks like an excellent design for a completely different purpose. It's built like a commercial design for commercial purposes that have very different requirements than what most of us use Linux for, but it does appear to have been done so very well. It looks like a goddamn Star Destroyer, and the Linux kernel (scheduler) suddenly looks like the Millennium Falcon. Real fast, but held together with duct tape, and ready to explode at any minute. Or to put it another way, my wife used to work for a local automotive manufacturer (in Australia), and brought home the latest (American) Cadillac they were playing with at her work about 10 years ago. When I stepped into it, it looked and felt like a lounge room! It didn't feel at all like stepping into a car. It had everything that opens and shuts, and every control you'd imagine when sitting in your lounge room with your dining setting and entertainment system. Then I drove it around the block, and it had the performance, handling and braking of a lounge room too.
A couple of days ago, I was asked what I thought it would take to port BFS to the opensolaris spinoff thingy Illumos. I don't pretend to even begin to understand the politics behind things so I have zero informed opinion about the whole Sun thingy. I pretty much had no idea how much effort it would take, but knowing BFS is simple code, I figure once the kernel interfaces are understood, it should be doable to any other operating system. Of course it would be much easier if it was done by someone who already knew the kernel they were working with rather than starting from scratch as I would be, but then, few people are likely to truly understand my intent with BFS unless they've studied it, no matter how simple the overall policy in BFS is.
With any existing modern kernel scheduler, there's a very good chance that an awful lot of effort has gone into making the scheduler scalable on SMP, and one of the very first things to do there is to make the scheduler have multiple runqueues, one for each CPU, or possibly one per physical package, shared amongst cores (much less common). Then, as CPU architectures become more complex, in having non-uniform memory (NUMA), sharing physical packages (multicore), having multiple threads per core (SMT) and combinations therein, special cases are made for balancing where tasks go when to make the most of these, but usually the special casing is only about improving throughput.
BFS in concept is two different distinct entities. The first is one of avoiding all heuristics for determining what is interactive and what is not, and using the simplest possible foreground-background virtual deadline to determine who goes next, and using a deterministic round robin system for sharing CPU time between all tasks. This is the overall policy for managing SCHED_NORMAL otherwise also known as SCHED_OTHER in the past, and responsible for the good interactivity, low latency and responsiveness. (I've mentioned in many articles previously why I believe all heuristics are bad in my opinion). The second entity in BFS is to throw out complex separate runqueue designs and special balancing, and have one runqueue to rule them all. This is a step backwards in possible scalability, and depending on the workload, becomes more and more of an issue the more CPUs you add, but is not relevant on any machine any normal human being can afford for a home desktop, even an extremely powerful one (note: whether this is true 10 years from now is yet to be seen, as I still think BFS will be fine up to many threads on multicore).
So to port BFS to another kernel would be porting two different concepts. The first is simply the policy for managing SCHED_OTHER tasks. This is a fairly portable concept and likely can be leveraged into just about any other scheduler without too much effort provided someone knows the kernel they're working with. The second part, however, would involve carving out large chunks of the existing complex system and replacing it with a rather simple design. As it is, if you remove the CGROUPS code as well as CFS that BFS replaces in the mainline linux kernel, you'd up with almost 20,000 lines less code. This is pretty drastic by anyone's standards since it replaces the scheduler en-bloc. The current BFS patch does not actually remove these lines of code, though, as many people requested I make it a configurable patch at build time. So it basically compiles out all that code instead of removing it.
Now one could make BFS plug into a "pluggable scheduler", and some of you may remember I posted patches for just such a feature for the linux kernel about 4 years ago. However, "pluggable" is a relative concept, depending on just how much of the scheduler you replace, and BFS replaces most of it. The current mainline kernel has pluggable policies, but isn't really pluggable on such a low level that BFS can happily live beside it (and the truth is that most of us don't "choose" a policy when we run our apps, they all run SCHED_OTHER unless we explicitly set them otherwise). The same goes for any other kernel, and the Illumos one is a perfect example since it has 4 different schedulers "plugged" into it, but probably not pluggable in the sense of "replaceable".
So anyway, where was this post going? Today I decided I'd have a look through the opensolaris code. Not because I particularly want to port BFS to it, but because someone else said they want to do so, and out of morbid curiosity of my own to see what it was like.
The summary of my impression was that I was... surprised. Now I don't claim to be any kind of expert on code per-se. I most certainly have ideas, but I just hack together my ideas however I can dream up that they work, and I have basically zero traditional teaching, so you should really take whatever I say about someone else's code with a grain of salt. Well, anyway, the code, as I saw it, was neat. Real neat. Extremely neat. In fact, I found it painful to read after a while. It was so neatly laid out that I found myself admiring it. It seems to have been built like an aircraft. It has everything that opens and shuts, has code for just about everything I've ever seen considered on a scheduler, and it's all neatly laid out in clean code and even comments. It also appears to have been coded with an awful lot of effort to ensure it's robust and measurable, with checking and tracing elements at every corner. I started to feel a little embarrassed by what we have as our own kernel. The more I looked at the code, the more it felt like it pretty much did everything the Linux kernel has been trying to do for ages. Not only that, but it's built like an aircraft, whereas ours looks like a garage job with duct tape by comparison.
As an aside, I did google a few terms they used which I hadn't seen before, and I was more than a little disappointed to find patents regarding the names... Sigh.
Now this would be a great time to take my comments out of context without reading on. The problem is that here was a scheduler that did exactly what I hate about what the Linux kernel scheduler is becoming. It's a monstrosity of epic proportions, and as far as an aircraft goes, it's like taking an Airbus A380 on a short joyride if you're running it on a desktop. It looks like a good, nay, great design for a massive airliner. By looking at it alone, I haven't got the foggiest what it might run like on a desktop. Now since I'm full of opinion and rhetoric and don't ever come through with any substance (maybe writing my own scheduler invalidates that?), I'm going to also say I can't even be bothered trying it, for you to confirm your suspicions about me.
So what do I think of it now? It looks like an excellent design for a completely different purpose. It's built like a commercial design for commercial purposes that have very different requirements than what most of us use Linux for, but it does appear to have been done so very well. It looks like a goddamn Star Destroyer, and the Linux kernel (scheduler) suddenly looks like the Millennium Falcon. Real fast, but held together with duct tape, and ready to explode at any minute. Or to put it another way, my wife used to work for a local automotive manufacturer (in Australia), and brought home the latest (American) Cadillac they were playing with at her work about 10 years ago. When I stepped into it, it looked and felt like a lounge room! It didn't feel at all like stepping into a car. It had everything that opens and shuts, and every control you'd imagine when sitting in your lounge room with your dining setting and entertainment system. Then I drove it around the block, and it had the performance, handling and braking of a lounge room too.
Subscribe to:
Posts (Atom)