Announcing a resync and update of the BFS CPU scheduler for linux-3.11
BFS by itself:
3.11-sched-bfs-441.patch
Full -ck1 patchset including separate patches:
3.11-ck1
Apart from the usual resync to keep up with the mainline churn, there are a few additions from BFS 0.440. A number of changes dealing with wake lists as done by mainline were added that were missing from the previous code. There is a good chance that these were responsible for a large proportion of the suspend/resume issues people were having with BFS post linux 3.8. Of course I can't guarantee that all issues have been resolved, but it has been far more stable in my testing so far.
The other significant change is to check for throttled CPUs when choosing an idle CPU to move a process to, which should impact the behaviour and possibly throughput when using a scaling CPU governor, such as ondemand.
Those of you still using the evil proprietary Nvidia binary driver (as I still do) will encounter some issues and will need to use a patched pre-release driver from them if you build it yourself, until they release a new driver.
That is all for now.
Enjoy!
お楽しみください
A development blog of what Con Kolivas is doing with code at the moment with the emphasis on linux kernel, MuQSS, BFS and -ck.
Showing posts with label ondemand. Show all posts
Showing posts with label ondemand. Show all posts
Monday, 9 September 2013
Friday, 1 April 2011
BFS 0.372 test patch
Another day, another BFS test release. This one builds on the ideas in the 0.371-test3 patch I posted about since they're proving very positive. No April fools here. This looks like it's kicking arse.
Apply to a BFS 0.363 patched kernel such as 2.6.38-ck1:
bfs363-372-test.patch
Changelog from the patch:
---
Add a "sticky" flag for one CPU bound task per runqueue which is used to flag
the last cache warm task per CPU. Use this flag to softly affine the task to
the CPU by not allowing it to move to another CPU when a scaling CPU frequency
governor is in use. This significantly improves throughput at lower loads by
allowing tasks to cluster on CPUs, thereby allowing the scaling governor to
speed up only that CPU. This should aslo save power. Use the sticky flag to
determine cache distance in earliest_deadline_task task and abolish the
cache_distance function entirely. This is proven as, if not more, effective.
Add helpers to the 3 scaling governors to tell the scheduler when a CPU is
scaling.
Replace the frequent use of num_online_cpus() with a grq.noc variable that
is only updated when the number of online cpus changes.
Simplify resched_best_idle by removing the open coded for_each_cpu_mask as
it was not of proven benefit.
Remove warnings in try_to_wakeup_local that are harmless or never hit.
Clear the cpuidle_map bit only when edt doesn't return the idle task.
Abolish the scaled rr_interval by number of CPUs and now just use a fixed
nominal 6ms everywhere. The improved cache warmth of the sticky flag makes
this unnecessary and allows us to lower overall latencies on SMP by doing so.
---
Please test this one thoroughly. It's very stable and now heavily tested, but I won't announce any new "release" till it's been tested for maybe 5 days or more. It appears better in all workloads and with and without cpu frequency governors. Again, only SMP will benefit from this patch, but it should change behaviour in all SMP now, not just with ondemand. However, the scaling governors should show the most improvement.
The best example (with ondemand) was a single threaded cpu bound workload that took 126 seconds to complete on an i7 2 core/4 thread machine that now takes 91.5 seconds. The 2 threaded workload dropped from 66 seconds to 51.5 seconds. Note that this more or less addresses a regression in BFS behaviour with cpu frequency scaling on SMP, but it's also been an opportunity to improve behaviour elsewhere.
Apply to a BFS 0.363 patched kernel such as 2.6.38-ck1:
bfs363-372-test.patch
Changelog from the patch:
---
Add a "sticky" flag for one CPU bound task per runqueue which is used to flag
the last cache warm task per CPU. Use this flag to softly affine the task to
the CPU by not allowing it to move to another CPU when a scaling CPU frequency
governor is in use. This significantly improves throughput at lower loads by
allowing tasks to cluster on CPUs, thereby allowing the scaling governor to
speed up only that CPU. This should aslo save power. Use the sticky flag to
determine cache distance in earliest_deadline_task task and abolish the
cache_distance function entirely. This is proven as, if not more, effective.
Add helpers to the 3 scaling governors to tell the scheduler when a CPU is
scaling.
Replace the frequent use of num_online_cpus() with a grq.noc variable that
is only updated when the number of online cpus changes.
Simplify resched_best_idle by removing the open coded for_each_cpu_mask as
it was not of proven benefit.
Remove warnings in try_to_wakeup_local that are harmless or never hit.
Clear the cpuidle_map bit only when edt doesn't return the idle task.
Abolish the scaled rr_interval by number of CPUs and now just use a fixed
nominal 6ms everywhere. The improved cache warmth of the sticky flag makes
this unnecessary and allows us to lower overall latencies on SMP by doing so.
---
Please test this one thoroughly. It's very stable and now heavily tested, but I won't announce any new "release" till it's been tested for maybe 5 days or more. It appears better in all workloads and with and without cpu frequency governors. Again, only SMP will benefit from this patch, but it should change behaviour in all SMP now, not just with ondemand. However, the scaling governors should show the most improvement.
The best example (with ondemand) was a single threaded cpu bound workload that took 126 seconds to complete on an i7 2 core/4 thread machine that now takes 91.5 seconds. The 2 threaded workload dropped from 66 seconds to 51.5 seconds. Note that this more or less addresses a regression in BFS behaviour with cpu frequency scaling on SMP, but it's also been an opportunity to improve behaviour elsewhere.
Saturday, 26 March 2011
BFS and optimising a global scheduler for per-cpu frequency scaling
TL;DR BFS now faster with cpu frequency scaling
The last update to BFS brought some performance enhancements to threaded CPUs like the i5/i7 etc. One thing that has troubled me for some time with BFS has been that CPU frequency scaling in the form of ondemand and friends, is based on the "local load" on a CPU, and scales one CPU up in the presence of load on that CPU alone, yet BFS has almost no notion of "local load" since it's a global scheduler. While there is some rudimentary code in BFS to say whether one CPU is currently slightly more busy than the total load, the fact that load is more a total system concept rather than per-CPU in BFS means that the CPU frequency scaling is not going to work as well. So I set out to try and tackle that exact issue with this latest test patch.
bfs363-test2.patch
In this patch, a special deadline selection process is invoked only when a "scaling" type cpu frequency governor is invoked on a per-cpu basis. What this special deadline selection process does is has a kind of soft affinity for CPU bound tasks (i.e. ones that don't ever sleep) and tries to keep them on one CPU over and above the normal amount in BFS, to the point of allowing a CPU to go idle instead of pulling that task away from its last CPU. It does this by using a simple counter for the number of times the task would otherwise have been selected by another CPU and keeps it under the total number of other CPUs. If it goes over this counter it will allow it to move. It still moves very very easily compared to the mainline scheduler which has a very strong notion of locality, and works the other way around (moving tasks only if things look really badly unbalanced). Freshly waking tasks also completely ignore this and just work as per usual BFS, thereby maintaining BFS' very low latency, and if a non-scaling governor is used (currently performance or powersave), BFS reverts to the regular deadline selection. This can increase the latency slightly of CPU bound tasks, but because the CPU they're bound to will now more easily increase its frequency with (for example) the ondemand governor, this latency is offset by faster completion of its work.
All in all it addresses a weakness in the fact that the CPU frequency scaling in mainline is designed around per-CPU scheduling and not global scheduling which BFS does. The most extreme example of this in testing on a heavily threaded multicore multithread CPU (i7) shows a 30% increase in throughput on a single threaded task when run on a 4 thread CPU. That's a massive change. Not only will the workload complete faster, but almost certainly battery usage will be reduced. In practice most workloads are more mixed than this, so don't expect a sudden 30% overall boost in performance. It also has no effect without cpu frequency scaling, and no effect on single CPU machines, but the more cores and threads you have, the greater the benefit. Since some new Linux distributions now no longer even allow you to change the governor and just set it to ondemand by default, this change is something that will be essential. After a bit of testing, and possibly more tweaking, it will constitute the changes for the next BFS release.
Please test and report back how this patch performs for you if you have more than one core and run CPU frequency scaling.
The last update to BFS brought some performance enhancements to threaded CPUs like the i5/i7 etc. One thing that has troubled me for some time with BFS has been that CPU frequency scaling in the form of ondemand and friends, is based on the "local load" on a CPU, and scales one CPU up in the presence of load on that CPU alone, yet BFS has almost no notion of "local load" since it's a global scheduler. While there is some rudimentary code in BFS to say whether one CPU is currently slightly more busy than the total load, the fact that load is more a total system concept rather than per-CPU in BFS means that the CPU frequency scaling is not going to work as well. So I set out to try and tackle that exact issue with this latest test patch.
bfs363-test2.patch
In this patch, a special deadline selection process is invoked only when a "scaling" type cpu frequency governor is invoked on a per-cpu basis. What this special deadline selection process does is has a kind of soft affinity for CPU bound tasks (i.e. ones that don't ever sleep) and tries to keep them on one CPU over and above the normal amount in BFS, to the point of allowing a CPU to go idle instead of pulling that task away from its last CPU. It does this by using a simple counter for the number of times the task would otherwise have been selected by another CPU and keeps it under the total number of other CPUs. If it goes over this counter it will allow it to move. It still moves very very easily compared to the mainline scheduler which has a very strong notion of locality, and works the other way around (moving tasks only if things look really badly unbalanced). Freshly waking tasks also completely ignore this and just work as per usual BFS, thereby maintaining BFS' very low latency, and if a non-scaling governor is used (currently performance or powersave), BFS reverts to the regular deadline selection. This can increase the latency slightly of CPU bound tasks, but because the CPU they're bound to will now more easily increase its frequency with (for example) the ondemand governor, this latency is offset by faster completion of its work.
All in all it addresses a weakness in the fact that the CPU frequency scaling in mainline is designed around per-CPU scheduling and not global scheduling which BFS does. The most extreme example of this in testing on a heavily threaded multicore multithread CPU (i7) shows a 30% increase in throughput on a single threaded task when run on a 4 thread CPU. That's a massive change. Not only will the workload complete faster, but almost certainly battery usage will be reduced. In practice most workloads are more mixed than this, so don't expect a sudden 30% overall boost in performance. It also has no effect without cpu frequency scaling, and no effect on single CPU machines, but the more cores and threads you have, the greater the benefit. Since some new Linux distributions now no longer even allow you to change the governor and just set it to ondemand by default, this change is something that will be essential. After a bit of testing, and possibly more tweaking, it will constitute the changes for the next BFS release.
Please test and report back how this patch performs for you if you have more than one core and run CPU frequency scaling.
Subscribe to:
Posts (Atom)