A development blog of what Con Kolivas is doing with code at the moment with the emphasis on linux kernel, MuQSS, BFS and -ck.
Thursday, 26 May 2011
2.6.39-ck1 unstable
As much as I hate to say this, I have to give up on 2.6.39 for now. I just don't have the time nor energy to fix this. I'm grateful for all your testing, but it's just going to have to go on hold and I'll have to support .38 kernels in the meantime until I have a revelation of some sort, or help from someone who also knows kernel internals.
Thursday, 19 May 2011
2.6.39-ck1
These are patches designed to improve system responsiveness and interactivity
with specific emphasis on the desktop, but suitable to any commodity hardware workload.
Apply to 2.6.39:
patch-2.6.39-ck1.bz2
Broken out tarball:
2.6.39-ck1-broken-out.tar.bz2
Discrete patches:
patches
Ubuntu packages:
http://ck.kolivas.org/patches/Ubuntu%20Packages
All -ck patches:
http://www.kernel.org/pub/linux/kernel/people/ck/patches/
BFS by itself:
http://ck.kolivas.org/patches/bfs/
Web:
http://kernel.kolivas.org
Code blog when I feel like it:
http://ck-hack.blogspot.com/
Each discrete patch contains a brief description of what it does at the top of
the patch itself.
The most substantial change since the last public release is a major version upgrade to the BFS CPU scheduler version 0.404.
Full details of the most substantial changes, which went into version 0.400, are in my blog here:
http://ck-hack.blogspot.com/2011/04/bfs-0400.html
This version exhibits better throughput, better latencies, better behaviour with scaling cpu frequency governors (e.g. ondemand), better use of turbo modes in newer CPUs, and addresses a long-standing bug that affected all configurations, but was only demonstrable on lower Hz configurations (i.e. 100Hz) that caused fluctuating performance and latencies. Thus mobile configurations (e.g. Android on 100Hz) also perform better. The tuning for default round robin interval on all hardware is now set to 6ms (i.e. tuned primarily for latency). This can be easily modified with the rr_interval sysctl in BFS for special configurations (e.g. increase to 300 for encoding / folding machines).
Performance of BFS has been tested on lower power single core machines through various configuration SMP hardware, both threaded and multicore, up to 24x AMD. The 24x machine exhibited better throughput on optimally loaded kbuild performance (from make -j1 up to make -j24). Performance beyond this level of load did not match mainline. On folding benchmarks at 24x, BFS was consistently faster for the unbound (no cpu affinity in use) multi-threaded version. On 6x hardware, performance at all levels of load in kbuild and x264 encoding benchmarks was better than mainline in both throughput and latency in the presence of the workloads.
For 6 core results and graphs, see:
benchmarks 20110516
(desktop = 1000Hz + preempt, server = 100Hz + no preempt):
Here are some desktop config highlights:
Throughput at make -j6:
Latency in the presence of x264 ultrafast:
Throughput with x264 ultrafast:
This is not by any means a comprehensive performance analysis, nor is it meant to claim that BFS is better under all workloads and hardware than mainline. They are simply easily demonstrable advantages on some very common workloads on commodity hardware, and constitute a regular part of my regression testing. Thanks to Serge Belyshev for 6x results, statistical analysis and graphs.
Other changes in this patch release include an updated version of lru_cache_add_lru_tail as the previous version did not work entirely as planned, dropping the dirty ratio to the extreme value of 1 by default in decrease_default_dirty_ratio, and dropping of the cpufreq ondemand tweaks since BFS detects scaling CPUs internally now and works with them.
Full patchlist:
2.6.39-sched-bfs-404.patch
sched-add-above-background-load-function.patch
mm-zero_swappiness.patch
mm-enable_swaptoken_only_when_swap_full.patch
mm-drop_swap_cache_aggressively.patch
mm-kswapd_inherit_prio-1.patch
mm-background_scan.patch
mm-idleprio_prio-1.patch
mm-lru_cache_add_lru_tail-1.patch
mm-decrease_default_dirty_ratio.patch
kconfig-expose_vmsplit_option.patch
hz-default_1000.patch
hz-no_default_250.patch
hz-raise_max.patch
preempt-desktop-tune.patch
ck1-version.patch
Please enjoy!
お楽しみください
--
-ck
EDIT4: For those having hangs, please try this patch on top of ck1:
bfs404-test6.patch
with specific emphasis on the desktop, but suitable to any commodity hardware workload.
Apply to 2.6.39:
patch-2.6.39-ck1.bz2
Broken out tarball:
2.6.39-ck1-broken-out.tar.bz2
Discrete patches:
patches
Ubuntu packages:
http://ck.kolivas.org/patches/Ubuntu%20Packages
All -ck patches:
http://www.kernel.org/pub/linux/kernel/people/ck/patches/
BFS by itself:
http://ck.kolivas.org/patches/bfs/
Web:
http://kernel.kolivas.org
Code blog when I feel like it:
http://ck-hack.blogspot.com/
Each discrete patch contains a brief description of what it does at the top of
the patch itself.
The most substantial change since the last public release is a major version upgrade to the BFS CPU scheduler version 0.404.
Full details of the most substantial changes, which went into version 0.400, are in my blog here:
http://ck-hack.blogspot.com/2011/04/bfs-0400.html
This version exhibits better throughput, better latencies, better behaviour with scaling cpu frequency governors (e.g. ondemand), better use of turbo modes in newer CPUs, and addresses a long-standing bug that affected all configurations, but was only demonstrable on lower Hz configurations (i.e. 100Hz) that caused fluctuating performance and latencies. Thus mobile configurations (e.g. Android on 100Hz) also perform better. The tuning for default round robin interval on all hardware is now set to 6ms (i.e. tuned primarily for latency). This can be easily modified with the rr_interval sysctl in BFS for special configurations (e.g. increase to 300 for encoding / folding machines).
Performance of BFS has been tested on lower power single core machines through various configuration SMP hardware, both threaded and multicore, up to 24x AMD. The 24x machine exhibited better throughput on optimally loaded kbuild performance (from make -j1 up to make -j24). Performance beyond this level of load did not match mainline. On folding benchmarks at 24x, BFS was consistently faster for the unbound (no cpu affinity in use) multi-threaded version. On 6x hardware, performance at all levels of load in kbuild and x264 encoding benchmarks was better than mainline in both throughput and latency in the presence of the workloads.
For 6 core results and graphs, see:
benchmarks 20110516
(desktop = 1000Hz + preempt, server = 100Hz + no preempt):
Here are some desktop config highlights:
Throughput at make -j6:
Latency in the presence of x264 ultrafast:
Throughput with x264 ultrafast:
This is not by any means a comprehensive performance analysis, nor is it meant to claim that BFS is better under all workloads and hardware than mainline. They are simply easily demonstrable advantages on some very common workloads on commodity hardware, and constitute a regular part of my regression testing. Thanks to Serge Belyshev for 6x results, statistical analysis and graphs.
Other changes in this patch release include an updated version of lru_cache_add_lru_tail as the previous version did not work entirely as planned, dropping the dirty ratio to the extreme value of 1 by default in decrease_default_dirty_ratio, and dropping of the cpufreq ondemand tweaks since BFS detects scaling CPUs internally now and works with them.
Full patchlist:
2.6.39-sched-bfs-404.patch
sched-add-above-background-load-function.patch
mm-zero_swappiness.patch
mm-enable_swaptoken_only_when_swap_full.patch
mm-drop_swap_cache_aggressively.patch
mm-kswapd_inherit_prio-1.patch
mm-background_scan.patch
mm-idleprio_prio-1.patch
mm-lru_cache_add_lru_tail-1.patch
mm-decrease_default_dirty_ratio.patch
kconfig-expose_vmsplit_option.patch
hz-default_1000.patch
hz-no_default_250.patch
hz-raise_max.patch
preempt-desktop-tune.patch
ck1-version.patch
Please enjoy!
お楽しみください
--
-ck
EDIT4: For those having hangs, please try this patch on top of ck1:
bfs404-test6.patch
Monday, 16 May 2011
BFS 0.404 page that really exists
There was one regression going into BFS 0.403, and that was expanding the sticky flag to cache warm as well. Not only didn't it improve throughput on anything I could measure, it caused latency regressions so I've backed it out. The only other change going to 404 was fixing a couple of unused variable warnings that were reported by a commenter on this blog. So I consider this patch now stable and pretty much how it will go into 2.6.39 final when it comes out.
Get it here:
2.6.39-rc7-sched-bfs-404.patch.lrz
Get it here:
2.6.39-rc7-sched-bfs-404.patch.lrz
Saturday, 14 May 2011
BFS 0.403 test for 2.6.39-rc7
BFS 0.402 test has proven very stable on 2.6.39-rc7 but a minor issue came up with respect to the new accurate IRQ accounting where some CPU time did not get accounted. So I went in and revised the way it worked to be cheaper and more accurate. There has also been a problem in the accounting that the total cpu did not always add up to 100%. The reason for this was the small inaccuracies of each respective CPU usage (user, system, wait etc.) all were exacerbated when added together. I've put in a total CPU percentage counter that checks the total adds up to 100 and if not, it rounds the values up so they should add up to 100%.
There was also a change I considered doing with the sticky flag that is used to minimise task movement to different CPUs that I've committed to 403 test. Instead of it being a binary on/off flag, I made it a stepped flag going from CACHE_COLD through CACHE_WARM to CACHE_HOT. Basically any task that is knocked off a CPU but is still waiting for more CPU is immediately labelled hot. Only one task is considered hot and previously as soon as a new cache hot task appeared, the sticky flag was cleared. Now, instead of it being cleared, it is set to warm, and only cleared to cold when the task sleeps. Forked child processes are now also labelled cache warm since they share many structures with their parent process. Any task that is cache warm or cache hot is biased against moving to another cpu by offsetting its relative deadline. Any task that is cache hot will not move cpu to a different cpu if that different one is scaled down in speed (as for example when ondemand cpu frequency governor slows it down). Basically this new change should improve throughput more in the overloaded case (when jobs > CPUs), but that's just a generic comment as I haven't benchmarked it yet.
Anyway give the new BFS a try. Everything appears to be running nice and stable, and as a bonus, my feel-good-o-meter is reading quite high with the upcoming 2.6.39! The magnitude of changes going into it seemed a lot less than previous kernels and I've had no issues with the -rc7 version so far.
As per previously, I've compressed the patch with lrzip as part of my evil plot to force you all to use it. Get it here:
2.6.39-rc7-sched-bfs-403-test.patch.lrz
Enjoy, and please report back if you try it!
There was also a change I considered doing with the sticky flag that is used to minimise task movement to different CPUs that I've committed to 403 test. Instead of it being a binary on/off flag, I made it a stepped flag going from CACHE_COLD through CACHE_WARM to CACHE_HOT. Basically any task that is knocked off a CPU but is still waiting for more CPU is immediately labelled hot. Only one task is considered hot and previously as soon as a new cache hot task appeared, the sticky flag was cleared. Now, instead of it being cleared, it is set to warm, and only cleared to cold when the task sleeps. Forked child processes are now also labelled cache warm since they share many structures with their parent process. Any task that is cache warm or cache hot is biased against moving to another cpu by offsetting its relative deadline. Any task that is cache hot will not move cpu to a different cpu if that different one is scaled down in speed (as for example when ondemand cpu frequency governor slows it down). Basically this new change should improve throughput more in the overloaded case (when jobs > CPUs), but that's just a generic comment as I haven't benchmarked it yet.
Anyway give the new BFS a try. Everything appears to be running nice and stable, and as a bonus, my feel-good-o-meter is reading quite high with the upcoming 2.6.39! The magnitude of changes going into it seemed a lot less than previous kernels and I've had no issues with the -rc7 version so far.
As per previously, I've compressed the patch with lrzip as part of my evil plot to force you all to use it. Get it here:
2.6.39-rc7-sched-bfs-403-test.patch.lrz
Enjoy, and please report back if you try it!
Subscribe to:
Posts (Atom)