New -ck version for the latest mainline linux kernel, 3.3.0:
3.3-ck1
Changes since 3.2.0-ck1:
New BFS version 0.420 AKA smoking as discussed here:
bfs-420-aka-smoking-for-linux-33
Includes one build bugfix for UP compared to that first release candidate.
Other changes:
These patches have been dropped:
-mm-background_scan.patch
-mm-lru_cache_add_lru_tail-2.patch
The Virtual Memory subsystem has changed so much it's hard to know if these patches do what they originally intended to do, nor if they are helpful any more. In the absence of being able to test their validity, it seems safer to just drop them.
The rest of the patchset is just a resync.
Enjoy!
お楽しみ下さい
A development blog of what Con Kolivas is doing with code at the moment with the emphasis on linux kernel, MuQSS, BFS and -ck.
Saturday, 24 March 2012
Thursday, 22 March 2012
BFS 420 AKA smoking for linux 3.3
Here's the release candidate for the first stable release of BFS for linux 3.3. BFS hadn't received much attention lately so I felt a little guilty and worked on a few things that had been nagging me:
3.3-sched-bfs-420.patch
It is not clear whether the bug which I posted about here is present any more or not, given that it is extremely rare and hard to reproduce it is impossible to know for sure. A number of minor issues were addressed in the code which hopefully have fixed it (and hardly anyone was affected anyway).
The changes since BFS version 0.416 include a fairly large architectural change just to bring the codebase in sync with 3.3, but none of the changes should be noticeable in any way. One change that may be user-visible is that the high resolution IRQ accounting now appears to be on by default for x86 architectures. There is an issue that system time accounting is wrong without this feature enabled in BFS so this should correct that problem.
Other changes:
416-417: A number of ints were changed to bool which though unlikely to have any performance impact, do make the code cleaner and the compiled code does often come out different. rq_running_iso was converted from a function to macro to avoid it being a separate function call when compiled in with the attendant overhead. requeue_task within the scheduler tick was moved to being done under lock which may prevent rare races. test_ret_isorefractory() was optimised. set_rq_task() was not being called on tasks that were being requeued within schedule() which could possibly have led to issues if the task ran out of timeslice during that requeue and should have had its deadline offset. The need_resched() check that occurs at the end of schedule() was changed to unlikely() since it really is that. Moved the scheduler version print function to bfs.c to avoid recompiling the entire kernel if the version number is changed.
417-418: Fixed a problem with the accounting resync for linux 3.3.
418-419: There was a small possibility that an unnecessary resched would occur in try_preempt if a task had changed affinity and called try_preempt with its ->cpu still set to the old cpu it could no longer run on, so try_preempt was reworked slightly. Reintroduced the deadline offset based on CPU cache locality on sticky tasks in a way that was cheaper than we currently offset the deadline.
419-420: Finally rewrote the earliest_deadline_task code. This has long been one of the hottest code paths in the scheduler and small changes here that made it look nice would often slow it down. I spent quite a few hours reworking it to include less GOTOs while disassembling the code to make sure it was actually getting smaller with every change. Then I wrote a scheduler specific version of find_next_bit which could be inlined into this code and avoid another function call in the hot path. The overall behaviour is unchanged from previous BFS versions, but initial benchmarking confirms slight improvements in throughput.
Now I'll leave it open for wider testing confirming it's all good and then I have to think about what I should do with the full -ck patchset. As I've said on numerous posts before, I'm no longer sure about the validity of some of the patches in the set with all the changes to the virtual memory subsystem in the mainline kernel.
3.3-sched-bfs-420.patch
It is not clear whether the bug which I posted about here is present any more or not, given that it is extremely rare and hard to reproduce it is impossible to know for sure. A number of minor issues were addressed in the code which hopefully have fixed it (and hardly anyone was affected anyway).
The changes since BFS version 0.416 include a fairly large architectural change just to bring the codebase in sync with 3.3, but none of the changes should be noticeable in any way. One change that may be user-visible is that the high resolution IRQ accounting now appears to be on by default for x86 architectures. There is an issue that system time accounting is wrong without this feature enabled in BFS so this should correct that problem.
Other changes:
416-417: A number of ints were changed to bool which though unlikely to have any performance impact, do make the code cleaner and the compiled code does often come out different. rq_running_iso was converted from a function to macro to avoid it being a separate function call when compiled in with the attendant overhead. requeue_task within the scheduler tick was moved to being done under lock which may prevent rare races. test_ret_isorefractory() was optimised. set_rq_task() was not being called on tasks that were being requeued within schedule() which could possibly have led to issues if the task ran out of timeslice during that requeue and should have had its deadline offset. The need_resched() check that occurs at the end of schedule() was changed to unlikely() since it really is that. Moved the scheduler version print function to bfs.c to avoid recompiling the entire kernel if the version number is changed.
417-418: Fixed a problem with the accounting resync for linux 3.3.
418-419: There was a small possibility that an unnecessary resched would occur in try_preempt if a task had changed affinity and called try_preempt with its ->cpu still set to the old cpu it could no longer run on, so try_preempt was reworked slightly. Reintroduced the deadline offset based on CPU cache locality on sticky tasks in a way that was cheaper than we currently offset the deadline.
419-420: Finally rewrote the earliest_deadline_task code. This has long been one of the hottest code paths in the scheduler and small changes here that made it look nice would often slow it down. I spent quite a few hours reworking it to include less GOTOs while disassembling the code to make sure it was actually getting smaller with every change. Then I wrote a scheduler specific version of find_next_bit which could be inlined into this code and avoid another function call in the hot path. The overall behaviour is unchanged from previous BFS versions, but initial benchmarking confirms slight improvements in throughput.
Now I'll leave it open for wider testing confirming it's all good and then I have to think about what I should do with the full -ck patchset. As I've said on numerous posts before, I'm no longer sure about the validity of some of the patches in the set with all the changes to the virtual memory subsystem in the mainline kernel.
Tuesday, 20 March 2012
BFS issues and linux 3.3
Linux 3.3 is out, but I'm not releasing BFS for it at this stage. The reason is that a regression has been reported as showing up in BFS and it's proving hard to track down.
The issue involves a slowdown under load when the load is niced. The problem is the slowdown does not occur all the time, and I've only seen it in the wild once and have not been able to repeat it since. I've audited the code and have not yet found the culprit, but when it does happen, it is very obvious with mouse stalling for seconds. The 'top' output is usually a give away that something has gone wrong because the 'PR' column should normally report values of 1-41 for a nice 19 load. However when it happens, it will show values much higher, in the 42-81 range which should not happen. Unfortunately, the best hint for me would be to find what version of BFS it was introduced, and look for the change responsible, and since I can't even reproduce the problem most of the time, I can't do this regression testing.
So I'm appealing to the BFS users out there to see if anyone has this problem more regularly that has the time to try older versions of BFS. By older versions of BFS, I don't mean the same version of BFS on older kernels, but to try the first version of BFS that was available for that kernel. Running something continuously as a 'nice'd load is required to reproduce it, where the load is equal to the number of CPUs, so for example 'nice -19 make -j4' continuously in a kernel tree on a quad core machine. I'm hoping that someone out there is able to reproduce it and can do the regression testing. Thanks in advance. In the meantime, I'll keep auditing code and comparing new to old versions in the hope something stands out.
EDIT: An alternative approach was to try moving to 3.3 and make minor fixes along the way to see if the problem persists. Consider this patch a pre-release for now (CPU accounting appears all screwy still):
EDIT2: Fixed CPU accounting, bumped version to 418:
3.3-sched-bfs-418.patch
The issue involves a slowdown under load when the load is niced. The problem is the slowdown does not occur all the time, and I've only seen it in the wild once and have not been able to repeat it since. I've audited the code and have not yet found the culprit, but when it does happen, it is very obvious with mouse stalling for seconds. The 'top' output is usually a give away that something has gone wrong because the 'PR' column should normally report values of 1-41 for a nice 19 load. However when it happens, it will show values much higher, in the 42-81 range which should not happen. Unfortunately, the best hint for me would be to find what version of BFS it was introduced, and look for the change responsible, and since I can't even reproduce the problem most of the time, I can't do this regression testing.
So I'm appealing to the BFS users out there to see if anyone has this problem more regularly that has the time to try older versions of BFS. By older versions of BFS, I don't mean the same version of BFS on older kernels, but to try the first version of BFS that was available for that kernel. Running something continuously as a 'nice'd load is required to reproduce it, where the load is equal to the number of CPUs, so for example 'nice -19 make -j4' continuously in a kernel tree on a quad core machine. I'm hoping that someone out there is able to reproduce it and can do the regression testing. Thanks in advance. In the meantime, I'll keep auditing code and comparing new to old versions in the hope something stands out.
EDIT: An alternative approach was to try moving to 3.3 and make minor fixes along the way to see if the problem persists. Consider this patch a pre-release for now (CPU accounting appears all screwy still):
EDIT2: Fixed CPU accounting, bumped version to 418:
3.3-sched-bfs-418.patch
Sunday, 18 March 2012
lrzip-0.612
Just for good measure, here's yet another lrzip release.
lrzip 0.612 on freecode
This time the main update is a new zpaq library back end instead of using the ageing zpipe code. There are a number of advantages to using the libzpaq library over the old zpipe code.
First, the old code required a FILE type stream as it was written with STDIO in mind, so it was the only compression back end that required the use of some lesser known but handy, yet (virtually) linux only memory features like fmemopen, open_memstream and friends. These were not portable for osx and others so they were emulated on those platforms through the incredibly clunky use of temporary files on disk. Using the new library has killed off the need for these features making the code more portable.
Second, the code is significantly faster since it is the latest full c++ version of the zpaq code. Unfortunately it also means it takes a LOT longer to compile this part of lrzip now, but that's not a big deal since you usually only compile it once ;)
Third, it supports 3 different compression levels, one of which is higher than the previously supported one in lrzip. As lrzip uses 9 levels of compression, I've mapped the 3 levels to -L 1-3, 4-7 and 8-9 since -L 7 is the default and that provides the "mid level" compression from zpaq.
Finally, the beauty of the zpaq compression algorithm is the reference decoder can decompress any zpaq compressed data of any profile. This means you are able to use the latest version of lrzip with compression -L 9 (max profile), yet it is still backwardly compatible with older 0.6x versions of lrzip, not requiring an updated minor version and file format. The release archive I provide of lrzip-0.612.tar.lrz is self compressed with the new max profile. Even though there is significantly more code than ever in the lrzip release tarball, it has actually shrunk for the first time in a while.
So all that talk is boring and all, so let's throw around some benchmark results which are much more fun.
From the original readme benchmarks, I had compressed the linux 2.6.37 tarball, so I used that again for comparison. Tests were performed on an Intel quad core 3Ghz core 2 duo.
As you can see, the improvements in speed of the rzip stage have made all the compression back ends pretty snappy, and most fun of all is that lrzip -z on this workload is even faster on compression than the multithreaded 7z and is significantly smaller. Alas the major disadvantage of zpaq remains that it takes about as long to decompress as it takes to compress. However, with the trend towards more CPU cores as time goes on, one could argue that zpaq compression, as used within lrzip, is getting to a speed where it can be in regular use instead of just research/experimental use, especially when files are small like the lrzip distributed tarball I provide.
I also repeated my old classic 10GB virtual image benchmarks
Using "U"nlimited "z"paq options, it is actually faster than xz now. Note that about 30% of this image is blank space but that's a not-uncommon type of virtual image. If it were full of data, the difference would be less. Anyway I think it's fair to say that it's worth watching zpaq in the future. Edit: I've sent Matt Mahoney (zpaq author) the latest benchmarks for lrzip and how it performs on the large file benchmark and he's updated his site: http://mattmahoney.net/dc/text.html I think it's performing pretty well for a general compression utility.
lrzip 0.612 on freecode
This time the main update is a new zpaq library back end instead of using the ageing zpipe code. There are a number of advantages to using the libzpaq library over the old zpipe code.
First, the old code required a FILE type stream as it was written with STDIO in mind, so it was the only compression back end that required the use of some lesser known but handy, yet (virtually) linux only memory features like fmemopen, open_memstream and friends. These were not portable for osx and others so they were emulated on those platforms through the incredibly clunky use of temporary files on disk. Using the new library has killed off the need for these features making the code more portable.
Second, the code is significantly faster since it is the latest full c++ version of the zpaq code. Unfortunately it also means it takes a LOT longer to compile this part of lrzip now, but that's not a big deal since you usually only compile it once ;)
Third, it supports 3 different compression levels, one of which is higher than the previously supported one in lrzip. As lrzip uses 9 levels of compression, I've mapped the 3 levels to -L 1-3, 4-7 and 8-9 since -L 7 is the default and that provides the "mid level" compression from zpaq.
Finally, the beauty of the zpaq compression algorithm is the reference decoder can decompress any zpaq compressed data of any profile. This means you are able to use the latest version of lrzip with compression -L 9 (max profile), yet it is still backwardly compatible with older 0.6x versions of lrzip, not requiring an updated minor version and file format. The release archive I provide of lrzip-0.612.tar.lrz is self compressed with the new max profile. Even though there is significantly more code than ever in the lrzip release tarball, it has actually shrunk for the first time in a while.
So all that talk is boring and all, so let's throw around some benchmark results which are much more fun.
From the original readme benchmarks, I had compressed the linux 2.6.37 tarball, so I used that again for comparison. Tests were performed on an Intel quad core 3Ghz core 2 duo.
Compression Size Percentage Compress Decompress None 430612480 100 7z 63636839 14.8 2m28s 0m6.6s xz 63291156 14.7 4m02s 0m8.7 lrzip 64561485 14.9 1m12s 0m4.3s lrzip -z 51588423 12.0 2m02s 2m08s lrzip -l 137515997 31.9 0m14s 0m2.7s lrzip -g 86142459 20.0 0m17s 0m3.0s lrzip -b 72103197 16.7 0m21s 0m6.5s bzip2 74060625 17.2 0m48s 0m12.8s gzip 94512561 21.9 0m17s 0m4.0s
As you can see, the improvements in speed of the rzip stage have made all the compression back ends pretty snappy, and most fun of all is that lrzip -z on this workload is even faster on compression than the multithreaded 7z and is significantly smaller. Alas the major disadvantage of zpaq remains that it takes about as long to decompress as it takes to compress. However, with the trend towards more CPU cores as time goes on, one could argue that zpaq compression, as used within lrzip, is getting to a speed where it can be in regular use instead of just research/experimental use, especially when files are small like the lrzip distributed tarball I provide.
I also repeated my old classic 10GB virtual image benchmarks
Compression Size Percentage Compress Time Decompress Time None 10737418240 100.0 gzip 2772899756 25.8 05m47s 2m46s bzip2 2704781700 25.2 16m15s 6m19s xz 2272322208 21.2 50m58s 3m52s 7z 2242897134 20.9 26m36s 5m41s lrzip 1372218189 12.8 10m23s 2m53s lrzip -U 1095735108 10.2 08m44s 2m45s lrzip -l 1831894161 17.1 04m53s 2m37s lrzip -lU 1414959433 13.2 04m48s 2m38s lrzip -zU 1067169419 9.9 39m32s 39m46s
Using "U"nlimited "z"paq options, it is actually faster than xz now. Note that about 30% of this image is blank space but that's a not-uncommon type of virtual image. If it were full of data, the difference would be less. Anyway I think it's fair to say that it's worth watching zpaq in the future. Edit: I've sent Matt Mahoney (zpaq author) the latest benchmarks for lrzip and how it performs on the large file benchmark and he's updated his site: http://mattmahoney.net/dc/text.html I think it's performing pretty well for a general compression utility.
Subscribe to:
Posts (Atom)