Tuesday, 31 July 2012

BFS and -ck delays for linux-3.5.0

Once again I find myself writing a post saying there will be delays with the resync of BFS and -ck for the new linux kernel. This time the reason for most people would be a quite unexpected development. As you may have read on this blog last year, I got invited to interview with Google for a job as a software engineer and then in the end I got turned down due to lack of adequate breadth of knowledge. This was probably for the best for me anyway since I have a full time unrelated career and the jump would have been too great. Anyway a small company noticed the work I had done on cgminer with bitcoin and openCL work and asked if I was interested in writing some software for them. The work involves writing openCL frameworks so they can provide distributed computing capability to clients. They were quire happy to forego any of the regular interview details or pretty much anything that is normally involved in employing someone and before long we started talking contracts instead. Since the work itself actually looked like a lot of fun, I decided to go with the opportunity.

Anyway, long story short, I'm doing a little bit of contract work for them and my kernel work will take a slightly lower  priority in the meantime. I'm not abandoning it, but it will be delayed some more before the next release. Apologies for any inconvenience this may cause in the interim.

Friday, 13 July 2012

Sunday, 8 July 2012

lrzip-0.613

lrzip 0.612 has been out in the wild for a while now and the good news is that there have been very few bug reports in that time. After allowing enough accumulated issues collect in my inbox, I've created a pure-bugfix maintenance release in version 0.613:

long-range-zip 0.613

One bug of note was that the md5 calculation on files that had compressed blocks greater than 4GB in size was wrong. This was very suspicious for a 32 bit overflow error. Indeed Serge Belyshev did some excellent detective work and found the culprit to be in the glibc implementation of md5, which is used by lrzip. This only affects using the md5 library components, not the md5sum command line utility which uses a different rolling algorithm so glibc userspace never hit it. The bug in question was amusing in the way it shows one of the many naive ways we dealt with 32 bit limitations in the past. It assumed anything larger than a 32bit chunk was just 2^31 + (chunk size modulo 2^31). That means it would never work with a chunk larger than 2^32. The fix has been pushed upstream and is now incorporated into lrzip.

Another bug, as reported on this blog by a commenter, was that of creating corrupt very small archives (less than 64 bytes). This has been fixed by disabling the back end compression when the chunk is less than 64 bytes and just using the rzip first stage.

A lot of the other work in this release was just getting it to compile  on osx. Numerous issues showed up as always, and I didn't have access to an osx machine on the previous release to fix it. This time I used my wife's laptop ;) . One of the issues, for example, was that osx didn't see itself as #ifdef unix, which I thought was a little amusing. Another unexpected surprise was that the default osx filesystem is not case sensitive which caused a conflict lrzip.h vs Lrzip.h. Alas I have no other BSDs to try compiling it on so I'm not sure if they're fixed with this.

Interestingly, I still have to disable md5 calculation on the osx build. The md5 is calculated the same on compression and decompression within lrzip, but it disagrees with the result returned from the ports version of md5! This defeats the whole purpose of including md5 in it since the point of it is to have a command line result to compare to. I'm guessing there's an endianness dispute there somewhere and haven't ever tracked it down, since osx has done an endian flip in the past. lrzip still uses crc32 checking of each block internally so it's not like there isn't any integrity checking.

Finally what would a release be without some new benchmarks? Nothing performance-wise has changed in lrzip since the last version, but I have access to a 12 thread CPU machine with 32GB of ram now, so I did some quick benchmarks with the classic 10GB virtual image I've been using till now.


Compression  Size           Percentage  Compress Time  Decompress Time
None         10737418240     100.0
gzip          2772899756      25.8       3m56s          2m15s
pbzip2        2705814394      25.2       1m41s          1m46s
lrzip         1095337763      10.2       2m54s          2m21s
Note that with enough ram and CPU, lrzip is actually faster than gzip (which does compression in place) and comparable on decompression, despite a huge increase in compression. pbzip2 is faster than both but its compression is almost no better than gzip.

Tuesday, 3 July 2012

BFS 424, linux-3.4-ck3

As seen on this blog previously, a bug showed up in 3.4-ck2/BFS 423 to do with unplugged I/O management that would lead to severe stalls/hangs. I'm releasing BFS 424 officially and upgrading 3.4-ck2 to 3.4-ck3, incorporating just this one change.

BFS 424:
3.4-sched-bfs-424.patch

3.4-ck3:
3.4-ck3/


Others on -ck2 can simply apply the incremental patch to be up to date.
3.4bfs423-424.patch

Enjoy!
お楽しみください

Sunday, 1 July 2012

BFS 424 test

A couple of bug reports mostly related to disk I/O seem to have cropped up with BFS 423/3.4-ck2. The likely culprit seems to be the plugged I/O management within schedule() that I modified going from BFS 420 to BFS 423, when I adopted mainline's approach to managing the plugged I/O. It appears that the mechanism I had put in place for BFS was the correct one, and mainline's approach does not work (for BFS) so I've backed out that change and increased the version number. Here is the test patch:

bfs423-424.patch

Those with issues of any sort related to BFS 423 or ck2, please test this patch on top of the previous BFS patched kernel. Thanks!

Monday, 11 June 2012

Upgradeable rwlocks and BFS

I've been experimenting with improving the locking scheme in BFS and had a few ideas. I'm particularly attached to the global runqueue that makes BFS what it is and obviously having  one queue that has one lock protecting all its data structures accessed by multiple CPUs will end up having quite significant scalability limits with many CPUs. Much like a lot of the scalability work, it tends to have the opposite effect with smaller hardware (i.e. the more scalable you make something, the more it will tend to harm the smaller hardware). Fortunately most of the scalability issues with BFS are pretty much irrelevant until you have more than 16 logical CPUs, but we've now reached the point where 8 logical CPUs is not unusual for a standard PC. Whether or not we actually need so many logical CPUs or not and can use them appropriately is an argument for a different time and place, but they're here. So I've always had at the back of my mind how I might go about making BFS more scalable in the long term and locking is one obvious area.

Unfortunately time and realistic limitations means I'm unlikely to ever do anything on a grand scale or be able to support it. However I had ideas for how to change the locking long-term but it would require lots of incremental steps. After all that rambling, this post is about the first step to changing it. Like some of the experimental steps of the past (such as skip lists), there is absolutely no guarantee that these are worth pursuing.

I've implemented a variant of the read/write locks as used in the kernel to better suit being used as a replacement for the spinlock that protects all the global runqueue data. The problem with read/write locks is they favour readers over writers, and you cannot upgrade or downgrade locks. Once you have grabbed them, if you drop the lock and then try to grab the other variant (eg. going from read to write), the data is no longer guaranteed to be the same. What I've put together (and I'm not remotely suggesting this is my original idea, I'm sure it's been done elsewhere) are upgradeable read/write locks where you can take either the read lock, write lock, or an upgradeable variant. Upgradeable locks can be up or downgraded to write or read locks, and write locks can be downgraded to read locks, but read locks can not be changed. These URW locks are unfortunately more overhead than either spinlocks or normal rwlocks since they're actually just taking combinations of spinlocks and rwlocks. However the overhead of the locks themself may be worthwhile if it allows you to convert otherwise locked data into sections of parallel read code.

So here is a patch that implements them and can be put into pretty much any recent linux kernel version:
urw-locks.patch

Note that patch does nothing by itself, but here is a patch to add on top of that one for BFS423 that modifies the global runqueue to use the urw locks and implements some degree of read/write separation that did not exist with the regular spinlock:
bfs423-grq_urwlocks.patch

It's rather early code but I've given it a fairly thorough testing at home and it at least seems to be working as desired. On the simplest of preliminary testing I'm unable to demonstrate any throughput advantage on my quad core hardware, but the reassuring thing is I also find no disadvantage. Whether this translates to some advantage on 8x or 16x is something I don't have hardware to test for myself (hint to readers).

Note that even if this does not prove to be any significant throughput gain, then provided it is not harmful, I hope to eventually use it as a stepping stone to a grander plan I have for the locking and scalability. I don't like vapourware and since I haven't finalised the details myself exactly how I would implement them, there's nothing more to say for now. Then there's the time issue... there never seems to be enough, but I only ever hack for fun so it's no problem really.

P.S. Don't let this long post make you not notice that BFS 423 and 3.4-ck2 are also out in the previous post.

bfs 0.423, 3.4-ck2

A couple of issues showed up with BFS 0.422, one being the "0 load" bug and the other being a build issue on non-hotplug releases. So here is BFS 0.423 and 3.4-ck2 (which is just ck1 with the BFS update) which should fix those:

3.4-sched-bfs-423.patch

3.4-ck2/

and the increment only:

3.4bfs422-423.patch

Enjoy!
お楽しみください

Saturday, 2 June 2012

BFS 0.422, 3.4.0-ck1

Announcing the release of BFS for 3.4, along with the complete -ck1 patch.

BFS alone:
3.4-sched-bfs-422.patch

Full 3.4-ck1 patches:
3.4-ck1

Alas I was unable to keep the 420 number for BFS due to a number of minor changes. I also incremented the number beyond the unofficial 421 patch put to lkml so there was no confusion. The only changes are that some trivial display accounting fixes were added, along with forcing SLUB in the config by default as other SLAB allocators crash with BFS (you should all be using SLUB anyway). The rest of the BFS changes are a resync with the new code going into linux 3.4, along with more merging of code from mainline into BFS where suitable. Note that I have adopted the mainline approach of dealing with unplugged I/O. Previously I had spent a lot of time making it work with BFS for those who remember that period of instability, so hopefully the mainline approach will work seamlessly now (since mainline ended up having the same bug but it was harder to reproduce).

3.4-ck1 is just a resync of the remainder of the patches from 3.3-ck1.

Enjoy!
お楽しみください

EDIT: If you build on SMP without enabling CPU hotplug you will need this patch on top for BFS to build:
bfs422-nohotplug_fix.patch

Saturday, 24 March 2012

3.3.0-ck1

New -ck version for the latest mainline linux kernel, 3.3.0:


3.3-ck1

Changes since 3.2.0-ck1:

New BFS version 0.420 AKA smoking as discussed here:
bfs-420-aka-smoking-for-linux-33

Includes one build bugfix for UP compared to that first release candidate.

Other changes:
These patches have been dropped:
-mm-background_scan.patch
-mm-lru_cache_add_lru_tail-2.patch


The Virtual Memory subsystem has changed so much it's hard to know if these patches do what they originally intended to do, nor if they are helpful any more. In the absence of being able to test their validity, it seems safer to just drop them.

The rest of the patchset is just a resync.

Enjoy!
お楽しみ下さい

Thursday, 22 March 2012

BFS 420 AKA smoking for linux 3.3

Here's the release candidate for the first stable release of BFS for linux 3.3. BFS hadn't received much attention lately so I felt a little guilty and worked on a few things that had been nagging me:

3.3-sched-bfs-420.patch

It is not clear whether the bug which I posted about here is present any more or not, given that it is extremely rare and hard to reproduce it is impossible to know for sure. A number of minor issues were addressed in the code which hopefully have fixed it (and hardly anyone was affected anyway).

The changes since BFS version 0.416 include a fairly large architectural change just to bring the codebase in sync with 3.3, but none of the changes should be noticeable in any way. One change that may be user-visible is that the high resolution IRQ accounting now appears to be on by default for x86 architectures. There is an issue that system time accounting is wrong without this feature enabled in BFS so this should correct that problem.

Other changes:
416-417: A number of ints were changed to bool which though unlikely to have any performance impact, do make the code cleaner and the compiled code does often come out different. rq_running_iso was converted from a function to macro to avoid it being a separate function call when compiled in with the attendant overhead. requeue_task within the scheduler tick was moved to being done under lock which may prevent rare races.  test_ret_isorefractory() was optimised. set_rq_task() was not being called on tasks that were being requeued within schedule() which could possibly have led to issues if the task ran out of timeslice during that requeue and should have had its deadline offset. The need_resched() check that occurs at the end of schedule() was changed to unlikely() since it really is that. Moved the scheduler version print function to bfs.c to avoid recompiling the entire kernel if the version number is changed.

417-418: Fixed a problem with the accounting resync for linux 3.3.

418-419: There was a small possibility that an unnecessary resched would occur in try_preempt if a task had changed affinity and called try_preempt with its ->cpu still set to the old cpu it could no longer run on, so try_preempt was reworked slightly. Reintroduced the deadline offset based on CPU cache locality on sticky tasks in a way that was cheaper than we currently offset the deadline.

419-420: Finally rewrote the earliest_deadline_task code. This has long been one of the hottest code paths in the scheduler and small changes here that made it look nice would often slow it down. I spent quite a few hours reworking it to include less GOTOs while disassembling the code to make sure it was actually getting smaller with every change.  Then I wrote a scheduler specific version of find_next_bit which could be inlined into this code and avoid another function call in the hot path. The overall behaviour is unchanged from previous BFS versions, but initial benchmarking confirms slight improvements in throughput.

Now I'll leave it open for wider testing confirming it's all good and then I have to think about what I should do with the full -ck patchset. As I've said on numerous posts before, I'm no longer sure about the validity of some of the patches in the set with all the changes to the virtual memory subsystem in the mainline kernel.

Tuesday, 20 March 2012

BFS issues and linux 3.3

Linux 3.3 is out, but I'm not releasing BFS for it at this stage. The reason is that a regression has been reported as showing up in BFS and it's proving hard to track down.

The issue involves a slowdown under load when the load is niced. The problem is the slowdown does not occur all the time, and I've only seen it in the wild once and have not been able to repeat it since. I've audited the code and have not yet found the culprit, but when it does happen, it is very obvious with mouse stalling for seconds. The 'top' output is usually a give away that something has gone wrong because the 'PR' column should normally report values of 1-41 for a nice 19 load. However when it happens, it will show values much higher, in the 42-81 range which should not happen. Unfortunately, the best hint for me would be to find what version of BFS it was introduced, and look for the change responsible, and since I can't even reproduce the problem most of the time, I can't do this regression testing.

So I'm appealing to the BFS users out there to see if anyone has this problem more regularly that has the time to try older versions of BFS. By older versions of BFS, I don't mean the same version of BFS on older kernels, but to try the first version of BFS that was available for that kernel. Running something continuously as a 'nice'd load is required to reproduce it, where the load is equal to the number of CPUs, so for example 'nice -19 make -j4' continuously in a kernel tree on a quad core machine. I'm hoping that someone out there is able to reproduce it and can do the regression testing. Thanks in advance. In the meantime, I'll keep auditing code and comparing new to old versions in the hope something stands out.

EDIT: An alternative approach was to try moving to 3.3 and make minor fixes along the way to see if the problem persists. Consider this patch a pre-release for now (CPU accounting appears all screwy still):

EDIT2: Fixed CPU accounting, bumped version to 418:
3.3-sched-bfs-418.patch

Sunday, 18 March 2012

lrzip-0.612

Just for good measure, here's yet another lrzip release.

lrzip 0.612 on freecode

This time the main update is a new zpaq library back end instead of using the ageing zpipe code. There are a number of advantages to using the libzpaq library over the old zpipe code.

First, the old code required a FILE type stream as it was written with STDIO in mind, so it was the only compression back end that required the use of some lesser known but handy, yet (virtually) linux only memory features like fmemopen, open_memstream and friends. These were not portable for osx and others so they were emulated on those platforms through the incredibly clunky use of  temporary files on disk. Using the new library has killed off the need for these features making the code more portable.

Second, the code is significantly faster since it is the latest full c++ version of the zpaq code. Unfortunately it also means it takes a LOT longer to compile this part of lrzip now, but that's not a big deal since you usually only compile it once ;)

Third, it supports 3 different compression levels, one of which is higher than the previously supported one in lrzip. As lrzip uses 9 levels of compression, I've mapped the 3 levels to -L 1-3, 4-7 and 8-9 since -L 7 is the default and that provides the "mid level" compression from zpaq.

Finally, the beauty of the zpaq compression algorithm is the reference decoder can decompress any zpaq compressed data of any profile. This means you are able to use the latest version of lrzip with compression -L 9 (max profile), yet it is still backwardly compatible with older 0.6x versions of lrzip, not requiring an updated minor version and file format. The release archive I provide of lrzip-0.612.tar.lrz is self compressed with the new max profile. Even though there is significantly more code than ever in the lrzip release tarball, it has actually shrunk for the first time in a while.

So all that talk is boring and all, so let's throw around some benchmark results which are much more fun.

From the original readme benchmarks, I had compressed the linux 2.6.37 tarball, so I used that again for comparison. Tests were performed on an Intel quad core 3Ghz core 2 duo.


Compression Size  Percentage Compress Decompress
None     430612480 100
7z        63636839 14.8  2m28s  0m6.6s
xz        63291156 14.7  4m02s  0m8.7
lrzip     64561485 14.9  1m12s  0m4.3s
lrzip -z  51588423 12.0  2m02s  2m08s
lrzip -l 137515997 31.9  0m14s  0m2.7s
lrzip -g  86142459 20.0  0m17s  0m3.0s
lrzip -b  72103197 16.7  0m21s  0m6.5s
bzip2     74060625 17.2  0m48s  0m12.8s
gzip      94512561 21.9  0m17s  0m4.0s

As you can see, the improvements in speed of the rzip stage have made all the compression back ends pretty snappy, and most fun of all is that lrzip -z on this workload is even faster on compression than the multithreaded 7z and is significantly smaller. Alas the major disadvantage of zpaq remains that it takes about as long to decompress as it takes to compress. However, with the trend towards more CPU cores as time goes on, one could argue that zpaq compression, as used within lrzip, is getting to a speed where it can be in regular use instead of just research/experimental use, especially when files are small like the lrzip distributed tarball I provide.

I also repeated my old classic 10GB virtual image benchmarks



Compression Size  Percentage Compress Time Decompress Time
None     10737418240 100.0
gzip      2772899756  25.8  05m47s  2m46s
bzip2     2704781700  25.2  16m15s  6m19s
xz        2272322208  21.2  50m58s  3m52s
7z        2242897134  20.9  26m36s  5m41s
lrzip     1372218189  12.8  10m23s  2m53s
lrzip -U  1095735108  10.2  08m44s  2m45s
lrzip -l  1831894161  17.1  04m53s  2m37s
lrzip -lU 1414959433  13.2  04m48s  2m38s
lrzip -zU 1067169419   9.9  39m32s  39m46s

Using "U"nlimited "z"paq options, it is actually faster than xz now. Note that about 30% of this image is blank space but that's a not-uncommon type of virtual image. If it were full of data, the difference would be less. Anyway I think it's fair to say that it's worth watching zpaq in the future. Edit: I've sent Matt Mahoney (zpaq author) the latest benchmarks for lrzip and how it performs on the large file benchmark and he's updated his site: http://mattmahoney.net/dc/text.html I think it's performing pretty well for a general compression utility.

Sunday, 11 March 2012

lrzip-0.611

It usually doesn't take long to find bugs in a new significantly larger release and that was the case with lrzip 0.610. Since I'm loathe to leaving a buggy release lying around, I've released a new version hot on the heels of the old one.

lrzip 0.611 on freecode

http://lrzip.kolivas.org

lrzcat and lrzuntar were broken on the last release and they've now been fixed. lrzuntar also had the nasty habit of overwriting existing directories without warning so I've modified the code so it will not overwrite it without the -f option.

With the report of slowdowns on the last release, almost certainly due to incorporating the liblrzip library support, since nothing else changed, I figured it was time to do some hot spot profiling. So I pulled out oprofile and found where most of the time was spent in the code during the rzip stage. Then I went in and carefully rewrote small areas of the hottest functions as though they were critical code paths in the CPU scheduler and managed to find a few small speed improvements. Most of the improvements won't be noticeable unless you're running one of the faster compression modes like lzo, or none at all with -n, but it is faster. I also moved the checksum routines (crc32 and md5) into separate threads as they now use a significant amount of CPU time of their own during the rzip phase, and this should speed things up slightly too.

Then I went to the decompression side of things and did some profiling on the runzip stage and got a surprise. Most of the time during decompression is simply spent on the md5 update code. If I disabled the md5 update code, it was much faster during the runzip stage. After arguing with myself for a day, I figured it was still better to have integrity checking enabled and consider the addition of a fast mode for decompression since that will actually be almost lightning quick.

Interestingly, when profiling decompression, I was using the test option (-t) which does not write anything to disk, and things changed quite dramatically when I changed to actual decompression to disk. It took four times longer to decompress a 1GB archive to disk than it did to just test decompression in ram. Now this all seems obvious if you consider how long it takes to write something to disk, but that was not the case at all. In fact, virtually none of the data is written to disk by the time decompression is complete; it is just all in "dirty ram" as writeback. This did not change whether I used a -ck kernel with a low dirty ratio setting or the mainline default of 10. After some more prodding around I discovered that doing a simple write() of a 1GB buffer took over 7 seconds on modern hardware. This is only 140MB/s for a copy from memory to memory! It should be 20 times faster than that. I even tried taking the write function out of the equation and doing an mmap on the file and then memcpy from the buffer to the mmaped ram and it took the same amount of time, so the slowdown was not in the write() function call. After speaking to a few people and googling, it appears I'm not alone in this finding and it may be only happening in recent linux kernels. At this point I gave up trying to find what was causing this slow decompression on lrzip since it seemed unrelated to the lrzip code, and concentrated on getting this release out. I wonder if this is related to the writeback code that I was actually so looking forward to in 3.2ish. However others reported the problem continues as far back as 2.6.38 whereas it's not there in 2.6.35. I'll wait and see.

Anyway it may be possible to get lrzip integration into libarchive and therefore a number of package managers and decompression programs that use this library. The GPL license in lrzip may be a sticking point, though, and the authors of lzo and the original rzip have not responded about queries about the possibility of making the library LGPL which would make it easier to incorporate into the BSD licensed libarchive. So for now, it's stuck on GPL version 2.

Enjoy.

Thursday, 8 March 2012

lrzip-0.610

I haven't been idle all this time. In all honesty I've spent a lot of time playing with my bitcoin mining code in cgminer and after a couple of hardware failures with my mining rig I was finally able to turn my attention back to lrzip which has been really fairly quiet in the bug reporting department since the last stable release.

Here is version 0.610 (when freecode updates):
http://freecode.com/projects/long-range-zip

Or directly:
http://lrzip.kolivas.org

The major feature addition in this latest version is largely thanks to the work of Michael Blumenkrantz in the form of a liblrzip library! This means you can now link in lrzip compression and decompression into any other application. There is a range of APIs available to use this capability from the simplest gzip equivalent lrzip_compress and lrzip_decompress functions to ones that expose all the low level features knobs and dials in lrzip. There are also a couple of demo examples of using these functions included in the source code, but in their simplest form they should be very easy to use. I look forward to seeing package managers, archivers and maybe even git and so on having lrzip support in the future ;) Being a large chunk of new code it is not entirely impossible that bugs or issues with using the library may show up.

WHATS-NEW summary:
The new liblrzip library allows you to add lrzip compression and decompression
to other applications with either simple lrzip_compress and lrzip_decompress
functions or fine control over all the options with low level functions.
Faster rzip stage when files are large enough to require the sliding mmap
feature (usually >1/3 of ram) and in unlimited mode.
A bug where multiple files being compressed or decompressed from the one
command line could have gotten corrupted was fixed.
Modification date of the decompressed file is now set to that of the lrzip
archive (support for storing the original file's date would require modifying
the archive format again).
Compilation warning fixes.
Make lrztar work with directories with spaces in their names.


The full changelog follows:
* Implement complete set of liblrzip libraries, documentation and example uses
with support for simple lrzip_compress() and lrzip_decompress() or complete
fine-grained control over all compression and decompression options.
* Use as much of the low buffer as possible with a single memcopy before going
fine grained byte by byte.
* Preserve the compressed time on decompression where suitable.
* Store a copy of the control struct to be reused on subsequent files to prevent
variables being modified in the control struct on the first file that corrupt
compression/decompression of the 2nd file.
* Explicitly select C99 to avoid certain warnings.
* Generic modifications to silence -Wextra warnings.
* Fix typos.
* Use an array of parameters in lrztar to allow working with directories with
spaces in their names.

Enjoy.

Monday, 16 January 2012

3.2-ck1

These are patches designed to improve system responsiveness and interactivity with specific emphasis on the desktop, but suitable to any commodity hardware workload.

This is an upgrade to BFS 416, trivial bugfixes and a resync from 3.1.0-ck1 . I've changed the versioning to not specify a 3 point release since it usually applies to successive 3 point releases for some time.

Apply to 3.2.x:
patch-3.2-ck1.bz2


Broken out tarball:
3.2-ck1-broken-out.tar.bz2

Discrete patches:
patches


BFS by itself:
bfs


Web:
kernel.kolivas.org


Code blog when I feel like it:
ck-hack.blogspot.com


Each discrete patch usually contains a brief description of what it does at
the top of the patch itself.

Full patchlist:

3.2-sched-bfs-416.patch
mm-minimal_swappiness.patch
mm-enable_swaptoken_only_when_swap_full.patch
mm-drop_swap_cache_aggressively.patch
mm-kswapd_inherit_prio-1.patch
mm-background_scan.patch
mm-idleprio_prio-1.patch
mm-lru_cache_add_lru_tail-2.patch
mm-decrease_default_dirty_ratio-1.patch
kconfig-expose_vmsplit_option.patch
hz-default_1000.patch
hz-no_default_250.patch
hz-raise_max.patch
preempt-desktop-tune.patch
ck1-version.patch

Enjoy!
お楽しみください

--ck

P.S. What this really means is I finished playing zelda.

Monday, 9 January 2012

Towards Transparent CPU Scheduling

Of BFS related interest is an excellent thesis by Joseph T. Meehean
entitled "Towards Transparent CPU Scheduling". Of particular note is
the virtually deterministic nature of BFS, especially in fairness and
latency. While this of course interests me greatly because of
extensive testing of the BFS CPU scheduler, there are many aspects of
both the current CFS CPU scheduler and the older O(1) CPU scheduler
that are discussed that anyone working on issues to do with
predictability, scalability, fairness and latency should read.

http://research.cs.wisc.edu/wind/Publications/meehean-thesis11.html


Abstract:

In this thesis we propose using the scientific method to develop a deeper understanding of CPU schedulers; we use this approach to explain and understand the sometimes erratic behavior of CPU schedulers. This approach begins with introducing controlled workloads into commodity operating systems and observing the CPU scheduler's behavior. From these observations we are able to infer the underlying CPU scheduling policy and create models that predict scheduling behavior.
We have made two advances in the area of applying scientific analysis to CPU schedulers. The first, CPU Futures, is a combination of predictive scheduling models embedded into the CPU scheduler and user-space controller that steers applications using feedback from these models. We have developed these predictive models for two different Linux schedulers (CFS and O(1)), based on two different scheduling paradigms (timesharing and proportional-share). Using three different case studies, we demonstrate that applications can use our predictive models to reduce interference from low-importance applications by over 70%, reduce web server starvation by an order of magnitude, and enforce scheduling policies that contradict the CPU scheduler's.
Harmony, our second contribution, is a framework and set of experiments for extracting multiprocessor scheduling policy from commodity operating systems. We used this tool to extract and analyze the policies of three Linux schedulers: O(1), CFS, and BFS. These schedulers often implement strikingly different policies. At the high level, the O(1) scheduler carefully selects processes for migration and strongly values processor affinity. In contrast, CFS continuously searches for a better balance and, as a result, selects processes for migration at random. BFS strongly values fairness and often disregards processor affinity.

Thursday, 5 January 2012

BFS 0.416 for 3.2.0

Well 3.2.0 is finally out.

I've done a quick and dirty port of BFS from 3.1 with mostly trivial changes and a minor change to idle CPU selection (it will choose the old CPU first now even if the other "thread" is busy on hyperthreaded CPUs). For the most part the changes are trivial only to stay in sync with the new scheduler changes in mainline so it should be a very safe upgrade. Nonetheless, I'm putting this up for testing here because my Ubuntu laptop seems unhappy starting X but that seems 3.2 related rather than BFS related. I've since moved my home desktop to arch linux and have been very happy on that distro.

So anyway here's BFS for 3.2.0 for those who want things hot off the press:

3.2-sched-bfs-416.patch

The -ck patch will not be far behind assuming the first few testers report no problems with this BFS patch. To be honest, the Virtual Memory subsystem has changed so much that I'm having trouble predicting how it will behave now and have no way of confirming the VM patches that go into -ck are even helpful any more, so I'm not even sure if I should continue to support them in light of that.

Besides, I've been too busy playing Zelda - Skyward Sword to really care much about anything code related. I've loved every 3D rendition of Zelda since Ocarina of time, but this latest incarnation is the most amazing game ever...

Friday, 11 November 2011

3.1.0-ck2, BFS 0.415

Blah blah fixes blah blah rollbacks, anyway here's a new -ck and bfs 415. Nothing new here, just bugfixes.

3.1.0-ck2

3.1-sched-bfs-415.patch


Full -ck patchlist:

3.1-sched-bfs-414.patch
sched-add-above-background-load-function.patch
mm-minimal_swappiness.patch
mm-enable_swaptoken_only_when_swap_full.patch
mm-drop_swap_cache_aggressively.patch
mm-kswapd_inherit_prio-1.patch
mm-background_scan.patch
mm-idleprio_prio-1.patch
mm-lru_cache_add_lru_tail-1.patch
mm-decrease_default_dirty_ratio.patch
kconfig-expose_vmsplit_option.patch
hz-default_1000.patch
hz-no_default_250.patch
hz-raise_max.patch
preempt-desktop-tune.patch
cpufreq-bfs_tweaks.patch
bfs414-noprefetch.patch
bfs414-remove_rqpreempt.patch
bfs414-correct_int_size.patch
bfs414-stickybool.patch
bfs414-dont_use_task.patch
bfs414-clear_deactivated_sticky.patch
bfs414-fix-nonbfs-build.patch
bfs414-415version.patch
ck2-version.patch 
 
 
BFS 415 is now the following patches from -ck2 above:
 
3.1-sched-bfs-414.patch
sched-add-above-background-load-function.patch
cpufreq-bfs_tweaks.patch
bfs414-noprefetch.patch
bfs414-remove_rqpreempt.patch
bfs414-correct_int_size.patch
bfs414-stickybool.patch
bfs414-dont_use_task.patch
bfs414-clear_deactivated_sticky.patch
bfs414-fix-nonbfs-build.patch
bfs414-415version.patch 
 
Enjoy!

Thursday, 3 November 2011

3.1.0-ck1, BFS 0.414

I finally found enough time to sync up BFS with the latest 3.1.0 linux kernel. In keeping with the fact that there are some changes to support new scheduler features in linux-3.1, and the general code churn that happens on every release, the version number is updated from the last 0.413 release but from the user perspective it should behave identically. There were some build fixes pending and one bug involving displayed hard interrupt usage that were fixed. Overall most users should notice no significant change.

Apply to linux-3.1.x:
3.1-sched-bfs-414.patch

Closely following on with syncing up with the latest kernel is an updated -ck kernel which is almost identical, apart from relaxing the swappiness from 0 to 10. Some very low ram users reported the virtual memory subsystem tried so hard to not swap anything at 0 that they experienced stalls while the VM twiddled its thumbs screwing around so I've relaxed the value to 10 which avoids these stalls and should still be quite aggressive at avoiding swapping.

Apart from that the -ck patch is a resync from the last 3.0.0-ck1 release:

Directory containing combined ck1 patch and separate patches:
3.1.0-ck1

With the recent kernel.org hacking it has become a lot harder to get a kernel.org account and as I would have great difficulty getting a gpg key personally signed by any other kernel developer in my area it could be a while before I can set up an account there again. So I've decided to just use my own storage once more instead of putting it on kernel.org. Unfortunately I don't have quite a few of the older -ck kernels since I never bothered to upload them there. I foolishly thought that kernel.org storage would be permanent cloud storage for them so I don't even have a copy of many of them. If someone has archived the 2.6 -ck patches, or has an old mirror of kernel.org please forward them on to me.

Enjoy!

EDIT: A few bugs are showing up in this latest BFS release, and I'm currently accumulating fixes fort them and putting them in the directory listed above for ck1. There will be a ck2 soon to follow with all the fixes.

Monday, 17 October 2011

BFS 0.413, BFS benchmarks, (and more cksort stuff)

Finally some BFS news, yay \o/

TL;DR: Minor unnoticeable tweaks for BFS.

Incremental:
3.0-bfs-406-413.patch

Incremental for 3.0.x-ck1:
3.0-ck1-bfs-406-413.patch

Full patch for 3.0.x:
3.0-sched-bfs-413.patch

Alas there isn't anything earth shattering here. The skiplist implementation experiment failed to be of any benefit despite my best intentions, so I've abandoned the idea for the time being. Meanwhile, there were a number of minor improvements that I had considered for BFS (and I mean really really minor) and a few cleanups that had been waiting. So here's the changelog for the new BFS for 3.0:


Improve the preempt call by explicitly setting a task that will be used on
lookup in earliest_deadline_task to avoid iterating over all tasks. Set the
effective runqueue priority for further calls to try_preempt based on the
preempting task's priority. Thus on the small chance that something else
tries to preempt the runqueue before the new tasks gets the CPU, it will
compare to the previously successful preempting task. This should avoid a
preempting task from rescheduling a runqueue and then the CPU deciding to
take a different task instead of the preempting one.

Clean up a number of variables from unnecessary longs to ints and ints to
bools.

Microoptimise around try preempt by settings highest_prio to that of the
task that's trying to preempt to avoid unnecessary comparisons.

Break sole affinity on cpu offline -after- the cpu has been set to be offline. 

I'd be surprised if anyone noticed anything different with these changes, but then, there have been some surprises from apparently innocuous changes in the past, so I'm keen to get feedback from testers. There is a small chance that the preempt changes can improve throughput by avoiding unnecessary rescheduling, and that the cpu offline changes might help rare scenarios where a machine will fail to suspend to ram.

Meanwhile, graysky who has been maintaining the unofficial ck repos for archlinux has very kindly put together a number of throughput benchmarks comparing BFS to mainline on machines from 2x up to 16x in a very nicely put together PDF which you can download and read here:
 
repo-ck.com/bench/benchmark.pdf


On the cksort front, it's clear to me that sorting algorithms are pretty boring to most people, despite me finding transient interest in them and developing my own. Nevertheless, the cksort algorithm has been further improved. After playing with it some more, and learning from the legendary work of the past, I modified the merge sort that cksort "tips" to to include an insertion sort as well, making the algorithm now a blended insertion+merge sort once it tips away from the presort+nodemerge algorithm which is basically the only unique part of cksort. This capitalises on the fact that insertion sort is pretty much the fastest way to sort up to 32 different entries by splitting up the merge into up-to-32 entry sized groups and doing insertion sort on all of them, followed by using a non-recursive merge sort on each group.

This has further improved cksort to now be faster than qsort for up to 1500 entries, and now equally fast with library qsort() up to 10000 (uniquely random)entries. Beyond that it is still slower than qsort, but it has -no- corner cases where it will ever become O(n^2) and will always be O(n log n) bound, unlike qsort.

Version 0.44 of cksort is here:
http://ck.kolivas.org/apps/cksort/cksort.c

And in response to requests for the benchmarking code vs qsort it is here:
http://ck.kolivas.org/apps/cksort/benchsort.c

Now Aldo Cortesi  has a very interesting static view of comparison sort algorithms here:
http://sortvis.org/

After contacting him, he was kind enough to modify his code such that it would take output from something like the cksort to generate a graph from any sorting algorithm. Alas cksort uses a separate array for its work and looking at the array of the final data does not tell the entire story of what's going on, but it does help further to visualise how cksort works.

Here is a view of 20 random variables being sorted and you can see the 2 phases of cksort where it builds up the nodes first, and the 2nd phase where it merges the nodes (click to enlarge):


It's important to note that the 2nd stage graph is fairly arbitrary since the data from the unmerged nodes is not actually written back to the original array until it's merged, but you should be able to make out the final sorted data appearing from the outside in.

Sorting non-random entries is really what makes cksort interesting and here is a (rather complex) visualisation of the 50 mostly-sorted example in the animated form I showed on my blog previously:


Anyway it's been a lot of fun playing with this stuff, and maybe one day I'll come up with something more useful from the algorithm. I played with an in-place version of the code which didn't need a second array and the amount of memory moving required made the algorithm slower and more complex so I soon lost interest in developing that for the time being. I'd like to find a way to make the presort+node merging aspect scale to any size instead of tipping to a different algorithm but the lack of any pattern in large random data sets means you can't really find a pattern in it. I still think there may be something like a heap structure that would help it but haven't found anything that works in practice. Maybe I'll come back to this in the future. Meanwhile it works extremely well for the smaller data sets, partially ordered data, reverse ordered data, and few unique data sets. Interestingly it coincidentally shares a lot in its ideas with Timsort. All I can say after all this is that the work of the legends of the past, and coming up with something like Smoothsort is truly amazing.