Monday, 26 September 2011

BFS 0.410 test with skiplists.

Hi all.

TL;DR: Test release for fastest BFS ever.

The skiplists patch has proven to be quite stable, but a couple of minor issues have shown up. First, as I knew, the ondemand behaviour would not quite match the current release, and second, SCHED_IDLEPRIO tasks weren't scheduled properly (they acted like normal tasks). However the code itself seemed quite safe otherwise. So I've tentatively put out a test release of the next version of BFS. The two main changes to the skiplist code are to bring the ondemand governor performance up to par with current BFS, and to fix the behaviour of SCHED_IDLEPRIO tasks.

Quick benchmarks:
BFS 406:
Make -j4: 26.6s
Make -j : 27.8s

BFS 410:
Make -j4: 26.4s
Make -j : 27.1s


Changelog in patch:
Implement skip lists as described here:
http://en.wikipedia.org/wiki/Skip_list
for the main priority queue.

The old queue had: O(1) insertion, O(1) removal, but a lookup involving
both a binary search of a bitmap and O(n) search through a linked list which
is very cache unfriendly as the list gets larger.

This queue is now: O(log n) insertion, O(1) lookup and O(k) lookup in a much
more cache friendly manner.

This should not compromise performance at all at very low loads, but improve
both throughput and latency as loads get higher, as confirmed by benchmarks.


Other changes: Cleanups of variable choices and micro-optimisations.

Full bfs410 patch for 3.0.x:
http://ck.kolivas.org/patches/bfs/test/3.0-sched-bfs-410.patch

Incremental bfs410 for those on 3.0.x-bfs406:
http://ck.kolivas.org/patches/bfs/test/3.0-bfs406-410.patch

Incremental bfs410 for those on 3.0.x-ck1:
http://ck.kolivas.org/patches/bfs/test/ck1-bfs406-410.patch

Note that make headers check will report errors, so you may have to disable this option in kernel hacking if it is enabled. I'm investigating why this is, but it's a harmless warning.

EDIT: Uniprocessor builds will need the following fix on top:
http://ck.kolivas.org/patches/bfs/test/bfs410-upfix.patch

EDIT2: The following fix is required for interactivity issues with the patch
http://ck.kolivas.org/patches/bfs/test/bfs410-offsetfix.patch

Saturday, 24 September 2011

BFS and skip lists

Hi all.

Just a quick post to tell you all what I've been up to with BFS.

Some of you may know about Skip lists as an alternative to balanced binary search trees. They feature O(log n) insertion, lookup and removal of table entries. Anyway I've been looking for some time at the O(n) lookup of BFS (which is O(1) insertion and removal) to try and find a solution that didn't cost us at the desktop level since O(n) of small numbers of n is very fast. The problem is of course at higher numbers of n (or server type loads), where it gets linearly slower, and the cache trashing aspect of scanning linked lists becomes expensive.

So to cut a long story short, I've implemented the first draft of a custom version of skip lists for BFS in place of the O(n) lookup. The insertion remains O(log n), but by sorting all tasks realtime, iso, normal, batch and idleprio in a way they all end up on the same table, I was able to do away with the regular linked lists and the bitmap priority lookup. Then I was able to utilise one of the features of the skip lists in that the first task on the "bottom" list is always sorted to be the highest priority. This means the lookup is O(1). Then I put an extra pointer into each entry to the previous entry (the original design normally only points to the next entry). Finally I placed a pointer to the entry in the task struct as a way of reverse looking up the entry without any search.

So what I'm left with is an O(log n) insertion, O(1) lookup, and O(k) removal (k is the "height" of the node in question, up to 16, but usually only 1-4).

I've implemented the sticky task used for CPU affinity by simply comparing the last sticky task to the first entry returned from the skip list. Alas I have not yet provided a good version of the sticky task being used to improve scaling governor behaviour. This means that this will likely perform worse with the ondemand governor at this stage. On the other hand, the performance governor seems to be working very nicely in my preliminary tests.

Anyway I'd love to post more, but I need to sleep, so here's some code instead (for a BFS patched 3.0.x kernel):
bfs406-skiplists.patch

Full bfs406 + skiplists patch:
http://ck.kolivas.org/patches/bfs/test/3.0-sched-bfs-406-skiplists.patch

Try it out, see what you think. It seems to be running safely here but as always, there are no guarantees. All going well, you should notice pretty much no difference on a desktop. If you do any throughput benchmarks comparing before/after, I'd love to hear about them.

EDIT:
Here's the thread about it on lkml: https://lkml.org/lkml/2011/9/23/371

Note the benchmarks showing substantial speedup in the highly contended make -j case that I mention in the thread.

3.0.0:
Elapsed Time 28.7

3.0.0-bfs406:
Elapsed Time 28.5

3.0.0-bfs406-sl:
Elapsed Time 27.0

Friday, 16 September 2011

lrzip-0.607

It's been a while since lrzip received any attention while I got distracted by cgminer. However a bug report where there was an archive that was not decompressible drew my attention. After a run of debugging and testing (thanks Entil!) I tracked down the bug. The nice thing was the archives were actually all intact, it was purely on decompression that it could be a problem, and it was extremely rare for an archive to be made in this way. So I fixed the bug, updated the lzma library to the latest, added an lrzip bash completion script and a couple of other minor things and have released version 0.607.

lrzip on freshmeat
http://lrzip.kolivas.org
http://github.com/ckolivas/lrzip

Meanwhile Michael Blumenkrantz has been working on a liblrzip library for lrzip! That means once we finalise the format, you'll be able to compile in lrzip support into your applications, using functions like:

lrzip_compress(void *dest, unsigned long *dest_len, const void *source, unsigned long source_len)

This work is currently in the liblrzip branch on the github tree.

Tuesday, 16 August 2011

Phoronix revisits BFS

Phoronix once benchmarked BFS in its early days. I guess at the prodding of people who suggested it here (thanks!), they've revisited it. Of course they did a throughput benchmark comparison which is kinda missing the point of BFS, but then again if they find ANY throughput benchmark where BFS is better, that's somewhere between amusing and disappointing. Anyway they did mention in passing that "However, the system with the BFS scheduler was more responsive during the test process." While this at least pays homage to what BFS really is about, it makes me wonder - what were they doing while the benchmarks ran? Benchmarks are meant to run without anything else running on the machine.

Anyway, thanks to phoronix for conducting these benchmarks and their logo suggestion.

http://www.phoronix.com/scan.php?page=article&item=bfs_two_years&num=1


Probably more interesting a test of something besides pure throughput is that of a server admin who was running mainline with 6000-8000 processes and his system was falling over under that load. Then he switched to BFS with no change in the usage (actually increasing) which fixed his problems. Here's the graph and see if you can guess when he switched to BFS.