Tuesday, 11 October 2011

cksort and O(n^2) tipping point

I've spent some time modelling the behaviour of the last cksort I posted here to investigate where and why it becomes O(n^2). It turns out the prescan which creates the nodes ends up scanning a large amount of the data repeatedly when the data is purely random. As mentioned here previously, when there are up to ~1500 unique random entries, it is faster than qsort and then gets slower. When there is already some sorted data or repetition, it remains faster even when the number of entries becomes very large. So I tried to model where the tipping point was to determine when to change the sort mode and it turns out that if it is unable to generate a node which is a decent proportion of the original data, it is likely to only generate yet smaller nodes on each pass. With the help of a bit of trial and error and weak modelling, I found that it would stay ahead of a classic O(n log n) search when it was able to reliably sort the data into 32 nodes or less. There are occasions where it is faster with more nodes, but it is always faster at less. So I integrated a classic merge sort into the sort algorithm and made it switch to classic merge sort should it fall below this threshold. This does not significantly increase the total amount of compares/swaps and then means there algorithm is always O(n log n) bound.

So the latest version of cksort now has some nice properties, being O(n) in certain ideal scenarios, being very adaptive to patterns in either direction or lots of repeated data, yet always being bound by O(n log n) in complexity.

Just about any O(n log n) algorithm could be used beyond the tipping point, but given the intrinsic nature of cksort uses a merge, it seemed intuitive to extend that to using classic merge as well. I experimented with using a heap sort once the data was pre-sorted since it almost became a heap structure naturally after the pre-sort, but 2 things became clear: First, the heap sort is actually slower than merge sort on average. And second, and more importantly, the pre-sort phase was actually the O(n^2) component in the original sort so it didn't make sense to complete this phase.

(EDIT updated version) Version 0.43:
http://ck.kolivas.org/apps/cksort/cksort.c

10k samples compared to library qsort (I checked the actual code and it's a highly optimised combination of quicksort and insertion sort). Again bear in mind there is little to no optimisation in the cksort code. Time is in microseconds:

fewunique:
qsort:  452
cksort: 114

random:
qsort:  573
cksort: 883

sorted:
qsort:  449
cksort: 34

reversed:
qsort:  451
cksort: 41

mostly sorted:
qsort:  459
cksort: 58

As you can see above, the random is still slower on cksort (but a heck of a lot faster than earlier implementations), and it seems to stay at about half the speed of qsort no matter how many samples which is good because it means it is bound by O(n log n) complexity. 10,000 samples was chosen because it accurately represented a size where the polynomial nature of an algorithm would be obvious. The reason it's half the speed of qsort is that an unoptimised merge sort does on average twice as much work as an optimised quicksort. I could reasonably just switch to a quicksort at the tipping point, but there are still some corner cases for quicksort where it becomes O(n^2) which mergesort never suffers from, and mergesort is very parallelisable if desired. I was tempted to thread the mergesort component to improve the benchmark numbers, but it would have skewed the results inappropriately.

As a generic sort there are still two issues with cksort. One is the extra space required, and at O(n) auxiliary space it is about as bad as it gets. An in-place algorithm could certainly be developed but would be tricky and require a lot more swaps during the presort phase. The tipping algorithm could be switched to a heap sort instead which works in place. The second disadvantage is that it is not stable (i.e. does not keep original order of same value entries). I'm not sure that's easy to improve on.

Again, I don't expect this will dramatically change the way anyone sorts anything but it's certainly been interesting to play with. I'll keep playing some more :)

EDIT: In response to requests for a range of numbers, here's a set of them for different types of data and number of samples, 10-100 million samples. Random (2^32), fewunique1(4 different values), fewunique2(25 different values), fewunique3(100 different values and > tipping point), mostly sorted(every 5th and 10th value swapped), already sorted and reverse sorted.




EDIT: In response to requests, this is the benchmark code based on 0.43 which is an improvement on this original post. Just edit the TOSORT value to try different sizes.

Sunday, 9 October 2011

cksort evolution

I've been tinkering with the cksort algorithm to try and improve its performance and worst case behaviour, while working on the existing design. Spending some time visualising what happens with the 3 bucket model I described, I experimented with using the high/low buckets as a pre-pass instead of being the actual sorting part. This turns a random array of data like this:


into what looks almost like a heap array in the following diagram:


Looking carefully at this, you can see that the range (lowest to highest values) of each "node" as I'm calling them, gets smaller as you progress from the bottom to the top. Also each node is a discrete sorted entity, and by combining the increasing and decreasing values into one node it halves the number of nodes generated, while being able to benefit from any existing partially arranged data in either direction.

After lots of tinkering at this point, I settled on using a merge from one node to the next, and used the narrower range of each successive node to my advantage to avoid re-scanning areas that are known to be in their final positions.

All this has done wonders to the performance of cksort in all areas, not only speeding up the sorts it was already fast at, but dramatically speeding up the random data example (20x faster) and fixing the corner case of the sawtooth overlapping data. A reminder where the inspiration for these sample data sets and graphs came from: http://sorting-algorithms.com

So here are the latest versions of the 50 value graphs that improved further:
Random


Mostly sorted


Few unique

The code as is is faster than qsort up to 1500 uniquely random variables and then gets slower beyond that. It is faster than qsort for pretty much any number of variables for data that has any significant existing pattern/sorting or repeated data.

Again, I'm doing this purely as an intellectual exercise, but there may be some valid usage for such an algorithm. As for my comment about making the sort parallelisable, the merge component between nodes could easily be parallelised, beginning the merging before the presort component is completed, as each node comes in.

Current version (0.3) is available here:  http://ck.kolivas.org/apps/cksort/cksort.c

Probably the next step is to find the upper limit on number of nodes for when it turns O(n^2) and then use the intrinsic heap like shape of it to actually generate a heap structure and use a classic heap sort.

Wednesday, 5 October 2011

Anatomy of a software engineer reject

As most of you would be aware, I'm a doctor, a specialist in anaesthesia, and only moonlight with code in my spare time. The only reason I have time to moonlight is great care for work/life balance to make sure I have enough time to tinker with the things I enjoy doing. At about August each year, it must be recruitment time at big software companies or something because each year for the last few years, I've had a couple of different companies approach me about considering a career with them. Of course with a stable career, income, work-life balance, proximal location to work and so on, you can imagine I would never have considered changing careers, especially after the investment I put into specialising. The fact is I had pretty much enjoyed my career and the intellectual stimulation it provided. But intellectual stimulation fades with time because as one of my teachers once told me "if it's still exciting in 10 years time it's because you don't know what you're doing".

Anyway this year was different because I had decided I needed an extended break from work for personal development reasons and figured I'd go through the interview process and see where things ended up. I knew full well my knowledge base was sparse to say the least, knowing great details about minute things and missing vast amounts of other general information and experience. Sure I explained that to the recruiters and in the past they've baulked at finding that out, or wanted to put me through the process and send me to $faraway_location. This time the recruiter was positive and was considering all options that might actually suit me, including the tempting carrot of suggesting working remotely (but not confirming it), so I went through the interview routine with said $software_company expecting my lack of general knowledge would be my downfall. This might help explain my sudden interest in algorithmic complexity and sorting algorithms...

And indeed I did get rejected. This is, of course, not surprising in the slightest. Now I have to go away and decide whether I want to spend lots of spare time studying all that general stuff to move house and home, family and work security to work for much less income full time in another state or country on general software. Or I could stick to working part-time as an anaesthetist and hack on whatever I like in my spare time and keep all those other things. Change always has something that attracts us in perverse ways...

Tuesday, 4 October 2011

Visualising cksort

As part of the development of my first draft for the sorting algorithm, I spent a lot of time visualising existing sorting algorithms at the excellent website http://www.sorting-algorithms.com/ . Looking at the different algorithms allowed me to think about how I might develop my own sort to try and improve performance in some of the common cases listed there and do well in areas most do badly. Until now I was unable to generate pretty graphs like those ones, but with Serge Belyshev's help I manage to get some rudimentary versions (without the nice arrows and different shading that the website above has) to see how swaps occur.

The following graphs visualise only the writes to the main array, and an update occurs only whenever a value is changed in the main array. This tends to correlate quite well with the overall number of comparisons as well so the time taken to complete in these examples should give you an idea of where cksort performs.

I would have liked to have visualised completely sorted data, but since no moves of data occur in cksort, there is nothing to show.

These graphs correspond to 50 entries of values 0-50, so the equivalent of clicking the 50 button on sorting algorithms above. Putting them in the same order as the site above here they are (note I do NOT have the same data so you cannot compare them directly, though the ballpark performance scales should be comparable):

EDIT: NOTE These graphs link samples from the latest version of  cksort which performs better than this original post described.

Random data

Nearly sorted




Reversed



Few Unique

Part of the reason I wanted to visualise these sorts was to demonstrate how cksort worked visually, but another reason was for me to examine it and see where its weakness is and how I might improve it. Interestingly, in my testing, the worst data for cksort was a sawtooth pattern as I mentioned in my last post. Furthermore, if I "heapify" the unsorted data on every pass, it gets much slower. Looking at the random data example above, it seems that with multiple passes, it intrinsically tends to heapify the data anyway, which is counterproductive. So assuming I can improve on it, this is the area I need to concentrate on. Perhaps after a few passes if the number of clustered "highs" drops to minimal, it may be better to switch to a heap sort for the remaining data. At least it's clear in the above examples where cksort does perform well!

Sunday, 2 October 2011

cksort follow up.

I've spent a few days playing with cksort to try and improve it with behaviour that I was planning on, and to try and categorise exactly what sort of sort (heh) it is and to analyse its algorithmic complexity. I know people haven't found this as interesting as my usual stuff, but I feel like blogging it anyway, and maybe some use will come of this code.

cksort v0.21:
http://ck.kolivas.org/apps/cksort/cksort.c
To describe the sort, it's an extension of a selection sort using external storage to substantially improve the speed of sorting certain forms of data and the worst case scenario.

I'll use this sample of 20 for simplicity:


 63 98 22 30 84 73 23 51 10 57 04 73 74 11 16 27 65 22 25 25

It starts by scanning over the data linearly, sorting data into 3 buckets:
The first bucket contains the first entry, and then collects any entries equal to or greater than the last entry put into it.


63 98

The second bucket contains the first item rejected by the 1st bucket and then collects any entries equal to or smaller than the last entry put into it.


22 10 04


The last bucket stores any data not falling into either of the previous categories (unsorted).

30 84 73 23 51 57 73 74 11 16 27 65

To minimise the amount of extra storage required, only one extra buffer equal in size to the original data is used, and the data from buckets 1 and 2 are placed at the beginning and end of the same buffer, while the unsorted data gets compacted in the original storage.

The highest and lowest values in the unsorted are store for comparison.

Then the main data is reconstructed as follows:

Any values from bucket 2 below the lowest value from unsorted are put at the beginning of the main buffer and considered in their final position and will not be scanned again.
Then all unsorted values are next in line.
Then the remainder of bucket 2 data is placed in the main buffer in ascending order.
Then bucket 1 is placed in ascending order, and any that are larger than the highest value from unsorted are considered in their final position.

 04 10 73 23 51 57 73 74 11 16 27 65 22 25 25 30 84 22 63 98

Then the scan is repeated in ever decreasing sections of the middle until none are left.

04 10 11 57 16 27 65 22 25 25 30 22 63 51 23 73 73 74 84 98
04 10 11 16 22 25 25 30 22 63 51 23 27 57 65 73 73 74 84 98
04 10 11 16 51 23 27 57 22 22 25 25 30 63 65 73 73 74 84 98
04 10 11 16 22 22 23 30 27 25 25 51 57 63 65 73 73 74 84 98
04 10 11 16 22 22 23 25 25 27 30 51 57 63 65 73 73 74 84 98


Trying to categorise this was difficult for a while but fortunately I discussed it with a few friends online and we were able to come up with something.


So in this overcrowded world of >9000 sorts, there needs to be some advantage or else it's a useless addition. Now I'm not claiming this is going to be something special but I'm doing this to fulfil some intellectual curiosity. Now please, I'm not that good with the algorithmic complexity details, so if I've described something wrong, please feel free to correct me (because no one ever does that on the internet!)

The advantages are with how it manages partially sorted data, and repeats of data. A set of fully sorted data in ascending order takes one pass, and after the data is placed into the backup array, it can abort out without any further swaps, so the comparisons and swaps equal n with one pass. O(n).

When data is already sorted in reverse order, the 2nd bucket picks it all up and places it in correct order. This is one set of comparisons per n, and 2 swaps per entry, O(n). This is somewhere very few sort algorithms do well.

When there are a limited number of values the variables can take, the number of passes cksort spends is bound by half that number of variables, no matter how many entries there are. e.g. if you have 1 million entries of values 0-100, it will only ever take up to 50 passes, needing ~25 million comparisons. EDIT: This means it's actually O(k) where k is the number of unique entries.

When data is completely random, there is some kind of function describing its relationship, and it is significantly less than (absolutely) n^2 but the algorithmic complexity is O(n^2). For example, 100 unique random entries take on average 1800 compares which is a lot less than 10,000 but does scale in polynomial fashion as the number of n increases.

The exact worst case scenario is actually well defined as a sawtooth pattern of input values with lots of crossover but no repeat values (e.g. 100, 97, 98, 95, 96...) and is also O(n^2) and 100 unique values takes 2424 compares.

How to describe average I find quite hard, because real data usually has either repeats or has sections of partially ordered variables. So depending on how random the data is, it will tend from O(n)->O(n^2). Strictly speaking, most documented "averages" are based on random data.

So, finally how do the benchmarks stack up compared to the industry standard qsort? It's important to note that no effort was really made to optimise the code but the coarse scales should give you an idea of performance. Note that qsort is often thought of as being quicksort, but the implementation is entirely library dependent. This one may be a heap sort given it takes the same time no matter what is fed to it. A quicksort would easily have hit O(n^2) with some of these examples:


Time is in microseconds on a 3GHz core 2 with work bound to one CPU:

All are 10,000 entries:
Data already sorted 0 - 10000
qsort: 659
cksort: 40

Data sorted in reverse order 10000 - 0
qsort: 646
cksort: 61

Data sorted interleaved rising and falling: (0, 100, 1, 99, 2, 98...)
qsort: 651
cksort: 96

Data random with repeats every 2nd: (e.g. 54, 54, 99, 99, 33, 33...)
qsort: 643
cksort: 37

Data with 2 rising values and then a random: (e.g.: 1, 2, 54, 4, 5, 65...)
qsort: 653
cksort: 102

Random data, bound to 100 different values:
qsort: 642
cksort: 781

And now to real suckage, purely random unique data:
qsort: 658
cksort: 51770

For completeness I tried to find the crossover in n for when truly random data performed the same and it was ~475 entries.
qsort: 153
cksort: 155

So it's been a fun experiment and there's no doubt that with certain data types it could actually be quite useful, such as small data sets, partially ordered data sets, or numerous repeats, but the worst case scenario is clearly unacceptable for a general purpose algorithm if truly random data is possible. In summary, its major advantage is its extreme adaptability (adaptive algorithm), but the other disadvantages are that it is not stable, does not operate in place, and may have many swaps. cksort, like BFS, is a tool that can be useful for specific tasks. As I saw with BFS, the right algorithmic complexity in the right place is better than something that just "sounds better".

Now I wonder if I'll be inspired to try and design an O(n log n) sorting algorithm that can be parallelised...

EDIT: In response to questions about the license in the code, the GPL v3 license in the sample code is just for the sample code, not the algorithm.

EDIT2: It's worth noting that I'm still experimenting with this algorithm and hopefully I'll be able to improve it further.

Friday, 30 September 2011

cksort

I've been playing with sorting algorithms lately and wondered if the landscape of a billion out there had room for one more. I tried implementing many of the better algorithms in code and found it very hard to actually generate working code for anything beyond the most simple of selection sorts. So I set out to create my own sorting algorithm that sort of kind of worked half decently but wasn't too hard to implement. Now I'm sure some of the more complex algorithms do significantly better, and this does have some disadvantages, like the space requirements and lots of writes, but its best performance is very good and its average is decent. I have yet to find any particular data it performs badly on. However I'm not very good with the whole mathematical analysis side so I figured I'd post it here and hopefully get comments from people who are good there.

I don't really know how to describe it, but it's sort of a  selection sort that generates two sorted lists at the end, and part of the last list is the completely sorted component.


The original design did not have the bidirectional component, just putting the "high" list at the end, but this performed poorly when data is in the reverse order, whereas this one performs just as well with data already mostly sorted in either direction.

Note that almost no effort has gone into code optimisation, just to make sure it works and to calculate how many iterations and swaps it does and so on.

http://ck.kolivas.org/apps/cksort/cksort.c


EDIT: Note v0.21. This is significantly faster than the original one I posted working bidirectionally requiring less comparisons and swaps on average and worst case.

/* Copyright 2011, Con Kolivas
 *
 * cksort v0.21
 *
 * Gnu Public License V3.
 *
 * cksort: A very simple sort algorithm based on selection sort but sorts
 * successive chunks of entries and scans less on each pass based on
 * knowledge of how much has been successfully sorted, and does a low and high
 * directional search to avoid worst case of reversed data. Uses an extra array
 * equal in size to the original data to do this.
 * Adaptive, not stable, best performance O(n), worst better than O(n^2) but
 * unknown. The average is unknown but seems to be ~ O(n^(3/2)) for random data.
 * It becomes faster the more sorted the original data is - even if it's
 * backwards, and the smaller the spread of data.
 *
 * gcc -o cksort -Wall -O2 cksort.c
 */

#include
#include
#include

#define TOSORT 50

typedef int32_t s32;

s32 rands[TOSORT], a[TOSORT];

#define s32min -2147483647
#define s32max 2147483647

void show(void)
{
    int i;

    for (i = 0; i < TOSORT; i++)
        printf("%02d ", rands[i]);
    printf("\n");
}

int total_n, total_passes, swaps, fromsort, tosort;

int sort(void)
{
    int i, j, ai = 0, bi = 0, unsorted_end;
    s32 max = s32min, minmax = s32min;
    s32 min = s32max, maxmin = s32max;

    j = fromsort;
    /* Do a sequential sort, finding ever increasing values and
     * store them in a[] */
    for (i = fromsort; i < tosort; i++) {
        total_n++;
        if (rands[i] >= max) {
            /* Store the grouped maximums at the bottom of the a
             * array */
            max = a[ai++] = rands[i];
            if (max < maxmin)
                maxmin = max;
            swaps++;
        } else if (rands[i] <= min) {
            /* Store the grouped minimums at the top of the a
             * array */
            min = a[TOSORT - 1 - bi++] = rands[i];
            swaps++;
        } else {
            /* Group any unsorted values at the low end */
            rands[j++] = rands[i];
            swaps++;
            /* Find the largest value beyond which all data will
             * be sorted on the next pass */
            if (rands[i] > minmax)
                minmax = rands[i];
            if (rands[i] < maxmin)
                maxmin = rands[i];
        }
    }

    /* The only entries left are all sorted already */
    if (ai == tosort - fromsort)
        return 0;

    if (maxmin == s32max)
        maxmin = s32min;
    if (minmax == s32min)
        minmax = s32max;

    unsorted_end = j;
    /* Place the sorted group of low values immediately after the unsorted
     * group. This avoids small values being picked up as maximums on the
     * next pass. */
    for (i = TOSORT - bi; i < TOSORT; i++) {
        swaps++;
        /* Everything less than maxmin will be the lowest sorted values
         * and can be put in their final positions */
        if (a[i] < maxmin) {
            if (fromsort < unsorted_end) {
                s32 swap = rands[fromsort];

                swaps++;
                rands[j] = swap;
                unsorted_end++;
            }
            rands[fromsort++] = a[i];
        } else
            rands[j] = a[i];
        j++;
    }

    /* Place the sorted group of high values after the low values. After
     * the minmax value, these are now in their final positions. */
    for (i = 0; i < ai; i++) {
        swaps++;
        rands[j] = a[i];
        /* Everything greater than minmax will be the highest sorted
         * values and can be put in their final positions */
        if (a[i] >= minmax) {
            tosort = j;
            minmax = s32max;
        }
        j++;
    }
    total_passes++;
    return tosort - fromsort;
}

int main(void)
{
    struct timeval tv;
    int i;

    gettimeofday(&tv, NULL);
    srand((unsigned int)tv.tv_sec);
    for (i = 0; i < TOSORT; i++)
        rands[i] = rand() % 100;

    show();
    tosort = TOSORT;

    while (sort())
        show();

    show();

    printf("%d passes, %d comparisons, %d swaps\n", total_passes, total_n, swaps);

    return 0;
}


Sample output from 50 random variables 0-100:
07 70 66 93 18 71 59 82 30 17 25 15 24 87 66 13 29 10 05 19 34 20 75 53 33 88 79 19 99 40 94 59 62 60 52 80 32 11 14 62 81 40 30 05 79 96 70 09 59 75
 

[snip]

05 05 07 09 10 11 13 14 15 17 18 19 19 20 24 25 29 30 30 32 33 34 40 40 52 53 59 59 59 60 62 62 66 66 70 70 71 75 75 79 79 80 81 82 87 88 93 94 96 99
13 passes, 360 comparisons, 503 swaps



EDIT: Further testing confirms it's likely to be O(n log n) worst case performance which is exceedingly rare with any real data set. I still haven't accurately categorised the average performance but it's a lot better even with random data, and of course it depends very much on what sort of data it sorts. Any existing order is seized upon and it benefits from it, even if there are for example interleaved increasing and decreasing data sets. I have not found evidence that I have simply "reinvented" an existing algorithm.

Wednesday, 28 September 2011

BFS 0.411 test with skiplists

Thanks to those who've tested BFS 0.410 and found a few bugs. I've updated the test patch to 0.411 hopefully addressing all the regressions, and uploaded some incremental versions as well:

http://ck.kolivas.org/patches/bfs/test/3.0-sched-bfs-411.patch

http://ck.kolivas.org/patches/bfs/test/3.0-bfs406-410.patch

http://ck.kolivas.org/patches/bfs/test/ck1-bfs406-411.patch

http://ck.kolivas.org/patches/bfs/test/3.0-bfs410-411.patch

http://ck.kolivas.org/patches/bfs/test/2.6.39-bfs406-411.patch

The bugs fixed in this version were to fix the Uniprocessor builds, fix the poor interactivity due to completely miscalculating deadline, and the suspend/resume regression. Please keep on testing and hopefully I can declare this one a stable release!

EDIT: I have re-tested and confirmed as reported by others, that 411 is indeed slower than 410 and actually slower than the stable 406. This is really disappointing as 410 was only faster by virtue of a bug that would favour throughput (and cause lousy interactivity). I've tried further optimising the code, but alas it appears that for our desktop workloads, skiplists are not the way forward.

EDIT2: Small improvements to be had by adding this patch, but it's still not matching 406 performance for me:
http://ck.kolivas.org/patches/bfs/test/bfs411-412test.patch

Monday, 26 September 2011

lrzip-0.608

Right on the heels of the lrzip 0.607 bugfix release is yet another release, this time with a performance boost!

http://lrzip.kolivas.org

TL;DR: Faster compression on lrzip

When I converted lrzip to work with stdin/stdout and with infinitely sized mmap windows, I needed to do an awful lot of layering of the access to the data internally to decide whether to access the data directly in the original mmapped region of ram, whether to mmap a new region of ram, or to use the moving smaller upper mmapped ram on just the portion being accessed. While this made it possible to implement features that were originally considered virtually impossible, it unfortunately also slowed it down substantially during the rzip dictionary search stage.

For example, a simple memcpy had to turn into this convoluted function:


        for (i = 0; i < n; i++) {
            memcpy(sinfo->s[stream].buf + sinfo->s[stream].buflen + i,
                   get_sb(control, p + i), 1);
        }

As you can see, it had to check it byte by byte in the get_sb function to decide where the memory was. I had tried my best to make the get_sb() function as small as possible since I knew this was a hot path, and it had ended up looking like this:

    i64 low_end = sb.offset_low + sb.size_low;

    if (unlikely(sb.offset_search > low_end))
        remap_low_sb(control);
    if (p >= sb.offset_low && p < low_end)
        return (sb.buf_low + p - sb.offset_low);
    if (p >= sb.offset_high && p < (sb.offset_high + sb.size_high))
        return (sb.buf_high + (p - sb.offset_high));
    /* p is not within the low or high buffer range */
    remap_high_sb(control, p);
    return (sb.buf_high + (p - sb.offset_high));

Not too bad, but clearly this is going to cost a heck of a lot more than just doing a simple memcpy!

Anyway it had concerned me that the more "features" I added to lrzip, the slower it got. Of course this allows one to have unlimited compression windows making for unparalleled compression ratios of massive files with long range redundancy, but most of the time people won't be using this feature and that code is just going to cost them adversely :\

A little bit of playing with profiling recently showed me that the situation was even worse than I had imagined. During the rzip phase, get_sb accounted for 37% of the consumed CPU cycles! Anyway, a simple set of proxy functions seemed the simplest answer, where it would use different versions of memcpy and get_sb depending on whether the file fits in ram or not.

static uchar *sliding_get_sb(rzip_control *control, i64 p)
{
    if (p >= sb.offset_low && p < sb.offset_low + sb.size_low)
        return (sb.buf_low + p - sb.offset_low);
    if (p >= sb.offset_high && p < (sb.offset_high + sb.size_high))
        return (sb.buf_high + (p - sb.offset_high));
    /* p is not within the low or high buffer range */
    remap_high_sb(control, p);
    return (sb.buf_high + (p - sb.offset_high));
}

static uchar *single_get_sb(rzip_control *control, i64 p)
{
    return (sb.buf_low + p);
}

/* We use a pointer to the function we actually want to use and only enable
 * the sliding mmap version if we need sliding mmap functionality as this is
 * a hot function during the rzip phase */
static uchar *(*get_sb)(rzip_control *control, i64 p);

static void sliding_mcpy(rzip_control *control, unsigned char *buf, i64 offset, i64 len)
{
    i64 i;

    for (i = 0; i < len; i++)
        memcpy(buf + i, sliding_get_sb(control, offset + i), 1);
}

static void single_mcpy(rzip_control *control, unsigned char *buf, i64 offset, i64 len)
{
    memcpy(buf, sb.buf_low + offset, len);
}

/* Since the sliding get_sb only allows us to access one byte at a time, we
 * do the same as we did with get_sb with the memcpy since one memcpy is much
 * faster than numerous memcpys 1 byte at a time */
static void (*do_mcpy)(rzip_control *control, unsigned char *buf, i64 offset, i64 len);


Voila, 30% speedup for the rzip stage on small files. This does not bring us back to the previous performance, but close to it. Unfortunately the extra level of indirection actually slowed it down even more on larger files. Then I discovered I could move one compare out of the get_sb function and recover that last bit of speed. So overall, large file rzip stage is actually slightly faster too now.

Finally I decided to not install the bash completion script by default since some distributions include a bash completion package and they conflict on install. So instead I've kept it in the tarball, but don't install it by default. A couple of other trivial changes, and I figured I should just release a stable faster release, so here's lrzip 0.608 (when freshmeat syncs up)

http://freshmeat.net/projects/long-range-zip

BFS 0.410 test with skiplists.

Hi all.

TL;DR: Test release for fastest BFS ever.

The skiplists patch has proven to be quite stable, but a couple of minor issues have shown up. First, as I knew, the ondemand behaviour would not quite match the current release, and second, SCHED_IDLEPRIO tasks weren't scheduled properly (they acted like normal tasks). However the code itself seemed quite safe otherwise. So I've tentatively put out a test release of the next version of BFS. The two main changes to the skiplist code are to bring the ondemand governor performance up to par with current BFS, and to fix the behaviour of SCHED_IDLEPRIO tasks.

Quick benchmarks:
BFS 406:
Make -j4: 26.6s
Make -j : 27.8s

BFS 410:
Make -j4: 26.4s
Make -j : 27.1s


Changelog in patch:
Implement skip lists as described here:
http://en.wikipedia.org/wiki/Skip_list
for the main priority queue.

The old queue had: O(1) insertion, O(1) removal, but a lookup involving
both a binary search of a bitmap and O(n) search through a linked list which
is very cache unfriendly as the list gets larger.

This queue is now: O(log n) insertion, O(1) lookup and O(k) lookup in a much
more cache friendly manner.

This should not compromise performance at all at very low loads, but improve
both throughput and latency as loads get higher, as confirmed by benchmarks.


Other changes: Cleanups of variable choices and micro-optimisations.

Full bfs410 patch for 3.0.x:
http://ck.kolivas.org/patches/bfs/test/3.0-sched-bfs-410.patch

Incremental bfs410 for those on 3.0.x-bfs406:
http://ck.kolivas.org/patches/bfs/test/3.0-bfs406-410.patch

Incremental bfs410 for those on 3.0.x-ck1:
http://ck.kolivas.org/patches/bfs/test/ck1-bfs406-410.patch

Note that make headers check will report errors, so you may have to disable this option in kernel hacking if it is enabled. I'm investigating why this is, but it's a harmless warning.

EDIT: Uniprocessor builds will need the following fix on top:
http://ck.kolivas.org/patches/bfs/test/bfs410-upfix.patch

EDIT2: The following fix is required for interactivity issues with the patch
http://ck.kolivas.org/patches/bfs/test/bfs410-offsetfix.patch

Saturday, 24 September 2011

BFS and skip lists

Hi all.

Just a quick post to tell you all what I've been up to with BFS.

Some of you may know about Skip lists as an alternative to balanced binary search trees. They feature O(log n) insertion, lookup and removal of table entries. Anyway I've been looking for some time at the O(n) lookup of BFS (which is O(1) insertion and removal) to try and find a solution that didn't cost us at the desktop level since O(n) of small numbers of n is very fast. The problem is of course at higher numbers of n (or server type loads), where it gets linearly slower, and the cache trashing aspect of scanning linked lists becomes expensive.

So to cut a long story short, I've implemented the first draft of a custom version of skip lists for BFS in place of the O(n) lookup. The insertion remains O(log n), but by sorting all tasks realtime, iso, normal, batch and idleprio in a way they all end up on the same table, I was able to do away with the regular linked lists and the bitmap priority lookup. Then I was able to utilise one of the features of the skip lists in that the first task on the "bottom" list is always sorted to be the highest priority. This means the lookup is O(1). Then I put an extra pointer into each entry to the previous entry (the original design normally only points to the next entry). Finally I placed a pointer to the entry in the task struct as a way of reverse looking up the entry without any search.

So what I'm left with is an O(log n) insertion, O(1) lookup, and O(k) removal (k is the "height" of the node in question, up to 16, but usually only 1-4).

I've implemented the sticky task used for CPU affinity by simply comparing the last sticky task to the first entry returned from the skip list. Alas I have not yet provided a good version of the sticky task being used to improve scaling governor behaviour. This means that this will likely perform worse with the ondemand governor at this stage. On the other hand, the performance governor seems to be working very nicely in my preliminary tests.

Anyway I'd love to post more, but I need to sleep, so here's some code instead (for a BFS patched 3.0.x kernel):
bfs406-skiplists.patch

Full bfs406 + skiplists patch:
http://ck.kolivas.org/patches/bfs/test/3.0-sched-bfs-406-skiplists.patch

Try it out, see what you think. It seems to be running safely here but as always, there are no guarantees. All going well, you should notice pretty much no difference on a desktop. If you do any throughput benchmarks comparing before/after, I'd love to hear about them.

EDIT:
Here's the thread about it on lkml: https://lkml.org/lkml/2011/9/23/371

Note the benchmarks showing substantial speedup in the highly contended make -j case that I mention in the thread.

3.0.0:
Elapsed Time 28.7

3.0.0-bfs406:
Elapsed Time 28.5

3.0.0-bfs406-sl:
Elapsed Time 27.0

Friday, 16 September 2011

lrzip-0.607

It's been a while since lrzip received any attention while I got distracted by cgminer. However a bug report where there was an archive that was not decompressible drew my attention. After a run of debugging and testing (thanks Entil!) I tracked down the bug. The nice thing was the archives were actually all intact, it was purely on decompression that it could be a problem, and it was extremely rare for an archive to be made in this way. So I fixed the bug, updated the lzma library to the latest, added an lrzip bash completion script and a couple of other minor things and have released version 0.607.

lrzip on freshmeat
http://lrzip.kolivas.org
http://github.com/ckolivas/lrzip

Meanwhile Michael Blumenkrantz has been working on a liblrzip library for lrzip! That means once we finalise the format, you'll be able to compile in lrzip support into your applications, using functions like:

lrzip_compress(void *dest, unsigned long *dest_len, const void *source, unsigned long source_len)

This work is currently in the liblrzip branch on the github tree.

Tuesday, 16 August 2011

Phoronix revisits BFS

Phoronix once benchmarked BFS in its early days. I guess at the prodding of people who suggested it here (thanks!), they've revisited it. Of course they did a throughput benchmark comparison which is kinda missing the point of BFS, but then again if they find ANY throughput benchmark where BFS is better, that's somewhere between amusing and disappointing. Anyway they did mention in passing that "However, the system with the BFS scheduler was more responsive during the test process." While this at least pays homage to what BFS really is about, it makes me wonder - what were they doing while the benchmarks ran? Benchmarks are meant to run without anything else running on the machine.

Anyway, thanks to phoronix for conducting these benchmarks and their logo suggestion.

http://www.phoronix.com/scan.php?page=article&item=bfs_two_years&num=1


Probably more interesting a test of something besides pure throughput is that of a server admin who was running mainline with 6000-8000 processes and his system was falling over under that load. Then he switched to BFS with no change in the usage (actually increasing) which fixed his problems. Here's the graph and see if you can guess when he switched to BFS.


Thursday, 11 August 2011

3.0.0-ck1 and BFS for 3.0.0

Surprise, I'm back and it's all ready.

http://ck.kolivas.org/patches/bfs/3.0.0/3.0-sched-bfs-406.patch

http://www.kernel.org/pub/linux/kernel/people/ck/patches/3.0/3.0.0-ck1/

Note no packages this time because the current distributions get really confused thanks to the new numbering and may not boot.

Nauru was a very depressing place indeed, really living up to the expectations I had based on what I knew of the history of it. Goddamn we (collectively, the developed world) screwed that country up big...

Monday, 25 July 2011

3.0 BFS delays

Hi all

I haven't blogged much lately because I've been distracted from kernel hacking by bitcoin mining. For some crazy reason I took it upon myself to make mining software that did what I wanted, writing it the way I write kernel code. Anyway since it's unrelated I haven't posted about it here before, but if anyone's interested, the development thread is here:

http://forum.bitcoin.org/index.php?topic=28402.0

Now about the kernel. To be honest I haven't followed the development of 3.0 almost at all, being totally pre-occupied with other things as I've taken time out from work as a sabbatical while I reassess work-life balance, long term career management (even to the point of considering changing line of work - anyone need a c programmer?) and spend time with family, friends and random other personal development things. No, I'm not quitting kernel development any time soon (again).

Anyway the thing is I'm going with Interplast next week to Nauru (of all places) as a volunteer anaesthetist for needy children for 10 days. I'm not sure if I'll find time to port BFS to 3.0 before then, or if I'll be able to do it while I'm actually there (doubtful). So just a heads up that it might be a while before we BF the 3.0 kernel.

Tuesday, 7 June 2011

Bitcoin donations

To be honest I only barely understand all the concepts behind bitcoin but it seems like a great idea. Since people have suggested real monetary donations in the form of paypal before, and I have a real career that earns me more money than kernel hacking likely ever will, I've refused. However, I'm quite happy to accept bitcoin donations. So here's my bitcoin address:

1PXcnF7vpW7rUnmaT3sbrAvzNrHLnaQSmu

EDIT:
For those who have not heard of bitcoin, it is an online currency not underwritten by any country but by mathematical computations. It's an interesting concept where the currency itself is slowly increasing in value as the amount of "coins" "generated" decreases with time. You can convert real money into bitcoins, and you can currently purchase a few things with them. The number of services and products you can buy with them is increasing steadily. One bitcoin is worth about 18 USD at the moment. You can actually "mine" for coins by work in the form of CPU or GPU power, which is a perfectly valid way of generating "money" but the ability to generate these coins gets harder as time goes on. Currently, mining for coins with CPU power is unlikely to generate any coins directly unless you join a pool, as it would take 50 years to generate them with CPU power. In a pool such as http://ozco.in (which I use), a modern CPU will generate about .01 coins a day. However, GPUs (and more so ATI ones) are much better at this mining. See http://bitcoin.org to learn more about bitcoins.

Monday, 6 June 2011

Porting BFS to FreeBSD

There is an interesting Google Summer of Code project underway to port BFS to FreeBSD and see how it compares to their mainline CPU schedulers, both the V4 scheduler and ULE. The GSOC person porting it has got a rudimentary version of the BFS algorithm working based on the V4 scheduler and has started testing it. His progress shall be interesting to watch. The first results he's getting are interesting already.

His blog is here: rudot.blog.com

Sunday, 5 June 2011

2.6.39-ck2, bfs-0.406

These are patches designed to improve system responsiveness and interactivity with specific emphasis on the desktop, but suitable to any commodity hardware workload.


Apply to 2.6.39(.x):
http://www.kernel.org/pub/linux/kernel/people/ck/patches/2.6/2.6.39/2.6.39-ck2/patch-2.6.39-ck2.bz2

Ubuntu packages (2.6.39-ck1-3 is equivalent to 2.6.39-ck2):
http://ck.kolivas.org/patches/Ubuntu%20Packages/

Broken out tarball:
http://www.kernel.org/pub/linux/kernel/people/ck/patches/2.6/2.6.39/2.6.39-ck2/2.6.39-ck2-broken-out.tar.bz2

Discrete patches:
http://www.kernel.org/pub/linux/kernel/people/ck/patches/2.6/2.6.39/2.6.39-ck2/patches/

All -ck patches:
http://www.kernel.org/pub/linux/kernel/people/ck/patches/

BFS by itself:
http://ck.kolivas.org/patches/bfs/

Web:
http://kernel.kolivas.org

Code blog when I feel like it:
http://ck-hack.blogspot.com/

Each discrete patch contains a brief description of what it does at the top of the patch itself.


The only change from 2.6.39-ck1 is an upgrade to BFS CPU scheduler version 0.406. A bug that would cause hangs due to an incompatibility with the new block plug flushing code and BFS was fixed. For those who tried the "bfs404-test9" patch, this is only trivially different apart from the bfs version change.

Full patchlist:
2.6.39-sched-bfs-406.patch
sched-add-above-background-load-function.patch
mm-zero_swappiness.patch
mm-enable_swaptoken_only_when_swap_full.patch
mm-drop_swap_cache_aggressively.patch
mm-kswapd_inherit_prio-1.patch
mm-background_scan.patch
mm-idleprio_prio-1.patch
mm-lru_cache_add_lru_tail-1.patch
mm-decrease_default_dirty_ratio.patch
kconfig-expose_vmsplit_option.patch
hz-default_1000.patch
hz-no_default_250.patch
hz-raise_max.patch
preempt-desktop-tune.patch
ck2-version.patch

Please enjoy!
お楽しみください
--
-ck

Saturday, 4 June 2011

lrzip tarball of all 40 linux-2.6 kernels

With the 2.6 linux kernel now officially finished, I'm providing a tarball of all 40 of the 3 point kernel releases as an lrzip tarball. This is a convenient way of getting all the releases in a relatively low bandwidth form, and previous archives of this nature have had a few downloads so I figured I'd complete the archive for those who want to download it and use it as an ad for lrzip:

163.9MB:
linux-2.6.0-2.6.39.tar.lrz

10.3GB:
linux-2.6.0-2.6.39.tar

Original file is a tarball of all 3 point linux kernel release tarballs 2.6.0 to 2.6.39 coming to a grand total of 10.3GB

Of course this is a plug for lrzip on something it is particularly good at compressing.

lrzip can be obtained here:
http://lrzip.kolivas.org

About this archive: It was compressed with lrzip version 0.606 on a quad core 3GHz core 2 on a relatively slow external USB2 hard drive with the following options:

lrzip -UL 9 linux-2.6.0-2.6.39.tar
Total time: 00:56:19.88

It would have compressed a lot faster without the -L 9 option, but given this is the "final" archive of 2.6, I figured I'd push it a bit further. Lrzip can compress it even further with zpaq as an option, but it makes decompression much slower so I'd personally find the archive less useful.

Of course someone will ask how it compares to xz, so for completion:

xz -9 linux-2.6.0-2.6.39.tar
Total time: 2:05:32.218

11067473920 linux-2.6.0-2.6.39.tar
1535618848 linux-2.6.0-2.6.39.tar.xz 13.8%
171879382 linux-2.6.0-2.6.39.tar.lrz 1.6%

EDIT:
I figured I'd do the rest as well.

Here is a tarball of all 161 stable 3 point releases from linux-1.0 to linux-2.6.39:

211MB:
linux-1.0-2.6.39.tar.lrz

19617064960 linux-1.0-2.6.39.tar
221368298 linux-1.0-2.6.39.tar.lrz 1.1%

As compressing this workload is mostly I/O bound I performed it on an SSD
drive this time:

linux-1.0-2.6.39.tar - Compression Ratio: 88.617. Average Compression Speed: 8.026MB/s.
Total time: 00:38:50.76

Friday, 3 June 2011

2.6.39 BFS test 9 - is this the one?

Hopefully this test patch should fix all the problems with BFS 404 on 2.6.39:

bfs404-test9.patch
Ubuntu Packages : grab the 2.6.39-ck1-3 package

Please report back if you haven't already! Thanks to everyone who has tested so far! Your feedback has been absolutely essential on this weird and wonderful bug.

Monday, 30 May 2011

2.6.39 BFS progress

TL;DR: 2.6.39 BFS fixed maybe?

After walking away from the code for a while, annoyed at the bug I couldn't track down, I had another good look at what might be happening. It appears that while the grq lock is dropped in schedule() to perform the block plug flush, a call to the task via try_to_wake_up may be missed entirely, leaving the task deactivated when it should actually keep running. Anyway, first tests from the people on these blog comments are reassuring.

Here is a cleaned up and slightly modified version of the "test8" patch that has so far been stable and shows to have fixed the problem for a handful of people:

Apply to 2.6.39-ck1 or 2.6.39 with BFS 404:
bfs404-recheck_unplugged.patch

In response to requests for packaged versions, I've uploaded a 2.6.39-ck1-2 ubuntu package which includes this change:
Ubuntu Packages

Please test and report back! If this fixes the problem, I'll be releasing it as ck2.