Sunday, 2 October 2011

cksort follow up.

I've spent a few days playing with cksort to try and improve it with behaviour that I was planning on, and to try and categorise exactly what sort of sort (heh) it is and to analyse its algorithmic complexity. I know people haven't found this as interesting as my usual stuff, but I feel like blogging it anyway, and maybe some use will come of this code.

cksort v0.21:
http://ck.kolivas.org/apps/cksort/cksort.c
To describe the sort, it's an extension of a selection sort using external storage to substantially improve the speed of sorting certain forms of data and the worst case scenario.

I'll use this sample of 20 for simplicity:


 63 98 22 30 84 73 23 51 10 57 04 73 74 11 16 27 65 22 25 25

It starts by scanning over the data linearly, sorting data into 3 buckets:
The first bucket contains the first entry, and then collects any entries equal to or greater than the last entry put into it.


63 98

The second bucket contains the first item rejected by the 1st bucket and then collects any entries equal to or smaller than the last entry put into it.


22 10 04


The last bucket stores any data not falling into either of the previous categories (unsorted).

30 84 73 23 51 57 73 74 11 16 27 65

To minimise the amount of extra storage required, only one extra buffer equal in size to the original data is used, and the data from buckets 1 and 2 are placed at the beginning and end of the same buffer, while the unsorted data gets compacted in the original storage.

The highest and lowest values in the unsorted are store for comparison.

Then the main data is reconstructed as follows:

Any values from bucket 2 below the lowest value from unsorted are put at the beginning of the main buffer and considered in their final position and will not be scanned again.
Then all unsorted values are next in line.
Then the remainder of bucket 2 data is placed in the main buffer in ascending order.
Then bucket 1 is placed in ascending order, and any that are larger than the highest value from unsorted are considered in their final position.

 04 10 73 23 51 57 73 74 11 16 27 65 22 25 25 30 84 22 63 98

Then the scan is repeated in ever decreasing sections of the middle until none are left.

04 10 11 57 16 27 65 22 25 25 30 22 63 51 23 73 73 74 84 98
04 10 11 16 22 25 25 30 22 63 51 23 27 57 65 73 73 74 84 98
04 10 11 16 51 23 27 57 22 22 25 25 30 63 65 73 73 74 84 98
04 10 11 16 22 22 23 30 27 25 25 51 57 63 65 73 73 74 84 98
04 10 11 16 22 22 23 25 25 27 30 51 57 63 65 73 73 74 84 98


Trying to categorise this was difficult for a while but fortunately I discussed it with a few friends online and we were able to come up with something.


So in this overcrowded world of >9000 sorts, there needs to be some advantage or else it's a useless addition. Now I'm not claiming this is going to be something special but I'm doing this to fulfil some intellectual curiosity. Now please, I'm not that good with the algorithmic complexity details, so if I've described something wrong, please feel free to correct me (because no one ever does that on the internet!)

The advantages are with how it manages partially sorted data, and repeats of data. A set of fully sorted data in ascending order takes one pass, and after the data is placed into the backup array, it can abort out without any further swaps, so the comparisons and swaps equal n with one pass. O(n).

When data is already sorted in reverse order, the 2nd bucket picks it all up and places it in correct order. This is one set of comparisons per n, and 2 swaps per entry, O(n). This is somewhere very few sort algorithms do well.

When there are a limited number of values the variables can take, the number of passes cksort spends is bound by half that number of variables, no matter how many entries there are. e.g. if you have 1 million entries of values 0-100, it will only ever take up to 50 passes, needing ~25 million comparisons. EDIT: This means it's actually O(k) where k is the number of unique entries.

When data is completely random, there is some kind of function describing its relationship, and it is significantly less than (absolutely) n^2 but the algorithmic complexity is O(n^2). For example, 100 unique random entries take on average 1800 compares which is a lot less than 10,000 but does scale in polynomial fashion as the number of n increases.

The exact worst case scenario is actually well defined as a sawtooth pattern of input values with lots of crossover but no repeat values (e.g. 100, 97, 98, 95, 96...) and is also O(n^2) and 100 unique values takes 2424 compares.

How to describe average I find quite hard, because real data usually has either repeats or has sections of partially ordered variables. So depending on how random the data is, it will tend from O(n)->O(n^2). Strictly speaking, most documented "averages" are based on random data.

So, finally how do the benchmarks stack up compared to the industry standard qsort? It's important to note that no effort was really made to optimise the code but the coarse scales should give you an idea of performance. Note that qsort is often thought of as being quicksort, but the implementation is entirely library dependent. This one may be a heap sort given it takes the same time no matter what is fed to it. A quicksort would easily have hit O(n^2) with some of these examples:


Time is in microseconds on a 3GHz core 2 with work bound to one CPU:

All are 10,000 entries:
Data already sorted 0 - 10000
qsort: 659
cksort: 40

Data sorted in reverse order 10000 - 0
qsort: 646
cksort: 61

Data sorted interleaved rising and falling: (0, 100, 1, 99, 2, 98...)
qsort: 651
cksort: 96

Data random with repeats every 2nd: (e.g. 54, 54, 99, 99, 33, 33...)
qsort: 643
cksort: 37

Data with 2 rising values and then a random: (e.g.: 1, 2, 54, 4, 5, 65...)
qsort: 653
cksort: 102

Random data, bound to 100 different values:
qsort: 642
cksort: 781

And now to real suckage, purely random unique data:
qsort: 658
cksort: 51770

For completeness I tried to find the crossover in n for when truly random data performed the same and it was ~475 entries.
qsort: 153
cksort: 155

So it's been a fun experiment and there's no doubt that with certain data types it could actually be quite useful, such as small data sets, partially ordered data sets, or numerous repeats, but the worst case scenario is clearly unacceptable for a general purpose algorithm if truly random data is possible. In summary, its major advantage is its extreme adaptability (adaptive algorithm), but the other disadvantages are that it is not stable, does not operate in place, and may have many swaps. cksort, like BFS, is a tool that can be useful for specific tasks. As I saw with BFS, the right algorithmic complexity in the right place is better than something that just "sounds better".

Now I wonder if I'll be inspired to try and design an O(n log n) sorting algorithm that can be parallelised...

EDIT: In response to questions about the license in the code, the GPL v3 license in the sample code is just for the sample code, not the algorithm.

EDIT2: It's worth noting that I'm still experimenting with this algorithm and hopefully I'll be able to improve it further.

Friday, 30 September 2011

cksort

I've been playing with sorting algorithms lately and wondered if the landscape of a billion out there had room for one more. I tried implementing many of the better algorithms in code and found it very hard to actually generate working code for anything beyond the most simple of selection sorts. So I set out to create my own sorting algorithm that sort of kind of worked half decently but wasn't too hard to implement. Now I'm sure some of the more complex algorithms do significantly better, and this does have some disadvantages, like the space requirements and lots of writes, but its best performance is very good and its average is decent. I have yet to find any particular data it performs badly on. However I'm not very good with the whole mathematical analysis side so I figured I'd post it here and hopefully get comments from people who are good there.

I don't really know how to describe it, but it's sort of a  selection sort that generates two sorted lists at the end, and part of the last list is the completely sorted component.


The original design did not have the bidirectional component, just putting the "high" list at the end, but this performed poorly when data is in the reverse order, whereas this one performs just as well with data already mostly sorted in either direction.

Note that almost no effort has gone into code optimisation, just to make sure it works and to calculate how many iterations and swaps it does and so on.

http://ck.kolivas.org/apps/cksort/cksort.c


EDIT: Note v0.21. This is significantly faster than the original one I posted working bidirectionally requiring less comparisons and swaps on average and worst case.

/* Copyright 2011, Con Kolivas
 *
 * cksort v0.21
 *
 * Gnu Public License V3.
 *
 * cksort: A very simple sort algorithm based on selection sort but sorts
 * successive chunks of entries and scans less on each pass based on
 * knowledge of how much has been successfully sorted, and does a low and high
 * directional search to avoid worst case of reversed data. Uses an extra array
 * equal in size to the original data to do this.
 * Adaptive, not stable, best performance O(n), worst better than O(n^2) but
 * unknown. The average is unknown but seems to be ~ O(n^(3/2)) for random data.
 * It becomes faster the more sorted the original data is - even if it's
 * backwards, and the smaller the spread of data.
 *
 * gcc -o cksort -Wall -O2 cksort.c
 */

#include
#include
#include

#define TOSORT 50

typedef int32_t s32;

s32 rands[TOSORT], a[TOSORT];

#define s32min -2147483647
#define s32max 2147483647

void show(void)
{
    int i;

    for (i = 0; i < TOSORT; i++)
        printf("%02d ", rands[i]);
    printf("\n");
}

int total_n, total_passes, swaps, fromsort, tosort;

int sort(void)
{
    int i, j, ai = 0, bi = 0, unsorted_end;
    s32 max = s32min, minmax = s32min;
    s32 min = s32max, maxmin = s32max;

    j = fromsort;
    /* Do a sequential sort, finding ever increasing values and
     * store them in a[] */
    for (i = fromsort; i < tosort; i++) {
        total_n++;
        if (rands[i] >= max) {
            /* Store the grouped maximums at the bottom of the a
             * array */
            max = a[ai++] = rands[i];
            if (max < maxmin)
                maxmin = max;
            swaps++;
        } else if (rands[i] <= min) {
            /* Store the grouped minimums at the top of the a
             * array */
            min = a[TOSORT - 1 - bi++] = rands[i];
            swaps++;
        } else {
            /* Group any unsorted values at the low end */
            rands[j++] = rands[i];
            swaps++;
            /* Find the largest value beyond which all data will
             * be sorted on the next pass */
            if (rands[i] > minmax)
                minmax = rands[i];
            if (rands[i] < maxmin)
                maxmin = rands[i];
        }
    }

    /* The only entries left are all sorted already */
    if (ai == tosort - fromsort)
        return 0;

    if (maxmin == s32max)
        maxmin = s32min;
    if (minmax == s32min)
        minmax = s32max;

    unsorted_end = j;
    /* Place the sorted group of low values immediately after the unsorted
     * group. This avoids small values being picked up as maximums on the
     * next pass. */
    for (i = TOSORT - bi; i < TOSORT; i++) {
        swaps++;
        /* Everything less than maxmin will be the lowest sorted values
         * and can be put in their final positions */
        if (a[i] < maxmin) {
            if (fromsort < unsorted_end) {
                s32 swap = rands[fromsort];

                swaps++;
                rands[j] = swap;
                unsorted_end++;
            }
            rands[fromsort++] = a[i];
        } else
            rands[j] = a[i];
        j++;
    }

    /* Place the sorted group of high values after the low values. After
     * the minmax value, these are now in their final positions. */
    for (i = 0; i < ai; i++) {
        swaps++;
        rands[j] = a[i];
        /* Everything greater than minmax will be the highest sorted
         * values and can be put in their final positions */
        if (a[i] >= minmax) {
            tosort = j;
            minmax = s32max;
        }
        j++;
    }
    total_passes++;
    return tosort - fromsort;
}

int main(void)
{
    struct timeval tv;
    int i;

    gettimeofday(&tv, NULL);
    srand((unsigned int)tv.tv_sec);
    for (i = 0; i < TOSORT; i++)
        rands[i] = rand() % 100;

    show();
    tosort = TOSORT;

    while (sort())
        show();

    show();

    printf("%d passes, %d comparisons, %d swaps\n", total_passes, total_n, swaps);

    return 0;
}


Sample output from 50 random variables 0-100:
07 70 66 93 18 71 59 82 30 17 25 15 24 87 66 13 29 10 05 19 34 20 75 53 33 88 79 19 99 40 94 59 62 60 52 80 32 11 14 62 81 40 30 05 79 96 70 09 59 75
 

[snip]

05 05 07 09 10 11 13 14 15 17 18 19 19 20 24 25 29 30 30 32 33 34 40 40 52 53 59 59 59 60 62 62 66 66 70 70 71 75 75 79 79 80 81 82 87 88 93 94 96 99
13 passes, 360 comparisons, 503 swaps



EDIT: Further testing confirms it's likely to be O(n log n) worst case performance which is exceedingly rare with any real data set. I still haven't accurately categorised the average performance but it's a lot better even with random data, and of course it depends very much on what sort of data it sorts. Any existing order is seized upon and it benefits from it, even if there are for example interleaved increasing and decreasing data sets. I have not found evidence that I have simply "reinvented" an existing algorithm.

Wednesday, 28 September 2011

BFS 0.411 test with skiplists

Thanks to those who've tested BFS 0.410 and found a few bugs. I've updated the test patch to 0.411 hopefully addressing all the regressions, and uploaded some incremental versions as well:

http://ck.kolivas.org/patches/bfs/test/3.0-sched-bfs-411.patch

http://ck.kolivas.org/patches/bfs/test/3.0-bfs406-410.patch

http://ck.kolivas.org/patches/bfs/test/ck1-bfs406-411.patch

http://ck.kolivas.org/patches/bfs/test/3.0-bfs410-411.patch

http://ck.kolivas.org/patches/bfs/test/2.6.39-bfs406-411.patch

The bugs fixed in this version were to fix the Uniprocessor builds, fix the poor interactivity due to completely miscalculating deadline, and the suspend/resume regression. Please keep on testing and hopefully I can declare this one a stable release!

EDIT: I have re-tested and confirmed as reported by others, that 411 is indeed slower than 410 and actually slower than the stable 406. This is really disappointing as 410 was only faster by virtue of a bug that would favour throughput (and cause lousy interactivity). I've tried further optimising the code, but alas it appears that for our desktop workloads, skiplists are not the way forward.

EDIT2: Small improvements to be had by adding this patch, but it's still not matching 406 performance for me:
http://ck.kolivas.org/patches/bfs/test/bfs411-412test.patch

Monday, 26 September 2011

lrzip-0.608

Right on the heels of the lrzip 0.607 bugfix release is yet another release, this time with a performance boost!

http://lrzip.kolivas.org

TL;DR: Faster compression on lrzip

When I converted lrzip to work with stdin/stdout and with infinitely sized mmap windows, I needed to do an awful lot of layering of the access to the data internally to decide whether to access the data directly in the original mmapped region of ram, whether to mmap a new region of ram, or to use the moving smaller upper mmapped ram on just the portion being accessed. While this made it possible to implement features that were originally considered virtually impossible, it unfortunately also slowed it down substantially during the rzip dictionary search stage.

For example, a simple memcpy had to turn into this convoluted function:


        for (i = 0; i < n; i++) {
            memcpy(sinfo->s[stream].buf + sinfo->s[stream].buflen + i,
                   get_sb(control, p + i), 1);
        }

As you can see, it had to check it byte by byte in the get_sb function to decide where the memory was. I had tried my best to make the get_sb() function as small as possible since I knew this was a hot path, and it had ended up looking like this:

    i64 low_end = sb.offset_low + sb.size_low;

    if (unlikely(sb.offset_search > low_end))
        remap_low_sb(control);
    if (p >= sb.offset_low && p < low_end)
        return (sb.buf_low + p - sb.offset_low);
    if (p >= sb.offset_high && p < (sb.offset_high + sb.size_high))
        return (sb.buf_high + (p - sb.offset_high));
    /* p is not within the low or high buffer range */
    remap_high_sb(control, p);
    return (sb.buf_high + (p - sb.offset_high));

Not too bad, but clearly this is going to cost a heck of a lot more than just doing a simple memcpy!

Anyway it had concerned me that the more "features" I added to lrzip, the slower it got. Of course this allows one to have unlimited compression windows making for unparalleled compression ratios of massive files with long range redundancy, but most of the time people won't be using this feature and that code is just going to cost them adversely :\

A little bit of playing with profiling recently showed me that the situation was even worse than I had imagined. During the rzip phase, get_sb accounted for 37% of the consumed CPU cycles! Anyway, a simple set of proxy functions seemed the simplest answer, where it would use different versions of memcpy and get_sb depending on whether the file fits in ram or not.

static uchar *sliding_get_sb(rzip_control *control, i64 p)
{
    if (p >= sb.offset_low && p < sb.offset_low + sb.size_low)
        return (sb.buf_low + p - sb.offset_low);
    if (p >= sb.offset_high && p < (sb.offset_high + sb.size_high))
        return (sb.buf_high + (p - sb.offset_high));
    /* p is not within the low or high buffer range */
    remap_high_sb(control, p);
    return (sb.buf_high + (p - sb.offset_high));
}

static uchar *single_get_sb(rzip_control *control, i64 p)
{
    return (sb.buf_low + p);
}

/* We use a pointer to the function we actually want to use and only enable
 * the sliding mmap version if we need sliding mmap functionality as this is
 * a hot function during the rzip phase */
static uchar *(*get_sb)(rzip_control *control, i64 p);

static void sliding_mcpy(rzip_control *control, unsigned char *buf, i64 offset, i64 len)
{
    i64 i;

    for (i = 0; i < len; i++)
        memcpy(buf + i, sliding_get_sb(control, offset + i), 1);
}

static void single_mcpy(rzip_control *control, unsigned char *buf, i64 offset, i64 len)
{
    memcpy(buf, sb.buf_low + offset, len);
}

/* Since the sliding get_sb only allows us to access one byte at a time, we
 * do the same as we did with get_sb with the memcpy since one memcpy is much
 * faster than numerous memcpys 1 byte at a time */
static void (*do_mcpy)(rzip_control *control, unsigned char *buf, i64 offset, i64 len);


Voila, 30% speedup for the rzip stage on small files. This does not bring us back to the previous performance, but close to it. Unfortunately the extra level of indirection actually slowed it down even more on larger files. Then I discovered I could move one compare out of the get_sb function and recover that last bit of speed. So overall, large file rzip stage is actually slightly faster too now.

Finally I decided to not install the bash completion script by default since some distributions include a bash completion package and they conflict on install. So instead I've kept it in the tarball, but don't install it by default. A couple of other trivial changes, and I figured I should just release a stable faster release, so here's lrzip 0.608 (when freshmeat syncs up)

http://freshmeat.net/projects/long-range-zip