Sunday, 2 October 2011

cksort follow up.

I've spent a few days playing with cksort to try and improve it with behaviour that I was planning on, and to try and categorise exactly what sort of sort (heh) it is and to analyse its algorithmic complexity. I know people haven't found this as interesting as my usual stuff, but I feel like blogging it anyway, and maybe some use will come of this code.

cksort v0.21:
To describe the sort, it's an extension of a selection sort using external storage to substantially improve the speed of sorting certain forms of data and the worst case scenario.

I'll use this sample of 20 for simplicity:

 63 98 22 30 84 73 23 51 10 57 04 73 74 11 16 27 65 22 25 25

It starts by scanning over the data linearly, sorting data into 3 buckets:
The first bucket contains the first entry, and then collects any entries equal to or greater than the last entry put into it.

63 98

The second bucket contains the first item rejected by the 1st bucket and then collects any entries equal to or smaller than the last entry put into it.

22 10 04

The last bucket stores any data not falling into either of the previous categories (unsorted).

30 84 73 23 51 57 73 74 11 16 27 65

To minimise the amount of extra storage required, only one extra buffer equal in size to the original data is used, and the data from buckets 1 and 2 are placed at the beginning and end of the same buffer, while the unsorted data gets compacted in the original storage.

The highest and lowest values in the unsorted are store for comparison.

Then the main data is reconstructed as follows:

Any values from bucket 2 below the lowest value from unsorted are put at the beginning of the main buffer and considered in their final position and will not be scanned again.
Then all unsorted values are next in line.
Then the remainder of bucket 2 data is placed in the main buffer in ascending order.
Then bucket 1 is placed in ascending order, and any that are larger than the highest value from unsorted are considered in their final position.

 04 10 73 23 51 57 73 74 11 16 27 65 22 25 25 30 84 22 63 98

Then the scan is repeated in ever decreasing sections of the middle until none are left.

04 10 11 57 16 27 65 22 25 25 30 22 63 51 23 73 73 74 84 98
04 10 11 16 22 25 25 30 22 63 51 23 27 57 65 73 73 74 84 98
04 10 11 16 51 23 27 57 22 22 25 25 30 63 65 73 73 74 84 98
04 10 11 16 22 22 23 30 27 25 25 51 57 63 65 73 73 74 84 98
04 10 11 16 22 22 23 25 25 27 30 51 57 63 65 73 73 74 84 98

Trying to categorise this was difficult for a while but fortunately I discussed it with a few friends online and we were able to come up with something.

So in this overcrowded world of >9000 sorts, there needs to be some advantage or else it's a useless addition. Now I'm not claiming this is going to be something special but I'm doing this to fulfil some intellectual curiosity. Now please, I'm not that good with the algorithmic complexity details, so if I've described something wrong, please feel free to correct me (because no one ever does that on the internet!)

The advantages are with how it manages partially sorted data, and repeats of data. A set of fully sorted data in ascending order takes one pass, and after the data is placed into the backup array, it can abort out without any further swaps, so the comparisons and swaps equal n with one pass. O(n).

When data is already sorted in reverse order, the 2nd bucket picks it all up and places it in correct order. This is one set of comparisons per n, and 2 swaps per entry, O(n). This is somewhere very few sort algorithms do well.

When there are a limited number of values the variables can take, the number of passes cksort spends is bound by half that number of variables, no matter how many entries there are. e.g. if you have 1 million entries of values 0-100, it will only ever take up to 50 passes, needing ~25 million comparisons. EDIT: This means it's actually O(k) where k is the number of unique entries.

When data is completely random, there is some kind of function describing its relationship, and it is significantly less than (absolutely) n^2 but the algorithmic complexity is O(n^2). For example, 100 unique random entries take on average 1800 compares which is a lot less than 10,000 but does scale in polynomial fashion as the number of n increases.

The exact worst case scenario is actually well defined as a sawtooth pattern of input values with lots of crossover but no repeat values (e.g. 100, 97, 98, 95, 96...) and is also O(n^2) and 100 unique values takes 2424 compares.

How to describe average I find quite hard, because real data usually has either repeats or has sections of partially ordered variables. So depending on how random the data is, it will tend from O(n)->O(n^2). Strictly speaking, most documented "averages" are based on random data.

So, finally how do the benchmarks stack up compared to the industry standard qsort? It's important to note that no effort was really made to optimise the code but the coarse scales should give you an idea of performance. Note that qsort is often thought of as being quicksort, but the implementation is entirely library dependent. This one may be a heap sort given it takes the same time no matter what is fed to it. A quicksort would easily have hit O(n^2) with some of these examples:

Time is in microseconds on a 3GHz core 2 with work bound to one CPU:

All are 10,000 entries:
Data already sorted 0 - 10000
qsort: 659
cksort: 40

Data sorted in reverse order 10000 - 0
qsort: 646
cksort: 61

Data sorted interleaved rising and falling: (0, 100, 1, 99, 2, 98...)
qsort: 651
cksort: 96

Data random with repeats every 2nd: (e.g. 54, 54, 99, 99, 33, 33...)
qsort: 643
cksort: 37

Data with 2 rising values and then a random: (e.g.: 1, 2, 54, 4, 5, 65...)
qsort: 653
cksort: 102

Random data, bound to 100 different values:
qsort: 642
cksort: 781

And now to real suckage, purely random unique data:
qsort: 658
cksort: 51770

For completeness I tried to find the crossover in n for when truly random data performed the same and it was ~475 entries.
qsort: 153
cksort: 155

So it's been a fun experiment and there's no doubt that with certain data types it could actually be quite useful, such as small data sets, partially ordered data sets, or numerous repeats, but the worst case scenario is clearly unacceptable for a general purpose algorithm if truly random data is possible. In summary, its major advantage is its extreme adaptability (adaptive algorithm), but the other disadvantages are that it is not stable, does not operate in place, and may have many swaps. cksort, like BFS, is a tool that can be useful for specific tasks. As I saw with BFS, the right algorithmic complexity in the right place is better than something that just "sounds better".

Now I wonder if I'll be inspired to try and design an O(n log n) sorting algorithm that can be parallelised...

EDIT: In response to questions about the license in the code, the GPL v3 license in the sample code is just for the sample code, not the algorithm.

EDIT2: It's worth noting that I'm still experimenting with this algorithm and hopefully I'll be able to improve it further.


  1. sorry for my OT :(
    I need patch-3.0.0-ck1.bz2 but is down, where can i download it?

  2. I've placed them in here for now (needs lrzip):

  3. The "numerous repeats" advantage may make this useful in DBMS (which keeps statistics) to sort non-indexed column with limited number of distinct values.