-ck hacking: cksort

Showing posts with label cksort. Show all posts

Monday, 17 October 2011

BFS 0.413, BFS benchmarks, (and more cksort stuff)

Finally some BFS news, yay \o/

TL;DR: Minor unnoticeable tweaks for BFS.

Incremental:
3.0-bfs-406-413.patch

Incremental for 3.0.x-ck1:
3.0-ck1-bfs-406-413.patch

Full patch for 3.0.x:
3.0-sched-bfs-413.patch

Alas there isn't anything earth shattering here. The skiplist implementation experiment failed to be of any benefit despite my best intentions, so I've abandoned the idea for the time being. Meanwhile, there were a number of minor improvements that I had considered for BFS (and I mean really really minor) and a few cleanups that had been waiting. So here's the changelog for the new BFS for 3.0:

Improve the preempt call by explicitly setting a task that will be used on
lookup in earliest_deadline_task to avoid iterating over all tasks. Set the
effective runqueue priority for further calls to try_preempt based on the
preempting task's priority. Thus on the small chance that something else
tries to preempt the runqueue before the new tasks gets the CPU, it will
compare to the previously successful preempting task. This should avoid a
preempting task from rescheduling a runqueue and then the CPU deciding to
take a different task instead of the preempting one.

Clean up a number of variables from unnecessary longs to ints and ints to
bools.

Microoptimise around try preempt by settings highest_prio to that of the
task that's trying to preempt to avoid unnecessary comparisons.

Break sole affinity on cpu offline -after- the cpu has been set to be offline.

I'd be surprised if anyone noticed anything different with these changes, but then, there have been some surprises from apparently innocuous changes in the past, so I'm keen to get feedback from testers. There is a small chance that the preempt changes can improve throughput by avoiding unnecessary rescheduling, and that the cpu offline changes might help rare scenarios where a machine will fail to suspend to ram.

Meanwhile, graysky who has been maintaining the unofficial ck repos for archlinux has very kindly put together a number of throughput benchmarks comparing BFS to mainline on machines from 2x up to 16x in a very nicely put together PDF which you can download and read here:

repo-ck.com/bench/benchmark.pdf

On the cksort front, it's clear to me that sorting algorithms are pretty boring to most people, despite me finding transient interest in them and developing my own. Nevertheless, the cksort algorithm has been further improved. After playing with it some more, and learning from the legendary work of the past, I modified the merge sort that cksort "tips" to to include an insertion sort as well, making the algorithm now a blended insertion+merge sort once it tips away from the presort+nodemerge algorithm which is basically the only unique part of cksort. This capitalises on the fact that insertion sort is pretty much the fastest way to sort up to 32 different entries by splitting up the merge into up-to-32 entry sized groups and doing insertion sort on all of them, followed by using a non-recursive merge sort on each group.

This has further improved cksort to now be faster than qsort for up to 1500 entries, and now equally fast with library qsort() up to 10000 (uniquely random)entries. Beyond that it is still slower than qsort, but it has -no- corner cases where it will ever become O(n^2) and will always be O(n log n) bound, unlike qsort.

Version 0.44 of cksort is here:
http://ck.kolivas.org/apps/cksort/cksort.c

And in response to requests for the benchmarking code vs qsort it is here:
http://ck.kolivas.org/apps/cksort/benchsort.c

Now Aldo Cortesi has a very interesting static view of comparison sort algorithms here:
http://sortvis.org/

After contacting him, he was kind enough to modify his code such that it would take output from something like the cksort to generate a graph from any sorting algorithm. Alas cksort uses a separate array for its work and looking at the array of the final data does not tell the entire story of what's going on, but it does help further to visualise how cksort works.

Here is a view of 20 random variables being sorted and you can see the 2 phases of cksort where it builds up the nodes first, and the 2nd phase where it merges the nodes (click to enlarge):

It's important to note that the 2nd stage graph is fairly arbitrary since the data from the unmerged nodes is not actually written back to the original array until it's merged, but you should be able to make out the final sorted data appearing from the outside in.

Sorting non-random entries is really what makes cksort interesting and here is a (rather complex) visualisation of the 50 mostly-sorted example in the animated form I showed on my blog previously:

Anyway it's been a lot of fun playing with this stuff, and maybe one day I'll come up with something more useful from the algorithm. I played with an in-place version of the code which didn't need a second array and the amount of memory moving required made the algorithm slower and more complex so I soon lost interest in developing that for the time being. I'd like to find a way to make the presort+node merging aspect scale to any size instead of tipping to a different algorithm but the lack of any pattern in large random data sets means you can't really find a pattern in it. I still think there may be something like a heap structure that would help it but haven't found anything that works in practice. Maybe I'll come back to this in the future. Meanwhile it works extremely well for the smaller data sets, partially ordered data, reverse ordered data, and few unique data sets. Interestingly it coincidentally shares a lot in its ideas with Timsort. All I can say after all this is that the work of the legends of the past, and coming up with something like Smoothsort is truly amazing.

Tuesday, 11 October 2011

cksort and O(n^2) tipping point

I've spent some time modelling the behaviour of the last cksort I posted here to investigate where and why it becomes O(n^2). It turns out the prescan which creates the nodes ends up scanning a large amount of the data repeatedly when the data is purely random. As mentioned here previously, when there are up to ~1500 unique random entries, it is faster than qsort and then gets slower. When there is already some sorted data or repetition, it remains faster even when the number of entries becomes very large. So I tried to model where the tipping point was to determine when to change the sort mode and it turns out that if it is unable to generate a node which is a decent proportion of the original data, it is likely to only generate yet smaller nodes on each pass. With the help of a bit of trial and error and weak modelling, I found that it would stay ahead of a classic O(n log n) search when it was able to reliably sort the data into 32 nodes or less. There are occasions where it is faster with more nodes, but it is always faster at less. So I integrated a classic merge sort into the sort algorithm and made it switch to classic merge sort should it fall below this threshold. This does not significantly increase the total amount of compares/swaps and then means there algorithm is always O(n log n) bound.

So the latest version of cksort now has some nice properties, being O(n) in certain ideal scenarios, being very adaptive to patterns in either direction or lots of repeated data, yet always being bound by O(n log n) in complexity.

Just about any O(n log n) algorithm could be used beyond the tipping point, but given the intrinsic nature of cksort uses a merge, it seemed intuitive to extend that to using classic merge as well. I experimented with using a heap sort once the data was pre-sorted since it almost became a heap structure naturally after the pre-sort, but 2 things became clear: First, the heap sort is actually slower than merge sort on average. And second, and more importantly, the pre-sort phase was actually the O(n^2) component in the original sort so it didn't make sense to complete this phase.

(EDIT updated version) Version 0.43:
http://ck.kolivas.org/apps/cksort/cksort.c

10k samples compared to library qsort (I checked the actual code and it's a highly optimised combination of quicksort and insertion sort). Again bear in mind there is little to no optimisation in the cksort code. Time is in microseconds:

fewunique:

qsort:  452

cksort: 114

random:

qsort:  573

cksort: 883

sorted:

qsort:  449

cksort: 34

reversed:

qsort:  451

cksort: 41

mostly sorted:

qsort:  459

cksort: 58

As you can see above, the random is still slower on cksort (but a heck of a lot faster than earlier implementations), and it seems to stay at about half the speed of qsort no matter how many samples which is good because it means it is bound by O(n log n) complexity. 10,000 samples was chosen because it accurately represented a size where the polynomial nature of an algorithm would be obvious. The reason it's half the speed of qsort is that an unoptimised merge sort does on average twice as much work as an optimised quicksort. I could reasonably just switch to a quicksort at the tipping point, but there are still some corner cases for quicksort where it becomes O(n^2) which mergesort never suffers from, and mergesort is very parallelisable if desired. I was tempted to thread the mergesort component to improve the benchmark numbers, but it would have skewed the results inappropriately.

As a generic sort there are still two issues with cksort. One is the extra space required, and at O(n) auxiliary space it is about as bad as it gets. An in-place algorithm could certainly be developed but would be tricky and require a lot more swaps during the presort phase. The tipping algorithm could be switched to a heap sort instead which works in place. The second disadvantage is that it is not stable (i.e. does not keep original order of same value entries). I'm not sure that's easy to improve on.

Again, I don't expect this will dramatically change the way anyone sorts anything but it's certainly been interesting to play with. I'll keep playing some more :)

EDIT: In response to requests for a range of numbers, here's a set of them for different types of data and number of samples, 10-100 million samples. Random (2^32), fewunique1(4 different values), fewunique2(25 different values), fewunique3(100 different values and > tipping point), mostly sorted(every 5th and 10th value swapped), already sorted and reverse sorted.

http://ck.kolivas.org/apps/cksort/benchsort.c

EDIT: In response to requests, this is the benchmark code based on 0.43 which is an improvement on this original post. Just edit the TOSORT value to try different sizes.

Tuesday, 4 October 2011

Visualising cksort

As part of the development of my first draft for the sorting algorithm, I spent a lot of time visualising existing sorting algorithms at the excellent website http://www.sorting-algorithms.com/ . Looking at the different algorithms allowed me to think about how I might develop my own sort to try and improve performance in some of the common cases listed there and do well in areas most do badly. Until now I was unable to generate pretty graphs like those ones, but with Serge Belyshev's help I manage to get some rudimentary versions (without the nice arrows and different shading that the website above has) to see how swaps occur.

The following graphs visualise only the writes to the main array, and an update occurs only whenever a value is changed in the main array. This tends to correlate quite well with the overall number of comparisons as well so the time taken to complete in these examples should give you an idea of where cksort performs.

I would have liked to have visualised completely sorted data, but since no moves of data occur in cksort, there is nothing to show.

These graphs correspond to 50 entries of values 0-50, so the equivalent of clicking the 50 button on sorting algorithms above. Putting them in the same order as the site above here they are (note I do NOT have the same data so you cannot compare them directly, though the ballpark performance scales should be comparable):

EDIT: NOTE These graphs link samples from the latest version of cksort which performs better than this original post described.

Random data

Nearly sorted

Reversed

Few Unique

Part of the reason I wanted to visualise these sorts was to demonstrate how cksort worked visually, but another reason was for me to examine it and see where its weakness is and how I might improve it. Interestingly, in my testing, the worst data for cksort was a sawtooth pattern as I mentioned in my last post. Furthermore, if I "heapify" the unsorted data on every pass, it gets much slower. Looking at the random data example above, it seems that with multiple passes, it intrinsically tends to heapify the data anyway, which is counterproductive. So assuming I can improve on it, this is the area I need to concentrate on. Perhaps after a few passes if the number of clustered "highs" drops to minimal, it may be better to switch to a heap sort for the remaining data. At least it's clear in the above examples where cksort does perform well!

Sunday, 2 October 2011

cksort follow up.

I've spent a few days playing with cksort to try and improve it with behaviour that I was planning on, and to try and categorise exactly what sort of sort (heh) it is and to analyse its algorithmic complexity. I know people haven't found this as interesting as my usual stuff, but I feel like blogging it anyway, and maybe some use will come of this code.

cksort v0.21:
http://ck.kolivas.org/apps/cksort/cksort.c
To describe the sort, it's an extension of a selection sort using external storage to substantially improve the speed of sorting certain forms of data and the worst case scenario.

I'll use this sample of 20 for simplicity:

63 98 22 30 84 73 23 51 10 57 04 73 74 11 16 27 65 22 25 25

It starts by scanning over the data linearly, sorting data into 3 buckets:
The first bucket contains the first entry, and then collects any entries equal to or greater than the last entry put into it.

63 98

The second bucket contains the first item rejected by the 1st bucket and then collects any entries equal to or smaller than the last entry put into it.

22 10 04

The last bucket stores any data not falling into either of the previous categories (unsorted).

30 84 73 23 51 57 73 74 11 16 27 65

To minimise the amount of extra storage required, only one extra buffer equal in size to the original data is used, and the data from buckets 1 and 2 are placed at the beginning and end of the same buffer, while the unsorted data gets compacted in the original storage.

The highest and lowest values in the unsorted are store for comparison.

Then the main data is reconstructed as follows:

Any values from bucket 2 below the lowest value from unsorted are put at the beginning of the main buffer and considered in their final position and will not be scanned again.
Then all unsorted values are next in line.
Then the remainder of bucket 2 data is placed in the main buffer in ascending order.
Then bucket 1 is placed in ascending order, and any that are larger than the highest value from unsorted are considered in their final position.

04 10 73 23 51 57 73 74 11 16 27 65 22 25 25 30 84 22 63 98

Then the scan is repeated in ever decreasing sections of the middle until none are left.

04 10 11 57 16 27 65 22 25 25 30 22 63 51 23 73 73 74 84 98
04 10 11 16 22 25 25 30 22 63 51 23 27 57 65 73 73 74 84 98
04 10 11 16 51 23 27 57 22 22 25 25 30 63 65 73 73 74 84 98
04 10 11 16 22 22 23 30 27 25 25 51 57 63 65 73 73 74 84 98
04 10 11 16 22 22 23 25 25 27 30 51 57 63 65 73 73 74 84 98

Trying to categorise this was difficult for a while but fortunately I discussed it with a few friends online and we were able to come up with something.

So in this overcrowded world of >9000 sorts, there needs to be some advantage or else it's a useless addition. Now I'm not claiming this is going to be something special but I'm doing this to fulfil some intellectual curiosity. Now please, I'm not that good with the algorithmic complexity details, so if I've described something wrong, please feel free to correct me (because no one ever does that on the internet!)

The advantages are with how it manages partially sorted data, and repeats of data. A set of fully sorted data in ascending order takes one pass, and after the data is placed into the backup array, it can abort out without any further swaps, so the comparisons and swaps equal n with one pass. O(n).

When data is already sorted in reverse order, the 2nd bucket picks it all up and places it in correct order. This is one set of comparisons per n, and 2 swaps per entry, O(n). This is somewhere very few sort algorithms do well.

When there are a limited number of values the variables can take, the number of passes cksort spends is bound by half that number of variables, no matter how many entries there are. e.g. if you have 1 million entries of values 0-100, it will only ever take up to 50 passes, needing ~25 million comparisons. EDIT: This means it's actually O(k) where k is the number of unique entries.

When data is completely random, there is some kind of function describing its relationship, and it is significantly less than (absolutely) n^2 but the algorithmic complexity is O(n^2). For example, 100 unique random entries take on average 1800 compares which is a lot less than 10,000 but does scale in polynomial fashion as the number of n increases.

The exact worst case scenario is actually well defined as a sawtooth pattern of input values with lots of crossover but no repeat values (e.g. 100, 97, 98, 95, 96...) and is also O(n^2) and 100 unique values takes 2424 compares.

How to describe average I find quite hard, because real data usually has either repeats or has sections of partially ordered variables. So depending on how random the data is, it will tend from O(n)->O(n^2). Strictly speaking, most documented "averages" are based on random data.

So, finally how do the benchmarks stack up compared to the industry standard qsort? It's important to note that no effort was really made to optimise the code but the coarse scales should give you an idea of performance. Note that qsort is often thought of as being quicksort, but the implementation is entirely library dependent. This one may be a heap sort given it takes the same time no matter what is fed to it. A quicksort would easily have hit O(n^2) with some of these examples:

Time is in microseconds on a 3GHz core 2 with work bound to one CPU:

All are 10,000 entries:
Data already sorted 0 - 10000
qsort: 659
cksort: 40

Data sorted in reverse order 10000 - 0

qsort: 646

cksort: 61

Data sorted interleaved rising and falling: (0, 100, 1, 99, 2, 98...)

qsort: 651

cksort: 96

Data random with repeats every 2nd: (e.g. 54, 54, 99, 99, 33, 33...)
qsort: 643
cksort: 37

Data with 2 rising values and then a random: (e.g.: 1, 2, 54, 4, 5, 65...)

qsort: 653

cksort: 102

Random data, bound to 100 different values:

qsort: 642

cksort: 781

And now to real suckage, purely random unique data:

qsort: 658

cksort: 51770

For completeness I tried to find the crossover in n for when truly random data performed the same and it was ~475 entries.
qsort: 153
cksort: 155

So it's been a fun experiment and there's no doubt that with certain data types it could actually be quite useful, such as small data sets, partially ordered data sets, or numerous repeats, but the worst case scenario is clearly unacceptable for a general purpose algorithm if truly random data is possible. In summary, its major advantage is its extreme adaptability (adaptive algorithm), but the other disadvantages are that it is not stable, does not operate in place, and may have many swaps. cksort, like BFS, is a tool that can be useful for specific tasks. As I saw with BFS, the right algorithmic complexity in the right place is better than something that just "sounds better".

Now I wonder if I'll be inspired to try and design an O(n log n) sorting algorithm that can be parallelised...

EDIT: In response to questions about the license in the code, the GPL v3 license in the sample code is just for the sample code, not the algorithm.

EDIT2: It's worth noting that I'm still experimenting with this algorithm and hopefully I'll be able to improve it further.

Friday, 30 September 2011

cksort

I've been playing with sorting algorithms lately and wondered if the landscape of a billion out there had room for one more. I tried implementing many of the better algorithms in code and found it very hard to actually generate working code for anything beyond the most simple of selection sorts. So I set out to create my own sorting algorithm that sort of kind of worked half decently but wasn't too hard to implement. Now I'm sure some of the more complex algorithms do significantly better, and this does have some disadvantages, like the space requirements and lots of writes, but its best performance is very good and its average is decent. I have yet to find any particular data it performs badly on. However I'm not very good with the whole mathematical analysis side so I figured I'd post it here and hopefully get comments from people who are good there.

I don't really know how to describe it, but it's sort of a selection sort that generates two sorted lists at the end, and part of the last list is the completely sorted component.

The original design did not have the bidirectional component, just putting the "high" list at the end, but this performed poorly when data is in the reverse order, whereas this one performs just as well with data already mostly sorted in either direction.

Note that almost no effort has gone into code optimisation, just to make sure it works and to calculate how many iterations and swaps it does and so on.

http://ck.kolivas.org/apps/cksort/cksort.c

EDIT: Note v0.21. This is significantly faster than the original one I posted working bidirectionally requiring less comparisons and swaps on average and worst case.

/* Copyright 2011, Con Kolivas
*
* cksort v0.21
*
* Gnu Public License V3.
*
* cksort: A very simple sort algorithm based on selection sort but sorts
* successive chunks of entries and scans less on each pass based on
* knowledge of how much has been successfully sorted, and does a low and high
* directional search to avoid worst case of reversed data. Uses an extra array
* equal in size to the original data to do this.
* Adaptive, not stable, best performance O(n), worst better than O(n^2) but
* unknown. The average is unknown but seems to be ~ O(n^(3/2)) for random data.
* It becomes faster the more sorted the original data is - even if it's
* backwards, and the smaller the spread of data.
*
* gcc -o cksort -Wall -O2 cksort.c
*/

#include
#include
#include

#define TOSORT 50

typedef int32_t s32;

s32 rands[TOSORT], a[TOSORT];

#define s32min -2147483647
#define s32max 2147483647

void show(void)
{
   int i;

   for (i = 0; i < TOSORT; i++)
       printf("%02d ", rands[i]);
   printf("\n");
}

int total_n, total_passes, swaps, fromsort, tosort;

int sort(void)
{
   int i, j, ai = 0, bi = 0, unsorted_end;
   s32 max = s32min, minmax = s32min;
   s32 min = s32max, maxmin = s32max;

   j = fromsort;
   /* Do a sequential sort, finding ever increasing values and
   * store them in a[] */
   for (i = fromsort; i < tosort; i++) {
       total_n++;
       if (rands[i] >= max) {
           /* Store the grouped maximums at the bottom of the a
           * array */
           max = a[ai++] = rands[i];
           if (max < maxmin)
               maxmin = max;
           swaps++;
       } else if (rands[i] <= min) {
           /* Store the grouped minimums at the top of the a
           * array */
           min = a[TOSORT - 1 - bi++] = rands[i];
           swaps++;
       } else {
           /* Group any unsorted values at the low end */
           rands[j++] = rands[i];
           swaps++;
           /* Find the largest value beyond which all data will
           * be sorted on the next pass */
           if (rands[i] > minmax)
               minmax = rands[i];
           if (rands[i] < maxmin)
               maxmin = rands[i];
       }
   }

   /* The only entries left are all sorted already */
   if (ai == tosort - fromsort)
       return 0;

   if (maxmin == s32max)
       maxmin = s32min;
   if (minmax == s32min)
       minmax = s32max;

   unsorted_end = j;
   /* Place the sorted group of low values immediately after the unsorted
   * group. This avoids small values being picked up as maximums on the
   * next pass. */
   for (i = TOSORT - bi; i < TOSORT; i++) {
       swaps++;
       /* Everything less than maxmin will be the lowest sorted values
       * and can be put in their final positions */
       if (a[i] < maxmin) {
           if (fromsort < unsorted_end) {
               s32 swap = rands[fromsort];

               swaps++;
               rands[j] = swap;
               unsorted_end++;
           }
           rands[fromsort++] = a[i];
       } else
           rands[j] = a[i];
       j++;
   }

   /* Place the sorted group of high values after the low values. After
   * the minmax value, these are now in their final positions. */
   for (i = 0; i < ai; i++) {
       swaps++;
       rands[j] = a[i];
       /* Everything greater than minmax will be the highest sorted
       * values and can be put in their final positions */
       if (a[i] >= minmax) {
           tosort = j;
           minmax = s32max;
       }
       j++;
   }
   total_passes++;
   return tosort - fromsort;
}

int main(void)
{
   struct timeval tv;
   int i;

   gettimeofday(&tv, NULL);
   srand((unsigned int)tv.tv_sec);
   for (i = 0; i < TOSORT; i++)
       rands[i] = rand() % 100;

   show();
   tosort = TOSORT;

   while (sort())
       show();

   show();

   printf("%d passes, %d comparisons, %d swaps\n", total_passes, total_n, swaps);

   return 0;
}

Sample output from 50 random variables 0-100:
07 70 66 93 18 71 59 82 30 17 25 15 24 87 66 13 29 10 05 19 34 20 75 53 33 88 79 19 99 40 94 59 62 60 52 80 32 11 14 62 81 40 30 05 79 96 70 09 59 75

[snip]

05 05 07 09 10 11 13 14 15 17 18 19 19 20 24 25 29 30 30 32 33 34 40 40 52 53 59 59 59 60 62 62 66 66 70 70 71 75 75 79 79 80 81 82 87 88 93 94 96 99
13 passes, 360 comparisons, 503 swaps

EDIT: Further testing confirms it's likely to be O(n log n) worst case performance which is exceedingly rare with any real data set. I still haven't accurately categorised the average performance but it's a lot better even with random data, and of course it depends very much on what sort of data it sorts. Any existing order is seized upon and it benefits from it, even if there are for example interleaved increasing and decreasing data sets. I have not found evidence that I have simply "reinvented" an existing algorithm.