So the latest version of cksort now has some nice properties, being O(n) in certain ideal scenarios, being very adaptive to patterns in either direction or lots of repeated data, yet always being bound by O(n log n) in complexity.
Just about any O(n log n) algorithm could be used beyond the tipping point, but given the intrinsic nature of cksort uses a merge, it seemed intuitive to extend that to using classic merge as well. I experimented with using a heap sort once the data was pre-sorted since it almost became a heap structure naturally after the pre-sort, but 2 things became clear: First, the heap sort is actually slower than merge sort on average. And second, and more importantly, the pre-sort phase was actually the O(n^2) component in the original sort so it didn't make sense to complete this phase.
(EDIT updated version) Version 0.43:
10k samples compared to library qsort (I checked the actual code and it's a highly optimised combination of quicksort and insertion sort). Again bear in mind there is little to no optimisation in the cksort code. Time is in microseconds:
As you can see above, the random is still slower on cksort (but a heck of a lot faster than earlier implementations), and it seems to stay at about half the speed of qsort no matter how many samples which is good because it means it is bound by O(n log n) complexity. 10,000 samples was chosen because it accurately represented a size where the polynomial nature of an algorithm would be obvious. The reason it's half the speed of qsort is that an unoptimised merge sort does on average twice as much work as an optimised quicksort. I could reasonably just switch to a quicksort at the tipping point, but there are still some corner cases for quicksort where it becomes O(n^2) which mergesort never suffers from, and mergesort is very parallelisable if desired. I was tempted to thread the mergesort component to improve the benchmark numbers, but it would have skewed the results inappropriately.
As a generic sort there are still two issues with cksort. One is the extra space required, and at O(n) auxiliary space it is about as bad as it gets. An in-place algorithm could certainly be developed but would be tricky and require a lot more swaps during the presort phase. The tipping algorithm could be switched to a heap sort instead which works in place. The second disadvantage is that it is not stable (i.e. does not keep original order of same value entries). I'm not sure that's easy to improve on.
Again, I don't expect this will dramatically change the way anyone sorts anything but it's certainly been interesting to play with. I'll keep playing some more :)
EDIT: In response to requests for a range of numbers, here's a set of them for different types of data and number of samples, 10-100 million samples. Random (2^32), fewunique1(4 different values), fewunique2(25 different values), fewunique3(100 different values and > tipping point), mostly sorted(every 5th and 10th value swapped), already sorted and reverse sorted.
EDIT: In response to requests, this is the benchmark code based on 0.43 which is an improvement on this original post. Just edit the TOSORT value to try different sizes.