## Tuesday 4 October 2011

### Visualising cksort

As part of the development of my first draft for the sorting algorithm, I spent a lot of time visualising existing sorting algorithms at the excellent website http://www.sorting-algorithms.com/ . Looking at the different algorithms allowed me to think about how I might develop my own sort to try and improve performance in some of the common cases listed there and do well in areas most do badly. Until now I was unable to generate pretty graphs like those ones, but with Serge Belyshev's help I manage to get some rudimentary versions (without the nice arrows and different shading that the website above has) to see how swaps occur.

The following graphs visualise only the writes to the main array, and an update occurs only whenever a value is changed in the main array. This tends to correlate quite well with the overall number of comparisons as well so the time taken to complete in these examples should give you an idea of where cksort performs.

I would have liked to have visualised completely sorted data, but since no moves of data occur in cksort, there is nothing to show.

These graphs correspond to 50 entries of values 0-50, so the equivalent of clicking the 50 button on sorting algorithms above. Putting them in the same order as the site above here they are (note I do NOT have the same data so you cannot compare them directly, though the ballpark performance scales should be comparable):

EDIT: NOTE These graphs link samples from the latest version of  cksort which performs better than this original post described.

Random data

Nearly sorted

Reversed

Few Unique

Part of the reason I wanted to visualise these sorts was to demonstrate how cksort worked visually, but another reason was for me to examine it and see where its weakness is and how I might improve it. Interestingly, in my testing, the worst data for cksort was a sawtooth pattern as I mentioned in my last post. Furthermore, if I "heapify" the unsorted data on every pass, it gets much slower. Looking at the random data example above, it seems that with multiple passes, it intrinsically tends to heapify the data anyway, which is counterproductive. So assuming I can improve on it, this is the area I need to concentrate on. Perhaps after a few passes if the number of clustered "highs" drops to minimal, it may be better to switch to a heap sort for the remaining data. At least it's clear in the above examples where cksort does perform well!

1. What's your point? Are you trying to beat qsort in some corner case, or two, or three? Sure that's possible. The more you know about your distribution, the faster you can sort it too. That's no-brainer - that's the lack of entropy.

Quoting from Wikipedia:
"quicksort performs only about 39% worse than the ideal number of comparisons"

Beat that, my friend!

Again the beauty of you BFS is that it does not optimize the corners. Why you decided to go in the opposite direction with cksort? No one cases about corner cases of qsort - is what I am saying.

2. LOL I love the way people reply with stuff like "What's your point" in that smug internet way. The introduction did say it was purely an intellectual exercise.

3. What if you were to implement a way to quickly estimate how badly unsorted the data is and if it's not too badly unsorted, don't bother sorting it. Would that help speed things up? If so, this is my thinking so far.

==========
You would calculate the average of the values of all positions, which I'll call full_average. You would then calculate the average of positions 1-10, which I'll call average_set_a, the average of positions 11-20, which I'll call average_set_b, the average of positions 21-30, which I'll call average_set_c, the average of positions 31-40, which I'll call average_set_d, and the average of positions 41-50, which I'll call average_set_e. Now using a sorting algorithm to find the difference of lowest value and highest value of the average_set_'s will give a value for an estimate of the amount of deviation in the dataset, which I'll call deviation_amount. Yes, it's a sorting algorithm to decide whether something should be sorted or not, but my thinking is that sorting a set of five is faster than sorting a set of 50. Now, if average_set_a is close to 5% of deviation_amount, and if average_set_b is close to 15% of average_set_b and so forth for the other average_set_'s, then the dataset is not likely too badly unsorted.
==========

I'm not sure if this would even end up being faster than just sorting the full dataset every time, and I can still think up some datasets where this still won't work at all. Anyway, I've got to take a break from all this thinking now.

On second thought, I think this is turning into complete mess of total nonsense. Maybe you'll at least be able to take something out of this that would work.