## 2009-03-05

### Small Sorts

I have seen in more than one occasion sorting problems of small sizes being solved with general-purpose sorting routines. The case should be analogous to FFTs of small size, where the cases for 1-point, 2-point and 4-point transforms are special, straight-line code segments; but I haven't found the equivalent special-casing of small-size sorts. One reason for that could be that the structure of FFTs is very regular, and specializing the recursion for small base cases is relatively straightforward. Another reason might be the fact that FFT code tends to be rather specialized, and written once, while practically everybody with a formal education in CS is expected to know how to write a general-purpose sort routine.

If this is the case, let this post serve as a container for cut-and-paste-ready code.

## The Case for Three

There are 3! = 6 permutation of three elements, and 2² ≤ 3! < 2³, which means that an exhaustive comparison tree for three elements must make between two and three comparisons to completely sort them. The easiest way to see that this is the case is by inspecting the comparison tree of order three:

Note that sorting three elements immediately gives us the minimum, maximum and median of the set, as the latter is uniquely determined by the former. The corresponding code is:

```let sort3 (a, b, c) =
if a <= b then
if b <= c then (a, b, c) else
if a <= c then (a, c, b) else
(c, a, b) else
if a <= c then (b, a, c) else
if b <= c then (b, c, a) else
(c, b, a)
```

As the sets are fixed in size, it is sufficient to verify every permutation to prove (in the strictest sense of the word) that the code is correct. There are several algorithms for generating permutations, but for these sizes, the following is sufficient:

```let rec permute =
let remove e = List.filter ((<>) e) in
let perms_with a = List.map (fun y -> a :: y) % permute % remove a
in function
| [] -> [[]]
| l  -> List.concat (List.map (fun a -> perms_with a l) l)
```

With that, verification of the code is straightforward:

```let p3 = List.map (function [a;b;c] -> a,b,c) (permute [1;2;3]) in
List.for_all ((=) (1, 2, 3)) (List.map sort3 p3)
```

## The Case for Four

There are 4! = 24 permutations of four elements, and 24 < 4! < 25. The decision tree has become decidedly bushier. Now the tasks of sorting four elements and finding the extrema among these four are different, in which the first requires between four and five comparisons to sort, but the second can do the job with four comparisons in every case, as you can see by collapsing terminal nodes with equal extrema in the following comparison tree:

Whereas anybody with five minutes to spare could derive the decision tree to sort three elements, the code corresponding to this tree is a prime candidate for cut-and-paste:

```let sort4 (a, b, c, d) =
if a <= b then
if c <= d then
if a <= c then
if b <= d then
if b <= c then (a, b, c, d) else
(a, c, b, d) else
(a, c, d, b) else
if b <= d then   (c, a, b, d) else
if a <= d then (c, a, d, b) else
(c, d, a, b) else
if a <= d then
if b <= c then
if b <= d then (a, b, d, c) else
(a, d, b, c) else
(a, d, c, b) else
if b <= c then   (d, a, b, c) else
if a <= c then (d, a, c, b) else
(d, c, a, b) else
if c <= d then
if b <= c then
if a <= d then
if a <= c then (b, a, c, d) else
(b, c, a, d) else
(b, c, d, a) else
if a <= d then   (c, b, a, d) else
if b <= d then (c, b, d, a) else
(c, d, b, a) else
if b <= d then
if a <= c then
if a <= d then (b, a, d, c) else
(b, d, a, c) else
(b, d, c, a) else
if a <= c then   (d, b, a, c) else
if b <= c then (d, b, c, a) else
(d, c, b, a)
```

Verification can proceed as before, by checking exhaustively correctness over all 4! permutations:

```let p4 = List.map (function [a;b;c;d] -> a,b,c) (permute [1;2;3;4]) in
List.for_all ((=) (1, 2, 3, 4)) (List.map sort4 p4)
```

## The Case for Five

The fact that five comparisons suffice to sort four elements is not entirely straightforward to derive, as poring over 5.3.1 of The Art of Computer Programming shows. Knuth shows that it is still less clear that seven comparisons suffice to sort five elements, as 26 < 5! = 120 < 27, as the upper bound is rather tight. The recipe given is very high level, and although I'm confident it can be used to build a code generator, it is not immediately obvious to me how to convert that intuition into actual code.

This is a good cut-off point to resort to a general-purpose sort; Jon Bentley argues that such small cases are best served by straightforward insertion sort. I went a step further and unrolled a 5-element insertion sort, which has a worst-case complexity of 10 comparisons. The downside to this code is that it is imperative, and destructive: it sorts by swapping the elements in an array:

```let inssort5 v =
let a = v.(1) in
if v.(0) > a then begin
v.(1) <- v.(0);
v.(0) <- a
end;
let a = v.(2) in
if v.(1) > a then begin
v.(2) <- v.(1);
if v.(0) > a then begin
v.(1) <- v.(0);
v.(0) <- a
end else
v.(1) <- a
end;
let a = v.(3) in
if v.(2) > a then begin
v.(3) <- v.(2);
if v.(1) > a then begin
v.(2) <- v.(1);
if v.(0) > a then begin
v.(1) <- v.(0);
v.(0) <- a
end else
v.(1) <- a
end else
v.(2) <- a
end;
let a = v.(4) in
if v.(3) > a then begin
v.(4) <- v.(3);
if v.(2) > a then begin
v.(3) <- v.(2);
if v.(1) > a then begin
v.(2) <- v.(1);
if v.(0) > a then begin
v.(1) <- v.(0);
v.(0) <- a
end else
v.(1) <- a
end else
v.(2) <- a
end else
v.(3) <- a
end
```

Sorting a 5-tuple requires an adapter:

```let sort5 (a, b, c, d, e) =
let v = [| a; b; c; d; e |] in inssort5 v;
v.(0), v.(1), v.(2), v.(3), v.(4)
```

Which can, again, be verified by exhaustively testing it against all 120 permutations of 5 elements:

```let p5 = List.map (function [a;b;c;d;e] -> a,b,c,d,e) (permute [1;2;3;4;5])
in List.for_all ((=) (1, 2, 3, 4, 5)) (List.map sort5 p5)
```

Edit: Indeed, it is possible to directly translate Knuth's instructions for merge-inserting 5 elements into a 7-comparison straight-line sort. As in the text, it is convenient to first rank a, b, c and d so that, say, a < b < d and c < d, then insert e among a, b, d, and finally inserting c. After the ranking is found, merging e is the same in all cases modulo renaming, so it can be considered a sub-routine. This is the code:

```let sort5 (a, b, c, d, e) =
let merge5 a b c d e =
(* abd cd e *)
if e <= b then
if a <= e then (* aebd cd *)
if c <= e then
if c <= a then
(c, a, e, b, d)
else
(a, c, e, b, d)
else
if c <= b then
(a, e, c, b, d)
else
(a, e, b, c, d)
else (* eabd cd *)
if c <= a then
if c <= e then
(c, e, a, b, d)
else
(e, c, a, b, d)
else
if c <= b then
(e, a, c, b, d)
else
(e, a, b, c, d)
else
if e <= d then (* abed cd *)
if c <= b then
if c <= a then
(c, a, b, e, d)
else
(a, c, b, e, d)
else
if c <= e then
(a, b, c, e, d)
else
(a, b, e, c, d)
else (* abde cd *)
if c <= b then
if c <= a then
(c, a, b, d, e)
else
(a, c, b, d, e)
else
(a, b, c, d, e)
in
if a <= b then
if c <= d then
if b <= d then (* abd cd *)
merge5 a b c d e
else (* cdb ab *)
merge5 c d a b e
else
if b <= c then (* abc dc *)
merge5 a b d c e
else (* dcb ab *)
merge5 d c a b e
else
if c <= d then
if a <= d then (* bad cd *)
merge5 b a c d e
else (* cda ba *)
merge5 c d b a e
else
if a <= c then (* bac dc *)
merge5 b a d c e
else (* dca ba *)
merge5 d c b a e
```

The comments show the ranking implied by the result of the comparisons, so that `bac dc` means b < a < cd < c, for instance. Note also that the very last case of `merge5` makes use of transitivity to coalesce two cases into one, for a total of (7⋅14 + 6)/15 = 6.9333 comparisons on average.

Anonymous said...

One can concisely code up sorting networks using continuations.

let sort2 a b k =
if a <= b then k a b else k b a

let sort3 a b c k =
sort2 a b (fun a b ->
sort2 b c (fun b c ->
sort2 a b (fun a b ->
k a b c)))

let sort4 a b c d k =
sort2 b d (fun b d ->
sort2 a c (fun a c ->
sort2 a b (fun a b ->
sort2 c d (fun c d ->
sort2 b c (fun b c ->
k a b c d)))))

Of course, this could be done without continuations, but by inlining all continuations parameters, one can obtain some nice straightline code.

Unknown said...

Very cool trick! It is both clear and subtle at the same time.

anthonyfairchild said...

Hi,

Great post! I recently wrote a lisp macro to generate fast sorters for small numbers of items. Here is my code for comparison.

This is the macro code:

http://paste.lisp.org/display/75398

This is a generated 3 item sorter:

http://paste.lisp.org/display/75399

This is a generated 5 item sorter:

http://paste.lisp.org/display/75400

- Anthony

me said...

Hard coding small sorts is a stupid idea. There's a lot of code and not a lot of work, and it's not really a critical base case for anything like in the FFT. CPUs have an instruction cache. You should strive to be efficient with very little code. Shell sort is the obvious choice. Use the Sedgewick-Incerpi sequence 1, 3, 7, 21, 48, 112, ...

Paul Biggar said...

I briefly investigated this as part of my research on sorting and branch predictors (see http://www.cs.tcd.ie/~pbiggar, particularly section 6.1.3 of the tech report).

In particular, I needed a very fast sort for arrays of size 4 or 8, in order to prepare them for a much larger mergesort. I found that an insertion sort performed better than a hard coded sort, even when the hard coded sort was optimal. I believe this is due to the very good branch prediction behaviour of insertion sort, compared with the very poor behaviour of the hard coded sort.

Some factors need to be considered of course. I was working in a case where the cost of comparisons was very low (a simple < operator in C), and the cost of branch mispredictions was very high (a Pentium 4). Your high level (Ocaml?) implemention no doubt has different tradeoffs. However, I dont think you'll achieve great performance improvements over an insertion sort here, especially if you skip unrolling the loop, and have an implementation which is close to hardware.

Trying to achieve increased performance by trying to reduce the number of comparisons is only going to work for a very expensive comparison function. I think you should better state your assumptions, and benchmark to achieve concrete results.

Otherwise, your post was very interesting.

Unknown said...

@Anthony: Thank you for the pointers!

@Roman: I appreciate your remarks. Please read Paul Biggar's comments to see what a use for base sorts would be. As to your assertion about the suitability of Shellsort for small sorts, compared to Insertion Sort, it requires some sort of backup data.

@Paul: Thank you for the insight and pointers. I'm in an interchange with Christophe Troestler, whose code prompted this post, and he remarked that these minimal comparison routines have times dominated by parameter marshalling.

As for motivation and assumptions, my only concern was with the shape of minimal-comparison sortings, regardless of performance. I'll follow up with a post analyzing these sorters's performance vis-à-vis insertion sort.

ChriS said...

You might be interesting in the following paper “A variant of the Ford–Johnson algorithm that is more space efficient”,
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.84.1540
doi:10.1016/j.ipl.2006.11.017