2009-03-07

Simple Future

A future is a computation that encapsulates the notion of asynchronously computing a value. Let me deconstruct the buzzwords:

  • By "computation" I mean a mechanism that, eventually, reduces to or contains a value
  • By "value" I mean a primitive object that cannot be reduced further, or rather a computation that is trivial in the sense that it returns itself when run
  • By "asynchronous" I mean that the context of the computation doesn't have to wait for it to complete before continuing

By applying Moggi's insight, you can view futures as a monad 'a future encapsulating values of type 'a. There are two important operations distinguishing futures from any other monad: the first attempts to "redeem" or unwrap the encapsulated value, blocking if its computation didn't yet finish:

val redeem : 'a future -> 'a

The second constructs a future from a function by running it asynchronously in a separate thread:

val delay : ('a -> 'b) -> ('a -> 'b future)

(I use extra parentheses to emphasize the functorial aspects of computation). It is important that futures be composable, in the sense that there must exist an operation that combines a list of futures into a single future:

val select : 'a future list -> 'a t

with the strong condition that the returned future is the first in the list to complete, so that it can be redeemed without waiting. This operation is sufficiently powerful to construct an efficient polling operation on a future (that is, testing in constant time whether a future is immediately redeemable or not) as a derived operation. Of course, a very inefficient selection operation can also be built based on polling, so that the two primitives are semantically inter-definable.

Given than futures form a monad, they come equipped with the usual operations:

val unit : 'a -> 'a future
val bind : ('a -> 'b future) -> ('a future -> 'b future)
val fmap : ('a -> 'b) -> ('a future -> 'b future)

(If the type of bind seems puzzling, it's that I prefer working with Kleisli triples. Again, I've parenthesized the declarations in order to highlight the categorial aspects of the operations on futures). The key point is that these operations must "run in the future" and not block the calling thread. Then, poll can be written as:

let poll f = redeem (select [
  fmap (fun _ -> true) f;
  unit false
])

The first element is a future that replaces the result of running f with true. The second future immediately returns false. If the first is already completed, select would return its true value. Since futures are immutable, f is still available for redeeming. If, on the other hand, the first is not yet completed, select would return the second future, by the strong guarantee of deterministic (left-to-right) parallelism, and would return false on the spot.

So, let's build a minimal interface for a future. First let's start with a monad:

module type MONAD = sig
  type 'a t
  val unit : 'a -> 'a t
  val bind : ('a -> 'b t) -> ('a t -> 'b t)
  val fmap : ('a -> 'b) -> ('a t -> 'b t)
end

Of course fmap is derivable in a monad because every monad is a (categorial) functor, but it is simple to include it in the implementation, and it's often more efficient to use a specialized version of it. Now, the futures themselves:

module type FUTURE = sig
  include MONAD
  val redeem : 'a t -> 'a
  val delay  : ('a -> 'b) -> ('a -> 'b t)
  val select : 'a t list -> 'a t
end

To derive poll I can use a functor:

module Future (F : FUTURE) = struct
  include F
  let poll f = redeem (select [
    fmap (fun _ -> true) f;
    unit false
  ])
end

With this specification, the minimal semantics for futures is to perform completely synchronous computations; a valid, although not very interesting starting point.

module NullFuture = Future(struct
  type 'a t = 'a
  let unit x = x
  let bind f x = f x
  let fmap f x = f x
  let redeem x = x
  let delay f x = f x
  let select = function [] -> failwith "select" | x :: _ -> x
end)

For a truly multi-threaded experience, I must call in the services of the Thread module. The problem with OCaml's threads is twofold. First, threads don't return a value, but just run a function unattended until completion, discarding the result. This is remedied easily enough by providing the thread with a reference in which to deliver its result. The other, more serious problem is that there's no built-in way to wait on more than one thread.

Both drawbacks can be overcome by associating the references with a condition variable and waiting for the first thread that signals it, in effect creating a synchronizing slot. I follow mostly Bartosz Milewsky's description of MVars, but I depart of his implementation in that slots are more restricted. First, the interface is simple enough:

module Slot : sig
  type 'a t
  val create : unit -> 'a t
  val put : 'a -> 'a t -> unit
  val get : 'a t -> 'a
end = struct

The idea is that get waits for a value to be present, but put has an effect only if the slot is empty. To wait on a value, I use a Condition.t that must always be paired with a Mutex.t:

  type 'a t = { m : Mutex.t; c : Condition.t; mutable ref : 'a option }

Creating a slot is simple enough:

  let create () = { m = Mutex.create (); c = Condition.create (); ref = None }

In order to ensure that all accesses to a slot are synchronous, and that the lock is properly released even in the event of failure, I use a wrapper:

  let locked f s =
    Mutex.lock s.m;
    try let x = f s in Mutex.unlock s.m; x
    with e -> Mutex.unlock s.m; raise e

Putting a value in the slot succeeds only if the slot is empty; otherwise, it has no effect:

  let put x =
    locked (fun s ->
      if s.ref = None then begin
        s.ref <- Some x; Condition.signal s.c
      end)

Getting from an empty slot blocks until another thread puts a value into it; there is no wait involved when it is full:

  let get s =
    let rec go s = match s.ref with
    | None   -> Condition.wait s.c s.m; go s
    | Some x -> x
    in locked go s
end

The net effect is that a slot becomes immutable once set. This is key to ensure that select has fair semantics and to avoid race conditions.

Now futures must wait on the slot to be filled with the result of the computation, which must proceed in a separate thread if it is to be truly asynchronous:

module AsyncFuture = Future(struct
  type 'a t = { t : Thread.t; s : 'a Slot.t }

Redeeming a future is simply waiting on its slot to become full:

  let redeem m = Slot.get m.s

In order to delay a computation by turning it into a future, spawn a thread that runs it and deposits its value in the corresponding slot:

  let delay f x =
    let s = Slot.create () in
    { t = Thread.create (fun x -> Slot.put (f x) s) x; s = s }

That's it. Now selecting from a set of futures is a bit more involved. Of course, it makes no sense to treat an empty list of futures as anything else but an error. Otherwise, I turn every future into a delayed computation that runs it to completion and signals termination by setting a slot with a reference to itself. The semantics for slots ensure that the first future to complete will be the one that will be present as the slot's value:

  let select = function
  | [] -> failwith "select"
  | l  ->
    let s = Slot.create () in
    List.iter (fun m ->
      ignore (Thread.create (fun m -> ignore (redeem m); Slot.put m s) m))
      l;
    Slot.get s

As a slight optimization I directly create Threads instead of delaying the thunks, because the results are completely uninteresting.

Finally, the monadic semantics is very simple to provide with these primitives:

  let unit x =
    let s = Slot.create () in Slot.put x s;
    { t = Thread.self; s = s }

  let bind f m = delay (fun m -> redeem (f (redeem m))) m
  let fmap f m = delay (fun m -> f (redeem m)) m
end)

I had originally written unit as delay id; but since the value is already present and it barely needs a slot, much less a thread, I placehold the future with the current thread. If the slot were to be waited upon, this could lead to deadlock; as it is already full, there is no danger of the current thread becoming blocked because of it.

This implementation is low-level enough that can be used as a guide for implementing composable futures in other languages. In fact, Apocalisp has a series of articles presenting this very design applied to Java. Of course, this is far more basic than that, in that there is no provision for a strategy for running threads, which are in fact paired one-to-one with thunks. Building a work queue, a thread pool to run it and layering a future library on that is left as an exercise for the reader.

2009-03-05

Small Sorts

I have seen in more than one occasion sorting problems of small sizes being solved with general-purpose sorting routines. The case should be analogous to FFTs of small size, where the cases for 1-point, 2-point and 4-point transforms are special, straight-line code segments; but I haven't found the equivalent special-casing of small-size sorts. One reason for that could be that the structure of FFTs is very regular, and specializing the recursion for small base cases is relatively straightforward. Another reason might be the fact that FFT code tends to be rather specialized, and written once, while practically everybody with a formal education in CS is expected to know how to write a general-purpose sort routine.

If this is the case, let this post serve as a container for cut-and-paste-ready code.

The Case for Three

There are 3! = 6 permutation of three elements, and 2² ≤ 3! < 2³, which means that an exhaustive comparison tree for three elements must make between two and three comparisons to completely sort them. The easiest way to see that this is the case is by inspecting the comparison tree of order three:

Comparison tree of order three

Note that sorting three elements immediately gives us the minimum, maximum and median of the set, as the latter is uniquely determined by the former. The corresponding code is:

let sort3 (a, b, c) =
  if a <= b then
    if b <= c then (a, b, c) else
    if a <= c then (a, c, b) else
                   (c, a, b) else
    if a <= c then (b, a, c) else
    if b <= c then (b, c, a) else
                   (c, b, a)

As the sets are fixed in size, it is sufficient to verify every permutation to prove (in the strictest sense of the word) that the code is correct. There are several algorithms for generating permutations, but for these sizes, the following is sufficient:

let rec permute =
  let remove e = List.filter ((<>) e) in
  let perms_with a = List.map (fun y -> a :: y) % permute % remove a
  in function
  | [] -> [[]]
  | l  -> List.concat (List.map (fun a -> perms_with a l) l)

With that, verification of the code is straightforward:

let p3 = List.map (function [a;b;c] -> a,b,c) (permute [1;2;3]) in
List.for_all ((=) (1, 2, 3)) (List.map sort3 p3)

The Case for Four

There are 4! = 24 permutations of four elements, and 24 < 4! < 25. The decision tree has become decidedly bushier. Now the tasks of sorting four elements and finding the extrema among these four are different, in which the first requires between four and five comparisons to sort, but the second can do the job with four comparisons in every case, as you can see by collapsing terminal nodes with equal extrema in the following comparison tree:

Comparison tree of order four

Whereas anybody with five minutes to spare could derive the decision tree to sort three elements, the code corresponding to this tree is a prime candidate for cut-and-paste:

let sort4 (a, b, c, d) =
  if a <= b then
    if c <= d then
      if a <= c then
        if b <= d then
          if b <= c then (a, b, c, d) else
                         (a, c, b, d) else
                         (a, c, d, b) else
        if b <= d then   (c, a, b, d) else
          if a <= d then (c, a, d, b) else
                         (c, d, a, b) else
      if a <= d then
        if b <= c then
          if b <= d then (a, b, d, c) else
                         (a, d, b, c) else
                         (a, d, c, b) else
        if b <= c then   (d, a, b, c) else
          if a <= c then (d, a, c, b) else
                         (d, c, a, b) else
    if c <= d then
      if b <= c then
        if a <= d then
          if a <= c then (b, a, c, d) else
                         (b, c, a, d) else
                         (b, c, d, a) else
        if a <= d then   (c, b, a, d) else
          if b <= d then (c, b, d, a) else
                         (c, d, b, a) else
      if b <= d then
        if a <= c then
          if a <= d then (b, a, d, c) else
                         (b, d, a, c) else
                         (b, d, c, a) else
        if a <= c then   (d, b, a, c) else
          if b <= c then (d, b, c, a) else
                         (d, c, b, a)

Verification can proceed as before, by checking exhaustively correctness over all 4! permutations:

let p4 = List.map (function [a;b;c;d] -> a,b,c) (permute [1;2;3;4]) in
List.for_all ((=) (1, 2, 3, 4)) (List.map sort4 p4)

The Case for Five

The fact that five comparisons suffice to sort four elements is not entirely straightforward to derive, as poring over 5.3.1 of The Art of Computer Programming shows. Knuth shows that it is still less clear that seven comparisons suffice to sort five elements, as 26 < 5! = 120 < 27, as the upper bound is rather tight. The recipe given is very high level, and although I'm confident it can be used to build a code generator, it is not immediately obvious to me how to convert that intuition into actual code.

This is a good cut-off point to resort to a general-purpose sort; Jon Bentley argues that such small cases are best served by straightforward insertion sort. I went a step further and unrolled a 5-element insertion sort, which has a worst-case complexity of 10 comparisons. The downside to this code is that it is imperative, and destructive: it sorts by swapping the elements in an array:

let inssort5 v =
  let a = v.(1) in
  if v.(0) > a then begin
    v.(1) <- v.(0);
    v.(0) <- a
  end;
  let a = v.(2) in
  if v.(1) > a then begin
    v.(2) <- v.(1);
    if v.(0) > a then begin
      v.(1) <- v.(0);
      v.(0) <- a
    end else
      v.(1) <- a
  end;
  let a = v.(3) in
  if v.(2) > a then begin
    v.(3) <- v.(2);
    if v.(1) > a then begin
      v.(2) <- v.(1);
      if v.(0) > a then begin
        v.(1) <- v.(0);
        v.(0) <- a
      end else
        v.(1) <- a
    end else
      v.(2) <- a
  end;
  let a = v.(4) in
  if v.(3) > a then begin
    v.(4) <- v.(3);
    if v.(2) > a then begin
      v.(3) <- v.(2);
      if v.(1) > a then begin
        v.(2) <- v.(1);
        if v.(0) > a then begin
          v.(1) <- v.(0);
          v.(0) <- a
        end else
          v.(1) <- a
      end else
        v.(2) <- a
    end else
      v.(3) <- a
  end

Sorting a 5-tuple requires an adapter:

let sort5 (a, b, c, d, e) =
  let v = [| a; b; c; d; e |] in inssort5 v;
  v.(0), v.(1), v.(2), v.(3), v.(4)

Which can, again, be verified by exhaustively testing it against all 120 permutations of 5 elements:

let p5 = List.map (function [a;b;c;d;e] -> a,b,c,d,e) (permute [1;2;3;4;5])
in List.for_all ((=) (1, 2, 3, 4, 5)) (List.map sort5 p5)

Edit: Indeed, it is possible to directly translate Knuth's instructions for merge-inserting 5 elements into a 7-comparison straight-line sort. As in the text, it is convenient to first rank a, b, c and d so that, say, a < b < d and c < d, then insert e among a, b, d, and finally inserting c. After the ranking is found, merging e is the same in all cases modulo renaming, so it can be considered a sub-routine. This is the code:

let sort5 (a, b, c, d, e) =
  let merge5 a b c d e =
    (* abd cd e *)
    if e <= b then
      if a <= e then (* aebd cd *)
        if c <= e then
          if c <= a then
            (c, a, e, b, d)
          else
            (a, c, e, b, d)
        else
          if c <= b then
            (a, e, c, b, d)
          else
            (a, e, b, c, d)
      else (* eabd cd *)
        if c <= a then
          if c <= e then
            (c, e, a, b, d)
          else
            (e, c, a, b, d)
        else
          if c <= b then
            (e, a, c, b, d)
          else
            (e, a, b, c, d)
    else
      if e <= d then (* abed cd *)
        if c <= b then
          if c <= a then
            (c, a, b, e, d)
          else
            (a, c, b, e, d)
        else
          if c <= e then
            (a, b, c, e, d)
          else
            (a, b, e, c, d)
      else (* abde cd *)
        if c <= b then
          if c <= a then
            (c, a, b, d, e)
          else
            (a, c, b, d, e)
        else
          (a, b, c, d, e)
  in
  if a <= b then
    if c <= d then
      if b <= d then (* abd cd *)
        merge5 a b c d e
      else (* cdb ab *)
        merge5 c d a b e
    else
      if b <= c then (* abc dc *)
        merge5 a b d c e
      else (* dcb ab *)
        merge5 d c a b e
  else
    if c <= d then
      if a <= d then (* bad cd *)
        merge5 b a c d e
      else (* cda ba *)
        merge5 c d b a e
    else
      if a <= c then (* bac dc *)
        merge5 b a d c e
      else (* dca ba *)
        merge5 d c b a e

The comments show the ranking implied by the result of the comparisons, so that bac dc means b < a < cd < c, for instance. Note also that the very last case of merge5 makes use of transitivity to coalesce two cases into one, for a total of (7⋅14 + 6)/15 = 6.9333 comparisons on average.