## 2012-08-06

### Merge Right

The so-called master-transaction update is one of the, if not the defining algorithms of the discipline formerly known as "data processing". Given two sorted files of records with increasing keys, the process applies each record in the transaction file to each record of the the master file and outputs the result, if any, to the updated master file in one pass over each input. The same algorithm can compute the union, intersection or difference of sorted sequences. For instance, the union of two sets represented as sorted lists of unique elements is:

```let union       =
let rec go l r = match l, r with
| [], xs | xs, []  -> xs
| x :: xs, y :: ys ->
match compare x y with
| -1 -> x :: go       xs (y :: ys)
|  0 -> x :: go       xs       ys
|  1 -> y :: go (x :: xs)      ys
|  _ -> assert false
in go
```

Intersection is:

```let inter       =
let rec go l r = match l, r with
| [], _  | _, []   -> []
| x :: xs, y :: ys ->
match compare x y with
| -1 ->      go       xs (y :: ys)
|  0 -> x :: go       xs       ys
|  1 ->      go (x :: xs)      ys
|  _ -> assert false
in go
```

while difference is:

```let diff       =
let rec go l r = match l, r with
| [], _  | _, []   -> l
| x :: xs, y :: ys ->
match compare x y with
| -1 -> x :: go       xs (y :: ys)
|  0 ->      go       xs       ys
|  1 ->      go (x :: xs)      ys
|  _ -> assert false
in go
```

And so on. The three functions use the same underlying merge schemata; what varies is the operation to perform in each of the five possible cases:

1. The left sequence is nil
2. The right sequence is nil
3. The next element in the left sequence is less than the next element in the right sequence
4. The next element in the left sequence is equal to the next element in the right sequence
5. The next element in the left sequence is greater than the next element in the right sequence

The question is, then, how many set operations can the merge algorithm implement? These five cases partition both input sequences in disjoint sets such that the output sequence is the natural merge of zero or more of them. For example, processing the sets `{ 1, 3, 4, 6, 7, 8 } ⋈ { 2, 3, 5, 6, 8, 9 }` results in the following steps:

LNRNLTEQGTArguments
1 `{ 1, 3, 4, 6, 7, 8 } ⋈ { 2, 3, 5, 6, 8, 9, 10 }`
2`{ 3, 4, 6, 7, 8 } ⋈ { 2, 3, 5, 6, 8, 9, 10 }`
3 `{ 3, 4, 6, 7, 8 } ⋈ { 3, 5, 6, 8, 9, 10 }`
4 `{ 4, 6, 7, 8 } ⋈ { 5, 6, 8, 9, 10 }`
5`{ 6, 7, 8 } ⋈ { 5, 6, 8, 9, 10 }`
6 `{ 6, 7, 8 } ⋈ { 6, 8, 9, 10 }`
7 `{ 7, 8 } ⋈ { 8, 9, 10 }`
8 `{ 8 } ⋈ { 9, 10 }`
9,10`∅ ⋈ { 9, 10 }`

Abstracting away the operations to perform in each of these five cases we have the following schema:

```let merge ln rn lt eq gt : 'a list -> 'a list -> 'a list =
let rec go l r = match l, r with
| [], ys -> ln ys
| xs, [] -> rn xs
| x :: xs, y :: ys ->
match compare x y with
| -1 -> lt x (go       xs (y :: ys))
|  0 -> eq x (go       xs       ys )
|  1 -> gt y (go (x :: xs)      ys )
|  _ -> assert false
in go
```

Both `ln` and `rn` must decide what to do with the remaining list and so have type `α list → α list`, while `lt`, `eq` and `gt` must decide what to do with the element in consideration and so have type `α → α list → α list`; thus the type of `merge` is `(α list → α list) → (α list → α list) → (α → α list → α list) → (α → α list → α list) → (α → α list → α list) → α list → α list → α list`. The operations on the remainder either pass it unchanged or return `nil`, while the operations on the next element either add it to the output sequence or not:

```let self   xs =      xs
and null   _  =      []
and cons x xs = x :: xs
and tail _ xs =      xs
```

(some of these have well-known names in functional programming, but here I choose to use these neat, four-letter ones.) With the proviso that the output sequence must be increasing these four functions exhaust the possibilities by parametricity; otherwise, duplications and rearrangements would satisfy the parametric signature. Now union, intersection and difference are simply:

```let union l r = merge self self  cons cons cons l r
and inter l r = merge null null  tail cons tail l r
and diff  l r = merge null self  cons tail tail l r
```

It is obvious that the question I posed above is answered as 25 = 32 possible set operations obtainable by varying each of the five operations. The next question is, then, what are these 32 operations? Let me characterize each of the five sets `ln`, `rn`, `lt`, `eq` and `gt`. The easiest one is `eq`, as it obviously is the intersection of both sets:

```eq(A, B) = A ∩ B
```

By substitution in `merge` it is possible to show that `ln(A, B) = rn(B, A)` and vice versa; hence just one set expression suffices. The merge ends with `rn` for every element in A that is greater than every element in B, as the latter were included in the comparison sets; and conversely for `ln`. Hence:

```ln(A, B) = B / A ≡ ⟨S y : B : ⟨∀ x : A : y > x⟩⟩
rn(A, B) = A / B ≡ ⟨S x : A : ⟨∀ y : B : x > y⟩⟩
```

(read A / B as "A over B"). All three sets are pairwise disjoint, since A / B ⊆ A and A / B ∩ B = ∅, and conversely, by construction.

The two remaining sets are also symmetric in the sense that `lt(A, B) = gt(B, A)` but are more difficult to characterize. My first attempt was to think of each element in A as being processed in turn and put into `lt(A, B)` just when strictly less than all the elements in B against which it could be matched, namely `lt(A, B) = ⟨S x : A : x < ⟨min y : B : x ≤ y⟩⟩`. The condition can be simplified with a bit of equational reasoning:

```   x∈A ∧ x < ⟨min y : B : x ≤ y⟩
≡ { GLB }
x∈A ∧ ⟨∀ y : y∈B ∧ x ≤ y : x < y⟩⟩
x∈A ∧ ⟨∀ y : y∈B : x > y ∨ x < y⟩⟩
≡ { Trichotomy }
x∈A ∧ ⟨∀ y : y∈B : x ≠ y⟩⟩
≡
x∈A ∧ x∉B
```

In other words, A − B. The problem is that, since the quantification over an empty set is trivially true, this set is too big as it includes the respective remainder; that is to say A / B ⊆ A − B as I showed above. To preserve disjointness I define:

```lt(A, B) = A − B − A / B
gt(A, B) = B − A − B / A
```

In a Venn diagram, these five sets are:

So by including or excluding one of the five components depending on the function passed to each of the five operations, the 32 set operations achievable by `merge` are:

Or in tabular form:

N`ln``rn``lt``eq``gt`Function
0`self``self``cons``cons``cons`A∪B
1`self``self``cons``cons``tail`A B/A
2`self``self``cons``tail``cons`A∆B
3`self``self``cons``tail``tail`A−BB/A
4`self``self``tail``cons``cons`B A/B
5`self``self``tail``cons``tail`A∩BB/AA/B
6`self``self``tail``tail``cons`B−AA/B
7`self``self``tail``tail``tail` B/AA/B
8`self``null``cons``cons``cons`A∪BA/B
9`self``null``cons``cons``tail`A B/AA/B
10`self``null``cons``tail``cons`A∆BA/B
11`self``null``cons``tail``tail`A−BB/AA/B
12`self``null``tail``cons``cons`B
13`self``null``tail``cons``tail`A∩BB/A
14`self``null``tail``tail``cons`B−A
15`self``null``tail``tail``tail` B/A
16`null``self``cons``cons``cons`A∪BB/A
17`null``self``cons``cons``tail`A
18`null``self``cons``tail``cons`A∆BB/A
19`null``self``cons``tail``tail`A−B
20`null``self``tail``cons``cons`B B/AA/B
21`null``self``tail``cons``tail`A∩BA/B
22`null``self``tail``tail``cons`B−AB/AA/B
23`null``self``tail``tail``tail` A/B
24`null``null``cons``cons``cons`A∪BB/AA/B
25`null``null``cons``cons``tail`A A/B
26`null``null``cons``tail``cons`A∆BB/AA/B
27`null``null``cons``tail``tail`A−BA/B
28`null``null``tail``cons``cons`B B/A
29`null``null``tail``cons``tail`A∩B
30`null``null``tail``tail``cons`B−AB/A
31`null``null``tail``tail``tail`

Arguably, besides the traditional five set operations A ∪ B, A ∩ B, A − B, B − A and A ∆ B, only the remainders A / B, B / A and perhaps A / B ∪ B / A = A ⊔ B, the join of A and B (not to be confused with the relational operation), are independently useful. These three are obscure, and as far as I know have no name, although I'd love to hear if there is literature about them. This might mean that this exhaustive catalog of set merges is rather pointless, but at least now I know for sure.

## 2012-08-02

### A Helping Phantom Hand

You don't have to be writing an interpreter or some other kind of abstract code to profit from some phantom types. Suppose you have two or more functions that work by "cooking" a simple value (a `float`, say) with a lengthy computation before proceeding:

```let sun_geometric_longitude j =
let je = to_jcen (to_jde j) in
(* … computation with je … *)

let sun_apparent_longitude j =
let je = to_jcen (to_jde j) in
let q  = sun_geometric_longitude j in
(* … computation with je … *)
```

In this case j is a date expressed in Julian Days as a `float`, and `to_jde` computes the Ephemeris Time as a 63-term trigonometric polynomial correction on it. `sun_apparent_longitude` calls `sun_geometric_longitude` and both call `to_jde`. Obviously this unnecessary duplication can be factored out:

```let sun_geometric_longitude je =
let t  = to_jcen je in
(* … computation with je … *)

let sun_apparent_longitude je =
let q  = sun_geometric_longitude je in
let t  = to_jcen je in
(* … computation with je … *)

let foo j =
let je = to_jde j in
let l  = sun_apparent_longitude je in
(* … *)
```

(`to_jcen` is cheap and not worth factoring out.) But now a naked `float` represent two different things, Universal Time and Ephemeris Time, and we have a valid concern of mixing them up. We can wrap the time in an ADT:

```type dt = UT of float | TD of float

let to_jcen j = match j with
| UT j ->
(* … lengthy computation … *)
TD j
| TD _ -> invalid_arg "to_jcen"

let sun_geometric_longitude je = match je with
| TD je ->
let t  = to_jcen je in
(* … computation with je … *)
| UT _  -> invalid_arg "sun_geometric_longitude"

let sun_apparent_longitude je = match je with
| TD je ->
let q  = sun_geometric_longitude je in
let t  = to_jcen je in
(* … computation with je … *)
| UT _  -> invalid_arg "sun_apparent_longitude"

let foo j =
let je = to_jde j in
(* … computation with sun_apparent_longitude je … *)
```

but this forces us to check at run-time whether we mixed times up in our code. A better technique is to use a phantom type. First fix an abstract signature for the module implementing these functions:

```module Test : sig
type 'a dt

val datetime : yy:int -> mm:int -> dd:int -> hh:int -> nn:int -> ss:float -> [ `JD ] dt
val to_jde   : [ `JD ] dt -> [ `JDE ] dt
val to_jcen  : 'a dt -> float

val sun_geometric_longitude : [ `JDE ] dt -> float
val sun_apparent_longitude  : [` JDE ] dt -> float
end = struct
(* … *)
```

We have a way to construct our type `α dt` from a Universal `datetime`, a way to convert it to Dynamical Time with `to_jde` and operations that respect the kind of measure. The implementation is as before:

```  (* … *)
type 'a dt = float (* phantom type! *)

let datetime ~yy ~mm ~dd ~hh ~nn ~ss = (* … *)

let to_jde  j = (* … *)
let to_jcen j = (* … *)

let sun_geometric_longitude je =
(* … computation with a statically checked je … *)

let sun_apparent_longitude je =
let q  = sun_geometric_longitude je in
(* … computation with a statically checked je … *)
end
```

Now the compiler checks for us that we don't mix up measures. The only inconvenient of this approach is that the type `α dt` is fully abstract, and you must provide coercions, `string_of`s and pretty printers for it if you need to show them or debug your code. There is a way out, though; just make it a `private` type abbreviation:

```module Test : sig
type 'a dt = private float
(* … signature exactly as before … *)
end = struct
type 'a dt = float (* phantom type! *)
(* … implementation exactly as before … *)
end
```

Now `α dt` will be shown in the top-level, can be printed with a coercion `(je :> float)`, etc.

For another simple example, suppose you want to represent sets as `list`s. The best way to do that is to keep them sorted so that set operations run in linear time. If you want to provide some operations that temporarily destroy the ordering, a phantom type can keep track of the invariant "this `list` is sorted":

```module Set : sig
type ('a, 'b) t = private 'a list

val of_list : 'a list -> ('a, [ `S ]) t
val as_set  : ('a, [ `U ]) t -> ('a, [ `S ]) t
val empty   : ('a, [ `S ]) t
val mem     : 'a -> ('a, [ `S ]) t -> bool
val add     : 'a -> ('a, [ `S ]) t -> ('a, [ `S ]) t
val union   : ('a, [ `S ]) t -> ('a, [ `S ]) t -> ('a, [ `S ]) t
val inter   : ('a, [ `S ]) t -> ('a, [ `S ]) t -> ('a, [ `S ]) t
val append  : ('a, 'b) t -> ('a, 'c) t -> ('a, [ `U ]) t
end = struct
type ('a, 'b) t = 'a list

let of_list l   = List.sort compare l
and as_set  l   = List.sort compare l
and empty       = []
let union   l r = (* … linear merge … *)
and inter   l r = (* … linear merge … *)
let mem     e   = List.mem e
and add     e   = union [e]
and append      = List.append
end
```

The phantom type `[ `S | `U ]` tracks the sortedness of the `list`. Note that in the case of `append` the argument lists can have any ordering but the result is known to be unsorted. Note also how the fact that the empty `list` is by definition sorted is directly reflected in the type.