2012-08-06

Merge Right

The so-called master-transaction update is one of the, if not the defining algorithms of the discipline formerly known as "data processing". Given two sorted files of records with increasing keys, the process applies each record in the transaction file to each record of the the master file and outputs the result, if any, to the updated master file in one pass over each input. The same algorithm can compute the union, intersection or difference of sorted sequences. For instance, the union of two sets represented as sorted lists of unique elements is:

let union       =
  let rec go l r = match l, r with
  | [], xs | xs, []  -> xs
  | x :: xs, y :: ys ->
    match compare x y with
    | -1 -> x :: go       xs (y :: ys)
    |  0 -> x :: go       xs       ys
    |  1 -> y :: go (x :: xs)      ys
    |  _ -> assert false
  in go

Intersection is:

let inter       =
  let rec go l r = match l, r with
  | [], _  | _, []   -> []
  | x :: xs, y :: ys ->
    match compare x y with
    | -1 ->      go       xs (y :: ys)
    |  0 -> x :: go       xs       ys
    |  1 ->      go (x :: xs)      ys
    |  _ -> assert false
  in go

while difference is:

let diff       =
  let rec go l r = match l, r with
  | [], _  | _, []   -> l
  | x :: xs, y :: ys ->
    match compare x y with
    | -1 -> x :: go       xs (y :: ys)
    |  0 ->      go       xs       ys
    |  1 ->      go (x :: xs)      ys
    |  _ -> assert false
  in go

And so on. The three functions use the same underlying merge schemata; what varies is the operation to perform in each of the five possible cases:

  1. The left sequence is nil
  2. The right sequence is nil
  3. The next element in the left sequence is less than the next element in the right sequence
  4. The next element in the left sequence is equal to the next element in the right sequence
  5. The next element in the left sequence is greater than the next element in the right sequence

The question is, then, how many set operations can the merge algorithm implement? These five cases partition both input sequences in disjoint sets such that the output sequence is the natural merge of zero or more of them. For example, processing the sets { 1, 3, 4, 6, 7, 8 } ⋈ { 2, 3, 5, 6, 8, 9 } results in the following steps:

LNRNLTEQGTArguments
1 { 1, 3, 4, 6, 7, 8 } ⋈ { 2, 3, 5, 6, 8, 9, 10 }
2{ 3, 4, 6, 7, 8 } ⋈ { 2, 3, 5, 6, 8, 9, 10 }
3 { 3, 4, 6, 7, 8 } ⋈ { 3, 5, 6, 8, 9, 10 }
4 { 4, 6, 7, 8 } ⋈ { 5, 6, 8, 9, 10 }
5{ 6, 7, 8 } ⋈ { 5, 6, 8, 9, 10 }
6 { 6, 7, 8 } ⋈ { 6, 8, 9, 10 }
7 { 7, 8 } ⋈ { 8, 9, 10 }
8 { 8 } ⋈ { 9, 10 }
9,10∅ ⋈ { 9, 10 }

Abstracting away the operations to perform in each of these five cases we have the following schema:

let merge ln rn lt eq gt : 'a list -> 'a list -> 'a list =
  let rec go l r = match l, r with
  | [], ys -> ln ys
  | xs, [] -> rn xs
  | x :: xs, y :: ys ->
    match compare x y with
    | -1 -> lt x (go       xs (y :: ys))
    |  0 -> eq x (go       xs       ys )
    |  1 -> gt y (go (x :: xs)      ys )
    |  _ -> assert false
  in go

Both ln and rn must decide what to do with the remaining list and so have type α list → α list, while lt, eq and gt must decide what to do with the element in consideration and so have type αα list → α list; thus the type of merge is (α list → α list) → (α list → α list) → (αα list → α list) → (αα list → α list) → (αα list → α list) → α list → α list → α list. The operations on the remainder either pass it unchanged or return nil, while the operations on the next element either add it to the output sequence or not:

let self   xs =      xs
and null   _  =      []
and cons x xs = x :: xs
and tail _ xs =      xs

(some of these have well-known names in functional programming, but here I choose to use these neat, four-letter ones.) With the proviso that the output sequence must be increasing these four functions exhaust the possibilities by parametricity; otherwise, duplications and rearrangements would satisfy the parametric signature. Now union, intersection and difference are simply:

let union l r = merge self self  cons cons cons l r
and inter l r = merge null null  tail cons tail l r
and diff  l r = merge null self  cons tail tail l r

It is obvious that the question I posed above is answered as 25 = 32 possible set operations obtainable by varying each of the five operations. The next question is, then, what are these 32 operations? Let me characterize each of the five sets ln, rn, lt, eq and gt. The easiest one is eq, as it obviously is the intersection of both sets:

eq(A, B) = A ∩ B

By substitution in merge it is possible to show that ln(A, B) = rn(B, A) and vice versa; hence just one set expression suffices. The merge ends with rn for every element in A that is greater than every element in B, as the latter were included in the comparison sets; and conversely for ln. Hence:

ln(A, B) = B / A ≡ ⟨S y : B : ⟨∀ x : A : y > x⟩⟩
rn(A, B) = A / B ≡ ⟨S x : A : ⟨∀ y : B : x > y⟩⟩

(read A / B as "A over B"). All three sets are pairwise disjoint, since A / B ⊆ A and A / B ∩ B = ∅, and conversely, by construction.

The two remaining sets are also symmetric in the sense that lt(A, B) = gt(B, A) but are more difficult to characterize. My first attempt was to think of each element in A as being processed in turn and put into lt(A, B) just when strictly less than all the elements in B against which it could be matched, namely lt(A, B) = ⟨S x : A : x < ⟨min y : B : x ≤ y⟩⟩. The condition can be simplified with a bit of equational reasoning:

   x∈A ∧ x < ⟨min y : B : x ≤ y⟩
≡ { GLB }
   x∈A ∧ ⟨∀ y : y∈B ∧ x ≤ y : x < y⟩⟩
≡ { Trading }
   x∈A ∧ ⟨∀ y : y∈B : x > y ∨ x < y⟩⟩
≡ { Trichotomy }
   x∈A ∧ ⟨∀ y : y∈B : x ≠ y⟩⟩
≡
   x∈A ∧ x∉B

In other words, A − B. The problem is that, since the quantification over an empty set is trivially true, this set is too big as it includes the respective remainder; that is to say A / B ⊆ A − B as I showed above. To preserve disjointness I define:

lt(A, B) = A − B − A / B
gt(A, B) = B − A − B / A

In a Venn diagram, these five sets are:

Venn diagram of the five component sets

So by including or excluding one of the five components depending on the function passed to each of the five operations, the 32 set operations achievable by merge are:

Venn diagrams for all 32 set operations

Or in tabular form:

NlnrnlteqgtFunction
0selfselfconsconsconsA∪B
1selfselfconsconstailA B/A
2selfselfconstailconsA∆B
3selfselfconstailtailA−BB/A
4selfselftailconsconsB A/B
5selfselftailconstailA∩BB/AA/B
6selfselftailtailconsB−AA/B
7selfselftailtailtail B/AA/B
8selfnullconsconsconsA∪BA/B
9selfnullconsconstailA B/AA/B
10selfnullconstailconsA∆BA/B
11selfnullconstailtailA−BB/AA/B
12selfnulltailconsconsB
13selfnulltailconstailA∩BB/A
14selfnulltailtailconsB−A
15selfnulltailtailtail B/A
16nullselfconsconsconsA∪BB/A
17nullselfconsconstailA
18nullselfconstailconsA∆BB/A
19nullselfconstailtailA−B
20nullselftailconsconsB B/AA/B
21nullselftailconstailA∩BA/B
22nullselftailtailconsB−AB/AA/B
23nullselftailtailtail A/B
24nullnullconsconsconsA∪BB/AA/B
25nullnullconsconstailA A/B
26nullnullconstailconsA∆BB/AA/B
27nullnullconstailtailA−BA/B
28nullnulltailconsconsB B/A
29nullnulltailconstailA∩B
30nullnulltailtailconsB−AB/A
31nullnulltailtailtail

Arguably, besides the traditional five set operations A ∪ B, A ∩ B, A − B, B − A and A ∆ B, only the remainders A / B, B / A and perhaps A / B ∪ B / A = A ⊔ B, the join of A and B (not to be confused with the relational operation), are independently useful. These three are obscure, and as far as I know have no name, although I'd love to hear if there is literature about them. This might mean that this exhaustive catalog of set merges is rather pointless, but at least now I know for sure.

2 comments:

Anonymous said...

coq and metaocaml variants : http://gds.livejournal.com/65745.html

Anonymous said...

Interesting observation! I like such simple and clear structures which is easy to observe when using functional style and which is not so noticeable when using java or c. I'm your reader now!