2010-02-16

A Heap of (Regular) Strings

Once you have a priority queue, suddenly it gets put to use more than once. Reading Pete Manolios' "Enumerating the strings of a Regular Expression", I found (§ 3.4) the following single-paragraph explanation of an algorithm for generating all the strings accepted by a DFA:

Another approach is to […] enumerate the expression using the DFA. This can be done with a function `enumDfa` whose single argument is a list of pairs, where each pair consists of a string and a state in the DFA. We maintain the invariant that the pairs in the list are sorted with respect to ≤, using their strings as keys. In addition, the path through the DFA determined by the string of a pair leads to the state of the pair. The inital call of `enumDfa` consists of the list with the single pair consisting of the empty string and the start state. As long as its argument is not empty, `enumDfa` removes the pair at the head of the list. If it contains an accepting state, the string is added to the enumeration. In either case, the pairs consisting of states reachable in a single step and their corresponding strings are inserted into the list, which is the argument to the recursive call of `enumDfa`.

Here ≤ is a regular preorder as defined in the paper; in order for the enumeration to exhaust all possible recognized strings shorter strings must compare earlier than longer ones. I generalize a bit and use lists:

```let regular_compare cmp (l : 'a list) (r : 'a list) =
let rec go c = function
| [], [] ->  c
| [], _  -> -1
| _ , [] ->  1
| x :: xs, y :: ys -> go (if c = 0 then cmp x y else c) (xs, ys)
in go 0 (l, r)
```

Note how the list comparison is based on an elementwise comparison, and once corresponding elements are found not to match it is still necessary to recur on the tail to look for a potential difference in length that overrides a difference in content. I represent DFAs by nodes that can be terminal or not, and that are followed by a list of labeled successor edges:

```type 'a dfa = { final : bool; succ : ('a * 'a dfa) list }
```

This simple representation suffices to generate regular strings. Manolios's algorithm is straightforward, if we replace the list sorted by its first component (the shortest string generated so far) by a binary heap:

```let enum_dfa cmp (f : 'a list -> unit) (g : 'a dfa) =
let h = Heap.make (fun (w, _) (w', _) -> regular_compare cmp w w') in
Heap.insert h ([], g);
while not (Heap.is_empty h) do
let (w, q) = Heap.min h in
Heap.removemin h;
if q.final then f w;
List.iter (fun (a, q) -> Heap.insert h (w @ [a], q)) q.succ
done
```

The heap initially contains the start state of the DFA labeled by the empty string. This shortest string is repeatedly taken out of the heap together with the state reachable by it. If this state is final, the string is recognized by supplying it to the enumerating function `f`. All the edges leading out of said state augment the current string by one more character. This algorithm in fact performs a level-first search of the DFA. As it is the enumeration is extremely simple but crude and too low level. It is not difficult to wrap it in an unfold with rather better functional structure, besides the ability to control the number of strings visited:

```let unfold_iterator f e (iter : ('a -> unit) -> 'b -> unit) o =
let arg = ref e
and res = ref [] in
begin try
iter (fun x -> match f (!arg, x) with
| Some (e, y) -> arg := e; res := y :: !res
| None        -> raise Exit) o
with Exit -> () end;
List.rev !res
```

With this, stopping after generating N strings is a one-liner:

```let take_iterator n =
unfold_iterator (fun (i, x) -> if i > 0 then Some (i-1, x) else None) n
```

as is directly taking the first N character strings as such recognized by a textual DFA:

```let implode l =
let b = Buffer.create 8 in
Buffer.contents b

let take_str n dfa = List.map implode (take_iterator n (enum_dfa compare) dfa)
```

As a first example, here is the DFA recognizing Doug McIlroy's sandwich language `ab*a`:

```let sandwich =
let     s2 = { final = true ; succ = []; } in
let rec s1 = { final = false; succ = [('b', s1); ('a', s2)]; } in
let     s0 = { final = false; succ = [('a', s1)]; } in
s0
```

Its first six words are:

```# take_str 6 sandwich ;;
- : string list = ["aa"; "aba"; "abba"; "abbba"; "abbbba"; "abbbbba"]
```

As a second example, here is the DFA recognizing Doug McIlroy's even A's language `(ab*a|b)*`:

```let even_a =
let rec s1 = { final = false; succ = [('a', s0); ('b', s1)]; }
and s0 = { final = true ; succ = [('b', s0); ('a', s1)]; } in
s0
```

Its first 20 words are:

```# take_str 20 even_a ;;
- : string list =
[""; "b"; "aa"; "bb"; "aab"; "aba"; "baa"; "bbb"; "aaaa"; "aabb"; "abab";
"abba"; "baab"; "baba"; "bbaa"; "bbbb"; "aaaab"; "aaaba"; "aabaa"; "aabbb"]
```

Note that the brevity of the enumeration procedure has nothing to do with functional programming or the expressiveness of OCaml: the same solution would translate almost verbatim to an imperative language like C, modulo the explicit creation and destruction of the heap and its elements. Also, note how graph traversals are one-to-one with the data structure used to keep intermediate state: with a stack you have DFS, with a queue you have BFS, and with a priority queue you have some kind of LFS, depending on the underlying order chosen.