On my last post, Benedikt Grundmann asked if changing representations for trie nodes, in particular eliminating `'a option`

-boxed references, would represent a speed-up or not. I started tinkering with a direct representation:

type 'a t = { value : 'a option; split : char; lokid : 'a t; eqkid : 'a t; hikid : 'a t; }

(call this representation `trie`

). This forces me to use a sentinel node to represent empty trees:_{1}

let rec empty = { value = None; split = '\255'; lokid = empty; eqkid = empty; hikid = empty; } let is_empty t = t == empty

(the `split`

value is entirely arbitrary and never relied upon). This recursive structure means that I cannot use pattern matches to discriminate and deconstruct nodes, and all comparisons must be strictly by reference equality on the carefully preserved unique sentinel node. This represented a rough 10% speedup on my original code.

Meanwhile, Mauricio Fernández tried his hand at it, and came with a tantalizing result:

in the lapse between the present post and the preceding one, I implemented ternary trees with different constructors for the nodes with and without a value. In my tests, construction is 50% faster, and lookups are between 20% and 80% faster. The code is much less elegant than the one presented here, because it moreover includes some special-casing to handle the

`""`

(empty string) key, which the above code chokes at.

(It's true that the empty string cannot be a key in both the original and the above implementation. I thought it obvious by the tradeoffs I listed in selecting the representation for the key, but truth is I never guarded the operations against it. It's a bug, not a feature). The numbers are too exciting to not try this approach. By carefully teasing apart each case, I arrived to the following data type:

type 'a t = E | L of 'a | B of char * 'a t * 'a t * 'a t | K of 'a * char * 'a t * 'a t * 'a t

(call this representation `trie`

). The first constructor _{2}`E`

represents an empty tree. The second constructor `L`

represents a leaf with value `α`

. The third constructor `B`

represents a branching node without value. The fourth constructor `K`

represents a marked, valued branching node. Functions over this type require eight matches in general, four each for the middle and four for the end of the key.

I'll present the benchmark results first, then the code. In every case I've run the benchmark three times in sequence from the command line and retained the last measurement. My test machine is an MSI U100 with an 1.6 GHz Intel Atom and 2 GB RAM running Mac OS X 10.5.7, and the code is compiled with `ocamlopt -g -inline 10 -w Aelz -pp camlp4o`

version 3.11.0. Each run is comprised of three tests using the standard `Map`

balanced-height tree, `trie`

and _{1}`trie`

ternary trees, with a _{2}`Gc.full_major`

collection at the start of each. The test data sets are the 234936-word `/usr/share/dict/words`

file in Mac OS X, and a 89546-word Spanish lexicon, with 3.91% overlap between both. The tests measure:

- Insertion of all the English words into an empty tree via a
`fold_left`

- Retrieval of the list of all inserted words via a
`S.fold`

(with validation of length and sortedness outside the timing) - Retrieval of all the English words
- Retrieval of all the Spanish words
- Deletion of all the English words from the tree via a
`fold_left`

(with validation that the resulting tree is empty)

These are the results for the list of words in *dictionary* order:

Data Structure | Insert | List All | Find All | Find Sp. | Delete All |
---|---|---|---|---|---|

`Map` | 167,105.77 | 2,056,784.27 | 254,101.92 | 225,067.88 | 802,020.62 |

`trie` | 184,619.90 | 373,794.66 | 981,049.90 | 1,201,493.12 | 273,188.70 |

`trie` | 232,617.76 | 468,132.42 | 791,158.14 | 984,052.05 | 371,658.90 |

These are the results for the list of words in *median* order, that is the list with the median first followed by the left and right sublists both in median order:

Data Structure | Insert | List All | Find All | Find Sp. | Delete All |
---|---|---|---|---|---|

`Map` | 168,873.28 | 2,100,585.38 | 251,426.31 | 224,190.06 | 318,687.33 |

`trie` | 270,556.35 | 352,670.96 | 1,669,707.66 | 2,139,124.18 | 707,988.31 |

`trie` | 329,939.38 | 462,198.78 | 1,385,399.36 | 1,692,227.60 | 804,071.63 |

These are the combined results, normalized so that the corresponding operations on `Map`

are 1:

Data Structure | Insert | List All | Find All | Find Sp. | Delete All |
---|---|---|---|---|---|

`trie` (dict) | 1.10481 | 0.181737 | 3.86085 | 5.33836 | 0.340626 |

`trie` (dict) | 1.39204 | 0.227604 | 3.11355 | 4.37225 | 0.463403 |

`trie` (median) | 1.60213 | 0.167892 | 6.64094 | 9.54157 | 2.22158 |

`trie` (median) | 1.95377 | 0.220033 | 5.51016 | 7.54818 | 2.52307 |

Some things are easy to see and to explain:

- The insertion code is faster for ternary trees than for the height-balanced tree, as there is no need to maintain expensive shape invariants
- The ternary trees highly favor lookups,
*especially*by non-present keys - Insertion in sorted order is pessimal for ternary trees. Even so, they remain very fast, as claimed by Bentley and Sedgewick
- The
`fold`

reconstructs the keys one character at a time by appending to the current prefix, which is quadratic on the prefix length. This accounts for the speed being a 20% of the corresponding`Map.fold`

- The shape of the ternary trees depend on the order of key insertion in a dramatic way
- The shape of height-balanced trees
*also*depend heavily on the order of key insertion

Some other things are not so easy to understand. In particular, it is not easy to see why `trie`

is 20%–30% faster than _{2}`trie`

_{1}*except* for lookups where it is a 20% slower.

For comparison, here's `trie`

's _{1}`lookup`

:

let lookup k t = let n = String.length k in if n == 0 then failwith "lookup" else let rec go i t = if is_empty t then None else let c = k.[i] in if t.split > c then go i t.lokid else if t.split < c then go i t.hikid else if i + 1 < n then go (i+1) t.eqkid else t.value in go 0 t

And here's `trie`

's:_{2}

let lookup k t = let n = String.length k in let rec go i = function | E -> None | L v when i == n -> Some v | L _ -> None | B ( _, _, _, _) when i == n -> None | B ( c, l, q, h) -> let c' = k.[i] in if c' < c then go i l else if c' > c then go i h else go (i+1) q | K (v, _, _, _, _) when i == n -> Some v | K (_, c, l, q, h) -> let c' = k.[i] in if c' < c then go i l else if c' > c then go i h else go (i+1) q in go 0 t

(It's not *that* ugly). Twice the number of lines, a slew of cases, it should be obvious why the latter is slower, right? Then compare and contrast: here's `trie`

's _{1}`insert`

:

let add k v t = let n = String.length k in if n == 0 then failwith "add" else let rec go i t = if is_empty t then if i+1 < n then { empty with split = k.[i]; eqkid = go (i+1) t } else { empty with split = k.[i]; value = Some v } else let c = k.[i] in if t.split > c then { t with lokid = go i t.lokid } else if t.split < c then { t with hikid = go i t.hikid } else if i + 1 < n then { t with eqkid = go (i+1) t.eqkid } else { t with value = Some v } in go 0 t

Here's `trie`

's_{2}

let add k v t = let n = String.length k in let rec go i = function | E when i == n -> L v | E -> B ( k.[i], E, go (i+1) E, E) | L _ when i == n -> L v | L v -> K (v, k.[i], E, go (i+1) E, E) | B ( c, l, q, h) when i == n -> K (v, c , l, q , h) | B ( c, l, q, h) -> let c' = k.[i] in if c' < c then B ( c, go i l, q, h) else if c' > c then B ( c, l, q, go i h) else B ( c, l, go (i+1) q, h) | K (_, c, l, q, h) when i == n -> K (v, c , l, q , h) | K (v, c, l, q, h) -> let c' = k.[i] in if c' < c then K (v, c, go i l, q, h) else if c' > c then K (v, c, l, q, go i h) else K (v, c, l, go (i+1) q, h) in go 0 t

(Now this is not as pretty). They are exactly in the same shape and relation one with another. Some things I've observed:

- OCaml is not as clever as it should be compiling or-patterns. Many of the branches in
`trie`

's functions could be coalesced into those, but the speed hit is very consistently noticeable_{2} - Refining the constructors for each code path pays off in algorithmic ways: the normalization required in
`remove`

can be restricted to those branches statically known to have modified children, and the sharing can be made explicit:let remove k t = let prune = function | B ( _, E, E, E) -> E | K (v, _, E, E, E) -> L v | t -> t in let n = String.length k in let rec go i t = match t with | E -> t | L _ when i == n -> E | L _ -> t | B ( _, _, _, _) when i == n -> t | B ( c, l, q, h) -> let c' = k.[i] in if c' < c then prune (B ( c, go i l, q, h)) else if c' > c then prune (B ( c, l, q, go i h)) else prune (B ( c, l, go (i+1) q, h)) | K (_, c, l, q, h) when i == n -> B (c, l, q, h) | K (v, c, l, q, h) -> let c' = k.[i] in if c' < c then prune (K (v, c, go i l, q, h)) else if c' > c then prune (K (v, c, l, q, go i h)) else prune (K (v, c, l, go (i+1) q, h)) in go 0 t

Furthermore, the duality between

`add`

and`remove`

is easy to see: the former's base cases introduce`L`

and`K`

nodes, the latter turns them into`E`

and`B`

nodes, respectively.

In any case, I look forward to read Mauricio's analysis and code. All this goes to show that these ternary trees are *excellent* data structures to represent string maps, as claimed by Bentley and Sedgewick.

## 1 comment:

Try lambda lifting your "go" function in "lookup".

Post a Comment