2009-06-17

The Tree Nursery

On my last post, Benedikt Grundmann asked if changing representations for trie nodes, in particular eliminating 'a option-boxed references, would represent a speed-up or not. I started tinkering with a direct representation:

type 'a t =
{ value : 'a option;
  split : char;
  lokid : 'a t;
  eqkid : 'a t;
  hikid : 'a t; }

(call this representation trie1). This forces me to use a sentinel node to represent empty trees:

let rec empty = { value = None; split = '\255'; lokid = empty; eqkid = empty; hikid = empty; }

let is_empty t = t == empty

(the split value is entirely arbitrary and never relied upon). This recursive structure means that I cannot use pattern matches to discriminate and deconstruct nodes, and all comparisons must be strictly by reference equality on the carefully preserved unique sentinel node. This represented a rough 10% speedup on my original code.

Meanwhile, Mauricio Fernández tried his hand at it, and came with a tantalizing result:

in the lapse between the present post and the preceding one, I implemented ternary trees with different constructors for the nodes with and without a value. In my tests, construction is 50% faster, and lookups are between 20% and 80% faster. The code is much less elegant than the one presented here, because it moreover includes some special-casing to handle the "" (empty string) key, which the above code chokes at.

(It's true that the empty string cannot be a key in both the original and the above implementation. I thought it obvious by the tradeoffs I listed in selecting the representation for the key, but truth is I never guarded the operations against it. It's a bug, not a feature). The numbers are too exciting to not try this approach. By carefully teasing apart each case, I arrived to the following data type:

type 'a t = E
          | L of 'a
          | B of      char * 'a t * 'a t * 'a t
          | K of 'a * char * 'a t * 'a t * 'a t

(call this representation trie2). The first constructor E represents an empty tree. The second constructor L represents a leaf with value α. The third constructor B represents a branching node without value. The fourth constructor K represents a marked, valued branching node. Functions over this type require eight matches in general, four each for the middle and four for the end of the key.

I'll present the benchmark results first, then the code. In every case I've run the benchmark three times in sequence from the command line and retained the last measurement. My test machine is an MSI U100 with an 1.6 GHz Intel Atom and 2 GB RAM running Mac OS X 10.5.7, and the code is compiled with ocamlopt -g -inline 10 -w Aelz -pp camlp4o version 3.11.0. Each run is comprised of three tests using the standard Map balanced-height tree, trie1 and trie2 ternary trees, with a Gc.full_major collection at the start of each. The test data sets are the 234936-word /usr/share/dict/words file in Mac OS X, and a 89546-word Spanish lexicon, with 3.91% overlap between both. The tests measure:

  1. Insertion of all the English words into an empty tree via a fold_left
  2. Retrieval of the list of all inserted words via a S.fold (with validation of length and sortedness outside the timing)
  3. Retrieval of all the English words
  4. Retrieval of all the Spanish words
  5. Deletion of all the English words from the tree via a fold_left (with validation that the resulting tree is empty)

These are the results for the list of words in dictionary order:

Test of 89546-word set in dictionary order. In ops/s.
Data StructureInsertList AllFind AllFind Sp.Delete All
Map167,105.772,056,784.27254,101.92225,067.88802,020.62
trie1184,619.90373,794.66981,049.901,201,493.12273,188.70
trie2232,617.76468,132.42791,158.14984,052.05371,658.90

These are the results for the list of words in median order, that is the list with the median first followed by the left and right sublists both in median order:

Test of 89546-word set in median order. In ops/s.
Data StructureInsertList AllFind AllFind Sp.Delete All
Map168,873.282,100,585.38251,426.31224,190.06318,687.33
trie1270,556.35352,670.961,669,707.662,139,124.18707,988.31
trie2329,939.38462,198.781,385,399.361,692,227.60804,071.63

These are the combined results, normalized so that the corresponding operations on Map are 1:

Test of 89546-word set both in dictionary and in median order. Relative to Map
Data StructureInsertList AllFind AllFind Sp.Delete All
trie1 (dict)1.104810.1817373.860855.338360.340626
trie2 (dict)1.392040.2276043.113554.372250.463403
trie1 (median)1.602130.1678926.640949.541572.22158
trie2 (median)1.953770.2200335.510167.548182.52307

Some things are easy to see and to explain:

  • The insertion code is faster for ternary trees than for the height-balanced tree, as there is no need to maintain expensive shape invariants
  • The ternary trees highly favor lookups, especially by non-present keys
  • Insertion in sorted order is pessimal for ternary trees. Even so, they remain very fast, as claimed by Bentley and Sedgewick
  • The fold reconstructs the keys one character at a time by appending to the current prefix, which is quadratic on the prefix length. This accounts for the speed being a 20% of the corresponding Map.fold
  • The shape of the ternary trees depend on the order of key insertion in a dramatic way
  • The shape of height-balanced trees also depend heavily on the order of key insertion

Some other things are not so easy to understand. In particular, it is not easy to see why trie2 is 20%–30% faster than trie1 except for lookups where it is a 20% slower.

For comparison, here's trie1's lookup:

let lookup k t =
  let n = String.length k in
  if n == 0 then failwith "lookup" else
  let rec go i t =
    if is_empty t then None else
    let c = k.[i] in
    if t.split > c then go  i    t.lokid else
    if t.split < c then go  i    t.hikid else
    if i + 1   < n then go (i+1) t.eqkid else
                        t.value
  in go 0 t

And here's trie2's:

let lookup k t =
  let n = String.length k in
  let rec go i = function
  | E                             -> None
  | L  v              when i == n -> Some v
  | L  _                          -> None
  | B (   _, _, _, _) when i == n -> None
  | B (   c, l, q, h)             ->
    let c' = k.[i] in
    if  c' < c then go  i    l else
    if  c' > c then go  i    h else
                    go (i+1) q
  | K (v, _, _, _, _) when i == n -> Some v
  | K (_, c, l, q, h)             ->
    let c' = k.[i] in
    if  c' < c then go  i    l else
    if  c' > c then go  i    h else
                    go (i+1) q
  in go 0 t

(It's not that ugly). Twice the number of lines, a slew of cases, it should be obvious why the latter is slower, right? Then compare and contrast: here's trie1's insert:

let add k v t =
  let n = String.length k in
  if n == 0 then failwith "add" else
  let rec go i t =
    if is_empty t then
      if i+1 < n then { empty with split = k.[i]; eqkid = go (i+1) t }
                 else { empty with split = k.[i]; value = Some v     }
    else
    let c = k.[i] in
    if t.split > c then { t with lokid = go  i    t.lokid } else
    if t.split < c then { t with hikid = go  i    t.hikid } else
    if i + 1   < n then { t with eqkid = go (i+1) t.eqkid } else
                        { t with value = Some v           }
  in go 0 t

Here's trie2's

let add k v t =
  let n = String.length k in
  let rec go i = function
  | E                 when i == n -> L  v
  | E                             -> B (   k.[i], E, go (i+1) E, E)
  | L  _              when i == n -> L  v
  | L  v                          -> K (v, k.[i], E, go (i+1) E, E)
  | B (   c, l, q, h) when i == n -> K (v, c    , l, q         , h)
  | B (   c, l, q, h)             ->
    let c' = k.[i] in
    if  c' < c then B (   c, go i l,          q,      h) else
    if  c' > c then B (   c,      l,          q, go i h) else
                    B (   c,      l, go (i+1) q,      h)
  | K (_, c, l, q, h) when i == n -> K (v, c    , l, q         , h)
  | K (v, c, l, q, h)             ->
    let c' = k.[i] in
    if  c' < c then K (v, c, go i l,          q,      h) else
    if  c' > c then K (v, c,      l,          q, go i h) else
                    K (v, c,      l, go (i+1) q,      h)
  in go 0 t

(Now this is not as pretty). They are exactly in the same shape and relation one with another. Some things I've observed:

  • OCaml is not as clever as it should be compiling or-patterns. Many of the branches in trie2's functions could be coalesced into those, but the speed hit is very consistently noticeable
  • Refining the constructors for each code path pays off in algorithmic ways: the normalization required in remove can be restricted to those branches statically known to have modified children, and the sharing can be made explicit:
    let remove k t =
      let prune = function
      | B (   _, E, E, E) -> E
      | K (v, _, E, E, E) -> L v
      | t                 -> t
      in
      let n = String.length k in
      let rec go i t = match t with
      | E                             -> t
      | L  _              when i == n -> E
      | L  _                          -> t
      | B (   _, _, _, _) when i == n -> t
      | B (   c, l, q, h)             ->
        let c' = k.[i] in
        if  c' < c then prune (B (   c, go i l,          q,      h)) else
        if  c' > c then prune (B (   c,      l,          q, go i h)) else
                        prune (B (   c,      l, go (i+1) q,      h))
      | K (_, c, l, q, h) when i == n -> B (c, l, q, h)
      | K (v, c, l, q, h) ->
        let c' = k.[i] in
        if  c' < c then prune (K (v, c, go i l,          q,      h)) else
        if  c' > c then prune (K (v, c,      l,          q, go i h)) else
                        prune (K (v, c,      l, go (i+1) q,      h))
      in go 0 t
    

    Furthermore, the duality between add and remove is easy to see: the former's base cases introduce L and K nodes, the latter turns them into E and B nodes, respectively.

In any case, I look forward to read Mauricio's analysis and code. All this goes to show that these ternary trees are excellent data structures to represent string maps, as claimed by Bentley and Sedgewick.

1 comment:

Jon Harrop said...

Try lambda lifting your "go" function in "lookup".