On my last post, Benedikt Grundmann asked if changing representations for trie nodes, in particular eliminating 'a option
-boxed references, would represent a speed-up or not. I started tinkering with a direct representation:
type 'a t = { value : 'a option; split : char; lokid : 'a t; eqkid : 'a t; hikid : 'a t; }
(call this representation trie1
). This forces me to use a sentinel node to represent empty trees:
let rec empty = { value = None; split = '\255'; lokid = empty; eqkid = empty; hikid = empty; } let is_empty t = t == empty
(the split
value is entirely arbitrary and never relied upon). This recursive structure means that I cannot use pattern matches to discriminate and deconstruct nodes, and all comparisons must be strictly by reference equality on the carefully preserved unique sentinel node. This represented a rough 10% speedup on my original code.
Meanwhile, Mauricio Fernández tried his hand at it, and came with a tantalizing result:
in the lapse between the present post and the preceding one, I implemented ternary trees with different constructors for the nodes with and without a value. In my tests, construction is 50% faster, and lookups are between 20% and 80% faster. The code is much less elegant than the one presented here, because it moreover includes some special-casing to handle the
""
(empty string) key, which the above code chokes at.
(It's true that the empty string cannot be a key in both the original and the above implementation. I thought it obvious by the tradeoffs I listed in selecting the representation for the key, but truth is I never guarded the operations against it. It's a bug, not a feature). The numbers are too exciting to not try this approach. By carefully teasing apart each case, I arrived to the following data type:
type 'a t = E | L of 'a | B of char * 'a t * 'a t * 'a t | K of 'a * char * 'a t * 'a t * 'a t
(call this representation trie2
). The first constructor E
represents an empty tree. The second constructor L
represents a leaf with value α
. The third constructor B
represents a branching node without value. The fourth constructor K
represents a marked, valued branching node. Functions over this type require eight matches in general, four each for the middle and four for the end of the key.
I'll present the benchmark results first, then the code. In every case I've run the benchmark three times in sequence from the command line and retained the last measurement. My test machine is an MSI U100 with an 1.6 GHz Intel Atom and 2 GB RAM running Mac OS X 10.5.7, and the code is compiled with ocamlopt -g -inline 10 -w Aelz -pp camlp4o
version 3.11.0. Each run is comprised of three tests using the standard Map
balanced-height tree, trie1
and trie2
ternary trees, with a Gc.full_major
collection at the start of each. The test data sets are the 234936-word /usr/share/dict/words
file in Mac OS X, and a 89546-word Spanish lexicon, with 3.91% overlap between both. The tests measure:
- Insertion of all the English words into an empty tree via a
fold_left
- Retrieval of the list of all inserted words via a
S.fold
(with validation of length and sortedness outside the timing) - Retrieval of all the English words
- Retrieval of all the Spanish words
- Deletion of all the English words from the tree via a
fold_left
(with validation that the resulting tree is empty)
These are the results for the list of words in dictionary order:
Data Structure | Insert | List All | Find All | Find Sp. | Delete All |
---|---|---|---|---|---|
Map | 167,105.77 | 2,056,784.27 | 254,101.92 | 225,067.88 | 802,020.62 |
trie1 | 184,619.90 | 373,794.66 | 981,049.90 | 1,201,493.12 | 273,188.70 |
trie2 | 232,617.76 | 468,132.42 | 791,158.14 | 984,052.05 | 371,658.90 |
These are the results for the list of words in median order, that is the list with the median first followed by the left and right sublists both in median order:
Data Structure | Insert | List All | Find All | Find Sp. | Delete All |
---|---|---|---|---|---|
Map | 168,873.28 | 2,100,585.38 | 251,426.31 | 224,190.06 | 318,687.33 |
trie1 | 270,556.35 | 352,670.96 | 1,669,707.66 | 2,139,124.18 | 707,988.31 |
trie2 | 329,939.38 | 462,198.78 | 1,385,399.36 | 1,692,227.60 | 804,071.63 |
These are the combined results, normalized so that the corresponding operations on Map
are 1:
Data Structure | Insert | List All | Find All | Find Sp. | Delete All |
---|---|---|---|---|---|
trie1 (dict) | 1.10481 | 0.181737 | 3.86085 | 5.33836 | 0.340626 |
trie2 (dict) | 1.39204 | 0.227604 | 3.11355 | 4.37225 | 0.463403 |
trie1 (median) | 1.60213 | 0.167892 | 6.64094 | 9.54157 | 2.22158 |
trie2 (median) | 1.95377 | 0.220033 | 5.51016 | 7.54818 | 2.52307 |
Some things are easy to see and to explain:
- The insertion code is faster for ternary trees than for the height-balanced tree, as there is no need to maintain expensive shape invariants
- The ternary trees highly favor lookups, especially by non-present keys
- Insertion in sorted order is pessimal for ternary trees. Even so, they remain very fast, as claimed by Bentley and Sedgewick
- The
fold
reconstructs the keys one character at a time by appending to the current prefix, which is quadratic on the prefix length. This accounts for the speed being a 20% of the correspondingMap.fold
- The shape of the ternary trees depend on the order of key insertion in a dramatic way
- The shape of height-balanced trees also depend heavily on the order of key insertion
Some other things are not so easy to understand. In particular, it is not easy to see why trie2
is 20%–30% faster than trie1
except for lookups where it is a 20% slower.
For comparison, here's trie1
's lookup
:
let lookup k t = let n = String.length k in if n == 0 then failwith "lookup" else let rec go i t = if is_empty t then None else let c = k.[i] in if t.split > c then go i t.lokid else if t.split < c then go i t.hikid else if i + 1 < n then go (i+1) t.eqkid else t.value in go 0 t
And here's trie2
's:
let lookup k t = let n = String.length k in let rec go i = function | E -> None | L v when i == n -> Some v | L _ -> None | B ( _, _, _, _) when i == n -> None | B ( c, l, q, h) -> let c' = k.[i] in if c' < c then go i l else if c' > c then go i h else go (i+1) q | K (v, _, _, _, _) when i == n -> Some v | K (_, c, l, q, h) -> let c' = k.[i] in if c' < c then go i l else if c' > c then go i h else go (i+1) q in go 0 t
(It's not that ugly). Twice the number of lines, a slew of cases, it should be obvious why the latter is slower, right? Then compare and contrast: here's trie1
's insert
:
let add k v t = let n = String.length k in if n == 0 then failwith "add" else let rec go i t = if is_empty t then if i+1 < n then { empty with split = k.[i]; eqkid = go (i+1) t } else { empty with split = k.[i]; value = Some v } else let c = k.[i] in if t.split > c then { t with lokid = go i t.lokid } else if t.split < c then { t with hikid = go i t.hikid } else if i + 1 < n then { t with eqkid = go (i+1) t.eqkid } else { t with value = Some v } in go 0 t
Here's trie2
's
let add k v t = let n = String.length k in let rec go i = function | E when i == n -> L v | E -> B ( k.[i], E, go (i+1) E, E) | L _ when i == n -> L v | L v -> K (v, k.[i], E, go (i+1) E, E) | B ( c, l, q, h) when i == n -> K (v, c , l, q , h) | B ( c, l, q, h) -> let c' = k.[i] in if c' < c then B ( c, go i l, q, h) else if c' > c then B ( c, l, q, go i h) else B ( c, l, go (i+1) q, h) | K (_, c, l, q, h) when i == n -> K (v, c , l, q , h) | K (v, c, l, q, h) -> let c' = k.[i] in if c' < c then K (v, c, go i l, q, h) else if c' > c then K (v, c, l, q, go i h) else K (v, c, l, go (i+1) q, h) in go 0 t
(Now this is not as pretty). They are exactly in the same shape and relation one with another. Some things I've observed:
- OCaml is not as clever as it should be compiling or-patterns. Many of the branches in
trie2
's functions could be coalesced into those, but the speed hit is very consistently noticeable - Refining the constructors for each code path pays off in algorithmic ways: the normalization required in
remove
can be restricted to those branches statically known to have modified children, and the sharing can be made explicit:let remove k t = let prune = function | B ( _, E, E, E) -> E | K (v, _, E, E, E) -> L v | t -> t in let n = String.length k in let rec go i t = match t with | E -> t | L _ when i == n -> E | L _ -> t | B ( _, _, _, _) when i == n -> t | B ( c, l, q, h) -> let c' = k.[i] in if c' < c then prune (B ( c, go i l, q, h)) else if c' > c then prune (B ( c, l, q, go i h)) else prune (B ( c, l, go (i+1) q, h)) | K (_, c, l, q, h) when i == n -> B (c, l, q, h) | K (v, c, l, q, h) -> let c' = k.[i] in if c' < c then prune (K (v, c, go i l, q, h)) else if c' > c then prune (K (v, c, l, q, go i h)) else prune (K (v, c, l, go (i+1) q, h)) in go 0 t
Furthermore, the duality between
add
andremove
is easy to see: the former's base cases introduceL
andK
nodes, the latter turns them intoE
andB
nodes, respectively.
In any case, I look forward to read Mauricio's analysis and code. All this goes to show that these ternary trees are excellent data structures to represent string maps, as claimed by Bentley and Sedgewick.