No matter how credentialed a writer is, he is bound to make a mistake in print sooner or later. Raymond Chen's latest entry was for me oddly synchronous and oddly at variance with my own investigations on Bentley and Sedgewick's Ternary Trees. His apology of the hash table is technically sound, but only on the surface: there is not a single measurement to back up his assertions. Here is my data.
I pitted four different data structures intended as finite maps on strings: the standard
Hashtbl, my implementation of Cuckoo Hash, the standard height-balanced
Map and my own implementation of ternary trees as Trie Maps. I used Mac OS X 10.5's /usr/share/dict/words file against a Spanish lexicon of 90.000 entries. I tested both successful and unsuccessful searches by looking up every entry in the original file first, and every entry in the alternate file second. In both cases the files were read into memory and then processed with the algorithm benchmarks. I used a bit of generic programming to have the same benchmark code operate on the four data structures, and in each case the mapped domain was
unit, treating the maps effectively as sets.
Here are the results for reading the big file and testing against the small file:
As you can see,
Hashtbl is competitive inserting data, and Cuckoo Hash is smoking fast looking up data. This is not a very fair comparison, as these are imperative data structures, whereas both the standard
Map and the Trie Map are purely functional data structures. Even so, the Trie Map is slightly (3.5%) slower in successful searches and substantially (60%) faster in unsuccessful searches than
Unfortunately the champion apparent, the Cuckoo Hash, has a rather fragile disposition. The Spanish lexicon thoroughly broke its insertion routine.
Aside: I used to view with suspicion the lack of termination analysis for the insertion routine, and the very scarce code floating around implementing Cuckoo Hashes. After seeing here a practical case of non-termination for insertion with a dataset of less than 4,000 keys I consider this algorithm completely unsuitable for use. I have confidence that the universal hash I used is essentially random, as it passes standard tests for this data set, and that I implemented the published algorithm faithfully.
Even so, the Ternary Trie remains the victor in both successful and unsuccessful searches, being 60% faster than
Hashtbls (and almost 4 times as fast as height-balanced trees) in the latter:
What is the take-away lesson in this? First, do not let expert advice lull you into complacency. Take the time to experiment and analyze for yourself: it is not only intellectually challenging and rewarding, it might be a store of surprises (and of blogging material) for you.