How good is the hash function in your language of choice, particularly for strings? Chances are, not very. I applied the classical statistical tests popularized by Knuth and I was *very* surprised to see how bad OCaml's hash function on strings is.

I applied the chi-square test on the distribution of hashes over `/usr/share/dict/words`, for almost 235.000 words. I binned the integers from the most significant bit down, taking successively more and more bits into account, and tested the resulting distribution against a uniform distribution. Here are the results:

Bins (ν) | X² | Pr[χ² ≤ X²] |
---|---|---|

2 | 5250.0017026 | 1.0000000 |

4 | 6491.6332618 | 1.0000000 |

8 | 9310.9851194 | 1.0000000 |

16 | 20682.9648245 | 1.0000000 |

32 | 42563.4542173 | 1.0000000 |

64 | 88265.1637552 | 1.0000000 |

128 | 94121.1397827 | 1.0000000 |

256 | 155277.5725134 | 1.0000000 |

512 | 209363.2757857 | 1.0000000 |

1024 | 317909.0541084 | 1.0000000 |

2048 | 389422.3450812 | 1.0000000 |

4096 | 448253.5283822 | 1.0000000 |

8192 | 663614.9166071 | 1.0000000 |

16384 | 887812.2319610 | 1.0000000 |

32768 | 1037118.9427589 | 1.0000000 |

The first column indicates how many bins were used, which is the degrees of freedom in the probability distribution (ν). The second column shows the computed statistic. The third column shows the probability with which a χ² deviate with ν degrees of freedom would be at least this value. Highlighted in red are the failed entries according to Knuth's criterion (probability in first or last percentile) and in yellow the suspect entries (probability in first five or last five percentiles).

Pretty dismal, isn't it? The Kolmogorov-Smirnov test is equally appalling: the probability that entries are this higher than a uniform deviate is `p+` = 0.0016593, the probability that entries are this lower than a uniform deviate is `p-` = 1.0000000 both of which are extreme.

Postprocessing the hashes with a simple mixing function helps matter considerably. I use the Murmur hash specialized to a single 32-bit value:

let murmur_hash_32 k h = let m = 0x5bd1e995l in let k = Int32.mul k m in let k = Int32.logxor k (Int32.shift_right_logical k 24) in let k = Int32.mul k m in let h = Int32.mul h m in let h = Int32.logxor h k in let h = Int32.logxor h (Int32.shift_right_logical h 13) in let h = Int32.mul h m in let h = Int32.logxor h (Int32.shift_right_logical h 15) in h let universal_hash i h = let r = murmur_hash_32 (Int32.of_int h) (Int32.of_int (succ i)) in Int32.to_int (Int32.logand r 0x3fffffffl)

Since Murmur hash achieves full 32-bit avalanche, I can use it to construct a universal hash function. The results of applying this mixing function to `Hashtbl.hash`

are much more encouraging:

bins (ν) | X² | Pr[χ² ≤ X²] |
---|---|---|

2 | 0.0360268 | 0.1505399 |

4 | 4.9407498 | 0.8238125 |

8 | 11.4643648 | 0.8803928 |

16 | 17.1057309 | 0.6874168 |

32 | 25.6769844 | 0.2633473 |

64 | 55.6071781 | 0.2655752 |

128 | 121.7178466 | 0.3843221 |

256 | 250.5078489 | 0.4322981 |

512 | 480.1555487 | 0.1675241 |

1024 | 934.7566997 | 0.0230148 |

2048 | 1983.7491061 | 0.1614639 |

4096 | 3951.4456363 | 0.0550442 |

8192 | 8032.6611503 | 0.1074982 |

16384 | 16249.4879286 | 0.2308936 |

32768 | 32526.7885722 | 0.1741126 |

Much better, except maybe for the distribution on 1024 bins (the 10 most significant bits). The Kolmogorov-Smirnov test shows that the probability that entries are this higher than a uniform deviate is `p+` = 0.7688490, the probability that entries are this lower than a uniform deviate is `p-` = 0.4623243 which are much more balanced.

Know your hash function. It should be random.

## 2 comments:

Only in theory. :-)

I found I could get substantial speedups in my naive GC by altering the hash function to deliberately collide on close input values (essentially by dropping a few least significant bits). If your bucket is an array then that greatly improves cache coherence.

@Jon, with linear chaining (almost) any hash function is acceptable. In the worst case you have a linear search but the algorithms for lookup, insertion &c all terminate. With open addressing schemes, you're out of luck, since the correctness of the algorithms depend crucially on the hash function behaving essentially as a random map from your key space to your index space.

In fact, I embarked into this analysis because I can't seem to make insertion into a Cuckoo hash work with a particular dataset I found.

Post a Comment