Alaska Ataca a Kamtchatka: Hash 'n Mix

How good is the hash function in your language of choice, particularly for strings? Chances are, not very. I applied the classical statistical tests popularized by Knuth and I was very surprised to see how bad OCaml's hash function on strings is.

I applied the chi-square test on the distribution of hashes over /usr/share/dict/words, for almost 235.000 words. I binned the integers from the most significant bit down, taking successively more and more bits into account, and tested the resulting distribution against a uniform distribution. Here are the results:

χ² test of 234936 hashes with `Hashtbl.hash`
Bins (ν)	X²	Pr[χ² ≤ X²]
2	5250.0017026	1.0000000
4	6491.6332618	1.0000000
8	9310.9851194	1.0000000
16	20682.9648245	1.0000000
32	42563.4542173	1.0000000
64	88265.1637552	1.0000000
128	94121.1397827	1.0000000
256	155277.5725134	1.0000000
512	209363.2757857	1.0000000
1024	317909.0541084	1.0000000
2048	389422.3450812	1.0000000
4096	448253.5283822	1.0000000
8192	663614.9166071	1.0000000
16384	887812.2319610	1.0000000
32768	1037118.9427589	1.0000000

The first column indicates how many bins were used, which is the degrees of freedom in the probability distribution (ν). The second column shows the computed statistic. The third column shows the probability with which a χ² deviate with ν degrees of freedom would be at least this value. Highlighted in red are the failed entries according to Knuth's criterion (probability in first or last percentile) and in yellow the suspect entries (probability in first five or last five percentiles).

Pretty dismal, isn't it? The Kolmogorov-Smirnov test is equally appalling: the probability that entries are this higher than a uniform deviate is p+ = 0.0016593, the probability that entries are this lower than a uniform deviate is p- = 1.0000000 both of which are extreme.

Postprocessing the hashes with a simple mixing function helps matter considerably. I use the Murmur hash specialized to a single 32-bit value:

let murmur_hash_32 k h =
  let m = 0x5bd1e995l in
  let k = Int32.mul k m in 
  let k = Int32.logxor k (Int32.shift_right_logical k 24) in 
  let k = Int32.mul k m in 
  let h = Int32.mul h m in 
  let h = Int32.logxor h k in
  let h = Int32.logxor h (Int32.shift_right_logical h 13) in
  let h = Int32.mul h m in
  let h = Int32.logxor h (Int32.shift_right_logical h 15) in
  h

let universal_hash i h =
  let r = murmur_hash_32 (Int32.of_int h) (Int32.of_int (succ i)) in
  Int32.to_int (Int32.logand r 0x3fffffffl)

Since Murmur hash achieves full 32-bit avalanche, I can use it to construct a universal hash function. The results of applying this mixing function to Hashtbl.hash are much more encouraging:

χ² test of 234936 hashes with mixing function
bins (ν)	X²	Pr[χ² ≤ X²]
2	0.0360268	0.1505399
4	4.9407498	0.8238125
8	11.4643648	0.8803928
16	17.1057309	0.6874168
32	25.6769844	0.2633473
64	55.6071781	0.2655752
128	121.7178466	0.3843221
256	250.5078489	0.4322981
512	480.1555487	0.1675241
1024	934.7566997	0.0230148
2048	1983.7491061	0.1614639
4096	3951.4456363	0.0550442
8192	8032.6611503	0.1074982
16384	16249.4879286	0.2308936
32768	32526.7885722	0.1741126

Much better, except maybe for the distribution on 1024 bins (the 10 most significant bits). The Kolmogorov-Smirnov test shows that the probability that entries are this higher than a uniform deviate is p+ = 0.7688490, the probability that entries are this lower than a uniform deviate is p- = 0.4623243 which are much more balanced.

Know your hash function. It should be random.

2 comments:

Jon Harrop said...: Only in theory. :-)

I found I could get substantial speedups in my naive GC by altering the hash function to deliberately collide on close input values (essentially by dropping a few least significant bits). If your bucket is an array then that greatly improves cache coherence.; 6/12/2009 10:15:00 PM
Matías Giovannini said...: @Jon, with linear chaining (almost) any hash function is acceptable. In the worst case you have a linear search but the algorithms for lookup, insertion &c all terminate. With open addressing schemes, you're out of luck, since the correctness of the algorithms depend crucially on the hash function behaving essentially as a random map from your key space to your index space.

In fact, I embarked into this analysis because I can't seem to make insertion into a Cuckoo hash work with a particular dataset I found.; 6/13/2009 11:05:00 AM

Alaska Ataca a Kamtchatka

2009-06-12

Hash 'n Mix

2 comments:

About Me

Statement of Rights

Labels

Blog Archive