The Tree Nursery

On my last post, Benedikt Grundmann asked if changing representations for trie nodes, in particular eliminating 'a option-boxed references, would represent a speed-up or not. I started tinkering with a direct representation:

type 'a t =
{ value : 'a option;
  split : char;
  lokid : 'a t;
  eqkid : 'a t;
  hikid : 'a t; }

(call this representation trie1). This forces me to use a sentinel node to represent empty trees:

let rec empty = { value = None; split = '\255'; lokid = empty; eqkid = empty; hikid = empty; }

let is_empty t = t == empty

(the split value is entirely arbitrary and never relied upon). This recursive structure means that I cannot use pattern matches to discriminate and deconstruct nodes, and all comparisons must be strictly by reference equality on the carefully preserved unique sentinel node. This represented a rough 10% speedup on my original code.

Meanwhile, Mauricio Fernández tried his hand at it, and came with a tantalizing result:

in the lapse between the present post and the preceding one, I implemented ternary trees with different constructors for the nodes with and without a value. In my tests, construction is 50% faster, and lookups are between 20% and 80% faster. The code is much less elegant than the one presented here, because it moreover includes some special-casing to handle the "" (empty string) key, which the above code chokes at.

(It's true that the empty string cannot be a key in both the original and the above implementation. I thought it obvious by the tradeoffs I listed in selecting the representation for the key, but truth is I never guarded the operations against it. It's a bug, not a feature). The numbers are too exciting to not try this approach. By carefully teasing apart each case, I arrived to the following data type:

type 'a t = E
          | L of 'a
          | B of      char * 'a t * 'a t * 'a t
          | K of 'a * char * 'a t * 'a t * 'a t

(call this representation trie2). The first constructor E represents an empty tree. The second constructor L represents a leaf with value α. The third constructor B represents a branching node without value. The fourth constructor K represents a marked, valued branching node. Functions over this type require eight matches in general, four each for the middle and four for the end of the key.

I'll present the benchmark results first, then the code. In every case I've run the benchmark three times in sequence from the command line and retained the last measurement. My test machine is an MSI U100 with an 1.6 GHz Intel Atom and 2 GB RAM running Mac OS X 10.5.7, and the code is compiled with ocamlopt -g -inline 10 -w Aelz -pp camlp4o version 3.11.0. Each run is comprised of three tests using the standard Map balanced-height tree, trie1 and trie2 ternary trees, with a Gc.full_major collection at the start of each. The test data sets are the 234936-word /usr/share/dict/words file in Mac OS X, and a 89546-word Spanish lexicon, with 3.91% overlap between both. The tests measure:

  1. Insertion of all the English words into an empty tree via a fold_left
  2. Retrieval of the list of all inserted words via a S.fold (with validation of length and sortedness outside the timing)
  3. Retrieval of all the English words
  4. Retrieval of all the Spanish words
  5. Deletion of all the English words from the tree via a fold_left (with validation that the resulting tree is empty)

These are the results for the list of words in dictionary order:

Test of 89546-word set in dictionary order. In ops/s.
Data StructureInsertList AllFind AllFind Sp.Delete All

These are the results for the list of words in median order, that is the list with the median first followed by the left and right sublists both in median order:

Test of 89546-word set in median order. In ops/s.
Data StructureInsertList AllFind AllFind Sp.Delete All

These are the combined results, normalized so that the corresponding operations on Map are 1:

Test of 89546-word set both in dictionary and in median order. Relative to Map
Data StructureInsertList AllFind AllFind Sp.Delete All
trie1 (dict)1.104810.1817373.860855.338360.340626
trie2 (dict)1.392040.2276043.113554.372250.463403
trie1 (median)1.602130.1678926.640949.541572.22158
trie2 (median)1.953770.2200335.510167.548182.52307

Some things are easy to see and to explain:

  • The insertion code is faster for ternary trees than for the height-balanced tree, as there is no need to maintain expensive shape invariants
  • The ternary trees highly favor lookups, especially by non-present keys
  • Insertion in sorted order is pessimal for ternary trees. Even so, they remain very fast, as claimed by Bentley and Sedgewick
  • The fold reconstructs the keys one character at a time by appending to the current prefix, which is quadratic on the prefix length. This accounts for the speed being a 20% of the corresponding Map.fold
  • The shape of the ternary trees depend on the order of key insertion in a dramatic way
  • The shape of height-balanced trees also depend heavily on the order of key insertion

Some other things are not so easy to understand. In particular, it is not easy to see why trie2 is 20%–30% faster than trie1 except for lookups where it is a 20% slower.

For comparison, here's trie1's lookup:

let lookup k t =
  let n = String.length k in
  if n == 0 then failwith "lookup" else
  let rec go i t =
    if is_empty t then None else
    let c = k.[i] in
    if t.split > c then go  i    t.lokid else
    if t.split < c then go  i    t.hikid else
    if i + 1   < n then go (i+1) t.eqkid else
  in go 0 t

And here's trie2's:

let lookup k t =
  let n = String.length k in
  let rec go i = function
  | E                             -> None
  | L  v              when i == n -> Some v
  | L  _                          -> None
  | B (   _, _, _, _) when i == n -> None
  | B (   c, l, q, h)             ->
    let c' = k.[i] in
    if  c' < c then go  i    l else
    if  c' > c then go  i    h else
                    go (i+1) q
  | K (v, _, _, _, _) when i == n -> Some v
  | K (_, c, l, q, h)             ->
    let c' = k.[i] in
    if  c' < c then go  i    l else
    if  c' > c then go  i    h else
                    go (i+1) q
  in go 0 t

(It's not that ugly). Twice the number of lines, a slew of cases, it should be obvious why the latter is slower, right? Then compare and contrast: here's trie1's insert:

let add k v t =
  let n = String.length k in
  if n == 0 then failwith "add" else
  let rec go i t =
    if is_empty t then
      if i+1 < n then { empty with split = k.[i]; eqkid = go (i+1) t }
                 else { empty with split = k.[i]; value = Some v     }
    let c = k.[i] in
    if t.split > c then { t with lokid = go  i    t.lokid } else
    if t.split < c then { t with hikid = go  i    t.hikid } else
    if i + 1   < n then { t with eqkid = go (i+1) t.eqkid } else
                        { t with value = Some v           }
  in go 0 t

Here's trie2's

let add k v t =
  let n = String.length k in
  let rec go i = function
  | E                 when i == n -> L  v
  | E                             -> B (   k.[i], E, go (i+1) E, E)
  | L  _              when i == n -> L  v
  | L  v                          -> K (v, k.[i], E, go (i+1) E, E)
  | B (   c, l, q, h) when i == n -> K (v, c    , l, q         , h)
  | B (   c, l, q, h)             ->
    let c' = k.[i] in
    if  c' < c then B (   c, go i l,          q,      h) else
    if  c' > c then B (   c,      l,          q, go i h) else
                    B (   c,      l, go (i+1) q,      h)
  | K (_, c, l, q, h) when i == n -> K (v, c    , l, q         , h)
  | K (v, c, l, q, h)             ->
    let c' = k.[i] in
    if  c' < c then K (v, c, go i l,          q,      h) else
    if  c' > c then K (v, c,      l,          q, go i h) else
                    K (v, c,      l, go (i+1) q,      h)
  in go 0 t

(Now this is not as pretty). They are exactly in the same shape and relation one with another. Some things I've observed:

  • OCaml is not as clever as it should be compiling or-patterns. Many of the branches in trie2's functions could be coalesced into those, but the speed hit is very consistently noticeable
  • Refining the constructors for each code path pays off in algorithmic ways: the normalization required in remove can be restricted to those branches statically known to have modified children, and the sharing can be made explicit:
    let remove k t =
      let prune = function
      | B (   _, E, E, E) -> E
      | K (v, _, E, E, E) -> L v
      | t                 -> t
      let n = String.length k in
      let rec go i t = match t with
      | E                             -> t
      | L  _              when i == n -> E
      | L  _                          -> t
      | B (   _, _, _, _) when i == n -> t
      | B (   c, l, q, h)             ->
        let c' = k.[i] in
        if  c' < c then prune (B (   c, go i l,          q,      h)) else
        if  c' > c then prune (B (   c,      l,          q, go i h)) else
                        prune (B (   c,      l, go (i+1) q,      h))
      | K (_, c, l, q, h) when i == n -> B (c, l, q, h)
      | K (v, c, l, q, h) ->
        let c' = k.[i] in
        if  c' < c then prune (K (v, c, go i l,          q,      h)) else
        if  c' > c then prune (K (v, c,      l,          q, go i h)) else
                        prune (K (v, c,      l, go (i+1) q,      h))
      in go 0 t

    Furthermore, the duality between add and remove is easy to see: the former's base cases introduce L and K nodes, the latter turns them into E and B nodes, respectively.

In any case, I look forward to read Mauricio's analysis and code. All this goes to show that these ternary trees are excellent data structures to represent string maps, as claimed by Bentley and Sedgewick.


Simple, Efficient Trie Maps

I became interested in implementing Bentley and Sedgewick's Ternary Trees by reading the challenge proposed on Programming Praxis (a great blog, by the way). I thought the algorithms would be a good fit for a purely functional data structure, and I am happy with the results. Here is my implementation.

I follow the Map.S signature, so that this can be a drop-in replacement for string-keyed maps, so that the interface (.mli file) is simply:

include Map.S with type key = string

For the implementation I restrict the key space to strings:

type key = string

Trees are optional references to nodes, and these are records containing the mapped value whenever this node represents a full key, the character suffix for the tree and the three branches:

type 'a t = 'a node option
and 'a node =
{ value : 'a option;
  split : char;
  lokid : 'a t;
  eqkid : 'a t;
  hikid : 'a t; }

In this I follow closely the paper, except for a crucial detail: in it, the authors make use of the terminator inherent to every C string, the null byte '\0'. This forces a design decision:

  • emulate the termination character. This has the distinct disadvantage of having to prohibit it from occurring in string keys
  • use a char option as the key. This forces boxing the key into a constructed type and complicates the logic in all the comparisons, which has a negative impact in the inner loops
  • use a char list or string as the key. This has the advantage that it can be adapted to compress the search paths, but it requires generating substrings of everything
  • guard on the string length. This is the choice I took. It complicates the comparison logic somewhat, but it has the advantage of making the node representation more regular and compact, and allowing the full range of characters in string keys

An empty tree is represented by a None reference:

let empty = None

Consequently, the test for an empty tree is a match on the option:

let is_empty = function None -> true | Some _ -> false

This means that I'll be forced to normalize trees on deletion, of which more later. The workhorse function is lookup:

let lookup k t =
  let n = String.length k in
  let rec go i = function
  | None   -> None
  | Some t ->
    let cmp = Char.compare k.[i] t.split in
    if cmp < 0 then go  i    t.lokid else
    if cmp > 0 then go  i    t.hikid else
    if i+1 < n then go (i+1) t.eqkid else
  in go 0 t

It is as simple as it can be given the choices I outlined above. The crucial bit is that the presence or absence of the value is the flag that indicates if this node represents a present or absent key in the tree. The signature's retrieval and membership functions are simply wrappers around this:

let find k t = match lookup k t with
| Some x -> x
| None   -> raise Not_found

let mem k t = match lookup k t with
| Some _ -> true
| None   -> false

Inserting a key-value pair in the tree is a bit more involved:

let add k v t =
  let n = String.length k in
  let rec go i = function
  | None when i+1 < n ->
    Some { value = None  ; split = k.[i]; lokid = None; eqkid = go (i+1) None; hikid = None }
  | None              ->
    Some { value = Some v; split = k.[i]; lokid = None; eqkid =          None; hikid = None }
  | Some t            ->
    let cmp = Char.compare k.[i] t.split in
    if cmp < 0 then Some { t with lokid = go  i    t.lokid } else
    if cmp > 0 then Some { t with hikid = go  i    t.hikid } else
    if i+1 < n then Some { t with eqkid = go (i+1) t.eqkid } else
                    Some { t with value = Some v }
  in go 0 t

The case when the end of a branch is reached is inlined, so that a descending branch keyed on the key's suffix is built until reaching its last character, which stores the keyed value. Otherwise, the suffix is compared to the node key, and the appropriate child is built. Note that the recursively built subtree replaces the corresponding child in the node; this is what makes this data structure purely functional and persistent. Removing a mapping is similar, but complicated by the need to normalize the tree so that an empty tree is represented by None:

let remove k t =
  let prune = function
  | { value = None; lokid = None; eqkid = None; hikid = None } -> None
  | t -> Some t
  let n = String.length k in
  let rec go i = function
  | None   -> None
  | Some t ->
    let cmp = Char.compare k.[i] t.split in
    if cmp < 0 then prune { t with lokid = go  i    t.lokid } else
    if cmp > 0 then prune { t with hikid = go  i    t.hikid } else
    if i+1 < n then prune { t with eqkid = go (i+1) t.eqkid } else
                    prune { t with value = None             }
  in go 0 t

In this case, prune works somewhat as a smart constructor for node references.

The big advantage of ternary search trees is that, since they are modeled on three-way quicksort, they provide ordered key storage with relative insensitivity to the order of insertions. The big disadvantage is that they don't have a canonical form for a given key set, that is, their shape depends on the order in which the keys are inserted. This means that they provide a cheap fold but no easy way to compare trees:

let fold f t e =
  let rec go prefix e = function
  | None   -> e
  | Some t ->
    let e   = go prefix e t.lokid in
    let key = prefix ^ String.make 1 t.split in
    let e   = match t.value with None -> e | Some v -> f key v e in
    let e   = go key    e t.eqkid in
              go prefix e t.hikid
  in go "" e t

Note that the key is built character by character but only on the equal kids. Note also that the resulting fold is guaranteed to be ordered on the keys by the sequence in which the recursion is carried out. The rest of the iteration functions in the signature are trivial wrappers:

let iter f t = fold (fun k v () ->        f k v ) t ()
and map  f t = fold (fun k v    -> add k (f v  )) t empty
and mapi f t = fold (fun k v    -> add k (f k v)) t empty

As I indicated above, ternary search trees have no canonical form, so to compare two trees you have to somehow traverse them in synchronization. One easy but expensive way would be to turn both into key-value lists and compare those; this would require O(n min m) time and O(n + m) space. The best way would be to turn the fold into a continuation-passing iterator that lazily streams keys and traverse both in tandem. I leave it as an exercise to the reader:

let compare _ _ _ = failwith "Not comparable"

let equal eq t t' =
  0 == compare (fun v w -> if eq v w then 0 else 1) t t'


Algorithm-conscious, cache-oblivious

No matter how credentialed a writer is, he is bound to make a mistake in print sooner or later. Raymond Chen's latest entry was for me oddly synchronous and oddly at variance with my own investigations on Bentley and Sedgewick's Ternary Trees. His apology of the hash table is technically sound, but only on the surface: there is not a single measurement to back up his assertions. Here is my data.

I pitted four different data structures intended as finite maps on strings: the standard Hashtbl, my implementation of Cuckoo Hash, the standard height-balanced Map and my own implementation of ternary trees as Trie Maps. I used Mac OS X 10.5's /usr/share/dict/words file against a Spanish lexicon of 90.000 entries. I tested both successful and unsuccessful searches by looking up every entry in the original file first, and every entry in the alternate file second. In both cases the files were read into memory and then processed with the algorithm benchmarks. I used a bit of generic programming to have the same benchmark code operate on the four data structures, and in each case the mapped domain was unit, treating the maps effectively as sets.

Here are the results for reading the big file and testing against the small file:

Test of 234936-word set against 89546 queries (3.91 % success rate). In ops/s.
Data StructureInsertReadLookup
Trie Map118,932.34770,795.15991,782.69

As you can see, Hashtbl is competitive inserting data, and Cuckoo Hash is smoking fast looking up data. This is not a very fair comparison, as these are imperative data structures, whereas both the standard Map and the Trie Map are purely functional data structures. Even so, the Trie Map is slightly (3.5%) slower in successful searches and substantially (60%) faster in unsuccessful searches than Hashtbl.

Unfortunately the champion apparent, the Cuckoo Hash, has a rather fragile disposition. The Spanish lexicon thoroughly broke its insertion routine.

Aside: I used to view with suspicion the lack of termination analysis for the insertion routine, and the very scarce code floating around implementing Cuckoo Hashes. After seeing here a practical case of non-termination for insertion with a dataset of less than 4,000 keys I consider this algorithm completely unsuitable for use. I have confidence that the universal hash I used is essentially random, as it passes standard tests for this data set, and that I implemented the published algorithm faithfully.

Even so, the Ternary Trie remains the victor in both successful and unsuccessful searches, being 60% faster than Hashtbls (and almost 4 times as fast as height-balanced trees) in the latter:

Test of 89546-word set against 234936 queries (1.49 % success rate). In ops/s.
Data StructureInsertReadLookup
Trie Map108,718.90889,076.241,300,517.50

What is the take-away lesson in this? First, do not let expert advice lull you into complacency. Take the time to experiment and analyze for yourself: it is not only intellectually challenging and rewarding, it might be a store of surprises (and of blogging material) for you.


Hash 'n Mix

How good is the hash function in your language of choice, particularly for strings? Chances are, not very. I applied the classical statistical tests popularized by Knuth and I was very surprised to see how bad OCaml's hash function on strings is.

I applied the chi-square test on the distribution of hashes over /usr/share/dict/words, for almost 235.000 words. I binned the integers from the most significant bit down, taking successively more and more bits into account, and tested the resulting distribution against a uniform distribution. Here are the results:

χ² test of 234936 hashes with Hashtbl.hash
Bins (ν)Pr[χ² ≤ X²]

The first column indicates how many bins were used, which is the degrees of freedom in the probability distribution (ν). The second column shows the computed statistic. The third column shows the probability with which a χ² deviate with ν degrees of freedom would be at least this value. Highlighted in red are the failed entries according to Knuth's criterion (probability in first or last percentile) and in yellow the suspect entries (probability in first five or last five percentiles).

Pretty dismal, isn't it? The Kolmogorov-Smirnov test is equally appalling: the probability that entries are this higher than a uniform deviate is p+ = 0.0016593, the probability that entries are this lower than a uniform deviate is p- = 1.0000000 both of which are extreme.

Postprocessing the hashes with a simple mixing function helps matter considerably. I use the Murmur hash specialized to a single 32-bit value:

let murmur_hash_32 k h =
  let m = 0x5bd1e995l in
  let k = Int32.mul k m in 
  let k = Int32.logxor k (Int32.shift_right_logical k 24) in 
  let k = Int32.mul k m in 
  let h = Int32.mul h m in 
  let h = Int32.logxor h k in
  let h = Int32.logxor h (Int32.shift_right_logical h 13) in
  let h = Int32.mul h m in
  let h = Int32.logxor h (Int32.shift_right_logical h 15) in

let universal_hash i h =
  let r = murmur_hash_32 (Int32.of_int h) (Int32.of_int (succ i)) in
  Int32.to_int (Int32.logand r 0x3fffffffl)

Since Murmur hash achieves full 32-bit avalanche, I can use it to construct a universal hash function. The results of applying this mixing function to Hashtbl.hash are much more encouraging:

χ² test of 234936 hashes with mixing function
bins (ν)Pr[χ² ≤ X²]

Much better, except maybe for the distribution on 1024 bins (the 10 most significant bits). The Kolmogorov-Smirnov test shows that the probability that entries are this higher than a uniform deviate is p+ = 0.7688490, the probability that entries are this lower than a uniform deviate is p- = 0.4623243 which are much more balanced.

Know your hash function. It should be random.


Euler's Power Riffs

Ed Sandifer is an exegete of Euler's methods. In his latest "How Euler Did It", he shows the method through which he arrived at closed polynomials for the sums of n-th powers of the first x integers:

In general, Euler has us multiply the terms of Σxn by (n + 1)/(n + 2) x, (n + 1)/(n + 1) x, (n + 1)/n x, … (n + 1)/2 x and (n + 1)/1 x, respectively, and add the appropriate linear term and to get the formula for Σxn+1.

where Σxn is Euler's notation for the sum in question. The algorithm is very easily translated to a short OCaml function. First, I'll need a way to generate an integer range:

let range i j =
  let rec go l j =
    if j < i then l else
    go (j :: l) (pred j)
  in go [] j

Simple, eager, tail-call; almost imperative, definitely dirty. With that, Sandifer's rendition of Euler's method is simply:

open Num

let rec eusum n =
  if n = 0 then [Int 0; Int 1] else
  let l = eusum (pred n)
  (* n/1 n/2 n/3 ... n/(n+1) *)
  and k = List.map (fun d -> Int n // Int d) (range 1 (n + 1)) in
  (* the lowest coefficient is that of x^1 and always 0 *)
  let _ :: xs = List.map2 ( */ ) k l in
  (* normalize the sum of coefficients to 1 *)
  let x = List.fold_left ( -/ ) (Int 1) xs in
  Int 0 :: x :: xs

I want to clarify a couple of subtleties in the implementation. First, I use lists of Nums representing dense polynomials, least-order coefficient first. The base case is that for the sum of x ones (with exponent n = 0), namely x itself. As the article points out, the constant term is always zero, because a sum of zero powers is zero.

Then, the general case is built out of Σxn (called l) and the terms with which to multiply it (called k, in reverse order to that reported by Sandifer, which follows Euler in putting the highest power first). Both polynomials are multiplied term by term with map2; however, the degree is not yet raised at this stage. Also, the lowest term is always zero by the previous observation and I disregard it with an irrefutable pattern match (I could've used List.tl, but I prefer not to, in general). A simple fold_left computes the coefficient α for the linear term, and the polynomial is completed with it and the zero linear term, thus raising the degree by one (one term discarded, two added).

That's it, seven lines of code. Regrettably, it is not of much use as it is:

# eusum 4 ;;
- : Num.num list =
[Int 0; Ratio <abstr>; Int 0; Ratio <abstr>; Ratio <abstr>; Ratio <abstr>]

Bummer, abstract data types. It turns out, pretty-printing polynomials is surprisingly tedious to get right:

let pp_print_poly pp l =
  let pow = ref 0
  and any = ref false in
  Format.pp_open_box pp 1;
  List.iter (fun k ->
    if k <>/ Int 0 then begin
      if !any then Format.pp_print_space pp ();
      Format.pp_open_hbox pp ();
      (* Print coefficient *)
      if k </ Int 0 then Format.pp_print_char pp '-' else
      if !any then Format.pp_print_char pp '+';
      if !any then Format.pp_print_space pp ();
      let n = Num.abs_num k in
      if n <>/ Int 1 || !pow = 0 then Format.pp_print_string pp (Num.string_of_num n);
      (* Print formal power *)
      if n <>/ Int 1 && !pow > 0 then Format.pp_print_char pp '*';
      if !pow > 0 then Format.pp_print_char pp 'x';
      if !pow > 1 then Format.fprintf pp "^%d" !pow;
      Format.pp_close_box pp ();
      any := true
    incr pow
  ) l;
  Format.pp_close_box pp ()

I follow some very general rules for writing pretty printers. First, always accept the pretty printing channel pp as a parameter for maximum generality (you can then use it with #install_printer in the top-level). Second, always box everything. In this case I use a vertical-or-horizontal box around the entire polynomial, with one space for continuation lines, and a horizontal-only box around monomials.

As it is evident here, the usual rules for polynomials make the function logic-heavy: the first term uses the sign of the coefficient, but subsequent monomials are added or subtracted with the operation sign offset with spaces, and the coefficient is taken in absolute value. If the coefficient or the variable are 1 they are not shown, unless it is the constant term. If both the coefficient and the power are shown, they are separated by the multiplication sign.

With this it is easy to recreate the table in page 2 of the paper:

# let () =
    Format.open_box 1;
    List.iter (fun i ->
      pp_print_poly Format.std_formatter (eusum i);
      Format.print_newline ()) (range 0 8);
    Format.close_box ()
1/2*x + 1/2*x^2
1/6*x + 1/2*x^2 + 1/3*x^3
1/4*x^2 + 1/2*x^3 + 1/4*x^4
-1/30*x + 1/3*x^3 + 1/2*x^4 + 1/5*x^5
-1/12*x^2 + 5/12*x^4 + 1/2*x^5 + 1/6*x^6
1/42*x - 1/6*x^3 + 1/2*x^5 + 1/2*x^6 + 1/7*x^7
1/12*x^2 - 7/24*x^4 + 7/12*x^6 + 1/2*x^7 + 1/8*x^8
-1/30*x + 2/9*x^3 - 7/15*x^5 + 2/3*x^7 + 1/2*x^8 + 1/9*x^9

As noted by Euler, the coefficient of the linear term is a Bernoulli number. There are more efficient ways to compute them, but in a pinch this serves:

let bernoulli n = let _ :: b :: _ = eusum n in b

Immutability Redux

My last post generated a ton of very interesting discussion. I'd like to summarize it here to bring the topic to closure.

In view of the excellent analysis performed by Mauricio Fernández , Ketan's comment:

At this point, you don't have the data to say what you said. You could just as easily say OCaml penalizes mutable data, or, less judgmentally, that it is less well-optimized for that use case.

is entirely justified. His first observation that [I have] more data to support that the performance disparity is due to the idiom rather than the programming language" was as misunderstood by me as it was insightful.

Mauricio did an in-depth analysis of the disparity and reported about it first on reddit, then on his own blog (which I cannot recommend highly enough). His follow-up is a bit more in full than his first analysis, but I'll paraphrase it anyway for completeness's sake. OCaml's garbage collector is of the generational variety. This means that it works under the generational hypothesis: the most recently created objects are also those most likely to become unreachable quickly. In order to exploit this fact, it divides the heap in at least two regions: the young space with newly created objects that mostly point to themselves and to older objects, and the old space that comprise objects that survived a number of collection phases. Any time an old object points to a young object it must be added to the remembered set, a list of roots that allow the collector to quickly decide what's reachable without scanning the entire old space.

Mauricio shows that my "synthetic business entity update benchmark" violates the generation hypothesis: since floats are in general boxed, updating a mutable float field generates a boxed cell with the new value. After enough allocations, the fixed, mutable record is promoted to the old space, from which it points to the young space containing the box. Every mutation then adds to the remembered set via the caml_modify write barrier. His post shows ways to ameliorate this penalty by exploiting the conditions under which OCaml unboxes floats, in particular by using records comprised entirely of floats, mutable or immutable (in this case, just one suffices). He followed up with an in-depth analysis of the write barrier and a simple optimization that gains up to 50% the speed of synthetic benchmarks, and an excellent 32% on this particular benchmark.

It is hoped that the OCaml team continues improving the performance of the garbage collector (indeed, some people are asking very vocally for different kinds of overhaul to it). Meanwhile, instead of recommending workarounds or speed-ups, I'd say "go with the idioms best supported by the language". In this case, indeed, it pays off to use immutable records in OCaml. This shouldn't be extrapolated to a general methodology applicable to other programming languages.