2007-12-11

The Alphanumeric Sort

A discussion over reddit about Dave Koelle's "Alphanum Algorithm" to sort "naturally" strings that include numerals. The problem to solve is, in Koelle's words:

For example, in a sorted list of files, "z100.html" is sorted before "z2.html". But obviously, 2 comes before 100!

First, it's clear that it's a comparison function between (rather, a total order on) strings what's at issue here; second, as pointed out in reddit, it's what OS X does in the Finder (I wouldn't swear it, but I think iTunes sorts it like this, too). Martin Pool's Natural String Order comparison is another implementation of this idea, one that is special in that (in his words) [p]erformance is linear: each character of the string is scanned at most once, and only as many characters as necessary to decide are considered. His page also lists a number of alternative implementations, giving priority to Stuart Cheshire.

I'll not aim that high; my implementation is also linear but I won't bother trying to visit each character at most once. The idea is to separate numbers from non-numbers, and compare corresponding numbers according to numeric value and not as strings. In other words, 100 > 20 even when "100" < "20" (as the string comparison stops as soon as the first character '1' on the left compares less than the first character '2' on the right). The first thing needed would then be a way to split a string between numeric and non-numeric components. OCaml regexps are somewhat crude, but a multipass (cue Milla Jovovich's Leeloo Dallas multipass!) approach works: I'll insert a delimiter between a sequence of digits followed by a sequence of non-digits; then I'll insert the same delimiter between a sequence of non-digits followed by a sequence of digits. Finally, I'll split the string on the delimiter. The only problem is that any character is potentially valid in a string; I cop out and use a null byte:

let (%) f g x = f (g x)

let split =
  let re_d_D = Str.regexp "\\([0-9]+\\)\\([^0-9]+\\)"
  and re_D_d = Str.regexp "\\([^0-9]+\\)\\([0-9]+\\)"
  and re_NUL = Str.regexp_string "\000"
  and tp_spl = "\\1\000\\2" in
    Str.split re_NUL
  % Str.global_replace re_D_d tp_spl
  % Str.global_replace re_d_D tp_spl

Next, I need to lift a comparison on elements to a comparison on a list of those elements. There is no standard lexicographical comparison on lists, but the implementation is trivial. As long as elements compare as equal, recur on the rest of the lists until a difference is found or the end of the shortest list is reached, whichever comes first:

let rec cmp_list cmp l m = match l, m with
| [], [] ->  0
| [], _  -> -1
| _, []  ->  1
| a :: l, b :: m ->
  let c = cmp a b in
  if c == 0 then cmp_list cmp l m else c

Finally comes the alphanumeric comparison. The idea is to try first to compare both strings as numbers, and if that isn't possible to fall back on plain string comparison. For that, I take advantage of the fact that int_of_string fails with exception on non-numeric input:

let cmp_alnum x y =
  try Pervasives.compare (int_of_string x) (int_of_string y)
  with Failure _ -> Pervasives.compare x y

Putting all together, I perform a Schwartzian Transform on the list of strings to be sorted, splitting on numbers to give a list of lists. Then, I sort using the standard sort on lists with a comparison function built by lifting the alphanumeric comparison to a lexicographic ordering. Lastly, I concatenate the pieces back into strings:

let sort_alnum =
    List.map (String.concat "")
  % List.sort (cmp_list cmp_alnum)
  % List.map split

To prove my claim of linearity, note that split is (likely—I haven't studied Str's code) linear since the regexes are deterministic, that cmp_list is obviously linear on the size of the lists hence linear on the size of the original string, and that cmp_alnum is linear since conversion of string to numbers is linear just as is lexicographical comparison. Simple, huh?

No comments: