2010-07-02

Sukhotin's Algorithm

There is an obscure algorithm devised by one Boris Viktorovich Sukhotin, a Russian… linguist? mathematician? cryptographer? let's say researcher, in the early '70s, usually called Sukhotin's Vowel Identification Algorithm, whose initial aim was to quickly attack an alphabetic cyphertext encoded with an unknown substitution cypher. Jacques Guy, a linguist and programmer well-known for his interest in the Voynich manuscript, translated the algorithm from Russian and simplified it but his publishing in Cryptologia didn't help it being more widely known. Besides, his explanation is not very illuminating, so I'll show the algorithm and then try to explain it from first principles.

This classification algorithm, given a text over an alphabet Σ of size N = |Σ|, partitions Σ into two disjoint subsets V and K (for vowels and consonants). It takes as input the symmetric digraph frequencies F[xy] = F[yx], that is, the frequency with which letter xΣ appears next to letter yΣ in the text. The algorithm is predicated on the assumption that in natural languages vowels tend to appear next to consonants rather than to other vowels. On input, the array d records the frequencies F[xy], subject to the condition:

⟨∀i, j : 0 ≤ i, j < N : d.i.j = d.j.i ≥ 0⟩

On exit, the stack v will contain the letters identified as vowels by the algorithm. I use Dijkstra's guarded command language, except that I omit trivial skip branches in single-branched ifs.

begin
   var v: stack of int;
   var r: array [0 … N) of int;
   var m, i, j: int;
   m, i := 0, 0;
   do iNr.i, j := 0, 0;
      do jNr.i := r.i + d.i.j;
         j := j + 1
      od;
      if r.m < r.im := i fi;
      i := i + 1
   od;
   { P }
   do r.m > 0 →
      v:push.m;
      i, j := 0, 0;
      do iNif imr.i :=  r.i − 2⋅d.i.mi = mr.i := −r.i
         fi;
         if r.j < r.ij := i fi;
         i := i + 1
      od;
      m := j
      { P }
   od
   { v = V }
end

The algorithm initially assumes that every letter is a consonant, that is, initially K = Σ. The first loop computes in r.i the cumulative frequency of character i, recording in m the character with maximum frequency, thus establishing the initial validity of the loop invariant P :

P ≡ ⟨∀ i : 0 ≤ i < N : r.i = ⟨∑j : jK : F[ij]⟩ ∧ r.mr.i

The second loop tries to maximize:

Q = ⟨∑i : iV : ⟨∑j : jK : F[ij]⟩⟩

under invariance of P. This measure is motivated by the original assumption that vowels are more likely to be preceded or followed by consonants than by other vowels, and that a partition of Σ should maximize this likelihood. The second loop proceeds to hill-climb to this maximum by assuming that the most frequent letter is a vowel, adding each r.m to V and removing it from K under the invariance of P. Its inner loop restores it by discounting twice the frequency for m: once for d.m.i and then for d.i.m, except for d.m.m, since after reduction it is precisely equal to r.m. It also keeps track of the new maximum for the next iteration. This process continues as long as Q can increase, which is implied by the loop condition.

References

  1. B. V. Sukhotin. DECIPHERING METHODS AS A MEANS OF LINGUISTIC RESEARCH, COLING Vol.1 1973. Online.
  2. ———. OPTIMIZATION ALGORITHMS OF DECIPHERING AS THE ELEMENTS OF A LINGUISTIC THEORY, COLING Budapest Vol.1 1988. Online.
  3. Jacques B. M. Guy. VOWEL IDENTIFICATION: AN OLD (BUT GOOD) ALGORITHM, CRYPTOLOGIA Vol.XV No.3 1991. Online.
  4. David M. W. Powers. Unsupervised learning of linguistic structure. IJCL Vol.2 No.1 1997. Online.

No comments: