2011-02-05

Finding Duplicate Files, on Batteries

I just got another machine. I had to migrate almost 18 years of backups, documents and other digital records. Many of these were duplicated; some I filtered by hand but I needed to automate the bulk of the process. Nothing fancy, just something to guide me in the pruning. To reduce the amount of code needed I dove head first on Batteries. Here is the result, in less than two pages of fully commented code.

The gist of the program is to find all the files in one or more directories, get their lengths, group the files by length and retain as candidates groups of two or more; then for each group compute an MD5 digest, find groups of two or more files with the same digest and deem them duplicates. First I open the modules I need most:

open Batteries
open List
open Unix

I produce an Enum with all the files in a directory and their subdirectories together with their lengths, except that I'm not interested on some of the files (zero-length files, Mac OS-specific metadata and Subversion control files):

(* Filter files based on specific criteria *)
let prune leaf = leaf = ".svn" || leaf = ".DS_Store"

(* Find all regular files and their lengths in the given path *)
let rec all_file_lengths path =
    let open Enum in
       Sys.files_of path
    |> map (fun leaf ->
        (* Skip undesirable files *)
        if prune leaf then empty () else
        let path = Filename.concat path leaf in
        let stat = stat path in
        (* Depth-first recursion on directories *)
        if stat.st_kind = S_DIR then all_file_lengths path else
        (* Skip empty files *)
        if stat.st_size = 0 then empty () else
        (* Report the file and its size *)
        singleton (path, stat.st_size))
    |> concat

It might seem to be a confusion of concerns to recursively traverse the directory tree and to compute the lengths of the files, but in order to minimize the I/O effort required I make use of a single stat call to find whether the file is a directory or not and its length. Note that at each step the result is a sub-enumeration, either empty or consisting of a single file or of a whole subdirectory of files. Note also that Sys.files_of all but forces me to work with enumerations.

The algorithmic heart of the program is to find partitions on a list according to some criterion having two or more members each. In the past I had to write my own function for that; now Batteries gives me everything I need:

(* Find equivalence classes of at least two members *)
let quotient ~by =
    (* Map criterion (Schwartzian transform) *)
       map (fun x -> (x, by x))
    (* Group by criterion *)
    |- group (fun (_, x) (_, y) -> compare x y)
    (* Filter groups with at least two members *)
    |- filter (function [] | [_] -> false | _ -> true)
    (* Project out the original elements *)
    |- map (map fst)

(If you have a better, non set-theoretical name for this function, I'm all ears.) Batteries' List.group sorts internally and finds consecutive runs of elements satisfying the given criterion. Since I'm going to use a computationally expensive grouping criterion (MD5 digests) I use a Schwartzian transform to process each element just once. Now since grouping involves sorting to avoid an O(n²) cost, I have to convert the enumeration into a list. Also, in my test runs I found that I had entire duplicate directories; in order not to complicate the code too much and yet be able to effectively identify those duplicate directories, a good compromise for me was to sort the list of duplicates so that files are kept together by directory:

(* List all duplicate files in a listing *)
let all_duplicate_files listing =
       of_enum listing
    (* Find duplicate lengths in list *)
    |> quotient ~by:snd
    (* Project out the path *)
    |> map (map fst)
    (* Find duplicate signatures in each group *)
    |> map (quotient ~by:Digest.file)
    (* Flatten the result *)
    |> concat
    (* Sort each group *)
    |> map (sort ~cmp:String.icompare)
    (* Sort the report by group leader *)
    |> sort ~cmp:(make_compare String.icompare)

I tried to make use of everything Batteries has, so that's it. To make this into a command-line tool I need to print the results:

(* Pretty-print a report of all duplicate files *)
let report_duplicate_files =
    let open Printf in
    iter (function
    | []      -> ()
    | p :: ps -> printf "> %s\n" p; iter (printf "< %s\n") ps; printf "\n")

and finally add a driver function:

(* Main function *)
let () =
    if !Sys.interactive then () else
    if Array.length Sys.argv = 1 then begin
        prerr_endline "usage - finddups <dir>...";
        exit 2
    end else try
        let listing = ref (Enum.empty ()) in
        for i = 1 to Array.length Sys.argv - 1 do
            listing := Enum.append !listing (all_file_lengths Sys.argv.(i))
        done;
        !listing |> all_duplicate_files |> report_duplicate_files;
        exit 0
    with e ->
        prerr_endline (Printexc.to_string e);
        exit 1

This simple script is I/O-bound, so I could have made it a hash-bang executable file without impacting its performance too much, but I opted to compile it with:

ocamlfind ocamlopt -thread -package batteries -linkpkg -o finddups finddups.ml

I hope you give Batteries a chance too!

2011-02-03

OCaml 3.12 and ocamlfind ocamldoc

Just a quick note: ocamldoc in 3.12 dumps its help to stderr, unlike every other tool in the distribution. The automatic argument detection of Findlib 1.2.6 fails to catch the output and so does not recognize any command line option. The quick fix is to edit and/or patch tools/extract_args/extract_args.ml, line 32 to read:

Sys.command (sprintf "%s -help >%s 2>&1"

before running ./configure

2011-01-09

A jolt, or a shock?

Things in OCaml Batteries that annoy me (i.e., this is merely opinion on my part, and not very informed at that. Caveat lector):

  • It is incompatible with stdlib in minor random ways:
    • List.sort requires label ~cmp
    • Channels are not completely wrapped, so that in_channel_length (open_in_bin "foo") doesn't type
  • Functional combinators are not what I've grown accustomed to. Turnstiles for composition are something I would never have thought of
  • Enumerators are in scope by default but conversion functions aren't. You can't do anything directly useful with --, for instance
  • Compiling it in results in huge executables
  • The help system is broken, at least on my install. I can't persuade it to know about anything at all

Edit: A maintainer left me a comment regarding reporting the issue to GitHub. I haven't got an account with them, and I don't plan to have one in the future, so to give further information:

# let inch = open_in_bin "dblib.mli" in let len = in_channel_length inch in close_in inch; len ;;
Characters 66-70:
  let inch = open_in_bin "dblib.mli" in let len = in_channel_length inch in close_in inch; len ;;
                                                                    ^^^^
Error: This expression has type BatIO.input = BatInnerIO.input
       but an expression was expected of type
         Batteries.in_channel = in_channel
# Pervasives.(let inch = open_in_bin "dblib.mli" in let len = in_channel_length inch in close_in inch; len) ;;
- : int = 21289

It is a minor oversight (in_channel_length needs lifting), but it bite me. This means in practice that you can't program against Batteries' Pervasives as if it were the stdlib's.

As to the second point, I entered the following in my .ocamlinit:


let id _ = failwith "Use Batteries ``identity'' function" ;;

let ( % ) _ = failwith "Use Batteries turnstiles ``|-'' and ``-|''" ;;

I expect the conditioning to kick in pretty quickly. As to the third point, of course it is my ignorance of the extensive library that frustrates me, and not a limitation of Batteries itself. Maybe I should retract it, but rest assured I am aware that I'm railing against my own limitations here.

As to the fourth… #man "modules" doesn't work; #man "topics" doesn't work; #man_module "BatIO" doesn't work… I'm reading batteries_help.ml here and nothing I can think of that is reasonable gives me a response other than Sorry, I don't know anything about X. If the indices can't be read, I'd expect an error message. If the syntax is incorrect, I'd expect a short blurb guiding me in the right direction. I just don't know what to tell it to satisfy it.

Edit 2: Regarding the help issue, upon further investigation I've found that none of the .idex files in /usr/local/share/doc/batteries-included/html/api were generated, so that #man is right in being perplexed. I've also found a number of working starting pointers (write Hashtbl.keys Toploop.directive_table |> List.of_enum and be amazed).

I've made a couple of changes that I expect will make my life easier with Batteries on Cygwin/MINGW:

  • Added the following to my .ocamlinit:
    let (browser: (_, _, _) format) = "\"path/to/chrome.exe\" %s" in
    Batteries_config.set_browser (fun url -> Sys.command (Printf.sprintf browser url))
    ;;
    
  • Rewritten /usr/local/share/doc/ocaml-batteries/language.idex to read:
    "batteries":  "html/index.html"
    "directives": "html/toplevel.html#directives"
    "ocaml":      "http://caml.inria.fr/pub/docs/manual-ocaml/"
    "wrappers":   "http://www.linux-nantes.org/~fmonnier/ocaml/ocaml-wrapping-c.php"
    
  • Rewritten /usr/local/share/doc/ocaml-batteries/toplevel.help to read:
    Welcome to OCaml, Batteries Included.
    
    Some directives:
     #quit;;                   (*Use this to quit OCaml.                *)
     #use "some_file.ml";;     (*Use this to load another program.      *)
     #require "some_package";; (*Use this to load an installed library. *)
     #help;;                   (*Well, you just used that one.          *)
     #man "some subject";;     (*Read the manual on some subject.       *)
     #browse "Some_module";;   (*Describe Some_module's contents.       *)
     #warnings "on";;          (*Turn on warnings.                      *)
     #warn_errors "on";;       (*Turn warnings into errors.             *)
    
    Some starting points:
     #man "batteries";;
     #man "directives";;
     #man "ocaml";;
     #man "wrappers";;
    

Now #help suits me.

2010-11-13

A First-Principles DNS Client

Ever wondered how nslookup works? I'm a protocol junkie, and Domain Name (RFC 1035) has it all. It's a simple protocol with a highly structured message format, so the rich machinery of OCaml makes easy to build a simple but relatively complete client. The code is long for a blog post (a bit less than 500 lines), so I'll show the highlights and leave the rest for you to use as you see fit. Since I'm not constrained by presenting a literate bottom-up program, I'll start with a couple of examples. A simple query is straightforward:

# query_dns "10.0.0.1" (query ~q_type:`A 0 "www.nytimes.com");;
- : dns_record =
{id = 0; detail = 33152;
 question = [{q_name = "www.nytimes.com"; q_type = `A; q_class = `IN}];
 answer =
  [{rr_name = "www.nytimes.com"; rr_type = `A; rr_class = `IN; rr_ttl = 120l;
    rr_rdata = `Address 164.107.65.0}];
 authority =
  [{rr_name = "www.nytimes.com"; rr_type = `NS; rr_class = `IN; rr_ttl = 60l;
    rr_rdata = `Domain "nss1.sea1.nytimes.com"};
   {rr_name = "www.nytimes.com"; rr_type = `NS; rr_class = `IN; rr_ttl = 60l;
    rr_rdata = `Domain "nss1.lga2.nytimes.com"}];
 additional =
  [{rr_name = "nss1.lga2.nytimes.com"; rr_type = `A; rr_class = `IN;
    rr_ttl = 60l; rr_rdata = `Address 164.107.65.0}]}

(I installed a top-level pretty printer to show the inet_addresses). As another example, here's how to query the servers to which to send mail for a domain:

let mail_servers server domain =
  let res = query_dns server (query ~q_type:`MX 0 domain) in
  List.sort (fun (p, _) (p', _) -> compare p p')
    (List.map (function { rr_rdata = `Exchange (p, d); _ } -> (p, d))
      (List.filter (fun { rr_type; _ } -> rr_type = `MX)
        res.answer))

And here it is in action:

# mail_servers "10.0.0.1" "google.com";;
- : (int * domain_name) list =
[(100, "google.com.s9a1.psmtp.com"); (200, "google.com.s9a2.psmtp.com");
 (300, "google.com.s9b1.psmtp.com"); (400, "google.com.s9b2.psmtp.com")]

In total, I think I've put in 20 hours to this little project, including this write-up. The networking part is very simple, though it took me a number of tries to get it right (I'm rusty):

let dns_port  = (Unix.getservbyname "domain" "udp").Unix.s_port

let query_dns addr q =
  let len = 4096 in
  let buf  = String.create len in
  let msg  = Writer.run (write_dns_record q) in
  let sock = Unix.socket Unix.PF_INET Unix.SOCK_DGRAM 0 in
  let peer = Unix.ADDR_INET (Unix.inet_addr_of_string addr, dns_port) in
  unwind ~protect:Unix.close (fun sock ->
    Unix.setsockopt_float sock Unix.SO_RCVTIMEO  1.;
    let _ = Unix.sendto sock msg 0 (String.length msg) [] peer in
    let cnt, _ = Unix.recvfrom sock buf 0 len [] in
    match Parser.run parse_dns_record (String.sub buf 0 cnt) with
    | Some dns -> dns
    | None     -> failwith "Parse error"
  ) sock

It is important to set the receive timeout for the datagram socket, otherwise it blocks trying to read too many bytes. The function takes the address of the Domain Name server to query and a dns_record containing the query. It is serialized, sent to the server, a response is received and parsed. For that, I use a parsing monad and a writer monoid to structure the handling of messages (the slogan is "parsing is monadic, pretty-printing is monoidal"):

module Parser = struct
  type cursor = string * int

  type 'a parser = Parser of (cursor -> 'a option * cursor)

  include Monad (struct
    type 'a t = 'a parser
    let return x = Parser (fun cur -> (Some x, cur))
    let bind f (Parser p) = Parser (fun cur ->
      match p cur with
      | (None  , _  ) -> (None, cur)
      | (Some x, cur) -> let Parser q = f x in q cur)
    let fail = Parser (fun cur -> (None, cur))
  end)

This is a simple deterministic parsing monad using option instead of list. The only quirk is that Domain Name messages use back-references for data compression, so I need full random-access to the message buffer with positioning information in the form of a cursor. Running a parser is done by providing it with a buffer:

  let run (Parser p) str = let (res, _) = p (str, 0) in res

The use of positioning information means that the parser is also a state monad:

  let tell     = Parser (fun (_, pos as cur) -> (Some pos, cur))
  let seek off = Parser (fun (str, _ as cur) ->
    if 0 <= off && off <= String.length str
    then (Some (), (str, off))
    else (None, cur) )

  (* … code omitted … *)
end

I deal with low-level entities like bytes (properly octets, but let's not quibble) and big-endian integers. Parsing these is straightforward:

type int16  = int
type byte   = int
type bytes  = string

let byte : byte Parser.t = Parser.(fmap int_of_char next)

let bytes cnt : bytes Parser.t = Parser.take cnt

let int16 : int16 Parser.t = 
  let open Parser in
  byte >>= fun hi ->
  byte >>= fun lo ->
  return ((hi lsl 8) lor lo)

let int32 : int32 Parser.t =
  let open Parser in
  let (<|) m n = Int32.logor (Int32.shift_left m 8) (Int32.of_int n) in
  byte >>= fun b0 ->
  byte >>= fun b1 ->
  byte >>= fun b2 ->
  byte >>= fun b3 ->
  return (((Int32.of_int b0 <| b1) <| b2) <| b3)

By the way, I love the new syntax for modules in OCaml 3.12! Now labels are length-prefixed sequences of bytes, and domain names are sequences of labels:

type label        = bytes
type domain_name  = bytes

let write_bytes (str : bytes) =
  let len = String.length str in
  if len > 255 then invalid_arg "write_bytes" else
  Writer.(byte len ++ bytes str)

let parse_bytes = Parser.(byte >>= bytes)

let write_label (lbl : label) =
  let len = String.length lbl in
  if len > 63 then invalid_arg "write_label" else
  Writer.(byte len ++ bytes lbl)

let parse_label : label Parser.t =
  let open Parser in
  ensure (byte >>= fun n -> guard (0 < n && n < 64)) >> byte >>= bytes

The RFC specifies that labels are limited to 63 octets in length, and domain names can use back-references for reducing the size of a packet. Thus a domain name is terminated by the null label (corresponding to the top-level domain "."), or by a pointer to another label in the packet:

let re_dot = Str.regexp_string "."

let split_domain (name : domain_name) = Str.split re_dot name

let write_domain_name (name : domain_name) =
  let labels = split_domain name in
  Writer.(sequence write_label labels ++ byte 0)

let parse_domain_name : domain_name Parser.t =
 let open Parser in
 let rec labels () =
    sequence parse_label       >>= fun ls  ->
    byte                       >>= fun n   ->
    if n = 0 then return ls else
    guard (n land 0xc0 = 0xc0) >>
    byte                       >>= fun m   ->
    let off = ((n land 0x3f) lsl 8) lor m in
    tell                       >>= fun pos ->
    seek off                   >>
    labels ()                  >>= fun ls' ->
    seek pos                   >>
    return (ls @ ls')
  in fmap (String.concat ".") $ labels ()

Writing a domain name is straightforward, as I don't do compression. Reading a domain name is done recursively. Note that parse_label fails (via ensure) whenever the label is empty or overlong. The first case corresponds to the top-level domain, which ends the sequence; the second case corresponds to a 16-bit pointer which is signalled by its two most-significant bits set. To read the back-reference I record the position in the packet via tell, seek to the referenced position, parse the label sequence recursively (which might trigger yet more back-references), restore the original position and return all the results. For example, consider the following complete response (don't mind the names yet):

id        0000  \000\000
detail    0002  \132\000
qdcount   0004  \000\001
ancount   0006  \000\002
nscount   0008  \000\000
arcount   0010  \000\000
q_name    0012  \009_services\007_dns-sd\004_udp
          0035  \005local\000
q_type    0042  \000\012
q_class   0044  \000\001
rr_name   0046  \192\012
rr_type   0048  \000\012
rr_class  0050  \000\001
rr_ttl    0052  \000\000\000\010
rr_dlen   0056  \000\020
rr_rdata  0058  \012_workstation
          0071  \004_tcp\192\035
rr_name   0078  \192\012
rr_type   0080  \000\012
rr_class  0082  \000\001
rr_ttl    0084  \000\000\000\010
rr_dlen   0088  \000\008
rr_rdata  0090  \005_http\192\071
          0098

The domain name at offset 0090 (_http) refers back to 0071 (_tcp) which in turn refers back to 0035 (local), which spells _http._tcp.local. Now Domain Name is basically a distributed database system which replies to queries with responses containing zero or more resources. The types of resources is spelled out in detail in the RFC:

type rr_type = [
| `A | `NS | `MD | `MF | `CNAME | `SOA | `MB | `MG
| `MR | `NULL | `WKS | `PTR | `HINFO | `MINFO | `MX | `TXT
]

Why polymorphic variants? Because query types are a superset of resource types:

type q_type = [ rr_type | `AXFR | `MAILB | `MAILA | `ANY ]

This lets me reuse the code for converting back and forth between labels and protocol integers. In turn, each standard resource has its particular format:

type resource = [
| `Hostinfo  of string * string
| `Domain    of domain_name
| `Mailbox   of domain_name * domain_name
| `Exchange  of int * domain_name
| `Data      of bytes
| `Text      of bytes list
| `Authority of domain_name * domain_name * int32 * int32 * int32 * int32 * int32
| `Address   of Unix.inet_addr
| `Services  of int32 * byte * bytes
]

The types are dictated by the protocol, and they in turn dictate how to format and write resource data. The first is straightforward:

let write_resource =
  let open Writer in function
  | `Hostinfo (cpu, os) ->
       write_bytes       cpu
    ++ write_bytes       os
  | `Domain name ->
       write_domain_name name
  | `Mailbox  (rmbx, embx) ->
       write_domain_name rmbx
    ++ write_domain_name embx
  | `Exchange (pref, exch) ->
       int16             pref
    ++ write_domain_name exch
  | `Data data ->
       bytes             data
  | `Text texts ->
    sequence write_bytes texts
  | `Authority (mname, rname, serial, refresh, retry, expire, minimum) ->
       write_domain_name mname
    ++ write_domain_name rname
    ++ int32             serial
    ++ int32             refresh
    ++ int32             retry
    ++ int32             expire
    ++ int32             minimum
  | `Address addr ->
       int32  (Obj.magic addr : int32)
  | `Services (addr, proto, bmap) ->
       int32             addr
    ++ byte              proto
    ++ bytes             bmap

(the use of Obj.magic to coerce addresses back and forth is relatively unperilous). Parsing requires knowing the resource type to decode the payload. Note that Writer.sequence formats a list by formatting each element in turn while Parser.sequence applies repeatedly a parser until it fails and returns the list of intermediate parsers:

let parse_resource rr_type rr_dlen =
  let open Parser in match rr_type with
  | `HINFO ->
    parse_bytes                >>= fun cpu ->
    parse_bytes                >>= fun os  ->
    return (`Hostinfo (cpu, os))
  | `MB | `MD | `MF | `MG | `MR | `NS
  | `CNAME | `PTR ->
    parse_domain_name          >>= fun name ->
    return (`Domain name)
  | `MINFO ->
    parse_domain_name          >>= fun rmailbx ->
    parse_domain_name          >>= fun emailbx ->
    return (`Mailbox (rmailbx, emailbx))
  | `MX ->
    int16                      >>= fun preference ->
    parse_domain_name          >>= fun exchange   ->
    return (`Exchange (preference, exchange))
  | `NULL ->
    bytes rr_dlen              >>= fun data ->
    return (`Data data)
  | `TXT ->
    sequence (byte >>= bytes)  >>= fun texts ->
    return (`Text texts)
  | `SOA ->
    parse_domain_name          >>= fun mname   ->
    parse_domain_name          >>= fun rname   ->
    int32                      >>= fun serial  ->
    int32                      >>= fun refresh ->
    int32                      >>= fun retry   ->
    int32                      >>= fun expire  ->
    int32                      >>= fun minimum ->
    return (`Authority (mname, rname, serial, refresh, retry, expire, minimum))
  | `A ->
    int32                      >>= fun addr ->
    return (`Address  (Obj.magic addr : Unix.inet_addr))
  | `WKS ->
    int32                      >>= fun addr   ->
    byte                       >>= fun proto  ->
    bytes (rr_dlen-5)          >>= fun bitmap ->
    return (`Services (addr, proto, bitmap))

There's nothing to it, really, as monadic notation makes the code read straight. A resource is described by a rsrc_record, which can be written and parsed fairly simply:

type rsrc_record = {
  rr_name  : domain_name;
  rr_type  : rr_type;
  rr_class : rr_class;
  rr_ttl   : int32;
  rr_rdata : resource;
}

let write_rsrc_record r =
  let open Writer in
  let rr_rdata = run (write_resource r.rr_rdata) in
     write_domain_name      r.rr_name
  ++ int16 (int_of_rr_type  r.rr_type)
  ++ int16 (int_of_rr_class r.rr_class)
  ++ int32                  r.rr_ttl
  ++ int16     (String.length rr_rdata)
  ++ bytes                    rr_rdata

let parse_rsrc_record =
  let open Parser in
  parse_domain_name              >>= fun rr_name  ->
  fmap rr_type_of_int  int16     >>= fun rr_type  ->
  fmap rr_class_of_int int16     >>= fun rr_class ->
  int32                          >>= fun rr_ttl   ->
  int16                          >>= fun rr_dlen  ->
  parse_resource rr_type rr_dlen >>= fun rr_rdata ->
  return { rr_name; rr_type; rr_class; rr_ttl; rr_rdata; }

Note that since the resource payload is prefixed by its length I have to write it out-of-band to know its length. This is the only instance of buffer copying in the implementation. A question is handled similarly:

type question = {
  q_name  : domain_name;
  q_type  : q_type;
  q_class : q_class;
}
 
let write_question q =
  let open Writer in
     write_domain_name     q.q_name
  ++ int16 (int_of_q_type  q.q_type )
  ++ int16 (int_of_q_class q.q_class)

let parse_question =
  let open Parser in
  parse_domain_name         >>= fun q_name  ->
  fmap q_type_of_int  int16 >>= fun q_type  ->
  fmap q_class_of_int int16 >>= fun q_class ->
  return { q_name; q_type; q_class; }

A Domain Name record has a serial number, a bit field with options and response codes and a sequence of question followed by various sequences of resources as answers:

type dns_record = {
  id         : int16;
  detail     : int16;
  question   : question    list;
  answer     : rsrc_record list;
  authority  : rsrc_record list;
  additional : rsrc_record list;
}

let write_dns_record d =
  let open Writer in
     int16                      d.id
  ++ int16                      d.detail
  ++ int16 (List.length         d.question  )
  ++ int16 (List.length         d.answer    )
  ++ int16 (List.length         d.authority )
  ++ int16 (List.length         d.additional)
  ++ sequence write_question    d.question
  ++ sequence write_rsrc_record d.answer
  ++ sequence write_rsrc_record d.authority
  ++ sequence write_rsrc_record d.additional

let parse_dns_record =
  let open Parser in
  int16                            >>= fun id         ->
  int16                            >>= fun detail     ->
  int16                            >>= fun qdcount    ->
  int16                            >>= fun ancount    ->
  int16                            >>= fun nscount    ->
  int16                            >>= fun arcount    ->
  repeat qdcount parse_question    >>= fun question   ->
  repeat ancount parse_rsrc_record >>= fun answer     ->
  repeat nscount parse_rsrc_record >>= fun authority  ->
  repeat arcount parse_rsrc_record >>= fun additional ->
  return { id; detail; question; answer; authority; additional; }

Very high-level, straight-line code thanks to the structuring power of the algebraic structures! A query involves asking for a resource identified by domain name. If you analyze the bit field, you'll notice that I ask for a recursive query:

let query ?(q_type=`A) id q_name =
  if id land 0xffff <> id then invalid_arg "query" else {
  id;
  detail     = 0b0_0000_0010_000_0000;
  question   = [ { q_name; q_type; q_class = `IN; } ];
  answer     = [];
  authority  = [];
  additional = []; }

That's it! Grab the code and play with it; I've put it in the Public Domain so that you can use it for whatever purpose you want.

2010-10-17

Text to PDF

I find that the best way for me to learn is to reexpress what I read in a different but equivalent form. Transcribing notes and translation are two applications of this heuristic. In this case, I've translated this simple and pretty Factor PDF text formatter into OCaml. The result is almost as compact, at less than 200 lines of code, thanks to the compositional nature that OCaml and Factor share, even if OCaml's library of combinators is rather poor. As a conscious style decision, I try to keep all definitions as short as possible. As usual, I start with an operator for functional composition:

let ( % ) f g x = f (g x)

PDF is essentially a textual format, so it's convenient to perform the conversion in memory by using strings. A PDF object is a string or stream tagged by a reference number:

let pdf_object n str =
  String.concat "\n" [string_of_int n ^ " 0 obj"; str; "endobj"]

These objects can be referenced later by such a number:

let pdf_ref n = string_of_int n ^ " 0 R"

PDF strings are quoted by parentheses like in Postscript, so they must be escaped. The function escape_with is somewhat crude but sufficient for the small number of escapes required:

let escape_with escapes str =
  let buf = Buffer.create (String.length str) in
  String.iter (fun c ->
    try Buffer.add_string buf (List.assoc c escapes)
    with Not_found -> Buffer.add_char buf c) str;
  Buffer.contents buf

let pdf_escapes = [
  '\008', "\\b";
  '\t'  , "\\t";
  '\n'  , "\\n";
  '\012', "\\f";
  '\r'  , "\\r";
  '('   , "\\(";
  ')'   , "\\)";
  '\\'  , "\\\\";
]

let pdf_string str = "(" ^ escape_with pdf_escapes str ^ ")"

A PDF dictionary is a list of key-value pairs with special syntax. I indulge in a bit of special-casing for the sake of better formatting:

let pdf_dict kvs =
  let buf = Buffer.create 16 in
  Buffer.add_string buf "<<";
  begin match kvs with
  | []     -> ()
  | [k, v] -> Printf.bprintf buf " /%s %s " k v
  | kvs    ->
    Buffer.add_char buf '\n';
    List.iter (fun (k, v) -> Printf.bprintf buf "/%s %s\n" k v) kvs
  end;
  Buffer.add_string buf ">>";
  Buffer.contents buf

This way short dictionaries occupy a single line. Every PDF file is annotated by metadata describing the particulars of its creation:

let iso_time () =
  let tm = Unix.gmtime (Unix.time ()) in
  Printf.sprintf "%04d%02d%02d%02d%02d%02d"
    (tm.Unix.tm_year + 1900) (tm.Unix.tm_mon + 1) tm.Unix.tm_mday
    tm.Unix.tm_hour tm.Unix.tm_min tm.Unix.tm_sec

let app_name = Filename.basename Sys.executable_name
and usr_name = try Sys.getenv "USER" with Not_found -> "unknown"

let pdf_info () =
  pdf_dict [
    "CreationDate", "D:" ^ iso_time ();
    "Producer"    , pdf_string app_name;
    "Author"      , pdf_string usr_name;
    "Creator"     , pdf_string "created with OCaml";
  ]

The pages in a PDF document form a tree rooted in a special page table referenced by a Catalog object. Here, as in the original program, the reference numbers are hardwired since the document structure is statically known:

let pdf_catalog () =
  pdf_dict [
    "Type" , "/Catalog";
    "Pages", pdf_ref 4;
  ]

All fonts in use must be similarly declared beforehand:

let pdf_font () =
  pdf_dict [
    "Type"    , "/Font";
    "Subtype" , "/Type1";
    "BaseFont", "/Courier";
  ]

As with dictionaries, arrays have their own syntax, very much like OCaml lists:

let pdf_array vs =
  let buf = Buffer.create 16
  and any = ref false in
  Buffer.add_string buf "[";
  List.iter (fun v -> Printf.bprintf buf " %s" v; any := true) vs;
  if !any then Buffer.add_char buf ' ';
  Buffer.add_string buf "]";
  Buffer.contents buf

PDF pages have definite sizes represented by a media box which serves as a coordinate system and clipping region:

let paper_size = function
| `A4     -> ( 595,  842)
| `Letter -> ( 612,  792)
| _       -> failwith "paper_size"

let media_box paper =
  let (w, h) = paper_size paper in
  pdf_array ["0"; "0"; string_of_int w; string_of_int h]

As I mentioned before, pages are rooted on a page table that forward references each child. Each page is comprised of a page dictionary and of its contents themselves, hence the need to reference every other object:

let rec table f n =
  let rec go i =
    if i = n then [] else f i :: go (succ i)
  in go 0

let pdf_pages ?(paper=`Letter) npages =
  pdf_dict [
    "Type"    , "/Pages";
    "MediaBox", media_box paper;
    "Count"   , string_of_int npages;
    "Kids"    , pdf_array (table (fun i -> pdf_ref (5 + 2 * i)) npages);
  ]

The page dictionary links back to the page table it depends on, the contents that follow it and any resources it needs:

let pdf_page pageno =
  pdf_dict [
    "Type"     , "/Page";
    "Parent"   , pdf_ref 4;
    "Contents" , pdf_ref (succ pageno);
    "Resources", pdf_dict ["Font", pdf_dict ["F1", pdf_ref 3]];
  ]

The page contents are represented by a PDF stream. In this case, I use the simplest formatting I can get away with:

let pdf_stream str =
  String.concat "\n" [
    pdf_dict ["Length", string_of_int (String.length str + 1)];
    "stream";
    str;
    "endstream"
  ]

Textual content is a kind of PDF stream comprised of text layout commands. Lines of text are shipped out by the ' (quote) operator one by one, after defining the characteristics (origin, font, etc) of the text box they are displayed in:

let pdf_text text =
  pdf_stream (String.concat "\n" ([
    "BT";
    "54 738 Td";
    "/F1 10 Tf";
    "12 TL";
  ] @ List.map (fun l -> pdf_string l ^ "'") text @ [
    "ET"
  ]))

All a PDF file contains is a sequence of cross-referenced objects. Given a collection of lines of text grouped into pages, the program writes the metadata, the catalog, the required resources, the page table and the sequence of pages given by their dictionary and contents:

let mapi f =
  let rec go i = function
  | [] -> []
  | x :: xs -> f i x :: go (succ i) xs
  in go 0

let objects_of_pages pages =
  mapi (pdf_object % succ) (List.concat ([
    pdf_info ();
    pdf_catalog ();
    pdf_font ();
    pdf_pages (List.length pages);
  ] :: mapi (fun i p -> [pdf_page (5 + 2 * i); pdf_text p]) pages))

Again, since the structure of the PDF file is known the references can be hardwired. Once the content is written, the PDF file ends with a trailer that cross-references the positions of all its contained objects by stream length. I fold over the list of objects to compute the file offsets at which they begin, starting at 9 to account for the PDF header:

let pdf_trailer objects =
  let len, tot, xref = List.fold_left ( fun (i, n, l) s ->
    (succ i, String.length s + 1 + n, Printf.sprintf "%010x 00000 n" n :: l) )
    (1, 9, []) objects in
  String.concat "\n" ([
    "xref";
    "0 " ^ string_of_int len;
    "0000000000 65535 f";
  ] @ List.rev xref @ [
    "trailer";
    pdf_dict [
      "Size", string_of_int len;
      "Info", pdf_ref 1;
      "Root", pdf_ref 2;
    ];
    "startxref";
    string_of_int tot;
    "%%EOF";
  ])

In this simple case the cross-reference table is written at the end; a stream-aware PDF document would have the table at the beginning, but that would need a two-pass algorithm. That is all that it's needed to write the PDF document:

let pdf_of_objects objects =
  String.concat "\n" (["%PDF-1.4"] @ objects @ [pdf_trailer objects])

(the header has length 9, as indicated above). It remains to format the text into lines. Lines will be split by terminator, accounting for all possibilities and taking care not to lose empty lines at the beginning or end of the text:

let split_lines = Str.split_delim (Str.regexp "\n\\|\r\n?")

In the same way that the original Factor code tab expansion is rather naïve:

let expand_tabs =
  String.concat "    " % Str.split_delim (Str.regexp_string "\t")

Composing both, making sure like in Factor that empty lines have at least something to print:

let lines_of_string =
  List.map (fun s -> if s = "" then " " else s) % split_lines % expand_tabs

Line wrapping and page layout are performed by grouping content lists by length. In the case of lines, I group lists of characters, so I need to convert from strings to lists and back. This is unnecessary in Factor since the word <groups> is generic on sequences, and so it would be in Haskell, but this is OCaml:

let explode str =
  let res = ref [] in
  String.iter (fun c -> res := c :: !res) str;
  List.rev !res

and implode cs =
  let buf = Buffer.create 16 in
  List.iter (Buffer.add_char buf) cs;
  Buffer.contents buf

Grouping lists by length is a simple function with a number of special cases:

let groups_of n =
  let rec go gss gs i = function
  | [] when gs = [] -> List.rev                 gss
  | []              -> List.rev (List.rev gs :: gss)
  | xs when i = n   -> go (List.rev gs :: gss) [] 0  xs
  | x :: xs         -> go gss (x :: gs)     (succ i) xs
  in go [] [] 0

As with the Factor code, I format text in pages of 57 lines of 84 characters. Except for the reading direction and use of ordnance, this function reads exactly like the original:

let pages_of_lines =
  groups_of 57 % List.concat % List.map (List.map implode % groups_of 84 % explode)

That's it. Formatting a text into a PDF document is a pipeline connecting all the pieces:

let pdf_of_text =
  pdf_of_objects % objects_of_pages % pages_of_lines % lines_of_string

As a bonus, a simple file filter:

let unwind ~(protect:'a -> unit) f (x : 'a) =
  try let y = f x in protect x; y with e -> protect x; raise e

let with_input_file  f file = unwind ~protect:close_in  f (open_in_bin file)
and with_output_file f file = unwind ~protect:close_out f (open_out    file)

let file_contents = with_input_file (fun inch ->
  let len = in_channel_length inch in
  let buf = String.create len in
  really_input inch buf 0 len;
  buf)

let pdf_of_file inf =
  let pdf = pdf_of_text (file_contents inf) in
  with_output_file (fun otch -> output_string otch pdf)

The entire program comes down to less than four pages in a formatted PDF document. As the original author remarks, this solution also compares favorably to a 600 line Python version and a 450 line C version. I expect that a Haskell version would be much nearer the 140 lines in the original program.