2011-02-05

Finding Duplicate Files, on Batteries

I just got another machine. I had to migrate almost 18 years of backups, documents and other digital records. Many of these were duplicated; some I filtered by hand but I needed to automate the bulk of the process. Nothing fancy, just something to guide me in the pruning. To reduce the amount of code needed I dove head first on Batteries. Here is the result, in less than two pages of fully commented code.

The gist of the program is to find all the files in one or more directories, get their lengths, group the files by length and retain as candidates groups of two or more; then for each group compute an MD5 digest, find groups of two or more files with the same digest and deem them duplicates. First I open the modules I need most:

open Batteries
open List
open Unix

I produce an Enum with all the files in a directory and their subdirectories together with their lengths, except that I'm not interested on some of the files (zero-length files, Mac OS-specific metadata and Subversion control files):

(* Filter files based on specific criteria *)
let prune leaf = leaf = ".svn" || leaf = ".DS_Store"

(* Find all regular files and their lengths in the given path *)
let rec all_file_lengths path =
    let open Enum in
       Sys.files_of path
    |> map (fun leaf ->
        (* Skip undesirable files *)
        if prune leaf then empty () else
        let path = Filename.concat path leaf in
        let stat = stat path in
        (* Depth-first recursion on directories *)
        if stat.st_kind = S_DIR then all_file_lengths path else
        (* Skip empty files *)
        if stat.st_size = 0 then empty () else
        (* Report the file and its size *)
        singleton (path, stat.st_size))
    |> concat

It might seem to be a confusion of concerns to recursively traverse the directory tree and to compute the lengths of the files, but in order to minimize the I/O effort required I make use of a single stat call to find whether the file is a directory or not and its length. Note that at each step the result is a sub-enumeration, either empty or consisting of a single file or of a whole subdirectory of files. Note also that Sys.files_of all but forces me to work with enumerations.

The algorithmic heart of the program is to find partitions on a list according to some criterion having two or more members each. In the past I had to write my own function for that; now Batteries gives me everything I need:

(* Find equivalence classes of at least two members *)
let quotient ~by =
    (* Map criterion (Schwartzian transform) *)
       map (fun x -> (x, by x))
    (* Group by criterion *)
    |- group (fun (_, x) (_, y) -> compare x y)
    (* Filter groups with at least two members *)
    |- filter (function [] | [_] -> false | _ -> true)
    (* Project out the original elements *)
    |- map (map fst)

(If you have a better, non set-theoretical name for this function, I'm all ears.) Batteries' List.group sorts internally and finds consecutive runs of elements satisfying the given criterion. Since I'm going to use a computationally expensive grouping criterion (MD5 digests) I use a Schwartzian transform to process each element just once. Now since grouping involves sorting to avoid an O(n²) cost, I have to convert the enumeration into a list. Also, in my test runs I found that I had entire duplicate directories; in order not to complicate the code too much and yet be able to effectively identify those duplicate directories, a good compromise for me was to sort the list of duplicates so that files are kept together by directory:

(* List all duplicate files in a listing *)
let all_duplicate_files listing =
       of_enum listing
    (* Find duplicate lengths in list *)
    |> quotient ~by:snd
    (* Project out the path *)
    |> map (map fst)
    (* Find duplicate signatures in each group *)
    |> map (quotient ~by:Digest.file)
    (* Flatten the result *)
    |> concat
    (* Sort each group *)
    |> map (sort ~cmp:String.icompare)
    (* Sort the report by group leader *)
    |> sort ~cmp:(make_compare String.icompare)

I tried to make use of everything Batteries has, so that's it. To make this into a command-line tool I need to print the results:

(* Pretty-print a report of all duplicate files *)
let report_duplicate_files =
    let open Printf in
    iter (function
    | []      -> ()
    | p :: ps -> printf "> %s\n" p; iter (printf "< %s\n") ps; printf "\n")

and finally add a driver function:

(* Main function *)
let () =
    if !Sys.interactive then () else
    if Array.length Sys.argv = 1 then begin
        prerr_endline "usage - finddups <dir>...";
        exit 2
    end else try
        let listing = ref (Enum.empty ()) in
        for i = 1 to Array.length Sys.argv - 1 do
            listing := Enum.append !listing (all_file_lengths Sys.argv.(i))
        done;
        !listing |> all_duplicate_files |> report_duplicate_files;
        exit 0
    with e ->
        prerr_endline (Printexc.to_string e);
        exit 1

This simple script is I/O-bound, so I could have made it a hash-bang executable file without impacting its performance too much, but I opted to compile it with:

ocamlfind ocamlopt -thread -package batteries -linkpkg -o finddups finddups.ml

I hope you give Batteries a chance too!

4 comments:

gwern said...

I take it you wrote this for fun, not because you couldn't find an existing tool like fdupes or competitors: http://en.wikipedia.org/wiki/Fdupes#Similar_programs

Matías Giovannini said...

The fact that so many alternatives exist is an indication that this particular itch is easier to scratch by oneself than by finding someone else to do it: it never occurred to me to find a ready-made tool, either built-in or to port. In fact, had I not run it a couple of times tweaking the code here and there I would not have understood exactly what to keep and what to delete.

gasche said...

> It might seem to be a confusion of
> concerns to recursively traverse the
> directory tree and to compute the
> lengths of the files, but in order to
> minimize the I/O effort required I make
> use of a single stat call to find
> whether the file is a directory or not
> and its length.

A solution may be to formulate your recursive traversal as a fold parametrized by a function acting on the stat result. You would write the fold with only the traversal concern, then specialize it to your size-collecting traversal by giving only the size-collection logic.

Matías Giovannini said...

Indeed, that was my original formulation before I found that Batteries could enumerate directories for me. I ended up with half the code by doing things this way.