Alaska Ataca a Kamtchatka: Assessing Abstractions

Back to OCaml! Catching up with StackOverflow's OCaml questions, I found an interesting one about private type abbreviations in module signatures. One thing in that conversation that struck me as odd was the assertion that the compiler optimizes single-constructor variants, thereby subsuming the semantics of Haskell's all three declarators, data, type and newtype, into one. Definitive proof would be obtainable by diving into the compiler's source. I decided instead to read the output of ocamlopt to try to catch a glimpse of this, even if I know it doesn't constitute definitive proof. What I found is both good and bad, or as we say here una de cal y una de arena.

Consider the following module definitions:

module S = struct
  type t = P of int * int
  let make x y = P (x, y)
end

module T = struct
  type t = int * int
  let make x y = x, y
end

Module S (for "single sum type") declares a single-variant type, while T (for "tuple type") declares an abbreviation. Putting them in a file and compiling it with ocamlopt -S -c generates the following listing (prelude and other specificities omitted):

S.make T.make

`S.make`	`T.make`
_camlAbbrev__make_1033: subq $8, %rsp movq %rax, %rdi .L101: subq $24, %r15 movq _caml_young_limit@GOTPCREL(%rip), %rax cmpq (%rax), %r15 jb .L102 leaq 8(%r15), %rax movq $2048, -8(%rax) movq %rdi, (%rax) movq %rbx, 8(%rax) addq $8, %rsp ret .L102: call _caml_call_gc jmp .L101	_camlAbbrev__make_1038: subq $8, %rsp movq %rax, %rdi .L105: subq $24, %r15 movq _caml_young_limit@GOTPCREL(%rip), %rax cmpq (%rax), %r15 jb .L106 leaq 8(%r15), %rax movq $2048, -8(%rax) movq %rdi, (%rax) movq %rbx, 8(%rax) addq $8, %rsp ret .L106: call _caml_call_gc jmp .L105

_camlAbbrev__make_1033:
        subq    $8, %rsp
        movq    %rax, %rdi
.L101:  subq    $24, %r15
        movq    _caml_young_limit@GOTPCREL(%rip), %rax
        cmpq    (%rax), %r15
        jb      .L102
        leaq    8(%r15), %rax
        movq    $2048, -8(%rax)
        movq    %rdi, (%rax)
        movq    %rbx, 8(%rax)
        addq    $8, %rsp
        ret
.L102:  call    _caml_call_gc
        jmp     .L101

_camlAbbrev__make_1038:
        subq    $8, %rsp
        movq    %rax, %rdi
.L105:  subq    $24, %r15
        movq    _caml_young_limit@GOTPCREL(%rip), %rax
        cmpq    (%rax), %r15
        jb      .L106
        leaq    8(%r15), %rax
        movq    $2048, -8(%rax)
        movq    %rdi, (%rax)
        movq    %rbx, 8(%rax)
        addq    $8, %rsp
        ret
.L106:  call    _caml_call_gc
        jmp     .L105

(ocamlopt -version is 3.12.1). The compiler generates the exact same code in both cases. The bolded instructions build the heap-allocated 3-word pair as tag, first component, second component. In order to test modular abstraction, I coerce these implementations into the following signatures:

module A : sig
  type t = int * int
  val make : int -> int -> t
end = T

module B : sig
  type t = private int * int
  val make : int -> int -> t
end = T

module X : sig
  type t = P of int * int
  val make : int -> int -> t
end = S

module Y : sig
  type t = private P of int * int
  val make : int -> int -> t
end = S

Modules A and B coerce module T with a public and a private type definition, respectively. Modules X and Y do the same with module S. The following code shows how to use each variant of type t in a destructuring pattern match:

let a () = A.(let   (x, y) =  make 2 3               in x + y)
let b () = B.(let   (x, y) = (make 2 3 :> int * int) in x + y)
let x () = X.(let P (x, y) =  make 2 3               in x + y)
let y () = Y.(let P (x, y) =  make 2 3               in x + y)

As explained in the manual, a private type abbreviation requires an explicit coercion for the types to be compatible. The use of a single-variant type makes for not only safer coding (the intent behind Haskell's newtype), it also allows for lighter syntax. I was surprised by the assembly code generated by ocamlopt (somewhat simplified):

a b

`a`	`b`
_camlAbbrev__a_1060: subq $8, %rsp .L109: subq $24, %r15 movq _caml_young_limit@GOTPCREL(%rip), %rax cmpq (%rax), %r15 jb .L110 leaq 8(%r15), %rax movq $2048, -8(%rax) movq $5, (%rax) movq $7, 8(%rax) movq 8(%rax), %rbx movq (%rax), %rax leaq -1(%rax, %rbx), %rax addq $8, %rsp ret .L110: call _caml_call_gc jmp .L109	_camlAbbrev__b_1063: subq $8, %rsp .L113: subq $24, %r15 movq _caml_young_limit@GOTPCREL(%rip), %rax cmpq (%rax), %r15 jb .L114 leaq 8(%r15), %rax movq $2048, -8(%rax) movq $5, (%rax) movq $7, 8(%rax) movq 8(%rax), %rbx movq (%rax), %rax leaq -1(%rax, %rbx), %rax addq $8, %rsp ret .L114: call _caml_call_gc jmp .L113
`c`	`d`
_camlAbbrev__x_1066: subq $8, %rsp .L117: subq $24, %r15 movq _caml_young_limit@GOTPCREL(%rip), %rax cmpq (%rax), %r15 jb .L118 leaq 8(%r15), %rax movq $2048, -8(%rax) movq $5, (%rax) movq $7, 8(%rax) movq 8(%rax), %rbx movq (%rax), %rax leaq -1(%rax, %rbx), %rax addq $8, %rsp ret .L118: call _caml_call_gc jmp .L117	_camlAbbrev__y_1070: subq $8, %rsp .L121: subq $24, %r15 movq _caml_young_limit@GOTPCREL(%rip), %rax cmpq (%rax), %r15 jb .L122 leaq 8(%r15), %rax movq $2048, -8(%rax) movq $5, (%rax) movq $7, 8(%rax) movq 8(%rax), %rbx movq (%rax), %rax leaq -1(%rax, %rbx), %rax addq $8, %rsp ret .L122: call _caml_call_gc jmp .L121

_camlAbbrev__a_1060:
        subq    $8, %rsp
.L109:  subq    $24, %r15
        movq    _caml_young_limit@GOTPCREL(%rip), %rax
        cmpq    (%rax), %r15
        jb      .L110
        leaq    8(%r15), %rax
        movq    $2048, -8(%rax)
        movq    $5, (%rax)
        movq    $7, 8(%rax)
        movq    8(%rax), %rbx
        movq    (%rax), %rax
        leaq    -1(%rax, %rbx), %rax
        addq    $8, %rsp
        ret
.L110:  call    _caml_call_gc
        jmp     .L109

_camlAbbrev__b_1063:
        subq    $8, %rsp
.L113:  subq    $24, %r15
        movq    _caml_young_limit@GOTPCREL(%rip), %rax
        cmpq    (%rax), %r15
        jb      .L114
        leaq    8(%r15), %rax
        movq    $2048, -8(%rax)
        movq    $5, (%rax)
        movq    $7, 8(%rax)
        movq    8(%rax), %rbx
        movq    (%rax), %rax
        leaq    -1(%rax, %rbx), %rax
        addq    $8, %rsp
        ret
.L114:  call    _caml_call_gc
        jmp     .L113

c d

_camlAbbrev__x_1066:
        subq    $8, %rsp
.L117:  subq    $24, %r15
        movq    _caml_young_limit@GOTPCREL(%rip), %rax
        cmpq    (%rax), %r15
        jb      .L118
        leaq    8(%r15), %rax
        movq    $2048, -8(%rax)
        movq    $5, (%rax)
        movq    $7, 8(%rax)
        movq    8(%rax), %rbx
        movq    (%rax), %rax
        leaq    -1(%rax, %rbx), %rax
        addq    $8, %rsp
        ret
.L118:  call    _caml_call_gc
        jmp     .L117

_camlAbbrev__y_1070:
        subq    $8, %rsp
.L121:  subq    $24, %r15
        movq    _caml_young_limit@GOTPCREL(%rip), %rax
        cmpq    (%rax), %r15
        jb      .L122
        leaq    8(%r15), %rax
        movq    $2048, -8(%rax)
        movq    $5, (%rax)
        movq    $7, 8(%rax)
        movq    8(%rax), %rbx
        movq    (%rax), %rax
        leaq    -1(%rax, %rbx), %rax
        addq    $8, %rsp
        ret
.L122:  call    _caml_call_gc
        jmp     .L121

Identical machine code generated in all four cases, which I find very encouraging. Here the compiler inlines the tupling constructor make (the three first bolded lines; note that an integer n is tagged as 2·n + 1 so that 2 is represented by 5 and 3 by 7). What I find a bummer is that the tuple is immediately destructured (into rbx and rax, next two bolded lines) and operated on its components (last bolded line; note in passing how addition of tagged integers is done with leaq, since 2·(m + n) + 1 = (2·m + 1) + (2·n + 1) - 1), without any kind of CSE performed. This is perfectly summarized by Pascal Cuoq in a comment to another StackOverflow question:

[…] Last time I checked it didn't even optimize the allocation of a, b in match a, b with x, y -> .... If you wish to check by yourself, I found that reading the x86 assembly generated by ocamlopt -S was convenient for me because I didn't have to learn a new representation.

This, I suspect, is an artifact of inlining in the presence of code emission that must be correct in the face of separate compilation. Disturbingly enough, ocamlopt seems to inline even across module boundaries. Separating the test into an abbrev.ml implementation:

module S = struct
  type t = P of int * int
  let make x y = P (x, y)
end

module T = struct
  type t = int * int
  let make x y = x, y
end

module A = T
module B = T
module X = S
module Y = S

with interface abbrev.mli:

module A : sig type t =              int * int val make : int -> int -> t end
module B : sig type t = private      int * int val make : int -> int -> t end
module X : sig type t =         P of int * int val make : int -> int -> t end
module Y : sig type t = private P of int * int val make : int -> int -> t end

the following code in test.ml:

open Abbrev

let a () = A.(let   (x, y) =  make 2 3               in x + y)
let b () = B.(let   (x, y) = (make 2 3 :> int * int) in x + y)
let x () = X.(let P (x, y) =  make 2 3               in x + y)
let y () = Y.(let P (x, y) =  make 2 3               in x + y)

compiles to the same assembly as above! Xavier Leroy spells out the conditions for ocamlopt to do cross-module inlining; I think it obvious that these tests do fulfill them. An interesting tidbit in that message is the command-line argument -dcmm to dump a RTL-like representation of the generated code:

(function camlTest__a_1038 (param/1056: addr)
 (let match/1057 (alloc 2048 5 7)
   (+ (+ (load match/1057) (load (+a match/1057 8))) -1)))

(function camlTest__b_1041 (param/1058: addr)
 (let match/1059 (alloc 2048 5 7)
   (+ (+ (load match/1059) (load (+a match/1059 8))) -1)))

(function camlTest__x_1044 (param/1060: addr)
 (let match/1061 (alloc 2048 5 7)
   (+ (+ (load match/1061) (load (+a match/1061 8))) -1)))

(function camlTest__y_1048 (param/1062: addr)
 (let match/1063 (alloc 2048 5 7)
   (+ (+ (load match/1063) (load (+a match/1063 8))) -1)))

I think it is not very productive to worry too much about the cost of abstractions in OCaml, as it probably is offset by the naïveté of the code generator.

2 comments:

gasche said...: Pattern matching is compiled away (and, in the process, optimizes things such as `match 2,3 with (x,y) -> ...`) in an early stage of OCaml's compilation, that is common to both bytecode and native compilers. Inlining is performed later, in the native compiler only.

This means that the pattern matching compiler won't get access to additional optimizations that will be made possible by the inlining pass. Ordering or repeating compiler passes is a delicate problem (should be performed on the source, a low-level intermediate language, both?) in compiler design, and OCaml makes the choice of simplicity here.

If you can produce real-world code example where optimizing this kind of codes makes sensible difference in the performances (even better if it's already existing code), you should submit a request on the OCaml bugtracker.; 7/03/2012 01:42:00 AM
Matías Giovannini said...: It clearly is not an issue with the pattern matching compiler; the tuple is allocated and built just to be destructured and used well after that. Whether this optimization should be performed on the RTL or as a peephole pass over the generated instructions, I don't know. I don't know what the "real-word" (them fighting words, Dijkstra was famously wary of those) impact of this is, either, and frankly it is not something that concerns me here; the exercise was admittedly artificial and done purely to satisfy my curiosity, and I derive no value judgements from it.; 7/03/2012 01:54:00 AM

Alaska Ataca a Kamtchatka

2012-07-02

Assessing Abstractions

2 comments:

About Me

Statement of Rights

Labels

Blog Archive