Back to OCaml! Catching up with StackOverflow's OCaml questions, I found an interesting one about private type abbreviations in module signatures. One thing in that conversation that struck me as odd was the assertion that the compiler optimizes single-constructor variants, thereby subsuming the semantics of Haskell's all three declarators, data
, type
and newtype
, into one. Definitive proof would be obtainable by diving into the compiler's source. I decided instead to read the output of ocamlopt to try to catch a glimpse of this, even if I know it doesn't constitute definitive proof. What I found is both good and bad, or as we say here una de cal y una de arena
.
Consider the following module
definitions:
module S = struct type t = P of int * int let make x y = P (x, y) end module T = struct type t = int * int let make x y = x, y end
Module S
(for "single sum type") declares a single-variant type, while T
(for "tuple type") declares an abbreviation. Putting them in a file and compiling it with ocamlopt -S -c generates the following listing (prelude and other specificities omitted):
S.make | T.make |
---|---|
_camlAbbrev__make_1033: subq $8, %rsp movq %rax, %rdi .L101: subq $24, %r15 movq _caml_young_limit@GOTPCREL(%rip), %rax cmpq (%rax), %r15 jb .L102 leaq 8(%r15), %rax movq $2048, -8(%rax) movq %rdi, (%rax) movq %rbx, 8(%rax) addq $8, %rsp ret .L102: call _caml_call_gc jmp .L101 |
_camlAbbrev__make_1038: subq $8, %rsp movq %rax, %rdi .L105: subq $24, %r15 movq _caml_young_limit@GOTPCREL(%rip), %rax cmpq (%rax), %r15 jb .L106 leaq 8(%r15), %rax movq $2048, -8(%rax) movq %rdi, (%rax) movq %rbx, 8(%rax) addq $8, %rsp ret .L106: call _caml_call_gc jmp .L105 |
(ocamlopt -version is 3.12.1). The compiler generates the exact same code in both cases. The bolded instructions build the heap-allocated 3-word pair as tag, first component, second component. In order to test modular abstraction, I coerce these implementations into the following signatures:
module A : sig type t = int * int val make : int -> int -> t end = T module B : sig type t = private int * int val make : int -> int -> t end = T module X : sig type t = P of int * int val make : int -> int -> t end = S module Y : sig type t = private P of int * int val make : int -> int -> t end = S
Modules A
and B
coerce module T
with a public and a private
type definition, respectively. Modules X
and Y
do the same with module S
. The following code shows how to use each variant of type t
in a destructuring pattern match:
let a () = A.(let (x, y) = make 2 3 in x + y) let b () = B.(let (x, y) = (make 2 3 :> int * int) in x + y) let x () = X.(let P (x, y) = make 2 3 in x + y) let y () = Y.(let P (x, y) = make 2 3 in x + y)
As explained in the manual, a private type abbreviation requires an explicit coercion for the types to be compatible. The use of a single-variant type makes for not only safer coding (the intent behind Haskell's newtype
), it also allows for lighter syntax. I was surprised by the assembly code generated by ocamlopt (somewhat simplified):
a | b |
---|---|
_camlAbbrev__a_1060: subq $8, %rsp .L109: subq $24, %r15 movq _caml_young_limit@GOTPCREL(%rip), %rax cmpq (%rax), %r15 jb .L110 leaq 8(%r15), %rax movq $2048, -8(%rax) movq $5, (%rax) movq $7, 8(%rax) movq 8(%rax), %rbx movq (%rax), %rax leaq -1(%rax, %rbx), %rax addq $8, %rsp ret .L110: call _caml_call_gc jmp .L109 |
_camlAbbrev__b_1063: subq $8, %rsp .L113: subq $24, %r15 movq _caml_young_limit@GOTPCREL(%rip), %rax cmpq (%rax), %r15 jb .L114 leaq 8(%r15), %rax movq $2048, -8(%rax) movq $5, (%rax) movq $7, 8(%rax) movq 8(%rax), %rbx movq (%rax), %rax leaq -1(%rax, %rbx), %rax addq $8, %rsp ret .L114: call _caml_call_gc jmp .L113 |
c | d |
_camlAbbrev__x_1066: subq $8, %rsp .L117: subq $24, %r15 movq _caml_young_limit@GOTPCREL(%rip), %rax cmpq (%rax), %r15 jb .L118 leaq 8(%r15), %rax movq $2048, -8(%rax) movq $5, (%rax) movq $7, 8(%rax) movq 8(%rax), %rbx movq (%rax), %rax leaq -1(%rax, %rbx), %rax addq $8, %rsp ret .L118: call _caml_call_gc jmp .L117 |
_camlAbbrev__y_1070: subq $8, %rsp .L121: subq $24, %r15 movq _caml_young_limit@GOTPCREL(%rip), %rax cmpq (%rax), %r15 jb .L122 leaq 8(%r15), %rax movq $2048, -8(%rax) movq $5, (%rax) movq $7, 8(%rax) movq 8(%rax), %rbx movq (%rax), %rax leaq -1(%rax, %rbx), %rax addq $8, %rsp ret .L122: call _caml_call_gc jmp .L121 |
Identical machine code generated in all four cases, which I find very encouraging. Here the compiler inlines the tupling constructor make
(the three first bolded lines; note that an integer n is tagged as 2·n + 1 so that 2 is represented by 5 and 3 by 7). What I find a bummer is that the tuple is immediately destructured (into rbx
and rax
, next two bolded lines) and operated on its components (last bolded line; note in passing how addition of tagged integers is done with leaq
, since 2·(m + n) + 1 = (2·m + 1) + (2·n + 1) - 1), without any kind of CSE performed. This is perfectly summarized by Pascal Cuoq in a comment to another StackOverflow question:
[…] Last time I checked it didn't even optimize the allocation of
a, b
inmatch a, b with x, y -> ...
. If you wish to check by yourself, I found that reading the x86 assembly generated by ocamlopt -S was convenient for me because I didn't have to learn a new representation.
This, I suspect, is an artifact of inlining in the presence of code emission that must be correct in the face of separate compilation. Disturbingly enough, ocamlopt seems to inline even across module boundaries. Separating the test into an abbrev.ml implementation:
module S = struct type t = P of int * int let make x y = P (x, y) end module T = struct type t = int * int let make x y = x, y end module A = T module B = T module X = S module Y = S
with interface abbrev.mli:
module A : sig type t = int * int val make : int -> int -> t end module B : sig type t = private int * int val make : int -> int -> t end module X : sig type t = P of int * int val make : int -> int -> t end module Y : sig type t = private P of int * int val make : int -> int -> t end
the following code in test.ml:
open Abbrev let a () = A.(let (x, y) = make 2 3 in x + y) let b () = B.(let (x, y) = (make 2 3 :> int * int) in x + y) let x () = X.(let P (x, y) = make 2 3 in x + y) let y () = Y.(let P (x, y) = make 2 3 in x + y)
compiles to the same assembly as above! Xavier Leroy spells out the conditions for ocamlopt to do cross-module inlining; I think it obvious that these tests do fulfill them. An interesting tidbit in that message is the command-line argument -dcmm to dump a RTL-like representation of the generated code:
(function camlTest__a_1038 (param/1056: addr) (let match/1057 (alloc 2048 5 7) (+ (+ (load match/1057) (load (+a match/1057 8))) -1))) (function camlTest__b_1041 (param/1058: addr) (let match/1059 (alloc 2048 5 7) (+ (+ (load match/1059) (load (+a match/1059 8))) -1))) (function camlTest__x_1044 (param/1060: addr) (let match/1061 (alloc 2048 5 7) (+ (+ (load match/1061) (load (+a match/1061 8))) -1))) (function camlTest__y_1048 (param/1062: addr) (let match/1063 (alloc 2048 5 7) (+ (+ (load match/1063) (load (+a match/1063 8))) -1)))
I think it is not very productive to worry too much about the cost of abstractions in OCaml, as it probably is offset by the naïveté of the code generator.
2 comments:
Pattern matching is compiled away (and, in the process, optimizes things such as `match 2,3 with (x,y) -> ...`) in an early stage of OCaml's compilation, that is common to both bytecode and native compilers. Inlining is performed later, in the native compiler only.
This means that the pattern matching compiler won't get access to additional optimizations that will be made possible by the inlining pass. Ordering or repeating compiler passes is a delicate problem (should be performed on the source, a low-level intermediate language, both?) in compiler design, and OCaml makes the choice of simplicity here.
If you can produce real-world code example where optimizing this kind of codes makes sensible difference in the performances (even better if it's already existing code), you should submit a request on the OCaml bugtracker.
It clearly is not an issue with the pattern matching compiler; the tuple is allocated and built just to be destructured and used well after that. Whether this optimization should be performed on the RTL or as a peephole pass over the generated instructions, I don't know. I don't know what the "real-word" (them fighting words, Dijkstra was famously wary of those) impact of this is, either, and frankly it is not something that concerns me here; the exercise was admittedly artificial and done purely to satisfy my curiosity, and I derive no value judgements from it.
Post a Comment