For Pi Day, I have implemented the same algorithm in both OCaml and OxCaml and compared the generated assembly and runtime performance.

OxCaml is a performance-focused extension of OCaml developed at Jane Street. It adds unboxed types, stack allocation, mutable bindings, and compile-time zero-allocation checking to the language. But what do these features actually look like in practice, and do they make a measurable difference?

The algorithm

This year, I’m going to use the Gauss-Legendre algorithm to compute pi by iteratively refining four variables. Starting from:

a = 1,  b = 1/sqrt(2),  t = 1/4,  p = 1

each iteration updates them:

a' = (a + b) / 2
b' = sqrt(a * b)
t' = t - p * (a - a')^2
p' = 2 * p

and pi is approximated by:

pi ~ (a + b)^2 / (4 * t)

The code is deliberately structured as two functions: a step function that computes one iteration and returns the updated state, and a gauss_legendre function that iterates over step calls. The step function is marked [@inline never] to simulate a realistic cross-module call boundary of the kind that appears in real applications.

OCaml

(* pi_ocaml.ml - OCaml 5.4.0 *)

let[@inline never] step a b t p =
  let a' = (a +. b) /. 2.0 in
  let b' = sqrt (a *. b) in
  let d = a -. a' in
  let t' = t -. p *. d *. d in
  let p' = 2.0 *. p in
  (a', b', t', p')

let[@inline never] gauss_legendre () =
  let rec loop a b t p = function
    | 0 ->
      let s = a +. b in
      s *. s /. (4.0 *. t)
    | i ->
      let (a', b', t', p') = step a b t p in
      loop a' b' t' p' (i - 1)
  in
  loop 1.0 (1.0 /. sqrt 2.0) 0.25 1.0 25

The step function returns a float * float * float * float tuple. The gauss_legendre function recurses, threading the four values through each iteration.

OxCaml

(* pi_oxcaml.ml - OxCaml 5.2.0+ox *)

module F = Float_u
open Float_u.O_dot

let[@zero_alloc][@inline never] step a b t p
  : #(float# * float# * float# * float#) =
  let a' = (a +. b) /. #2.0 in
  let b' = F.sqrt (a *. b) in
  let d = a -. a' in
  let t' = t -. p *. d *. d in
  let p' = #2.0 *. p in
  #(a', b', t', p')

let[@zero_alloc][@inline never] gauss_legendre () : float# =
  let mutable a = #1.0 in
  let mutable b = #1.0 /. F.sqrt #2.0 in
  let mutable t = #0.25 in
  let mutable p = #1.0 in
  for _ = 1 to 25 do
    let #(a', b', t', p') = step a b t p in
    a <- a'; b <- b'; t <- t'; p <- p'
  done;
  let s = a +. b in
  s *. s /. (#4.0 *. t)

Three OxCaml features are used here:

unboxed types

Instead of float, which would be a boxed 64-bit IEEE double, allocated on the heap and subject to garbage collection, a float# stores the value directly in a CPU register. The #(float# * float# * float# * float#) return type is an unboxed tuple: four values returned in four XMM registers, with no heap allocation at all.

Stack-local mutable bindings

let mutable creates a mutable binding that lives on the stack.

Compile-time allocation checking

The [@zero_alloc] annotation asks the compiler to prove that the function performs no heap allocation. If any code path allocates, that is, boxing a float or creating a tuple and thereby potentially triggering the GC, the build fails with an error showing exactly where the allocation occurs.

What the compiler generates

Both versions perform the same arithmetic requiring the same instructions: addsd, divsd, sqrtsd, mulsd, subsd sequence. The difference is what happens when step needs to return its four results to the caller.

OCaml: 104 bytes of heap allocation

    subq    $104, %r15              ; reserve 104 bytes on the minor heap
    cmpq    (%r14), %r15            ; enough space?
    jb      .L102                   ; if not → trigger GC

    ; Box each float (16 bytes each: 8-byte GC header + 8-byte value)
    movq    $1277, -8(%rbx)         ; GC tag for boxed float
    movsd   %xmm0, (%rbx)           ; store p'
    movq    $1277, -8(%rdi)         ; GC tag
    movsd   %xmm1, (%rdi)           ; store t'
    movq    $1277, -8(%rsi)         ; GC tag
    movsd   %xmm3, (%rsi)           ; store b'
    movq    $1277, -8(%rdx)         ; GC tag
    movsd   %xmm2, (%rdx)           ; store a'

    ; Build the 4-element tuple (40 bytes: 8-byte header + 4 pointers)
    movq    $4096, -8(%rax)         ; tuple header tag
    movq    %rdx, (%rax)            ; pointer → a'
    movq    %rsi, 8(%rax)           ; pointer → b'
    movq    %rdi, 16(%rax)          ; pointer → t'
    movq    %rbx, 24(%rax)          ; pointer → p'
    ret

.L102:
    call    caml_call_gc            ; garbage collection
    jmp     .L104

Every call to step allocates 104 bytes: four 16-byte boxed floats plus a 40-byte tuple to hold pointers to them. It also performs a GC boundary check and may trigger a garbage collection if the minor heap is full.

OxCaml: zero allocation

    vaddsd  %xmm1, %xmm4, %xmm0   ; a' = (a + b) / 2
    vdivsd  %xmm3, %xmm0, %xmm0
    vmulsd  %xmm1, %xmm4, %xmm1   ; b' = sqrt(a * b)
    vsqrtsd %xmm1, %xmm6, %xmm1
    vsubsd  %xmm0, %xmm4, %xmm4   ; d = a - a'
    vmulsd  %xmm5, %xmm3, %xmm3   ; p' = 2 * p
    vmulsd  %xmm4, %xmm5, %xmm5   ; t' = t - p * d * d
    vmulsd  %xmm4, %xmm5, %xmm4
    vsubsd  %xmm4, %xmm2, %xmm2
    ret                           ; a' in xmm0, b' in xmm1,
                                  ; t' in xmm2, p' in xmm3

Just the pure arithmetic steps: four values go in via XMM registers, four come back out the same way.

Benchmark

The algorithm is too fast to benchmark on its own, so I run the computation in a for loop 10 million times, resulting in 250 million step calls and measured the results with hyperfine on Linux (10 runs, 3 warmup), showing that OxCaml is 1.44x faster.

Benchmark 1: ./pi_ocaml
  Time (mean +/- s):      2.931 s +/-  0.048 s    [User: 2.927 s, System: 0.004 s]
  Range (min ... max):    2.854 s ...  3.020 s    10 runs

Benchmark 2: ./pi_oxcaml
  Time (mean +/- s):      2.033 s +/-  0.090 s    [User: 2.018 s, System: 0.015 s]
  Range (min ... max):    1.945 s ...  2.188 s    10 runs

Summary
  ./pi_oxcaml ran
    1.44 +/- 0.07 times faster than ./pi_ocaml

Results

Both versions produce the correct value!

Pi = 3.141592653589794  (error: 8.88e-16)

Why `[@inline never]`?

The [@inline never] annotation is worth mentioning as without it, the two compilers behave differently, resulting in a different comparison.

OxCaml inlines step into gauss_legendre, eliminating the function call entirely, resulting in a loop which is just the register arithmetic with no call instruction at all:

; OxCaml gauss_legendre WITHOUT [@inline never] - step is fully inlined
.L124:
    vmulsd  %xmm0, %xmm1, %xmm4   ; b' = sqrt(a * b)
    vsqrtsd %xmm4, %xmm5, %xmm4
    vaddsd  %xmm0, %xmm1, %xmm0   ; a' = (a + b) / 2
    vdivsd  %xmm5, %xmm0, %xmm0
    vsubsd  %xmm0, %xmm1, %xmm1   ; d = a - a'
    vmulsd  %xmm1, %xmm2, %xmm6   ; t' = t - p * d^2
    vmulsd  %xmm1, %xmm6, %xmm1
    vsubsd  %xmm1, %xmm3, %xmm3
    vmulsd  %xmm2, %xmm5, %xmm2   ; p' = 2 * p
    ...
    jmp     .L124                   ; loop — no call instruction anywhere

OCaml 5.4.0 does not inline step, even without the annotation. The loop still contains call camlPi_ocaml_inline.step_274@PLT, and every call still boxes four floats and builds a tuple on the heap:

; OCaml gauss_legendre WITHOUT [@inline never] - step is NOT inlined
.L111:
    cmpq    $1, %rdx                ; base case check (i = 0)
    je      .L110
    movq    %rdx, (%rsp)            ; spill loop counter
    call    camlPi_ocaml_inline.step_274@PLT  ; still a real call
    movq    24(%rax), %rsi          ; unbox returned tuple
    movq    16(%rax), %rdi
    movq    8(%rax), %rbx
    movq    (%rax), %rax
    jmp     .L111                   ; loop - passing boxed pointers

So [@inline never] on the OxCaml version levels the playing field. It prevents flambda2 from optimising away the very overhead I was trying to measure. Without it, the gap would be even larger, because OxCaml would inline everything while OCaml would still be boxing.

In practice, [@inline never] also represents the realistic case: real applications are made of many modules, and calls across compilation-unit boundaries cannot be inlined by either compiler. The boxing overhead shown here occurs whenever a function in one .ml file returns multiple float values to a caller in another.

Note

I wrote this post last weekend in readiness for Pi Day, not realising that I would have the chance to use these techniques during the week in my OxCaml inference engine.

Pi Day 2026: OCaml vs OxCaml

Categories

Tags

The algorithm

OCaml

OxCaml

unboxed types

Stack-local mutable bindings

Compile-time allocation checking

What the compiler generates

OCaml: 104 bytes of heap allocation

OxCaml: zero allocation

Benchmark

Results

Why `[@inline never]`?

Note

Categories

Tags

The algorithm

OCaml

OxCaml

unboxed types

Stack-local mutable bindings

Compile-time allocation checking

What the compiler generates

OCaml: 104 bytes of heap allocation

OxCaml: zero allocation

Benchmark

Results

Why [@inline never]?

Note

Why `[@inline never]`?