After reading Anil’s post about his zero-allocation HTTP parser httpz, I decided to apply some OxCaml optimisation techniques to my pure OCaml MP3 encoder/decoder.
The OCaml-based MP3 encoder/decoder has been the most ambitious project I’ve tried in Opus 4.5. It was a struggle to get it over the line, and I even needed to read large chunks of the ISO standard and get to grips with some of the maths and help the AI troubleshoot.
Profiling an OCaml MP3 Decoder with Landmarks
Before dividing into OxCaml, I wanted to get a feel for the current performance and also to make obvious non-OxCaml performance improvements; otherwise, I would be comparing an optimised OxCaml version with an underperforming OCaml version.
It was 40 times slower than ffmpeg: 29.5 seconds to decode a 3-minute file versus 0.74 seconds. I used the landmarks profiling library to identify and fix the bottlenecks, bringing decode time down to 3.5 seconds (a 8x speedup).
Setting Up Landmarks
Landmarks is an OCaml profiling library that instruments functions and reports cycle counts. It was easy to add to the project (*) with a simple edit of the dune file:
(libraries ... landmarks)
(preprocess (pps landmarks-ppx --auto))
The --auto flag automatically instruments every top-level function — no manual annotation needed. Running the decoder with OCAML_LANDMARKS=on prints a call tree with cycle counts and percentages.
(*) It needed OCaml 5.3.0 for
landmarks-ppxcompatibility; it wouldn’t install on OCaml 5.4.0 due to a ppxlib version constraint.
Issues
78% of the time was spent in the Huffman decoding, specifically decode_pair. The implementation read one bit at a time, then scanned the table for a matching Huffman code. I initially tried a Hashtbl, which was much better than the scan before deciding to use array lookup instead.
The bitstream operations still accounted for much of the time, but these could be optimised with appropriate Bytes.get_... calls, as the most frequent path is reading 32 bits in big endian layout.
The profile now showed find_sfb_long consuming 3.4 billion cycles inside requantization. This function does a linear search through scalefactor band boundaries for every one of the 576 frequency lines, every granule, every frame. Switching to precomputed 576-entry arrays mapping each frequency line directly to its scalefactor band index.
There were some additional tweaks, such as adding more precomputed lookup tables stored in floatarray, using [@inline] and unsafe_get, land instead of mod.
After this, no single function dominated the profile, and I could move on to OxCaml.
OxCaml
OxCaml has float#, an unboxed float type that lives in registers, and let mutable for stack-allocated mutable variables. Together, they let you write inner loops where the accumulator never touches the heap:
module F = Stdlib_upstream_compatible.Float_u
let[@inline] imdct_long input =
for i = 0 to 35 do
let mutable sum : float# = F.of_float 0.0 in
for k = 0 to 17 do
let cos_val = F.of_float (Float.Array.unsafe_get cos_table (i * 18 + k)) in
let inp_val = F.of_float (Array.unsafe_get input k) in
sum <- F.add sum (F.mul inp_val cos_val)
done;
Array.unsafe_set output i (F.to_float sum)
done
These kinds of optimisations got me from 2.35s down to 2.01s.
What I felt was missing was an accessor function which returned an unboxed float from a floatarray, so I wouldn’t need to unbox with F.of_float. However, I couldn’t find it.
The httpz parser really benefited from OxCaml’s unboxed types because its hot path operates on small unboxed records that stay entirely in registers:
#{ off: int16#; len: int16# }
Results
The optimisations brought a 29.5s MP3 decoder down to 2.01s. Mostly through standard OCaml optimisations, but OxCaml’s float# saved another ~14%.
| Decoder | Time | vs ffmpeg |
|---|---|---|
| ffmpeg | 0.74s | 1x |
| LAME | 0.81s | 1.1x |
| ocaml-mp3 (original) | 29.5s | 40x |
| ocaml-mp3 (Hashtbl) | 6.4s | 8.6x |
| ocaml-mp3 (flat + fast bitstream) | 3.5s | 4.7x |
| ocaml-mp3 (best) | 2.4s | 3.2x |
| ocaml-mp3 (OxCaml) | 2.0s | 2.7x |