While I typically write posts as I go, I’ve only managed one this week because there’s been a lot going on. To catch up, I’m going to try to write a summary of the week instead.

OCaml 5.5.0 (and 4.14.4)

OCaml 5.5.0 has been released! It starts with ocurrent/ocaml-version, the library most of our CI uses to know which compiler versions exist and what to test. It was really a double bump with a new maintenance release alongside the major one, since I’ve been rolling 4.14.3 -> 4.14.4 across everything at the same time as 5.5.0.

Florian did the actual changes (ocurrent/ocaml-version#91 for 4.14.4 and #92 for 5.5.0), and I merged those and published ocaml-version.4.1.3 into opam-repository (ocaml/opam-repository#30077). From there, it cascades in a specific order. First, the base images have to be built against the new ocaml-version: ocurrent/docker-base-images#355 covers Linux and Windows, both of which build as Docker containers.

The other platforms which don’t use Docker need to be handled separately. Ansible playbooks exist for macOS and FreeBSD to build under ZFS (obuilder takes a snapshot per step), while OpenBSD has a Makefile to rebuild the QEMU image. All of this has to happen before the CI systems are touched, because the moment a CI requests a compiler version whose base image doesn’t yet exist, the jobs simply fail.

The FreeBSD images were updated in ocurrent/freebsd-infra#22. The OpenBSD images (built under QEMU on oregano) were updated in ocurrent/obuilder#211. The macOS workers got the same 4.14.4 / 5.5.0 treatment via ocurrent/macos-infra#59.

Only once every base image - Docker (Linux/Windows), FreeBSD, OpenBSD and macOS - is in place can the CI systems themselves be refreshed to expect the new versions: ocurrent/ocaml-ci#1059 and ocurrent/opam-repo-ci#475.

The quiet win of the week was building multi-arch images with buildx imagetools create which I added last week in ocurrent/ocurrent#474 to replace the older docker manifest. The older style failed on roughly half the jobs on the first pass and I’d been retrying them while the new style worked on every single image on the first try! However, this change meant that I needed to add the Docker buildx tools to the base image builder with ocurrent/docker-base-images#356.

OCaml-CI housekeeping

With the version bump done, I spent some time clearing out the backlog of in-progress changes and PR in OCaml-CI:

ocurrent/ocaml-ci#1054 which removes opam 2.0 support.
ocurrent/ocaml-ci#1060, which switches to using the builtin-0install solver in Docker builds. This resolve in seconds, allowing me to drop the 900-second timeout I needed on the default solver.
ocurrent/ocaml-ci#1058 is a work around for slow builds on Windows and RISC-V, which triples the per-job timeout.
ocurrent/ocaml-ci#1057 which fixes the solver’s dummy job as noted in OCaml CI lingering solves.

The FreeBSD jail leak

I diagnosed the leaking git-daemon processes earlier in the week — orphaned daemons reparenting to host PID 1 and pinning jails open forever. The fix is ocurrent/obuilder#210, forcing jail -r on success as well as failure.

That obuilder change is the proper fix: the build system has to reap whatever a build step leaves behind, regardless of how well-behaved the package under test is. Tracing which package was spawning the daemons led me to git-kv, whose test suite starts a throwaway git server and polls for readiness with lsof which isn’t installed on the FreeBSD by default. Still, while I was there I opened robur-coop/git-kv#15 and followed it with robur-coop/git-kv#16, replacing the lsof probe with a portable socket poll so the test suite no longer passes by accident when lsof is missing.

Moving MirageOS infrastructure off Equinix

Equinix is discontinuing its bare-metal product, and the box is being switched off. infra-2 has been up for 1308 days without a reboot, quietly running the authoritative DNS infrastructure as Mirage Unikernels under albatross. Not just the websites (www serves both mirageos.org and mirage.io, with next as the staging site), but ns, which is ns0.mirageos.org, the git-backed primary authoritative DNS server for mirageos.org, mirage.io and openmirage.org plus le for Let’s Encrypt provisioning and a secondary for another domain.

The unikernels themselves were the easy part as they’re stateless with the zone data in GitHub-backed repos, so migrating them is really “new host, new IPs, re-run the deploy scripts.” The Equinix-routed /28 couldn’t migrate, so every public IP changed which meant registrar glue updates for ns0.mirageos.org. Hannes at Robur was instrumental in the off-site secondary name server updates. The cutover completed on 2026-06-24: ns, www and next now run under albatross on dopey.caelum.ci.dev in Cambridge, with public DNS and HTTPS verified.

The deploy pipeline is unchanged (OCluster build -> push to ocurrentbuilder/staging -> crane export -> rsync -> mirage-redeploy) only the SSH target host, so the deployer was repointed in ocurrent/ocurrent-deployer#268 and the create-config.sh keyscan now targets dopey rather than the old Equinix IP.

mirage-ci

Separately, unrelated to the move, Edwin prompted me to merge Hannes’ ocurrent/mirage-ci#56 and followed it up with ocurrent/mirage-ci#57. However, merging was the easy part as the service didn’t restart! The engine crash-looped on OCurrent’s set to different values in the same step.

The cause was a latent collision that the restart happened to expose. mirage-ci runs two test sets: “released” (skeleton main + released mirage) and “edge” (skeleton dev + mirage main + mirage-dev) with both reporting GitHub commit status under the same context. Whenever main and dev are on the same commit the two writers tried to set that one context to different values in the same engine step, which OCurrent rightly refuses and takes the whole engine down with it. The fix in ocurrent/mirage-ci#60 threads a per-test-set label into the context, Mirage CI (released) - <platform> and Mirage CI (edge) - <platform>, so the two writers target distinct checks and never collide. Builds are unchanged and still deduplicated by the cluster cache; the only catch is that the rename means any branch-protection rules would need to be updated but it turned out that there weren’t any!

Edwin also noted that the CI builds ran on different base images. This is a caching issue where ocurrent/obuilder caches the layer (from ocaml/opam:debian-13-ocaml-4.14) and doesn’t refresh it as new images are released. OCaml-CI has long since solved this problem by fetching the SHA of the Docker images on a schedule, changing the spec to (from ocaml/opam:debian-13-ocaml-4.14@sha256=....). I ported this to Mirage CI in ocurrent/mirage-ci#59.

Tessera v1.1

Work continues in running the Tessera v1.1 inference pipeline via my OCaml/Eio orchestrator mtelvers/genesis, which issues 0.1° grid cells over selected country or ROI polygon to remote inference workers running the Tessera model.

The model happily runs on A10 GPUs, but A10 capacity is in short supply, and my 100 A10 worker machines struggle to maintain 50% online at any given time due to spot evictions. So I turned to Intel AMX CPU VMs instead. They are plentiful and virtually never get spot-evicted. I ran some AMX sizing experiments, where smaller VMs turned out to do more work per dollar because AMX usage isn’t wide enough. I spun up the AMX fleet using any available capacity across Europe.

I’ve continued to work through the embedding requests which arrive as issues at ucam-eo/geotessera. There is high demand!

The server was struggling a bit with some of the early choices I made about thumbnail generation. One PNG per tile seems like a good idea until you have a million of them. Therefore, I redesigned it using a per-year mmap‘d atlas, which is a single sparse file holding all the thumbnails.

Running quietly alongside the v1.1 generation, there’s also some early, exploratory work against a pre-release Tessera v2 model, which I’ve been running on a secondary Genesis server backed by the Vultr machines.

An OCaml S3 client

Storage has been really tight. The disk was full when I started, but I still found space. Now, with more than 75TB of embeddings, there’s nothing left to find, and I have moved the primary storage to Scaleway. Of course, this meant migrating those 75TB to Scaleway. I have a Scaleway CephFS cluster, which I repurposed as S3 an object store running RGW.

Rather than saving to disk and shelling out to s5cmd, I searched opam and found an existing OCaml library, abdufelsayed/awskit. I integrated it into Genesis and pushed some Tessera traffic through it on a small-scale trial. Listing an embeddings bucket of 1,426,671 objects exposed some weaknesses: s5cmd walked the whole thing in under five minutes, while awskit took what felt like a lifetime per page. Underneath were a cluster of cohttp-eio body-reading bugs: responses with no body (HEAD, 204, 304) weren’t treated as bodiless, so the client sat waiting to drain a body that was never coming; chunked keep-alive responses were read past EOF, producing a 65-second per-page stall; and the upload path wasn’t respecting how much single_read had actually filled the buffer.

I sent three fixes upstream abdufelsayed/awskit#10, #11 and #12. Then I found a file descriptor leak which slowly accumulated until the process ran out. At that point, it was time to stop patching and own the problem. I didn’t need the complexity of awskit, so I wrote a small, dependency-light Amazon S3 / Ceph RGW compatible object-storage client built on Eio and cohttp-eio with s3cli mirroring some s5cmd functions for testing. mtelvers/ocaml-s3. I moved Genesis over to the dedicated ocaml-s3 client.

Monitoring odds and ends

A few things to mention: My monitoring tool mtelvers/mon is back: I used to run ocurrent/ocurrent-observer which did periodic dig + curl checks, but that only gives a snapshot, not a history. I’d started a Prometheus based monitor when I was trying to find the unreliability in get.dune.build and then stopped it when get.dune.build moved behind the CloudFlare CDN as that messed up my per-ip status model. I added a pooled mode which allowed this to be gracefully handled.

Cuihtlauac noticed that the certificate on watch.ocaml.org had expired, but weirdly not because the renewal failed, but because nginx was still serving the old certificate. I added a periodic reload to nginx so a freshly renewed certificate always gets picked up.

My RISC-V QEMU worker pool needed some attention due to disk space exhaustion as the switch to containerd moved the Docker partition outside of the normal ocurrent/ocluster monitoring path. I now bind mount the “docker partition” to both /var/lib/docker and /var/lib/containerd.

Late Thursday, I noticed that ocaml-opam/opam2web, which generates opam index at opam.ocaml.org, had started failing to build. I poked around and noticed that the failure started after some commits to opam which added a dependency on dune-local. Fortunately, Kate was still at her keyboard and kindly turned around a fix in ocaml-opam/opam2web#257.

Notes from week 26

Categories

Tags