tunbury.org
ZFS Replication with Ansible

Rather than using the agent-based approach proposed yesterday, it’s worth considering an Ansible-based solution instead.
ZFS System Concept

How would the distributed ZFS storage system look in practical terms? Each machine with a ZFS store would have an agent application installed. Centrally, there would be a tracker server, and users would interact with the system using a CLI tool. The elements will interact with each other using Capt’n Proto capability files.
Opam Health Check with OxCaml

Arthur mentioned that it would be great to know which packages build successfully with OxCaml and which don’t.
Ubuntu 24.04 runc issues with AppArmor

Patrick reported issues with OCaml-CI running tests on ocaml-ppx
.
Posthog on OCaml.org

Sabine would like to switch OCaml.org from using Plausible over to Posthog. The underlying reason for the move is that the self-hosted product from Posthog has more features than the equivalent from Plausible. Of particular interest is the heatmap feature to assess the number of visitors who finish the Tour of OCaml.
Worker moves

Following the setup of rosemary with FreeBSD 14 (with 20C/40T), I have paused spring and summer (which combined have 12C/24T) and rosemary is now handling all of the FreeBSD workload.
Debugging OBuilder on macOS

The log from an OBuilder job starts with the steps needed to reproduce the job locally. This boilerplate output assumes that all OBuilder jobs start from a Docker base image, but on some operating systems, such as FreeBSD and macOS, OBuilder uses ZFS base images. On OpenBSD and Windows, it uses QEMU images. The situation is further complicated when the issue only affects a specific architecture that may be unavailable to the user.
Otter Wiki with Raven Authentication

We’d like to have a go using Otter Wiki, but rather than having yet more usernames and passwords, we would like to integrate this into the Raven authentication system. There is guide on using SAML2 with Apache
iPXE boot for FreeBSD with an UEFI BIOS

I had assumed that booting FreeBSD over the network using iPXE would be pretty simple. There is even a freebsd.ipxe
file included with Netboot.xyz. However, I quickly realised that most of the Internet wisdom on this process centred around legacy BIOS rather than UEFI. When booting with UEFI, the Netboot.xyz menu omits the FreeBSD option as it only supports legacy BIOS. Even in legacy mode, it uses memdisk
from the Syslinux project rather than a FreeBSD loader.
OS Boot Media with Ventoy

I need to install a chunky Windows application (90GB download, +250 GB install), but all my Windows VMs are pretty small, so I decided to use a spare Dell OptiPlex 7090. It had Windows 10 installed, but it was pretty messy from the previous use, so I decided to install Windows 11. I had a Windows 11 ISO on hand, so I wrote that to a USB memory stick using the Raspberry Pi Imaging tool (effectively dd
in this use case). The machine booted without issue, but the installation failed, citing “A media driver your computer needs is missing”. This error looked familiar: a mass storage driver was missing. I often see this in QEMU or similar situations, and it’s also common on server hardware. However, pressing Shift-F10 and opening diskpart
showed all my storage.
ZFS Send Streams

We often say that ZFS is an excellent replicated file system, but not the best local filesystem. This led me to think that if we run zfs send
on one machine, we might want to write that out as a different filesystem. Is that even possible?
Reconfiguring a system with an mdadm RAID5 root

Cloud providers automatically configure their machines as they expect you to use them. For example, a machine with 4 x 8T disks might come configured with an mdadm RAID5 array spanning the disks. This may be what most people want, but we don’t want this configuration, as we want to see the bare disks. Given you have only a serial console (over SSH) and no access to the cloud-init environment, how do you boot the machine in a different configuration?
Distributed ZFS Storage

Following Anil’s note, we will design and implement a distributed storage archive system for ZFS volumes and associated metadata. Metadata here refers to key information about the dataset itself:
Raptor Talos II - POWER9 unreliability

We have two Raptor Computing Talos II POWER9 machines. One of these has had issues for some time and cannot run for more than 20 minutes before locking up completely. Over the last few days, our second machine has exhibited similar issues and needs to be power-cycled every ~24 hours. I spent some time today trying to diagnose the issue with the first machine, removing the motherboard as recommended by Raptor support, to see if the issue still exists with nothing else connected. Sadly, it does. I noted that a firmware update is available, which would move from v2.00 to v2.10.
Equinix Moves

The moves of registry.ci.dev, opam-repo-ci, and get.dune.build have followed the template of OCaml-CI. Notable differences have been that I have hosted get.dune.build
in a VM, as the services required very little disk space or CPU/RAM. For opam-repo-ci, the rsync
was pretty slow, so I tried running multiple instances using GNU parallel with marginal gains.
Moving OCaml-CI

As noted on Thursday, the various OCaml services will need to be moved away from Equinix. Below are my notes on moving OCaml-CI.
Bluesky SSH Authentication #2

Addressing the glaring omissions from yesterday’s proof of concept, such as the fact that you could sign in as any user, you couldn’t revoke access, all hosts had the same users, and there was no mapping between Bluesky handles and POSIX users, I have updated mtelvers/bluesky-ssh-key-extractor and newly published mtelvers/bluesky-collection.
Bluesky SSH Authentication

If you have sign up to tangled.sh you will have published your SSH public key on the Bluesky ATproto network. Have a browse to your Bluesky ID, or mine. Look under sh.tangled.publicKey
.
Blade Server Reallocation

We have changed our mind about using dm-cache
in the SSD/RAID1 configuration. The current thinking is that the mechanical drives would be better served as extra capacity for our distributed ZFS infrastructure, where we intend to have two copies of all data, and these disks represent ~100TB of storage.
OCaml Infra Map

Yesterday, we were talking about extending the current infrastructure database to incorporate other information to provide prompts to return machines to the pool of resources after they have completed their current role/loan, etc. There is also a wider requirement to bring these services back to Cambridge from Equinix/Scaleway, which will be the subject of a follow-up post. However, the idea of extending the database made me think that it would be amusing to overlay the machine’s positions onto Google Maps.
Blade Server Allocation

Equinix has stopped commercial sales of Metal and will sunset the service at the end of June 2026. Equinix have long been a supporter of OCaml and has provided free credits to use on their Metal platform. These credits are coming to an end at the end of this month, meaning that we need to move some of our services away from Equinix. We have two new four-node blade servers, which will become the new home for these services. The blades have dual 10C/20T processors with either 192GB or 256GB of RAM and a combination of SSD and spinning disk.
OCaml < 4.14, Fedora 42 and GCC 15

Late last week, @MisterDA added Fedora 42 support to the Docker base image builder. The new base images attempted to build over the weekend, but there have been a few issues!
Clock winder repair

The galvanised steel wire rope on one of my clock winders has snapped. This is a 3mm rope, so it would have a rating of greater than 500 kg. I am quite surprised that it snapped, as the load on this wire rope is much lower than that of others in use in the same system.
Ubuntu cloud-init with LVM and dm-cache

dm-cache has been part of the mainline Linux kernel for over a decade, making it possible for faster SSD and NVMe drives to be used as a cache within a logical volume. This technology brief from Dell gives a good overview of dm-cache
and the performance benefits. Skip to the graph on page 25, noting the logarithmic scale.
Gluster

Gluster is a free and open-source software network filesystem. It has been a few years since I last looked at the project, and I was interested in taking another look. Some features, like automatic tiering of hot/cold data, have been removed, and the developers now recommend dm-cache
with LVM instead.
Ubuntu cloud-init

Testing cloud-init is painful on real (server) hardware, as the faster the server, the longer it seems to take to complete POST. Therefore, I highly recommend testing with a virtual machine before moving to real hardware.
Slurm Workload Manager

Sadiq mentioned slurm
as a possible way to better schedule the group’s compute resources. Many resources are available showing how to create batch jobs for Slurm clusters but far fewer on how to set up a cluster. This is a quick walkthrough of the basic steps to set up a two-node compute cluster on Ubuntu 24.04. Note that slurmd
and slurmctld
can run on the same machine.
GNU Parallel

If you haven’t used it before, or perhaps it has been so long that it has been swapped out to disk, let me commend GNU’s Parallel to you.
Box Diff Tool

Over the weekend, I extended mtelvers/ocaml-box-diff to include the ability to upload files over 50MB. This is a more complex API which requires a call to https://upload.box.com/api/2.0/files/upload_sessions by posting JSON containing the name of the file, the folder ID and the file size. Box replies with various session endpoints which give the URIs to use to upload the parts and to commit the the file. Box also specifies the size of each part.
Dell R640 Ubuntu Installation

I could have scripted this via Ansible, but there would always be a manual element, such as configuring the H740P controller and booting from the network to get to the point where you can SSH to the machine. Therefore, I decided to just document the steps required.
Dell R640 installation

Today we have racked the five 14th generation Dell R640 servers and a Dell N4032 switch.
Box API with OCaml and Claude

Over the weekend, I decided to extend my Box tool to incorporate file upload. There is a straightforward POST API for this with a curl
one-liner given in the Box documentation. Easy.
opam repo ci job timeouts

It’s Tuesday morning, and virtually all opam repo ci jobs are failing with timeouts. This comes at a critical time as these are the first jobs following the update of ocurrent/ocaml-version noted on 24th March.
More Kingston Drives

We have received the second batch of 40 x 7.68TB Kingston SSD drives, bringing the total to 50 drives.
Ubuntu with ZFS root

The installation of Ubuntu on ZFS contains about 50 steps of detailed configuration. I have 10 servers to install, so I would like to script this process as much as possible.
Updating Docker and Go

For some time, we have had issues on Ubuntu Noble when extracting tar files within Docker containers. See ocaml/infrastructure#121. This is only an issue on exotic architectures like RISCV and PPC64LE.
Installation order for opam packages

Previously, I discussed the installation order for a simple directed acyclic graph without any cycles. However, opam
packages include post dependencies. Rather than package A depending upon B where B would be installed first, post dependencies require X to be installed after Y. The post dependencies only occur in a small number of core OCaml packages. They are quite often empty and exist to direct the solver. Up until now, I had been using a base layer with an opam switch containing the base compiler and, therefore, did not need to deal with any post dependencies.
Box Diff Tool

Box has an unlimited storage model but has an upload limit of 1TB per month. I have been uploading various data silos but would now like to verify that the data is all present. Box has an extensive API, but I only need the list items in folder call.
Dell PowerEdge R640 Storage Server

We have received our first batch of 7.68TB Kingston SSD drives for deployment in some Dell PowerEdge R640 servers, which will be used to create a large storage pool.
FreeBSD 14.2 Upgrade

CI workers spring
and summer
run FreeBSD and need to be updated.
Topological Sort of Packages

Given a list of packages and their dependencies, what order should those packages be installed in?
Recent OCaml Versions

Following my post on discuss.ocaml.org, I have created a new release of ocurrent/ocaml-version that moves the minimum version of OCaml, considered as recent, from 4.02 to 4.08.
Real Time Trains API

After the Heathrow substation electrical fire, I found myself in Manchester with a long train ride ahead. Checking on Real Time Trains for the schedule I noticed that they had an API. With time to spare, I registered for an account and downloaded the sample code from ocaml-cohttp.
Playing with Cap’n Proto

Cap’n Proto has become a hot topic recently and while this is used for many OCaml-CI services, I spent some time creating a minimal application.
Irmin Database

After Thomas’ talk today I wanted to try Irmin for myself.
Setup Tangled with Bluesky

Bluesky Personal Data Server (PDS)

Today I have set up my own Bluesky (PDS) Personal Data Server.
Pi Day - Archimedes Method

It’s Pi Day 2025
Deepseek R1 on a Raspberry Pi

I’ve heard a lot about Deepseek and wanted to try it for myself.
Arduino PWM Train Controller

Circuit
3d Printed Train

Creating a new OO train body drawn from scratch in Fusion 360 to minic the original damaged version.
Foot Operated Timer

At the end of a quarter peal there is always the question of how long it took and whether anyone really noted the start time. Mike proposed a foot operated timer.