tunbury.org
Distributed ZFS Storage

Following Anil’s note, we will design and implement a distributed storage archive system for ZFS volumes and associated metadata. Metadata here refers to key information about the dataset itself:
Raptor Talos II - POWER9 unreliability

We have two Raptor Computing Talos II POWER9 machines. One of these has had issues for some time and cannot run for more than 20 minutes before locking up completely. Over the last few days, our second machine has exhibited similar issues and needs to be power-cycled every ~24 hours. I spent some time today trying to diagnose the issue with the first machine, removing the motherboard as recommended by Raptor support, to see if the issue still exists with nothing else connected. Sadly, it does. I noted that a firmware update is available, which would move from v2.00 to v2.10.
Equinix Moves

The moves of registry.ci.dev, opam-repo-ci, and get.dune.build have followed the template of OCaml-CI. Notable differences have been that I have hosted get.dune.build
in a VM, as the services required very little disk space or CPU/RAM. For opam-repo-ci, the rsync
was pretty slow, so I tried running multiple instances using GNU parallel with marginal gains.
Moving OCaml-CI

As noted on Thursday, the various OCaml services will need to be moved away from Equinix. Below are my notes on moving OCaml-CI.
Bluesky SSH Authentication #2

Addressing the glaring omissions from yesterday’s proof of concept, such as the fact that you could sign in as any user, you couldn’t revoke access, all hosts had the same users, and there was no mapping between Bluesky handles and POSIX users, I have updated mtelvers/bluesky-ssh-key-extractor and newly published mtelvers/bluesky-collection.
Bluesky SSH Authentication

If you have sign up to tangled.sh you will have published your SSH public key on the Bluesky ATproto network. Have a browse to your Bluesky ID, or mine. Look under sh.tangled.publicKey
.
Blade Server Reallocation

We have changed our mind about using dm-cache
in the SSD/RAID1 configuration. The current thinking is that the mechanical drives would be better served as extra capacity for our distributed ZFS infrastructure, where we intend to have two copies of all data, and these disks represent ~100TB of storage.
OCaml Infra Map

Yesterday, we were talking about extending the current infrastructure database to incorporate other information to provide prompts to return machines to the pool of resources after they have completed their current role/loan, etc. There is also a wider requirement to bring these services back to Cambridge from Equinix/Scaleway, which will be the subject of a follow-up post. However, the idea of extending the database made me think that it would be amusing to overlay the machine’s positions onto Google Maps.
Blade Server Allocation

Equinix has stopped commercial sales of Metal and will sunset the service at the end of June 2026. Equinix have long been a supporter of OCaml and has provided free credits to use on their Metal platform. These credits are coming to an end at the end of this month, meaning that we need to move some of our services away from Equinix. We have two new four-node blade servers, which will become the new home for these services. The blades have dual 10C/20T processors with either 192GB or 256GB of RAM and a combination of SSD and spinning disk.
OCaml < 4.14, Fedora 42 and GCC 15

Late last week, @MisterDA added Fedora 42 support to the Docker base image builder. The new base images attempted to build over the weekend, but there have been a few issues!
Clock winder repair

The galvanised steel wire rope on one of my clock winders has snapped. This is a 3mm rope, so it would have a rating of greater than 500 kg. I am quite surprised that it snapped, as the load on this wire rope is much lower than that of others in use in the same system.
Ubuntu cloud-init with LVM and dm-cache

dm-cache has been part of the mainline Linux kernel for over a decade, making it possible for faster SSD and NVMe drives to be used as a cache within a logical volume. This technology brief from Dell gives a good overview of dm-cache
and the performance benefits. Skip to the graph on page 25, noting the logarithmic scale.
Gluster

Gluster is a free and open-source software network filesystem. It has been a few years since I last looked at the project, and I was interested in taking another look. Some features, like automatic tiering of hot/cold data, have been removed, and the developers now recommend dm-cache
with LVM instead.
Ubuntu cloud-init

Testing cloud-init is painful on real (server) hardware, as the faster the server, the longer it seems to take to complete POST. Therefore, I highly recommend testing with a virtual machine before moving to real hardware.
Slurm Workload Manager

Sadiq mentioned slurm
as a possible way to better schedule the group’s compute resources. Many resources are available showing how to create batch jobs for Slurm clusters but far fewer on how to set up a cluster. This is a quick walkthrough of the basic steps to set up a two-node compute cluster on Ubuntu 24.04. Note that slurmd
and slurmctld
can run on the same machine.
GNU Parallel

If you haven’t used it before, or perhaps it has been so long that it has been swapped out to disk, let me commend GNU’s Parallel to you.
Box Diff Tool

Over the weekend, I extended mtelvers/ocaml-box-diff to include the ability to upload files over 50MB. This is a more complex API which requires a call to https://upload.box.com/api/2.0/files/upload_sessions by posting JSON containing the name of the file, the folder ID and the file size. Box replies with various session endpoints which give the URIs to use to upload the parts and to commit the the file. Box also specifies the size of each part.
Dell R640 Ubuntu Installation

I could have scripted this via Ansible, but there would always be a manual element, such as configuring the H740P controller and booting from the network to get to the point where you can SSH to the machine. Therefore, I decided to just document the steps required.
Dell R640 installation

Today we have racked the five 14th generation Dell R640 servers and a Dell N4032 switch.
Box API with OCaml and Claude

Over the weekend, I decided to extend my Box tool to incorporate file upload. There is a straightforward POST API for this with a curl
one-liner given in the Box documentation. Easy.
opam repo ci job timeouts

It’s Tuesday morning, and virtually all opam repo ci jobs are failing with timeouts. This comes at a critical time as these are the first jobs following the update of ocurrent/ocaml-version noted on 24th March.
More Kingston Drives

We have received the second batch of 40 x 7.68TB Kingston SSD drives, bringing the total to 50 drives.
Ubuntu with ZFS root

The installation of Ubuntu on ZFS contains about 50 steps of detailed configuration. I have 10 servers to install, so I would like to script this process as much as possible.
Updating Docker and Go

For some time, we have had issues on Ubuntu Noble when extracting tar files within Docker containers. See ocaml/infrastructure#121. This is only an issue on exotic architectures like RISCV and PPC64LE.
Installation order for opam packages

Previously, I discussed the installation order for a simple directed acyclic graph without any cycles. However, opam
packages include post dependencies. Rather than package A depending upon B where B would be installed first, post dependencies require X to be installed after Y. The post dependencies only occur in a small number of core OCaml packages. They are quite often empty and exist to direct the solver. Up until now, I had been using a base layer with an opam switch containing the base compiler and, therefore, did not need to deal with any post dependencies.
Box Diff Tool

Box has an unlimited storage model but has an upload limit of 1TB per month. I have been uploading various data silos but would now like to verify that the data is all present. Box has an extensive API, but I only need the list items in folder call.
Dell PowerEdge R640 Storage Server

We have received our first batch of 7.68TB Kingston SSD drives for deployment in some Dell PowerEdge R640 servers, which will be used to create a large storage pool.
FreeBSD 14.2 Upgrade

CI workers spring
and summer
run FreeBSD and need to be updated.
Topological Sort of Packages

Given a list of packages and their dependencies, what order should those packages be installed in?
Recent OCaml Versions

Following my post on discuss.ocaml.org, I have created a new release of ocurrent/ocaml-version that moves the minimum version of OCaml, considered as recent, from 4.02 to 4.08.
Real Time Trains API

After the Heathrow substation electrical fire, I found myself in Manchester with a long train ride ahead. Checking on Real Time Trains for the schedule I noticed that they had an API. With time to spare, I registered for an account and downloaded the sample code from ocaml-cohttp.
Playing with Cap’n Proto

Cap’n Proto has become a hot topic recently and while this is used for many OCaml-CI services, I spent some time creating a minimal application.
Irmin Database

After Thomas’ talk today I wanted to try Irmin for myself.
Setup Tangled with Bluesky

Bluesky Personal Data Server (PDS)

Today I have set up my own Bluesky (PDS) Personal Data Server.
Pi Day - Archimedes Method

It’s Pi Day 2025
Deepseek R1 on a Raspberry Pi

I’ve heard a lot about Deepseek and wanted to try it for myself.
Arduino PWM Train Controller

Circuit
3d Printed Train

Creating a new OO train body drawn from scratch in Fusion 360 to minic the original damaged version.
Foot Operated Timer

At the end of a quarter peal there is always the question of how long it took and whether anyone really noted the start time. Mike proposed a foot operated timer.