tunbury.org

Distributed ZFS Storage

Following Anil’s note, we will design and implement a distributed storage archive system for ZFS volumes and associated metadata. Metadata here refers to key information about the dataset itself:

1 min read

Raptor Talos II - POWER9 unreliability

We have two Raptor Computing Talos II POWER9 machines. One of these has had issues for some time and cannot run for more than 20 minutes before locking up completely. Over the last few days, our second machine has exhibited similar issues and needs to be power-cycled every ~24 hours. I spent some time today trying to diagnose the issue with the first machine, removing the motherboard as recommended by Raptor support, to see if the issue still exists with nothing else connected. Sadly, it does. I noted that a firmware update is available, which would move from v2.00 to v2.10.

~1 min read

Equinix Moves

The moves of registry.ci.dev, opam-repo-ci, and get.dune.build have followed the template of OCaml-CI. Notable differences have been that I have hosted get.dune.build in a VM, as the services required very little disk space or CPU/RAM. For opam-repo-ci, the rsync was pretty slow, so I tried running multiple instances using GNU parallel with marginal gains.

~1 min read

Moving OCaml-CI

As noted on Thursday, the various OCaml services will need to be moved away from Equinix. Below are my notes on moving OCaml-CI.

1 min read

Bluesky SSH Authentication

If you have sign up to tangled.sh you will have published your SSH public key on the Bluesky ATproto network. Have a browse to your Bluesky ID, or mine. Look under sh.tangled.publicKey.

~1 min read

Blade Server Reallocation

We have changed our mind about using dm-cache in the SSD/RAID1 configuration. The current thinking is that the mechanical drives would be better served as extra capacity for our distributed ZFS infrastructure, where we intend to have two copies of all data, and these disks represent ~100TB of storage.

~1 min read

OCaml Infra Map

Yesterday, we were talking about extending the current infrastructure database to incorporate other information to provide prompts to return machines to the pool of resources after they have completed their current role/loan, etc. There is also a wider requirement to bring these services back to Cambridge from Equinix/Scaleway, which will be the subject of a follow-up post. However, the idea of extending the database made me think that it would be amusing to overlay the machine’s positions onto Google Maps.

1 min read

Blade Server Allocation

Equinix has stopped commercial sales of Metal and will sunset the service at the end of June 2026. Equinix have long been a supporter of OCaml and has provided free credits to use on their Metal platform. These credits are coming to an end at the end of this month, meaning that we need to move some of our services away from Equinix. We have two new four-node blade servers, which will become the new home for these services. The blades have dual 10C/20T processors with either 192GB or 256GB of RAM and a combination of SSD and spinning disk.

1 min read

Clock winder repair

The galvanised steel wire rope on one of my clock winders has snapped. This is a 3mm rope, so it would have a rating of greater than 500 kg. I am quite surprised that it snapped, as the load on this wire rope is much lower than that of others in use in the same system.

1 min read

Gluster

Gluster is a free and open-source software network filesystem. It has been a few years since I last looked at the project, and I was interested in taking another look. Some features, like automatic tiering of hot/cold data, have been removed, and the developers now recommend dm-cache with LVM instead.

3 min read

Ubuntu cloud-init

Testing cloud-init is painful on real (server) hardware, as the faster the server, the longer it seems to take to complete POST. Therefore, I highly recommend testing with a virtual machine before moving to real hardware.

8 min read

Slurm Workload Manager

Sadiq mentioned slurm as a possible way to better schedule the group’s compute resources. Many resources are available showing how to create batch jobs for Slurm clusters but far fewer on how to set up a cluster. This is a quick walkthrough of the basic steps to set up a two-node compute cluster on Ubuntu 24.04. Note that slurmd and slurmctld can run on the same machine.

1 min read

GNU Parallel

If you haven’t used it before, or perhaps it has been so long that it has been swapped out to disk, let me commend GNU’s Parallel to you.

~1 min read

Box Diff Tool

Over the weekend, I extended mtelvers/ocaml-box-diff to include the ability to upload files over 50MB. This is a more complex API which requires a call to https://upload.box.com/api/2.0/files/upload_sessions by posting JSON containing the name of the file, the folder ID and the file size. Box replies with various session endpoints which give the URIs to use to upload the parts and to commit the the file. Box also specifies the size of each part.

~1 min read

Dell R640 Ubuntu Installation

I could have scripted this via Ansible, but there would always be a manual element, such as configuring the H740P controller and booting from the network to get to the point where you can SSH to the machine. Therefore, I decided to just document the steps required.

2 min read

Dell R640 installation

Today we have racked the five 14th generation Dell R640 servers and a Dell N4032 switch.

1 min read

opam repo ci job timeouts

It’s Tuesday morning, and virtually all opam repo ci jobs are failing with timeouts. This comes at a critical time as these are the first jobs following the update of ocurrent/ocaml-version noted on 24th March.

3 min read

More Kingston Drives

We have received the second batch of 40 x 7.68TB Kingston SSD drives, bringing the total to 50 drives.

~1 min read

Ubuntu with ZFS root

The installation of Ubuntu on ZFS contains about 50 steps of detailed configuration. I have 10 servers to install, so I would like to script this process as much as possible.

~1 min read

Updating Docker and Go

For some time, we have had issues on Ubuntu Noble when extracting tar files within Docker containers. See ocaml/infrastructure#121. This is only an issue on exotic architectures like RISCV and PPC64LE.

~1 min read

Installation order for opam packages

Previously, I discussed the installation order for a simple directed acyclic graph without any cycles. However, opam packages include post dependencies. Rather than package A depending upon B where B would be installed first, post dependencies require X to be installed after Y. The post dependencies only occur in a small number of core OCaml packages. They are quite often empty and exist to direct the solver. Up until now, I had been using a base layer with an opam switch containing the base compiler and, therefore, did not need to deal with any post dependencies.

1 min read

Box Diff Tool

Box has an unlimited storage model but has an upload limit of 1TB per month. I have been uploading various data silos but would now like to verify that the data is all present. Box has an extensive API, but I only need the list items in folder call.

1 min read

Dell PowerEdge R640 Storage Server

We have received our first batch of 7.68TB Kingston SSD drives for deployment in some Dell PowerEdge R640 servers, which will be used to create a large storage pool.

~1 min read

FreeBSD 14.2 Upgrade

CI workers spring and summer run FreeBSD and need to be updated.

1 min read

Topological Sort of Packages

Given a list of packages and their dependencies, what order should those packages be installed in?

1 min read

Real Time Trains API

After the Heathrow substation electrical fire, I found myself in Manchester with a long train ride ahead. Checking on Real Time Trains for the schedule I noticed that they had an API. With time to spare, I registered for an account and downloaded the sample code from ocaml-cohttp.

2 min read

Playing with Cap’n Proto

Cap’n Proto has become a hot topic recently and while this is used for many OCaml-CI services, I spent some time creating a minimal application.

2 min read

Irmin Database

After Thomas’ talk today I wanted to try Irmin for myself.

~1 min read

Setup Tangled with Bluesky

To setup this up, I’m using a modified version of Anil’s repo. My repo is here. Firstly, clone the repo and run gen-key.sh.

1 min read

3d Printed Train

Creating a new OO train body drawn from scratch in Fusion 360 to minic the original damaged version.

~1 min read

Foot Operated Timer

At the end of a quarter peal there is always the question of how long it took and whether anyone really noted the start time. Mike proposed a foot operated timer.

2 min read
Back to Top ↑