Deploying Ubuntu 26.04 on the CI workers

As mentioned in the Docker 29 post, I have been moving the Linux workers to Ubuntu 26.04 so we can test io_uring on the 7.0 kernel. There are eighteen Linux workers across four architectures: linux-x86_64, linux-arm64, linux-ppc64 and linux-s390x.

At this moment, we’re an early adopter as 26.04 LTS aka “Resolute Raccoon” doesn’t become the standard upgrade path until the first point release, 26.04.1. You can see this in the meta-release file the upgrader consults:

$ curl -s https://changelogs.ubuntu.com/meta-release-lts | grep -A3 resolute
Dist: resolute
Name: Resolute Raccoon
Version: 26.04 LTS
Supported: 0

Supported: 0, so a plain do-release-upgrade reports “no new release found”. The development flag overrides this:

$ do-release-upgrade -d -c
Checking for a new Ubuntu release
New release '26.04 LTS' available.

do-release-upgrade has a non-interactive frontend that preserves existing configuration files. It doesn’t reboot at the end of the installation. I’m going to use:

do-release-upgrade -d -f DistUpgradeViewNonInteractive

The first machine problem I encountered was on a ppc64 machine where the root filesystem was 100% full.

/dev/mapper/ubuntu--vg--1-ubuntu--lv  196G  189G   0  100% /

do-release-upgrade quite reasonably refused right upfront. The odd part was that there didn’t seem to be an obvious culprit. I normally du -sh /* and drill down on the largest directory, but even a full du only found 103 GB:

$ du -x -sh /
103G    /

My assumption was that deleted files, which were still being held open, were consuming space, but lsof couldn’t find them.

The worker has a 3.4 TB NVMe for /var/lib/docker, which might be masking some data which is actually in /var/lib/docker under the mount. The trick here is to bind-mount the root file system somewhere else, which allows the content to be inspected (without stopping Docker or unmounting the file system).

$ mount --bind / /mnt/rootfs
$ du -x -sh /mnt/rootfs/var/lib/docker
87G     /mnt/rootfs/var/lib/docker

So clearly, at some point in the past, the NVMe failed to mount and Docker started in the empty directory and filled the disk. Later the NVMe mounted on top and hid 87 GB of orphaned layers in plain sight. The likely issue is in /etc/fstab.

/dev/nvme1n1p1 /var/lib/docker ext4 defaults 0 1

The NVMe enumeration devices is not stable across boots, so /dev/nvme1n1p1 is a gamble. The fix is to stop Docker, unmount the NVMe, delete the orphaned tree on the root disk, and switch fstab to the partition UUID so it can never mount the wrong thing again:

$ systemctl stop ocluster-worker docker containerd
$ umount /var/lib/docker
$ rm -rf /var/lib/docker/*
$ blkid -s UUID -o value /dev/nvme1n1p1
f3650834-c3aa-4534-ad3b-f53fc9930013

The next issue was an arm64 worker. The upgrade changed the sources to resolute and then aborted on apt update:

E:Failed to fetch .../dists/resolute/main/binary-armhf/Packages  404  Not Found

binary-armhf. The host had the armhf architecture enabled as a foreign architecture (dpkg --add-architecture armhf), with zero installed packages. The amd64/arm64 archive does not publish an armhf index for resolute, so the fetch 404s.

$ dpkg --print-foreign-architectures
armhf
$ dpkg -l | grep ':armhf' | wc -l
0

As nothing depended on it, dpkg --remove-architecture armhf and the upgrade completed successfully.

On one of the x86_64 workers, the installation failed at the end, setting up GRUB:

Setting up grub-efi-amd64-signed ...
mount: /var/lib/grub/esp: special device
  /dev/disk/by-id/scsi-SATA_THNSF8400CCSE_67BS10GBTBST-part1 does not exist.
dpkg: error processing package grub-efi-amd64-signed (--configure):

The EFI System Partition looked ok and was on /dev/sda1, mounted at /boot/efi, exactly as /etc/fstab expected. What was stale was in debconf:

$ debconf-show grub-efi-amd64-signed | grep install_devices
* grub-efi/install_devices: /dev/disk/by-id/scsi-SATA_THNSF8400CCSE_...-part1

The SATA SSD must have been removed at some point, as it’s not in the system. The Ubuntu 24.04 had booted correctly, so it can’t be all bad. dpkg --configure -a then grub-install finished cleanly.

A couple of workers vanished from the network after their final reboot, and ping reported No route to host. Logging in via the serial console showed that they had rebooted correctly into 26.04. The only issue was that they had a new DHCP lease on reboot, and my DNS entry was cached.

ocluster worker makes an outbound connection to the scheduler, so it had reconnected regardless. My machine’s DNS cache will clear on its own.

We have exactly one S390x worker, so I can’t really afford to break it. The boot volume is only 20GB, so it needed a journalctl --vacuum-size=100M before starting anything. It’s also the only architecture that doesn’t boot via GRUB. After the upgrade had installed the 7.0 kernel, I pointed the IPL record at it explicitly:

$ zipl
Building bootmap in '/boot'
Adding #1: IPL section 'ubuntu' (default)
Adding #2: IPL section 'old'
Preparing boot device: dasda.
Done.

It also has a 20 GB root that was 81% full, which is not enough for a release upgrade. journalctl --vacuum-size=100M reclaimed 1.9 GB of archived journals, which was plenty.

In the past, I’ve used Ansible to do these upgrades, and that would have been a good option here too, but I ended up writing a script instead.

The riscv-bm machines need to stay on 24.04 as the hardware does not support 26.04, but everything else is now on the 7.0 kernel, which is what we wanted for the io_uring work.

Deploying Ubuntu 26.04 on the CI workers

Categories

Tags