Installation from recovery console
Mark Elvers
3 min read

Categories

  • ubuntu,zfs

Tags

  • tunbury.org

Over the weekend, one of the NVMe drives in pima failed, which brought down the whole system.

Booting over the network to a recovery console showed that nvme6 was dead. The kernel logged errors on any access, and this was confirmed by the SMART log, which showed a critical warning flag 0x4 despite zero media errors.

I have logged a ticket with Micron to investigate the failure, but we’d like to get the machine back online as soon as possible. Since the other seven drives have the same firmware, I’m suspicious that another drive or two will fail without warning; therefore, I’m going to rebuild with (extra) redundancy.

There is already a reasonable partition table on each of the drives, so I’m going to use that. md127 turned out to be the swap space. I’ll create a RAID6 MD on p2 across the seven healthy drives, ~20GB usable will be enough for root, and create ZFS RAIDz2 on p4, giving ~65TB usable space.

# lsblk
NAME        MAJ:MIN RM  SIZE RO TYPE   MOUNTPOINTS
nvme6n1     259:0    0   14T  0 disk   
nvme5n1     259:1    0   14T  0 disk   
├─nvme5n1p1 259:2    0  512M  0 part   
├─nvme5n1p2 259:3    0    4G  0 part   
│ └─md127     9:127  0   16G  0 raid10 
├─nvme5n1p3 259:4    0    2G  0 part   
└─nvme5n1p4 259:5    0   14T  0 part   
...

Remove md127 and clear the disks.

mdadm --stop /dev/md127
mdadm --zero-superblock /dev/nvme{0,1,2,3,4,5,7}n1p2

Then create the new array and format it with ext4.

mdadm --create /dev/md0 --level=6 --raid-devices=7 \
  /dev/nvme0n1p2 /dev/nvme1n1p2 /dev/nvme2n1p2 \
  /dev/nvme3n1p2 /dev/nvme4n1p2 /dev/nvme5n1p2 /dev/nvme7n1p2

mkfs.ext4 -L root /dev/md0

We are in the recovery console so we can start the installation directly by mounting the new file system and running debootstrap.

mount /dev/md0 /mnt
debootstrap noble /mnt http://archive.ubuntu.com/ubuntu

Once that completes, prepare for a chroot environment by mounting the pseudo-filesystems and the first EFI partition.

mount --bind /dev /mnt/dev
mount --bind /dev/pts /mnt/dev/pts
mount --bind /proc /mnt/proc
mount --bind /sys /mnt/sys
mount --bind /run /mnt/run

mkdir -p /mnt/boot/efi
mount /dev/nvme0n1p1 /mnt/boot/efi

Chroot in to the new environment.

chroot /mnt /bin/bash

Then inside the chroot, create /etc/fstab

/dev/md0       /           ext4    errors=remount-ro   0 1
/dev/nvme0n1p1 /boot/efi   vfat    umask=0077          0 1

And /etc/apt/sources.list

deb http://archive.ubuntu.com/ubuntu noble main restricted universe multiverse
deb http://archive.ubuntu.com/ubuntu noble-updates main restricted universe multiverse
deb http://archive.ubuntu.com/ubuntu noble-security main restricted universe multiverse

Install the kernel, GRUB, admin tools for MD and ZFS and SSHD.

apt update && apt install -y linux-image-generic grub-efi-amd64 mdadm zfsutils-linux openssh-server networkd-dispatcher

Create /etc/default/grub. This machine uses a serial port console on the second serial port.

GRUB_DEFAULT=0
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR="Ubuntu"
GRUB_CMDLINE_LINUX_DEFAULT=""
GRUB_CMDLINE_LINUX="console=tty0 console=ttyS1,115200n8"
GRUB_TERMINAL="console serial"
GRUB_SERIAL_COMMAND="serial --speed=115200 --unit=1 --word=8 --parity=no --stop=1"

Enable getty on ttyS1.

systemctl enable serial-getty@ttyS1.service

Update mdadm.conf so it finds the array at boot.

mdadm --detail --scan >> /etc/mdadm/mdadm.conf
update-initramfs -u

Install GRUB to all 7 EFI partitions for redundancy.

grub-install --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=ubuntu --recheck

Copy the EFI bootloader to the other drives.

for disk in nvme1n1 nvme2n1 nvme3n1 nvme4n1 nvme5n1 nvme7n1; do
  mkdir -p /tmp/efi
  mount /dev/${disk}p1 /tmp/efi
  cp -r /boot/efi/EFI /tmp/efi/
  umount /tmp/efi
done

Update the GRUB installation.

update-grub

Set a root password passwd root.

Set the hostname

echo "pima" > /etc/hostname

Create a netplan file, /etc/netplan/01-netcfg.yaml, to match the configuration of the machine and chmod it.

chmod 600 /etc/netplan/01-netcfg.yaml

Create a regular user with sudo access and get their keys from GitHub.

useradd -m -s /bin/bash -G sudo username
passwd username
mkdir -p -m 0700 /home/username/.ssh
curl -o /home/username/.ssh/authorized_keys https://github.com/username.keys
chown username:username /home/username/.ssh /home/username/.ssh/authorized_keys
chmod 0600 /home/username/.ssh/authorized_keys

Set the timezone.

ln -sf /usr/share/zoneinfo/Europe/London /etc/localtime

Set the locale.

apt install -y locales
locale-gen en_GB.UTF-8
update-locale LANG=en_GB.UTF-8

Exit and reboot.

exit
umount /mnt/boot/efi
umount /mnt/{dev/pts,dev,proc,sys,run}
umount /mnt
reboot

Create the ZFS pool.

zpool create -f -o ashift=12 data raidz2 \
  /dev/nvme0n1p4 /dev/nvme1n1p4 /dev/nvme2n1p4 \
  /dev/nvme3n1p4 /dev/nvme4n1p4 /dev/nvme5n1p4 /dev/nvme7n1p4