Over the weekend, one of the NVMe drives in pima failed, which brought down the whole system.
Booting over the network to a recovery console showed that nvme6 was dead. The kernel logged errors on any access, and this was confirmed by the SMART log, which showed a critical warning flag 0x4 despite zero media errors.
I have logged a ticket with Micron to investigate the failure, but we’d like to get the machine back online as soon as possible. Since the other seven drives have the same firmware, I’m suspicious that another drive or two will fail without warning; therefore, I’m going to rebuild with (extra) redundancy.
There is already a reasonable partition table on each of the drives, so I’m going to use that. md127 turned out to be the swap space. I’ll create a RAID6 MD on p2 across the seven healthy drives, ~20GB usable will be enough for root, and create ZFS RAIDz2 on p4, giving ~65TB usable space.
# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
nvme6n1 259:0 0 14T 0 disk
nvme5n1 259:1 0 14T 0 disk
├─nvme5n1p1 259:2 0 512M 0 part
├─nvme5n1p2 259:3 0 4G 0 part
│ └─md127 9:127 0 16G 0 raid10
├─nvme5n1p3 259:4 0 2G 0 part
└─nvme5n1p4 259:5 0 14T 0 part
...
Remove md127 and clear the disks.
mdadm --stop /dev/md127
mdadm --zero-superblock /dev/nvme{0,1,2,3,4,5,7}n1p2
Then create the new array and format it with ext4.
mdadm --create /dev/md0 --level=6 --raid-devices=7 \
/dev/nvme0n1p2 /dev/nvme1n1p2 /dev/nvme2n1p2 \
/dev/nvme3n1p2 /dev/nvme4n1p2 /dev/nvme5n1p2 /dev/nvme7n1p2
mkfs.ext4 -L root /dev/md0
We are in the recovery console so we can start the installation directly by mounting the new file system and running debootstrap.
mount /dev/md0 /mnt
debootstrap noble /mnt http://archive.ubuntu.com/ubuntu
Once that completes, prepare for a chroot environment by mounting the pseudo-filesystems and the first EFI partition.
mount --bind /dev /mnt/dev
mount --bind /dev/pts /mnt/dev/pts
mount --bind /proc /mnt/proc
mount --bind /sys /mnt/sys
mount --bind /run /mnt/run
mkdir -p /mnt/boot/efi
mount /dev/nvme0n1p1 /mnt/boot/efi
Chroot in to the new environment.
chroot /mnt /bin/bash
Then inside the chroot, create /etc/fstab
/dev/md0 / ext4 errors=remount-ro 0 1
/dev/nvme0n1p1 /boot/efi vfat umask=0077 0 1
And /etc/apt/sources.list
deb http://archive.ubuntu.com/ubuntu noble main restricted universe multiverse
deb http://archive.ubuntu.com/ubuntu noble-updates main restricted universe multiverse
deb http://archive.ubuntu.com/ubuntu noble-security main restricted universe multiverse
Install the kernel, GRUB, admin tools for MD and ZFS and SSHD.
apt update && apt install -y linux-image-generic grub-efi-amd64 mdadm zfsutils-linux openssh-server networkd-dispatcher
Create /etc/default/grub. This machine uses a serial port console on the second serial port.
GRUB_DEFAULT=0
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR="Ubuntu"
GRUB_CMDLINE_LINUX_DEFAULT=""
GRUB_CMDLINE_LINUX="console=tty0 console=ttyS1,115200n8"
GRUB_TERMINAL="console serial"
GRUB_SERIAL_COMMAND="serial --speed=115200 --unit=1 --word=8 --parity=no --stop=1"
Enable getty on ttyS1.
systemctl enable serial-getty@ttyS1.service
Update mdadm.conf so it finds the array at boot.
mdadm --detail --scan >> /etc/mdadm/mdadm.conf
update-initramfs -u
Install GRUB to all 7 EFI partitions for redundancy.
grub-install --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=ubuntu --recheck
Copy the EFI bootloader to the other drives.
for disk in nvme1n1 nvme2n1 nvme3n1 nvme4n1 nvme5n1 nvme7n1; do
mkdir -p /tmp/efi
mount /dev/${disk}p1 /tmp/efi
cp -r /boot/efi/EFI /tmp/efi/
umount /tmp/efi
done
Update the GRUB installation.
update-grub
Set a root password passwd root.
Set the hostname
echo "pima" > /etc/hostname
Create a netplan file, /etc/netplan/01-netcfg.yaml, to match the configuration of the machine and chmod it.
chmod 600 /etc/netplan/01-netcfg.yaml
Create a regular user with sudo access and get their keys from GitHub.
useradd -m -s /bin/bash -G sudo username
passwd username
mkdir -p -m 0700 /home/username/.ssh
curl -o /home/username/.ssh/authorized_keys https://github.com/username.keys
chown username:username /home/username/.ssh /home/username/.ssh/authorized_keys
chmod 0600 /home/username/.ssh/authorized_keys
Set the timezone.
ln -sf /usr/share/zoneinfo/Europe/London /etc/localtime
Set the locale.
apt install -y locales
locale-gen en_GB.UTF-8
update-locale LANG=en_GB.UTF-8
Exit and reboot.
exit
umount /mnt/boot/efi
umount /mnt/{dev/pts,dev,proc,sys,run}
umount /mnt
reboot
Create the ZFS pool.
zpool create -f -o ashift=12 data raidz2 \
/dev/nvme0n1p4 /dev/nvme1n1p4 /dev/nvme2n1p4 \
/dev/nvme3n1p4 /dev/nvme4n1p4 /dev/nvme5n1p4 /dev/nvme7n1p4