Scaleway Elastic Metal Reconfiguration
Mark Elvers
6 min read

Categories

  • Scaleway

Tags

  • tunbury.org

Scaleway offers the EM-L110X-SATA machine, which has 4 x 12TB disks. I’ve noted in a previous post that the configuration isn’t ideal for my purposes, and I outlined a way to reconfigure the machine. The premise of that post is that you can eject one of the disks from the RAID5 array to use as the new root filesystem. All well and good, but you must wait for the RAID5 array to finish building; otherwise, ejecting the disk immediately leads to an inaccessible file system.

Scaleway allows you to boot into a rescue console. This is a netboot environment which has SSH access using a randomly generated username and password.

Once booted, lsblk shows md0 is now md127 and md1 is missing.

NAME          MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
loop0           7:0    0 826.6M  1 loop  /usr/lib/live/mount/rootfs/filesystem.squashfs
sda             8:0    0  10.9T  0 disk  
├─sda1          8:1    0     1M  0 part  
├─sda2          8:2    0   512M  0 part  
│ └─md127       9:127  0   511M  0 raid1 
│   └─md127p1 259:0    0   506M  0 part  
├─sda3          8:3    0  10.7T  0 part  
└─sda4          8:4    0   512M  0 part  
sdb             8:16   0  10.9T  0 disk  
├─sdb1          8:17   0     1M  0 part  
├─sdb2          8:18   0   512M  0 part  
│ └─md127       9:127  0   511M  0 raid1 
│   └─md127p1 259:0    0   506M  0 part  
├─sdb3          8:19   0  10.7T  0 part  
└─sdb4          8:20   0   512M  0 part  
sdc             8:32   0  10.9T  0 disk  
├─sdc1          8:33   0     1M  0 part  
├─sdc2          8:34   0   512M  0 part  
│ └─md127       9:127  0   511M  0 raid1 
│   └─md127p1 259:0    0   506M  0 part  
├─sdc3          8:35   0  10.7T  0 part  
└─sdc4          8:36   0   512M  0 part  
sdd             8:48   0  10.9T  0 disk  
├─sdd1          8:49   0     1M  0 part  
├─sdd2          8:50   0   512M  0 part  
│ └─md127       9:127  0   511M  0 raid1 
│   └─md127p1 259:0    0   506M  0 part  
├─sdd3          8:51   0  10.7T  0 part  
└─sdd4          8:52   0   512M  0 part  

cat /proc/mdstat shows that md1 is now md126 but is inactive:

Personalities : [raid1] [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid10] 
md126 : inactive sdb3[4] sdc3[0] sda3[2] sdd3[1]
      45751787520 blocks super 1.2
       
md127 : active (auto-read-only) raid1 sdb2[3] sdc2[0] sda2[2] sdd2[1]
      523264 blocks super 1.2 [4/4] [UUUU]
      
unused devices: <none>

We can now use mdadm --assemble --force --run /dev/md126 /dev/sda3 /dev/sdb3 /dev/sdc3 /dev/sdd3 bring the array back online

mdadm: Fail create md126 when using /sys/module/md_mod/parameters/new_array
mdadm: Marking array /dev/md126 as 'clean'
mdadm: /dev/md126 has been started with 3 drives (out of 4) and 1 rebuilding.

This is confirmed with cat /proc/mdstat which shows that the rebuild has automatically restarted and will finish in about a day.

root@51-159-101-156:~# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid10] 
md126 : active raid5 sdc3[0] sdb3[4] sda3[2] sdd3[1]
      34313840640 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_]
      [=>...................]  recovery =  8.8% (1014124636/11437946880) finish=1579.5min speed=109982K/sec
      bitmap: 10/86 pages [40KB], 65536KB chunk

md127 : active (auto-read-only) raid1 sdb2[3] sdc2[0] sda2[2] sdd2[1]
      523264 blocks super 1.2 [4/4] [UUUU]
      
unused devices: <none>

Stop the rebuild echo frozen > /sys/block/md126/md/sync_action and mount the drive as read-only.

mkdir -p /mnt/old
mount -o ro /dev/md126p1 /mnt/old

The Scaleway base installation is only ~2G, and /tmp is huge (as these systems have 96GB of RAM)

Filesystem      Size  Used Avail Use% Mounted on
tmpfs            48G   28K   48G   1% /tmp

Create the backup

cd /mnt/old
tar czf /tmp/rootfs-backup.tar.gz \
  --exclude=./proc \
  --exclude=./sys \
  --exclude=./dev \
  --exclude=./tmp \
  --exclude=./run \
  --exclude=./mnt \
  .

Check if the backup was created successfully.

-rw-r--r-- 1 root root 1.2G Oct 31 14:50 /tmp/rootfs-backup.tar.gz

Unmount the drive, and stop the array.

cd /
umount /mnt/old
mdadm --stop /dev/md126

I found that the kernel was keen to remount the device, so I zeroed it out to prevent it.

mdadm --zero-superblock /dev/sda3 /dev/sdb3 /dev/sdc3 /dev/sdd3

Remove the partition from all the disks.

for disk in sda sdb sdc sdd; do
  parted /dev/$disk --script "rm 3 mkpart primary 1025MiB 34GiB set 3 raid on"
done

Create new 99GB RAID5 array (33GB × 3 usable with RAID5)

mdadm --create /dev/md126 --level=5 --raid-devices=4 \
  /dev/sda3 /dev/sdb3 /dev/sdc3 /dev/sdd3 \
  --chunk=512 --metadata=1.2

Check that it is building cat /proc/mdstat: 2 minutes to go!

Personalities : [raid1] [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid10] 
md126 : active raid5 sdd3[4] sdc3[2] sdb3[1] sda3[0]
      103704576 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_]
      [===>.................]  recovery = 19.1% (6606848/34568192) finish=2.2min speed=202080K/sec
      
md127 : active (auto-read-only) raid1 sdd2[1] sdc2[0] sdb2[3] sda2[2]
      523264 blocks super 1.2 [4/4] [UUUU]
      
unused devices: <none>

Create GPT partition table.

parted /dev/md126 mklabel gpt
parted /dev/md126 mkpart primary ext4 0% 100%

Format with ext4

mkfs.ext4 -L root /dev/md126p1

Verify

root@51-159-101-156:/# lsblk | grep md126
│ └─md126       9:126  0  98.9G  0 raid5 
│   └─md126p1 259:2    0  98.9G  0 part  
│ └─md126       9:126  0  98.9G  0 raid5 
│   └─md126p1 259:2    0  98.9G  0 part  
│ └─md126       9:126  0  98.9G  0 raid5 
│   └─md126p1 259:2    0  98.9G  0 part  
│ └─md126       9:126  0  98.9G  0 raid5 
│   └─md126p1 259:2    0  98.9G  0 part  

Mount the new filesystem

mkdir -p /mnt/new
mount /dev/md126p1 /mnt/new

Restore system

cd /mnt/new
tar xzf /tmp/rootfs-backup.tar.gz

Create system directories with correct permissions as these were excluded from the tar operation.

mkdir -p /mnt/new/proc
mkdir -p /mnt/new/sys
mkdir -p /mnt/new/dev
mkdir -p /mnt/new/run
mkdir -p /mnt/new/mnt
mkdir -p /mnt/new/tmp

Set correct permissions with the sticky bit for /tmp.

chmod 0555 /mnt/new/proc
chmod 0555 /mnt/new/sys
chmod 0755 /mnt/new/dev
chmod 0755 /mnt/new/run
chmod 0755 /mnt/new/mnt
chmod 1777 /mnt/new/tmp

Set ownership (but should already be root:root)

chown root:root /mnt/new/{proc,sys,dev,run,mnt,tmp}

Mount boot partition.

mount /dev/md127p1 /mnt/new/boot

Bind mount system directories for chroot.

mount --bind /dev /mnt/new/dev
mount --bind /proc /mnt/new/proc
mount --bind /sys /mnt/new/sys

Chroot into the system.

chroot /mnt/new /bin/bash

Check the original mdadm.conf file in /mnt/new/etc/mdadm/mdadm.conf.

ARRAY /dev/md0 metadata=1.2 UUID=54d65d24:831a6594:d2c51416:5dd1692c
ARRAY /dev/md1 metadata=1.2 spares=1 UUID=dd7844ac:07f188e7:995ade90:71c23f7b
MAILADDR root

And compare that with the output from mdadm --detail --scan.

ARRAY /dev/md/ubuntu-server:0 metadata=1.2 name=ubuntu-server:0 UUID=54d65d24:831a6594:d2c51416:5dd1692c
ARRAY /dev/md126 metadata=1.2 name=52-158-100-155:126 UUID=6a249202:c916a184:76fd6446:839ad3a4

Fix the UUID /dev/md1 in mdadm.conf using your favourite text editor:

sed -i "s/spares=1 UUID=.*/UUID=6a249202:c916a184:76fd6446:839ad3a4/g" /mnt/new/etc/mdadm/mdadm.conf 

Verify the changes to /mnt/new/etc/mdadm/mdadm.conf.

ARRAY /dev/md0 metadata=1.2 UUID=54d65d24:831a6594:d2c51416:5dd1692c
ARRAY /dev/md1 metadata=1.2 UUID=6a249202:c916a184:76fd6446:839ad3a4
MAILADDR root

Make the same edit to /etc/fstab.

sed -i 's/dd7844ac:07f188e7:995ade90:71c23f7b/6a249202:c916a184:76fd6446:839ad3a4/' /etc/fstab

Update initramfs with new array config.

update-initramfs -u -k all

Reinstall GRUB on all 4 disks (for redundancy).

for disk in sda sdb sdc sdd; do
  echo "Installing GRUB on /dev/$disk..."
  grub-install /dev/$disk
done

Update GRUB config.

update-grub

Exit the chroot environment.

exit

Many people would be happy here, but the free space is now in the middle of the disk with the swap space (nearly) at the end, and this means that my new partition, 5, would be out of order. parted /dev/sda print free

Model: ATA TOSHIBA MG07ACA1 (scsi)
Disk /dev/sda: 12.0TB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Disk Flags: 
Number  Start   End     Size    File system     Name     Flags
        17.4kB  1049kB  1031kB  Free Space
 1      1049kB  2097kB  1049kB                           bios_grub
 2      2097kB  539MB   537MB
        539MB   1075MB  536MB   Free Space
 3      1075MB  36.5GB  35.4GB                  primary  raid
        36.5GB  11.7TB  11.7TB  Free Space
 4      11.7TB  11.7TB  537MB   linux-swap(v1)
        11.7TB  12.0TB  286GB   Free Space

I’m going to delete the swap partition, create the new data partition, and finally, create the swap partition at the very end.

for disk in sda sdb sdc sdd; do
  parted /dev/$disk --script "rm 4 mkpart primary ext4 36.5GB -1GB mkpart primary linux-swap -1GB 100%"
  mkswap /dev/${disk}5
done

Delete the old references to the swap space from /etc/fstab.

sed -i '/swap/d' /mnt/new/etc/fstab

Then add the new swap space to /etc/fstab.

blkid | awk -F'"' '/TYPE="swap"/ {print "/dev/disk/by-uuid/" $2 " none swap sw 0 0"}' >> /etc/fstab

Unmount everything.

umount /mnt/new/boot
umount /mnt/new/dev
umount /mnt/new/proc
umount /mnt/new/sys
umount /mnt/new

Final sync and reboot.

sync && reboot