Gluster

Gluster is a free and open-source software network filesystem. It has been a few years since I last looked at the project, and I was interested in taking another look. Some features, like automatic tiering of hot/cold data, have been removed, and the developers now recommend dm-cache with LVM instead.

I am going to use four QEMU VMs on which I have installed Ubuntu via PXE boot. For easy repetition, I have wrapped my qemu-system-x86_64 commands into a Makefile.

machine: disk0.qcow2 disk1.qcow2 OVMF_VARS.fd
        qemu-system-x86_64 -m 8G -smp 4 -machine accel=kvm,type=pc -cpu host -display none -vnc :11 \
                -drive if=pflash,format=raw,readonly=on,file=/usr/share/OVMF/OVMF_CODE.fd \
                -drive if=pflash,format=raw,file=OVMF_VARS.fd \
                -serial stdio \
                -device virtio-scsi-pci,id=scsi0 \
                -device scsi-hd,drive=drive0,bus=scsi0.0,channel=0,scsi-id=0,lun=0 \
                -drive file=disk0.qcow2,if=none,id=drive0 \
                -device scsi-hd,drive=drive1,bus=scsi0.0,channel=0,scsi-id=1,lun=0 \
                -drive file=disk1.qcow2,if=none,id=drive1 \
                -net nic,model=virtio-net-pci,macaddr=02:00:00:00:00:11 \
                -net bridge,br=br0

disk%.qcow2:
        qemu-img create -f qcow2 $@ 1T

OVMF_VARS.fd:
        cp /usr/share/OVMF/OVMF_VARS.fd OVMF_VARS.fd

clean:
        rm -f *.qcow2 OVMF_VARS.fd

Gluster works on any file system that supports extended attributes xattr, which includes ext[2-4]. However, XFS is typically used as it performs well with parallel read/write operations and large files. I have used 512-byte inodes, -i size=512, which is recommended as this creates extra space for the extended attributes.

mkfs.xfs -i size=512 /dev/sdb
mkdir -p /gluster/sdb
echo "/dev/sdb /gluster/sdb xfs defaults 0 0" >> /etc/fstab
mount -a

With the filesystem prepared, install and start Gluster. Gluster stores its settings in /var/lib/glusterd, so if you need to reset your installation, stop the gluster daemon and remove that directory.

apt install glusterfs-server
systemctl enable glusterd
systemctl start glusterd

From one node, probe all the other nodes. You can do this by IP address or by hostname.

gluster peer probe node222
gluster peer probe node200
gluster peer probe node152

gluster pool list should now list all the nodes. localhost indicates your current host.

UUID                                    Hostname        State
8d2a1ef0-4c23-4355-9faa-8f3387054d41    node222         Connected
4078f192-b2bb-4c74-a588-35d5475dedc7    node200         Connected
5b2fc21b-b0ab-401e-9848-3973121bfec7    node152         Connected
d5878850-0d40-4394-8dd8-b9b0d4266632    localhost       Connected

Now we need to add a volume. A Gluster volume can be distributed, replicated or dispersed. It is possible to have mix distributed with the other two types, giving a distributed replicated volume or a distributed dispersed volume. Briefly, distributed splits the data across the nodes without redundancy but gives a performance advantage. Replicated creates 2 or more copies of the data. Dispersed uses erasure coding, which can be considered as RAID5 over nodes.

Once a volume has been created, it needs to be started. The commands to create and start the volume only need to be executed on one of the nodes.

gluster volume create vol1 disperse 4 transport tcp node{200,222,223,152}:/gluster/sdb/vol1
gluster volume start vol1

On each node, or on a remote machine, you can now mount the Gluster volume. Here I have mounted it to /mnt from the node itself. All writes to /mnt will be dispersed to the other nodes.

echo "localhost:/vol1 /mnt glusterfs defaults 0 0" >> /etc/fstab
mount -a

The volume can be inspected with gluster volume info.

Volume Name: vol1
Type: Disperse
Volume ID: 31e165b2-da96-40b2-bc09-e4607a02d14b
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (3 + 1) = 4
Transport-type: tcp
Bricks:
Brick1: node200:/gluster/sdb/vol1
Brick2: node222:/gluster/sdb/vol1
Brick3: node223:/gluster/sdb/vol1
Brick4: node152:/gluster/sdb/vol1
Options Reconfigured:
network.ping-timeout: 4
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on

In initial testing, any file operation on the mounted volume appeared to hang when a node went down. This is because Gluster has a default timeout of 42 seconds. This command will set a lower value:

gluster volume set vol1 network.ping-timeout 4

The video below shows the four VMs running. One is writing random data to /mnt/random. The other machines are running ls -phil /mnt so we can watch the file growing. node222 is killed, and after the 4-second pause, the other nodes continue. When the node is rebooted, it automatically recovers.

While I used 4 nodes, this works equally well with 3 nodes.

Gluster

Categories

Tags