Gluster is a free and open-source software network filesystem. It has been a few years since I last looked at the project, and I was interested in taking another look. Some features, like automatic tiering of hot/cold data, have been removed, and the developers now recommend dm-cache
with LVM instead.
I am going to use four QEMU VMs on which I have installed Ubuntu via PXE boot. For easy repetition, I have wrapped my qemu-system-x86_64
commands into a Makefile
.
machine: disk0.qcow2 disk1.qcow2 OVMF_VARS.fd
qemu-system-x86_64 -m 8G -smp 4 -machine accel=kvm,type=pc -cpu host -display none -vnc :11 \
-drive if=pflash,format=raw,readonly=on,file=/usr/share/OVMF/OVMF_CODE.fd \
-drive if=pflash,format=raw,file=OVMF_VARS.fd \
-serial stdio \
-device virtio-scsi-pci,id=scsi0 \
-device scsi-hd,drive=drive0,bus=scsi0.0,channel=0,scsi-id=0,lun=0 \
-drive file=disk0.qcow2,if=none,id=drive0 \
-device scsi-hd,drive=drive1,bus=scsi0.0,channel=0,scsi-id=1,lun=0 \
-drive file=disk1.qcow2,if=none,id=drive1 \
-net nic,model=virtio-net-pci,macaddr=02:00:00:00:00:11 \
-net bridge,br=br0
disk%.qcow2:
qemu-img create -f qcow2 $@ 1T
OVMF_VARS.fd:
cp /usr/share/OVMF/OVMF_VARS.fd OVMF_VARS.fd
clean:
rm -f *.qcow2 OVMF_VARS.fd
Gluster works on any file system that supports extended attributes xattr, which includes ext[2-4]
. However, XFS is typically used as it performs well with parallel read/write operations and large files. I have used 512-byte inodes, -i size=512
, which is recommended as this creates extra space for the extended attributes.
mkfs.xfs -i size=512 /dev/sdb
mkdir -p /gluster/sdb
echo "/dev/sdb /gluster/sdb xfs defaults 0 0" >> /etc/fstab
mount -a
With the filesystem prepared, install and start Gluster. Gluster stores its settings in /var/lib/glusterd
, so if you need to reset your installation, stop the gluster daemon and remove that directory.
apt install glusterfs-server
systemctl enable glusterd
systemctl start glusterd
From one node, probe all the other nodes. You can do this by IP address or by hostname.
gluster peer probe node222
gluster peer probe node200
gluster peer probe node152
gluster pool list
should now list all the nodes. localhost
indicates your current host.
UUID Hostname State
8d2a1ef0-4c23-4355-9faa-8f3387054d41 node222 Connected
4078f192-b2bb-4c74-a588-35d5475dedc7 node200 Connected
5b2fc21b-b0ab-401e-9848-3973121bfec7 node152 Connected
d5878850-0d40-4394-8dd8-b9b0d4266632 localhost Connected
Now we need to add a volume. A Gluster volume can be distributed, replicated or dispersed. It is possible to have mix distributed with the other two types, giving a distributed replicated volume or a distributed dispersed volume. Briefly, distributed splits the data across the nodes without redundancy but gives a performance advantage. Replicated creates 2 or more copies of the data. Dispersed uses erasure coding, which can be considered as RAID5 over nodes.
Once a volume has been created, it needs to be started. The commands to create and start the volume only need to be executed on one of the nodes.
gluster volume create vol1 disperse 4 transport tcp node{200,222,223,152}:/gluster/sdb/vol1
gluster volume start vol1
On each node, or on a remote machine, you can now mount the Gluster volume. Here I have mounted it to /mnt
from the node itself. All writes to /mnt
will be dispersed to the other nodes.
echo "localhost:/vol1 /mnt glusterfs defaults 0 0" >> /etc/fstab
mount -a
The volume can be inspected with gluster volume info
.
Volume Name: vol1
Type: Disperse
Volume ID: 31e165b2-da96-40b2-bc09-e4607a02d14b
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (3 + 1) = 4
Transport-type: tcp
Bricks:
Brick1: node200:/gluster/sdb/vol1
Brick2: node222:/gluster/sdb/vol1
Brick3: node223:/gluster/sdb/vol1
Brick4: node152:/gluster/sdb/vol1
Options Reconfigured:
network.ping-timeout: 4
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on
In initial testing, any file operation on the mounted volume appeared to hang when a node went down. This is because Gluster has a default timeout of 42 seconds. This command will set a lower value:
gluster volume set vol1 network.ping-timeout 4
The video below shows the four VMs running. One is writing random data to /mnt/random
. The other machines are running ls -phil /mnt
so we can watch the file growing. node222
is killed, and after the 4-second pause, the other nodes continue. When the node is rebooted, it automatically recovers.
While I used 4 nodes, this works equally well with 3 nodes.