How
do I determine the health of my glusterFS cluster. How do I determine the health of each volume
I created within the GlusterFS cluster? I have a basic GlusterFS running with 3
GlusterFS machines.
Its set up roughly like this.
I want to check the health of the mount from the client and the health of the GlusterFS itself.
Client mount Health
My Ubuntu client has two mounted drives via FUSE.
Looking
around I don’t see a lot of tools to check the GlusterFS remotely. But, also thinking it through what do I
really need to test from a client anyway?
1. 1.I s the mount valid
2. 2. Can I write to it
We can use this linux command to check the mount.
> mountpoint
/volume_one_client/ |
That
gets me my mount valid check.
So I could use this in some nagios type check if I wanted to set that up later.
So I could make a simple bash script for this. (I will be using basic nagios type output where you output some text and you exit with 0 for OK, 1, for warning, 2 for Critical and 3 for Unknown
> vi
gluster_mount_valid.sh |
#!/bin/bash MOUNT_ONE="/volume_one_client/" MOUNT_ONE_RESULT=$(mountpoint $MOUNT_ONE 2>&1) if [ $? -ne 0 ]; then #Otherwise we did return 0 and its good |
Chmod it so it can executed more easily
> chmod u+x
gluster_mount_valid.sh |
Test
> ./gluster_mount_valid.sh |
This
seems to work well and I think I would just make one per mount point for health
checks.
Client Write Health
How about I create a simple hidden folder on each mount named .heartbeat then within that folder create/overwrite a file that has a unique name that identifies the computer that made the file.
Hmmm thinking this through maybe I would like a history of when this occurred so every time in the file I could put a human readable timestamp then make sure to append new timestamps… But also to only keep X timestamps.
Let me see if I can make this simply.
First create the .heartbeat folder by hand, not by script.
> mkdir
/volume_one_client/.heartbeat |
Now for the test script in nagios friendly format
> vi
gluster_mount_writable.sh |
#!/bin/bash #Number of lines to keep from the heartbeat file ISO_8601_TIMESTAMP=$(date --iso-8601=ns) #Capture the last X lines #Zero out file for line in "${PRIOR_LINES[@]}" #Put in the latest line |
Chmod it so it can executed more easily
> chmod u+x
gluster_mount_writable.sh |
Test
> ./ gluster_mount_valid.sh |
Looks like it works and only leaves behind the last 10 entries.
As for then it fails…
> sudo umount
/volume_one_client |
> ./gluster_mount_valid.sh |
I
think that works for my current needs.
GlusterFS Server peer Health
I can run the following two commands to get some info about the peer status.
> sudo gluster
peer status |
> sudo gluster
pool list |
I think the pool list command will be better and more generic to use across all machines.
> vi
gluster_check_pool.sh |
#!/bin/bash #This has to be run with sudo or root permissions CMD="gluster pool list" ALL_GOOD=0 RESULT_STRING=$(printf "%s, "
"${RESULT_ARRAY[@]}") if [ $ALL_GOOD -ne 0 ]; then echo "OK: Pool result $RESULT_STRING" |
Chmod it so it can executed more easily
> chmod u+x gluster_check_pool.sh |
Test
> ./gluster_check_pool.sh |
And then let me turn of my glusterd on one of my servers
> sudo systemctl
stop glusterd |
> ./gluster_check_pool.sh |
I think that works
GlusterFS Server Check Volumes
OK what can I do to check the status of the volumes
> sudo gluster
volume list |
Will
get me a list of current volumes
And
> sudo gluster
volume info volume-one |
I
can get specific info on a volume
I
think I need to set up two checks…
One: all Volumes are in a a “Started” Status
Two: specific volumes I list exist
> vi
gluster_check_volumes.sh |
#!/bin/bash #This has to be run with sudo or root permissions CMD="gluster volume list" VOLUME_ARRAY=() ALL_GOOD=0 #Go over results
from volume output looking for "Status" Line RESULT_STRING=$(printf "%s, "
"${RESULT_ARRAY[@]}") if [ $ALL_GOOD -ne 0 ]; then echo "OK: All Volumes Started result
$RESULT_STRING" |
Chmod it so it can executed more easily
> chmod u+x gluster_check_volumes.sh |
Test
> ./gluster_check_volumes.sh |
Looks like one of my volumes has not started. It was created but not started up.
let me start it up.
> sudo gluster
volume start replicated-volume |
Hmmm I have a little error there
volume start: replicated-volume: failed: Volume id
mismatch for brick 192.168.0.201:/data/brick1. Expected volume id d2995642-b739-40ab-9287-14e6e7b5edb1,
volume id 25b17ed8-2dc2-41dd-bd2c-b59f4feee2c2
found |
Let me try and fix it…
Ahh
OK when I made this… as I was fiddling around I tried to use already existing
bricks… A Brick can only be part of one volume.
My bad…
So let me wipe this out and reset….
> sudo gluster
volume delete replicated-volume |
> sudo gluster
volume create replicated-volume replica 2 192.168.0.200:/data/brick3 192.168.0.201:/data/brick3
192.168.0.202:/data/brick3 |
Hmm had a failure. I wonder why?
Trying
the command without the replica 2
> sudo gluster
volume create replicated-volume 192.168.0.200:/data/brick3 192.168.0.201:/data/brick3
192.168.0.202:/data/brick3 |
volume create: replicated-volume: failed: /data/brick3 is
already part of a volume |
Hmm but no volume is using the /data/brick3
> echo
"glusterfs_00 glusterfs_01 glusterfs_02 " | sed 's/\s/\n/g' | xargs
- I{} bash -c "echo {}; ssh {} ls /data | sed
's/^/ /'" |
OK
every server has a brick3 folder
probably from some leftover stuff I did while testing.
So.. how to best deal with abandoned bricks?
After doing some research you need to do two steps.
1. 1. Double check and make sure no volume is using the brick
> sudo gluster
volume info | egrep Brick |
No brick3 there so we are good.
2. 2. Just rm the folder one every server that has it
Well maybe be a little more anal and just rename the folder first
> sudo mv
/data/brick3 /data/delete_brick3 |
Then make sure nothing broke and everything is happy (if this was a real deploy I might leave it there for a few days)
Now remove it
> sudo rm -r
/data/delete_brick3 |
OK now I should be good to create this volume. I am running the following command
> sudo gluster
volume create replicated-volume replica 2 192.168.0.200:/data/brick3 192.168.0.201:/data/brick3
192.168.0.202:/data/brick3 |
OK that failed… by why?
Let me try it without the replica 2
> sudo gluster
volume create replicated-volume 192.168.0.200:/data/brick3 192.168.0.201:/data/brick3
192.168.0.202:/data/brick3 |
Well that failed too but because the bricks were created in the last command so let me just try to make a brick4 and see what happens
> sudo gluster
volume create replicated-volume 192.168.0.200:/data/brick4 192.168.0.201:/data/brick4
192.168.0.202:/data/brick4 |
OK that worked
Let me see its status
> sudo gluster
volume info replicated-volume |
Its a Distribute type and I want it to be a Replicate type.
OK so what is the issue let me look at the logs
> sudo tail -n
100 -f /var/log/glusterfs/glusterd.log |
[2023-04-14 18:26:50.187354 +0000] E [MSGID: 106558]
[glusterd-volgen.c:3263:volgen_graph_build_clients] 0-glusterd: volume
inconsistency: total number
of bricks (3) is not divisible with number of bricks per cluster (2)
in a multi-cluster setup |
Hmm more research
https://docs.gluster.org/en/latest/Quick-Start-Guide/Architecture/#types-of-volumes [3]
Here exact copies of the data are maintained on all bricks |
So I misunderstood this.
OK so a straight up Replicated in GlusterFS language means you have a brick per replication factor. So they need to be equal.
But… if you use the number of bricks you use is a multiple of your replication factor you get a “Distributed Replicated” volume.
OK
What
I personally want is “Dispersed”
So to create what I want I would need to run this.
> sudo gluster volume create my-dispered-volume disperse 3 redundancy 1 192.168.0.200:/data/brick3 192.168.0.201:/data/brick3 192.168.0.202:/data/brick3 |
Success!!
OK now let me run my test again
> sudo ./gluster_check_volumes.sh |
OK now let me start that volume
> sudo gluster
volume start my-dispered-volume |
> sudo ./gluster_check_volumes.sh |
GlusterFS Server Check Specific Volumes
OK now I want to test a specific volume
> sudo gluster
volume info volume-one |
I
can get specific info on a volume
Now lets check if it exists and if it started.
> vi gluster_check_specific_volumes.sh |
#!/bin/bash #List of volumes I want to confirm exist and are startedVOLUME_LIST=(volume-one volume-two)
ALL_GOOD_STOP=0 RESULT_STRING=$(printf "%s, "
"${RESULT_ARRAY[@]}") if [ $ALL_GOOD_EXIST -ne 0 ]; then if [ $ALL_GOOD_STOP -ne 0 ]; then echo "OK: $RESULT_STRING" |
Chmod it so it can executed more easily
> chmod u+x
gluster_check_specific_volumes.sh |
Test
> sudo
./gluster_check_specific_volumes.sh |
OK I think that gives me enough checks for now to be happy.
References
[1] GlusterFS /etc/fstab mount
options GlusterFS/NFS testing in Ubuntu 22.04
http://www.whiteboardcoder.com/2023/03/glusterfs-etcfstab-mount-options.html
Accessed 03/2023
[2] GlusterFS how to failover
(smartly) if a mounted Server is failed?
https://unix.stackexchange.com/questions/213705/glusterfs-how-to-failover-smartly-if-a-mounted-server-is-failed
Accessed 03/2023
[3] Replicated GlusterFS Volume
https://docs.gluster.org/en/latest/Quick-Start-Guide/Architecture/#types-of-volumes
Accessed 03/2023
No comments:
Post a Comment