GlusterFS故障模拟
1、硬盘故障
如果底层做了raid,则直接更换硬盘即可。
如果没有做raid的处理方法:
- 正常node执行gluster volume status 记录故障节点的uuid
- 正常node执行getfattr -d -m '.*' /brick
- 记录trusted.glusterfs.volume-id 及 trusted.gfid
setfattr -n trusted.glusterfs.volume-id -v 记录值 brickpath
setfattr -n trusted.gfid -v 记录值 brickpath- 拷贝.glusterfs目录
重启glusterd
故障模拟
虚拟机移除一个磁盘,然后新加一个磁盘
[root@node2 gv1]# gluster volume status
Status of volume: gv1
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick node1:/storage/brick2 49153 0 Y 19812
Brick node2:/storage/brick2 N/A N/A N N/A
Brick node1:/storage/brick1 49154 0 Y 19834
Brick node2:/storage/brick1 49154 0 Y 19103
Self-heal Daemon on localhost N/A N/A Y 19126
Self-heal Daemon on node1 N/A N/A Y 19857
Task Status of Volume gv1
------------------------------------------------------------------------------
There are no active volume tasks
[root@node2 gv1]# mkfs.xfs -f /dev/sdc
meta-data=/dev/sdc isize=512 agcount=4, agsize=131072 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=0, sparse=0
data = bsize=4096 blocks=524288, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=4096 blocks=2560, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
[root@node2 gv1]# mount -a
[root@node2 gv1]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/centos-root 17G 1.9G 16G 11% /
devtmpfs 485M 0 485M 0% /dev
tmpfs 496M 0 496M 0% /dev/shm
tmpfs 496M 7.1M 489M 2% /run
tmpfs 496M 0 496M 0% /sys/fs/cgroup
/dev/sda1 1014M 130M 885M 13% /boot
tmpfs 100M 0 100M 0% /run/user/0
/dev/sdb 2.0G 35M 2.0G 2% /storage/brick1
127.0.0.1:/gv1 4.0G 110M 3.9G 3% /mnt/gv1
/dev/sdc 2.0G 33M 2.0G 2% /storage/brick2
[root@node2 gv1]# ls /storage/brick2
[root@node1 brick2]# getfattr -d -m '.*' /storage/brick2
getfattr: Removing leading '/' from absolute path names
# file: storage/brick2
trusted.afr.dirty=0sAAAAAAAAAAAAAAAA
trusted.afr.gv1-client-3=0sAAAAAAAAAAAAAAAF
trusted.gfid=0sAAAAAAAAAAAAAAAAAAAAAQ==
trusted.glusterfs.dht=0sAAAAAQAAAAAAAAAAf////g==
trusted.glusterfs.dht.commithash="3688746489"
trusted.glusterfs.volume-id=0sTGg/nUxsS+eaTMeppJ3aRw==
setfattr -n trusted.afr.dirty -v 0sAAAAAAAAAAAAAAAA /storage/brick2
setfattr -n trusted.afr.gv1-client-3 -v 0sAAAAAAAAAAAAAAAF /storage/brick2
setfattr -n trusted.gfid -v 0sAAAAAAAAAAAAAAAAAAAAAQ== /storage/brick2
setfattr -n trusted.glusterfs.dht -v 0sAAAAAQAAAAAAAAAAf////g== /storage/brick2
setfattr -n trusted.glusterfs.dht.commithash -v "3688746489" /storage/brick2
setfattr -n trusted.glusterfs.volume-id -v 0sTGg/nUxsS+eaTMeppJ3aRw== /storage/brick2
==可以不必设置这么多,只设置trusted.glusterfs.volume-id、trusted.gfid即可==
[root@node1 brick2]# getfattr -d -m '.*' /storage/brick2
getfattr: Removing leading '/' from absolute path names
# file: storage/brick2
trusted.afr.dirty=0sAAAAAAAAAAAAAAAA
trusted.afr.gv1-client-3=0sAAAAAAAAAAAAAAAF
trusted.gfid=0sAAAAAAAAAAAAAAAAAAAAAQ==
trusted.glusterfs.dht=0sAAAAAQAAAAAAAAAAf////g==
trusted.glusterfs.dht.commithash="3688746489"
trusted.glusterfs.volume-id=0sTGg/nUxsS+eaTMeppJ3aRw==
[root@node1 brick2]# systemctl restart glusterd
根据以上步骤重试多次均无法恢复:
在上述步骤的基础上将node1上的/storage/brick2/.glusterfs文件夹复制到node2上的对应的文件夹,在执行
systemctl restart glusterd
恢复成功
2、一台主机故障
- 找一台完全一样的机器,至少要保证硬盘数量和大小一致,安装系统,配置和故障机同样的ip,安装gluster软件,保证配置都一样,在其他的健康节点执行命令gluster peer status ,查看故障服务器的uuid。
- 修改新加机器的/var/lib/glusterd/glusterd.info和故障机一样
- 新机器上配置磁盘操作(同磁盘故障)
- 新机器加入集群gluster peer probe node2
- 重启systemctl restart glusterd
- 同步数据
gluster volume sync master all(不知是否可以不执行,未经测试)
重启systemctl restart glusterd - 在任意节点执行gluster volume heal gv1 full