1 PG 是什么?
PG就是目录
下面我会慢慢对PG做一下说明。
我准备了3个节点admin,node1,node2 共6个OSD,min_size=2,解释一下min_size=2,就是说最少不能少于2个,否则就不对外提供服务了。
[root@admin ceph]# ceph -s
cluster 55430962-45e4-40c3-bc14-afac24c69acb
health HEALTH_OK
monmap e1: 3 mons at {admin=172.18.1.240:6789/0,node1=172.18.1.241:6789/0,node2=172.18.1.242:6789/0}
election epoch 26, quorum 0,1,2 admin,node1,node2
osdmap e53: 6 osds: 6 up, 6 in
flags sortbitwise,require_jewel_osds
pgmap v373: 128 pgs, 2 pools, 330 bytes data, 5 objects
30932 MB used, 25839 MB / 56772 MB avail
128 active+clean
[root@admin ceph]# ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.05424 root default
-2 0.01808 host admin
0 0.00879 osd.0 up 1.00000 1.00000
3 0.00929 osd.3 up 1.00000 1.00000
-3 0.01808 host node1
1 0.00879 osd.1 up 1.00000 1.00000
4 0.00929 osd.4 up 1.00000 1.00000
-4 0.01808 host node2
2 0.00879 osd.2 up 1.00000 1.00000
5 0.00929 osd.5 up 1.00000 1.00000
[root@admin ~]# ceph osd pool get rbd min_size
min_size: 2
[root@admin ~]# ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
56772M 25839M 30932M 54.48
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
rbd 0 216 0 7382M 1
test-pool 1 114 0 7382M 4
上面的ceph df 中我们可以看到rbd池的ID是0,所以rbd池中的pg都是以0开头的。下面我们看看rbd池中的pg都分布在哪些osd中,且名字到底是什么。
首先我们先用rados上传一个文件到rbd池中:
#下面这个text.txt将是我上传的文件
[root@admin tmp]# cat test.txt
abc
123
ABC
# 下面命令表示上传到rbd池,上传上去的名称叫wzl
[root@admin tmp]# rados -p rbd put wzl ./test.txt
#搜索看wzl这个文件分布在哪些osd中
[root@admin tmp]# ceph osd map rbd wzl
osdmap e53 pool 'rbd' (0) object 'wzl' -> pg 0.ff62cf8d (0.d) -> up ([5,4,3], p5) acting ([5,4,3], p5)
看看,以wzl为名称上传到rbd池中的pg分布在5,4,3osd中,且pg的名字叫0.d
下面我们就去找找这些pg
# admin节点上osd.3下的
[root@admin tmp]# ll /var/lib/ceph/osd/ceph-3/current/ |grep 0.d
drwxr-xr-x 2 ceph ceph 59 Feb 9 20:06 0.d_head
drwxr-xr-x 2 ceph ceph 6 Feb 9 02:33 0.d_TEMP
# node1节点上的osd.4的
[root@node1 ~]# ll /var/lib/ceph/osd/ceph-4/current/|grep 0.d
drwxr-xr-x 2 ceph ceph 59 Feb 9 20:06 0.d_head
drwxr-xr-x 2 ceph ceph 6 Feb 9 02:34 0.d_TEMP
# node2节点上的osd.5的
[root@node2 ~]# ll /var/lib/ceph/osd/ceph-5/current/|grep 0.d
drwxr-xr-x 2 ceph ceph 59 Feb 9 20:06 0.d_head
drwxr-xr-x 2 ceph ceph 6 Feb 9 02:34 0.d_TEMP
我们可以看出来,分布在5,4,3三个osd上的pg名称都是0.d_head,这也就是pg三个副本的道理,每个副本的名称都是一样的。
2 PG损坏或者丢失怎么办
我们先来说一下pg的状态
Degraded(降级)
简单点的理解就是pg发生了一些故障,但是还是可以对外提供服务的。下面我来测试一下:
上面提到了wzl文件的pg分布在osd.3 osd.4 osd.5 中,如果osd.3挂掉之后会是什么样:
[root@admin tmp]# systemctl stop [email protected]
[root@admin tmp]# ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.05424 root default
-2 0.01808 host admin
0 0.00879 osd.0 up 1.00000 1.00000
3 0.00929 osd.3 down 1.00000 1.00000
-3 0.01808 host node1
1 0.00879 osd.1 up 1.00000 1.00000
4 0.00929 osd.4 up 1.00000 1.00000
-4 0.01808 host node2
2 0.00879 osd.2 up 1.00000 1.00000
5 0.00929 osd.5 up 1.00000 1.00000
[root@admin tmp]# ceph -s
cluster 55430962-45e4-40c3-bc14-afac24c69acb
health HEALTH_WARN
clock skew detected on mon.node1
66 pgs degraded
66 pgs stuck unclean
66 pgs undersized
recovery 2/18 objects degraded (11.111%)
1/6 in osds are down
Monitor clock skew detected
monmap e1: 3 mons at {admin=172.18.1.240:6789/0,node1=172.18.1.241:6789/0,node2=172.18.1.242:6789/0}
election epoch 26, quorum 0,1,2 admin,node1,node2
osdmap e55: 6 osds: 5 up, 6 in; 66 remapped pgs
flags sortbitwise,require_jewel_osds
pgmap v423: 128 pgs, 2 pools, 138 bytes data, 6 objects
30932 MB used, 25839 MB / 56772 MB avail
2/18 objects degraded (11.111%)
66 active+undersized+degraded
62 active+clean
瞧上面显示我停掉了osd.3,结果就有66个pg状态处于active+undersized+degraded状态,也就是降级状态,我看看现在还能不能下载我刚才上传的那个文件
[root@admin tmp]# rados -p rbd get wzl wzl.txt
[root@admin tmp]# cat wzl.txt
abc
123
ABC
显示虽然现在ceph是不健康有66个pg是降级的,但是还是可以对外服务的。
Peered(大病/重伤)
上面我们关闭了osd.3 ,集群里面还两个0.d的PG,分别在osd.4 osd.5中。我现在再来看看这3个0.d PG的状态是什么样的
[root@admin tmp]# ceph pg dump|grep ^0.d
dumped all in format plain
0.d 1 0 1 0 0 12 1 1 active+undersized+degraded 2018-02-09 20:35:27.585529 53'1 71:71 [5,4] 5 [5,4] 5 0'0 2018-02-09 01:37:29.127711 0'0 2018-02-09 01:37:29.127711
现在0.d 已经只分布在osd.5 osd.4中,状态是active+undersized+degraded
现在我们再停掉osd.4,看看pg的状态
[root@admin tmp]# ceph -s
cluster 55430962-45e4-40c3-bc14-afac24c69acb
health HEALTH_WARN
clock skew detected on mon.node1
99 pgs degraded
19 pgs stuck unclean
99 pgs undersized
recovery 7/18 objects degraded (38.889%)
2/6 in osds are down
Monitor clock skew detected
monmap e1: 3 mons at {admin=172.18.1.240:6789/0,node1=172.18.1.241:6789/0,node2=172.18.1.242:6789/0}
election epoch 26, quorum 0,1,2 admin,node1,node2
osdmap e97: 6 osds: 4 up, 6 in; 99 remapped pgs
flags sortbitwise,require_jewel_osds
pgmap v585: 128 pgs, 2 pools, 138 bytes data, 6 objects
30942 MB used, 25829 MB / 56772 MB avail
7/18 objects degraded (38.889%)
62 active+undersized+degraded
37 undersized+degraded+peered
29 active+clean
[root@admin tmp]# ceph pg dump|grep ^0.d
dumped all in format plain
0.d 1 0 2 0 0 12 1 1 undersized+degraded+peered 2018-02-09 20:42:08.558726 53'1 81:105 [5] 5 [50'0 2018-02-09 01:37:29.127711 0'0 2018-02-09 01:37:29.127711
ceph 已经ERR了,因为它停掉了2个osd,超过了min_size的限制,且限制PG 0.d状态已经是
undersized+degraded+peered了 从上面还可以看出现在只活了一个0.d 在osd.5中
现在我们重新设置min_size为1
[root@admin tmp]# ceph osd pool set rbd min_size 1
set pool 0 min_size to 1
[root@admin tmp]# ceph pg dump|grep ^0.d
dumped all in format plain
0.d 1 0 2 0 0 12 1 1 active+undersized+degraded 2018-02-09 20:59:03.684989 53'1 99:163 [5] 5 [50'0 2018-02-09 01:37:29.127711 0'0 2018-02-09 01:37:29.127711
[root@admin tmp]# ceph -s
cluster 55430962-45e4-40c3-bc14-afac24c69acb
health HEALTH_WARN
clock skew detected on mon.node1
99 pgs degraded
19 pgs stuck unclean
99 pgs undersized
recovery 7/18 objects degraded (38.889%)
2/6 in osds are down
Monitor clock skew detected
monmap e1: 3 mons at {admin=172.18.1.240:6789/0,node1=172.18.1.241:6789/0,node2=172.18.1.242:6789/0}
election epoch 26, quorum 0,1,2 admin,node1,node2
osdmap e99: 6 osds: 4 up, 6 in; 99 remapped pgs
flags sortbitwise,require_jewel_osds
pgmap v594: 128 pgs, 2 pools, 138 bytes data, 6 objects
30942 MB used, 25829 MB / 56772 MB avail
7/18 objects degraded (38.889%)
79 active+undersized+degraded
29 active+clean
20 undersized+degraded+peered
看看现在重新设置了min_size=1,则pg状态已经不是Peered了,改成degraded,健康状态也恢复为warn,可以对外服务。
Remapped(自愈)
ceph 有一个很强大的功能,自愈。如果一个osd停掉300秒以上,则集群就认为他已经没有复活的可能了,就开始从仅有的一份数据开始复制拷贝的其他存在的osd上镜像数据恢复。这就是它的自愈功能
[root@admin tmp]# ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.05424 root default
-2 0.01808 host admin
0 0.00879 osd.0 up 1.00000 1.00000
3 0.00929 osd.3 down 0 1.00000
-3 0.01808 host node1
1 0.00879 osd.1 up 1.00000 1.00000
4 0.00929 osd.4 down 0 1.00000
-4 0.01808 host node2
2 0.00879 osd.2 up 1.00000 1.00000
5 0.00929 osd.5 up 1.00000 1.00000
上面可以看出ceph osd tree 中osd.3 osd.4 停掉的开始300秒内状态是down 权重还是1.00000,表示还属于集群的一个osd成员。但是在300之后osd还没有起来,集群就把它驱逐出去了,状态改为out,也就是权重改为0 ,这个时候集群还是通过仅存的一份数据镜像恢复:
[root@admin tmp]# ceph -s
cluster 55430962-45e4-40c3-bc14-afac24c69acb
health HEALTH_ERR
clock skew detected on mon.node1
22 pgs are stuck inactive for more than 300 seconds
19 pgs degraded
66 pgs peering
7 pgs stuck degraded
22 pgs stuck inactive
88 pgs stuck unclean
7 pgs stuck undersized
19 pgs undersized
recovery 1/18 objects degraded (5.556%)
Monitor clock skew detected
monmap e1: 3 mons at {admin=172.18.1.240:6789/0,node1=172.18.1.241:6789/0,node2=172.18.1.242:6789/0}
election epoch 26, quorum 0,1,2 admin,node1,node2
osdmap e106: 6 osds: 4 up, 4 in
flags sortbitwise,require_jewel_osds
pgmap v609: 128 pgs, 2 pools, 138 bytes data, 6 objects
20636 MB used, 16699 MB / 37336 MB avail
1/18 objects degraded (5.556%)
44 remapped+peering
40 active+clean
22 peering
18 active+undersized+degraded
3 activating
1 activating+undersized+degraded
看已经有44个pg处于remapped+peering(重伤但开始恢复)状态中。等它恢复完成我再检查
[root@admin tmp]# ceph -s
cluster 55430962-45e4-40c3-bc14-afac24c69acb
health HEALTH_WARN
clock skew detected on mon.node1
Monitor clock skew detected
monmap e1: 3 mons at {admin=172.18.1.240:6789/0,node1=172.18.1.241:6789/0,node2=172.18.1.242:6789/0}
election epoch 26, quorum 0,1,2 admin,node1,node2
osdmap e107: 6 osds: 4 up, 4 in
flags sortbitwise,require_jewel_osds
pgmap v634: 128 pgs, 2 pools, 138 bytes data, 6 objects
20637 MB used, 16698 MB / 37336 MB avail
128 active+clean
[root@admin tmp]# ceph osd map rbd wzl
osdmap e107 pool 'rbd' (0) object 'wzl' -> pg 0.ff62cf8d (0.d) -> up ([5,1,0], p5) acting ([5,1,0], p5)
看,pg已经完全恢复,健康已经恢复,且文件wzl的pg文件0.d已经重新分配在osd.5 osd.1 osd.0中
注意:如果我们把之前停掉的osd都恢复起来,则重新恢复的pg就会重新恢复到原来的分布的osd上,在remapped 时的pg就会被删掉,下面我们来看看:
[root@admin tmp]# ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.05424 root default
-2 0.01808 host admin
0 0.00879 osd.0 up 1.00000 1.00000
3 0.00929 osd.3 up 1.00000 1.00000
-3 0.01808 host node1
1 0.00879 osd.1 up 1.00000 1.00000
4 0.00929 osd.4 up 1.00000 1.00000
-4 0.01808 host node2
2 0.00879 osd.2 up 1.00000 1.00000
5 0.00929 osd.5 up 1.00000 1.00000
[root@admin tmp]# ceph osd map rbd wzl
osdmap e113 pool 'rbd' (0) object 'wzl' -> pg 0.ff62cf8d (0.d) -> up ([5,4,3], p5) acting ([5,4,3], p5)
我们可以看到osd恢复后,pg又回到原来分布的几个osd中。现在pg 0.d又回到osd.5 osd.4 osd.3上了。
Recover(覆盖)
如果说pg三个副本所处的osd正常,但是三个副本中有一部分丢失或者损坏,ceph集群会发现这个pg所有副本文件不一致,则会从其他正常的pg中拷贝一份过来覆盖就OK了。下面我们做一个实验:
直接把osd.3上的0.d PG 删了
[root@admin tmp]# ll /var/lib/ceph/osd/ceph-3/current/|grep 0.d
drwxr-xr-x 2 ceph ceph 59 Feb 9 20:06 0.d_head
drwxr-xr-x 2 ceph ceph 6 Feb 9 02:33 0.d_TEMP
[root@admin tmp]# rm -rf /var/lib/ceph/osd/ceph-3/current/0.d_head/
[root@admin tmp]# ll /var/lib/ceph/osd/ceph-3/current/|grep 0.d
drwxr-xr-x 2 ceph ceph 6 Feb 9 02:33 0.d_TEMP
现在已经删了osd.3下面的pg 0.d,然后通知集群来扫描
# 扫描这个pg
[root@admin tmp]# ceph pg scrub 0.d
instructing pg 0.d on osd.5 to scrub
# 重新插pg状态
[root@admin tmp]# ceph pg dump|grep ^0.d
dumped all in format plain
0.d 1 0 0 0 0 12 1 1 active+clean+inconsistent 2018-02-09 21:31:09.239568 53'1 113:212 [5,4,3] 5 [5,4,3] 5 53'1 2018-02-09 21:31:09.239177 0'0 2018-02-09 01:37:29.127711
可以看到现在状态多了一个inconsistent,这就是集群发现这个pg3个副本中不一致。
可以执行ceph pg repair 0.d进行修复,它会从发其他pg副本中拷贝一个过来就好了
# pg不一致,ceph不健康
[root@admin tmp]# ceph -s
cluster 55430962-45e4-40c3-bc14-afac24c69acb
health HEALTH_ERR
clock skew detected on mon.node1
1 pgs inconsistent
1 scrub errors
Monitor clock skew detected
monmap e1: 3 mons at {admin=172.18.1.240:6789/0,node1=172.18.1.241:6789/0,node2=172.18.1.242:6789/0}
election epoch 26, quorum 0,1,2 admin,node1,node2
osdmap e113: 6 osds: 6 up, 6 in
flags sortbitwise,require_jewel_osds
pgmap v667: 128 pgs, 2 pools, 138 bytes data, 6 objects
30943 MB used, 25828 MB / 56772 MB avail
127 active+clean
1 active+clean+inconsistent
# ceph拷贝pg过来修复
[root@admin tmp]# ceph pg repair 0.d
instructing pg 0.d on osd.5 to repair
# ceph健康状态修复
[root@admin tmp]# ceph -s
cluster 55430962-45e4-40c3-bc14-afac24c69acb
health HEALTH_WARN
clock skew detected on mon.node1
Monitor clock skew detected
monmap e1: 3 mons at {admin=172.18.1.240:6789/0,node1=172.18.1.241:6789/0,node2=172.18.1.242:6789/0}
election epoch 26, quorum 0,1,2 admin,node1,node2
osdmap e113: 6 osds: 6 up, 6 in
flags sortbitwise,require_jewel_osds
pgmap v669: 128 pgs, 2 pools, 138 bytes data, 6 objects
30943 MB used, 25828 MB / 56772 MB avail
128 active+clean
recovery io 0 B/s, 0 objects/s
# pg状态恢复
[root@admin tmp]# ceph pg dump|grep ^0.d
dumped all in format plain
0.d 1 0 0 0 0 12 1 1 active+clean 2018-02-09 21:34:33.338065 53'1 113:220 [5,4,3] 5 [5,4,3] 5 53'1 2018-02-09 21:34:33.321122 53'1 2018-02-09 21:34:33.321122