* ==Ceph CRUSH性能调优==*
CRUSH 算法通过计算数据存储位置来确定如何存储和检索。 CRUSH 授权 Ceph 客户端直接连接 OSD ,而非通过一个中央服务器或经纪人。数据存储、检索算法的使用,使 Ceph 避免了单点故障、性能瓶颈、和伸缩的物理限制。
CRUSH 需要一张集群的地图,且使用 CRUSH 把数据伪随机地存储、检索于整个集群的 OSD 里。
CRUSH 图包含 OSD 列表、把设备汇聚为物理位置的“桶”列表、和指示 CRUSH 如何复制存储池里的数据的规则列表。由于对所安装底层物理组织的表达, CRUSH 能模型化、并因此定位到潜在的相关失败设备源头,典型的源头有物理距离、共享电源、和共享网络,把这些信息编码到集群运行图里, CRUSH 归置策略可把对象副本分离到不同的失败域,却仍能保持期望的分布。例如,要定位同时失败的可能性,可能希望保证数据复制到的设备位于不同机架、不同托盘、不同电源、不同控制器、甚至不同物理位置。
==可靠性分析==
名词定义:
- PG:一个PG包含一定量的数据切片,一个文件数据分片离散存放在PG中
- 副本:一个PG的总份数
- 放置域:PG的副本可选择放置的最大区域,一个放置域包含PG的所有副本
- 故障域:放置域内相同副本唯一存在的最大区域,一个故障域只存在PG的一份副本
Ceph的可靠性的影响因素可能有:
- 副本的数量(N)
- 故障域的范围(S)
- 故障恢复时间(T)
- 放置域的数量(R)
- 硬盘的故障概率(P)
P/(N*S*T*R) #可靠性分析计算
==编辑CRUSH图的步骤:==
- 获取 CRUSH 图;
- 反编译 CRUSH 图;
- 至少编辑一个设备、桶、规则;
- 重编译 CRUSH 图;
- 注入 CRUSH 图。
配置ceph.conf文件
osd crush update on start = false 要完全手动管理crush map,必须关闭ceph-crush-location挂钩
1.当前Ceph集群信息
集群状态:
[root@i-91A9F186 ~]# ceph -s
cluster 92cc47e8-bd9f-4ec9-a861-6a20784da190
health HEALTH_OK
monmap e1: 3 mons at {0=10.202.131.33:6789/0,1=10.202.131.195:6789/0,2=10.202.131.206:6789/0}
election epoch 8, quorum 0,1,2 0,1,2
fsmap e7: 1/1/1 up {0=1=up:active}, 2 up:standby
osdmap e35: 6 osds: 6 up, 6 in
flags sortbitwise,require_jewel_osds
pgmap v4614: 90 pgs, 6 pools, 2084 bytes data, 23 objects
30920 MB used, 269 GB / 299 GB avail
90 active+clean
[root@i-91A9F186 ~]# ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.29279 root default
-2 0.09760 host i-8E041728
0 0.04880 osd.0 up 1.00000 1.00000
1 0.04880 osd.1 up 1.00000 1.00000
-3 0.09760 host i-91A9F186
2 0.04880 osd.2 up 1.00000 1.00000
3 0.04880 osd.3 up 1.00000 1.00000
-4 0.09760 host i-03C020FE
4 0.04880 osd.4 up 1.00000 1.00000
5 0.04880 osd.5 up 1.00000 1.00000
==CRUSH map操作命令==:
ceph osd getcrushmap -o crush.source #获取crush map保存为crush.source文件
crushtool -d crush.source -o crush.dp #反编译crush map,输出文件crush.dp
crushtool -c crush.dp -o crush.cp #编译crush map
ceph osd setcrushmap -i crush.cp #将编译好的crush map注入ceph集群
原集群CRUSH map信息:
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable straw_calc_version 1
# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root
# buckets
host i-8E041728 {
id -2 # do not change unnecessarily
# weight 0.098
alg straw
hash 0 # rjenkins1
item osd.0 weight 0.049
item osd.1 weight 0.049
}
host i-91A9F186 {
id -3 # do not change unnecessarily
# weight 0.098
alg straw
hash 0 # rjenkins1
item osd.2 weight 0.049
item osd.3 weight 0.049
}
host i-03C020FE {
id -4 # do not change unnecessarily
# weight 0.098
alg straw
hash 0 # rjenkins1
item osd.4 weight 0.049
item osd.5 weight 0.049
}
root default {
id -1 # do not change unnecessarily
# weight 0.293
alg straw
hash 0 # rjenkins1
item i-8E041728 weight 0.098
item i-91A9F186 weight 0.098
item i-03C020FE weight 0.098
}
# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
# end crush map
==CRUSH图参数==
CRUSH 图主要有 4 个主要段落。
- 设备:由任意对象存储设备组成,即对应一个
ceph-osd
进程的存储器。 Ceph 配置文件里的每个 OSD 都应该有一个设备。
为把归置组映射到 OSD , CRUSH 图需要 OSD 列表(即配置文件所定义的 OSD 守护进程名称),所以它们首先出现在 CRUSH 图里。要在 CRUSH 图里声明一个设备,在设备列表后面新建一行,输入 device 、之后是唯一的数字 ID 、之后是相应的 ceph-osd 守护进程例程名字。
# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
- 桶类型: 定义了 CRUSH 分级结构里要用的桶类型(
types
),桶由逐级汇聚的存储位置(如行、机柜、机箱、主机等等)及其权重组成。
CRUSH 图里的第二个列表定义了 bucket (桶)类型,桶简化了节点和叶子层次。节点(或非叶子)桶在分级结构里一般表示物理位置,节点汇聚了其它节点或叶子,叶桶表示 ceph-osd 守护进程及其对应的存储媒体。
# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root
- 桶例程:定义了桶类型后,还必须声明主机的桶类型、以及规划的其它故障域。
CRUSH 算法根据各设备的权重、大致统一的概率把数据对象分布到存储设备中。 CRUSH 根据你定义的集群运行图分布对象及其副本, CRUSH 图表达了可用存储设备以及包含它们的逻辑单元。
# buckets
host i-8E041728 {
id -2 # do not change unnecessarily
# weight 0.098
alg straw
hash 0 # rjenkins1
item osd.0 weight 0.049
item osd.1 weight 0.049
}
host i-91A9F186 {
id -3 # do not change unnecessarily
# weight 0.098
alg straw
hash 0 # rjenkins1
item osd.2 weight 0.049
item osd.3 weight 0.049
}
host i-03C020FE {
id -4 # do not change unnecessarily
# weight 0.098
alg straw
hash 0 # rjenkins1
item osd.4 weight 0.049
item osd.5 weight 0.049
}
root default {
id -1 # do not change unnecessarily
# weight 0.293
alg straw
hash 0 # rjenkins1
item i-8E041728 weight 0.098
item i-91A9F186 weight 0.098
item i-03C020FE weight 0.098
}
- 规则: 由选择桶的方法组成。
CRUSH 图支持“ CRUSH 规则”概念,用以确定一个存储池里数据的归置。对大型集群来说,你可能创建很多存储池,且每个存储池都有它自己的 CRUSH 规则集和规则。默认的 CRUSH 图里,每个存储池有一条规则、一个规则集被分配到每个默认存储池。
# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
==降低故障恢复时间==
定义osd-domain
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable straw_calc_version 1
# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root
type 11 osd-domain
# buckets
host i-8E041728 {
id -2 # do not change unnecessarily
# weight 0.098
alg straw
hash 0 # rjenkins1
item osd.0 weight 0.049
item osd.1 weight 0.049
}
host i-91A9F186 {
id -3 # do not change unnecessarily
# weight 0.098
alg straw
hash 0 # rjenkins1
item osd.2 weight 0.049
item osd.3 weight 0.049
}
host i-03C020FE {
id -4 # do not change unnecessarily
# weight 0.098
alg straw
hash 0 # rjenkins1
item osd.4 weight 0.049
item osd.5 weight 0.049
}
osd-domain od-1 {
alg straw
hash 0
item i-8E041728 weight 0.098
}
osd-domain od-2 {
alg straw
hash 0
item i-03C020FE weight 0.098
}
osd-domain od-3 {
alg straw
hash 0
item i-91A9F186 weight 0.098
}
rack rack-01{
alg straw
hash 0
item od-1
item od-2
}
rack rack-02{
alg straw
hash 0
item od-3
}
# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
# step take default
step chooseleaf firstn 0 type rack
step emit
}
# end crush map
查看集群osd结构
[root@i-8E041728 ~]# ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-8 0.09799 rack rack-02
-6 0.09799 osd-domain od-3
-3 0.09799 host i-91A9F186
2 0.04900 osd.2 up 1.00000 1.00000
3 0.04900 osd.3 up 1.00000 1.00000
-7 0.19598 rack rack-01
-1 0.09799 osd-domain od-1
-2 0.09799 host i-8E041728
0 0.04900 osd.0 up 1.00000 1.00000
1 0.04900 osd.1 up 1.00000 1.00000
-5 0.09799 osd-domain od-2
-4 0.09799 host i-03C020FE
4 0.04900 osd.4 up 1.00000 1.00000
5 0.04900 osd.5 up 1.00000 1.00000
增加放置域定义placement-domain
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable straw_calc_version 1
# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root
type 11 osd-domain
type 12 placement-domain
# buckets
osd-domain od-1 {
alg straw
hash 0
item osd.0 weight 0.049
}
osd-domain od-2 {
alg straw
hash 0
item osd.1 weight 0.049
}
osd-domain od-3 {
alg straw
hash 0
item osd.2 weight 0.049
}
osd-domain od-4 {
alg straw
hash 0
item osd.3 weight 0.049
}
osd-domain od-5 {
alg straw
hash 0
item osd.4 weight 0.049
}
osd-domain od-6 {
alg straw
hash 0
item osd.5 weight 0.049
}
placement-domain pd-1{
alg straw
hash 0
item od-1
item od-3
item od-5
}
placement-domain pd-2{
alg straw
hash 0
item od-2
item od-4
item od-6
}
# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
# step take default
step choose firstn 1 type placement-domain
step chooseleaf firstn 0 type osd-domain
step emit
}
# end crush map
查看集群osd结构
[root@i-8E041728 ~]# ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-8 0.14699 placement-domain pd-2
-2 0.04900 osd-domain od-2
1 0.04900 osd.1 up 1.00000 1.00000
-4 0.04900 osd-domain od-4
3 0.04900 osd.3 up 1.00000 1.00000
-6 0.04900 osd-domain od-6
5 0.04900 osd.5 up 1.00000 1.00000
-7 0.14699 placement-domain pd-1
-1 0.04900 osd-domain od-1
0 0.04900 osd.0 up 1.00000 1.00000
-3 0.04900 osd-domain od-3
2 0.04900 osd.2 up 1.00000 1.00000
-5 0.04900 osd-domain od-5
4 0.04900 osd.4 up 1.00000 1.00000