Ozone OM服务的HA配置搭建

前言


在上文中,笔者阐述了Ozone OM服务HA的内部机理,但是没有介绍其是如何配置使用的。本文笔者结合自己在测试环境中的HA搭建过程,来补充介绍下这块的实际配置过程以及中间可能存在的坑。想了解OM HA背后更多的原理方面的内容,可阅读笔者的上一篇文章Ozone OM服务HA原理分析

OM HA的配置初始化


与HDFS NameNode的Active/Standby服务角色分类不同,OM HA采用的是基于Raft协议方式来实现服务状态的一致性控制。基于此协议的实现下,OM服务被分为1个Leader,多Follower的模式,一般我们会设定2N+1个服务。2N+1的设定能保证在领导投票选举过程中能够出现半数以上的投票结果,避免出现同票数的情况。

因此在笔者的测试环境下,笔者设定了3个OM服务,此3个服务将分布部署在以下机器中:

  • lyq-m1-xx.xx.xx.xx
  • lyq-m2-xx.xx.xx.xx
  • lyq-m3-xx.xx.xx.xx

因为要使用Raft协议的HA实现方式,需要开启om ratis(raits为Raft的Java实现库)选项,再配置好om metastore db的位置信息:

  <property>
    <name>ozone.om.db.dirs</name>
    <value>/home/hdfs/data/meta</value>
  </property>

  <property>
    <name>ozone.om.ratis.enable</name>
    <value>true</value>
  </property>

在OM HA模式下,同样也有service的定义,这里笔者的配置service id如下所示:=

  <property>
    <name>ozone.om.service.ids</name>
    <value>om-service-test</value>
  </property>

然后是对应此om service下的om node id,用以区别不同的OM service,

  <property>
    <name>ozone.om.nodes.om-service-test</name>
    <value>omNode-1,omNode-2,omNode-3</value>
  </property>

目前Ozone OM暂不支持OM federation的方式,因此ozone.om.internal.service.id并不需要配置。不过如果OM在未来支持多nameservice的情况,还是需要额外配置internal service id来指定启动哪个nameservice下的OM服务。

定义好上述OM HA的基本配置后,后面就是具体om id下的RPC/Http地址的配置了,每个om节点3个配置项,3个om就是9个配置项,配置结果如下:

  <property>
    <name>ozone.om.address.om-service-test.omNode-1</name>
    <value>lyq-m1-xx.xx.xx.xx:9862</value>
  </property>

  <property>
    <name>ozone.om.http-address.om-service-test.omNode-1</name>
    <value>lyq-m1-xx.xx.xx.xx:9874</value>
  </property>

  <property>
    <name>ozone.om.https-address.om-service-test.omNode-1</name>
    <value>lyq-m1-xx.xx.xx.xx:9875</value>
  </property>

  <property>
    <name>ozone.om.address.om-service-test.omNode-2</name>
    <value>lyq-m2-xx.xx.xx.xx:9862</value>
  </property>

  <property>
    <name>ozone.om.http-address.om-service-test.omNode-2</name>
    <value>lyq-m2-xx.xx.xx.xx:9874</value>
  </property>
  
   <property>
    <name>ozone.om.https-address.om-service-test.omNode-2</name>
    <value>lyq-m2-xx.xx.xx.xx:9875</value>
  </property>

  <property>
    <name>ozone.om.address.om-service-test.omNode-3</name>
    <value>lyq-m3-xx.xx.xx.xx:9862</value>
  </property>

  <property>
    <name>ozone.om.http-address.om-service-test.omNode-3</name>
    <value>lyq-m3-xx.xx.xx.xx:9874</value>
  </property>

  <property>
    <name>ozone.om.https-address.om-service-test.omNode-3</name>
    <value>lyq-m3-xx.xx.xx.xx:9875</value>
  </property>

完成到这里,OM的HA配置可以算作是完成了,将这个配置文件分发到lyq-m1-xx.xx.xx.xx,lyq-m2-xx.xx.xx.xx和lyq-m3-xx.xx.xx.xx这3个节点上即可。

OM HA模式下的服务启动


HA配置完毕后,就是后面的服务启动操作了。

此时如果你急匆匆地执行om启动命令,

ozone/bin/ozone --daemon start om

将会发现OM没有被启动,日志中还会出现如下“OM not initialization”的错误。

2020-01-17 08:57:55,210 [main] INFO       - registered UNIX signal handlers for [TERM, HUP, INT]
2020-01-17 08:57:55,846 [main] INFO       - ozone.om.internal.service.id is not defined, falling back to ozone.om.service.ids to find serviceID for OzoneManager if it is HA enabled cluster
2020-01-17 08:57:55,872 [main] INFO       - Found matching OM address with OMServiceId: om-service-test, OMNodeId: omNode-1, RPC Address: lyq-m1-xx.xx.xx.xx:9862 and Ratis port: 9872
2020-01-17 08:57:55,872 [main] INFO       - Setting configuration key ozone.om.http-address with value of key ozone.om.http-address.omNode-1: lyq-m1-xx.xx.xx.xx:9874
2020-01-17 08:57:55,872 [main] INFO       - Setting configuration key ozone.om.https-address with value of key ozone.om.https-address.omNode-1: lyq-m1-xx.xx.xx.xx:9875
2020-01-17 08:57:55,872 [main] INFO       - Setting configuration key ozone.om.address with value of key ozone.om.address.omNode-1: lyq-m1-xx.xx.xx.xx:9862
OM not initialized.
2020-01-17 08:57:55,887 [shutdown-hook-0] INFO       - SHUTDOWN_MSG:

这里,我们需要预先执行一步om初始化目录的操作命令,命令如下:

ozone/bin/ozone om --init

然后我们再执行上述的om daemon的start启动命令即可。

在OM服务启动的过程中,OM会比较和匹配启动服务节点的地址和配置中配置的om id地址,然后识别出所属的OM Id。

2020-01-17 08:57:55,872 [main] INFO       - Found matching OM address with OMServiceId: om-service-test, OMNodeId: omNode-1, RPC Address: lyq-m1-xx.xx.xx.xx:9862 and Ratis port: 9872
2020-01-17 08:57:55,872 [main] INFO       - Setting configuration key ozone.om.http-address with value of key ozone.om.http-address.omNode-1: lyq-m1-xx.xx.xx.xx:9874
2020-01-17 08:57:55,872 [main] INFO       - Setting configuration key ozone.om.https-address with value of key ozone.om.https-address.omNode-1: lyq-m1-xx.xx.xx.xx:9875
2020-01-17 08:57:55,872 [main] INFO       - Setting configuration key ozone.om.address with value of key ozone.om.address.omNode-1: lyq-m1-xx.xx.xx.xx:9862

等到我们把3个OM服务都启动完毕后,我们能从OM的log中看到领导选举的过程,下面日志表明omNode-1成为了Leader服务角色,omNode-2和omNode-3为Follower角色。

2020-01-19 00:27:10,878 INFO org.apache.ratis.server.impl.RaftServerImpl: omNode-2@group-C0483FFA3DBE: changes role from  FOLLOWER to FOLLOWER at term 600 for recognizeCandidate:omNode-1
2020-01-19 00:27:10,878 INFO org.apache.ratis.server.impl.RoleInfo: omNode-2: shutdown FollowerState
2020-01-19 00:27:10,878 INFO org.apache.ratis.server.impl.RoleInfo: omNode-2: start FollowerState
2020-01-19 00:27:10,878 INFO org.apache.ratis.server.impl.FollowerState: omNode-2@group-C0483FFA3DBE-FollowerState was interrupted: java.lang.InterruptedException: sleep interrupted
2020-01-19 00:27:11,221 INFO org.apache.ratis.server.impl.RaftServerImpl: omNode-2@group-C0483FFA3DBE: change Leader from null to omNode-1 at term 600 for appendEntries, leader elected after 56782ms
2020-01-19 00:27:11,261 INFO org.apache.ratis.server.impl.RaftServerImpl: omNode-2@group-C0483FFA3DBE: set configuration 0: [omNode-3:lyq-m3-xx.xx.xx.xx:9872, omNode-1:lyq-m1-xx.xx.xx.xx:9872, omNode-2:lyq-m2-xx.xx.xx.xx:9872], old=null at 0
2020-01-19 00:27:11,267 INFO org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: omNode-2@group-C0483FFA3DBE-SegmentedRaftLogWorker: Starting segment from index:0
2020-01-19 00:27:11,397 INFO org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: omNode-2@group-C0483FFA3DBE-SegmentedRaftLogWorker: created new log segment /home/hdfs/data/meta/ratis/3bf88a72-722a-315e-b909-c0483ffa3dbe/current/log_inprogress_0

OM HA下的CLI命令使用


上述OM服务完全启动完毕后,我们可以使用om ha的admin命令来查看当前各个节点的服务角色,命令执行结果如下所示(下面的id为om service id):

[hdfs@lyq-m1 yiqlin]$ ~/apache/ozone/bin/ozone admin om getserviceroles -id=om-service-test
omNode-2 : FOLLOWER
omNode-3 : FOLLOWER
omNode-1 : LEADER

在OM HA模式下,volume/bucket/key相关的操作命令需要额外带上om service id的schema前缀模式,命令如下:

[hdfs@lyq-m1 yiqlin]$ ~/apache/ozone/bin/ozone sh volume create o3://om-service-test/volumetest
2020-01-19 02:35:59,087 [main] INFO       - Creating Volume: volumetest, with hdfs as owner.

否则将会提示以下的错误信息:

[hdfs@lyq-m1 yiqlin]$ ~/apache/ozone/bin/ozone sh volume create /volumetest
Service ID or host name must not be omitted when ozone.om.service.ids is defined.

附:OM HA的样例配置


最后附上笔者测试使用的OM HA配置:

  <property>
    <name>ozone.om.db.dirs</name>
    <value>/home/hdfs/data/meta</value>
  </property>

  <property>
    <name>ozone.om.ratis.enable</name>
    <value>true</value>
  </property>
  
  <property>
    <name>ozone.om.service.ids</name>
    <value>om-service-test</value>
  </property>

  <property>
    <name>ozone.om.nodes.om-service-test</name>
    <value>omNode-1,omNode-2,omNode-3</value>
  </property>

  <property>
    <name>ozone.om.address.om-service-test.omNode-1</name>
    <value>lyq-m1-xx.xx.xx.xx:9862</value>
  </property>

  <property>
    <name>ozone.om.http-address.om-service-test.omNode-1</name>
    <value>lyq-m1-xx.xx.xx.xx:9874</value>
  </property>

  <property>
    <name>ozone.om.https-address.om-service-test.omNode-1</name>
    <value>lyq-m1-xx.xx.xx.xx:9875</value>
  </property>

  <property>
    <name>ozone.om.address.om-service-test.omNode-2</name>
    <value>lyq-m2-xx.xx.xx.xx:9862</value>
  </property>

  <property>
    <name>ozone.om.http-address.om-service-test.omNode-2</name>
    <value>lyq-m2-xx.xx.xx.xx:9874</value>
  </property>
  
   <property>
    <name>ozone.om.https-address.om-service-test.omNode-2</name>
    <value>lyq-m2-xx.xx.xx.xx:9875</value>
  </property>

  <property>
    <name>ozone.om.address.om-service-test.omNode-3</name>
    <value>lyq-m3-xx.xx.xx.xx:9862</value>
  </property>

  <property>
    <name>ozone.om.http-address.om-service-test.omNode-3</name>
    <value>lyq-m3-xx.xx.xx.xx:9874</value>
  </property>

  <property>
    <name>ozone.om.https-address.om-service-test.omNode-3</name>
    <value>lyq-m3-xx.xx.xx.xx:9875</value>
  </property>
发布了374 篇原创文章 · 获赞 403 · 访问量 203万+

猜你喜欢

转载自blog.csdn.net/Androidlushangderen/article/details/104070924