RAC共享磁盘物理路径故障导致OCR、Votedisk所在ASM磁盘组不可访问的案例分析

Generally this kind messages comes in ASM alertlog file on below situations,

Delayed ASM PST heart beats on ASM disks in normal or high redundancy diskgroup, <<<< 在normal或high冗余度的磁盘组上的ASM磁盘被执行延迟ASM PST心跳检查。
thus the ASM instance dismount the diskgroup.By default, it is 15 seconds. <<<< 检查失败，ASM实例会dismount磁盘组，默认的超时时间为15秒。

By the way the heart beat delays are sort of ignored for external redundancy diskgroup. <<<< PST heartbeat检查会忽略外部冗余的磁盘组。
ASM instance stop issuing more PST heart beat until it succeeds PST revalidation,
but the heart beat delays do not dismount external redundancy diskgroup directly. <<<< PST heartbeat检查即使超过了15秒也不会dismount外部冗余的磁盘组。

The ASM disk could go into unresponsiveness, normally in the following scenarios: <<< ASM磁盘出现无反应的情况通常是由于以下几个原因：

+ Some of the paths of the physical paths of the multipath device are offline or lost <<<< 1.聚合设备下的一些物理路径offline或丢失。
+ During path 'failover' in a multipath set up <<<< 2.具有设备下的物理路径发生failover。
+ Server load, or any sort of storage/multipath/OS maintenance <<<< 3.系统或设备的维护操作。

通过上面的这段描述，能大概的解释出现问题的原因，由于存储链路断掉了2条（可能发生failover），导致聚合后的共享存储设备短暂的不可访问，OCRVDISK是Normal冗余度的磁盘组，ASM会执行PST heartbeat检查，由于超过15秒OCRVDISK对应的磁盘组不可访问导致ASM将OCRVDISK直接dismount，进而导致OCR文件不可访问，导致crs服务OFFLINE，由于cssd的磁盘心跳超时时间为200秒，且是直接访问ASM磁盘，不经过ASM磁盘组，所以css服务没有受影响，hasd高可用堆栈依然正常工作，集群节点未被踢出，数据库实例正常工作。

Oracle给出了在数据库层面解决这个问题的办法：

If you can not keep the disk unresponsiveness to below 15 seconds, then the below parameter can be set in the ASM instance ( on all the Nodes of RAC ):

_asm_hbeatiowait <<<< 该参数指定了PST heartbeat超时时间。

As per internal bug 17274537 , based on internal testing the value should be increased to 120 secs, which is fixed in 12.1.0.2 <<<< 从12.1.0.2开始，该参数默认值被增加到了120秒。

Run below in asm instance to set desired value for _asm_hbeatiowait

alter system set "_asm_hbeatiowait"= scope=spfile sid='*'; <<<< 运行这条命令修改ASM实例的该参数，之后重启ASM实例，CRS。

And then restart asm instance / crs, to take new parameter value in effect.

为了避免类似的问题，可以将OCR镜像到不同的ASM磁盘组，这样将进一步的提高ora.crsd服务的可用性。

更详细的内容请参考文章：《ASM diskgroup dismount with "Waited 15 secs for write IO to PST" (文档 ID 1581684.1)》

--end--

RAC共享磁盘物理路径故障导致OCR、Votedisk所在ASM磁盘组不可访问的案例分析

猜你喜欢