ORA-29770: global enqueue process LMS hang for 70

故障背景：

2021-03-24 18:06时数据库实例【oracle2】异常宕机，手工启动后恢复，根据客户反应当时数据库正在运行存储过程处理业务。根据日志、主机性能等信息分析故障原因。

1 数据库告警日志：

Wed Mar 24 18:08:12 2021
LMS0 (ospid: 28907) has not called a wait for 2 secs.
Errors in file /u01/app/oracle/diag/rdbms/oracle/oracle2/trace/oracle2_lmhb_28929.trc  (incident=736205):
ORA-29770: global enqueue process LMS0 (OSID 28907) is hung for more than 70 seconds
Incident details in: /u01/app/oracle/diag/rdbms/oracle/oracle2/incident/incdir_736205/oracle2_lmhb_28929_i736205.trc
ERROR: Some process(s) is not making progress.
LMHB (ospid: 28929) is terminating the instance.
Please check LMHB trace file for more details.
Please also check the CPU load, I/O load and other system properties for anomalous behavior
ERROR: Some process(s) is not making progress.
LMHB (ospid: 28929): terminating the instance due to error 29770

分析发现：

ORA-29770: global enqueue process LMS0 (OSID 28907) is hung for more than 70 seconds

LMHB (ospid: 28929) is terminating the instance.

28907：ora_lms0_bxpasdb2 ，hang住超出阈值70 seconds

LMHB：ospid: 28929 导致实例终止

2 trace 文件分析：

LMS0 (ospid: 28907) has not moved for 130 sec (1616580498.1616580368)
Incident 736205 created, dump file: /u01/app/oracle/diag/rdbms/oracle/oracle2/incident/incdir_736205/oracle2_lmhb_28929_i736205.trc
ORA-29770: global enqueue process LMS0 (OSID 28907) is hung for more than 70 seconds
kjfmGCR_HBdisambig: action=Inst-kill
kjgcr_Main: KJGCR_ACTION - id 1
*** 2021-03-24 18:08:21.065
kjgcr_poll: Group locked by memno - 3
*** 2021-03-24 18:08:21.065
kjgcr_grouplock: Acquired group lock!
*** 2021-03-24 18:08:21.065
==============================
LMS0 (ospid: 28907) has not moved for 132 sec (1616580500.1616580368)

分析发现：LMS0 (ospid: 28907) has not moved for 132 sec (1616580500.1616580368) 132 sec超出阈值

3 主机性能

*** 2021-03-24 18:06:49.589
kjgcr_Main: KJGCR_ACTION - id 3
CPU is high.  Top oracle users listed below:
     Session           Serial         CPU
     8238               54747            92
     6011               36875            92
     6983               56133            92
     10723               33289            92
     2412               19545            92
*** 2021-03-24 18:06:54.593
kjgcr_Main: Reset called for action high cpu, identify users, count 0
*** 2021-03-24 18:06:54.593
kjgcr_Main: Reset called for action high cpu, kill users, count 0
*** 2021-03-24 18:06:54.593
kjgcr_Main: Reset called for action high cpu, activate RM plan, count 0
*** 2021-03-24 18:06:54.593
kjgcr_Main: Reset called for action high cpu, set BG into RT, count 0

分析发现：从客户监控工具看出主机性能正常，但trac日志中发现CPU在18:06 这个时间段左右 CPU使用达到92%。激活LMHB进程的内部资源计划。

4 KIGRC操作

*** 2021-03-24 18:08:18.935
==============================
LMS0 (ospid: 28907) has not moved for 130 sec (1616580498.1616580368)
Incident 736205 created, dump file: /u01/app/oracle/diag/rdbms/oracle/oracle2/incident/incdir_736205/oralce2_lmhb_28929_i736205.trc
ORA-29770: global enqueue process LMS0 (OSID 28907) is hung for more than 70 seconds
kjfmGCR_HBdisambig: action=Inst-kill
kjgcr_Main: KJGCR_ACTION - id 1
*** 2021-03-24 18:08:21.065
kjgcr_poll: Group locked by memno - 3
*** 2021-03-24 18:08:21.065
kjgcr_grouplock: Acquired group lock!
*** 2021-03-24 18:08:21.065

结论：总体上看，在18：06分左右CPU升高，导致LMHB进程激活内部算法，move lms0 超时后重启实例。

5 理论基础

LMS进程官方说明：
LMS: Global Cache Service Process
The LMS process maintains records of the data file statuses and each cached block by recording information in a Global Resource Directory (GRD). The LMS process also controls the flow of messages to remote instances and manages global data block access and transmits block images between the buffer caches of different instances. This processing is part of the Cache Fusion feature.
全局cache服务进程：LMS进程用户维护rac数据文件的状态记录。及两点之前cached block（缓存数据块），控制远程instance之间的信息传播，以及不同instances之前的global data block的访问和 block images传输。

LMHB进程官方说明：
       Global Cache/Enqueue Service Heartbeat Monitor，Monitor the heartbeat of LMON, LMD, and LMSn processes，LMHB monitors LMON, LMD, and LMSn processes to ensure they are running normally without blocking or spinning。  Database and ASM instances, Oracle RAC
      该进程负责监控LMON、LMD、LMSn等RAC关键的后台进程，保证这些background process不被阻塞或spin。LMHB是Lock Manager Heartbeat的缩写。LMHB如果发现有session的CPU使用率极高，根据内部算法会激活 资源计划(resource management plan) ，甚至于kill 进程。从11.2.0.2 开始LMHB开始使用slave 进程GCRn来完成实际的任务，LMHB会控制GCRn进程的启停，以便使用多个GCRn完成同步和缓解资源紧张的任务(例如kill进程)。

分析：LMHB从11.2 开始应用不算成熟，在官方MOS可以搜索到关于它的大量开放BUG或未开放BUG。

最后给出建议：1、控制主机资源消耗，减少存储过程并发。

2、升级数据库补丁

ORA-29770: global enqueue process LMS hang for 70

猜你喜欢