NetApp FAS更换控制器详解 & 疑难杂症排查

前言

NetApp控制器更换并非常规操作,正常替换流程也并不复杂,但出现问题都是疑难杂症,近半年做个两个比较大型环境的控制器更换案例,几乎更换控制器的所有坑可能都碰到了,本文作为整理,主要分以下三个部分:

  • 第一部分:阐述下更换控制器的基本流程及注意事项
  • 第二部分:记录一次FAS8200的更换流程及故障排查
  • 第三部分:问题及总结

更换流程

以FAS8200为例(其他型号大同小异)
在这里插入图片描述
整体流程以官方文档做简要阐述,正常情况下整个过程并不复杂:

Replace the controller module hardware - FAS8200
链接:https://docs.netapp.com/us-en/ontap-systems/fas8200/controller-replace-move-hardware.html#step-1-open-the-controller-module

首先需要说明,正常情况下控制器更换仅更换控制器主板,并不包含主板上其他的相关硬件模组,因此需要手动将老控制器上的相关模组手动拆除安装至新控制器

第一步:拆开控制器模组
将损坏控制器直接从机箱中拔出
在这里插入图片描述

第二步:移除启动介质
启动介质安装了Ontap系统,是比较重要的一个模块,更换后以确保两个控制器版本相同,位置如下
在这里插入图片描述

第三步:移除缓存电池
位置如下,注意有个卡扣
在这里插入图片描述

第四步:移除内存
尽量安装原位置和顺序安装内存
在这里插入图片描述

第五步:移除相关的PCIe卡(可选)
高端型号根据实际的应用场景可能会额外假装PCIe卡,包括SAS,UTA2的卡等,如果有需要安装到新控制器

第六步:移除Flash Cache模组
机器自带的Flash Cache缓存
在这里插入图片描述

第七步:安装至新控制器并启动确认
最后,把以上组件原位置安装到新控制器,并插入机箱等待启动
启动过程中,系统会提示系统ID已变更,需要手动进行确认,系统会自动更新所有的ID,以及所有磁盘归属的ID,进系统后进行相关确认,至此控制器更换完成

当然,以上只是顺利的情况,实际我生产环境大概更换过10台左右的控制器,真正顺利的可能也就一半左右,以下记录了最近比较折腾的一次

案例记录

项目概述:客户FAS8200更换控制器,系统版本Ontap 9.1P7
整个更换流程加排错约耗时3小时,以下步骤省略了大部分排错过程(仅记录的log就有接近2M),中间也有不少记录丢失,仅作参考
注:用户环境挂载了12个盘柜,由于显示信息过多,CLI中涉及disk的部分仅罗列一部分作为示意

硬件更换部分就不多做阐述了,更换完成后正常开机,发现System ID变更,选择y进行更新,进入系统后正常Waiting for giveback,由于默认等待时间较长,选择手动进行交还

Initializing System Memory ...
Loading Device Drivers ...
Waiting for SP ...
Configuring Devices ...

CPU = 1 Processor(s) Detected.
  Intel(R) Xeon(R) CPU D-1587 @ 1.70GHz (CPU 0)
  CPUID: 0x00050664. Cores per Processor = 16
131072 MB System RAM Installed.
SATA (AHCI) Device: SV9MST6D120GLM41NP

Boot Loader version 6.0.10 
Copyright (C) 2000-2003 Broadcom Corporation.
Portions Copyright (C) 2002-2020 NetApp, Inc. All Rights Reserved.

Starting AUTOBOOT press Ctrl-C to abort...
Loading X86_64/freebsd/image1/kernel:0x200000/10377696 0xbe59e0/6360256 Entry at 0xffffffff80294bf0
Loading X86_64/freebsd/image1/platform.ko:0x11f7000/2513560 0x145d000/393664 0x14bd1c0/543024 
Starting program at 0xffffffff80294bf0
NetApp Data ONTAP 9.1P7
ata2: AHCI reset done: devices=00000001
Trying to mount root from msdosfs:/dev/ad4s1 [ro]...
md0 attached to /X86_64/freebsd/image1/rootfs.img
Trying to mount root from ufs:/env/md0.uzip []...
mountroot: waiting for device /env/md0.uzip ...
Copyright (C) 1992-2017 NetApp.
All rights reserved.
Writing loader environment to the boot device.
Loader environment has been saved to the boot device.
*******************************
*                             *
* Press Ctrl-C for Boot Menu. *
*                             *
*******************************  
Sat Jun 13 18:50:40 2015 [nv2flash.restage.progress:NOTICE]: ReStage is not needed because the flash has no data.

WARNING: System ID mismatch. This usually occurs when replacing a boot device or NVRAM cards!
Override system ID? {
    
    y|n} y
No SVM keys found.
Firewall rules loaded.
Jun 14 02:51:07 Power outage protection flash de-staging: 17 cycles
Ipspace "ACP" created
WAFL CPLEDGER is enabled. Checklist = 0x7ff841ff
Waiting for giveback...(Press Ctrl-C to abort wait)

手动交还失败,提示Partner is missing disks.

XXXX_FAS8200::> storage failover giveback -ofnode cluster2-01 

Warning: System ID changed on partner. Disk ownership will be updated with new
         system ID. Do you want to continue? {
    
    y|n}: y

Error: command failed: Failed to initiate giveback. Reason: Partner is missing
       disks. 

XXXX_FAS8200::> storage failover show
                              Takeover          
Node           Partner        Possible State Description  
-------------- -------------- -------- -------------------------------------
cluster2-01    cluster2-02    -        Waiting for giveback
cluster2-02    cluster2-01    false    System ID changed on partner (Old:
                                       xxxxxx944, New: xxxxxx887), Normal
                                       giveback not possible: partner
                                       missing file system disks
2 entries were displayed.

手动停止等待交还的状态,发现控制器自己也读不到盘

Pausing to check HA partner status ... 
lock was released, continuing boot ...
Waiting for disk ownership to change.........................Jun 14 02:57:55 [cluster2-01:cf.disk.inventory.mismatch:error]: Status of the disk ?.? (50000398:88118750:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000) has recently changed or the node (cluster2-01) is missing the disk. 
Jun 14 02:57:55 [cluster2-01:cf.disk.inventory.mismatch:error]: Status of the disk ?.? (50000398:880A5F74:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000) has recently changed or the node (cluster2-01) is missing the disk. 
Jun 14 02:57:55 [cluster2-01:cf.disk.inventory.mismatch:error]: Status of the disk ?.? (50000398:88119AC4:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000) has recently changed or the node (cluster2-01) is missing the disk. 
Jun 14 02:57:55 [cluster2-01:cf.disk.invent.mismatchalt:ALERT]: Status of some of the disks has changed or the node (cluster2-01) is missing 143 disks (detailed logs have been throttled). 
Jun 14 02:57:55 [cluster2-01:callhome.sfo.miscount:error]: Call home for HA GROUP ERROR: DISK/SHELF COUNT MISMATCH 
......Jun 14 02:58:07 [cluster2-01:raid.assim.tree.noRootVol:error]: No usable root volume was found! 
WARNING: 0 disks found!

可以看到更换后控制器后丢失了大部分盘,进一步查看磁盘信息里的归属仍然是老控制器的System ID,并没有更新过来,尝试刷新也没效果

XXXX_FAS8200::> storage failover show -ins
                                            Node: cluster2-02
                                    Partner Name: cluster2-01
                                   Node NVRAM ID: xxxxxx816
                                Partner NVRAM ID: xxxxxx944
                                Takeover Enabled: true
                                         HA Mode: ha
                               Takeover Possible: false                        
                    Reason Takeover not Possible: Local node is already in takeover state.
                                 Interconnect Up: true
                              Interconnect Links: RDMA Interconnect is up (Link up)
                               Interconnect Type: GOP (PLX PEX8725 NTB)
                               State Description: System ID changed on partner (Old: xxxxxx944, New: xxxxxx887), In takeover
                                   Partner State: Initializing
                             Time Until Takeover: -
         Reason Takeover not Possible by Partner: Local node is already in takeover state.
                           Auto Giveback Enabled: false
                           Check Partner Enabled: true
                  Takeover Detection Time (secs): 15
                       Takeover on Panic Enabled: true
                      Takeover on Reboot Enabled: true
               Delay Before Auto Giveback (secs): 600
                         Hardware Assist Enabled: true
                           Partner's Hwassist IP: 10.33.21.13
                         Partner's Hwassist Port: 4444
           Hwassist Health Check Interval (secs): 180
                            Hwassist Retry Count: 2                            
                                 Hwassist Status:                              
                 Time Until Auto Giveback (secs): -                            
                             Local Mailbox Disks: 2.0.0.P3, 2.0.1.P3           
                           Partner Mailbox Disks: 2.0.12.P3, 2.0.13.P3         
                     Missing Disks on Local Node: None                         
                   Missing Disks on Partner Node: 3.20.20, 3.20.11, 3.20.22, 3.20.8, 3.20.12, 3.20.18, 3.20.9, 3.20.5, 3.20.13, 3.20.17, 3.20.23, 3.20.21, 3.20.15, 3.20.19, 3.20.10, 3.20.14, 3.20.7, 3.20.0, 3.20.4, 3.20.16, 3.20.6, 3.20.2, 3.20.1, 3.20.3, 3.21.22, 3.21.21, 3.21.23, 3.21.20, 3.21.17, 3.21.18, 3.21.16, 3.21.19, 3.21.15, 3.21.12, 3.21.13, 3.21.14, 3.21.11, 3.21.10, 3.21.9, 3.21.5, 3.21.8, 3.21.6, 3.21.7, 3.21.2, 3.21.4, 3.21.1, 3.21.3, 3.21.0, 3.22.23, 3.22.22, 3.22.18, 3.22.21, 3.22.19, 3.22.20, 3.22.17, 3.22.16, 3.22.15, 3.22.14, 3.22.12, 3.22.13, 3.22.11, 3.22.10, 3.22.9, 3.22.8, 3.22.6, 3.22.7, 3.22.5, 3.22.4, 3.22.3, 3.22.2, 3.22.1, 3.22.0, 3.23.20, 3.23.23, 3.23.18, 3.23.22, 3.23.17, 3.23.21, 3.23.19, 3.23.16, 3.23.15, 3.23.7, 3.23.14, 3.23.13, 3.23.8, 3.23.12, 3.23.11, 3.23.9, 3.23.10, 3.23.6, 3.23.5, 3.23.3, 3.23.4, 3.23.2, 3.23.0, 3.23.1, 3.24.23, 3.24.22, 3.24.21, 3.24.20, 3.24.18, 3.24.19, 3.24.17, 3.24.16, 3.24.15, 3.24.14, 3.24.13, 3.24.12, 3.24.10, 3.24.11, 3.24.9, 3.24.8, 3.24.7, 3.24.6, 3.24.3, 3.24.5, 3.24.4, 3.24.1, 3.24.0, 3.24.2, 3.25.23, 3.25.22, 3.25.21, 3.25.20, 3.25.19, 3.25.18, 3.25.17, 3.25.16, 3.25.15, 3.25.14, 3.25.12, 3.25.13, 3.25.11, 3.25.10, 3.25.9, 3.25.8, 3.25.7, 3.25.5, 3.25.4, 3.25.3, 3.25.2, 3.25.1, 3.25.0
                             Time Since Takeover: 00:23:22
           Auto Giveback After Takeover On Panic: false
            Bypass Takeover Optimization Enabled: false
           Auto-giveback Override Vetoes Enabled: false
Auto Giveback Delay Before Terminating CIFS (minutes): 5
                                         HA Type: none
2 entries were displayed.   

XXXX_FAS8200::> run -node cluster2-02 -command disk show -v
  DISK       OWNER                    POOL   SERIAL NUMBER         HOME                     DR HOME                CHKSUM
------------ -------------            -----  -------------         -------------            -------------          --------
0a.00.16     cluster2-01(xxxxxx944)    Pool0  S396NX0J601836        cluster2-01(xxxxxx944)                          Block
0a.00.16P1   cluster2-02(xxxxxx816)    Pool0  S396NX0J601836NP001   cluster2-01(xxxxxx944)                          Block
0a.00.16P2   cluster2-02(xxxxxx816)    Pool0  S396NX0J601836NP002   cluster2-01(xxxxxx944)                          Block
0a.00.16P3   cluster2-01(xxxxxx944)    Pool0  S396NX0J601836NP003   cluster2-01(xxxxxx944)                          Block
0a.00.23     cluster2-01(xxxxxx944)    Pool0  S396NX0J601740        cluster2-01(xxxxxx944)                          Block
0a.00.23P1   cluster2-02(xxxxxx816)    Pool0  S396NX0J601740NP001   cluster2-01(xxxxxx944)                          Block
0a.00.23P2   cluster2-02(xxxxxx816)    Pool0  S396NX0J601740NP002   cluster2-01(xxxxxx944)                          Block
0a.00.23P3   cluster2-01(xxxxxx944)    Pool0  S396NX0J601740NP003   cluster2-01(xxxxxx944)                          Block
0a.00.2      cluster2-02(xxxxxx816)    Pool0  S396NX0J601781        cluster2-02(xxxxxx816)                          Block
0a.00.2P1    cluster2-02(xxxxxx816)    Pool0  S396NX0J601781NP001   cluster2-02(xxxxxx816)                          Block
0a.00.2P2    cluster2-02(xxxxxx816)    Pool0  S396NX0J601781NP002   cluster2-02(xxxxxx816)                          Block
0a.00.2P3    cluster2-02(xxxxxx816)    Pool0  S396NX0J601781NP003   cluster2-02(xxxxxx816)                          Block
0a.01.11     cluster2-02(xxxxxx816)    Pool0  S396NX0J601890        cluster2-02(xxxxxx816)                          Block
0a.01.11P1   cluster2-02(xxxxxx816)    Pool0  S396NX0J601890NP001   cluster2-02(xxxxxx816)                          Block
0a.01.11P2   cluster2-02(xxxxxx816)    Pool0  S396NX0J601890NP002   cluster2-02(xxxxxx816)                          Block
0a.01.11P3   cluster2-02(xxxxxx816)    Pool0  S396NX0J601890NP003   cluster2-02(xxxxxx816)                          Block
0b.20.14     cluster2-02(xxxxxx816)    Pool0  S396NX0JC34076        cluster2-01(xxxxxx944)                          Block
0b.20.16     cluster2-02(xxxxxx816)    Pool0  S396NX0K105816        cluster2-01(xxxxxx944)                          Block
0b.20.3      cluster2-02(xxxxxx816)    Pool0  S396NX0JC34082        cluster2-01(xxxxxx944)                          Block
0a.01.14     cluster2-01(xxxxxx944)    Pool0  S396NX0J601796        cluster2-01(xxxxxx944)                          Block
0a.01.14P1   cluster2-02(xxxxxx816)    Pool0  S396NX0J601796NP001   cluster2-01(xxxxxx944)                          Block
0a.01.14P2   cluster2-02(xxxxxx816)    Pool0  S396NX0J601796NP002   cluster2-01(xxxxxx944)                          Block
0a.01.14P3   cluster2-01(xxxxxx944)    Pool0  S396NX0J601796NP003   cluster2-01(xxxxxx944)                          Block
0a.00.19     cluster2-01(xxxxxx944)    Pool0  S396NX0J601931        cluster2-01(xxxxxx944)                          Block
0a.00.19P1   cluster2-02(xxxxxx816)    Pool0  S396NX0J601931NP001   cluster2-01(xxxxxx944)                          Block
0a.00.19P2   cluster2-02(xxxxxx816)    Pool0  S396NX0J601931NP002   cluster2-01(xxxxxx944)                          Block
0a.00.19P3   cluster2-01(xxxxxx944)    Pool0  S396NX0J601931NP003   cluster2-01(xxxxxx944)                          Block
0a.01.19     cluster2-01(xxxxxx944)    Pool0  S396NX0J601835        cluster2-01(xxxxxx944)                          Block
0a.01.19P1   cluster2-02(xxxxxx816)    Pool0  S396NX0J601835NP001   cluster2-01(xxxxxx944)                          Block
0a.01.19P2   cluster2-02(xxxxxx816)    Pool0  S396NX0J601835NP002   cluster2-01(xxxxxx944)                          Block

进入节点模式下尝试强制交换,未报错

cluster2-02(takeover)*> cf giveback -f
System ID changed on partner. Giveback will update the ownership of partner disks with system ID: xxxxxx887.
Do you wish to continue {
    
    y|n}? y

交换过程正常进行

XXXX_FAS8200::> storage failover show
                              Takeover          
Node           Partner        Possible State Description  
-------------- -------------- -------- -------------------------------------
cluster2-01    cluster2-02    -        Waiting for cluster applications to
                                       come online on the local node
                                       Offline applications: mgmt, vldb,
                                       vifmgr, bcomd, crs, scsi blade, clam.
cluster2-02    cluster2-01    true     System ID changed on partner (Old:
                                       xxxxxx944, New: xxxxxx887),
                                       Connected to cluster2-01, Partial
                                       giveback
2 entries were displayed.

XXXX_FAS8200::> storage failover show-giveback 
               Partner
Node           Aggregate         Giveback Status
-------------- ----------------- ---------------------------------------------
cluster2-01
                                 No aggregates to give back
cluster2-02
               CFO Aggregates    Done
               aggr1_cluster2_01
                                 Not attempted yet
               aggr3_cluster2_01
                                 Not attempted yet
               aggr5_cluster_01  Not attempted yet
               aggr2_cluster2_01
                                 Not attempted yet
6 entries were displayed.

等待一段时间后查看,集群状态正常,但是控制器仍显示missing disk,并且takeover的状态是false

XXXX_FAS8200::> cluster show
Node                  Health  Eligibility
--------------------- ------- ------------
cluster2-01           true    true
cluster2-02           true    true
2 entries were displayed.

XXXX_FAS8200::> storage failover show
                              Takeover          
Node           Partner        Possible State Description  
-------------- -------------- -------- -------------------------------------
cluster2-01    cluster2-02    false    System ID changed on local (Old:
                                       xxxxxx944, New: xxxxxx887),
                                       Connected to cluster2-02, Takeover
                                       is not possible: Local node missing
                                       partner disks
cluster2-02    cluster2-01    true     System ID changed on partner (Old:
                                       xxxxxx944, New: xxxxxx887),
                                       Connected to cluster2-01, Giveback
                                       of one or more SFO aggregates failed
2 entries were displayed.

再次尝试failover giveback,命令可以正常进行,但仍然报错

XXXX_FAS8200::> storage failover giveback -ofnode cluster2-01 

Warning: System ID changed on partner. Disk ownership will be updated with new
         system ID. Do you want to continue? {
    
    y|n}: y

Info: Run the storage failover show-giveback command to check giveback status. 

XXXX_FAS8200::> event log show
Time                Node             Severity      Event
------------------- ---------------- ------------- ---------------------------
6/22/2023 09:23:12  cluster2-01      ALERT         cf.disk.invent.mismatchalt: Status of some of the disks has changed or the node (cluster2-01) is missing 143 disks (detailed logs have been throttled).
6/22/2023 09:23:12  cluster2-01      ERROR         cf.disk.inventory.mismatch: Status of the disk ?.? (5002538A:4812E370:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000) has recently changed or the node (cluster2-01) is missing the disk.
6/22/2023 09:23:12  cluster2-01      ERROR         cf.disk.inventory.mismatch: Status of the disk ?.? (5002538A:4812E6E0:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000) has recently changed or the node (cluster2-01) is missing the disk.
6/22/2023 09:23:12  cluster2-01      ERROR         cf.disk.inventory.mismatch: Status of the disk ?.? (5002538A:4812E550:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000) has recently changed or the node (cluster2-01) is missing the disk.
6/22/2023 09:23:07  cluster2-01      ERROR         scsitarget.ispfct.linkBreak: Link break detected on Fibre Channel target adapter 0e. Firmware status code status1 0x2, status2 0x7, and status4 0x0.
6/22/2023 09:22:32  cluster2-01      ERROR         asup.post.drop: AutoSupport message (HA Group Notification from cluster2-01 (REBOOT (watchdog reset)) ALERT) for host (0) was not posted to NetApp. The system will drop the message.

后仔细查看磁盘信息发现大部分磁盘归属的System ID变更了,仍有部分未进行变更

XXXX_FAS8200::> storage failover show
                              Takeover          
Node           Partner        Possible State Description  
-------------- -------------- -------- -------------------------------------
cluster2-01    cluster2-02    false    Connected to cluster2-02, Partial
                                       giveback, Takeover is not possible:
                                       Local node missing partner disks
cluster2-02    cluster2-01    true     Connected to cluster2-01. Node owns
                                       aggregates belonging to another node
                                       in the cluster.
2 entries were displayed.

这边排查了很久,最后发现是因为有个盘柜控制器损坏,导致磁盘受归属受影响(当然这个逻辑不是很能理解,本身是回环串接)

XXXX_FAS8200::*> storage shelf show
                                                                      
                                                       Module Operational
       Shelf Name Shelf ID Serial Number   Model       Type   Status
----------------- -------- --------------- ----------- ------ -----------

Warning: Unable to list entries for kernel on node "cluster2-01": RPC: Couldn't
         make connection.
              2.0        0 SHFFG1739000083 DS224-12    IOM12  Normal
              2.1        1 SHFFG1739000084 DS224-12    IOM12  Normal
             2.10       10 SHFFG1802000239 DS224-12    IOM12  Normal
             2.11       11 SHFFG1751000243 DS224-12    IOM12  Normal
             2.12       12 SHFFG1826000243 DS224-12    IOM12  Normal
             2.13       13 SHFFG1826000245 DS224-12    IOM12  Normal
             3.20       20 SHFFG1809000126 DS224-12    IOM12  Normal
             3.21       21 SHFFG1810000394 DS224-12    IOM12  Normal
             3.22       22 SHFFG1810000390 DS224-12    IOM12  Normal
             3.23       23 SHFFG1810000392 DS224-12    IOM12  Error
             3.24       24 SHFFG1810000389 DS224-12    IOM12  Normal
             3.25       25 SHFFG1810000391 DS224-12    IOM12  Normal
12 entries were displayed.

更换盘柜控制器后手动刷新归属,还是一样的报错,由于提示Node owns aggregates belonging to another node in the cluster,因此打算尝试把B控上的AGGR迁移到A控

XXXX_FAS8200::*> storage disk refresh-ownership -node cluster2-02

正常将几个AGGR从B控迁移到A控

XXXX_FAS8200::*> storage aggregate relocation start -node cluster2-02 -destination cluster2-01 -aggregate-list aggr1_cluster2_01

状态终于恢复正常

XXXX_FAS2720::*> storage failover show
                              Takeover          
Node           Partner        Possible State Description  
-------------- -------------- -------- -------------------------------------
cluster2-01     cluster2-02     true     Connected to cluster2-02
cluster2-02     cluster2-01     true     Connected to cluster2-01
2 entries were displayed.

总结

  • NetApp配置的逻辑性很强,要玩通还是需要把底层的逻辑摸透
  • Ontap 9.3之前的版本确实Bug较多,最近也碰到不少,有条件建议升级到9.5及以上
  • 核心存储建议还是购买原厂保
  • 其他有控制器更换的问题也欢迎一起交流

猜你喜欢

转载自blog.csdn.net/sjj222sjj/article/details/131791395