RAC主机相差超过10分钟导致crs无法启动

联系:手机/微信(+86 17813235971) QQ(107644445)QQ咨询惜分飞

标题:RAC主机相差超过10分钟导致crs无法启动

作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]

客户反馈有一套19c 2节点rac,断电之后,一个节点数据库无法正常启动,通过crsctl命令查看发现crs进程没有正常启动

[root@xifenf1 ~]# /u01/app/19.0/grid/bin/crsctl status res -t -init
--------------------------------------------------------------------------------
Name           Target  State        Server                   State details
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
      1        ONLINE  OFFLINE                               STABLE
ora.cluster_interconnect.haip
      1        ONLINE  ONLINE       xifenf1                  STABLE
ora.crf
      1        ONLINE  ONLINE       xifenf1                  STABLE
ora.crsd
      1        ONLINE  OFFLINE                               STABLE
ora.cssd
      1        ONLINE  ONLINE       xifenf1                  STABLE
ora.cssdmonitor
      1        ONLINE  ONLINE       xifenf1                  STABLE
ora.ctssd
      1        ONLINE  OFFLINE                               STABLE
ora.diskmon
      1        OFFLINE OFFLINE                               STABLE
ora.evmd
      1        ONLINE  ONLINE       xifenf1                  STABLE
ora.gipcd
      1        ONLINE  ONLINE       xifenf1                  STABLE
ora.gpnpd
      1        ONLINE  ONLINE       xifenf1                  STABLE
ora.mdnsd
      1        ONLINE  ONLINE       xifenf1                  STABLE
ora.storage
      1        ONLINE  ONLINE       xifenf1                  STABLE
--------------------------------------------------------------------------------

查看crs的alert日志发现集群时间间隔超过600s,无法启动csst进程

2024-06-11 17:33:09.953 [OCSSD(5020)]CRS-1605: CSSD voting file is online: /dev/asm_ocr5; 
 details in /u01/app/grid/diag/crs/xifenf1/crs/trace/ocssd.trc.
2024-06-11 17:33:09.956 [OCSSD(5020)]CRS-1605: CSSD voting file is online: /dev/asm_ocr1; 
 details in /u01/app/grid/diag/crs/xifenf1/crs/trace/ocssd.trc.
2024-06-11 17:33:10.024 [OCSSD(5020)]CRS-1605: CSSD voting file is online: /dev/asm_ocr2; 
 details in /u01/app/grid/diag/crs/xifenf1/crs/trace/ocssd.trc.
2024-06-11 17:33:10.031 [OCSSD(5020)]CRS-1605: CSSD voting file is online: /dev/asm_ocr4; 
 details in /u01/app/grid/diag/crs/xifenf1/crs/trace/ocssd.trc.
2024-06-11 17:33:10.040 [OCSSD(5020)]CRS-1605: CSSD voting file is online: /dev/asm_ocr3; 
 details in /u01/app/grid/diag/crs/xifenf1/crs/trace/ocssd.trc.
2024-06-11 17:33:11.900 [OCSSD(5020)]CRS-1601: CSSD Reconfiguration complete. Active nodes are xifenf1 xifenf2 .
2024-06-11 17:33:13.344 [OCSSD(5020)]CRS-1720: Cluster Synchronization Services daemon (CSSD) is ready for operation.
2024-06-11 17:33:13.809 [OCTSSD(5488)]CRS-8500: Oracle Clusterware OCTSSD process is starting with operating system process ID 5488
2024-06-11 17:33:16.017 [OCTSSD(5488)]CRS-2407: The new Cluster Time Synchronization Service reference node is host xifenf2.
2024-06-11 17:33:16.018 [OCTSSD(5488)]CRS-2401: The Cluster Time Synchronization Service started on host xifenf1.
2024-06-11 17:33:16.105 [OCTSSD(5488)]CRS-2419: The clock on host xifenf1 differs from mean cluster time by 1031504618 microseconds. 
  The Cluster Time Synchronization Service will not perform time synchronization 
  because the time difference is beyond the permissible offset of 600 seconds. 
  Details in /u01/app/grid/diag/crs/xifenf1/crs/trace/octssd.trc.
2024-06-11 17:33:16.579 [OCTSSD(5488)]CRS-2402: The Cluster Time Synchronization Service aborted on host xifenf1. 
  Details at (:ctsselect_mstm4:) in /u01/app/grid/diag/crs/xifenf1/crs/trace/octssd.trc.

查看主机时间

[grid@xifenf1 ~]$ date ;ssh xifenf2 date
Tue Jun 11 17:54:09 CST 2024
Tue Jun 11 18:04:34 CST 2024

修改主机时间

[root@xifenf1 ~]# date -s "20240611 18:06:00"
Tue Jun 11 18:06:00 CST 2024
[root@xifenf1 ~]# su - grid
Last login: Tue Jun 11 17:37:53 CST 2024 on pts/0
[grid@xifenf1 ~]$ date ;ssh xifenf2 date
Tue Jun 11 18:06:09 CST 2024
Tue Jun 11 18:05:34 CST 2024

重启crs

[root@xifenf1 ~]# /u01/app/19.0/grid/bin/crsctl stop crs -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'xifenf1'
CRS-2673: Attempting to stop 'ora.storage' on 'xifenf1'
CRS-2673: Attempting to stop 'ora.mdnsd' on 'xifenf1'
CRS-2673: Attempting to stop 'ora.crf' on 'xifenf1'
CRS-2677: Stop of 'ora.storage' on 'xifenf1' succeeded
CRS-2673: Attempting to stop 'ora.evmd' on 'xifenf1'
CRS-2673: Attempting to stop 'ora.cluster_interconnect.haip' on 'xifenf1'
CRS-2677: Stop of 'ora.mdnsd' on 'xifenf1' succeeded
CRS-2677: Stop of 'ora.crf' on 'xifenf1' succeeded
CRS-2677: Stop of 'ora.evmd' on 'xifenf1' succeeded
CRS-2677: Stop of 'ora.cluster_interconnect.haip' on 'xifenf1' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'xifenf1'
CRS-2677: Stop of 'ora.cssd' on 'xifenf1' succeeded
CRS-2673: Attempting to stop 'ora.gpnpd' on 'xifenf1'
CRS-2673: Attempting to stop 'ora.gipcd' on 'xifenf1'
CRS-2677: Stop of 'ora.gpnpd' on 'xifenf1' succeeded
CRS-2677: Stop of 'ora.gipcd' on 'xifenf1' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'xifenf1' has completed
CRS-4133: Oracle High Availability Services has been stopped.
[root@xifenf1 ~]# /u01/app/19.0/grid/bin/crsctl start crs
CRS-4123: Oracle High Availability Services has been started.
[root@xifenf1 ~]# /u01/app/19.0/grid/bin/crsctl status res -t -init
--------------------------------------------------------------------------------
Name           Target  State        Server                   State details
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
      1        ONLINE  ONLINE       xifenf1                  STABLE
ora.cluster_interconnect.haip
      1        ONLINE  ONLINE       xifenf1                  STABLE
ora.crf
      1        ONLINE  ONLINE       xifenf1                  STABLE
ora.crsd
      1        ONLINE  ONLINE       xifenf1                  STABLE
ora.cssd
      1        ONLINE  ONLINE       xifenf1                  STABLE
ora.cssdmonitor
      1        ONLINE  ONLINE       xifenf1                  STABLE
ora.ctssd
      1        ONLINE  ONLINE       xifenf1                  ACTIVE:35600,STABLE
ora.diskmon
      1        OFFLINE OFFLINE                               STABLE
ora.evmd
      1        ONLINE  ONLINE       xifenf1                  STABLE
ora.gipcd
      1        ONLINE  ONLINE       xifenf1                  STABLE
ora.gpnpd
      1        ONLINE  ONLINE       xifenf1                  STABLE
ora.mdnsd
      1        ONLINE  ONLINE       xifenf1                  STABLE
ora.storage
      1        ONLINE  ONLINE       xifenf1                  STABLE
--------------------------------------------------------------------------------

ora.storage无法启动报ORA-12514故障处理

联系:手机/微信(+86 17813235971) QQ(107644445)QQ咨询惜分飞

标题:ora.storage无法启动报ORA-12514故障处理

作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]

19.11集群,节点2人工重启之后,crs启动异常

[grid@xff2 ~]$ crsctl status res -t -init
--------------------------------------------------------------------------------
Name           Target  State        Server                   State details       
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
      1        ONLINE  ONLINE       xff2                    STABLE
ora.cluster_interconnect.haip
      1        ONLINE  ONLINE       xff2                    STABLE
ora.crf
      1        ONLINE  ONLINE       xff2                    STABLE
ora.crsd
      1        ONLINE  OFFLINE                               STABLE
ora.cssd
      1        ONLINE  ONLINE       xff2                    STABLE
ora.cssdmonitor
      1        ONLINE  ONLINE       xff2                    STABLE
ora.ctssd
      1        ONLINE  ONLINE       xff2                    OBSERVER,STABLE
ora.diskmon
      1        OFFLINE OFFLINE                               STABLE
ora.drivers.acfs
      1        ONLINE  ONLINE       xff2                    STABLE
ora.evmd
      1        ONLINE  ONLINE       xff2                    STABLE
ora.gipcd
      1        ONLINE  ONLINE       xff2                    STABLE
ora.gpnpd
      1        ONLINE  ONLINE       xff2                    STABLE
ora.mdnsd
      1        ONLINE  ONLINE       xff2                    STABLE
ora.storage
      1        ONLINE OFFLINE                                STABLE
--------------------------------------------------------------------------------

crs的alert日志显示

2024-03-05 12:46:26.021 [CLSECHO(3653)]ACFS-9327: Verifying ADVM/ACFS devices.
2024-03-05 12:46:26.040 [CLSECHO(3661)]ACFS-9156: Detecting control device '/dev/asm/.asm_ctl_spec'.
2024-03-05 12:46:26.065 [CLSECHO(3673)]ACFS-9156: Detecting control device '/dev/ofsctl'.
2024-03-05 12:46:26.357 [CLSECHO(3703)]ACFS-9294: updating file /etc/sysconfig/oracledrivers.conf
2024-03-05 12:46:26.376 [CLSECHO(3711)]ACFS-9322: completed
2024-03-05 12:46:27.764 [CSSDMONITOR(3855)]CRS-8500: Oracle Clusterware CSSDMONITOR process is starting with operating system process ID 3855
2024-03-05 12:46:27.839 [OSYSMOND(3857)]CRS-8500: Oracle Clusterware OSYSMOND process is starting with operating system process ID 3857
2024-03-05 12:46:28.129 [CSSDAGENT(3890)]CRS-8500: Oracle Clusterware CSSDAGENT process is starting with operating system process ID 3890
2024-03-05 12:46:29.125 [OCSSD(3910)]CRS-8500: Oracle Clusterware OCSSD process is starting with operating system process ID 3910
2024-03-05 12:46:30.187 [OCSSD(3910)]CRS-1713: CSSD daemon is started in hub mode
2024-03-05 12:46:31.428 [OCSSD(3910)]CRS-1707: Lease acquisition for node xff2 number 2 completed
2024-03-05 12:46:32.630 [OCSSD(3910)]CRS-1621: The IPMI configuration data for this node stored in the Oracle registry is incomplete; details at (:CSSNK00002:) in /u01/app/grid/diag/crs/xff2/crs/trace/ocssd.trc
2024-03-05 12:46:32.630 [OCSSD(3910)]CRS-1617: The information required to do node kill for node xff2 is incomplete; details at (:CSSNM00004:) in /u01/app/grid/diag/crs/xff2/crs/trace/ocssd.trc
2024-03-05 12:46:32.638 [OCSSD(3910)]CRS-1605: CSSD voting file is online: /dev/sda1; details in /u01/app/grid/diag/crs/xff2/crs/trace/ocssd.trc.
2024-03-05 12:46:33.546 [OCSSD(3910)]CRS-1601: CSSD Reconfiguration complete. Active nodes are xff1 xff2 .
2024-03-05 12:46:35.405 [OCSSD(3910)]CRS-1720: Cluster Synchronization Services daemon (CSSD) is ready for operation.
2024-03-05 12:46:35.533 [OCTSSD(4138)]CRS-8500: Oracle Clusterware OCTSSD process is starting with operating system process ID 4138
2024-03-05 12:46:36.339 [OCTSSD(4138)]CRS-2403: The Cluster Time Synchronization Service on host xff2 is in observer mode.
2024-03-05 12:46:37.601 [OCTSSD(4138)]CRS-2407: The new Cluster Time Synchronization Service reference node is host xff1.
2024-03-05 12:46:37.601 [OCTSSD(4138)]CRS-2401: The Cluster Time Synchronization Service started on host xff2.
2024-03-05 12:46:54.181 [ORAROOTAGENT(2427)]CRS-5019: All OCR locations are on ASM disk groups [SYSTEMDG], and none of these disk groups are mounted. Details are at "(:CLSN00140:)" in "/u01/app/grid/diag/crs/xff2/crs/trace/ohasd_orarootagent_root.trc".
2024-03-05 12:47:15.209 [OLOGGERD(4553)]CRS-8500: Oracle Clusterware OLOGGERD process is starting with operating system process ID 4553
2024-03-05 12:52:04.581 [CRSCTL(8313)]CRS-1013: The OCR location in an ASM disk group is inaccessible. Details in /u01/app/grid/diag/crs/xff2/crs/trace/crsctl_8313.trc.
2024-03-05 12:56:44.519 [ORAROOTAGENT(2427)]CRS-5818: Aborted command 'start' for resource 'ora.storage'. Details at (:CRSAGF00113:) {0:5:3} in /u01/app/grid/diag/crs/xff2/crs/trace/ohasd_orarootagent_root.trc.
2024-03-05 12:56:44.608 [OHASD(2217)]CRS-2757: Command 'Start' timed out waiting for response from the resource 'ora.storage'. Details at (:CRSPE00221:) {0:5:3} in /u01/app/grid/diag/crs/xff2/crs/trace/ohasd.trc.
2024-03-05 12:56:44.606 [ORAROOTAGENT(2427)]CRS-5017: The resource action "ora.storage start" encountered the following error:
2024-03-05 12:56:44.606+agent's abort action pending. For details refer to "(:CLSN00107:)" in "/u01/app/grid/diag/crs/xff2/crs/trace/ohasd_orarootagent_root.trc".
2024-03-05 12:57:58.464 [CRSD(11801)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 11801
2024-03-05 12:58:12.059 [CRSD(11801)]CRS-1013: The OCR location in an ASM disk group is inaccessible. Details in /u01/app/grid/diag/crs/xff2/crs/trace/crsd.trc.

ohasd_orarootagent_root 日志

2024-03-05 12:52:00.769 :  OCRRAW:4255452928: kgfnConnect3: Got a Connection Error when connecting to ASM.

2024-03-05 12:52:00.771 :  OCRRAW:4255452928: kgfnConnect2: failed to connect

2024-03-05 12:52:00.771 :  OCRRAW:4255452928: kgfnConnect2Retry: failed to connect connect after 1 attempts, 124s elapsed

2024-03-05 12:52:00.771 :  OCRRAW:4255452928: kgfo_kge2slos error stack at kgfoAl06: ORA-12514: TNS:listener does not currently know of service requested in connect descriptor
ORA-12514: TNS:listener does not currently know of service requested in connect descriptor


2024-03-05 12:52:00.771 :  OCRRAW:4255452928: -- trace dump on error exit --

2024-03-05 12:52:00.771 :  OCRRAW:4255452928: Error [kgfoAl06] in [kgfokge] at kgfo.c:2176

2024-03-05 12:52:00.771 :  OCRRAW:4255452928: ORA-12514: TNS:listener does not currently know of service requested in connect descriptor
ORA-12514: TNS:listener does not currently know of service requested

2024-03-05 12:52:00.771 :  OCRRAW:4255452928: Category: 7

"/u01/app/grid/diag/crs/xff2/crs/trace/crsctl_8313.trc" 208L, 11809C

2024-03-05 12:52:03.543 :  OCRRAW:4255452928: 9379 Error 4 opening dom root in 0xf9afdb79c0

2024-03-05 12:52:03.551 :  OCRRAW:4255452928: kgfnConnect2: kgfnGetBeqData failed

2024-03-05 12:52:03.577 :  OCRRAW:4255452928: kgfnConnect2Int: cstr=(DESCRIPTION=(TCP_USER_TIMEOUT=1)(CONNECT_TIMEOUT=60)(EXPIRE_TIME=1)(ADDRESS_LIST=(LOAD_BALANCE=ON)(ADDRESS=(PROTOCOL=tcp)(HOST=节点1私网IP)(PORT=1525)))(CONNECT_DATA=(SERVICE_NAME=+ASM)))

2024-03-05 12:52:03.578 :  OCRRAW:4255452928: kgfnConnect2Int: ServerAttach

2024-03-05 12:52:04.579 :  OCRRAW:4255452928: kgfnServerAttachConnErrors: Encountered service based error 12514

2024-03-05 12:52:04.579 :  OCRRAW:4255452928: kgfnRecordErr 12514 OCI error:
ORA-12514: TNS:listener does not currently know of service requested in connect descriptor


2024-03-05 12:52:04.579 :  OCRRAW:4255452928: kgfnConnect3: Got a Connection Error when connecting to ASM.

2024-03-05 12:52:04.581 :  OCRRAW:4255452928: kgfnConnect2: failed to connect

2024-03-05 12:52:04.581 :  OCRRAW:4255452928: kgfnConnect2Retry: failed to connect connect after 1 attempts, 122s elapsed

2024-03-05 12:52:04.581 :  OCRRAW:4255452928: kgfo_kge2slos error stack at kgfoAl06: ORA-12514: TNS:listener does not currently know of service requested in connect descriptor
ORA-12514: TNS:listener does not currently know of service requested in connect descriptor


2024-03-05 12:52:04.581 :  OCRRAW:4255452928: -- trace dump on error exit --

2024-03-05 12:52:04.581 :  OCRRAW:4255452928: Error [kgfoAl06] in [kgfokge] at kgfo.c:3180

2024-03-05 12:52:04.581 :  OCRRAW:4255452928: ORA-12514: TNS:listener does not currently know of service requested in connect descriptor
ORA-12514: TNS:listener does not currently know of service requested

2024-03-05 12:52:04.581 :  OCRRAW:4255452928: Category: 7

2024-03-05 12:52:04.581 :  OCRRAW:4255452928: DepInfo: 12514

2024-03-05 12:52:04.581 :  OCRRAW:4255452928: ADR is not properly configured

2024-03-05 12:52:04.581 :  OCRRAW:4255452928: -- trace dump end --

  OCRASM:4255452928: SLOS : SLOS: cat=7, opn=kgfoAl06, dep=12514, loc=kgfokge

2024-03-05 12:52:04.581 :  OCRASM:4255452928: ASM Error Stack : ORA-12514: TNS:listener does not currently know of service requested in connect descriptor
ORA-12514: TNS:listener does not currently know of service requested in connect descriptor

2024-03-05 12:52:04.581 :  OCRASM:4255452928: proprasmo: kgfoCheckMount returned [7]
2024-03-05 12:52:04.581 :  OCRASM:4255452928: proprasmo: The ASM instance is down
2024-03-05 12:52:04.635 :  OCRRAW:4255452928: proprioo: Failed to open [+SYSTEMDG/xff-cluster/OCRFILE/registry.255.1072903025]. Returned proprasmo() with [26]. Marking location as UNAVAILABLE.
2024-03-05 12:52:04.635 :  OCRRAW:4255452928: proprioo: No OCR/OLR devices are usable
  OCRUTL:4255452928: u_fill_errorbuf: Error Info : [Insufficient quorum to open OCR devices]
 default:4255452928: u_set_gbl_comp_error: comptype '107' : error '0'
2024-03-05 12:52:04.635 :  OCRRAW:4255452928: proprinit: Could not open raw device
2024-03-05 12:52:04.635 : default:4255452928: a_init:7!: Backend init unsuccessful : [26]
2024-03-05 12:52:04.637 : default:4255452928: clsvactversion:4: Retrieving Active Version from local storage.

通过这里,初步判断是由于节点2访问(DESCRIPTION=(TCP_USER_TIMEOUT=1)(CONNECT_TIMEOUT=60)(EXPIRE_TIME=1)(ADDRESS_LIST=(LOAD_BALANCE=ON)(ADDRESS=(PROTOCOL=tcp)(HOST=节点1私网IP)(PORT=1525)))(CONNECT_DATA=(SERVICE_NAME=+ASM)))异常导致,查看节点1的该监听状态

[grid@xff1 ~]$ lsnrctl status ASMNET1LSNR_ASM

LSNRCTL for Linux: Version 19.0.0.0.0 - Production on 05-MAR-2024 13:04:51

Copyright (c) 1991, 2021, Oracle.  All rights reserved.

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=ASMNET1LSNR_ASM)))
STATUS of the LISTENER
------------------------
Alias                     ASMNET1LSNR_ASM
Version                   TNSLSNR for Linux: Version 19.0.0.0.0 - Production
Start Date                20-MAY-2021 23:53:50
Uptime                    25 days 8 hr. 15 min. 15 sec
Trace Level               off
Security                  ON: Local OS Authentication
SNMP                      OFF
Listener Parameter File   /u01/app/19c/grid/network/admin/listener.ora
Listener Log File         /u01/app/grid/diag/tnslsnr/xff1/asmnet1lsnr_asm/alert/log.xml
Listening Endpoints Summary...
  (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=ASMNET1LSNR_ASM)))
  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=节点1私网IP)(PORT=1525)))
The listener supports no services
The command completed successfully

发现该监听没有注册服务进去,检查相关listener参数配置

[grid@xff1 ~]$ sqlplus / as sysdba

SQL*Plus: Release 19.0.0.0.0 - Production on Tue Mar 5 13:26:29 2024
Version 19.11.0.0.0

Copyright (c) 1982, 2020, Oracle.  All rights reserved.


Connected to:
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.11.0.0.0

SQL> show parameter listener;

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
forward_listener                     string
listener_networks                    string
local_listener                       string      
remote_listener                      string

初步判断是由于节点1的ASMNET1LSNR_ASM监听状态异常,很可能是由于asm实例的listener参数异常导致,比较稳妥的解决方案是重启节点1,让其重新生成listener相关参数,实现动态注册,临时解决方法,

[grid@xff1 ~]$ sqlplus / as sysasm

SQL*Plus: Release 19.0.0.0.0 - Production on Tue Mar 5 13:05:11 2024
Version 19.11.0.0.0

Copyright (c) 1982, 2020, Oracle.  All rights reserved.


Connected to:
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.11.0.0.0

SQL> ALTER SYSTEM SET local_listener ='(ADDRESS=(PROTOCOL=TCP)(HOST=节点1私网IP)(PORT=1525))' sid='+ASM1' SCOPE=MEMORY;

System altered.



[grid@xff1 ~]$ lsnrctl status ASMNET1LSNR_ASM

LSNRCTL for Linux: Version 19.0.0.0.0 - Production on 05-MAR-2024 13:05:21

Copyright (c) 1991, 2021, Oracle.  All rights reserved.

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=ASMNET1LSNR_ASM)))
STATUS of the LISTENER
------------------------
Alias                     ASMNET1LSNR_ASM
Version                   TNSLSNR for Linux: Version 19.0.0.0.0 - Production
Start Date                20-MAY-2021 23:53:50
Uptime                    25 days 8 hr. 15 min. 45 sec
Trace Level               off
Security                  ON: Local OS Authentication
SNMP                      OFF
Listener Parameter File   /u01/app/19c/grid/network/admin/listener.ora
Listener Log File         /u01/app/grid/diag/tnslsnr/xff1/asmnet1lsnr_asm/alert/log.xml
Listening Endpoints Summary...
  (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=ASMNET1LSNR_ASM)))
  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=节点1私网IP)(PORT=1525)))
Services Summary...
Service "+ASM" has 1 instance(s).
  Instance "+ASM1", status READY, has 1 handler(s) for this service...
Service "+ASM_DATA" has 1 instance(s).
  Instance "+ASM1", status READY, has 1 handler(s) for this service...
Service "+ASM_FRA" has 1 instance(s).
  Instance "+ASM1", status READY, has 1 handler(s) for this service...
Service "+ASM_SYSTEMDG" has 1 instance(s).
  Instance "+ASM1", status READY, has 1 handler(s) for this service...
The command completed successfully
[grid@xff1 ~]$ 

设置节点1的asm实例的local_listener 参数之后,集群启动成功

[grid@xff2 ~]$ crsctl status res -t -init
--------------------------------------------------------------------------------
Name           Target  State        Server                   State details       
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
      1        ONLINE  ONLINE       xff2                    STABLE
ora.cluster_interconnect.haip
      1        ONLINE  ONLINE       xff2                    STABLE
ora.crf
      1        ONLINE  ONLINE       xff2                    STABLE
ora.crsd
      1        ONLINE  ONLINE       xff2                    STABLE
ora.cssd
      1        ONLINE  ONLINE       xff2                    STABLE
ora.cssdmonitor
      1        ONLINE  ONLINE       xff2                    STABLE
ora.ctssd
      1        ONLINE  ONLINE       xff2                    OBSERVER,STABLE
ora.diskmon
      1        OFFLINE OFFLINE                               STABLE
ora.drivers.acfs
      1        ONLINE  ONLINE       xff2                    STABLE
ora.evmd
      1        ONLINE  ONLINE       xff2                    STABLE
ora.gipcd
      1        ONLINE  ONLINE       xff2                    STABLE
ora.gpnpd
      1        ONLINE  ONLINE       xff2                    STABLE
ora.mdnsd
      1        ONLINE  ONLINE       xff2                    STABLE
ora.storage
      1        ONLINE  ONLINE       xff2                    STABLE
--------------------------------------------------------------------------------

udev_start导致vip漂移(参见情况:rac在线加盘操作引起)

联系:手机/微信(+86 17813235971) QQ(107644445)QQ咨询惜分飞

标题:udev_start导致vip漂移(参见情况:rac在线加盘操作引起)

作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]

客户对asm进行扩容,执行udev_start命令之后,所有的vip全部漂移,业务全部中断
20230513203654


优先恢复业务,把所有vip漂移回来

[grid@rac3 ~]$  srvctl relocate vip -i rac1 -n rac1 -f -v
VIP was relocated successfully.
[grid@rac3 ~]$  srvctl relocate vip -i rac2 -n rac2 -f -v
VIP was relocated successfully.
[grid@rac3 ~]$  srvctl relocate vip -i rac3 -n rac3 -f -v
VIP was relocated successfully.
[grid@rac3 ~]$  srvctl relocate vip -i rac4 -n rac4 -f -v
VIP was relocated successfully.

vip恢复正常,业务也恢复正常
20230513203712


出现该问题的原因是由于udev_start命令引起网卡瞬间中断,从而使得vip发生漂移
20230513212316

查看ifcfg配置文件
20230513212440

引起该问题的原因是udev对网卡进行了操作,从而引起该问题,处理建议在对应的ifcfg文件中加上 HOTPLUG=”no” (pulbic,private和其他需要关注的网络)
参考:Network interface going down when dynamically adding disks to storage using udev in RHEL 6 (Doc ID 1569028.1)
20230513213713

删除ora.asmgroup资源offline记录

联系:手机/微信(+86 17813235971) QQ(107644445)QQ咨询惜分飞

标题:删除ora.asmgroup资源offline记录

作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]

采用了fix asm之后,查看集群状态的时候会有一个ora.asmgroup相关是offline状态,可以通过srvctl modify asm -count 2命令强制把asm count设置为2从而就不会有offline的资源存在

[grid@dbserver1 ~]$ crsctl status res -t
--------------------------------------------------------------------------------
Name           Target  State        Server                   State details       
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.LISTENER.lsnr
               ONLINE  ONLINE       dbserver1                STABLE
               ONLINE  ONLINE       dbserver2                STABLE
ora.chad
               ONLINE  ONLINE       dbserver1                STABLE
               ONLINE  ONLINE       dbserver2                STABLE
ora.net1.network
               ONLINE  ONLINE       dbserver1                STABLE
               ONLINE  ONLINE       dbserver2                STABLE
ora.ons
               ONLINE  ONLINE       dbserver1                STABLE
               ONLINE  ONLINE       dbserver2                STABLE
ora.proxy_advm
               OFFLINE OFFLINE      dbserver1                STABLE
               OFFLINE OFFLINE      dbserver2                STABLE
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.ASMNET1LSNR_ASM.lsnr(ora.asmgroup)
      1        ONLINE  ONLINE       dbserver1                STABLE
      2        ONLINE  ONLINE       dbserver2                STABLE
      3        ONLINE  OFFLINE                               STABLE
ora.ASMNET2LSNR_ASM.lsnr(ora.asmgroup)
      1        ONLINE  ONLINE       dbserver1                STABLE
      2        ONLINE  ONLINE       dbserver2                STABLE
      3        ONLINE  OFFLINE                               STABLE
ora.DATA.dg(ora.asmgroup)
      1        ONLINE  ONLINE       dbserver1                STABLE
      2        ONLINE  ONLINE       dbserver2                STABLE
      3        OFFLINE OFFLINE                               STABLE
ora.FRA.dg(ora.asmgroup)
      1        ONLINE  ONLINE       dbserver1                STABLE
      2        ONLINE  ONLINE       dbserver2                STABLE
      3        OFFLINE OFFLINE                               STABLE
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  ONLINE       dbserver1                STABLE
ora.SYSDG.dg(ora.asmgroup)
      1        ONLINE  ONLINE       dbserver1                STABLE
      2        ONLINE  ONLINE       dbserver2                STABLE
      3        OFFLINE OFFLINE                               STABLE
ora.asm(ora.asmgroup)
      1        ONLINE  ONLINE       dbserver1                Started,STABLE
      2        ONLINE  ONLINE       dbserver2                Started,STABLE
      3        OFFLINE OFFLINE                               STABLE
ora.asmnet1.asmnetwork(ora.asmgroup)
      1        ONLINE  ONLINE       dbserver1                STABLE
      2        ONLINE  ONLINE       dbserver2                STABLE
      3        OFFLINE OFFLINE                               STABLE
ora.asmnet2.asmnetwork(ora.asmgroup)
      1        ONLINE  ONLINE       dbserver1                STABLE
      2        ONLINE  ONLINE       dbserver2                STABLE
      3        OFFLINE OFFLINE                               STABLE
ora.cvu
      1        ONLINE  ONLINE       dbserver1                STABLE
ora.dbserver1.vip
      1        ONLINE  ONLINE       dbserver1                STABLE
ora.dbserver2.vip
      1        ONLINE  ONLINE       dbserver2                STABLE
ora.xff.db
      1        ONLINE  ONLINE       dbserver1                Open,HOME=/u01/app/o
                                                             racle/product/19c/db
                                                             _1,STABLE
      2        ONLINE  ONLINE       dbserver2                Open,HOME=/u01/app/o
                                                             racle/product/19c/db
                                                             _1,STABLE
ora.qosmserver
      1        ONLINE  ONLINE       dbserver1                STABLE
ora.scan1.vip
      1        ONLINE  ONLINE       dbserver1                STABLE
--------------------------------------------------------------------------------
[grid@dbserver1 ~]$ srvctl modify asm -count 2
[grid@dbserver1 ~]$ crsctl status res -t
--------------------------------------------------------------------------------
Name           Target  State        Server                   State details       
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.LISTENER.lsnr
               ONLINE  ONLINE       dbserver1                STABLE
               ONLINE  ONLINE       dbserver2                STABLE
ora.chad
               ONLINE  ONLINE       dbserver1                STABLE
               ONLINE  ONLINE       dbserver2                STABLE
ora.net1.network
               ONLINE  ONLINE       dbserver1                STABLE
               ONLINE  ONLINE       dbserver2                STABLE
ora.ons
               ONLINE  ONLINE       dbserver1                STABLE
               ONLINE  ONLINE       dbserver2                STABLE
ora.proxy_advm
               OFFLINE OFFLINE      dbserver1                STABLE
               OFFLINE OFFLINE      dbserver2                STABLE
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.ASMNET1LSNR_ASM.lsnr(ora.asmgroup)
      1        ONLINE  ONLINE       dbserver1                STABLE
      2        ONLINE  ONLINE       dbserver2                STABLE
ora.ASMNET2LSNR_ASM.lsnr(ora.asmgroup)
      1        ONLINE  ONLINE       dbserver1                STABLE
      2        ONLINE  ONLINE       dbserver2                STABLE
ora.DATA.dg(ora.asmgroup)
      1        ONLINE  ONLINE       dbserver1                STABLE
      2        ONLINE  ONLINE       dbserver2                STABLE
ora.FRA.dg(ora.asmgroup)
      1        ONLINE  ONLINE       dbserver1                STABLE
      2        ONLINE  ONLINE       dbserver2                STABLE
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  ONLINE       dbserver1                STABLE
ora.SYSDG.dg(ora.asmgroup)
      1        ONLINE  ONLINE       dbserver1                STABLE
      2        ONLINE  ONLINE       dbserver2                STABLE
ora.asm(ora.asmgroup)
      1        ONLINE  ONLINE       dbserver1                Started,STABLE
      2        ONLINE  ONLINE       dbserver2                Started,STABLE
ora.asmnet1.asmnetwork(ora.asmgroup)
      1        ONLINE  ONLINE       dbserver1                STABLE
      2        ONLINE  ONLINE       dbserver2                STABLE
ora.asmnet2.asmnetwork(ora.asmgroup)
      1        ONLINE  ONLINE       dbserver1                STABLE
      2        ONLINE  ONLINE       dbserver2                STABLE
ora.cvu
      1        ONLINE  ONLINE       dbserver1                STABLE
ora.dbserver1.vip
      1        ONLINE  ONLINE       dbserver1                STABLE
ora.dbserver2.vip
      1        ONLINE  ONLINE       dbserver2                STABLE
ora.xff.db
      1        ONLINE  ONLINE       dbserver1                Open,HOME=/u01/app/o
                                                             racle/product/19c/db
                                                             _1,STABLE
      2        ONLINE  ONLINE       dbserver2                Open,HOME=/u01/app/o
                                                             racle/product/19c/db
                                                             _1,STABLE
ora.qosmserver
      1        ONLINE  ONLINE       dbserver1                STABLE
ora.scan1.vip
      1        ONLINE  ONLINE       dbserver1                STABLE
--------------------------------------------------------------------------------
[grid@dbserver1 ~]$ 

网卡异常导致数据库实例启动异常

联系:手机/微信(+86 17813235971) QQ(107644445)QQ咨询惜分飞

标题:网卡异常导致数据库实例启动异常

作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]

一套集群,一个节点启动正常,另外一个节点无法正常启动实例,启动异常节点alert日志

Tue Mar 07 19:07:29 2023
IPC Send timeout detected. Receiver ospid 6386 [
Tue Mar 07 19:07:29 2023
Errors in file /u01/app/oracle/diag/rdbms/xff/xff2/trace/xff2_lms0_6386.trc:
IPC Send timeout detected. Receiver ospid 6402 [
Tue Mar 07 19:07:29 2023
Errors in file /u01/app/oracle/diag/rdbms/xff/xff2/trace/xff2_lms4_6402.trc:
Tue Mar 07 19:07:29 2023
Received an instance abort message from instance 1
Please check instance 1 alert and LMON trace files for detail.
System state dump requested by (instance=2, osid=6384 (LMD0)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/xff/xff2/trace/xff2_diag_6374_20230307190729.trc
LMD0 (ospid: 6384): terminating the instance due to error 481
Dumping diagnostic data in directory=[cdmp_20230307190729],
      requested by (instance=2, osid=6384 (LMD0)), summary=[abnormal instance termination].
Instance terminated by LMD0, pid = 6384

正常节点alert日志

Tue Mar 07 19:02:07 2023
Reconfiguration started (old inc 20, new inc 22)
List of instances:
 1 2 (myinst: 1)
 Global Resource Directory frozen
 Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
Tue Mar 07 19:02:08 2023
 LMS 5: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Tue Mar 07 19:02:08 2023
 LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Tue Mar 07 19:02:08 2023
 LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Tue Mar 07 19:02:08 2023
 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Tue Mar 07 19:02:08 2023
 LMS 3: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Tue Mar 07 19:02:08 2023
 LMS 7: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Tue Mar 07 19:02:08 2023
Tue Mar 07 19:02:08 2023
 LMS 4: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 LMS 6: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 Set master node info
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Submitted all GCS remote-cache requests
 Fix write in gcs resources
Tue Mar 07 19:02:27 2023
IPC Send timeout detected. Sender: ospid 6936 [oracle@xffnode1.localdomain (PING)]
Receiver: inst 2 binc 441249706 ospid 59731
Tue Mar 07 19:07:29 2023
IPC Send timeout detected. Sender: ospid 6946 [oracle@xffnode1.localdomain (LMS0)]
Receiver: inst 2 binc 429479852 ospid 6386
Tue Mar 07 19:07:29 2023
IPC Send timeout detected. Sender: ospid 6962 [oracle@xffnode1.localdomain (LMS4)]
Receiver: inst 2 binc 429479854 ospid 6402
Tue Mar 07 19:07:29 2023
IPC Send timeout detected. Sender: ospid 6966 [oracle@xffnode1.localdomain (LMS5)]

通过上述日志,可以确认主要由于两个节点之间无法正常通讯,从而使得新节点无法加入到集群(无法完成集群重组),从而使得实例启动异常.一般出现这类情况最检查的就是私网异常,通过分析oswnetstat记录发现packet reassembles failed特别严重
20230307230341


一般出现该问题,考虑是由于ipfrag_*_thresh默认值不足导致,通过设置

net.ipv4.ipfrag_high_thresh = 16777216
net.ipv4.ipfrag_low_thresh = 15728640

临时请库成功,但是数据库实例重组时间依旧过长
20230307230658


packet reassembles failed依旧在增加,通过分析网卡情况发现网卡异常,采用haip(双万兆网卡)的其中一块网卡异常
20230307230813

为了数据库性能不收太大影响,临时禁用异常网卡,重启库正常
20230308001033

后续等网络层面解决之后再启用该网卡

11.2 crs启动超时dd npohasd 处理

联系:手机/微信(+86 17813235971) QQ(107644445)QQ咨询惜分飞

标题:11.2 crs启动超时dd npohasd 处理

作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]

客户由于光纤链路故障导致表决盘异常从而使得主机重启,主机重启之后,集群没有正常启动
操作系统和crs版本

[root@rac1 ~]# cat /etc/redhat-release 
CentOS release 6.9 (Final)
[root@rac1 ~]# sqlplus -v

SQL*Plus: Release 11.2.0.4.0 Production

人工启动crs hang住一段时间然后报错

[root@rac1 ~]# crsctl start crs
CRS-4640: Oracle High Availability Services is already active
CRS-4000: Command Start failed, or completed with errors.

查看启动进程

[grid@rac1 ~]$ ps -ef|grep d.bin
root       7043      1  0 11:48 ?        00:00:00 /u01/app/grid/product/11.2.0/bin/ohasd.bin reboot
root       8311      1  0 11:53 ?        00:00:00 /u01/app/grid/product/11.2.0/bin/ohasd.bin reboot
grid      10984  10954  0 12:10 pts/2    00:00:00 grep d.bin

根据经验这个故障很可能就是BUG:17229230 – DURING REBOOT, “OHASD.BIN REBOOT” REMAINS SLEEPING,临时解决方案,一个会话启动crs,然后在另外一个会话发起

/bin/dd if=/var/tmp/.oracle/npohasd of=/dev/null bs=1024 count=1

后续crs启动正常

[root@rac1 ~]# crsctl start crs
CRS-4123: Oracle High Availability Services has been started.
[root@rac1 ~]# crsctl status res -t -init
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS       
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
      1        ONLINE  OFFLINE                               Instance Shutdown   
ora.cluster_interconnect.haip
      1        ONLINE  OFFLINE                                                   
ora.crf
      1        ONLINE  ONLINE       rac1                                         
ora.crsd
      1        ONLINE  OFFLINE                                                   
ora.cssd
      1        ONLINE  OFFLINE                               STARTING            
ora.cssdmonitor
      1        ONLINE  ONLINE       rac1                                         
ora.ctssd
      1        ONLINE  OFFLINE                                                   
ora.diskmon
      1        OFFLINE OFFLINE                                                   
ora.evmd
      1        ONLINE  OFFLINE                                                   
ora.gipcd
      1        ONLINE  ONLINE       rac1                                         
ora.gpnpd
      1        ONLINE  ONLINE       rac1                                         
ora.mdnsd
      1        ONLINE  ONLINE       rac1                                         

终止dd命令,集群启动正常

ocr磁盘组掉盘故障处理

联系:手机/微信(+86 17813235971) QQ(107644445)QQ咨询惜分飞

标题:ocr磁盘组掉盘故障处理

作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]

由于某种故障导致crs的OCR_0001盘掉线,votedisk从3个变为了2个

WARNING: Write Failed. group:3 disk:1 AU:1 offset:4190208 size:4096
WARNING: Hbeat write to PST disk 1.3915948466 in group 3 failed. [4]
Mon Jun 14 15:31:11 2021
NOTE: process _b000_+asm1 (21889) initiating offline of disk 1.3915948466 (OCR_0001) with mask 0x7e in group 3
NOTE: checking PST: grp = 3
GMON checking disk modes for group 3 at 14 for pid 28, osid 21889
NOTE: group OCR: updated PST location: disk 0000 (PST copy 0)
NOTE: group OCR: updated PST location: disk 0002 (PST copy 1)
NOTE: checking PST for grp 3 done.
NOTE: sending set offline flag message 1047812201 to 1 disk(s) in group 3
WARNING: Disk OCR_0001 in mode 0x7f is now being offlined
INFO: Instance #2 could not find disk 1 in group 3
NOTE: initiating PST update: grp = 3, dsk = 1/0xe968a1b2, mask = 0x6a, op = clear
GMON updating disk modes for group 3 at 15 for pid 28, osid 21889
NOTE: group OCR: updated PST location: disk 0000 (PST copy 0)
NOTE: group OCR: updated PST location: disk 0002 (PST copy 1)
NOTE: group OCR: updated PST location: disk 0000 (PST copy 0)
NOTE: group OCR: updated PST location: disk 0002 (PST copy 1)
NOTE: PST update grp = 3 completed successfully 
NOTE: initiating PST update: grp = 3, dsk = 1/0xe968a1b2, mask = 0x7e, op = clear
GMON updating disk modes for group 3 at 16 for pid 28, osid 21889
NOTE: group OCR: updated PST location: disk 0000 (PST copy 0)
NOTE: group OCR: updated PST location: disk 0002 (PST copy 1)
NOTE: group OCR: updated PST location: disk 0000 (PST copy 0)
NOTE: group OCR: updated PST location: disk 0002 (PST copy 1)
NOTE: cache closing disk 1 of grp 3: OCR_0001
NOTE: PST update grp = 3 completed successfully 
Mon Jun 14 15:31:13 2021
NOTE: Attempting voting file refresh on diskgroup OCR
NOTE: Refresh completed on diskgroup OCR
. Found 3 voting file(s).
NOTE: Voting file relocation is required in diskgroup OCR
NOTE: Attempting voting file relocation on diskgroup OCR
NOTE: Successful voting file relocation on diskgroup OCR
NOTE: Attempting voting file refresh on diskgroup OCR
NOTE: Refresh completed on diskgroup OCR
. Found 2 voting file(s).
NOTE: Voting file relocation is required in diskgroup OCR
NOTE: Attempting voting file relocation on diskgroup OCR
NOTE: Successful voting file relocation on diskgroup OCR
Mon Jun 14 15:34:08 2021
WARNING: PST-initiated drop of 1 disk(s) in group 3(.1918390620))
SQL> alter diskgroup OCR drop disk OCR_0001 force /* ASM SERVER */ 
NOTE: GroupBlock outside rolling migration privileged region
NOTE: requesting all-instance membership refresh for group=3
Mon Jun 14 15:34:10 2021
GMON updating for reconfiguration, group 3 at 17 for pid 28, osid 21889
NOTE: group OCR: updated PST location: disk 0000 (PST copy 0)
NOTE: group OCR: updated PST location: disk 0002 (PST copy 1)
NOTE: cache closing disk 1 of grp 3: (not open) OCR_0001
NOTE: group OCR: updated PST location: disk 0000 (PST copy 0)
NOTE: group OCR: updated PST location: disk 0002 (PST copy 1)
NOTE: group 3 PST updated.
Mon Jun 14 15:34:10 2021
NOTE: membership refresh pending for group 3/0x7258515c (OCR)
NOTE: Attempting voting file refresh on diskgroup OCR
NOTE: Refresh completed on diskgroup OCR
. Found 2 voting file(s).
NOTE: Voting file relocation is required in diskgroup OCR
NOTE: Attempting voting file relocation on diskgroup OCR
NOTE: Successful voting file relocation on diskgroup OCR
GMON querying group 3 at 18 for pid 18, osid 8900
NOTE: group OCR: updated PST location: disk 0000 (PST copy 0)
NOTE: group OCR: updated PST location: disk 0002 (PST copy 1)
NOTE: cache closing disk 1 of grp 3: (not open) _DROPPED_0001_OCR
SUCCESS: refreshed membership for 3/0x7258515c (OCR)
SUCCESS: alter diskgroup OCR drop disk OCR_0001 force /* ASM SERVER */

在第一次掉盘之后rebalance完成之后,又掉一块盘,ocr磁盘组正常,表决盘因为就只有一个磁盘,无法在ocr磁盘组中refresh到其他磁盘上

Tue Jun 15 04:41:42 2021
WARNING: Waited 15 secs for write IO to PST disk 0 in group 3.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 3.
Tue Jun 15 04:41:42 2021
NOTE: process _b000_+asm1 (58548) initiating offline of disk 0.3915948465 (OCR_0000) with mask 0x7e in group 3
NOTE: checking PST: grp = 3
GMON checking disk modes for group 3 at 23 for pid 28, osid 58548
NOTE: group OCR: updated PST location: disk 0002 (PST copy 0)
NOTE: checking PST for grp 3 done.
NOTE: sending set offline flag message 3615961191 to 1 disk(s) in group 3
WARNING: Disk OCR_0000 in mode 0x7f is now being offlined
INFO: Instance #2 could not find disk 1 in group 3
NOTE: initiating PST update: grp = 3, dsk = 0/0xe968a1b1, mask = 0x6a, op = clear
GMON updating disk modes for group 3 at 24 for pid 28, osid 58548
NOTE: group OCR: updated PST location: disk 0002 (PST copy 0)
NOTE: group OCR: updated PST location: disk 0002 (PST copy 0)
NOTE: PST update grp = 3 completed successfully 
NOTE: initiating PST update: grp = 3, dsk = 0/0xe968a1b1, mask = 0x7e, op = clear
GMON updating disk modes for group 3 at 25 for pid 28, osid 58548
NOTE: group OCR: updated PST location: disk 0002 (PST copy 0)
NOTE: group OCR: updated PST location: disk 0002 (PST copy 0)
NOTE: cache closing disk 0 of grp 3: OCR_0000
NOTE: PST update grp = 3 completed successfully 
Tue Jun 15 04:41:44 2021
NOTE: Attempting voting file refresh on diskgroup OCR
NOTE: Refresh completed on diskgroup OCR
. Found 2 voting file(s).
NOTE: Voting file relocation is required in diskgroup OCR
NOTE: Attempting voting file relocation on diskgroup OCR
NOTE: Failed voting file relocation on diskgroup OCR
WARNING: Waited 18 secs for write IO to PST disk 0 in group 3.
WARNING: Waited 18 secs for write IO to PST disk 0 in group 3.
NOTE: Attempting voting file relocation on diskgroup OCR
NOTE: Failed voting file relocation on diskgroup OCR
NOTE: Attempting voting file relocation on diskgroup OCR
NOTE: Failed voting file relocation on diskgroup OCR
NOTE: Attempting voting file relocation on diskgroup OCR
NOTE: Failed voting file relocation on diskgroup OCR
Tue Jun 15 04:44:21 2021
NOTE: Attempting voting file relocation on diskgroup OCR
NOTE: Failed voting file relocation on diskgroup OCR
Tue Jun 15 04:44:21 2021
WARNING: PST-initiated drop of 1 disk(s) in group 3(.1918390620))
SQL> alter diskgroup OCR drop disk OCR_0000 force /* ASM SERVER */ 
NOTE: GroupBlock outside rolling migration privileged region
NOTE: requesting all-instance membership refresh for group=3
NOTE: Attempting voting file relocation on diskgroup OCR
NOTE: Failed voting file relocation on diskgroup OCR
Tue Jun 15 04:44:24 2021
GMON updating for reconfiguration, group 3 at 26 for pid 28, osid 58548
NOTE: cache closing disk 0 of grp 3: (not open) OCR_0000
NOTE: group OCR: updated PST location: disk 0002 (PST copy 0)
NOTE: group 3 PST updated.
NOTE: membership refresh pending for group 3/0x7258515c (OCR)
NOTE: Attempting voting file relocation on diskgroup OCR
NOTE: Failed voting file relocation on diskgroup OCR
GMON querying group 3 at 27 for pid 18, osid 8900
NOTE: cache closing disk 0 of grp 3: (not open) _DROPPED_0000_OCR
SUCCESS: refreshed membership for 3/0x7258515c (OCR)
NOTE: starting rebalance of group 3/0x7258515c (OCR) at power 1
SUCCESS: alter diskgroup OCR drop disk OCR_0000 force /* ASM SERVER */

查询这个时候的ocr磁盘组相关信息
7a57d339c00820129cb3522c5082f35


可以明显的看到,ocr磁盘组只剩余1个disk,查询表决盘信息

node1-> crsctl query css votedisk
##  STATE    File Universal Id                File Name Disk group
--  -----    -----------------                --------- ---------
 1. ONLINE   3619aee7c3b04fc1bfa5c4ce659acbf7 (/dev/emcpowerc) [OCR]
 2. ONLINE   00bc3e79f7404ff2bf60925a7b8a5a6d (/dev/emcpowere) [OCR]
Located 2 voting disk(s).

可以发现表决盘中的两个disk一个属于ocr磁盘组,一个是被ocr磁盘组drop掉的磁盘,尝试增加以前离线的磁盘到ocr磁盘组

SQL> alter diskgroup OCR add  disk '/dev/emcpowerc';
alter diskgroup OCR add  disk '/dev/emcpowerc'
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15033: disk '/dev/emcpowerc' belongs to diskgroup "OCR"


SQL> alter diskgroup OCR add  disk '/dev/emcpowerc' force  
  2  ;
alter diskgroup OCR add  disk '/dev/emcpowerc' force
*
ERROR at line 1:
ORA-03113: end-of-file on communication channel
Process ID: 15191
Session ID: 1613 Serial number: 7

查看alert日志

SQL> alter diskgroup OCR add  disk '/dev/emcpowerc' force
NOTE: GroupBlock outside rolling migration privileged region
NOTE: Assigning number (3,4) to disk (/dev/emcpowerc)
NOTE: requesting all-instance membership refresh for group=3
WARNING: ignoring disk /dev/emcpowerd in deep discovery
NOTE: initializing header on grp 3 disk OCR_0004
WARNING: ignoring disk /dev/emcpowerd in deep discovery
NOTE: requesting all-instance disk validation for group=3
NOTE: skipping rediscovery for group 3/0x725d2390 (OCR) on local instance.
NOTE: requesting all-instance disk validation for group=3
NOTE: skipping rediscovery for group 3/0x725d2390 (OCR) on local instance.
NOTE: Attempting voting file relocation on diskgroup OCR
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_12207.trc  (incident=311185):
ORA-00600: internal error code, arguments: [kfdvfGetCurrent_baddsk], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /u01/app/grid/diag/asm/+asm/+ASM1/incident/incdir_311185/+ASM1_rbal_12207_i311185.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
ERROR: ORA-600 thrown in RBAL for group number 3
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_12207.trc:
ORA-00600: internal error code, arguments: [kfdvfGetCurrent_baddsk], [], [], [], [], [], [], [], [], [], [], []
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_12207.trc:
ORA-00600: internal error code, arguments: [kfdvfGetCurrent_baddsk], [], [], [], [], [], [], [], [], [], [], []
RBAL (ospid: 12207): terminating the instance due to error 488

由于ORA-600 kfdvfGetCurrent_baddsk错误导致增加磁盘失败,通过上面查询的votedisk的信息,可以发现emcpowerc这个盘虽然ocr中离线,但是依旧还是votedisk盘,因此无法增加到该磁盘组中,采用变通方法,先加另外一块盘

SQL> alter diskgroup OCR add failgroup OCR_0001 disk '/dev/emcpowerd' force;

Diskgroup altered.

SQL> exit
Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Real Application Clusters and Automatic Storage Management options
node1-> crsctl query css votedisk
##  STATE    File Universal Id                File Name Disk group
--  -----    -----------------                --------- ---------
 1. ONLINE   00bc3e79f7404ff2bf60925a7b8a5a6d (/dev/emcpowere) [OCR]
 2. ONLINE   0eef8152df5d4f41bf973ad5dc5a6cb1 (/dev/emcpowerd) [OCR]
Located 2 voting disk(s).

增加成功emcpowerd之后,emcpowerc已经不再是表决盘,变为了emcpowerd,再次增加emcpowerc

SQL> alter diskgroup OCR add failgroup OCR_0000 disk '/dev/emcpowerc' force;

Diskgroup altered.

SQL> exit
Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Real Application Clusters and Automatic Storage Management options
node1-> crsctl query css votedisk
##  STATE    File Universal Id                File Name Disk group
--  -----    -----------------                --------- ---------
 1. ONLINE   00bc3e79f7404ff2bf60925a7b8a5a6d (/dev/emcpowere) [OCR]
 2. ONLINE   0eef8152df5d4f41bf973ad5dc5a6cb1 (/dev/emcpowerd) [OCR]
 3. ONLINE   4f6201f808dc4ff3bf928b14eae0d4a6 (/dev/emcpowerc) [OCR]
Located 3 voting disk(s).

ASMCMD> lsdsk -G ocr
Path
/dev/emcpowerc
/dev/emcpowerd
/dev/emcpowere
SQL> alter diskgroup OCR add failgroup OCR_0000 disk '/dev/emcpowerc' force 
NOTE: GroupBlock outside rolling migration privileged region
NOTE: Assigning number (3,0) to disk (/dev/emcpowerc)
NOTE: requesting all-instance membership refresh for group=3
NOTE: initializing header on grp 3 disk OCR_0000
NOTE: requesting all-instance disk validation for group=3
Mon Jan 24 17:47:42 2022
NOTE: skipping rediscovery for group 3/0x725dccb9 (OCR) on local instance.
NOTE: requesting all-instance disk validation for group=3
NOTE: skipping rediscovery for group 3/0x725dccb9 (OCR) on local instance.
Mon Jan 24 17:47:48 2022
GMON updating for reconfiguration, group 3 at 20 for pid 30, osid 16978
NOTE: group 3 PST updated.
NOTE: initiating PST update: grp = 3
GMON updating group 3 at 21 for pid 30, osid 16978
NOTE: group OCR: updated PST location: disk 0002 (PST copy 0)
NOTE: group OCR: updated PST location: disk 0005 (PST copy 1)
NOTE: group OCR: updated PST location: disk 0000 (PST copy 2)
NOTE: PST update grp = 3 completed successfully 
NOTE: membership refresh pending for group 3/0x725dccb9 (OCR)
NOTE: Attempting voting file refresh on diskgroup OCR
NOTE: Refresh completed on diskgroup OCR
. Found 2 voting file(s).
NOTE: Voting file relocation is required in diskgroup OCR
NOTE: Attempting voting file relocation on diskgroup OCR
NOTE: Failed voting file relocation on diskgroup OCR
GMON querying group 3 at 22 for pid 18, osid 15952
NOTE: cache opening disk 0 of grp 3: OCR_0000 path:/dev/emcpowerc
Mon Jan 24 17:47:53 2022
NOTE: Attempting voting file refresh on diskgroup OCR
NOTE: Refresh completed on diskgroup OCR
. Found 2 voting file(s).
NOTE: Voting file relocation is required in diskgroup OCR
NOTE: Attempting voting file relocation on diskgroup OCR
NOTE: Failed voting file relocation on diskgroup OCR
GMON querying group 3 at 23 for pid 18, osid 15952
SUCCESS: refreshed membership for 3/0x725dccb9 (OCR)
Mon Jan 24 17:47:53 2022
SUCCESS: alter diskgroup OCR add failgroup OCR_0000 disk '/dev/emcpowerc' force
NOTE: starting rebalance of group 3/0x725dccb9 (OCR) at power 1
Starting background process ARB0
Mon Jan 24 17:47:53 2022
ARB0 started with pid=31, OS id=17092 
NOTE: assigning ARB0 to group 3/0x725dccb9 (OCR) with 1 parallel I/O
cellip.ora not found.
NOTE: F1X0 copy 3 relocating from 65534:4294967294 to 0:2 for diskgroup 3 (OCR)
NOTE: stopping process ARB0
SUCCESS: rebalance completed for group 3/0x725dccb9 (OCR)
NOTE: Attempting voting file refresh on diskgroup OCR
NOTE: Refresh completed on diskgroup OCR
. Found 2 voting file(s).
NOTE: Voting file relocation is required in diskgroup OCR
NOTE: Attempting voting file relocation on diskgroup OCR
NOTE: voting file allocation on grp 3 disk OCR_0000
NOTE: Successful voting file relocation on diskgroup OCR
Mon Jan 24 17:47:57 2022
NOTE: GroupBlock outside rolling migration privileged region
NOTE: requesting all-instance membership refresh for group=3
NOTE: membership refresh pending for group 3/0x725dccb9 (OCR)
Mon Jan 24 17:48:03 2022
GMON querying group 3 at 24 for pid 18, osid 15952
SUCCESS: refreshed membership for 3/0x725dccb9 (OCR)
Mon Jan 24 17:48:06 2022
NOTE: Attempting voting file refresh on diskgroup OCR
NOTE: Refresh completed on diskgroup OCR
. Found 3 voting file(s).

表决磁盘组从2个变为了3个,ocr磁盘组也恢复了正常的3个,至此OCR掉盘的故障处理完成

一次 CRS-1013: ASM 磁盘组中的 OCR 位置不可访问 故障分析

联系:手机/微信(+86 17813235971) QQ(107644445)QQ咨询惜分飞

标题:一次 CRS-1013: ASM 磁盘组中的 OCR 位置不可访问 故障分析

作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]

有朋友告知我集群突然异常,让我给看看什么原因,集群alert日志

2021-03-14 21:02:15.517 [OHASD(31771)]CRS-8500: Oracle Clusterware OHASD 进程以操作系统进程 ID 31771 开头
2021-03-14 21:02:15.561 [OHASD(31771)]CRS-0714: Oracle Clusterware 发行版 12.1.0.2.0。
2021-03-14 21:02:15.619 [OHASD(31771)]CRS-2112: 已在节点 rac1 上启动 OLR 服务。
2021-03-14 21:02:15.791 [OHASD(31771)]CRS-1301: 已在节点 rac1 上启动 Oracle 高可用性服务。
2021-03-14 21:02:15.910 [OHASD(31771)]CRS-8017: 位置:/etc/oracle/lastgasp具有2个重新启动指导日志文件,0个已发布,0个出现错误
2021-03-14 21:02:16.789 [CSSDAGENT(32015)]CRS-8500: Oracle Clusterware CSSDAGENT 进程以操作系统进程 ID 32015 开头
2021-03-14 21:02:16.868 [CSSDMONITOR(32017)]CRS-8500: Oracle Clusterware CSSDMONITOR 进程以操作系统进程 ID 32017 开头
2021-03-14 21:02:17.751 [ORAROOTAGENT(32008)]CRS-8500: Oracle Clusterware ORAROOTAGENT 进程以操作系统进程 ID 32008 开头
2021-03-14 21:02:17.916 [ORAAGENT(32012)]CRS-8500: Oracle Clusterware ORAAGENT 进程以操作系统进程 ID 32012 开头
2021-03-14 21:02:18.604 [ORAAGENT(32012)]CRS-5011: 检查资源 "ora.asm" 失败: 详细资料见 "(:CLSN00006:)" 
(位于 "/u01/app/gridbase/diag/crs/rac1/crs/trace/ohasd_oraagent_grid.trc")
2021-03-14 21:02:18.969 [ORAAGENT(32117)]CRS-8500: Oracle Clusterware ORAAGENT 进程以操作系统进程 ID 32117 开头
2021-03-14 21:02:19.050 [MDNSD(32130)]CRS-8500: Oracle Clusterware MDNSD 进程以操作系统进程 ID 32130 开头
2021-03-14 21:02:19.117 [EVMD(32132)]CRS-8500: Oracle Clusterware EVMD 进程以操作系统进程 ID 32132 开头
2021-03-14 21:02:20.078 [GPNPD(32151)]CRS-8500: Oracle Clusterware GPNPD 进程以操作系统进程 ID 32151 开头
2021-03-14 21:02:21.145 [GIPCD(32172)]CRS-8500: Oracle Clusterware GIPCD 进程以操作系统进程 ID 32172 开头
2021-03-14 21:02:21.163 [GPNPD(32151)]CRS-2328: 已在节点 rac1 上启动 GPNPD。
2021-03-14 21:02:22.172 [ORAROOTAGENT(32181)]CRS-8500: Oracle Clusterware ORAROOTAGENT 进程以操作系统进程 ID 32181 开头
2021-03-14 21:02:22.339 [CLSECHO(32204)]CRS-10001: 14-Mar-21 21:02 ACFS-9391: 正在检查现有 ADVM/ACFS 安装。
2021-03-14 21:02:22.580 [CLSECHO(32209)]CRS-10001: 14-Mar-21 21:02 ACFS-9392: 正在验证操作系统的 ADVM/ACFS 安装文件。
2021-03-14 21:02:22.598 [CLSECHO(32211)]CRS-10001: 14-Mar-21 21:02 ACFS-9393: 正在验证 ASM 管理员设置。
2021-03-14 21:02:22.646 [CLSECHO(32216)]CRS-10001: 14-Mar-21 21:02 ACFS-9308: 正在加载已安装的 ADVM/ACFS 驱动程序。
2021-03-14 21:02:22.678 [CLSECHO(32219)]CRS-10001: 14-Mar-21 21:02 ACFS-9154: 正在加载 'oracleoks.ko' 驱动程序。
2021-03-14 21:02:22.809 [CLSECHO(32234)]CRS-10001: 14-Mar-21 21:02 ACFS-9154: 正在加载 'oracleadvm.ko' 驱动程序。
2021-03-14 21:02:22.892 [CLSECHO(32290)]CRS-10001: 14-Mar-21 21:02 ACFS-9154: 正在加载 'oracleacfs.ko' 驱动程序。
2021-03-14 21:02:23.054 [CLSECHO(32334)]CRS-10001: 14-Mar-21 21:02 ACFS-9327: 正在验证 ADVM/ACFS 设备。
2021-03-14 21:02:23.079 [CLSECHO(32336)]CRS-10001: 14-Mar-21 21:02 ACFS-9156: 正在检测控制设备 '/dev/asm/.asm_ctl_spec'。
2021-03-14 21:02:23.108 [CLSECHO(32340)]CRS-10001: 14-Mar-21 21:02 ACFS-9156: 正在检测控制设备 '/dev/ofsctl'。
2021-03-14 21:02:23.263 [CLSECHO(32346)]CRS-10001: 14-Mar-21 21:02 ACFS-9322: 已完成
2021-03-14 21:02:28.571 [CSSDMONITOR(32409)]CRS-8500: Oracle Clusterware CSSDMONITOR 进程以操作系统进程 ID 32409 开头
2021-03-14 21:02:28.756 [CSSDAGENT(32425)]CRS-8500: Oracle Clusterware CSSDAGENT 进程以操作系统进程 ID 32425 开头
2021-03-14 21:02:28.975 [OCSSD(32436)]CRS-8500: Oracle Clusterware OCSSD 进程以操作系统进程 ID 32436 开头
2021-03-14 21:02:30.072 [OCSSD(32436)]CRS-1713: CSSD 守护程序已在 hub 模式下启动
2021-03-14 21:02:46.185 [OCSSD(32436)]CRS-1707: 节点 rac1 (编号为 1) 的租约获取已完成
2021-03-14 21:02:47.337 [OCSSD(32436)]CRS-1605: CSSD 表决文件联机: ORCL:OCR3; 详细资料见
 /u01/app/gridbase/diag/crs/rac1/crs/trace/ocssd.trc。
2021-03-14 21:02:47.357 [OCSSD(32436)]CRS-1605: CSSD 表决文件联机: ORCL:OCR2; 详细资料见 
/u01/app/gridbase/diag/crs/rac1/crs/trace/ocssd.trc。
2021-03-14 21:02:47.365 [OCSSD(32436)]CRS-1605: CSSD 表决文件联机: ORCL:OCR1; 详细资料见 
/u01/app/gridbase/diag/crs/rac1/crs/trace/ocssd.trc。
2021-03-14 21:02:48.781 [OCSSD(32436)]CRS-1601: CSSD 重新配置完毕。活动节点为 rac1 rac2 。
2021-03-14 21:02:50.971 [OCTSSD(32591)]CRS-8500: Oracle Clusterware OCTSSD 进程以操作系统进程 ID 32591 开头
2021-03-14 21:02:51.938 [OCTSSD(32591)]CRS-2403: 主机 rac1 上的集群时间同步服务处于观察程序模式。
2021-03-14 21:02:52.140 [OCTSSD(32591)]CRS-2407: 新的集群时间同步服务引用节点为主机 rac2。
2021-03-14 21:02:52.140 [OCTSSD(32591)]CRS-2401: 已在主机 rac1 上启动了集群时间同步服务。
2021-03-14 21:02:52.167 [OCTSSD(32591)]CRS-2409: 主机 rac1 上的时钟与集群标准时间不同步。
由于集群时间同步服务正在以观察程序模式运行, 所以未采取任何操作。
2021-03-14 21:02:59.284 [ORAAGENT(32117)]CRS-5011: 检查资源 "ora.asm" 失败: 详细资料见 "(:CLSN00006:)" (
位于 "/u01/app/gridbase/diag/crs/rac1/crs/trace/ohasd_oraagent_grid.trc")
2021-03-14 21:03:01.486 [ORAAGENT(32117)]CRS-5011: 检查资源 "ora.asm" 失败: 详细资料见 "(:CLSN00006:)" (
位于 "/u01/app/gridbase/diag/crs/rac1/crs/trace/ohasd_oraagent_grid.trc")
2021-03-14 21:03:01.514 [ORAAGENT(32117)]CRS-5011: 检查资源 "ora.asm" 失败: 详细资料见 "(:CLSN00006:)" (
位于 "/u01/app/gridbase/diag/crs/rac1/crs/trace/ohasd_oraagent_grid.trc")
2021-03-14 21:03:18.163 [OCTSSD(32591)]CRS-2407: 新的集群时间同步服务引用节点为主机 rac1。
2021-03-14 21:03:19.406 [OCSSD(32436)]CRS-1625: 节点 rac2 (编号为 2) 已关闭
2021-03-14 21:03:19.419 [OCSSD(32436)]CRS-1601: CSSD 重新配置完毕。活动节点为 rac1 。
2021-03-14 21:03:24.916 [OSYSMOND(318)]CRS-8500: Oracle Clusterware OSYSMOND 进程以操作系统进程 ID 318 开头
2021-03-14 21:03:26.558 [CRSD(325)]CRS-8500: Oracle Clusterware CRSD 进程以操作系统进程 ID 325 开头
2021-03-14 21:03:27.750 [CRSD(325)]CRS-1012: 已在节点 rac1 上启动 OCR 服务。
2021-03-14 21:03:27.807 [CRSD(325)]CRS-1201: 已在节点 rac1 上启动 CRSD。
2021-03-14 21:03:28.470 [ORAAGENT(1027)]CRS-8500: Oracle Clusterware ORAAGENT 进程以操作系统进程 ID 1027 开头
2021-03-14 21:03:28.499 [ORAROOTAGENT(1031)]CRS-8500: Oracle Clusterware ORAROOTAGENT 进程以操作系统进程 ID 1031 开头
2021-03-14 21:03:28.515 [ORAAGENT(1036)]CRS-8500: Oracle Clusterware ORAAGENT 进程以操作系统进程 ID 1036 开头
2021-03-14 21:03:28.666 [ORAAGENT(1036)]CRS-5011: 检查资源 "oracledb" 失败: 详细资料见 "(:CLSN00007:)"
 (位于 "/u01/app/gridbase/diag/crs/rac1/crs/trace/crsd_oraagent_oracle.trc")
2021-03-14 21:03:30.649 [ORAAGENT(32117)]CRS-5011: 检查资源 "ora.asm" 失败: 详细资料见 "(:CLSN00006:)" 
(位于 "/u01/app/gridbase/diag/crs/rac1/crs/trace/ohasd_oraagent_grid.trc")
2021-03-14 21:03:30.718 [ORAAGENT(32117)]CRS-5011: 检查资源 "ora.asm" 失败: 详细资料见 "(:CLSN00006:)" 
(位于 "/u01/app/gridbase/diag/crs/rac1/crs/trace/ohasd_oraagent_grid.trc")
2021-03-14 21:03:30.722 [CRSD(325)]CRS-1024: 由于此节点上的 ASM 实例未处于活动状态, 此节点上的集群就绪服务终止。详细信息见 
(:PROCR00009:) (位于 /u01/app/gridbase/diag/crs/rac1/crs/trace/crsd.trc)。
2021-03-14 21:03:30.736 [ORAROOTAGENT(1031)]CRS-5822: 代理 '/u01/app/grid/12.1.0/bin/orarootagent_root' 已从服务器断开连接。
详细资料见 (:CRSAGF00117:) {0:3:3} (位于 /u01/app/gridbase/diag/crs/rac1/crs/trace/crsd_orarootagent_root.trc)。
2021-03-14 21:03:30.736 [ORAAGENT(1027)]CRS-5822: 代理 '/u01/app/grid/12.1.0/bin/oraagent_grid' 已从服务器断开连接。
详细资料见 (:CRSAGF00117:) {0:1:3} (位于 /u01/app/gridbase/diag/crs/rac1/crs/trace/crsd_oraagent_grid.trc)。
2021-03-14 21:03:30.793 [CRSD(1157)]CRS-8500: Oracle Clusterware CRSD 进程以操作系统进程 ID 1157 开头
2021-03-14 21:03:31.457 [OLOGGERD(1162)]CRS-8500: Oracle Clusterware OLOGGERD 进程以操作系统进程 ID 1162 开头
2021-03-14 21:03:31.798 [ORAAGENT(32117)]CRS-5011: 检查资源 "ora.asm" 失败: 详细资料见 "(:CLSN00006:)" 
(位于 "/u01/app/gridbase/diag/crs/rac1/crs/trace/ohasd_oraagent_grid.trc")
2021-03-14 21:03:31.823 [ORAAGENT(32117)]CRS-5011: 检查资源 "ora.asm" 失败: 详细资料见 "(:CLSN00006:)" 
(位于 "/u01/app/gridbase/diag/crs/rac1/crs/trace/ohasd_oraagent_grid.trc")
2021-03-14 21:03:40.234 [CRSD(1157)]CRS-1013: ASM 磁盘组中的 OCR 位置不可访问。
详细资料见 /u01/app/gridbase/diag/crs/rac1/crs/trace/crsd.trc。
2021-03-14 21:03:40.238 [CRSD(1157)]CRS-0804: 由于 Oracle 集群注册表错误 [PROC-26: 访问物理存储时出错
ORA-15077: 找不到提供所需磁盘组的 ASM 实例
], 集群就绪服务中止。详细资料见 (:CRSD00111:) (位于 /u01/app/gridbase/diag/crs/rac1/crs/trace/crsd.trc)。

从整个集群的启动过程看cssd,crs都起来了,然后等一会由于crs无法访问ocr磁盘组,导致异常.开始crs起来了,证明ocr磁盘组应该是mount成功过.后面看错误提示又无法访问了.根据经验以及ora.asm失败的提示,怀疑很可能是asm实例出现问题了.对于这样的情况,分析asm的alert日志是最好的方法.通过分析日志发现

Sun Mar 14 21:03:24 2021
NOTE: Instance updated compatible.asm to 12.1.0.0.0 for grp 1
Sun Mar 14 21:03:24 2021
SUCCESS: diskgroup ARCHLOG was mounted
Sun Mar 14 21:03:24 2021
NOTE: Instance updated compatible.asm to 12.1.0.0.0 for grp 2
Sun Mar 14 21:03:24 2021
SUCCESS: diskgroup DATA was mounted
Sun Mar 14 21:03:24 2021
NOTE: Instance updated compatible.asm to 12.1.0.0.0 for grp 3
Sun Mar 14 21:03:24 2021
SUCCESS: diskgroup OCR was mounted
Sun Mar 14 21:03:24 2021
NOTE: Instance updated compatible.asm to 12.1.0.0.0 for grp 5
Sun Mar 14 21:03:24 2021
SUCCESS: ALTER DISKGROUP ALL MOUNT /* asm agent call crs *//* {0:9:3} */
Sun Mar 14 21:03:24 2021
WARNING: failed to online diskgroup resource ora.ARCHLOG.dg (unable to communicate with CRSD/OHASD)
WARNING: failed to online diskgroup resource ora.DATA.dg (unable to communicate with CRSD/OHASD)
WARNING: failed to online diskgroup resource ora.OCR.dg (unable to communicate with CRSD/OHASD)
Errors in file /u01/app/gridbase/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_32721.trc  (incident=123423):
ORA-00600: internal error code, arguments: [kfdAuDealloc2], [ARCHLOG], [213], [410], [0], [], [], [], [], [], [], []
Incident details in: /u01/app/gridbase/diag/asm/+asm/+ASM1/incident/incdir_123423/+ASM1_rbal_32721_i123423.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Sun Mar 14 21:03:25 2021
ERROR: An unrecoverable error has been identified in ASM metadata.
Sun Mar 14 21:03:27 2021
NOTE: [crsd.bin@rac1.schic.org (TNS V1-V3) 325] opening OCR file +OCR.255.4294967295
Starting background process ASMB
Sun Mar 14 21:03:27 2021
ASMB started with pid=28, OS id=932 
Sun Mar 14 21:03:27 2021
NOTE: ASMB registering with ASM instance as Standard client 0xffffffffffffffff (reg:3401595347) (new connection)
Sun Mar 14 21:03:27 2021
NOTE: Standard client +ASM1:+ASM:racscan registered, osid 934, mbr 0x0, asmb 932 (reg:3401595347)
Sun Mar 14 21:03:27 2021
NOTE: ASMB connected to ASM instance +ASM1 osid: 934 (Flex mode; client id 0xffffffffffffffff)
Sun Mar 14 21:03:28 2021
NOTE: AMDU dump of disk group ARCHLOG initiated at /u01/app/gridbase/diag/asm/+asm/+ASM1/incident/incdir_123423
ERROR: ORA-600 in COD recovery for diskgroup 1/0x730955f2 (ARCHLOG)
ERROR: ORA-600 thrown in RBAL for group number 1
Sun Mar 14 21:03:30 2021
Errors in file /u01/app/gridbase/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_32721.trc:
ORA-00600: internal error code, arguments: [kfdAuDealloc2], [ARCHLOG], [213], [410], [0], [], [], [], [], [], [], []
Sun Mar 14 21:03:30 2021
Errors in file /u01/app/gridbase/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_32721.trc:
ORA-00600: internal error code, arguments: [kfdAuDealloc2], [ARCHLOG], [213], [410], [0], [], [], [], [], [], [], []
USER (ospid: 32721): terminating the instance due to error 488
Sun Mar 14 21:03:30 2021
System state dump requested by (instance=1, osid=32721 (RBAL)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/gridbase/diag/asm/+asm/+ASM1/trace/+ASM1_diag_32691_20210314210330.trc
Sun Mar 14 21:03:30 2021
Dumping diagnostic data in directory=[cdmp_20210314210330], requested by (instance=1, osid=32721 (RBAL)), s
ummary=[abnormal instance termination].
Sun Mar 14 21:03:30 2021
Instance terminated by USER, pid = 32721

通过上述日志,果然发现ocr磁盘组先mount成功,然后asm实例由于ARCHLOG磁盘组的ORA-00600 kfdAuDealloc2错误而导致整个实例crash,从而使得ocr磁盘组无法被crs访问,从而出现了”CRS-0804: 由于 Oracle 集群注册表错误 [PROC-26: 访问物理存储时出错 ORA-15077: 找不到提供所需磁盘组的 ASM 实例], 集群就绪服务中止”这样的错误提示.进一步分析为什么archlog进程会报这个错误.

SQL> /* ASMCMD */alter diskgroup /*ASMCMD*/ "DATA" drop file '+DATA/xff/XIFENFEI.270.1040985885' 
Sun Mar 14 20:46:46 2021
SUCCESS: /* ASMCMD */alter diskgroup /*ASMCMD*/ "DATA" drop file '+DATA/xff/XIFENFEI.270.1040985885'
Sun Mar 14 20:49:24 2021
NOTE: Dropping directory '+archlog/oracledb/archivelog/2021_03_11' recursively
Sun Mar 14 20:49:24 2021
Errors in file /u01/app/gridbase/diag/asm/+asm/+ASM1/trace/+ASM1_ora_15281.trc  (incident=114015):
ORA-00600: internal error code, arguments: [kfdAuDealloc2], [ARCHLOG], [213], [410], [0], [], [], [], [], [], [], []
Incident details in: /u01/app/gridbase/diag/asm/+asm/+ASM1/incident/incdir_114015/+ASM1_ora_15281_i114015.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Sun Mar 14 20:49:24 2021
ERROR: An unrecoverable error has been identified in ASM metadata.
NOTE:AMDU dump of disk group ARCHLOG initiated at/u01/app/gridbase/diag/asm/+asm/+ASM1/incident/incdir_114015
Sun Mar 14 20:49:28 2021
Errors in file /u01/app/gridbase/diag/asm/+asm/+ASM1/trace/+ASM1_ora_15281.trc  (incident=114016):
ORA-00600: internal error code, arguments: [kfdAuDealloc2], [ARCHLOG], [213], [410], [0], [], [], [], [], [], [], []
ORA-00600: internal error code, arguments: [kfdAuDealloc2], [ARCHLOG], [213], [410], [0], [], [], [], [], [], [], []

因为这个库有一个历史背景:几天前由于存储cache导致,数据库使用备份还原(还原到一个新磁盘组中,老磁盘组没有使用),今天估计是运维人员在清理老磁盘组中不要的文件,然后archlog中的归档日志的时候,清空了+archlog/oracledb/archivelog/2021_03_11中的文件,然后触发asm删除该目录的异常(异常原因估计和上次清理存储cache引起了该磁盘组的元数据异常有关).该故障的基本思路原因已经清楚:由于archlog磁盘组本身元数据库有问题,清理该磁盘组文件之后,引起该磁盘组删除空目录出发问题,从而使得整个asm 实例crash.进而引起crs异常.解决方法比较简单,因为archlog磁盘组本身已经不需要,直接dd掉磁盘头,让其启动的时候不再mount,故障解决

[grid@rac1 ~]$ crsctl status res -t
--------------------------------------------------------------------------------
Name           Target  State        Server                   State details       
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.ARCHLOG.dg
               ONLINE  OFFLINE      rac1                     STABLE
               ONLINE  OFFLINE      rac2                     STABLE
ora.DATA.dg
               ONLINE  ONLINE       rac1                     STABLE
               ONLINE  ONLINE       rac2                     STABLE
ora.LISTENER.lsnr
               ONLINE  ONLINE       rac1                     STABLE
               ONLINE  ONLINE       rac2                     STABLE
ora.LISTENER1.lsnr
               ONLINE  ONLINE       rac1                     STABLE
               ONLINE  ONLINE       rac2                     STABLE
ora.NEWDATA.dg
               ONLINE  ONLINE       rac1                     STABLE
               ONLINE  ONLINE       rac2                     STABLE
ora.OCR.dg
               ONLINE  ONLINE       rac1                     STABLE
               ONLINE  ONLINE       rac2                     STABLE
ora.asm
               ONLINE  ONLINE       rac1                     Started,STABLE
               ONLINE  ONLINE       rac2                     Started,STABLE
ora.net1.network
               ONLINE  ONLINE       rac1                     STABLE
               ONLINE  ONLINE       rac2                     STABLE
ora.ons
               ONLINE  ONLINE       rac1                     STABLE
               ONLINE  ONLINE       rac2                     STABLE
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  ONLINE       rac2                     STABLE
ora.LISTENER_SCAN2.lsnr
      1        ONLINE  ONLINE       rac1                     STABLE
ora.LISTENER_SCAN3.lsnr
      1        ONLINE  ONLINE       rac1                     STABLE
ora.MGMTLSNR
      1        ONLINE  OFFLINE      rac2                     169.254.86.142 7.7.7
                                                             .1,STARTING
ora.cvu
      1        ONLINE  ONLINE       rac1                     STABLE
ora.mgmtdb
      1        ONLINE  OFFLINE                               Instance Shutdown,ST
                                                             ABLE
ora.oc4j
      1        ONLINE  OFFLINE      rac1                     STARTING
ora.xff.db
      1        ONLINE  OFFLINE      rac1                     STARTING
      2        ONLINE  OFFLINE      rac2                     STARTING
ora.rac1.vip
      1        ONLINE  ONLINE       rac1                     STABLE
ora.rac2.vip
      1        ONLINE  ONLINE       rac2                     STABLE
ora.scan1.vip
      1        ONLINE  ONLINE       rac2                     STABLE
ora.scan2.vip
      1        ONLINE  ONLINE       rac1                     STABLE
ora.scan3.vip
      1        ONLINE  ONLINE       rac1                     STABLE
--------------------------------------------------------------------------------
[grid@rac1 ~]$ 

has a disk HB, but no network HB—-traceroute不通导致

联系:手机/微信(+86 17813235971) QQ(107644445)

标题:has a disk HB, but no network HB—-traceroute不通导致

作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]

有客户反馈集群有一个节点异常,检查发现crs进程异常
3


重启crs发现cssd进程无法正常启动
no-network-hb


明显私网异常,进一步分析发现私网相互可以ping,但是无法traceroute其他节点
traceroute-not-work
traceroute-not-work2


客户反馈近期安装了安全软件,客户停掉安全软件之后,traceroute恢复正常
1
2


集群也正常启动

[root@his01 cssd]# crsctl status res -t
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.DATA.dg
               ONLINE  ONLINE       his01
               ONLINE  ONLINE       his02
ora.FRA.dg
               ONLINE  ONLINE       his01
               ONLINE  ONLINE       his02
ora.LISTENER.lsnr
               ONLINE  ONLINE       his01
               ONLINE  ONLINE       his02
ora.OCRVOTE.dg
               ONLINE  ONLINE       his01
               ONLINE  ONLINE       his02
ora.asm
               ONLINE  ONLINE       his01                    Started
               ONLINE  ONLINE       his02                    Started
ora.gsd
               OFFLINE OFFLINE      his01
               OFFLINE OFFLINE      his02
ora.net1.network
               ONLINE  ONLINE       his01
               ONLINE  ONLINE       his02
ora.ons
               ONLINE  ONLINE       his01
               ONLINE  ONLINE       his02
ora.registry.acfs
               ONLINE  ONLINE       his01
               ONLINE  ONLINE       his02
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  ONLINE       his02
ora.cvu
      1        ONLINE  ONLINE       his02
ora.his01.vip
      1        ONLINE  ONLINE       his01
ora.his02.vip
      1        ONLINE  ONLINE       his02
ora.oc4j
      1        ONLINE  ONLINE       his02
ora.orcl.db
      1        ONLINE  ONLINE       his01                    Open
      2        ONLINE  ONLINE       his02                    Open
ora.scan1.vip
      1        ONLINE  ONLINE       his02

12.1人工修改操作系统时间导致数据库异常

联系:手机/微信(+86 17813235971) QQ(107644445)

标题:12.1人工修改操作系统时间导致数据库异常

作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]

有客户数据库版本为12.1.0.1 版本RAC,突发发生重启,让协助分析原因
数据库alert日志报ORA-15064错误

Mon Apr 15 15:06:26 2019
WARNING: inbound connection timed out (ORA-3136)
Mon Apr 15 15:41:26 2019
NOTE: ASMB terminating
Mon Apr 15 15:41:26 2019
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_asmb_61426.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID:
Session ID: 1892 Serial number: 29
Mon Apr 15 15:41:26 2019
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_asmb_61426.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID:
Session ID: 1892 Serial number: 29
Mon Apr 15 15:41:26 2019
System state dump requested by (instance=1, osid=61426 (ASMB)), summary=[abnormal instance termination].
Mon Apr 15 15:41:26 2019
USER (ospid: 61426): terminating the instance due to error 15064
Mon Apr 15 15:41:26 2019
System State dumped to trace file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_diag_61287.trc
Mon Apr 15 15:41:27 2019
opiodr aborting process unknown ospid (1171) as a result of ORA-1092
Mon Apr 15 15:41:27 2019
ORA-1092 : opitsk aborting process

这里看,明显asmb异常导致数据库无法正常访问asm从而出现数据库crash的问题.

分析asm日志

Mon Apr 15 15:41:26 2019
WARNING: client [+ASM1:+ASM] not responsive for 2069s; state=0x1. pid 23155
NOTE: umbilicus traces dumped to /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_gen0_23050.trc
WARNING: client [orcl1:orcl] not responsive for 2069s; state=0x1. killing pid 61436
NOTE: umbilicus traces dumped to /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_gen0_23050.trc
WARNING: fencing client [orcl1:orcl] after 2069 seconds (mbr 2)
WARNING: client [-MGMTDB:_mgmtdb] not responsive for 2070s; state=0x1. killing pid 24026
NOTE: umbilicus traces dumped to /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_gen0_23050.trc
WARNING: fencing client [-MGMTDB:_mgmtdb] after 2070 seconds (mbr 1)
Mon Apr 15 15:41:26 2019
NOTE: cleaned up ASM client -MGMTDB:_mgmtdb
NOTE: cleaned up ASM client orcl1:orcl
Mon Apr 15 15:41:43 2019
NOTE: Standard client -MGMTDB:_mgmtdb registered, osid 183707, mbr 0x1 (reg:1371965153)
Mon Apr 15 15:42:16 2019
NOTE: Standard client orcl1:orcl registered, osid 184063, mbr 0x2 (reg:2088418628)
Mon Apr 15 15:44:30 2019
Warning: VKTM detected a time drift.
Time drifts can result in an unexpected behavior such as time-outs. Please check trace file for more details.

asm日志中和mos中的GEN0 terminating the ASM instance due to error 15082 (文档 ID 2096988.1)描述比较匹配.根据客户反馈,他们使用ntp进行修改了时间,基本上可以确定是由于oracle的Bug 19032250(在12.1.0.2中修复)在ntp修改时间跨度过大触发的相关问题(人工直接修改时间也可能出现类似问题)

对于rac修改时间建议
1. 如果时间慢了,关闭数据库和集群直接把时间向前调整,启动集群和数据库
2. 如果时间快了,关闭数据库和集群等实际时间过关闭集群和库的时间之后,再往回调整时间,启动集群和数据库