联系:手机/微信(+86 17813235971) QQ(107644445)
标题:一次 CRS-1013: ASM 磁盘组中的 OCR 位置不可访问 故障分析
作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]
有朋友告知我集群突然异常,让我给看看什么原因,集群alert日志
2021-03-14 21:02:15.517 [OHASD(31771)]CRS-8500: Oracle Clusterware OHASD 进程以操作系统进程 ID 31771 开头 2021-03-14 21:02:15.561 [OHASD(31771)]CRS-0714: Oracle Clusterware 发行版 12.1.0.2.0。 2021-03-14 21:02:15.619 [OHASD(31771)]CRS-2112: 已在节点 rac1 上启动 OLR 服务。 2021-03-14 21:02:15.791 [OHASD(31771)]CRS-1301: 已在节点 rac1 上启动 Oracle 高可用性服务。 2021-03-14 21:02:15.910 [OHASD(31771)]CRS-8017: 位置:/etc/oracle/lastgasp具有2个重新启动指导日志文件,0个已发布,0个出现错误 2021-03-14 21:02:16.789 [CSSDAGENT(32015)]CRS-8500: Oracle Clusterware CSSDAGENT 进程以操作系统进程 ID 32015 开头 2021-03-14 21:02:16.868 [CSSDMONITOR(32017)]CRS-8500: Oracle Clusterware CSSDMONITOR 进程以操作系统进程 ID 32017 开头 2021-03-14 21:02:17.751 [ORAROOTAGENT(32008)]CRS-8500: Oracle Clusterware ORAROOTAGENT 进程以操作系统进程 ID 32008 开头 2021-03-14 21:02:17.916 [ORAAGENT(32012)]CRS-8500: Oracle Clusterware ORAAGENT 进程以操作系统进程 ID 32012 开头 2021-03-14 21:02:18.604 [ORAAGENT(32012)]CRS-5011: 检查资源 "ora.asm" 失败: 详细资料见 "(:CLSN00006:)" (位于 "/u01/app/gridbase/diag/crs/rac1/crs/trace/ohasd_oraagent_grid.trc") 2021-03-14 21:02:18.969 [ORAAGENT(32117)]CRS-8500: Oracle Clusterware ORAAGENT 进程以操作系统进程 ID 32117 开头 2021-03-14 21:02:19.050 [MDNSD(32130)]CRS-8500: Oracle Clusterware MDNSD 进程以操作系统进程 ID 32130 开头 2021-03-14 21:02:19.117 [EVMD(32132)]CRS-8500: Oracle Clusterware EVMD 进程以操作系统进程 ID 32132 开头 2021-03-14 21:02:20.078 [GPNPD(32151)]CRS-8500: Oracle Clusterware GPNPD 进程以操作系统进程 ID 32151 开头 2021-03-14 21:02:21.145 [GIPCD(32172)]CRS-8500: Oracle Clusterware GIPCD 进程以操作系统进程 ID 32172 开头 2021-03-14 21:02:21.163 [GPNPD(32151)]CRS-2328: 已在节点 rac1 上启动 GPNPD。 2021-03-14 21:02:22.172 [ORAROOTAGENT(32181)]CRS-8500: Oracle Clusterware ORAROOTAGENT 进程以操作系统进程 ID 32181 开头 2021-03-14 21:02:22.339 [CLSECHO(32204)]CRS-10001: 14-Mar-21 21:02 ACFS-9391: 正在检查现有 ADVM/ACFS 安装。 2021-03-14 21:02:22.580 [CLSECHO(32209)]CRS-10001: 14-Mar-21 21:02 ACFS-9392: 正在验证操作系统的 ADVM/ACFS 安装文件。 2021-03-14 21:02:22.598 [CLSECHO(32211)]CRS-10001: 14-Mar-21 21:02 ACFS-9393: 正在验证 ASM 管理员设置。 2021-03-14 21:02:22.646 [CLSECHO(32216)]CRS-10001: 14-Mar-21 21:02 ACFS-9308: 正在加载已安装的 ADVM/ACFS 驱动程序。 2021-03-14 21:02:22.678 [CLSECHO(32219)]CRS-10001: 14-Mar-21 21:02 ACFS-9154: 正在加载 'oracleoks.ko' 驱动程序。 2021-03-14 21:02:22.809 [CLSECHO(32234)]CRS-10001: 14-Mar-21 21:02 ACFS-9154: 正在加载 'oracleadvm.ko' 驱动程序。 2021-03-14 21:02:22.892 [CLSECHO(32290)]CRS-10001: 14-Mar-21 21:02 ACFS-9154: 正在加载 'oracleacfs.ko' 驱动程序。 2021-03-14 21:02:23.054 [CLSECHO(32334)]CRS-10001: 14-Mar-21 21:02 ACFS-9327: 正在验证 ADVM/ACFS 设备。 2021-03-14 21:02:23.079 [CLSECHO(32336)]CRS-10001: 14-Mar-21 21:02 ACFS-9156: 正在检测控制设备 '/dev/asm/.asm_ctl_spec'。 2021-03-14 21:02:23.108 [CLSECHO(32340)]CRS-10001: 14-Mar-21 21:02 ACFS-9156: 正在检测控制设备 '/dev/ofsctl'。 2021-03-14 21:02:23.263 [CLSECHO(32346)]CRS-10001: 14-Mar-21 21:02 ACFS-9322: 已完成 2021-03-14 21:02:28.571 [CSSDMONITOR(32409)]CRS-8500: Oracle Clusterware CSSDMONITOR 进程以操作系统进程 ID 32409 开头 2021-03-14 21:02:28.756 [CSSDAGENT(32425)]CRS-8500: Oracle Clusterware CSSDAGENT 进程以操作系统进程 ID 32425 开头 2021-03-14 21:02:28.975 [OCSSD(32436)]CRS-8500: Oracle Clusterware OCSSD 进程以操作系统进程 ID 32436 开头 2021-03-14 21:02:30.072 [OCSSD(32436)]CRS-1713: CSSD 守护程序已在 hub 模式下启动 2021-03-14 21:02:46.185 [OCSSD(32436)]CRS-1707: 节点 rac1 (编号为 1) 的租约获取已完成 2021-03-14 21:02:47.337 [OCSSD(32436)]CRS-1605: CSSD 表决文件联机: ORCL:OCR3; 详细资料见 /u01/app/gridbase/diag/crs/rac1/crs/trace/ocssd.trc。 2021-03-14 21:02:47.357 [OCSSD(32436)]CRS-1605: CSSD 表决文件联机: ORCL:OCR2; 详细资料见 /u01/app/gridbase/diag/crs/rac1/crs/trace/ocssd.trc。 2021-03-14 21:02:47.365 [OCSSD(32436)]CRS-1605: CSSD 表决文件联机: ORCL:OCR1; 详细资料见 /u01/app/gridbase/diag/crs/rac1/crs/trace/ocssd.trc。 2021-03-14 21:02:48.781 [OCSSD(32436)]CRS-1601: CSSD 重新配置完毕。活动节点为 rac1 rac2 。 2021-03-14 21:02:50.971 [OCTSSD(32591)]CRS-8500: Oracle Clusterware OCTSSD 进程以操作系统进程 ID 32591 开头 2021-03-14 21:02:51.938 [OCTSSD(32591)]CRS-2403: 主机 rac1 上的集群时间同步服务处于观察程序模式。 2021-03-14 21:02:52.140 [OCTSSD(32591)]CRS-2407: 新的集群时间同步服务引用节点为主机 rac2。 2021-03-14 21:02:52.140 [OCTSSD(32591)]CRS-2401: 已在主机 rac1 上启动了集群时间同步服务。 2021-03-14 21:02:52.167 [OCTSSD(32591)]CRS-2409: 主机 rac1 上的时钟与集群标准时间不同步。 由于集群时间同步服务正在以观察程序模式运行, 所以未采取任何操作。 2021-03-14 21:02:59.284 [ORAAGENT(32117)]CRS-5011: 检查资源 "ora.asm" 失败: 详细资料见 "(:CLSN00006:)" ( 位于 "/u01/app/gridbase/diag/crs/rac1/crs/trace/ohasd_oraagent_grid.trc") 2021-03-14 21:03:01.486 [ORAAGENT(32117)]CRS-5011: 检查资源 "ora.asm" 失败: 详细资料见 "(:CLSN00006:)" ( 位于 "/u01/app/gridbase/diag/crs/rac1/crs/trace/ohasd_oraagent_grid.trc") 2021-03-14 21:03:01.514 [ORAAGENT(32117)]CRS-5011: 检查资源 "ora.asm" 失败: 详细资料见 "(:CLSN00006:)" ( 位于 "/u01/app/gridbase/diag/crs/rac1/crs/trace/ohasd_oraagent_grid.trc") 2021-03-14 21:03:18.163 [OCTSSD(32591)]CRS-2407: 新的集群时间同步服务引用节点为主机 rac1。 2021-03-14 21:03:19.406 [OCSSD(32436)]CRS-1625: 节点 rac2 (编号为 2) 已关闭 2021-03-14 21:03:19.419 [OCSSD(32436)]CRS-1601: CSSD 重新配置完毕。活动节点为 rac1 。 2021-03-14 21:03:24.916 [OSYSMOND(318)]CRS-8500: Oracle Clusterware OSYSMOND 进程以操作系统进程 ID 318 开头 2021-03-14 21:03:26.558 [CRSD(325)]CRS-8500: Oracle Clusterware CRSD 进程以操作系统进程 ID 325 开头 2021-03-14 21:03:27.750 [CRSD(325)]CRS-1012: 已在节点 rac1 上启动 OCR 服务。 2021-03-14 21:03:27.807 [CRSD(325)]CRS-1201: 已在节点 rac1 上启动 CRSD。 2021-03-14 21:03:28.470 [ORAAGENT(1027)]CRS-8500: Oracle Clusterware ORAAGENT 进程以操作系统进程 ID 1027 开头 2021-03-14 21:03:28.499 [ORAROOTAGENT(1031)]CRS-8500: Oracle Clusterware ORAROOTAGENT 进程以操作系统进程 ID 1031 开头 2021-03-14 21:03:28.515 [ORAAGENT(1036)]CRS-8500: Oracle Clusterware ORAAGENT 进程以操作系统进程 ID 1036 开头 2021-03-14 21:03:28.666 [ORAAGENT(1036)]CRS-5011: 检查资源 "oracledb" 失败: 详细资料见 "(:CLSN00007:)" (位于 "/u01/app/gridbase/diag/crs/rac1/crs/trace/crsd_oraagent_oracle.trc") 2021-03-14 21:03:30.649 [ORAAGENT(32117)]CRS-5011: 检查资源 "ora.asm" 失败: 详细资料见 "(:CLSN00006:)" (位于 "/u01/app/gridbase/diag/crs/rac1/crs/trace/ohasd_oraagent_grid.trc") 2021-03-14 21:03:30.718 [ORAAGENT(32117)]CRS-5011: 检查资源 "ora.asm" 失败: 详细资料见 "(:CLSN00006:)" (位于 "/u01/app/gridbase/diag/crs/rac1/crs/trace/ohasd_oraagent_grid.trc") 2021-03-14 21:03:30.722 [CRSD(325)]CRS-1024: 由于此节点上的 ASM 实例未处于活动状态, 此节点上的集群就绪服务终止。详细信息见 (:PROCR00009:) (位于 /u01/app/gridbase/diag/crs/rac1/crs/trace/crsd.trc)。 2021-03-14 21:03:30.736 [ORAROOTAGENT(1031)]CRS-5822: 代理 '/u01/app/grid/12.1.0/bin/orarootagent_root' 已从服务器断开连接。 详细资料见 (:CRSAGF00117:) {0:3:3} (位于 /u01/app/gridbase/diag/crs/rac1/crs/trace/crsd_orarootagent_root.trc)。 2021-03-14 21:03:30.736 [ORAAGENT(1027)]CRS-5822: 代理 '/u01/app/grid/12.1.0/bin/oraagent_grid' 已从服务器断开连接。 详细资料见 (:CRSAGF00117:) {0:1:3} (位于 /u01/app/gridbase/diag/crs/rac1/crs/trace/crsd_oraagent_grid.trc)。 2021-03-14 21:03:30.793 [CRSD(1157)]CRS-8500: Oracle Clusterware CRSD 进程以操作系统进程 ID 1157 开头 2021-03-14 21:03:31.457 [OLOGGERD(1162)]CRS-8500: Oracle Clusterware OLOGGERD 进程以操作系统进程 ID 1162 开头 2021-03-14 21:03:31.798 [ORAAGENT(32117)]CRS-5011: 检查资源 "ora.asm" 失败: 详细资料见 "(:CLSN00006:)" (位于 "/u01/app/gridbase/diag/crs/rac1/crs/trace/ohasd_oraagent_grid.trc") 2021-03-14 21:03:31.823 [ORAAGENT(32117)]CRS-5011: 检查资源 "ora.asm" 失败: 详细资料见 "(:CLSN00006:)" (位于 "/u01/app/gridbase/diag/crs/rac1/crs/trace/ohasd_oraagent_grid.trc") 2021-03-14 21:03:40.234 [CRSD(1157)]CRS-1013: ASM 磁盘组中的 OCR 位置不可访问。 详细资料见 /u01/app/gridbase/diag/crs/rac1/crs/trace/crsd.trc。 2021-03-14 21:03:40.238 [CRSD(1157)]CRS-0804: 由于 Oracle 集群注册表错误 [PROC-26: 访问物理存储时出错 ORA-15077: 找不到提供所需磁盘组的 ASM 实例 ], 集群就绪服务中止。详细资料见 (:CRSD00111:) (位于 /u01/app/gridbase/diag/crs/rac1/crs/trace/crsd.trc)。
从整个集群的启动过程看cssd,crs都起来了,然后等一会由于crs无法访问ocr磁盘组,导致异常.开始crs起来了,证明ocr磁盘组应该是mount成功过.后面看错误提示又无法访问了.根据经验以及ora.asm失败的提示,怀疑很可能是asm实例出现问题了.对于这样的情况,分析asm的alert日志是最好的方法.通过分析日志发现
Sun Mar 14 21:03:24 2021 NOTE: Instance updated compatible.asm to 12.1.0.0.0 for grp 1 Sun Mar 14 21:03:24 2021 SUCCESS: diskgroup ARCHLOG was mounted Sun Mar 14 21:03:24 2021 NOTE: Instance updated compatible.asm to 12.1.0.0.0 for grp 2 Sun Mar 14 21:03:24 2021 SUCCESS: diskgroup DATA was mounted Sun Mar 14 21:03:24 2021 NOTE: Instance updated compatible.asm to 12.1.0.0.0 for grp 3 Sun Mar 14 21:03:24 2021 SUCCESS: diskgroup OCR was mounted Sun Mar 14 21:03:24 2021 NOTE: Instance updated compatible.asm to 12.1.0.0.0 for grp 5 Sun Mar 14 21:03:24 2021 SUCCESS: ALTER DISKGROUP ALL MOUNT /* asm agent call crs *//* {0:9:3} */ Sun Mar 14 21:03:24 2021 WARNING: failed to online diskgroup resource ora.ARCHLOG.dg (unable to communicate with CRSD/OHASD) WARNING: failed to online diskgroup resource ora.DATA.dg (unable to communicate with CRSD/OHASD) WARNING: failed to online diskgroup resource ora.OCR.dg (unable to communicate with CRSD/OHASD) Errors in file /u01/app/gridbase/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_32721.trc (incident=123423): ORA-00600: internal error code, arguments: [kfdAuDealloc2], [ARCHLOG], [213], [410], [0], [], [], [], [], [], [], [] Incident details in: /u01/app/gridbase/diag/asm/+asm/+ASM1/incident/incdir_123423/+ASM1_rbal_32721_i123423.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Sun Mar 14 21:03:25 2021 ERROR: An unrecoverable error has been identified in ASM metadata. Sun Mar 14 21:03:27 2021 NOTE: [crsd.bin@rac1.schic.org (TNS V1-V3) 325] opening OCR file +OCR.255.4294967295 Starting background process ASMB Sun Mar 14 21:03:27 2021 ASMB started with pid=28, OS id=932 Sun Mar 14 21:03:27 2021 NOTE: ASMB registering with ASM instance as Standard client 0xffffffffffffffff (reg:3401595347) (new connection) Sun Mar 14 21:03:27 2021 NOTE: Standard client +ASM1:+ASM:racscan registered, osid 934, mbr 0x0, asmb 932 (reg:3401595347) Sun Mar 14 21:03:27 2021 NOTE: ASMB connected to ASM instance +ASM1 osid: 934 (Flex mode; client id 0xffffffffffffffff) Sun Mar 14 21:03:28 2021 NOTE: AMDU dump of disk group ARCHLOG initiated at /u01/app/gridbase/diag/asm/+asm/+ASM1/incident/incdir_123423 ERROR: ORA-600 in COD recovery for diskgroup 1/0x730955f2 (ARCHLOG) ERROR: ORA-600 thrown in RBAL for group number 1 Sun Mar 14 21:03:30 2021 Errors in file /u01/app/gridbase/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_32721.trc: ORA-00600: internal error code, arguments: [kfdAuDealloc2], [ARCHLOG], [213], [410], [0], [], [], [], [], [], [], [] Sun Mar 14 21:03:30 2021 Errors in file /u01/app/gridbase/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_32721.trc: ORA-00600: internal error code, arguments: [kfdAuDealloc2], [ARCHLOG], [213], [410], [0], [], [], [], [], [], [], [] USER (ospid: 32721): terminating the instance due to error 488 Sun Mar 14 21:03:30 2021 System state dump requested by (instance=1, osid=32721 (RBAL)), summary=[abnormal instance termination]. System State dumped to trace file /u01/app/gridbase/diag/asm/+asm/+ASM1/trace/+ASM1_diag_32691_20210314210330.trc Sun Mar 14 21:03:30 2021 Dumping diagnostic data in directory=[cdmp_20210314210330], requested by (instance=1, osid=32721 (RBAL)), s ummary=[abnormal instance termination]. Sun Mar 14 21:03:30 2021 Instance terminated by USER, pid = 32721
通过上述日志,果然发现ocr磁盘组先mount成功,然后asm实例由于ARCHLOG磁盘组的ORA-00600 kfdAuDealloc2错误而导致整个实例crash,从而使得ocr磁盘组无法被crs访问,从而出现了”CRS-0804: 由于 Oracle 集群注册表错误 [PROC-26: 访问物理存储时出错 ORA-15077: 找不到提供所需磁盘组的 ASM 实例], 集群就绪服务中止”这样的错误提示.进一步分析为什么archlog进程会报这个错误.
SQL> /* ASMCMD */alter diskgroup /*ASMCMD*/ "DATA" drop file '+DATA/xff/XIFENFEI.270.1040985885' Sun Mar 14 20:46:46 2021 SUCCESS: /* ASMCMD */alter diskgroup /*ASMCMD*/ "DATA" drop file '+DATA/xff/XIFENFEI.270.1040985885' Sun Mar 14 20:49:24 2021 NOTE: Dropping directory '+archlog/oracledb/archivelog/2021_03_11' recursively Sun Mar 14 20:49:24 2021 Errors in file /u01/app/gridbase/diag/asm/+asm/+ASM1/trace/+ASM1_ora_15281.trc (incident=114015): ORA-00600: internal error code, arguments: [kfdAuDealloc2], [ARCHLOG], [213], [410], [0], [], [], [], [], [], [], [] Incident details in: /u01/app/gridbase/diag/asm/+asm/+ASM1/incident/incdir_114015/+ASM1_ora_15281_i114015.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Sun Mar 14 20:49:24 2021 ERROR: An unrecoverable error has been identified in ASM metadata. NOTE:AMDU dump of disk group ARCHLOG initiated at/u01/app/gridbase/diag/asm/+asm/+ASM1/incident/incdir_114015 Sun Mar 14 20:49:28 2021 Errors in file /u01/app/gridbase/diag/asm/+asm/+ASM1/trace/+ASM1_ora_15281.trc (incident=114016): ORA-00600: internal error code, arguments: [kfdAuDealloc2], [ARCHLOG], [213], [410], [0], [], [], [], [], [], [], [] ORA-00600: internal error code, arguments: [kfdAuDealloc2], [ARCHLOG], [213], [410], [0], [], [], [], [], [], [], []
因为这个库有一个历史背景:几天前由于存储cache导致,数据库使用备份还原(还原到一个新磁盘组中,老磁盘组没有使用),今天估计是运维人员在清理老磁盘组中不要的文件,然后archlog中的归档日志的时候,清空了+archlog/oracledb/archivelog/2021_03_11中的文件,然后触发asm删除该目录的异常(异常原因估计和上次清理存储cache引起了该磁盘组的元数据异常有关).该故障的基本思路原因已经清楚:由于archlog磁盘组本身元数据库有问题,清理该磁盘组文件之后,引起该磁盘组删除空目录出发问题,从而使得整个asm 实例crash.进而引起crs异常.解决方法比较简单,因为archlog磁盘组本身已经不需要,直接dd掉磁盘头,让其启动的时候不再mount,故障解决
[grid@rac1 ~]$ crsctl status res -t -------------------------------------------------------------------------------- Name Target State Server State details -------------------------------------------------------------------------------- Local Resources -------------------------------------------------------------------------------- ora.ARCHLOG.dg ONLINE OFFLINE rac1 STABLE ONLINE OFFLINE rac2 STABLE ora.DATA.dg ONLINE ONLINE rac1 STABLE ONLINE ONLINE rac2 STABLE ora.LISTENER.lsnr ONLINE ONLINE rac1 STABLE ONLINE ONLINE rac2 STABLE ora.LISTENER1.lsnr ONLINE ONLINE rac1 STABLE ONLINE ONLINE rac2 STABLE ora.NEWDATA.dg ONLINE ONLINE rac1 STABLE ONLINE ONLINE rac2 STABLE ora.OCR.dg ONLINE ONLINE rac1 STABLE ONLINE ONLINE rac2 STABLE ora.asm ONLINE ONLINE rac1 Started,STABLE ONLINE ONLINE rac2 Started,STABLE ora.net1.network ONLINE ONLINE rac1 STABLE ONLINE ONLINE rac2 STABLE ora.ons ONLINE ONLINE rac1 STABLE ONLINE ONLINE rac2 STABLE -------------------------------------------------------------------------------- Cluster Resources -------------------------------------------------------------------------------- ora.LISTENER_SCAN1.lsnr 1 ONLINE ONLINE rac2 STABLE ora.LISTENER_SCAN2.lsnr 1 ONLINE ONLINE rac1 STABLE ora.LISTENER_SCAN3.lsnr 1 ONLINE ONLINE rac1 STABLE ora.MGMTLSNR 1 ONLINE OFFLINE rac2 169.254.86.142 7.7.7 .1,STARTING ora.cvu 1 ONLINE ONLINE rac1 STABLE ora.mgmtdb 1 ONLINE OFFLINE Instance Shutdown,ST ABLE ora.oc4j 1 ONLINE OFFLINE rac1 STARTING ora.xff.db 1 ONLINE OFFLINE rac1 STARTING 2 ONLINE OFFLINE rac2 STARTING ora.rac1.vip 1 ONLINE ONLINE rac1 STABLE ora.rac2.vip 1 ONLINE ONLINE rac2 STABLE ora.scan1.vip 1 ONLINE ONLINE rac2 STABLE ora.scan2.vip 1 ONLINE ONLINE rac1 STABLE ora.scan3.vip 1 ONLINE ONLINE rac1 STABLE -------------------------------------------------------------------------------- [grid@rac1 ~]$