ORA-01172 ORA-01151故障处理

联系:手机/微信(+86 17813235971) QQ(107644445)QQ咨询惜分飞

标题:ORA-01172 ORA-01151故障处理

作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]

接手客户一个云平台硬件故障恢复之后,数据库无法启动的case,通过分析alert日志,发现数据库在open过程中报ORA-01172: recovery of thread 1 stuck at block 2220167 of file 262,ORA-01151: use media recovery to recover block, restore backup if needed等相关错误(其实也就是在做实例恢复的过程中报了logically corrupt导致无法完成实例恢复)

Sat Nov 01 14:29:10 2025
ALTER DATABASE OPEN
Beginning crash recovery of 1 threads
 parallel recovery started with 15 processes
Started redo scan
Completed redo scan
 read 7034 KB redo, 937 data blocks need recovery
Started redo application at
 Thread 1: logseq 296553, block 389408
Recovery of Online Redo Log: Thread 1 Group 3 Seq 296553 Reading mem 0
  Mem# 0: /data/orcl/onlinelog/redo03a.log
Sat Nov 01 14:29:11 2025
Hex dump of (file 262, block 2220584) in trace file /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_p009_2648.trc
Sat Nov 01 14:29:11 2025
Hex dump of (file 262, block 2218886) in trace file /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_p007_2644.trc
Reading datafile '/data/orcl/datafile/xifenfei12.dbf' for corruption at rdba: 0x41a1db86 (file 262, block 2218886)
Reading datafile '/data/orcl/datafile/xifenfei12.dbf' for corruption at rdba: 0x41a1e228 (file 262, block 2220584)
Sat Nov 01 14:29:11 2025
Hex dump of (file 262, block 2219845) in trace file /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_p008_2646.trc
Reading datafile '/data/orcl/datafile/xifenfei12.dbf' for corruption at rdba: 0x41a1df45 (file 262, block 2219845)
Sat Nov 01 14:29:11 2025
Hex dump of (file 262, block 2220167) in trace file /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_p001_2632.trc
Reading datafile '/data/orcl/datafile/xifenfei12.dbf' for corruption at rdba: 0x41a1e087 (file 262, block 2220167)
Reread (file 262, block 2218886) found same corrupt data (logically corrupt)
RECOVERY OF THREAD 1 STUCK AT BLOCK 2218886 OF FILE 262
Reread (file 262, block 2220584) found same corrupt data (logically corrupt)
RECOVERY OF THREAD 1 STUCK AT BLOCK 2220584 OF FILE 262
Reread (file 262, block 2219845) found same corrupt data (logically corrupt)
RECOVERY OF THREAD 1 STUCK AT BLOCK 2219845 OF FILE 262
Reread (file 262, block 2220167) found same corrupt data (logically corrupt)
RECOVERY OF THREAD 1 STUCK AT BLOCK 2220167 OF FILE 262
Sat Nov 01 14:29:26 2025
Slave exiting with ORA-1172 exception
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_p007_2644.trc:
ORA-01172: recovery of thread 1 stuck at block 2218886 of file 262
ORA-01151: use media recovery to recover block, restore backup if needed
Sat Nov 01 14:29:26 2025
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_p008_2646.trc:
ORA-10388: parallel query server interrupt (failure)
Sat Nov 01 14:29:26 2025
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_p009_2648.trc:
ORA-10388: parallel query server interrupt (failure)
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_p008_2646.trc:
ORA-10388: parallel query server interrupt (failure)
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_p009_2648.trc:
ORA-10388: parallel query server interrupt (failure)
Sat Nov 01 14:29:26 2025
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_p001_2632.trc:
ORA-10388: parallel query server interrupt (failure)
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_p001_2632.trc:
ORA-10388: parallel query server interrupt (failure)
Sat Nov 01 14:29:26 2025
Aborting crash recovery due to slave death, attempting serial crash recovery
Beginning crash recovery of 1 threads
Started redo scan
Completed redo scan
 read 7034 KB redo, 937 data blocks need recovery
Started redo application at
 Thread 1: logseq 296553, block 389408
Recovery of Online Redo Log: Thread 1 Group 3 Seq 296553 Reading mem 0
  Mem# 0: /data/orcl/onlinelog/redo03a.log
Hex dump of (file 262,block 2220167) in trace file /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_ora_2606.trc
Reading datafile '/data/orcl/datafile/xifenfei12.dbf'for corruption at rdba: 0x41a1e087 (file 262,block 2220167)
Reread (file 262, block 2220167) found same corrupt data (logically corrupt)
RECOVERY OF THREAD 1 STUCK AT BLOCK 2220167 OF FILE 262
Aborting crash recovery due to error 1172
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_ora_2606.trc:
ORA-01172: recovery of thread 1 stuck at block 2220167 of file 262
ORA-01151: use media recovery to recover block, restore backup if needed
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_ora_2606.trc:
ORA-01172: recovery of thread 1 stuck at block 2220167 of file 262
ORA-01151: use media recovery to recover block, restore backup if needed
ORA-1172 signalled during: ALTER DATABASE OPEN...

接手故障之后,尝试recover database恢复,结果报ORA-600 4552错误

SQL> recover database;
ORA-10562: Error occurred while applying redo to data block (file# 262, block#
2222153)
ORA-10564: tablespace XIFENFEI
ORA-01110: data file 262: '/data/orcl/datafile/xifenfei12.dbf'
ORA-10560: block type '0'
ORA-00600: internal error code, arguments: [4552], [1], [0], [], [], [], [],
[], [], [], [], []

关于ORA-600 4552对应的alert日志信息

Sat Nov 01 17:49:58 2025
ALTER DATABASE RECOVER  database  
Media Recovery Start
 started logmerger process
Parallel Media Recovery started with 16 slaves
Sat Nov 01 17:49:59 2025
Recovery of Online Redo Log: Thread 1 Group 3 Seq 296553 Reading mem 0
  Mem# 0: /data/orcl/onlinelog/redo03a.log
Sat Nov 01 17:49:59 2025
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_pr0c_28770.trc  (incident=1018821):
ORA-00600: internal error code, arguments: [4552], [1], [0], [], [], [], [], [], [], [], [], []
Incident details in:/u01/app/oracle/diag/rdbms/orcl/orcl/incident/incdir_1018821/orcl_pr0c_28770_i1018821.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Sat Nov 01 17:50:03 2025
Sweep [inc][1018821]: completed
Sweep [inc2][1018821]: completed
Slave exiting with ORA-10562 exception
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_pr0c_28770.trc:
ORA-10562: Error occurred while applying redo to data block (file# 262, block# 2222153)
ORA-10564: tablespace xifenfei
ORA-01110: data file 262: '/data/orcl/datafile/xifenfei12.dbf'
ORA-10560: block type '0'
ORA-00600: internal error code, arguments: [4552], [1], [0], [], [], [], [], [], [], [], [], []
Recovery Slave PR0C previously exited with exception 10562
Media Recovery failed with error 448
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_pr00_28732.trc:
ORA-00283: recovery session canceled due to errors
ORA-00448: normal completion of background process
ORA-10562 signalled during: ALTER DATABASE RECOVER  database  ...

无法整个库级别recover,尝试数据文件recover操作

SQL> recover datafile 1;
Media recovery complete.
…………
SQL> recover datafile 22,23,24,26,25,27,28,29,30;
Media recovery complete.
…………
SQL>    recover datafile  251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261;
Media recovery complete.
SQL> recover datafile 262;
ORA-00283: recovery session canceled due to errors
ORA-00600: internal error code, arguments: [3020], [262], [2215808],
[1101123456], [], [], [], [], [], [], [], []
ORA-10567: Redo is inconsistent with data block (file# 262, block# 2215808,
file offset is 972029952 bytes)
ORA-10564: tablespace XIFENFEI
ORA-01110: data file 262: '/data/orcl/datafile/xifenfei12.dbf'
ORA-10560: block type '0'

SQL> recover datafile 263;
Media recovery complete.

出file# 262数据文件之外,其他文件全部recover成功,对应的ORA-600 3020错误相关alert日志信息

Sat Nov 01 17:53:37 2025
ALTER DATABASE RECOVER  datafile 262  
Media Recovery Start
Serial Media Recovery started
Recovery of Online Redo Log: Thread 1 Group 3 Seq 296553 Reading mem 0
  Mem# 0: /data/orcl/onlinelog/redo03a.log
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_ora_28561.trc  (incident=1018717):
ORA-00600: internal error code, arguments: [3020], [262], [2215808], [1101123456], [], [], [], [], []
ORA-10567: Redo is inconsistent with data block (file# 262, block# 2215808, file offset is 972029952 bytes)
ORA-10564: tablespace xifenfei
ORA-01110: data file 262: '/data/orcl/datafile/xifenfei12.dbf'
ORA-10560: block type '0'
Incident details in:/u01/app/oracle/diag/rdbms/orcl/orcl/incident/incdir_1018717/orcl_ora_28561_i1018717.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Media Recovery failed with error 600
ORA-283 signalled during: ALTER DATABASE RECOVER  datafile 262  ...

对于这种情况,有两种处理方式:
1)在recover过程中对于报错的block标记为坏块,然后继续恢复,这样正常应用日志成功,再把标记的坏块修复好
2)直接修改该文件头跳过该文件跳过这些block的应用日志,直接骗过数据库
在本case中由于客户急着恢复业务,需要尽快处理,所以采用了第一个方案,这里我使用自研的m_scn(modify_scn)工具快速修改相关数据文件信息

[oracle@host-172-18-50-10 tmp]$ cat 1.txt
1@/data/orcl/datafile/system01.dbf
262@/data/orcl/datafile/xifenfei12.dbf
[oracle@host-172-18-50-10 tmp]$ ./m_scn 1.txt
Please Enter Password: 

===== Starting Datafile Header modification program =====
Datafile list file: 1.txt
Operation Mode: Only Modify Datafile Header CheckPoint
Block Size: 8192
Log Path: /tmp/modify_scn
---------------------------------------------------------
Preparing Datafile list file...
Verifying Datafile existence...
Datafile verification passed
Initializing working directory...
Recovery script created: /tmp/modify_scn/backup/recover_datafile.sh
---------------------------------------------------------
Starting Datafile Header processing (total 2 files)...
[1/2] Processing Datafile Header: /data/orcl/datafile/system01.dbf (File number: 1)
  - Skipping file number 1 (control file)
---------------------------------------------------------
[2/2] Processing Datafile Header: /data/orcl/datafile/xifenfei12.dbf (File number: 262)
  - Backing up Datafile header...
  - Executing Datafile Header modification with block size 8192...
  - Datafile Header processing completed
---------------------------------------------------------
Cleaning up temporary files...
================= All operations completed =================

Note: Execute /tmp/modify_scn/backup/recover_datafile.sh operation for rollback

然后查询相关scn信息,确认修改文件信息没有问题并尝试recover 262号文件

[oracle@host-172-18-50-10 tmp]$ sqlplus / as sysdba

SQL*Plus: Release 11.2.0.4.0 Production on Sat Nov 1 18:02:18 2025

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

Connected to an idle instance.

SQL> startup mount;
ORA-32004: obsolete or deprecated parameter(s) specified for RDBMS instance
ORACLE instance started.

Total System Global Area 4.1823E+10 bytes
Fixed Size                  2262368 bytes
Variable Size            4294970016 bytes
Database Buffers         3.7447E+10 bytes
Redo Buffers               78614528 bytes
Database mounted.
SQL> set pages 10000
SQL> set numw 16
SQL> SELECT status,
  2  checkpoint_change#,
  3  to_char(checkpoint_time,'yyyy-mm-dd hh24:mi:ss') checkpoint_time,
  4  last_change#,
  5  count(*) ROW_NUM
FROM v$datafile
  6    7  GROUP BY status, checkpoint_change#, checkpoint_time,last_change#
ORDER BY status, checkpoint_change#, checkpoint_time;
  8  

set numw 16
col CHECKPOINT_TIME for a40
set lines 150
set pages 1000
SELECT status,
to_char(checkpoint_time,'yyyy-mm-dd hh24:mi:ss') checkpoint_time,FUZZY,checkpoint_change#,
count(*) ROW_NUM
FROM v$datafile_header
GROUP BY status, checkpoint_change#, to_char(checkpoint_time,'yyyy-mm-dd hh24:mi:ss'),fuzzy
ORDER BY status, checkpoint_change#, checkpoint_time;

SELECT dd.FILE#,
dd.NAME,
dd.STATUS,
dd.checkpoint_change# dfile_chkp_change,
dh.checkpoint_change# dfile_hed_chkp_change,
dh.recover,
dh.fuzzy
FROM v$datafile dd,
v$datafile_header dh
WHERE dd.FILE#=dh.FILE#
AND dd.checkpoint_change#<>dh.checkpoint_change#;


STATUS  CHECKPOINT_CHANGE# CHECKPOINT_TIME         LAST_CHANGE#          ROW_NUM
------- ------------------ ------------------- ---------------- ----------------
ONLINE      16816934458875 2025-11-01 17:58:28   16816934458875              258
RECOVER     16816934368799 2025-11-01 05:29:39   16816934456943                1
SYSTEM      16816934458875 2025-11-01 17:58:28   16816934458875                4

SQL> SQL> SQL> SQL> SQL> SQL> SQL>   2    3    4    5    6  
STATUS  CHECKPOINT_TIME                          FUZ CHECKPOINT_CHANGE#          ROW_NUM
------- ---------------------------------------- --- ------------------ ----------------
OFFLINE 2025-11-01 17:58:28                      NO      16816934458875                1
ONLINE  2025-11-01 17:58:28                      NO      16816934458875              262

SQL> SQL>   2    3    4    5    6    7    8    9   10   11  
           FILE#
----------------
NAME
----------------------------------------------------------------------------------
STATUS  DFILE_CHKP_CHANGE DFILE_HED_CHKP_CHANGE REC FUZ
------- ----------------- --------------------- --- ---
             262
/data/orcl/datafile/xifenfei12.dbf
RECOVER    16816934368799        16816934458875 YES NO

SQL> recover datafile 262;
Media recovery complete.

open数据库成功

SQL> alter database open;

Database altered.
Sat Nov 01 18:06:00 2025
ALTER DATABASE OPEN
Thread 1 opened at log sequence 296554
  Current log# 1 seq# 296554 mem# 0: /data/orcl/onlinelog/redo01a.log
Successful open of redo thread 1
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
SMON: enabling cache recovery
[33941] Successfully onlined Undo Tablespace 143.
Undo initialization finished serial:0 start:12793234 end:12793304 diff:70 (0 seconds)
Verifying file header compatibility for 11g tablespace encryption..
Verifying 11g file header compatibility for tablespace encryption completed
SMON: enabling tx recovery
Database Characterset is ZHS16GBK
No Resource Manager plan active
replication_dependency_tracking turned off (no async multimaster replication found)
Starting background process QMNC
Sat Nov 01 18:06:01 2025
QMNC started with pid=20, OS id=33973 
Completed: ALTER DATABASE OPEN

至此基本上完成本次恢复任务,后续根据alert日志,有个别表可能由于在file# 262中丢失一些数据导致不一致的问题进行单独,其他没有太大问题,最快帮客户恢复了业务

分布式存储故障导致数据库无法启动故障处理

联系:手机/微信(+86 17813235971) QQ(107644445)QQ咨询惜分飞

标题:分布式存储故障导致数据库无法启动故障处理

作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]

国内xx医院使用了国外医疗行业龙头的pacs系统,由于是一个历史库,存放在分布式存储中,由于存储同时多个节点故障,导致数据库多个文件异常,数据库无法启动,三方维护人员尝试通通过rman归档进行应用日志,结果发现日志有损坏报ORA-00354 ORA-00353,无法记录恢复,希望我们给予支持
ORA-00353

Mon Apr 29 13:28:40 2024
Media Recovery failed with error 354
Mon Apr 29 13:28:40 2024
Errors in file F:\XXXXXX_DB\ORACLE\ADMIN\diag\rdbms\xxx\msxxx1\trace\xxx_pr00_4568.trc:
ORA-00283: recovery session canceled due to errors
ORA-00354: corrupt redo log block header
ORA-00353: log corruption near block 487424 change 8737273868 time 04/01/2024 01:38:25
ORA-00334: archived log: 'F:\XXXXXX_DB\ORADATA\XXX\LOGARC0000052184_0922116268.0001'
ORA-283 signalled during: alter database recover logfile 'F:\XXXXXX_DB\ORADATA\XXX\LOGARC0000052184_0922116268.0001'...

接手故障之后,通过尝试恢复发现除该错误之外,还有ORA-600 4552之类错误
ORA-600-4552


跳过这些异常文件恢复,最终确认异常文件有如下部分
20240511223413

这些文件由于日志无法正常应用(有日志损坏无法应用,有日志和数据文件block不匹配导致无法应用),这样的情况直接通过自研的Oracle Recovery Tools小工具直接修改文件头信息
orarecovery

然后尝试OPEN数据库结果报ORA-1207
ORA-1207

对于这个故障可以通过rectl或者using backup ctl方式处理,然后open数据库成功
20240510224845

由于该系统是历史库,不会有新业务写入,通过对异常表和索引进行处理之后,客户测试业务可以正常访问,完成本次恢复

记录一次由于坏块和不恰当恢复引起各种ORA-600案例

联系:手机/微信(+86 17813235971) QQ(107644445)

标题:记录一次由于坏块和不恰当恢复引起各种ORA-600案例

作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]

朋友让我帮忙处理一个不能open的库,打开alert日志一看,傻眼了,里面是各种ORA-600的错误应有尽有,被折腾的够惨
故障后重启,无法启动主要表现在block坏块,引起的各种ORA-600等错误

Mon Mar 02 16:09:27 2015
ALTER DATABASE OPEN
Beginning crash recovery of 1 threads
 parallel recovery started with 23 processes
Started redo scan
Completed redo scan
 read 962 KB redo, 256 data blocks need recovery
Started redo application at
 Thread 1: logseq 726, block 37343
Recovery of Online Redo Log: Thread 1 Group 3 Seq 726 Reading mem 0
  Mem# 0: /u01/app/oracle/oradata/oa/redo03.log
Mon Mar 02 16:09:27 2015
RECOVERY OF THREAD 1 STUCK AT BLOCK 1673 OF FILE 3
Completed redo application of 0.27MB
Mon Mar 02 16:09:27 2015
RECOVERY OF THREAD 1 STUCK AT BLOCK 3104 OF FILE 3
Mon Mar 02 16:09:27 2015
RECOVERY OF THREAD 1 STUCK AT BLOCK 3613 OF FILE 3
Mon Mar 02 16:09:28 2015
RECOVERY OF THREAD 1 STUCK AT BLOCK 272 OF FILE 3
Mon Mar 02 16:09:28 2015
RECOVERY OF THREAD 1 STUCK AT BLOCK 2512 OF FILE 3
Hex dump of (file 2, block 92889) in trace file /u01/app/oracle/diag/rdbms/oa/oa/trace/oa_dbw2_4158.trc
Corrupt block relative dba: 0x00816ad9 (file 2, block 92889)
Bad header found during preparing block for write
Data in bad block:
 type: 0 format: 0 rdba: 0x6ad90000
 last change scn: 0x0000.00c6a052 seq: 0x1 flg: 0x00
 spare1: 0x6 spare2: 0xa2 spare3: 0x5d7e
 consistency value in tail: 0xa0520001
 check value in block header: 0x0
 block checksum disabled
Mon Mar 02 16:09:28 2015
Errors in file /u01/app/oracle/diag/rdbms/oa/oa/trace/oa_p007_4196.trc  (incident=3833):
ORA-00600: internal error code, arguments: [4502], [1], [], [], [], [], [], [], [], [], [], []
Mon Mar 02 16:09:28 2015
Errors in file /u01/app/oracle/diag/rdbms/oa/oa/trace/oa_p013_4208.trc  (incident=3881):
ORA-00600: internal error code, arguments: [2037], [4259067], [4244307968], [159], [243], [0], [2162032704], [100728832], [], [], [], []
Slave exiting with ORA-1172 exception
Errors in file /u01/app/oracle/diag/rdbms/oa/oa/trace/oa_p009_4200.trc:
ORA-01172: recovery of thread 1 stuck at block 3613 of file 3
ORA-01151: use media recovery to recover block, restore backup if needed
Errors in file /u01/app/oracle/diag/rdbms/oa/oa/trace/oa_p001_4184.trc:
ORA-01172: recovery of thread 1 stuck at block 2512 of file 3
ORA-01151: use media recovery to recover block, restore backup if needed
Errors in file /u01/app/oracle/diag/rdbms/oa/oa/trace/oa_p021_4224.trc:
ORA-10388: parallel query server interrupt (failure)
Errors in file /u01/app/oracle/diag/rdbms/oa/oa/trace/oa_p021_4224.trc:
Errors in file /u01/app/oracle/diag/rdbms/oa/oa/trace/oa_dbw2_4158.trc  (incident=3697):
ORA-00600: internal error code, arguments: [kcbzpbuf_1], [4], [1], [], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/oa/oa/incident/incdir_3697/oa_dbw2_4158_i3697.trc
Exception [type: SIGSEGV, SI_KERNEL(general_protection)] [ADDR:0x0] [PC:0xD2DDB7, kcbs_shrink_pool()+705] [flags: 0x0, count: 1]
Errors in file /u01/app/oracle/diag/rdbms/oa/oa/trace/oa_mman_4152.trc  (incident=3673):
ORA-07445: exception encountered: core dump [kcbs_shrink_pool()+705] [SIGSEGV] [ADDR:0x0] [PC:0xD2DDB7] [SI_KERNEL(general_protection)] []
Incident details in: /u01/app/oracle/diag/rdbms/oa/oa/incident/incdir_3673/oa_mman_4152_i3673.trc
Errors in file /u01/app/oracle/diag/rdbms/oa/oa/trace/oa_dbw2_4158.trc:
Mon Mar 02 16:09:34 2015
Instance terminated by DBW2, pid = 4158

第二次重启后增加新错误ORA-00600[17182]

Mon Mar 02 16:39:50 2015
Errors in file /u01/app/oracle/diag/rdbms/oa/oa/trace/oa_p002_4321.trc  (incident=4993):
ORA-00600: internal error code, arguments: [17182], [0x7F548C2BDBA8], [], [], [], [], [], [], [], [], [], []

进行了一些恢复处理后,日志中报错
主要体现在进行了不完全恢复,而且应该是对redo进行了重命名或者redo头损坏锁引起的一系列提示

Beginning crash recovery of 1 threads
Started redo scan
Completed redo scan
 read 962 KB redo, 256 data blocks need recovery
Started redo application at
 Thread 1: logseq 726, block 37343
Recovery of Online Redo Log: Thread 1 Group 3 Seq 726 Reading mem 0
  Mem# 0: /u01/app/oracle/oradata/oa/redo03.log
RECOVERY OF THREAD 1 STUCK AT BLOCK 1673 OF FILE 3
Aborting crash recovery due to error 1172
Errors in file /u01/app/oracle/diag/rdbms/oa/oa/trace/oa_ora_6644.trc:
ORA-01172: recovery of thread 1 stuck at block 1673 of file 3
ORA-01151: use media recovery to recover block, restore backup if needed
Errors in file /u01/app/oracle/diag/rdbms/oa/oa/trace/oa_ora_6644.trc:
ORA-01172: recovery of thread 1 stuck at block 1673 of file 3
ORA-01151: use media recovery to recover block, restore backup if needed
ORA-1172 signalled during: alter  database open...
Tue Mar 03 11:17:59 2015
Sweep [inc][17178]: completed
Sweep [inc][17177]: completed
Sweep [inc2][17178]: completed
Tue Mar 03 11:18:00 2015
ALTER DATABASE RECOVER  database until cancel
Media Recovery Start
 started logmerger process
Parallel Media Recovery started with 24 slaves
ORA-279 signalled during: ALTER DATABASE RECOVER  database until cancel  ...
ALTER DATABASE RECOVER    CONTINUE DEFAULT
Tue Mar 03 11:18:06 2015
Errors in file /u01/app/oracle/diag/rdbms/oa/oa/trace/oa_pr00_6701.trc:
ORA-00266: name of archived log file needed
ORA-266 signalled during: ALTER DATABASE RECOVER    CONTINUE DEFAULT  ...
ALTER DATABASE RECOVER CANCEL
Errors in file /u01/app/oracle/diag/rdbms/oa/oa/trace/oa_pr00_6701.trc:
ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below
ORA-01194: file 1 needs more recovery to be consistent
ORA-01110: data file 1: '/u01/app/oracle/oradata/oa/system01.dbf'
Slave exiting with ORA-1547 exception
Errors in file /u01/app/oracle/diag/rdbms/oa/oa/trace/oa_pr00_6701.trc:
ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below
ORA-01194: file 1 needs more recovery to be consistent
ORA-01110: data file 1: '/u01/app/oracle/oradata/oa/system01.dbf'
ORA-10879 signalled during: ALTER DATABASE RECOVER CANCEL ...
Tue Mar 03 11:18:06 2015
Checker run found 4 new persistent data failures
Tue Mar 03 11:18:13 2015
alter database open resetlogs
RESETLOGS is being done without consistancy checks. This may result
in a corrupted database. The database should be recreated.
RESETLOGS after incomplete recovery UNTIL CHANGE 12986989
Resetting resetlogs activation ID 3278679642 (0xc36cae5a)
Errors in file /u01/app/oracle/diag/rdbms/oa/oa/trace/oa_ora_6644.trc:
ORA-00367: checksum error in log file header
ORA-00322: log 1 of thread 1 is not current copy
ORA-00312: online log 1 thread 1: '/u01/app/oracle/oradata/oa/redo01.log'
Errors in file /u01/app/oracle/diag/rdbms/oa/oa/trace/oa_ora_6644.trc:

再一步折腾,增加了_allow_resetlogs_corruption= TRUE之后数据库报ORA-600[2662]

Tue Mar 03 11:19:26 2015
SMON: enabling cache recovery
Errors in file /u01/app/oracle/diag/rdbms/oa/oa/trace/oa_ora_6864.trc  (incident=18195):
ORA-00600: internal error code, arguments: [2662], [0], [13007002], [0], [13016626], [4194545], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/oa/oa/incident/incdir_18195/oa_ora_6864_i18195.trc
Errors in file /u01/app/oracle/diag/rdbms/oa/oa/trace/oa_ora_6864.trc:
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00600: internal error code, arguments: [2662], [0], [13007002], [0], [13016626], [4194545], [], [], [], [], [], []
Error 704 happened during db open, shutting down database
USER (ospid: 6864): terminating the instance due to error 704
Instance terminated by USER, pid = 6864
ORA-1092 signalled during: alter database open...
opiodr aborting process unknown ospid (6864) as a result of ORA-1092
Tue Mar 03 11:19:29 2015
ORA-1092 : opitsk aborting process

进一步折腾,可以看出来undo已经被其offline,无法正常访问,导致系统报ORA-704和ORA-00376

Wed Mar 04 21:10:58 2015
SMON: enabling cache recovery
Errors in file /u01/app/oracle/diag/rdbms/oa/oa/trace/oa_ora_17074.trc:
ORA-00704: bootstrap process failure
ORA-00604: error occurred at recursive SQL level 2
ORA-00376: file 3 cannot be read at this time
ORA-01110: data file 3: '/u01/app/oracle/oradata/oa/undotbs01.dbf'
Errors in file /u01/app/oracle/diag/rdbms/oa/oa/trace/oa_ora_17074.trc:
ORA-00704: bootstrap process failure
ORA-00604: error occurred at recursive SQL level 2
ORA-00376: file 3 cannot be read at this time
ORA-01110: data file 3: '/u01/app/oracle/oradata/oa/undotbs01.dbf'
Error 704 happened during db open, shutting down database
USER (ospid: 17074): terminating the instance due to error 704
Instance terminated by USER, pid = 17074
ORA-1092 signalled during: alter database open...
opiodr aborting process unknown ospid (17074) as a result of ORA-1092
Wed Mar 04 21:11:00 2015
ORA-1092 : opitsk aborting process

通过Oracle数据库异常恢复检查脚本(Oracle Database Recovery Check)检测结果见附件(xifenfei_db_recover_20150304),这里可以知道undo 不知道怎么折腾的数据文件scn较大而且还offline,
通过一些列方法(bbed,隐含参数等)调整数据库scn,强制启动数据库,报如下错误

Wed Mar 04 22:50:23 2015
SMON: enabling cache recovery
ORA-01555 caused by SQL statement below (SQL ID: 3nkd3g3ju5ph1, SCN: 0x0000.4000003e):
select obj#,type#,ctime,mtime,stime, status, dataobj#, flags, oid$, spare1, spare2 from obj$ where owner#=:1 and name=:2 and namespace=:3 and remoteowner is null and linkname is null and subname is null
Errors in file /u01/app/oracle/diag/rdbms/oa/oa/trace/oa_ora_17807.trc:
ORA-00704: bootstrap process failure
ORA-00604: error occurred at recursive SQL level 2
ORA-01555: snapshot too old: rollback segment number 10 with name "_SYSSMU10_3550978943$" too small
Errors in file /u01/app/oracle/diag/rdbms/oa/oa/trace/oa_ora_17807.trc:
ORA-00704: bootstrap process failure
ORA-00604: error occurred at recursive SQL level 2
ORA-01555: snapshot too old: rollback segment number 10 with name "_SYSSMU10_3550978943$" too small
Error 704 happened during db open, shutting down database
USER (ospid: 17807): terminating the instance due to error 704
Instance terminated by USER, pid = 17807
ORA-1092 signalled during: alter database open resetlogs...
opiodr aborting process unknown ospid (17807) as a result of ORA-1092

根据经验,该错误怀疑是文件头scn不够大,块延迟清理导致,进一步增加scn尝试,最后依旧是ORA-00704/ORA-00604/ORA-01555错误

Wed Mar 04 22:50:23 2015
SMON: enabling cache recovery
ORA-01555 caused by SQL statement below (SQL ID: 3nkd3g3ju5ph1, SCN: 0x0000.4000003e):
select obj#,type#,ctime,mtime,stime, status, dataobj#, flags, oid$, spare1, spare2 from obj$ where owner#=:1 and name=:2 and namespace=:3 and remoteowner is null and linkname is null and subname is null
Errors in file /u01/app/oracle/diag/rdbms/oa/oa/trace/oa_ora_17807.trc:
ORA-00704: bootstrap process failure
ORA-00604: error occurred at recursive SQL level 2
ORA-01555: snapshot too old: rollback segment number 10 with name "_SYSSMU10_3550978943$" too small
Errors in file /u01/app/oracle/diag/rdbms/oa/oa/trace/oa_ora_17807.trc:
ORA-00704: bootstrap process failure
ORA-00604: error occurred at recursive SQL level 2
ORA-01555: snapshot too old: rollback segment number 10 with name "_SYSSMU10_3550978943$" too small
Error 704 happened during db open, shutting down database
USER (ospid: 17807): terminating the instance due to error 704
Instance terminated by USER, pid = 17807
ORA-1092 signalled during: alter database open resetlogs...
opiodr aborting process unknown ospid (17807) as a result of ORA-1092

根据经验,在scn上做手脚估计难以解决给问题,对其启动过程做10046和errorstack分析发现

PARSING IN CURSOR #3 len=202 dep=2 uid=0 oct=3 lid=0 tim=1425481940448439 hv=3819099649 ad='64ff91af8' sqlid='3nkd3g3ju5ph1'
select obj#,type#,ctime,mtime,stime, status, dataobj#, flags, oid$, spare1, spare2 from obj$ where owner#=:1 and name=:2 and namespace=:3 and remoteowner is null and linkname is null and subname is null
END OF STMT
PARSE #3:c=1000,e=334,p=0,cr=0,cu=0,mis=1,r=0,dep=2,og=4,plh=0,tim=1425481940448439
BINDS #3:
 Bind#0
  oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00
  oacflg=08 fl2=0001 frm=00 csi=00 siz=24 off=0
  kxsbbbfp=7f5b3253a6f0  bln=22  avl=01  flg=05
  value=0
 Bind#1
  oacdty=01 mxl=32(06) mxlc=00 mal=00 scl=00 pre=00
  oacflg=18 fl2=0001 frm=01 csi=852 siz=32 off=0
  kxsbbbfp=7f5b3253a6b8  bln=32  avl=06  flg=05
  value="PROPS$"
 Bind#2
  oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00
  oacflg=08 fl2=0001 frm=00 csi=00 siz=24 off=0
  kxsbbbfp=7f5b3253a688  bln=24  avl=02  flg=05
  value=1
EXEC #3:c=0,e=640,p=0,cr=0,cu=0,mis=1,r=0,dep=2,og=4,plh=2853959010,tim=1425481940449147
WAIT #3: nam='db file sequential read' ela= 5 file#=1 block#=345 blocks=1 obj#=37 tim=1425481940449186
WAIT #3: nam='db file sequential read' ela= 4 file#=1 block#=44528 blocks=1 obj#=37 tim=1425481940449221
WAIT #3: nam='db file sequential read' ela= 3 file#=1 block#=5505 blocks=1 obj#=37 tim=1425481940449247
*** 2015-03-04 23:12:20.450
dbkedDefDump(): Starting a non-incident diagnostic dump (flags=0x0, level=3, mask=0x0)
----- Error Stack Dump -----
ORA-00604: error occurred at recursive SQL level 2
ORA-01555: snapshot too old: rollback segment number 10 with name "_SYSSMU10_3550978943$" too small
----- Current SQL Statement for this session (sql_id=g64r07v2jn8nq) -----
SELECT NULL FROM PROPS$ WHERE NAME='BOOTSTRAP_UPGRADE_ERROR'

这里可以发现是数据库在启动的过程中需要执行SELECT NULL FROM PROPS$ WHERE NAME=’BOOTSTRAP_UPGRADE_ERROR’语句,而该语句递归调用了select obj#,type#,ctime,mtime,stime, status, dataobj#, flags, oid$, spare1, spare2 from obj$ where owner#=:1 and name=:2 and namespace=:3 and remoteowner is null and linkname is null and subname is null 语句。既然这样通过一些方法避免数据库启动之时查询SELECT NULL FROM PROPS$ WHERE NAME=’BOOTSTRAP_UPGRADE_ERROR’语句,果然数据库启动成功。

知识点补充
ORA-600 [4502] [a]

Arg [a] ITL entry with a lock count
Meaning: During ITL cleanout we clear all row locks but the ITL entry
	 still thinks there is an uncleared lock. Ie: ITL has a locked
	 row but there are no locked rows in the block

大体意思是数据库发现undo 的itl已经被清除,但是block中的itl依然存在,从而出现ORA-600[4502],引起该问题除bug外主要原因是坏块

ORA-600 [2037] [a] [b] {c} [d] [e] [f] [g]

Arg [a] Relative Data Block Address (RDBA) that the redo vector is for
Arg [b] The Block format
Arg {c} RDBA in the block itself
Arg [d] The block type
Arg [e] The sequence number
Arg [f] Flags, if set
Arg [g] The return value from the block head/tail checker.
DESCRIPTION:
  During recovery we are examining a block to ensure that it is not
  corrupt prior to applying any change vectors.
  The block has failed this check and this exception is raised

大体意思是在恢复过程中,正在检查的块,以确保它在应用任何变化向量之前不损坏。如果检查失败排除该异常ORA-600[2037],引起该问题除bug外主要原因是坏块

ORA-600 [kcbzpbuf_1],[a],[b]

Arg [a] Corruption reason
Arg [b] Calculate checksum flag
Corruption reason:
#define KCBH_GOOD    0                                     /* block is valid */
#define KCBH_ZERO    1             /* block header was entirely zero on disk */
#define KCBH_BROKEN  2      /* corruption could be from a partial disk write */
#define KCBH_CHKVAL  3               /* The check value for the block failed */
#define KCBH_CORRUPT 4     /* this is the wrong block or is not a data block */
#define KCBH_ZERONG  5               /* all zero block and it is not allowed */
Calculate checksum flag:
The possible values are 1 (Generate Checksum - db_block_checksum is enabled - default value)
                        0 (do not generate checksum - db_block_checksum=false)

kcbzpbuf_1是该错误的源码函数

ORA-600 [17182] [a] [b] {c} [d] [e]

DESCRIPTION:
  Oracle has detected that the magic number in a memory chunk header has been overwritten.
  This is a heap (in memory) corruption and there is no underlying data corruption.
  The error may occur in the one of the process specific heaps
  (the Call heap, PGA heap, or session heap) or in the shared heap (SGA).

ORACLE 发现在内存中重要的块头被重新,但是没有基础数据损坏,大部分和数据块或者内存损坏有关系.

ORA-600 [4552] [a] [b] {c} [d] [e]

DESCRIPTION:
  This assertion is raised because we are trying to unlock the rows in a
  block, but receive an incorrect block type.
  The second argument is the block type received.

ORACLE尝试对某行进行解锁但是接收到了不正确的数据块类型,Arg [b]是接收到的数据块类型

ORA-600 [2662] [a] [b] {c} [d] [e]

DESCRIPTION:
  A data block SCN is ahead of the current SCN.
  The ORA-600 [2662] occurs when an SCN is compared to the dependent SCN
  stored in a UGA variable.
  If the SCN is less than the dependent SCN then we signal the ORA-600 [2662]
  internal error.
ARGUMENTS:
  Arg [a]  Current SCN WRAP
  Arg [b]  Current SCN BASE
  Arg {c}  dependent SCN WRAP
  Arg [d]  dependent SCN BASE
  Arg [e]  Where present this is the DBA where the dependent SCN came from.

主要的含义就是oracle文件头scn比某个block dependent scn小从而出现该问题