一台win oracle 数据库,重启后发现数据库无法访问,检查发现是Bug 4899479,但是oracle未提供完整的解决方法,这里根据自己对于数据库启动过程的理解,通过屏蔽前滚和回滚,拉起来数据库
数据库版本平台信息
ORACLE:11.1.0.7 OS:WIN 2008 R2 X64
数据库启动报错
Tue Apr 16 12:36:31 2013 alter database open Beginning crash recovery of 1 threads parallel recovery started with 7 processes Started redo scan Completed redo scan 28878 redo blocks read, 7353 data blocks need recovery Started redo application at Thread 1: logseq 7960, block 14132 Recovery of Online Redo Log: Thread 1 Group 1 Seq 7960 Reading mem 0 Mem# 0: D:\APP\SDWLJG-DB101\ORADATA\WLJG\REDO01.LOG Tue Apr 16 12:36:32 2013 RECOVERY OF THREAD 1 STUCK AT BLOCK 915068 OF FILE 9 Hex dump of (file 9, block 1698691) in trace file c:\app\sdwljg-db101\diag\rdbms\wljg\wljg\trace\wljg_p001_1500.trc Corrupt block relative dba: 0x0259eb83 (file 9, block 1698691) Bad header found during crash/instance recovery Data in bad block: type: 0 format: 0 rdba: 0x0000a206 last change scn: 0x2359.0259eb83 seq: 0xf7 flg: 0x0b spare1: 0x0 spare2: 0x0 spare3: 0x601 consistency value in tail: 0x02c10243 check value in block header: 0x0 block checksum disabled Reread of rdba: 0x0259eb83 (file 9, block 1698691) found valid data Slave exiting with ORA-1172 exception Errors in file c:\app\sdwljg-db101\diag\rdbms\wljg\wljg\trace\wljg_p001_1500.trc: ORA-01172: recovery of thread 1 stuck at block 915068 of file 9 ORA-01151: use media recovery to recover block, restore backup if needed Tue Apr 16 12:36:32 2013 Errors in file c:\app\sdwljg-db101\diag\rdbms\wljg\wljg\trace\wljg_p003_4088.trc (incident=187558): ORA-00600: internal error code, arguments: [2037], [12619645], [41474], [6], [1], [247], [12619645], [0], [], [], [], [] Incident details in: c:\app\sdwljg-db101\diag\rdbms\wljg\wljg\incident\incdir_187558\wljg_p003_4088_i187558.trc ORA-07445: exception encountered: core dump [kcbs_dump_adv_state()+1352] [ACCESS_VIOLATION] [ADDR:0xFFFFFFFFFFFFFFFF] [PC:0x16BFD20] [UNABLE_TO_READ] [] ORA-00600: internal error code, arguments: [2037], [12619645], [41474], [6], [1], [247], [12619645], [0], [], [], [], [] Incident details in: c:\app\sdwljg-db101\diag\rdbms\wljg\wljg\incident\incdir_187559\wljg_p003_4088_i187559.trc Errors in file c:\app\sdwljg-db101\diag\rdbms\wljg\wljg\trace\wljg_p006_1216.trc (incident=187567):
这里提示file 9 block 915068异常,但是通过dbv检查发现file 9无任何坏块.
trace文件内容
Dump continued from file: c:\app\sdwljg-db101\diag\rdbms\wljg\wljg\trace\wljg_p003_4088.trc ORA-00600: internal error code, arguments: [2037], [12620930], [41474], [2], [1], [247], [12619645], [0], [], [], [], [] ** DBGRL Error: ARB Alert Log ** DBGRL Error: <msg time='2013-04-16T11:05:58.522+08:00' org_id='oracle' comp_id='rdbms' msg_id='dbgexProcessError:1097:3370026720' type='TRACE' level='16' host_id='SDWLSCJG-DB' host_addr='172.18.1.15'> <txt>Incident details in: c:\app\sdwljg-db101\diag\rdbms\wljg\wlj ========= Dump for incident 129879 (ORA 600 [2037]) ======== *** 2013-04-16 11:05:58.522 ----- SQL Statement (None) ----- Current SQL information unavailable - no cursor. ----- Call Stack Trace ----- calling call entry argument values in hex location type point (? means dubious value) -------------------- -------- -------------------- ---------------------------- ksedst1()+111 CALL??? skdstdst()+0 000000000 000000000 01CFC9B80 000000200 ksedst()+63 CALL??? ksedst1()+0 000000005 021B00600 005D30C80 000002004 dbkedDefDump()+1012 CALL??? ksedst()+0 000000000 000000000 000000000 000000000 ksedmp()+51 CALL??? dbkedDefDump()+0 000000003 000000002 021AF92C0 000405038 __PGOSF184_ksfdmp() CALL??? ksedmp()+0 000000000 000000000 000000000 +27 27F00000000 dbgexPhaseII()+266 CALL??? __PGOSF184_ksfdmp() 00000000D 0082FAE50 000000000 +0 000000004 dbgexProcessError() CALL??? dbgexPhaseII()+0 021B00600 021AFCA50 000000201 +1313 000000000 dbgeExecuteForError CALL??? dbgexProcessError() 021B00600 021B07590 000000001 ()+55 +0 000000000 dbgePostErrorKGE()+ CALL??? dbgeExecuteForError 021AFCA30 021AFCA80 00000002E 1608 ()+0 000000005 dbkePostKGE_kgsf()+ CALL??? dbgePostErrorKGE()+ 01CFC99D0 021B0E080 000000258 65 0 021B0E080 kgeade()+556 CALL??? dbkePostKGE_kgsf()+ 000002000 000000000 000000009 0 000000004 kgeriv_int()+105 CALL??? kgeade()+0 3A4F00000003 000C09482 0FFFFFFFF 000000000 kgeriv()+27 CALL??? kgeriv_int()+0 3A9A024E0 000000000 01CFC9410 000000000 kgesiv()+102 CALL??? kgeriv()+0 0000008D5 0000008C3 021AFD9A0 000AFDC73 ksesic7()+125 CALL??? kgesiv()+0 006371F20 000000007 27F912000 200000004 kcoexam()+248 CALL??? ksesic7()+0 2000007F5 000000000 000C09482 000000000 kcbtema()+2154 CALL??? kcoexam()+0 27FFC22C8 39E113470 3A940BBB8 000000000 kcrpap()+355 CALL??? kcbtema()+0 27FFC22C8 28BFC2628 000000000 021B10200 kcrpdv()+1655 CALL??? kcrpap()+0 021B101A0 000000002 000000004 000000512 kxfprdp()+1384 CALL??? kcrpdv()+0 3A7AD3098 000000000 00000000C 00757CF00 opirip()+1396 CALL??? kxfprdp()+0 00000001E 005CDB518 021AFF9E0 000000000 opidrv()+855 CALL??? opirip()+0 000000032 000000004 021AFFD30 000000000 sou2o()+52 CALL??? opidrv()+213 000000032 000000004 021AFFD30 021AFFDB0 opimai_real()+295 CALL??? sou2o()+0 000000000 7FEFD9819B5 000000000 000000000 opimai()+96 CALL??? opimai_real()+0 000000000 000000000 000000000 000000000 BackgroundThreadSta CALL??? opimai()+0 021AFFE98 000000001 000000000 rt()+695 000000000 00000000775AF56D CALL??? BackgroundThreadSta 00A26B7A0 000000000 000000000 rt()+0 000000000 0000000077923281 CALL??? 00000000775AF560 000000000 000000000 000000000 000000000 --------------------- Binary Stack Dump ---------------------
查询mos发现During Startup (Open Database) Alert Log Shows ORA-600[2037] and ORA-7445[kcbs_dump_adv_state] [ID 551993.1]和我们这里展示的错误相符,引起该问题的原因主要是因为:The database may crash and fail to open due to undo/redo corruption if you are using distributed transactions.因为使用分布式事务的时候,数据库crash导致undo/redo corruption,从而使得数据库无法正常启动.
故障处理思路
因为通过数据库alert日志可以知道,数据库是在做前滚的时候并发进程失败,设置fast_start_parallel_rollback=false,禁止数据库实例恢复并发,可以恢复依然失败.因为前滚过不去,那就通过设置隐含参数禁止数据库前滚,在open数据库的过程中发现ora-600[2662]错误,推进scn,继续open数据库发现ora-600[4194],通过设置undo管理模式,屏蔽事务,屏蔽回滚段等方法,终于重新open库并重建undo,然后重建库算是完成恢复任务