某集团ebs数据库redo undo丢失导致悲剧

联系:手机/微信(+86 17813235971) QQ(107644445)

标题:某集团ebs数据库redo undo丢失导致悲剧

作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]

某集团的ebs系统因磁盘空间不足把redo和undo存放到raid 0之上,而且该库无任何备份。最终悲剧发生了,raid 0异常导致redo undo全部丢失,数据库无法正常启动(我接手之时数据库已经resetlogs过,但是未成功)

Sun Jul 27 11:31:27 2014
SMON: enabling cache recovery
SMON: enabling tx recovery
Sun Jul 27 11:31:27 2014
Database Characterset is ZHS16GBK
Sun Jul 27 11:31:27 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/bdump/prod_smon_454754.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-00376: file 42 cannot be read at this time
ORA-01110: data file 42: '/prod/oracle/PROD/logdata/undo/undo1.dbf'
Sun Jul 27 11:31:27 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/bdump/prod_smon_454754.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-00376: file 42 cannot be read at this time
ORA-01110: data file 42: '/prod/oracle/PROD/logdata/undo/undo1.dbf'
Sun Jul 27 11:31:27 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/bdump/prod_smon_454754.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-00376: file 42 cannot be read at this time
ORA-01110: data file 42: '/prod/oracle/PROD/logdata/undo/undo1.dbf'
Sun Jul 27 11:31:27 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/udump/prod_ora_663670.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-00376: file 41 cannot be read at this time
ORA-01110: data file 41: '/prod/oracle/PROD/logdata/undo/undo2.dbf'
Error 604 happened during db open, shutting down database
USER: terminating instance due to error 604
Instance terminated by USER, pid = 663670
ORA-1092 signalled during: ALTER DATABASE OPEN...

查询相关文件状态发现,undo表空间文件丢失,被offline处理
df_status
因为以前alert日志被清理,通过这里大概猜测是offline丢失的undo文件,然后resetlogs了数据库,现在处理方式为
使用_corrupted_rollback_segments屏蔽回滚段,然后尝试启动数据库

Tue Jul 29 11:40:39 2014
SMON: enabling cache recovery
SMON: enabling tx recovery
Tue Jul 29 11:40:39 2014
Database Characterset is ZHS16GBK
Tue Jul 29 11:40:39 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/bdump/prod_smon_569378.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-01555: snapshot too old: rollback segment number  with name "" too small
Tue Jul 29 11:40:39 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/bdump/prod_smon_569378.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-01555: snapshot too old: rollback segment number  with name "" too small
Tue Jul 29 11:40:39 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/bdump/prod_smon_569378.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-01555: snapshot too old: rollback segment number  with name "" too small
Tue Jul 29 11:40:39 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/udump/prod_ora_585786.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-01555: snapshot too old: rollback segment number  with name "" too small
Error 604 happened during db open, shutting down database
USER: terminating instance due to error 604
Instance terminated by USER, pid = 585786
ORA-1092 signalled during: alter database open...

该错误是由于数据库启动需要找到对应的回滚段,但是由于undo异常导致该回滚段无法找到,因此出现该错误,解决方法是通过修改数据scn,让其不找回滚段,从而屏蔽该错误.数据库启动后,删除undo重新创建新undo

Tue Jul 29 15:59:22 2014
drop tablespace undo2 including contents and datafiles
Tue Jul 29 15:59:23 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/udump/prod_ora_782490.trc:
ORA-01122: database file 41 failed verification check
ORA-01110: data file 41: '/prod/oracle/PROD/logdata/undo/undo2.dbf'
ORA-01565: error in identifying file '/prod/oracle/PROD/logdata/undo/undo2.dbf'
ORA-27037: unable to obtain file status
IBM AIX RISC System/6000 Error: 2: No such file or directory
Additional information: 3
Tue Jul 29 15:59:23 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/udump/prod_ora_782490.trc:
ORA-01259: unable to delete datafile /prod/oracle/PROD/logdata/undo/undo2.dbf
Tue Jul 29 15:59:23 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/udump/prod_ora_782490.trc:
ORA-01122: database file 42 failed verification check
ORA-01110: data file 42: '/prod/oracle/PROD/logdata/undo/undo1.dbf'
ORA-01565: error in identifying file '/prod/oracle/PROD/logdata/undo/undo1.dbf'
ORA-27037: unable to obtain file status
IBM AIX RISC System/6000 Error: 2: No such file or directory
Additional information: 3
ORA-01259: unable to delete datafile /prod/oracle/PROD/logdata/undo/undo2.dbf
Tue Jul 29 15:59:23 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/udump/prod_ora_782490.trc:
ORA-01259: unable to delete datafile /prod/oracle/PROD/logdata/undo/undo1.dbf
Tue Jul 29 15:59:23 2014
Completed: drop tablespace undo2 including contents and datafiles
Tue Jul 29 15:59:56 2014
create undo tablespace undotbs1 datafile '/prod/oracle/PROD/logdata/undo_new01.dbf' size 100M autoextend on next 128M maxsize 30G
Tue Jul 29 15:59:57 2014
Completed: create undo tablespace undotbs1 datafile '/prod/oracle/PROD/logdata/undo_new01.dbf' size 100M autoextend on next 128M maxsize 30G
Tue Jul 29 16:00:03 2014
alter tablespace undotbs1 add datafile '/prod/oracle/PROD/logdata/undo_new02.dbf' size 100M autoextend on next 128M maxsize 30G
Completed: alter tablespace undotbs1 add datafile '/prod/oracle/PROD/logdata/undo_new02.dbf' size 100M autoextend on next 128M maxsize 30G

业务运行过程中,数据库报大量ORA-600 4097,ORA-600 kdsgrp1,ORA-600 kcfrbd_3错误

Tue Jul 29 16:07:03 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/udump/prod_ora_950484.trc:
ORA-00600: internal error code, arguments: [4097], [], [], [], [], [], [], []
Tue Jul 29 16:07:06 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/udump/prod_ora_950484.trc:
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []
Tue Jul 29 16:10:06 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/udump/prod_ora_917702.trc:
ORA-00600: internal error code, arguments: [4097], [], [], [], [], [], [], []
Tue Jul 29 16:10:07 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/udump/prod_ora_917702.trc:
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []
Tue Jul 29 16:12:45 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/bdump/prod_m000_880692.trc:
ORA-00600: internal error code, arguments: [4097], [], [], [], [], [], [], []
Tue Jul 29 16:21:23 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/udump/prod_ora_1040638.trc:
ORA-00600: 内部错误代码, 参数: [kcfrbd_3], [41], [231381], [1], [12800], [12800], [], []
Tue Jul 29 16:21:37 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/udump/prod_ora_1040638.trc:
ORA-00600: 内部错误代码, 参数: [kcfrbd_3], [41], [231381], [1], [12800], [12800], [], []
Tue Jul 29 16:21:56 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/udump/prod_ora_1040638.trc:
ORA-00600: 内部错误代码, 参数: [kcfrbd_3], [41], [231381], [1], [12800], [12800], [], []
Tue Jul 29 16:22:18 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/udump/prod_ora_1040638.trc:
ORA-00600: 内部错误代码, 参数: [kcfrbd_3], [41], [231381], [1], [12800], [12800], [], []
Tue Jul 29 16:22:28 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/udump/prod_ora_1105950.trc:
ORA-00600: 内部错误代码, 参数: [4097], [], [], [], [], [], [], []
Tue Jul 29 16:22:33 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/udump/prod_ora_1159232.trc:
ORA-00600: 内部错误代码, 参数: [kcfrbd_3], [42], [61235], [1], [12800], [12800], [], []

出现该错误有几个原因和解决方法:
ORA-600 kdsgrp1 是因为相关坏块引起(tab,index,memory,cr block等),结合日志分析对象异常原因,根据具体情况确定对象然后选择合适处理方案(具体参考NOTE:1332252.1)
ORA-600 4097 由于数据库异常关闭然后open,创建回滚段,可能触发bug导致该问题(虽然说在当前版本修复,但是实际处理我确实按照NOTE:1030620.6解决)
ORA-600 kcfrbd_3 有事务的block被访问之后,根据回滚槽信息定位到相关回滚段,而正好新建的回滚段信息又和以前的名字编号一致,从而反馈出来是数据文件大小不够,从而出现该错误(具体参考NOTE:601798.1)
最终该数据库虽然恢复了,抢救了大量数据,但是对于ebs系统来说,丢失redo和undo数据的损失还是巨大的.再次温馨提示:数据库的redo,undo也很重要,数据库的备份更加重要

ORA-00600[kcfrbd_3]故障解决

联系:手机/微信(+86 17813235971) QQ(107644445)

标题:ORA-00600[kcfrbd_3]故障解决

作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]

朋友一数据库因为断电,被重建控制文件等操作折腾的最后出现在启动的时候出现ORA-00600[kcfrbd_3]

Wed Dec 05 10:26:34 2012
Thread 1 advanced to log sequence 11
Thread 1 opened at log sequence 11
  Current log# 1 seq# 11 mem# 0: E:\ORACLE\PRODUCT\10.2.0\ORADATA\ORCL\REDO01.LOG
Successful open of redo thread 1
Wed Dec 05 10:26:34 2012
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
Wed Dec 05 10:26:34 2012
SMON: enabling cache recovery
Wed Dec 05 10:26:35 2012
Successfully onlined Undo Tablespace 1.
Dictionary check beginning
Dictionary check complete
Wed Dec 05 10:26:35 2012
SMON: enabling tx recovery
Wed Dec 05 10:26:35 2012
Database Characterset is ZHS16GBK
Wed Dec 05 10:26:35 2012
Errors in file d:\oracle\product\10.2.0\admin\orcl\bdump\orcl_smon_548.trc:
ORA-00600: internal error code, arguments: [kcfrbd_3], [2], [2279045], [1], [2277120], [2277120], [], []
replication_dependency_tracking turned off (no async multimaster replication found)
Wed Dec 05 10:26:36 2012
Fatal internal error happened while SMON was doing active transaction recovery.
Wed Dec 05 10:26:36 2012
Errors in file d:\oracle\product\10.2.0\admin\orcl\bdump\orcl_smon_548.trc:
ORA-00600: internal error code, arguments: [kcfrbd_3], [2], [2279045], [1], [2277120], [2277120], [], []
SMON: terminating instance due to error 474

这个错误很明显:数据库已经open成功了,但是因为有事务不能正常被回滚,然后数据库的smon进程异常,从而使得数据库不能正常启动,解决该问题的方法也是很简单,就是常规的undo处理思路(使用人工undo管理,event屏蔽事务,隐含参数屏蔽回滚段),然后重建undo表空间,这个时候可以结合txchecker来检测是否有异常事务:如果有重要基表对象异常,需要重建库;如果是个别其他对象异常,可以通过重建该对象解决