dul无法加载bootstrap实现unload table/user恢复

最近有朋友误操作引起了非常大的事故,差点吃了官司.在做数据库迁移的时候,远程误操作删除了原库的system等几个数据库初始安装的文件,而且该磁盘空间使用率非常高,还有少量写入.最后结果比较悲剧,通过文件系统层面无法直接恢复出来数据文件,而且该库无任何有效备份,又没有表名,列名等信息,无奈之下只能通过底层io block重组来恢复数据文件,可是悲剧又一次发生,这个磁盘上以前也有一份system等文件,最后经过多方重组恢复出来一份相对理想的数据文件.但是第三方公司通过这样重组出来的数据文件和未被删除的业务文件恢复出来的数据大量有问题,依旧需要我们进一步分析恢复处理.这篇文章主要描述了dul在无法加载bootstrap命令之后通过一些方法依旧可以正常使用unload table/user 等命令实现数据尽可能恢复.你要知道几百张表没有表名/列名要把他们区分出来那是什么样的工作量……
在dul中配置system文件

D:\xifenfei\system01.dbf
D:\TEMP\recover\dul\bak>dul
Data UnLoader: 11.2.0.0.4 - Internal Only - on Wed Sep 28 17:01:56 2016
with 64-bit io functions
Copyright (c) 1994 2016 Bernard van Duijnen All rights reserved.
 Strictly Oracle Internal Use Only
DUL> show datafiles;
Sorry, no valid data files found in control.txt

使用默认的dul中数据文件配置方法,让dul自己发现数据文件方法不可行

随意表空间号和文件号dul识别

0 0 D:\xifenfei\system01.dbf
D:\TEMP\recover\dul\bak>dul
Data UnLoader: 11.2.0.0.4 - Internal Only - on Wed Sep 28 17:00:27 2016
with 64-bit io functions
Copyright (c) 1994 2016 Bernard van Duijnen All rights reserved.
 Strictly Oracle Internal Use Only
DUL: Warning: File Type mismatch 1 != 8
DUL: Warning: D:\xifenfei\system01.dbf Header tablespace number 3
!= 0
DUL: Warning: D:\xifenfei\system01.dbf Header relative file number 1 != 0
Found db_id = 2948357999
Found db_name = XIFENFEI
DUL: Warning: Found mismatch while checking file D:\xifenfei\system01.dbf
DUL: Warning: DUL osd_parameter or control.dul configuration error
DUL: Warning: Given file number(0) in control file does not match file# in dba(1)

通过这个识别我们可以知道system的表空间号为3,文件号为1

再次配置system让dul识别

3 1 D:\xifenfei\system01.dbf
D:\TEMP\recover\dul\bak>dul
Data UnLoader: 11.2.0.0.4 - Internal Only - on Wed Sep 28 17:03:46 2016
with 64-bit io functions
Copyright (c) 1994 2016 Bernard van Duijnen All rights reserved.
 Strictly Oracle Internal Use Only
DUL: Warning: File Type mismatch 1 != 8
Found db_id = 2948357999
Found db_name = XIFENFEI
DUL> show datafiles;
ts# rf# start   blocks offs open  err file name
  3   1     0   320257    0    1    0 D:\xifenfei\system01.dbf

dul正常识别出来system文件但是根据经验我们知道tablespace 3肯定是有问题的,因此后续操作依旧问题非常多

尝试dul bootstrap恢复失败

DUL> bootstrap;
Scanning SYSTEM tablespace to locate compatibility segment ...
DUL: Warning: No files found for tablespace 0
Reading EXT.dat 0 entries loaded and sorted 0 entries
Reading SEG.dat 0 entries loaded
Reading COMPATSEG.dat 0 entries loaded
Reading SCANNEDLOBPAGE.dat 0 entries loaded and sorted 0 entries
DUL: Error: No compatibility segments found

由于表空间号错误,dul无法加载到bootstrap$表,另外根据bbed分析恢复出来的system文件中bootstrap$这部分丢失

尝试人工加载dul所需数据字典

DUL> unload table OBJ$
  2     storage ( tablespace 3 segobjno 18 file 1 block 240);
. unloading table                      OBJ$   79074 rows unloaded
DUL> unload table TAB$( OBJ# number, DATAOBJ# number,
  2      cluster  C_OBJ#(OBJ#)
  3      storage ( tablespace 3 segobjno 2 tabno 1 file 1 block 144);
. unloading table                      TAB$    4482 rows unloaded
DUL> unload table COL$ ( OBJ# number, COL# number , SEGCOL# number,
  2      cluster C_OBJ#(OBJ#)
  3      storage ( tablespace 3 segobjno 2 tabno 5 file 1 block 144);
. unloading table                      COL$  114491 rows unloaded
DUL> unload table USER$
  2      cluster C_USER#(USER#)
  3      storage ( tablespace 3 segobjno 10 tabno 1 file 1 block 208);
. unloading table                     USER$      96 rows unloaded
----其他表省略,根据需要的依次处理

尝试使用dul恢复数据

DUL> desc portal_emr.BASEELEMENT;
Table PORTAL_EMR.BASEELEMENT
obj#= 87200, dataobj#= 87200, ts#= 9, file#= 7, block#=458
      tab#= 0, segcols= 8, clucols= 0
Column information:
icol# 01 segcol# 01       BENAME len   30 type  1 VARCHAR2 cs 852(ZHS16GBK)
icol# 02 segcol# 02     TYPENAME len   30 type  1 VARCHAR2 cs 852(ZHS16GBK)
icol# 03 segcol# 03     TYPETYPE len   22 type  2 NUMBER(0,0)
icol# 04 segcol# 04    BEXMLTEXT len 4000 type  1 VARCHAR2 cs 852(ZHS16GBK)
icol# 05 segcol# 05 DEPTGROUPCODE len   30 type  1 VARCHAR2 cs 852(ZHS16GBK)
icol# 06 segcol# 06     ISCOMMON len   22 type  2 NUMBER(0,0)
icol# 07 segcol# 07      BESPELL len   15 type  1 VARCHAR2 cs 852(ZHS16GBK)
icol# 08 segcol# 08     ELEMTYPE len   22 type  2 NUMBER(0)
DUL> show datafiles;
ts# rf# start   blocks offs open  err file name
  3   1     0   320257    0    1    0 D:\xifenfei\system01.dbf
  9   7     0  4170425    0    1    0 D:\BaiduYunDownload\PORTAL_EMR
DUL> unload table portal_emr.BASEELEMENT;
. unloading table               BASEELEMENT    1913 rows unloaded

这里描述了在dul无法加载bootstrap命令之后,通过人工加载数据字典实现正常的unload table/user功能,丢弃了一般处理思路中的只能通过scan 然后unload没有表名,列名的处理方法,从而实现了恢复的最大化.
我们对原厂官方oracle dual工具有深入研究,如果在oracle dul恢复方面有搞不定的问题.
请联系我们,提供专业ORACLE数据库恢复技术支持
Phone:17813235971    Q Q:107644445QQ咨询惜分飞    E-Mail:dba@xifenfei.com

通过kfed说明asm disk header定义

kfed读取数据磁盘头主要参数解释说明

   % kfed read /dev/raw/raw1
   kfbh.endian:                          1 ; 0x000: 0x01
   kfbh.hard:                          130 ; 0x001: 0x82
   kfbh.type:                            1 ; 0x002: KFBTYP_DISKHEAD
   kfbh.datfmt:                          1 ; 0x003: 0x01
   kfbh.block.blk:                       0 ; 0x004: T=0 NUMB=0x0
   kfbh.block.obj:              2147483648 ; 0x008: TYPE=0x8 NUMB=0x0
   kfbh.check:                  2932902794 ; 0x00c: 0xaed08b8a
   kfbh.fcn.base:                        0 ; 0x010: 0x00000000
   kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
   kfbh.spare1:                          0 ; 0x018: 0x00000000
   kfbh.spare2:                          0 ; 0x01c: 0x00000000
   kfdhdb.driver.provstr:         ORCLDISK ; 0x000: length=8
   kfdhdb.driver.reserved[0]:            0 ; 0x008: 0x00000000
   kfdhdb.driver.reserved[1]:            0 ; 0x00c: 0x00000000
   kfdhdb.driver.reserved[2]:            0 ; 0x010: 0x00000000
   kfdhdb.driver.reserved[3]:            0 ; 0x014: 0x00000000
   kfdhdb.driver.reserved[4]:            0 ; 0x018: 0x00000000
   kfdhdb.driver.reserved[5]:            0 ; 0x01c: 0x00000000
   kfdhdb.compat:                168820736 ; 0x020: 0x0a100000
   kfdhdb.dsknum:                        0 ; 0x024: 0x0000
   kfdhdb.grptyp:                        1 ; 0x026: KFDGTP_EXTERNAL
   kfdhdb.hdrsts:                        3 ; 0x027: KFDHDR_MEMBER
   kfdhdb.dskname:              ASM01_0000 ; 0x028: length=10
   kfdhdb.grpname:                   ASM01 ; 0x048: length=5
   kfdhdb.fgname:               ASM01_0000 ; 0x068: length=10
   kfdhdb.capname:                         ; 0x088: length=0
   kfdhdb.crestmp.hi:             32837774 ; 0x0a8: HOUR=0xe DAYS=0x4 MNTH=0x4 YEAR=0x7d4
   kfdhdb.crestmp.lo:           1555722240 ; 0x0ac: USEC=0x0 MSEC=0x29c SECS=0xb MINS=0x17
   kfdhdb.mntstmp.hi:             32837774 ; 0x0b0: HOUR=0xe DAYS=0x4 MNTH=0x4 YEAR=0x7d4
   kfdhdb.mntstmp.lo:           1563864064 ; 0x0b4: USEC=0x0 MSEC=0x1ab SECS=0x13 MINS=0x17
   kfdhdb.secsize:                     512 ; 0x0b8: 0x0200
   kfdhdb.blksize:                    4096 ; 0x0ba: 0x1000
   kfdhdb.ausize:                  1048576 ; 0x0bc: 0x00100000
   kfdhdb.mfact:                    113792 ; 0x0c0: 0x0001bc80
   kfdhdb.dsksize:                    9075 ; 0x0c4: 0x00002373
   kfdhdb.pmcnt:                         2 ; 0x0c8: 0x00000002
   kfdhdb.fstlocn:                       1 ; 0x0cc: 0x00000001
   kfdhdb.altlocn:                       2 ; 0x0d0: 0x00000002
   kfdhdb.f1b1locn:                      2 ; 0x0d4: 0x00000002
   kfdhdb.redomirrors[0]:                0 ; 0x0d8: 0x0000
   kfdhdb.redomirrors[1]:                0 ; 0x0da: 0x0000
   kfdhdb.redomirrors[2]:                0 ; 0x0dc: 0x0000
   kfdhdb.redomirrors[3]:                0 ; 0x0de: 0x0000
   kfdhdb.ub4spare[0]:                   0 ; 0x0e0: 0x00000000
   ...
   kfdhdb.ub4spare[60]:                  0 ; 0x1d0: 0x00000000
   kfdhdb.acdb.aba.seq:                  0 ; 0x1d4: 0x00000000
   kfdhdb.acdb.aba.blk:                  0 ; 0x1d8: 0x00000000
   kfdhdb.acdb.ents:                     0 ; 0x1dc: 0x0000
   kfdhdb.acdb.ub2spare:                 0 ; 0x1de: 0x0000
  Breakdown:
   kfbh.endian
     kf3.h   /* endianness of writer */
       Little endian = 1
       Big endian = 0
   kfbh.hard
     kf3.h   /* H.A.R.D. magic # and block size */
  kfbh.type
    kf3.h    /* metadata block type               */
  kfbh.datfmt
    kf3.h   /* metadata block data format        */
  kfbh.block
    kf3.h   /* block location of this block      */
      blk -- Disk header should have T=0 and NUMB=0x0
      obj -- Disk header should have TYPE=0x8 NUMB=<disknumber>
    blk and obj values are derived from a series of macros in kf3.h.  See
    "KFBL Macros" in kf3.h for more information.
  kfbh.check
    kf3.h   /* check value to verify consistency */
  kfbh.fcn
    kf3.h   /* change number of last change      */
  kfdhdb.driver
    kf3.h   /* OSMLIB driver reserved block  */
       If no driver is defined "ORCLDISK" is used.
  kfdhdb.compat
    kf3.h   /* Comaptible software version   */
      example: 0x0a100000
      You get:
          a=10 1=1 so 10.1.0.0.0
  kfdhdb.dsknum
    kf3.h   /* OSM disk number               *
      This is the disk number.  The first disk being "0".  There can be up to
      ub2 disks in a diskgroup.  This allows for 65336 disks 0 through 65335.
  kfdhdb.grptyp
    kf3.h   /* Disk group type               */
  kfdhdb.hdrsts
    kf3.h   /* Disk header status            */
      This is what is used to determine if a disk is available or not to
      the diskgroup.  0x03 is the correct value for a valid status.
  kfdhdb.dskname   /* OSM disk name       */
  kfdhdb.grpname   /* OSM disk group name */
  kfdhdb.fgname    /* Failure group name  */
  kfdhdb.capname   /* Capacity grp, unused*/
    kf3.h
  kfdhdb.crestmp   /* Creation timestamp            */
  kfdhdb.mntstmp   /* Mount timestamp               */
    kf3.h To derive the hi and low time`from an unformated dump use the
    "KFTS Macros" in kf3.h.
  kfdhdb.secsize
    kf3.h   /* Disk sector size (bytes)      */
      This is the physical sector size of the disk in bytes. All I/O's to the
      disk are described in physical sectors. This must be a power of 2. An
      ideal value would be 4096, but most disks are formatted with 512 byte
      sectors. (from asmlib.h)
  kfdhdb.blksize
    kf3.h   /* Metadata block (bytes)        */
  kfdhdb.ausize
    kf3.h   /* Allocation Unit (bytes)       */
  kfdhdb.mfact
    kf3.h   /* Stride between phys addr AUs  */
  kfdhdb.dsksize
    kf3.h   /* Disk size in AUs              */
      Mulitply by AUs to get actual size of disk when added.
  kfdhdb.pmcnt
    kf3.h   /* Permanent phys addressed AUs  */
      Number of physically addressed allocation units.
  kfdhdb.fstlocn
    kf3.h   /* First FreeSpace table blk num */
      Used to find freespace.
  kfdhdb.altlocn
    kf3.h   /* First Alocation table blk num */
      Used to find alocated space.
  kfdhdb.f1b1locn
    kf3.h   /* File Directory blk 1 AU num   */
      Beginging for file directory.

通过update _NEXT_OBJECT 实现obj$.obj#和obj$.dataobj#跳号

在一些特殊的情况下(比如ORA-00600 [15267],ORA-00600 [KKDLCOB-OBJN-EXISTS],Ora-600 [15260]),考虑需要把dba_objects中的object_id往前推进,这里通过试验的方法实现该功能
数据库版本信息

SQL> select * from v$version;
BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - Prod
PL/SQL Release 10.2.0.4.0 - Production
CORE    10.2.0.4.0      Production
TNS for Linux: Version 10.2.0.4.0 - Production
NLSRTL Version 10.2.0.4.0 - Production

分析obj和dataobj

SQL> select max(obj#),max(dataobj#) from obj$;
 MAX(OBJ#) MAX(DATAOBJ#)
---------- -------------
     51887         51907
SQL> select name from obj$ where obj#=51887;
NAME
------------------------------
T_DUL
SQL> select name from obj$ where dataobj#=51907;
NAME
------------------------------
_NEXT_OBJECT
SQL> select object_id,data_object_id from dba_objects where object_name='_NEXT_OBJECT';
no rows selected

为什么dba_objects中无_NEXT_OBJECT
因为dba_objects视图中跳过了_NEXT_OBJECT这条记录
_next_object


测试创建新表后obj和dataobj的变化

SQL>  create table t_xff as select * from dual;
Table created.
SQL> select max(obj#),max(dataobj#) from obj$;
 MAX(OBJ#) MAX(DATAOBJ#)
---------- -------------
     51898         51907
SQL> select name from obj$ where obj#=51898;
NAME
------------------------------
T_XFF
SQL> select max(object_id),max(data_object_id) from dba_objects where object_name='T_XFF';
MAX(OBJECT_ID) MAX(DATA_OBJECT_ID)
-------------- -------------------
         51898               51898

通过测试可以确定,obj发生增加,但是dataobj不一定增加(因为dataobj本身比obj大,如果出现obj>dataobj那属于异常情况)

测试数据库重启obj和dataobj是否会跳号

---正常重启数据库
SQL> SHUTDOWN IMMEDIATE;
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> STARTUP
ORACLE instance started.
Total System Global Area  260046848 bytes
Fixed Size                  1266920 bytes
Variable Size              83888920 bytes
Database Buffers          171966464 bytes
Redo Buffers                2924544 bytes
Database mounted.
Database opened.
SQL> select max(obj#),max(dataobj#) from obj$;
 MAX(OBJ#) MAX(DATAOBJ#)
---------- -------------
     51898         51907
---强制重启数据库
SQL> shutdown abort
ORACLE instance shut down.
SQL> startup
ORACLE instance started.
Total System Global Area  260046848 bytes
Fixed Size                  1266920 bytes
Variable Size              83888920 bytes
Database Buffers          171966464 bytes
Redo Buffers                2924544 bytes
Database mounted.
Database opened.
SQL> select max(obj#),max(dataobj#) from obj$;
 MAX(OBJ#) MAX(DATAOBJ#)
---------- -------------
     51898         51907

通过这个证明obj和dataobj没有因为数据库重启而发生改变

实现obj跳号

SQL> shutdown immediate;
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup restrict
ORACLE instance started.
Total System Global Area  260046848 bytes
Fixed Size                  1266920 bytes
Variable Size              83888920 bytes
Database Buffers          171966464 bytes
Redo Buffers                2924544 bytes
Database mounted.
Database opened.
SQL>  update obj$ set dataobj#=1000000 where name='_NEXT_OBJECT';
1 row updated.
SQL> commit;
Commit complete.
SQL> shutdown abort;
ORACLE instance shut down.
SQL> startup
ORACLE instance started.
Total System Global Area  260046848 bytes
Fixed Size                  1266920 bytes
Variable Size              83888920 bytes
Database Buffers          171966464 bytes
Redo Buffers                2924544 bytes
Database mounted.
Database opened.
SQL> select max(obj#),max(dataobj#) from obj$;
 MAX(OBJ#) MAX(DATAOBJ#)
---------- -------------
     51898       1000000
SQL> create table t_www_xifenfei_com as select * from dual;
Table created.
SQL> select max(obj#),max(dataobj#) from obj$;
 MAX(OBJ#) MAX(DATAOBJ#)
---------- -------------
   1000000       1000010
SQL> select max(object_id),max(data_object_id) from dba_objects;
MAX(OBJECT_ID) MAX(DATA_OBJECT_ID)
-------------- -------------------
       1000000             1000000
SQL> select object_name from dba_objects where object_id=1000000;
OBJECT_NAME
----------------------------------------------------------------
T_WWW_XIFENFEI_COM

通过丢_NEXT_OBJECT的更新实现obj和dataobj跳号(变成100w)

使用alter system events导致库crash

由于数据库导入大量数据的时候io等待比较高,新的存储无法直接挂过来,考虑使用nfs挂载过来,然后存放redo缓解io压力。
数据库版本信息

SQL> select * from v$version;
BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bi
PL/SQL Release 10.2.0.4.0 - Production
CORE    10.2.0.4.0      Production
TNS for IBM/AIX RISC System/6000: Version 10.2.0.4.0 - Productio
NLSRTL Version 10.2.0.4.0 - Production

挂载参数(mount命令查看)

10.240.10.1 /top/data4/nfs   /back1            nfs3
Aug 29 13:40 cio,rw,bg,hard,nointr,rsize=32768,wsize=32768,proto=tcp,noac,vers=3,timeo=600

尝试创建redo

SQL> alter database add logfile group 13 ('/back/newxff/redo13.log') size 2048m;
alter database add logfile group 13 ('/back1/newxff/redo13.log') size 2048m
*
ERROR at line 1:
ORA-00301: error in adding log file '/back1/newxff/redo13.log' - file cannot be
created
ORA-27054: NFS file system where the file is created or resides is not mounted
with correct options
Additional information: 6

根据mos文档
ORA-27054 ERRORS WHEN RUNNING RMAN WITH NFS (文档 ID 387700.1)

SQL> Alter system set events '10298 trace name context forever,level 32';
System altered.
Mon Sep  5 10:10:18 2016
Thread 1 advanced to log sequence 109 (LGWR switch)
  Current log# 1 seq# 109 mem# 0: +DATA/xff/onlinelog/group_1.257.921671023
Mon Sep  5 10:12:19 2016
OS Pid: 160710 executed alter system set events '10298 trace name context forever,level 32'

创建redo成功

SQL> alter database add logfile group 13 ('/back1/newxff/redo13.log') size 2048m;
System altered.
Mon Sep  5 10:18:13 2016
alter database add logfile group 13 ('/back1/newxff/redo13.log') size 2048m
Mon Sep  5 10:18:43 2016
Completed: alter database add logfile group 13 ('/back1/newxff/redo13.log') size 2048m

数据库crash

Mon Sep  5 10:19:06 2016
Errors in file /opt/oracle/admin/xff/bdump/xff1_lgwr_246566.trc:
ORA-00313: open failed for members of log group 13 of thread 1
ORA-00312: online log 13 thread 1: '/back1/newxff/redo13.log'
ORA-27054: NFS file system where the file is created or resides is not mounted with correct options
Additional information: 6
Mon Sep  5 10:19:06 2016
Errors in file /opt/oracle/admin/xff/bdump/xff1_lgwr_246566.trc:
ORA-00313: open failed for members of log group 13 of thread 1
ORA-00312: online log 13 thread 1: '/back1/newxff/redo13.log'
ORA-27054: NFS file system where the file is created or resides is not mounted with correct options
Additional information: 6
Mon Sep  5 10:19:06 2016
LGWR: terminating instance due to error 313
Mon Sep  5 10:19:06 2016
System state dump is made for local instance
System State dumped to trace file /opt/oracle/admin/xff/bdump/xff1_diag_299654.trc

通过报错很明显可以看出来数据库挂掉的原因和当时不能创建redo的原因一样,都是由于ORA-27054导致数据库挂了,但是为什么创建redo成功,但是使用redo失败呢?
这里需要注意使用的命令是events,而这个命令是对当前会话和后续新建的会话生效,也就是说他不会对数据库已经存在的后台进程生效,那也就可以理解了,我创建redo是在执行events的当前命令行窗口处理的,因此可以创建成功;但是lgwr进程是数据库一启动就存在的进程,现在设置的events对他没有影响,因此当lgwr去使用redo的时候无法正常使用因此就导致数据库crash掉。如果希望event对已经存在的进程生效,可以考虑使用oradebug对进程进行设置event(这个案例主要要设置多个后台进程不光lgwr访问redo),或者设置event=的方式,然后重启数据库让其生效。

hp平台rdisk中磁盘丢失导致asm启动报ORA-15042恢复

有老朋友找到我,说一个客户的数据库异常,问题是asm无法正常mount,提示是缺少两块磁盘.问我是否可以恢复.因为是内网环境,通过他那边发过来的零零散散的信息,大概分析如下
asm alert日志报错
ERROR: diskgroup DGROUP1 was not mounted

Fri Aug 12 16:03:12 EAT 2016
SQL> alter diskgroup DGROUP1 mount
Fri Aug 12 16:03:12 EAT 2016
NOTE: cache registered group DGROUP1 number=1 incarn=0xf6781b5c
Fri Aug 12 16:03:12 EAT 2016
NOTE: Hbeat: instance first (grp 1)
Fri Aug 12 16:03:16 EAT 2016
NOTE: start heartbeating (grp 1)
Fri Aug 12 16:03:16 EAT 2016
NOTE: cache dismounting group 1/0xF6781B5C (DGROUP1)
NOTE: dbwr not being msg'd to dismount
ERROR: diskgroup DGROUP1 was not mounted

前台尝试mount asm 磁盘组报错ORA-15042
ORA-15042


从这里可以明显的看出来asm 磁盘组无法正常mount,是由于缺少asm disk 15,16.如果想恢复asm,最好的方法就是找出来这两个磁盘.通过kfed对现在的磁盘进行分析,最后我们发现asm disk 14对应的磁盘为disk160,,asm disk 17对应的disk163,根据第一感觉很可能是disk161和disk161两块盘异常,让机房检查硬件无任何告警

OS层面分析
省略和本次结论无关的记录

ls -l /dev/rdisk
crw-rw----   1 oracle     dba         13 0x000070 Jan  1  2016 disk160
crw-rw----   1 oracle     dba         13 0x000073 Jan  1  2016 disk163
ls -l /dev/disk
brw-r-----   1 bin        sys          1 0x000070 Jan 13  2015 disk160
brw-r-----   1 bin        sys          1 0x000071 Jan 13  2015 disk161
brw-r-----   1 bin        sys          1 0x000072 Jan 13  2015 disk162
brw-r-----   1 bin        sys          1 0x000073 Jan 13  2015 disk163

这里我们发现在hp unix中/dev/disk下面磁盘都存在,但是/dev/rdisk下面丢失,通过ioscan相关命令继续分析

ioscan -fNnkC disk
disk    160  64000/0xfa00/0x70  esdisk   CLAIMED     DEVICE       HP      OPEN-V
                      /dev/disk/disk160   /dev/rdisk/disk160
disk    161  64000/0xfa00/0x71  esdisk   CLAIMED     DEVICE       HP      OPEN-V
                      /dev/disk/disk161
disk    162  64000/0xfa00/0x72  esdisk   CLAIMED     DEVICE       HP      OPEN-V
                      /dev/disk/disk162
disk    163  64000/0xfa00/0x73  esdisk   CLAIMED     DEVICE       HP      OPEN-V
                      /dev/disk/disk163   /dev/rdisk/disk163

这里我们基本上可以确定是/dev/rdisk下面的盘发生丢失.进一步分析,因为rdisk是聚合后的盘符,那我们分析聚合前的盘符是否正常

ioscan -m dsf
/dev/rdisk/disk160       /dev/rdsk/c29t12d4
                         /dev/rdsk/c28t12d4
/dev/rdisk/disk163       /dev/rdsk/c29t12d7
                         /dev/rdsk/c28t12d7
ls -l /dev/rdsk
crw-r-----   1 bin        sys        188 0x1dc000 Apr 22  2014 c29t12d0
crw-r-----   1 bin        sys        188 0x1dc100 Apr 22  2014 c29t12d1
crw-r-----   1 bin        sys        188 0x1dc300 Jan 13  2015 c29t12d3
crw-r-----   1 bin        sys        188 0x1dc400 Jan 13  2015 c29t12d4
crw-r-----   1 bin        sys        188 0x1dc500 Jan 13  2015 c29t12d5
crw-r-----   1 bin        sys        188 0x1dc600 Jan 13  2015 c29t12d6
crw-r-----   1 bin        sys        188 0x1dc700 Jan 13  2015 c29t12d7
crw-r-----   1 bin        sys        188 0x1cc100 Apr 22  2014 c28t12d1
crw-r-----   1 bin        sys        188 0x1cc300 Jan 13  2015 c28t12d3
crw-r-----   1 bin        sys        188 0x1cc400 Jan 13  2015 c28t12d4
crw-r-----   1 bin        sys        188 0x1cc500 Jan 13  2015 c28t12d5
crw-r-----   1 bin        sys        188 0x1cc600 Jan 13  2015 c28t12d6
crw-r-----   1 bin        sys        188 0x1cc700 Jan 13  2015 c28t12d7

通过这里我们基本上可以大概判断出来/dev/rdsk/c28t12d5,/dev/rdsk/c28t12d6,/dev/rdsk/c29t12d5,/dev/rdsk/c29t12d6就是我们需要找的/dev/rdisk/disk161和disk162的聚合之前的盘符.也就是说,现在我们判断只有/dev/rdisk下面的字符设备有问题,其他均正常.

通过系统命令修复异常

insf -e -H 64000/0xfa00/0x71
insf -e -H 64000/0xfa00/0x72

hp-asm-disk


现在已经可以正常看到/dev/rdisk/disk161和/dev/rdisk/disk162盘符,初步判断,os层面盘符已经恢复正常.修改磁盘权限和所属组

chmod 660 /dev/rdisk/disk161
chmod 660 /dev/rdisk/disk162
chown oracle:dba /dev/rdisk/disk161
chown oracle:dba /dev/rdisk/disk162

正常启动asm,mount磁盘组,open数据库
asm-mount


这次的恢复,主要是从操作系统层面判断解决问题,从而实现数据库完美恢复,数据0丢失.有类似恢复案例:分区无法识别导致asm diskgroup无法mount
如果您遇到此类情况,无法解决请联系我们,提供专业ORACLE数据库恢复技术支持
Phone:17813235971    Q Q:107644445QQ咨询惜分飞    E-Mail:dba@xifenfei.com

ORA-600 kcbz_check_objd_typ_1 处理

客户数据库异常(ORA-600 kcbz_check_objd_typ_1),让我们远程给分析处理
ORA-600 kcbz_check_objd_typ_1异常

Mon Aug  8 12:19:28 2016
Completed: ALTER DATABASE OPEN
Mon Aug  8 12:19:29 2016
db_recovery_file_dest_size of 20480 MB is 0.00% used. This is a
user-specified limit on the amount of space that will be used by this
database for recovery-related files, and does not reflect the amount of
space available in the underlying filesystem or ASM diskgroup.
Mon Aug  8 12:19:33 2016
Errors in file /home/oracle/admin/RT/bdump/rt_smon_1514.trc:
ORA-00600: internal error code, arguments: [kcbz_check_objd_typ_1], [0], [0], [1], [], [], [], []
Mon Aug  8 12:20:21 2016
Shutting down archive processes
Mon Aug  8 12:20:26 2016
ARCH shutting down
ARC3: Archival stopped
Mon Aug  8 13:12:25 2016
Thread 1 advanced to log sequence 13804
  Current log# 3 seq# 13804 mem# 0: /home/oracle/product/10.2.0/oradata/RT/redo03a.log
Mon Aug  8 14:01:37 2016
Thread 1 advanced to log sequence 13805
  Current log# 2 seq# 13805 mem# 0: /home/oracle/product/10.2.0/oradata/RT/redo02a.log
Mon Aug  8 14:20:51 2016
Errors in file /home/oracle/admin/RT/bdump/rt_smon_1514.trc:
ORA-00600: internal error code, arguments: [kcbz_check_objd_typ_1], [0], [0], [1], [], [], [], []
Mon Aug  8 15:54:47 2016
Thread 1 advanced to log sequence 13808
  Current log# 2 seq# 13808 mem# 0: /home/oracle/product/10.2.0/oradata/RT/redo02a.log
Mon Aug  8 16:21:48 2016
Errors in file /home/oracle/admin/RT/bdump/rt_smon_1514.trc:
ORA-00600: internal error code, arguments: [kcbz_check_objd_typ_1], [0], [0], [1], [], [], [], []
Mon Aug  8 16:22:05 2016
Errors in file /home/oracle/admin/RT/bdump/rt_pmon_1500.trc:
ORA-00474: SMON process terminated with error

这里比较明显,数据库报大量ORA-600 kcbz_check_objd_typ_1错误之后,然后smon进程终止,实例crash.

smon trace文件

Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - Production
With the Partitioning, OLAP and Data Mining options
ORACLE_HOME = /home/oracle/product/10.2.0/db_1
System name:	SunOS
Node name:	st104
Release:	5.10
Version:	Generic_141445-09
Machine:	i86pc
Instance name: RT
Redo thread mounted by this instance: 1
Oracle process number: 12
Unix process pid: 1514, image: oracle@st104 (SMON)
*** 2016-08-08 12:19:26.868
*** SERVICE NAME:() 2016-08-08 12:19:26.868
*** SESSION ID:(383.1) 2016-08-08 12:19:26.868
Dead transaction 0x003d.002.0000f964 recovered by SMON
Dead transaction 0x0041.017.00004d55 recovered by SMON
Dead transaction 0x0047.002.0000180c recovered by SMON
Dead transaction 0x004c.01c.00001b57 recovered by SMON
*** SESSION ID:(383.1) 2016-08-08 12:19:27.470
DATA seg.obj=0, on-disk obj=925949, dsflg=0, dsobj=923715, cls=4
Formatted dump of block:
buffer tsn: 4 rdba: 0x0100336b (4/13163)
scn: 0x09c6.b2c7f7a2 seq: 0x02 flg: 0x04 tail: 0xf7a20602
frmt: 0x02 chkval: 0x649b type: 0x06=trans data
Hex dump of block: st=0, typ_found=1
*** SESSION ID:(383.1) 2016-08-08 12:19:34.244
DATA seg.obj=0, on-disk obj=925950, dsflg=0, dsobj=923671, cls=4
Formatted dump of block:
buffer tsn: 4 rdba: 0x01003343 (4/13123)
scn: 0x09c6.b2c7f7dc seq: 0x02 flg: 0x04 tail: 0xf7dc0602
frmt: 0x02 chkval: 0x8013 type: 0x06=trans data
Hex dump of block: st=0, typ_found=1
*** SESSION ID:(383.1) 2016-08-08 12:19:35.197
DATA seg.obj=0, on-disk obj=925941, dsflg=0, dsobj=923657, cls=4
Formatted dump of block:
buffer tsn: 7 rdba: 0x01c03d53 (7/15699)
scn: 0x09c6.b2c7f570 seq: 0x02 flg: 0x04 tail: 0xf5700602
frmt: 0x02 chkval: 0xe5c5 type: 0x06=trans data
Hex dump of block: st=0, typ_found=1
*** SESSION ID:(383.1) 2016-08-08 12:19:38.965
DATA seg.obj=0, on-disk obj=925948, dsflg=0, dsobj=923656, cls=4
Formatted dump of block:
buffer tsn: 7 rdba: 0x01c03a6b (7/14955)
scn: 0x09c6.b2c7f745 seq: 0x02 flg: 0x04 tail: 0xf7450602
frmt: 0x02 chkval: 0x58c5 type: 0x06=trans data
Hex dump of block: st=0, typ_found=1

这里可以看出来有block中的obj和dataobj不匹配.

查询seg$.type=3
type=3为临时对象,由于异常原因导致smon在清理temp对象无法正常完成,从而使得smon终止,实例crash.

SQL> select file#, block#, ts# from seg$ where type# = 3;
     FILE#     BLOCK#        TS#
---------- ---------- ----------
         4      13163          4
         4      13123          4
         7      15699          7
         7      14955          7

ORA-600 kcbz_check_objd_typ_1处理方法

1) Check tablespace bitmap
SQL> oradebug setmypid
SQL> exec dbms_space_admin.tablespace_verify('&TBSP_NAME')
SQL> oradebug tracefile_name
or if the tablespace involved is an ASSM tablespace:
SQL> oradebug setmypid
SQL> exec dbms_space_admin.assm_tablespace_verify ('&TBSP_NAME',dbms_space_admin.TS_VERIFY_BITMAPS)
SQL> oradebug tracefile_name
I am expecting to fail
2) Corrupt these temp segments
SQL> exec dbms_space_admin.segment_corrupt('&TBSP_NAME', &FILE#, &BLOCK#)
3) Drop them
SQL> exec dbms_space_admin.segment_drop_corrupt('&TBSP_NAME', &FILE#, &BLOCK#)
4) Rebuild tablespace bitmap
exec DBMS_SPACE_ADMIN.TABLESPACE_REBUILD_BITMAPS('&TBSP_NAME')
5) Verify the tablespace again
SQL> oradebug setmypid
SQL> exec dbms_space_admin.tablespace_verify('&TBSP_NAME')
SQL> oradebug tracefile_name
or if the tablespace involved is an ASSM tablespace:
SQL> oradebug setmypid
SQL> exec dbms_space_admin.assm_tablespace_verify('&TBSP_NAME',dbms_space_admin.TS_VERIFY_BITMAPS)
SQL> oradebug tracefile_name

linux 7(redhat,oracle linux,centos)中使用udev

慢慢的linux 7的使用人越来越多了,但是linux 7相对于5和6的版本,变动确实比较大,本文主要描写在linux 7中如何实现udev,实现设备持久化,权限和所属组的修改
linux版本

Oracle Linux Server release 7.1
[root@www.xifenfei.com ~]# uname -a
Linux www.xifenfei.com 3.10.0-229.el7.x86_64 #1 SMP Fri Mar 6 04:05:24 PST 2015 x86_64 x86_64 x86_64 GNU/Linux

VMware Workstation中显示uuid需要在vmx文件中增加

disk.enableUUID = "TRUE"

查看磁盘分区

[root@www.xifenfei.com ~]# fdisk -l
Disk /dev/sdb: 21.5 GB, 21474836480 bytes, 41943040 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: dos
Disk identifier: 0xf60fe217
   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1            2048     2099199     1048576   83  Linux
Disk /dev/sda: 42.9 GB, 42949672960 bytes, 83886080 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: dos
Disk identifier: 0x000bce7c
   Device Boot      Start         End      Blocks   Id  System
/dev/sda1            2048     4204543     2101248   8e  Linux LVM
/dev/sda2   *     4204544    79702015    37748736   83  Linux
Disk /dev/sdc: 32.2 GB, 32212254720 bytes, 62914560 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk /dev/mapper/ol-swap: 2147 MB, 2147483648 bytes, 4194304 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

查看磁盘uuid

[root@www.xifenfei.com ~]# /usr/lib/udev/scsi_id -g -u /dev/sdb1
36000c29e91831cedbe69afe6cc08daf7
[root@www.xifenfei.com ~]# /usr/lib/udev/scsi_id -g -u /dev/sdc
36000c292495e9d9de6f21640cc7b53b9

udev绑定

[root@www.xifenfei.com ~]# more /etc/udev/rules.d/99-my-asmdevices.rules
KERNEL=="sd*[!0-9]", ENV{DEVTYPE}=="disk", SUBSYSTEM=="block", PROGRAM=="/usr/lib/udev/scsi_id -g -u -d $devnode",
 RESULT=="36000c292495e9d9de6f21640cc7b53b9", RUN+="/bin/sh -c 'mknod /dev/xifenfei-sdc b $major $minor;
chown oracle:dba /dev/xifenfei-sdc; chmod 0660 /dev/xifenfei-sdc'"
KERNEL=="sd?1", SUBSYSTEM=="block", PROGRAM=="/lib/udev/scsi_id -g -u -d /dev/$parent",
RESULT=="36000c29e91831cedbe69afe6cc08daf7", SYMLINK+="xifenfei-sdb1", OWNER="oracle", GROUP="dba", MODE="0660"

绑定结果

[root@www.xifenfei.com ~]# ls -l /dev/xifenfei-*
lrwxrwxrwx. 1 root   root     4 Aug  7 22:49 /dev/xifenfei-sdb1 -> sdb1
brw-rw----. 1 oracle dba  8, 32 Aug  7 22:36 /dev/xifenfei-sdc
[root@www.xifenfei.com ~]# ls -l /dev/sdb1
brw-rw----. 1 oracle dba 8, 17 Aug  7 22:49 /dev/sdb1

udev只修改磁盘权限

[root@www.xifenfei.com ~]# fdisk /dev/sdb
Welcome to fdisk (util-linux 2.23.2).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.
Command (m for help): n
Partition type:
   p   primary (1 primary, 0 extended, 3 free)
   e   extended
Select (default p): p
Partition number (2-4, default 2):
First sector (2099200-41943039, default 2099200):
Using default value 2099200
Last sector, +sectors or +size{K,M,G} (2099200-41943039, default 41943039): +1G
Partition 2 of type Linux and of size 1 GiB is set
Command (m for help): w
The partition table has been altered!
Calling ioctl() to re-read partition table.
Syncing disks.
[root@www.xifenfei.com ~]# more /etc/udev/rules.d/99-my-asmdevices.rules
KERNEL=="sd?2", SUBSYSTEM=="block", PROGRAM=="/lib/udev/scsi_id -g -u -d /dev/$parent",
 RESULT=="36000c29e91831cedbe69afe6cc08daf7",  OWNER="oracle", GROUP="dba", MODE="0660"
[root@www.xifenfei.com ~]# /sbin/udevadm trigger --type=devices --action=change
[root@www.xifenfei.com ~]# ls -l /dev/sdb2
brw-rw----. 1 oracle dba 8, 18 Aug  7 23:14 /dev/sdb2

这里可以发现在linux 7中使用了两种方法绑定udev,一种是真实生成udev设备,另外一种是通过软连接实现.感谢lunar(Lunar的oracle实验室)在linux 7学习中的帮助

MON_MODS$表ORA-600 13013报错处理

有朋友反馈数据库启动运行一点时间之后,然后就自动crash,让我们帮忙找原因,通过分析是由于smon进程触发ORA-600 13013导致数据库异常
alert日志报错信息

Thu Aug  4 18:39:44 2016
Database Characterset is ZHS16GBK
replication_dependency_tracking turned off (no async multimaster replication found)
Starting background process QMNC
QMNC started with pid=33, OS id=22935
Thu Aug  4 18:39:44 2016
Completed: ALTER DATABASE OPEN
Thu Aug  4 18:39:44 2016
db_recovery_file_dest_size of 2048 MB is 0.00% used. This is a
user-specified limit on the amount of space that will be used by this
database for recovery-related files, and does not reflect the amount of
space available in the underlying filesystem or ASM diskgroup.
Thu Aug  4 18:48:41 2016
Thread 1 advanced to log sequence 86746
  Current log# 3 seq# 86746 mem# 0: /opt/ora10/oradata/ora10g/redo03.log
Thu Aug  4 18:58:13 2016
Errors in file /opt/ora10/admin/ora10g/bdump/ora10g_smon_22449.trc:
ORA-00600: internal error code, arguments: [13013], [5001], [482], [4198075], [40], [4198075], [17], []
Thu Aug  4 18:58:56 2016
Non-fatal internal error happenned while SMON was doing flushing of monitored table stats.
SMON encountered 8 out of maximum 100 non-fatal internal errors.
Thu Aug  4 18:59:06 2016
Errors in file /opt/ora10/admin/ora10g/bdump/ora10g_smon_22449.trc:
ORA-00600: internal error code, arguments: [13013], [5001], [482], [4198075], [40], [4198075], [17], []
Thu Aug  4 18:59:08 2016
Errors in file /opt/ora10/admin/ora10g/bdump/ora10g_pmon_22413.trc:
ORA-00474: SMON process terminated with error
Thu Aug  4 18:59:08 2016
PMON: terminating instance due to error 474
Instance terminated by PMON, pid = 22413

通过trace文件大概可以发现是由于ORA-600 13013错误导致数据库crash,而且这里有类似”SMON was doing flushing of monitored table stats”错误提示,根据经验,很可能是smon把表的dml操作收集信息相关.

ORA-600 [13013] 含义

ORA-600 [13013] [a] [b] {c} [d] [e] [f]
This format relates to Oracle Server 8.0.3 to 10.1
Arg [a] Passcount
Arg [b] Data Object number
Arg {c} Tablespace Relative DBA of block containing the row to be updated
Arg [d] Row Slot number
Arg [e] Relative DBA of block being updated (should be same as )
Arg [f] Code

根据这个错误信息,以及How to resolve ORA-00600 [13013], [5001] [ID 816784.1]中的描述

ORA-600 13013 对应对象

SQL> select object_name from dba_objects where object_id=482
OBJECT_NAME
--------------------------------------------------------------------------------
MON_MODS$

该对象正是和监控dml变化相关的表,smon会对其进行相关操作,以前写过一篇:MON_MODS$和MON_MODS_ALL$统计DML操作次数的文章
对于MON_MODS$表ORA-600 13013处理

SQL> analyze table mon_mods$ validate structure cascade;
analyze table mon_mods$ validate structure cascade
*
ERROR at line 1:
ORA-01499: table/index cross reference failure - see trace file
SQL> select index_name from dba_indexes where table_name='MON_MODS$';
INDEX_NAME
------------------------------
I_MON_MODS$_OBJ
SQL> ALTER INDEX I_MON_MODS$_OBJ REBUILD;
Index altered.
SQL> analyze table mon_mods$ validate structure cascade;
analyze table mon_mods$ validate structure cascade
*
ERROR at line 1:
ORA-01499: table/index cross reference failure - see trace file
SQL> CREATE TABLE MON_MODS_BAK AS SELECT * FROM MON_MODS$;
Table created.
SQL> SELECT COUNT(*) FROM MON_MODS$;
  COUNT(*)
----------
      1247
SQL> C/MON_MODS$/MON_MODS_BAK;
  1* SELECT COUNT(*) FROM MON_MODS_BAK
SQL> /
  COUNT(*)
----------
      1247
SQL> TRUNCATE TABLE MON_MODS$;
Table truncated.
SQL> INSERT INTO MON_MODS$ SELECT * fROM MON_MODS_BAK;
1247 rows created.
SQL> COMMIT;
Commit complete.
SQL>  analyze table mon_mods$ validate structure cascade;
Table analyzed.

自此关于MON_MODS$表相关的ORA-600 13013异常处理完全,当然也可以通过重建I_MON_MODS$_OBJ索引来解决,但是不能通过rebuild index解决.数据库也就不会因此而crash了.

ORA-600 4042 故障恢复

通过Oracle数据库异常恢复检查脚本(Oracle Database Recovery Check) 检查结果
wrong_scn
通过上图可以知道file 2未能正常恢复(需要看日志分析原因),file 3以前就被offline,需要历史归档(非归档状态,所以这个先放着,后续再处理)

分析file 2 不成功原因

Wed Aug  3 15:21:11 2016
ALTER DATABASE RECOVER  datafile 2
Wed Aug  3 15:21:11 2016
Media Recovery Start
 parallel recovery started with 2 processes
Wed Aug  3 15:21:11 2016
Recovery of Online Redo Log: Thread 1 Group 1 Seq 1916 Reading mem 0
  Mem# 0 errs 0: /home/oracle/orabase/oradata/ORACLE/redo01.log
Wed Aug  3 15:21:11 2016
Errors in file /u01/app/oracle/admin/oracle/bdump/oracle_p001_22017.trc:
ORA-00600: internal error code, arguments: [3020], [2], [41], [8388649], [], [], [], []
ORA-10567: Redo is inconsistent with data block (file# 2, block# 41)
ORA-10564: tablespace UNDOTBS1
ORA-01110: data file 2: '/home/oracle/orabase/oradata/ORACLE/undotbs01.dbf'
ORA-10560: block type '0'
Wed Aug  3 15:21:13 2016
Errors in file /u01/app/oracle/admin/oracle/bdump/oracle_p001_22017.trc:
ORA-00600: internal error code, arguments: [3020], [2], [41], [8388649], [], [], [], []
ORA-10567: Redo is inconsistent with data block (file# 2, block# 41)
ORA-10564: tablespace UNDOTBS1
ORA-01110: data file 2: '/home/oracle/orabase/oradata/ORACLE/undotbs01.dbf'
ORA-10560: block type '0'
Wed Aug  3 15:21:18 2016
Media Recovery failed with error 12801
ORA-283 signalled during: ALTER DATABASE RECOVER  datafile 2  ...

通过日志可以知道由于ORA-600 3020导致file 2不能正常的恢复.
处理file 2

SQL> recover  datafile 2 allow 1 corruption;
Media recovery complete.
Thu Aug  4 01:58:35 2016
ALTER DATABASE RECOVER  datafile 2 allow 1 corruption
Media Recovery Start
 ALLOW CORRUPTION option must use serial recovery
Thu Aug  4 01:58:35 2016
Recovery of Online Redo Log: Thread 1 Group 1 Seq 1916 Reading mem 0
  Mem# 0 errs 0: /home/oracle/orabase/oradata/ORACLE/redo01.log
Thu Aug  4 01:58:35 2016
Media Recovery Complete (oracle)
Completed: ALTER DATABASE RECOVER  datafile 2 allow 1 corruption

尝试open数据库

SQL> alter database open ;
alter database open
*
ERROR at line 1:
ORA-01092: ORACLE instance terminated. Disconnection forced
Thu Aug  4 01:59:20 2016
alter database open
Thu Aug  4 01:59:21 2016
Beginning crash recovery of 1 threads
 parallel recovery started with 2 processes
Thu Aug  4 01:59:21 2016
Started redo scan
Thu Aug  4 01:59:21 2016
Completed redo scan
 1619 redo blocks read, 0 data blocks need recovery
Thu Aug  4 01:59:21 2016
Started redo application at
 Thread 1: logseq 1916, block 12724
Thu Aug  4 01:59:21 2016
Recovery of Online Redo Log: Thread 1 Group 1 Seq 1916 Reading mem 0
  Mem# 0 errs 0: /home/oracle/orabase/oradata/ORACLE/redo01.log
Thu Aug  4 01:59:21 2016
Completed redo application
Thu Aug  4 01:59:21 2016
Completed crash recovery at
 Thread 1: logseq 1916, block 14343, scn 3303614971196
 0 data blocks read, 0 data blocks written, 1619 redo blocks read
Thu Aug  4 01:59:21 2016
LGWR: STARTING ARCH PROCESSES
ARC0 started with pid=18, OS id=5542
Thu Aug  4 01:59:21 2016
ARC0: Archival started
ARC1: Archival started
LGWR: STARTING ARCH PROCESSES COMPLETE
ARC1 started with pid=19, OS id=5544
Thu Aug  4 01:59:21 2016
Thread 1 advanced to log sequence 1917
Thread 1 opened at log sequence 1917
  Current log# 2 seq# 1917 mem# 0: /home/oracle/orabase/oradata/ORACLE/redo02.log
Successful open of redo thread 1
Thu Aug  4 01:59:21 2016
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
Thu Aug  4 01:59:21 2016
ARC1: STARTING ARCH PROCESSES
Thu Aug  4 01:59:21 2016
ARC0: Becoming the 'no FAL' ARCH
ARC0: Becoming the 'no SRL' ARCH
Thu Aug  4 01:59:21 2016
SMON: enabling cache recovery
Thu Aug  4 01:59:21 2016
ARC2: Archival started
ARC1: STARTING ARCH PROCESSES COMPLETE
ARC1: Becoming the heartbeat ARCH
ARC2 started with pid=20, OS id=5546
Thu Aug  4 01:59:21 2016
db_recovery_file_dest_size of 2048 MB is 1.05% used. This is a
user-specified limit on the amount of space that will be used by this
database for recovery-related files, and does not reflect the amount of
space available in the underlying filesystem or ASM diskgroup.
Thu Aug  4 01:59:22 2016
Errors in file /u01/app/oracle/admin/oracle/udump/oracle_ora_5505.trc:
ORA-00600: internal error code, arguments: [4042], [0], [], [], [], [], [], []
Thu Aug  4 01:59:23 2016
Errors in file /u01/app/oracle/admin/oracle/udump/oracle_ora_5505.trc:
ORA-00600: internal error code, arguments: [4042], [0], [], [], [], [], [], []
Thu Aug  4 01:59:23 2016
Error 600 happened during db open, shutting down database
USER: terminating instance due to error 600
Instance terminated by USER, pid = 5505
ORA-1092 signalled during: alter database open ...

由于ORA-600 4042错误导致数据库无法正常open.
分析ORA-600 4042

PARSING IN CURSOR #4 len=142 dep=1 uid=0 oct=3 lid=0 tim=1435788503594313 hv=361892850 ad='a7ab2db8'
select /*+ rule */ name,file#,block#,status$,user#,undosqn,xactsqn,scnbas,scnwrp,
DECODE(inst#,0,NULL,inst#),ts#,spare1 from undo$ where us#=:1
END OF STMT
PARSE #4:c=0,e=11,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=3,tim=1435788503594311
BINDS #4:
kkscoacd
 Bind#0
  oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00
  oacflg=08 fl2=0001 frm=00 csi=00 siz=24 off=0
  kxsbbbfp=2aae75802218  bln=22  avl=02  flg=05
  value=3
EXEC #4:c=0,e=39,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=3,tim=1435788503594393
FETCH #4:c=0,e=8,p=0,cr=2,cu=0,mis=0,r=1,dep=1,og=3,tim=1435788503594412
STAT #4 id=1 cnt=1 pid=0 pos=1 obj=15 op='TABLE ACCESS BY INDEX ROWID UNDO$ (cr=2 pr=0 pw=0 time=8 us)'
STAT #4 id=2 cnt=1 pid=1 pos=1 obj=34 op='INDEX UNIQUE SCAN I_UNDO1 (cr=1 pr=0 pw=0 time=3 us)'
WAIT #1: nam='db file sequential read' ela= 10 file#=2 block#=41 blocks=1 obj#=-1 tim=1435788503594468
Dump of buffer cache at level 4 for tsn=1, rdba=8388649
BH (0x95ff3c58) file#: 2 rdba: 0x00800029 (2/41) class: 21 ba: 0x95ef0000
  set: 3 blksize: 8192 bsi: 0 set-flg: 2 pwbcnt: 0
  dbwrid: 0 obj: -1 objn: 0 tsn: 1 afn: 2
  hash: [a8b77880,a8b77880] lru: [95ff3dd0,a8e70338]
  ckptq: [NULL] fileq: [NULL] objq: [a43da110,a43da110]
  use: [a8e6e658,a8e6e658] wait: [NULL]
  st: XCURRENT md: SHR tch: 0
  flags: gotten_in_current_mode
  LRBA: [0x0.0.0] HSCN: [0xffff.ffffffff] HSUB: [65535]
  buffer tsn: 1 rdba: 0x00800029 (2/41)
  scn: 0x0000.00000000 seq: 0x01 flg: 0x01 tail: 0x00000001
  frmt: 0x02 chkval: 0x0000 type: 0x00=unknown
Hex dump of block: st=0, typ_found=0
Dump of memory from 0x0000000095EF0000 to 0x0000000095EF2000
095EF0000 0000A200 00800029 00000000 01010000  [....)...........]
095EF0010 00000000 00000000 00000000 00000000  [................]
        Repeat 509 times
095EF1FF0 00000000 00000000 00000000 00000001  [................]
Dump of memory from 0x0000000095EF0014 to 0x0000000095EF1FFC
095EF0010          00000000 00000000 00000000      [............]
095EF0020 00000000 00000000 00000000 00000000  [................]

这里可以发现,file 2 block 41的type为unknown,注意观察ORA-600 3020的错误,我们发现当时报的坏块也正好是该block.基本上可以确定由于前面的allow 1 corruption操作导致了后面的ORA-600 4042的错误.官方关于ORA-600[4042]解释
ORA-600-4042


通过修改undo$中的回滚段状态(参考:bbed修改undo$(回滚段)状态)
正常open数据库,修改file 3的scn并online数据文件

SQL> shutdown immediate;
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup mount
ORACLE instance started.
Total System Global Area 1224736768 bytes
Fixed Size                  2020384 bytes
Variable Size             318770144 bytes
Database Buffers          889192448 bytes
Redo Buffers               14753792 bytes
Database mounted.
SQL>   SELECT thread#,
  2           a.sequence#,
  3           a.group#,
  4           TO_CHAR (first_change#, '9999999999999999') "SCN",
  5           a.status,
  6           MEMBER
  7      FROM v$log a, v$logfile b
  8     WHERE a.group# = B.GROUP#
  9  ORDER BY a.sequence# DESC;
   THREAD#  SEQUENCE#     GROUP# SCN
---------- ---------- ---------- ----------------------------------
STATUS
--------------------------------
MEMBER
--------------------------------------------------------------------------------
         1       1919          1     3303615011212
CURRENT
/home/oracle/orabase/oradata/ORACLE/redo01.log
         1       1918          3     3303614991206
INACTIVE
/home/oracle/orabase/oradata/ORACLE/redo03.log
   THREAD#  SEQUENCE#     GROUP# SCN
---------- ---------- ---------- ----------------------------------
STATUS
--------------------------------
MEMBER
--------------------------------------------------------------------------------
         1       1917          2     3303614971197
INACTIVE
/home/oracle/orabase/oradata/ORACLE/redo02.log
SQL> recover database using backup controlfile;
ORA-00279: change 3303615011452 generated at 08/04/2016 02:06:52 needed for
thread 1
ORA-00289: suggestion :
/u01/app/oracle/flash_recovery_area/ORACLE/archivelog/2016_08_04/o1_mf_1_1919_%u
_.arc
ORA-00280: change 3303615011452 for thread 1 is in sequence #1919
Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
/home/oracle/orabase/oradata/ORACLE/redo01.log
Log applied.
Media recovery complete.
SQL> alter database datafile 3 online;
Database altered.
SQL> alter database open resetlogs;
Database altered.
SQL>

至此该数据库基本上恢复完成,强烈建议使用逻辑方式导出导入重建库.

ORA-600 4194/ORA-600 4193/ORA-600 4137故障解决

对于常见的undo异常错误,ORA-600 4193,ORA-600 4194,ORA-600 4137等错误的处理一般步骤.
适用版本

Oracle Database - Enterprise Edition - Version 9.2.0.1 to 11.2.0.4 [Release 9.2 to 11.2]
Information in this document applies to any platform.

报错现象

The following error is occurring in the alert.log right before the database crashes.
ORA-00600: internal error code, arguments: [4194], [#], [#], [], [], [], [], []
This error indicates that a mismatch has been detected between redo records and rollback (undo) records.
ARGUMENTS:
Arg [a] - Maximum Undo record number in Undo block
Arg [b] - Undo record number from Redo block
Since we are adding a new undo record to our undo block, we would expect that the new record number
 is equal to the maximum record number in the undo block plus one. Before Oracle can add
a new undo record to the undo block it validates that this is correct. If this validation fails,
 then an ORA-600 [4194] will be triggered.

报错原因

This also can be cause by the following defect
Bug 8240762 Abstract: Undo corruptions with ORA-600 [4193]/ORA-600 [4194] or ORA-600 [4137] after SHRINK
Details:
Undo corruption may be caused after a shrink and the same undo block may be used
for two different transactions causing several internal errors like:
ORA-600 [4193] / ORA-600 [4194] for new transactions
ORA-600 [4137] for a transaction rollback

处理步骤

Best practice to create a new undo tablespace.
This method includes segment check.
Create pfile from spfile to edit
>create pfile from spfile;
1. Shutdown the instance
2. set the following parameters in the pfile
    undo_management = manual
    event = '10513 trace name context forever, level 2'
3. >startup restrict pfile=<initsid.ora>
4. >select tablespace_name, status, segment_name from dba_rollback_segs where status != 'OFFLINE';
This is critical - we are looking for all undo segments to be offline - System will always be online.
If any are 'PARTLY AVAILABLE' or 'NEEDS RECOVERY' - Please open an issue with Oracle Support or update the current SR.
If all offline then continue to the next step
5. Create new undo tablespace - example
>create undo tablespace <new undo tablespace> datafile <datafile> size 2000M;
6. Drop old undo tablespace
>drop tablespace <old undo tablespace> including contents and datafiles;
7. >shutdown immediate;
8 >startup nomount;  --> Using your Original spfile
9 modify the spfile with the new undo tablespace name
  Alter system set undo_tablespace = '<new tablespace created in step 5>' scope=spfile;
10. >shutdown immediate;
11. >startup;  --> Using spfile
The reason we create a new undo tablespace first is to use new undo segment numbers
 that are higher then the current segments being used.
This way when a transaction goes to do block clean-out
the reference to that undo segment does not exist and continues with the block clean-out.

参考:tep by step to resolve ORA-600 4194 4193 4197 on database crash (Doc ID 1428786.1)