rac redo log file被意外覆盖数据库恢复

联系:手机/微信(+86 17813235971) QQ(107644445)

标题:rac redo log file被意外覆盖数据库恢复

作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]

当前日志被覆盖导致错误
朋友的一客户在一套rac上包含了两个数据库,其其中一个库增加redo group时候,覆盖了另外一个库的redo,悲剧的是刚好是current redo

Wed May 16 17:03:05 2012
ALTER DATABASE OPEN
This instance was first to open
Wed May 16 17:03:09 2012
Beginning crash recovery of 2 threads
 parallel recovery started with 15 processes
Wed May 16 17:03:11 2012
Started redo scan
Wed May 16 17:03:11 2012
Errors in file /oracle/admin/odsdb/udump/odsdb1_ora_2040024.trc:
ORA-00305: log 14 of thread 1 inconsistent; belongs to another database
ORA-00312: online log 14 thread 1: '/dev/rods_redo1_2_2'
ORA-00305: log 14 of thread 1 inconsistent; belongs to another database
ORA-00312: online log 14 thread 1: '/dev/rods_redo1_2_1'
Abort recovery for domain 0
Wed May 16 17:03:11 2012
Aborting crash recovery due to error 305
Wed May 16 17:03:11 2012
Errors in file /oracle/admin/odsdb/udump/odsdb1_ora_2040024.trc:
ORA-00305: log 14 of thread 1 inconsistent; belongs to another database
ORA-00312: online log 14 thread 1: '/dev/rods_redo1_2_2'
ORA-00305: log 14 of thread 1 inconsistent; belongs to another database
ORA-00312: online log 14 thread 1: '/dev/rods_redo1_2_1'
ORA-305 signalled during: ALTER DATABASE OPEN...
Wed May 16 17:03:13 2012
Shutting down instance (abort)

使用_allow_resetlogs_corruption= TRUE进行恢复

Wed May 16 18:16:48 2012
SMON: enabling cache recovery
Wed May 16 18:16:48 2012
Instance recovery: looking for dead threads
Instance recovery: lock domain invalid but no dead threads
Wed May 16 18:16:48 2012
Errors in file /oracle/admin/odsdb/udump/odsdb1_ora_2105454.trc:
ORA-00600: internal error code, arguments: [kclchkblk_4], [2522], [18446744072024280773],
[2522], [18446744072024247666], [], [], []
Wed May 16 18:16:50 2012
Errors in file /oracle/admin/odsdb/udump/odsdb1_ora_2105454.trc:
ORA-00600: internal error code, arguments: [kclchkblk_4], [2522], [18446744072024280773],
[2522], [18446744072024247666], [], [], []
Wed May 16 18:16:50 2012
Error 600 happened during db open, shutting down database
USER: terminating instance due to error 600
Instance terminated by USER, pid = 2105454
ORA-1092 signalled during: alter database open resetlogs...

ORA-600[KCLCHKBLK_4], is signaled because the SCN in a tempfile block is too high.
The same reason caused the ORA-600[2662]s in the alert logs.
因为是临时文件的scn太大的问题,那就比较好解决:
启动数据库到mount状态,查询出来相关temp file,然后drop掉.

ORA-00600[6856]

Wed May 16 20:25:16 2012
Errors in file /oracle/admin/odsdb/bdump/odsdb1_smon_2482210.trc:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/dev/rods_redo2_1_1'
ORA-00600: internal error code, arguments: [6856], [0], [0], [], [], [], [], []
ORACLE Instance odsdb1 (pid = 16) - Error 600 encountered while recovering transaction
(10, 8) on object 7162533.
Wed May 16 20:25:16 2012
Errors in file /oracle/admin/odsdb/bdump/odsdb1_smon_2482210.trc:
ORA-00600: internal error code, arguments: [6856], [0], [0], [], [], [], [], []

这里的4193和4194是比较熟悉的,根据这里的提示猜测6856也是和undo有关系
ORA-600[6856]SMON is trying to recover a dead transaction.
But the undo application runs into an internal error (trying to delete a row that is already deleted).
因为smon回滚的时候出现上面错误,解决方法是想办法终止回滚,使用event=”10513 trace name context forever, level 2″.

ORA-00600[4193]/ORA-00600[4194]

Wed May 16 20:25:17 2012
Errors in file /oracle/admin/odsdb/udump/odsdb1_ora_2547936.trc:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/dev/rods_redo2_1_1'
ORA-00600: internal error code, arguments: [4194], [22], [25], [], [], [], [], []
Wed May 16 20:25:18 2012
Errors in file /oracle/admin/odsdb/udump/odsdb1_ora_2547936.trc:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/dev/rods_redo2_1_1'
ORA-00600: internal error code, arguments: [4194], [22], [25], [], [], [], [], []
Wed May 16 20:25:56 2012
Errors in file /oracle/admin/odsdb/udump/odsdb1_ora_2547936.trc:
ORA-00600: internal error code, arguments: [4193], [22248], [22252], [], [], [], [], []

太常见错误,不再做说明,虽然使用event是的库open成功,因为部分回滚段有问题,该错误还是会出现(还是喜欢直接屏蔽回滚段)

ORA-00600[ktpridestroy2]

Wed May 16 20:36:26 2012
Errors in file /oracle/admin/odsdb/bdump/odsdb1_smon_2101296.trc:
ORA-00600: internal error code, arguments: [ktpridestroy2], [], [], [], [], [], [], []

This error could be the result of a corruption and involves the parallel rollback that SMON enables each startup.
解决:fast_start_parallel_rollback=false

ORA-00600[kturacf1]/ORA-00600[kcbgcur_9]

Wed May 16 20:49:15 2012
Errors in file /oracle/admin/odsdb/bdump/odsdb1_j000_2007088.trc:
ORA-00600: internal error code, arguments: [kturacf1], [2097152], [], [], [], [], [], []
ORA-00600: internal error code, arguments: [kcbgcur_9], [780140563], [4], [4294959056], [2097152], [], [], []

ORA-00600[kturacf1]错误未查询到原因
ORA-00600[kcbgcur_9]错误原因可能是:Buffers are pinned in a specific class order to prevent internal deadlocks.
因为这两个错误是job产生非致命错误,在这次的处理过程中可以忽略

ORA-00600[4097]

Wed May 16 21:05:05 2012
Errors in file /oracle/admin/odsdb/bdump/odsdb1_j000_1716282.trc:
ORA-12012: error on auto execute of job 6603
ORA-20001: ORA-00600: internal error code, arguments: [4097], [], [], [], [], [], [], []
ORA-06512: at "EPBI.UP_SYSLOG_ONLINE_USER", line 141
ORA-00600: internal error code, arguments: [4097], [], [], [], [], [], [], []

When an instance has a rollback segment offline and the instance crashes, or
the user does a shutdown abort, the rollback segment wrap number does not get
updated. If that segment is then dropped and recreated immediately after the
instance is restarted, the wrap number could be lower than existing wrap
numbers. This will cause the ORA-600[4097] to occur in subsequent
transactions using Rollback.
这个错误也是因为回滚段wrap number未被及时更新导致的异常.

总结这次恢复过程
1.因当前redo丢失,使用隐含参数打开库,然后出现ORA-600[KCLCHKBLK_4](这个错误比较少见,更何况这个库是10.2.0.4)
2.undo出现问题出现ORA-00600[6856]错误不是很多见
3.接下来都是一些列undo导致的问题,其实如果开始就直接使用隐含参数删除掉有问题回滚段,效果可能会比event好.
4.因为部分trace文件没有拿到,未做深入分析,文章列出整体恢复思路
5.本次恢复的数据库版本是10.2.0.4,根据mos描述,很多错误应不会发生,但是实际还是发生了,MOS的版本范围,不要太看重.
6.其实这篇文章的本质不是展示恢复过程,而是再一次提醒:操作数据库慎重,特别是一台机器上多套库.

One thought on “rac redo log file被意外覆盖数据库恢复

  1. ORA-600 [kcbgcur_9]

    ERROR:
      Format: ORA-600 [kcbgcur_9] [a] [b] [c] [d]
    VERSIONS:
      versions 8.0 to 10.0
    DESCRIPTION:
      Buffers are pinned in a specific class order to prevent internal deadlocks.
      This exception means there is a class violation with the current
      buffers pinned in the wrong class order, or duplicate buffers of
      the same class.
      May occur with an ORA-372 error when trying to update read-only
      tablespace or partition via direct load.
    ARGUMENTS:
      Arg [a] The Database Block Address (DBA)
      Arg [b] The buffer class
      Arg [c] Block class order mask
      Arg [d] Block class checking mask
    FUNCTIONALITY:
      BUFFER CACHE
    IMPACT:
      PROCESS FAILURE
      NON CORRUPTIVE - No underlying data corruption.
    SUGGESTIONS:
      Check all system hardware (disk, memory) to ensure there are no  errors.
    
  2. ORA-600 [6006] ORA-600 [6856] During Startup Instance, Followed by Termination by SMON

    Applies to:
    Oracle Server - Enterprise Edition - Version: 10.2.0.1 to 10.2.0.3 - Release: 10.2 to 10.2
    Information in this document applies to any platform.
    Symptoms
    The following errors are occurring when starting the instance:
    ORA-00600: internal error code, arguments: [6006], [1], [], [], [], [], [], []
    ORA-00600: internal error code, arguments: [6856], [0], [60], [], [], []
    SMON will eventually terminate the instance.
    Cause
    These errors can occur if the database crashes under certain circumstances.
    The undo segments are trying to rollback a failed transaction and cannot.
    ORA-00600: internal error code, arguments: [6006], [1], [], [], [], [], [], []
    Oracle is undoing an index leaf key operation. If the key is not found,
    ORA-00600 [6006] is logged.
    ORA-00600: internal error code, arguments: [6856], [0], [60], [], [], []
    SMON is trying to recover a dead transaction. But the undo application runs into an
    internal error (trying to delete a row that is already deleted).
    Solution
    Review the trace files and look for the object(s) involved.
    If the trace file does not have a SQL statement, search on the following: "block dump header"
    In the block header there will be a seg/obj = hex value.
    Convert the hex to dec and this will give you the data_object_id.
    The alert.log may also show the affected object, for example:
    ORACLE Instance ORCL (pid = 8) - Error 600 encountered while recovering
    transaction (9999, 36) on object 45879.
    SQL>select object_id, data_object_id, owner, object_name from dba_objects where
    object_id = object# or data_object_id=object# ;
    This will be the object you need to work with.
    In case no object is shown at select of dba_objects check if object has been already dropped
    and is still present in recyclebin(in that case the problem can/will still be observed):
    SQL> select u.name,r.original_name,t.name from recyclebin$ r, user$ u, ts$ t where
         r.obj#=&object_id and r.ts#=t.ts# and r.owner#=u.user# ;
    To implement the solution:
    1. shutdown instance (if not already down)
    SQL> shutdown immediate
    2. set event - event="10513 trace name context forever, level 2"
    (event disables transaction recovery which was initiated by SMON)
    It would be best to create a pfile which is to be used for this action only:
    SQL>
    connect / as sysdba
    startup mount
    create pfile='/tmp/init$ORACLE_SID.ora' from spfile;
    shutdown immediate
    +
    echo 'event="10513 trace name context forever, level 2"' >> /tmp/init$ORACLE_SID.ora
    *) in case already a pfile is being used make a copy to /tmp and ...
    3. startup instance using the pfile as just created:
    SQL>
    connect / as sysdba
    startup pfile=/tmp/init$ORACLE_SID.ora
    4. Drop the object:
    a) If the object is an index the action is to drop the index, in case no create
    script is present you can extract one by:
    SQL>
    connect / as sysdba
    set long 1000000
    select dbms_metadata.get_ddl('INDEX',upper('&object_name'),upper('&owner')) from dual;
    b) If the object is a table there will likely be a need to salvage the content of
    the table before dropping the table.
    Possible options:
    - exp{dp} table
    - CTAS (CreateTableAsSelect)
    In case recyclebin is active it might be that object was already in recylebin OR has
    been moved by drop into the recyclebin, therefore check and purge the recyclebin (if applicable):
    SQL> select u.name,r.original_name,t.name from recyclebin$ r, user$ u, ts$ t
         where r.obj#=&object_id and r.ts#=t.ts# and r.owner#=u.user# ;
    SQL> purge &tablespace_name;
    or
    SQL> purge dba_recyclebin;
    5. shutdown the instance
    SQL>
    connect / as sysdba
    shutdown immediate
    6. remove the event
    Only needed in case the event was pushed into the spfile, if true the steps are:
    SQL>
    connect / as sysdba
    startup mount
    alter system reset event scope=spfile sid='*';
    shutdown immediate
    7. startup the instance
    SQL>
    connect / as sysdba
    startup
    8. recreate the affected object(s)
    
  3. ORA-600 [kclchkblk_4] and ORA-600 [2662] After Recovery of Database

    Applies to:
    Oracle Server - Enterprise Edition - Version: 9.0.1.0 to 10.1.0.2 - Release: 9.0.1 to 10.1
    Information in this document applies to any platform.
    ***Checked for relevance on 22-03-2012***
    Symptoms
    1) You restored the database from backup and did an incomplete recovery.
    2) Opened the database with resetlogs.
    3) After opening the database, you start getting following errors:
    ORA-00600 [kclchkblk_4]
    ORA-00600 [2662]
    Example output from ALERT:
    Errors in file /u01/app/oracle/admin/orcl/bdump/orcl_smon_7493.trc:
    ORA-600: internal error code, arguments: [kclchkblk_4], [1904],
    [18446744073431179384], [1904],
    [18446744073403569507], [], [], []
    Starting background process QMNC
    QMNC started with pid=24, OS id=8329
    Errors in file /u01/app/oracle/admin/orcl/bdump/orcl_smon_7493.trc:
    ORA-600: internal error code, arguments: [2662], [1904], [3988985522],
    [1904], [4016595064], [83
    88610], [], []
    Errors in file /u01/app/oracle/admin/orcl/bdump/orcl_smon_7493.trc:
    ORA-600: internal error code, arguments: [2662], [1904], [3988985525],
    [1904], [4016595064], [83
    88610], [], []
    SMON: terminating instance due to error 474
    Instance terminated by SMON, pid = 7493
    4) The call stack generated for ORA-600 [kclchkblk_4] looks similar to:
    kclchkblk kcbzib kcbgcur ktfbhget ktftfcload
    Changes
    This problem situation occurs in 10.1.0.2, after flashback database
    when trying to open database, using 'open database resetlogs'.
    Cause
    1) Error, ORA-600[KCLCHKBLK_4], is signaled because the SCN in a tempfile block
    is too high. The same reason caused the ORA-600[2662]s in the alert logs.
    2) This issue is because the tempfiles may not get reinitialized during open resetlogs.
    This problem is analyzed in unpublished
    Bug 3517013: GSIST: OPEN DB RESETLOG AFTER FLASHBACK DB FAILS ORA-600 [KCLCHKBLK_4], [1904],
    which is fixed in 10.1.0.3 and later in 10.1.
    This problem will not occure in 10.2 or later, since we nolonger reinitialize tempfiles in 10.2.
    Solution
    To workaround this problem once it has occured, please follow below steps
    to remove and recreate the tempfiles:
    First find the tempfiles in the database:
    SQL>select file_name, file_id from dba_temp_files;
    Startup mount
    Drop all the tempfiles:
    SQL>alter database tempfile drop;
    Alter database open;
    Add new tempfile(s).
    SQL>alter tablespace add tempfile size N;
    Restart the instance.
    To avoid hitting this problem, which is fixed in 10.1.0.3 and later:
    Apply 10.1.0.3 or later patchset or upgrade to 10.2 or later.
    Preferable solution is to upgrade to a supported version of RDBMS.
    See Document 757445.1 FAQ for Products in Extended Support
    
  4. 在一次三个节点的rac恢复过程中遇到如下错误

    Errors in file /u01/app/oracle/admin/ZSDB/udump/zsdb2_ora_18918.trc:
    ORA-00600: internal error code, arguments: [kclchkblk_4], [1], [1727335936], [1], [1726872599], [], [], []
    

    这个库的tempfile已经全部被删除,而且从这里的错误提示a和c的值为1中也可以看出来是system01.dbf文件scn异常导致

  5. ORA-00600 [6006]
    ================
    Oracle is undoing an index leaf key operation. If the key is not found,
    ORA-00600 [6006] is logged.
    ORA-600[6006] is usually caused by a media corruption problem related to either a
    lost write to disk or a corruption on disk.

发表评论

邮箱地址不会被公开。 必填项已用*标注

2 × 3 =