Database SOS – 第136页 – Professional Database Recovery Support@+8617813235971

Systemstates分析会话阻塞—锁表

模拟会话被阻塞

--会话1
SQL> select sid from v$mystat where rownum=1;
       SID
----------
        15
SQL> create table t_xifenfei (id number,name varchar2(20));
Table created.
SQL> insert into t_xifenfei values(1,'xifenfei');
1 row created.
SQL> insert into t_xifenfei values(2,'www.xifenfei');
1 row created.
SQL> commit;
Commit complete.
SQL> select * from t_xifenfei;
        ID NAME
---------- ----------------------------------------
         1 xifenfei
         2 www.xifenfei
SQL> delete from t_xifenfei where id=1;
1 row deleted.
--会话2
SQL>  select sid from v$mystat where rownum=1;
       SID
----------
       143
SQL> delete from chf.t_xifenfei where id=1;
--hang住

新打开会话做Systemstates
我们假设不知道会话1和会话2的sid,现在特定的使用Systemstates分析问题,后面给出简单分析方法

SQL> oradebug setmypid
Statement processed.
SQL> oradebug unlimit
Statement processed.
SQL> oradebug dump systemstate 10
Statement processed.
SQL> oradebug dump systemstate 10
Statement processed.
SQL> oradebug tracefile_name
/u01/oracle/diag/rdbms/ora11g/ora11g/trace/ora11g_ora_31027.trc
SQL> exit

阅读trace文件
会话2在这样的情况下hang住,而其他会话正常,第一反应是在lock级别阻塞了,而这个又是DML操作,很自然想到是TX,于是在trace文件中搜索”enq: TX” ,找到对应记录,然后向上找到对应的进程号,开始读相关内容,发现有小信息如下:

PROCESS 20:
  ----------------------------------------
--客户端信息
    client details:
      O/S info: user: oracle, term: pts/0, ospid: 30622
      machine: xifenfei program: sqlplus@xifenfei (TNS V1-V3)
      application name: SQL*Plus, hash value=3669949024
--进程相关session信息
    (session) sid: 15 ser: 151 trans: 0x343a4c2c, creator: 0x35fe2218
              flags: (0x45) USR/- flags_idl: (0x0) -/-/-/-/-/-
              flags2: (0x40009) -/-/INC
              DID: , short-term DID:
              txn branch: (nil)
              oct: 0, prv: 0, sql: (nil), psql: 0x2f6e7b68, user: 84/CHF
--被阻塞会话信息
    There are 1 sessions blocked by this session.
    Dumping one waiter:
      inst: 1, sid: 143, ser: 229
      wait event: 'enq: TX - row lock contention'
        p1: 'name|mode'=0x54580006
        p2: 'usn<<16 | slot'=0x40005
        p3: 'sequence'=0x252
      row_wait_obj#: 75928, block#: 171, row#: 0, file# 4
      min_blocked_time: 296 secs, waiter_cache_ver: 7860
    Wait State:
      fixed_waits=0 flags=0x22 boundary=(nil)/-1
--54580006 is split into ASCII 54 + ASCII 58 (TX) + Mode 0006 (X) ...
SQL> select object_type,object_name,owner from dba_objects where object_id=75928;
OBJECT_TYP OBJECT_NAME          OWNER
---------- -------------------- ----------
TABLE      T_XIFENFEI           CHF
--持有锁的信息
      (enqueue) TX-00040005-00000252	DID: 0001-0014-0000009C
      lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  res_flag: 0x6
      mode: X, lock_flag: 0x0, lock: 0x343a4c6c, res: 0x353606a8
      own: 0x355ae5b8, sess: 0x355ae5b8, proc: 0x35fe2218, prv: 0x353606b0
--通过上述信息可以分析出结论：
sqlplus登录的sid=15的会话占用了TX mode=6(mode:x)的锁,阻塞了sid=143会话对chf.t_xifenfei表操作

找出被阻塞进程相关信息(sid 为143的进程),搜索”sid: 143″，阅读相关进程信息

PROCESS 21:
  ----------------------------------------
--相关session信息
 (session) sid: 143 ser: 229 trans: 0x343915a0, creator: 0x35fe2d3c
              flags: (0x41) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
              flags2: (0x40009) -/-/INC
              DID: , short-term DID:
              txn branch: (nil)
              oct: 7, prv: 0, sql: 0x2f6cb234, psql: 0x2f6dd5cc, user: 0/SYS
--当前等待信息
 Current Wait Stack:
     0: waiting for 'enq: TX - row lock contention'
        name|mode=0x54580006, usn<<16 | slot=0x40005, sequence=0x252
        wait_id=12 seq_num=13 snap_id=1
        wait times: snap=5 min 1 sec, exc=5 min 1 sec, total=5 min 1 sec
        wait times: max=infinite, heur=5 min 1 sec
        wait counts: calls=101 os=101
        in_wait=1 iflags=0x15a0
--阻塞该会话的session信息
    There is at least one session blocking this session.
      Dumping 1 direct blocker(s):
        inst: 1, sid: 15, ser: 151
      Dumping final blocker:
        inst: 1, sid: 15, ser: 151
    Wait State:
      fixed_waits=0 flags=0x22 boundary=(nil)/-1
--请求锁信息
      SO: 0x352f8fcc, type: 8, owner: 0x35765fe8, flag: INIT/-/-/0x00 if: 0x1 c: 0x1
       proc=0x35fe2d3c, name=enqueue, file=ksq1.h LINE:380, pg=0
      (enqueue) TX-00040005-00000252	DID: 0001-0015-0000003B
      lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  res_flag: 0x6
      req: X, lock_flag: 0x10, lock: 0x352f8ff8, res: 0x353606a8
      own: 0x356f49b8, sess: 0x356f49b8, proc: 0x35fe2d3c, prv: 0x353606b8
--通过对被阻塞对象分析,可以得出和阻塞者相同的信息

对该问题的常规分析思路

--查询等待事件
SQL> select event,p1,p2,p3 from v$session where wait_class#<>6;
EVENT                                  P1         P2         P3
------------------------------ ---------- ---------- ----------
SQL*Net message to client      1650815232          1          0
enq: TX - row lock contention  1415053318     262149        594
--查询锁信息(因为通过上面的等待事件分析,TX可能引起会话hang)
SQL> SELECT * FROM v$lock where type in ('TM','TX');
ADDR     KADDR           SID TYPE        ID1        ID2      LMODE    REQUEST      CTIME      BLOCK
-------- -------- ---------- ---- ---------- ---------- ---------- ---------- ---------- ----------
352F8FCC 352F8FF8        143 TX       262149        594          0          6       4181          0
B69CC7A8 B69CC7D8        143 TM        75928          0          3          0       4181          0
B69CC7A8 B69CC7D8         15 TM        75928          0          3          0       4266          0
343A4C2C 343A4C6C         15 TX       262149        594          6          0       4267          1
--通过TM查询出来对应对象
SQL>  select object_type,object_name,owner from dba_objects where object_id=75928;
OBJECT_TYP OBJECT_NAME          OWNER
---------- -------------------- ----------
TABLE      T_XIFENFEI           CHF
--通过观察v$lock查询结果可以知道:
15会话的TX MODE=6的锁阻塞了143会话想会的的TX MODE=6的锁,从而是的143会话hang住

Systemstates分析参考文档:Understanding and Reading Systemstates或者[ID 423153.1]

找出 alter system kill session 'sid,serial#' kill 掉的数据库会话对应进程

当我们使用alter system kill session ‘sid,serial#’ 在数据库中kill掉某个会话的时候,如果你观察仔细会发现v$session.paddr发生了改变,从而是的不能直接通过关联v$process.add找出spid,然后进行其他操作.本文提供三种方法找该种情况下spid的方法.
数据库版本

SQL> select * from v$version;
BANNER
-----------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - Production
PL/SQL Release 11.2.0.3.0 - Production
CORE    11.2.0.3.0      Production
TNS for Linux: Version 11.2.0.3.0 - Production
NLSRTL Version 11.2.0.3.0 - Production

会话1

SQL> select sid, SERIAL#,paddr from v$session where
  2  sid=(select sid from v$mystat where rownum=1);
       SID    SERIAL# PADDR
---------- ---------- --------
       133         53 35FE16F4

会话2

SQL> select sid, SERIAL#,paddr from v$session where
  2  sid=(select sid from v$mystat where rownum=1);
       SID    SERIAL# PADDR
---------- ---------- --------
       143         21 35FE2D3C

会话3

SQL> alter system kill session '133,53';
System altered.
SQL> alter system kill session '143,21';
System altered.
SQL> select sid, SERIAL#,paddr,status from v$session where sid in(133,143);
       SID    SERIAL# PADDR    STATUS
---------- ---------- -------- ----------------
       133         53 3547A3F4 KILLED
       143         21 3547A3F4 KILLED

证明alter system kill session后,v$session中的paddr发生了改变,这个时候如果需要找出原来的spid,不能使用v$session.paddr和v$process.addr关联获得

找出kill掉的spid方法1

SQL> select spid, program from v$process
  2      where program!= 'PSEUDO'
  3      and addr not in (select paddr from v$session)
  4      and addr not in (select paddr from v$bgprocess)
  5      and addr not in (select paddr from v$shared_server);
SPID                                             PROGRAM
------------------------------------------------ ------------------------------
14260                                            oracle@xifenfei (L001)
14256                                            oracle@xifenfei (L000)
15300                                            oracle@xifenfei (TNS V1-V3)
14179                                            oracle@xifenfei (D000)
15318                                            oracle@xifenfei (TNS V1-V3)
14252                                            oracle@xifenfei (N000)
SQL> !ps -ef|grep 15300|grep -v grep
oracle   15300 14052  0 03:22 ?        00:00:00 oracleora11g (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
SQL> !ps -ef|grep 15318|grep -v grep
oracle   15318 15315  0 03:22 ?        00:00:00 oracleora11g (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))

找出kill 掉的spid 方法2

SQL> SELECT s.username,s.status,
  2  x.ADDR,x.KSLLAPSC,x.KSLLAPSN,x.KSLLASPO,x.KSLLID1R,x.KSLLRTYP,
  3  decode(bitand (x.ksuprflg,2),0,null,1)
  4  FROM x$ksupr x,v$session s
  5  WHERE s.paddr(+)=x.addr
  6  and bitand(ksspaflg,1)!=0;
USERNAME   STATUS           ADDR       KSLLAPSC   KSLLAPSN KSLLASPO                   KSLLID1R KSLL DE
---------- ---------------- -------- ---------- ---------- ------------------------ ---------- ---- --
           ACTIVE           35FD5990          6         26 14121                             0      1
           ACTIVE           35FD6FD8          1         69 14055                             0      1
           ACTIVE           35FD8620          1         69 14055                             0      1
           ACTIVE           35FD9C68          1         69 14055                             0      1
           ACTIVE           35FDB2B0          8         27 15300                             0      1
           ACTIVE           35FDC8F8         12         36 15300                             0      1
           ACTIVE           35FDDF40          1         69 14055                             0      1
           ACTIVE           35FDF588          1         69 14055                             0      1
           ACTIVE           35FE3860          7         26 14236                             0      1
           ACTIVE           35FE4EA8          1         69 14224                             0      1
           ACTIVE           35FE64F0         63          2 14311                           377 EV   1
           ACTIVE           35FEA7C8          3         26 14155                           258 EV   1
           ACTIVE           35FE9180         59          2 14248                           378 EV   1
           ACTIVE           35FE9CA4         12          2 14603                             0      1
           ACTIVE           35FD64B4          1         69 14055                             0      1
           ACTIVE           35FD7AFC          2         27 14055                             0      1
           ACTIVE           35FD9144          2         27 15300                             0      1
           ACTIVE           35FDA78C          3         26 14171                             0      1
           ACTIVE           35FDBDD4         17          2 15255                             0      1
           ACTIVE           35FDD41C         22         26 14155                             0      1
           ACTIVE           35FDEA64         52         26 14155                             0      1
           ACTIVE           35FE4384          1         69 14224                             0      1
           ACTIVE           35FE59CC          1         69 14224                             0      1
           ACTIVE           35FEB2EC          2          2 14248                             0      1
           ACTIVE           35FEC934         11         26 14121                             0      1
SYS        ACTIVE           35FEF5C4          4         16 14117                             0
                            35FE0BD0          1         69 14055                             0
                            35FE865C          1         69 14117                             0
                            35FE7B38          1         69 14117                             0
                            35FE16F4          1         26 14155                             0
                            35FD4E6C          0          0                                   0
                            35FE00AC          2        279 14117                             0
                            35FE2D3C          0          0                                   0
                            35FE7014          2        335 14117                             0
--挑选username和status为null的会话
SQL> select spid,program from v$process where addr in (
  2  '35FE0BD0',
  3  '35FE865C',
  4  '35FE7B38',
  5  '35FE16F4',
  6  '35FD4E6C',
  7  '35FE00AC',
  8  '35FE2D3C',
  9  '35FE7014'
 10  );
SPID                                             PROGRAM
------------------------------------------------ ------------------------------
                                                 PSEUDO
14179                                            oracle@xifenfei (D000)
14183                                            oracle@xifenfei (S000)
15300                                            oracle@xifenfei (TNS V1-V3)
15318                                            oracle@xifenfei (TNS V1-V3)
14252                                            oracle@xifenfei (N000)
14256                                            oracle@xifenfei (L000)
14260                                            oracle@xifenfei (L001)
8 rows selected.
--同样可以发现spid 15300和15318的进程已经在数据库中被kill掉

找出kill掉的spid方法3(11g特有)

SQL> select  spid,program  from v$process where addr in
  2  (select creator_addr from v$session where sid in(133,143));
SPID                                             PROGRAM
------------------------------------------------ ------------------------------
15300                                            oracle@xifenfei (TNS V1-V3)
15318                                            oracle@xifenfei (TNS V1-V3)

找出kill掉的spid方法4(11g特有)

SQL> select * from V$DETACHED_SESSION;
      INDX PG_NAME                                                             SID    SERIAL#        PID
---------- ------------------------------------------------------------ ---------- ---------- ----------
         0 DEFAULT                                                             143         21         21
         1 DEFAULT                                                             133         53         19
SQL> select spid,program from v$process where pid in(21,19);
SPID                                             PROGRAM
------------------------------------------------ ------------------------------
15300                                            oracle@xifenfei (TNS V1-V3)
15318                                            oracle@xifenfei (TNS V1-V3)

ROW CACHE LOCK基础说明
ROW CACHE LOCK等待事件是一个共享池相关的等待事件。是由于对于字典缓冲的访问造成的。
P1 – Cache Id
P2 – Mode Held
P3 – Mode Requested

mode 和REQUEST的取值：
KQRMNULL 0 null mode – not locked
KQRMS 3 share mode
KQRMX 5 exclusive mode
KQRMFAIL 10 fail to acquire instance lock

如果是RAC/OPS环境，前台进程发出锁请求，LCK0进程发出锁请求。如果是单实例模式，由前台进程直接发出锁请求。
在RAC/OPS环境下，前台进程会循环等待锁的获取，最多会等待60秒钟。在单实例环境，前台进程会循环1000次，等待3秒钟。PMON进程无论在哪种模式，都会等待5秒钟。
要注意的是单实例模式下和多实例模式下申请该锁调用的模块是不同的（kqrget()- 单实例，kqgigt()- 多实例）。
如果发现这个等待十分高，一般来说可能由于2种原因，一是共享池太小了，需要增加共享池，另外一种情况是SQL分析过于频繁，对于共享池的并发访问量过大。对于任何一种情况，绝大多数情况下加大共享池会有助于降低该等待，不过加大共享池的时候也要注意，并不一定所有的情况下增加共享池都会有明显的效果，特别是对于第二种情况，精确的分析十分重要。另外进一步分析，弄清楚哪些ROW CACHE的等待最为严重，有助于解决问题。

SQL查询

--查询row cache lock等待
select *  from v$session_wait where wait_class = 'row cache lock';
--查询rowcache 名称
select * from v$rowcache where cache# = &p1;

ENQUEUE TYPE
DC_TABLESPACES
Probably the most likely cause is the allocation of new extents. If extent sizes are set low then the application may constantly be requesting new extents and causing contention. Do you have objects with small extent sizes that are rapidly growing? (You may be able to spot these by looking for objects with large numbers of extents). Check the trace for insert/update activity, check the objects inserted into for number of extents.

DC_SEQUENCES
Check for appropriate caching of sequences for the application requirements.

DC_USERS
Deadlock and resulting “WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK!” can occur if a session issues a GRANT to a user, and that user is in the process of logging on to the database.

DC_SEGMENTS
This is likely to be down to segment allocation. Identify what the session holding the enqueue is doing and use errorstacks to diagnose.

DB_ROLLBACK_SEGMENTS
This is due to rollback segment allocation. Just like dc_segments,identify what is holding the enqueue and also generate errorstacks. Remember that on a multi-node system (RAC) the holder may be on another node and so multiple systemstates from each node will be required.

DC_AWR_CONTROL
This enqueue is related to control of the Automatic Workload Repository. As such any operation manipulating the repository may hold this so look for processes blocking these.

Process OS id : xxxxx alive after kill

Process OS id : xxxxx alive after kill警告

Mon May 21 04:55:06 2012
Shutting down instance (immediate)
License high water mark = 373
Mon May 21 04:55:06 2012
Stopping Job queue slave processes
Mon May 21 04:55:06 2012
Job queue slave processes stopped
All dispatchers and shared servers shutdown
Mon May 21 04:55:14 2012
Process OS id : 9922 alive after kill
Errors in file
Mon May 21 04:55:16 2012
Process OS id : 8159 alive after kill
Errors in file /oracle/admin/resultdb/udump/resultdb_ora_14639.trc
Mon May 21 04:55:17 2012
Process OS id : 8285 alive after kill
Errors in file /oracle/admin/resultdb/udump/resultdb_ora_14639.trc
Mon May 21 04:55:33 2012
ALTER DATABASE CLOSE NORMAL

错误原因

On some platforms it takes some time to kill processes--AIX being one of those platforms
There have been previous reports of shutdown taking time on AIX and after all Oracle waits were taken out of
the picture it was determined to be due to the way the kill command is implemented on that platform.
Bug 4931101 ERRORS IN ALERT LOG DURING SHUTDOWN

处理建议

Ignore the error  as all processes will be closed and shutdown will complete successfully.

补充说明
本次出问题的数据库是运行在 linux 平台上的 10.2.0.3

主键表插入数据不提交,外键表插入数据被阻塞

有客户和我说:他在含主外键的表中实验发现,在主表数据未提交,然后在外键表插入该数据数据时,出现外键表hang住现象.我开始以为是不同的会话,根据oracle数据库的一致性原则,应该新会话在外键表中不能知道这个记录的存在,直接报错.
可是我实验结果证明:外键表会被阻塞.分析原因如下：
模拟环境

SQL> create table t_p(id number primary key,name varchar2(100));
Table created.
SQL> create table t_f(fid number primary key,pid number, foreign key(pid) references  t_p(id));
Table created.
--会话1
SQL> insert into t_p values(1,'xifenfei');
1 row created.
SQL> commit;
Commit complete.
--会话2
SQL> insert into t_f values(1,1);
1 row created.
SQL> commit;
--会话1
SQL> insert into t_p values(2,'XIFENFEI');
1 row created.
--会话2
SQL> insert into t_f values(2,2);
--hang住

通过实验发现,当主键数据没有提交,然后在外键表中插入该数据外键数据时,该条记录会处于hang住状态(等待),那是什么原因导致了这个等待呢？对会话2做一个10046的trace,发现如下

*** 2012-05-17 17:25:41.757
WAIT #3065187488: nam='enq: TX - row lock contention' ela= 27002895 name|mode=1415053316 usn<<16
| slot=262151 sequence=588 obj#=-1 tim=1337246741756917
EXEC #3065187488:c=4000,e=27004456,p=0,cr=2,cu=14,mis=0,r=0,dep=0,og=1,
plh=0,tim=1337246741757690
ERROR #3065187488:err=1013 tim=1337246741757751
STAT #3065187488 id=1 cnt=0 pid=0 pos=1 obj=0 op='LOAD TABLE CONVENTIONAL
(cr=0 pr=0 pw=0 time=12 us)'
WAIT #3065187488: nam='SQL*Net break/reset to client' ela= 581 driver
id=1650815232 break?=0 p3=0 obj#=-1 tim=1337246741782587
WAIT #3065187488: nam='SQL*Net message to client' ela= 2 driver id=1650815232
#bytes=1 p3=0 obj#=-1 tim=1337246741782668

通过这个trace发现,是因为TX锁导致了外键表上的插入操作被阻塞.出现该问题的原因
有两种可能:1.两次插入(主键表和外键表分别插入)在主键表上有不兼容锁;2.外键表上有不兼容性锁.

使用oradebug跟踪会话

oradebug setmypid
--EVENT 10704跟踪锁的使用情况
oradebug EVENT 10704 trace name context forever,level 10
--插入数据操作
oradebug EVENT 10704 trace name context off
oradebug TRACEFILE_NAME

跟踪主键表插入数据

*** 2012-05-17 19:05:52.410
ksqgtl *** TM-00012892-00000000 mode=3 flags=0x401 timeout=21474836 ***
ksqgtl: xcb=0x343c0e54, ktcdix=2147483647, topxcb=0x343c0e54
        ktcipt(topxcb)=0x0
*** 2012-05-17 19:05:52.411
ksucti: init txn DID from session DID
ksqgtl:
        ksqlkdid: 0001-0013-0000000F
*** 2012-05-17 19:05:52.429
*** ksudidTrace: ksqgtl
        ktcmydid(): 0001-0013-0000000F
        ksusesdi:   0000-0000-00000000
        ksusetxn:   0001-0013-0000000F
ksqgtl: RETURNS 0
*** 2012-05-17 19:05:52.430
ksqgtl *** TM-00012894-00000000 mode=3 flags=0x401 timeout=21474836 ***
ksqgtl: xcb=0x343c0e54, ktcdix=2147483647, topxcb=0x343c0e54
        ktcipt(topxcb)=0x0
*** 2012-05-17 19:05:52.430
ksucti: init session DID from txn DID:
ksqgtl:
        ksqlkdid: 0001-0013-0000000F
*** 2012-05-17 19:05:52.430
*** ksudidTrace: ksqgtl
        ktcmydid(): 0001-0013-0000000F
        ksusesdi:   0000-0000-00000000
        ksusetxn:   0001-0013-0000000F
ksqgtl: RETURNS 0
*** 2012-05-17 19:05:52.431
ksqgtl *** TX-00050019-00000307 mode=6 flags=0x401 timeout=0 ***
ksqgtl: xcb=0x343c0e54, ktcdix=2147483647, topxcb=0x343c0e54
        ktcipt(topxcb)=0x0
*** 2012-05-17 19:05:52.431
ksucti: init session DID from txn DID:
ksqgtl:
        ksqlkdid: 0001-0013-0000000F
*** 2012-05-17 19:05:52.431
*** ksudidTrace: ksqgtl
        ktcmydid(): 0001-0013-0000000F
        ksusesdi:   0000-0000-00000000
        ksusetxn:   0001-0013-0000000F
ksqgtl: RETURNS 0
SQL> SELECT TO_NUMBER(12892,'xxxxxxx') from dual;
TO_NUMBER(12892,'XXXXXXX')
--------------------------
                     75922
SQL> SELECT TO_NUMBER(12894,'xxxxxxx') from dual;
TO_NUMBER(12894,'XXXXXXX')
--------------------------
                     75924
SQL> select object_name from dba_objects where object_id in(75922,75924);
OBJECT_NAM
----------
T_P
T_F

通过锁使用情况跟踪可以知道,在主键表插入一条记录时,先在主键表获得TM锁,然后外键表获得TM锁,最后主键表获得TX MODE=6的锁。

跟踪外键表插入数据

*** 2012-05-17 19:49:24.912
ksqgtl *** TM-00012892-00000000 mode=3 flags=0x401 timeout=21474836 ***
ksqgtl: xcb=0x343a45e8, ktcdix=2147483647, topxcb=0x343a45e8
        ktcipt(topxcb)=0x0
*** 2012-05-17 19:49:24.912
ksucti: init txn DID from session DID
ksqgtl:
        ksqlkdid: 0001-0015-00000064
*** 2012-05-17 19:49:24.913
*** ksudidTrace: ksqgtl
        ktcmydid(): 0001-0015-00000064
        ksusesdi:   0000-0000-00000000
        ksusetxn:   0001-0015-00000064
ksqgtl: RETURNS 0
*** 2012-05-17 19:49:24.913
ksqgtl *** TM-00012894-00000000 mode=3 flags=0x401 timeout=21474836 ***
ksqgtl: xcb=0x343a45e8, ktcdix=2147483647, topxcb=0x343a45e8
        ktcipt(topxcb)=0x0
*** 2012-05-17 19:49:24.913
ksucti: init session DID from txn DID:
ksqgtl:
        ksqlkdid: 0001-0015-00000064
*** 2012-05-17 19:49:24.913
*** ksudidTrace: ksqgtl
        ktcmydid(): 0001-0015-00000064
        ksusesdi:   0000-0000-00000000
        ksusetxn:   0001-0015-00000064
ksqgtl: RETURNS 0
*** 2012-05-17 19:49:24.913
ksqgtl *** TX-0002001f-0000034a mode=6 flags=0x401 timeout=0 ***
ksqgtl: xcb=0x343a45e8, ktcdix=2147483647, topxcb=0x343a45e8
        ktcipt(topxcb)=0x0
*** 2012-05-17 19:49:24.913
ksucti: init session DID from txn DID:
ksqgtl:
        ksqlkdid: 0001-0015-00000064
*** 2012-05-17 19:49:24.914
*** ksudidTrace: ksqgtl
        ktcmydid(): 0001-0015-00000064
        ksusesdi:   0000-0000-00000000
        ksusetxn:   0001-0015-00000064
ksqgtl: RETURNS 0
*** 2012-05-17 19:49:24.914
ksqgtl *** TX-00050019-00000307 mode=4 flags=0x10021 timeout=21474836 ***
ksqgtl: xcb=0x343a45e8, ktcdix=2147483647, topxcb=0x343a45e8
        ktcipt(topxcb)=0x0
*** 2012-05-17 19:49:24.914
ksucti: init session DID from txn DID:
ksqgtl:
        ksqlkdid: 0001-0015-00000064
*** 2012-05-17 19:49:24.914
*** ksudidTrace: ksqgtl
        ktcmydid(): 0001-0015-00000064
        ksusesdi:   0000-0000-00000000
        ksusetxn:   0001-0015-00000064
*** 2012-05-17 19:49:24.914
ksqcmi: TX,50019,307 mode=4 timeout=21474836

从这里可以发现:先在主键表和外键表上加上TM锁,然后外键表获得TX MODE=6的锁(这边成功,因为该表上未有其他级别不兼容锁),再需要在主键表上获得TX MODE=4(表结构共享锁+所有记录共享锁),但是这个时候,发现该锁上已经在主键表插入数据未提交的时候,已经含有了TX MODE=6的锁,从而使得TX MODE=4无法获得,从而使得外键表插入数据处于阻塞状态.

rac redo log file被意外覆盖数据库恢复

当前日志被覆盖导致错误
朋友的一客户在一套rac上包含了两个数据库,其其中一个库增加redo group时候,覆盖了另外一个库的redo,悲剧的是刚好是current redo

Wed May 16 17:03:05 2012
ALTER DATABASE OPEN
This instance was first to open
Wed May 16 17:03:09 2012
Beginning crash recovery of 2 threads
 parallel recovery started with 15 processes
Wed May 16 17:03:11 2012
Started redo scan
Wed May 16 17:03:11 2012
Errors in file /oracle/admin/odsdb/udump/odsdb1_ora_2040024.trc:
ORA-00305: log 14 of thread 1 inconsistent; belongs to another database
ORA-00312: online log 14 thread 1: '/dev/rods_redo1_2_2'
ORA-00305: log 14 of thread 1 inconsistent; belongs to another database
ORA-00312: online log 14 thread 1: '/dev/rods_redo1_2_1'
Abort recovery for domain 0
Wed May 16 17:03:11 2012
Aborting crash recovery due to error 305
Wed May 16 17:03:11 2012
Errors in file /oracle/admin/odsdb/udump/odsdb1_ora_2040024.trc:
ORA-00305: log 14 of thread 1 inconsistent; belongs to another database
ORA-00312: online log 14 thread 1: '/dev/rods_redo1_2_2'
ORA-00305: log 14 of thread 1 inconsistent; belongs to another database
ORA-00312: online log 14 thread 1: '/dev/rods_redo1_2_1'
ORA-305 signalled during: ALTER DATABASE OPEN...
Wed May 16 17:03:13 2012
Shutting down instance (abort)

使用_allow_resetlogs_corruption= TRUE进行恢复

Wed May 16 18:16:48 2012
SMON: enabling cache recovery
Wed May 16 18:16:48 2012
Instance recovery: looking for dead threads
Instance recovery: lock domain invalid but no dead threads
Wed May 16 18:16:48 2012
Errors in file /oracle/admin/odsdb/udump/odsdb1_ora_2105454.trc:
ORA-00600: internal error code, arguments: [kclchkblk_4], [2522], [18446744072024280773],
[2522], [18446744072024247666], [], [], []
Wed May 16 18:16:50 2012
Errors in file /oracle/admin/odsdb/udump/odsdb1_ora_2105454.trc:
ORA-00600: internal error code, arguments: [kclchkblk_4], [2522], [18446744072024280773],
[2522], [18446744072024247666], [], [], []
Wed May 16 18:16:50 2012
Error 600 happened during db open, shutting down database
USER: terminating instance due to error 600
Instance terminated by USER, pid = 2105454
ORA-1092 signalled during: alter database open resetlogs...

ORA-600[KCLCHKBLK_4], is signaled because the SCN in a tempfile block is too high.
The same reason caused the ORA-600[2662]s in the alert logs.
因为是临时文件的scn太大的问题,那就比较好解决:
启动数据库到mount状态,查询出来相关temp file,然后drop掉.

ORA-00600[6856]

Wed May 16 20:25:16 2012
Errors in file /oracle/admin/odsdb/bdump/odsdb1_smon_2482210.trc:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/dev/rods_redo2_1_1'
ORA-00600: internal error code, arguments: [6856], [0], [0], [], [], [], [], []
ORACLE Instance odsdb1 (pid = 16) - Error 600 encountered while recovering transaction
(10, 8） on object 7162533.
Wed May 16 20:25:16 2012
Errors in file /oracle/admin/odsdb/bdump/odsdb1_smon_2482210.trc:
ORA-00600: internal error code, arguments: [6856], [0], [0], [], [], [], [], []

这里的4193和4194是比较熟悉的,根据这里的提示猜测6856也是和undo有关系
ORA-600[6856]SMON is trying to recover a dead transaction.
But the undo application runs into an internal error (trying to delete a row that is already deleted).
因为smon回滚的时候出现上面错误,解决方法是想办法终止回滚,使用event=”10513 trace name context forever, level 2″.

ORA-00600[4193]/ORA-00600[4194]

Wed May 16 20:25:17 2012
Errors in file /oracle/admin/odsdb/udump/odsdb1_ora_2547936.trc:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/dev/rods_redo2_1_1'
ORA-00600: internal error code, arguments: [4194], [22], [25], [], [], [], [], []
Wed May 16 20:25:18 2012
Errors in file /oracle/admin/odsdb/udump/odsdb1_ora_2547936.trc:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/dev/rods_redo2_1_1'
ORA-00600: internal error code, arguments: [4194], [22], [25], [], [], [], [], []
Wed May 16 20:25:56 2012
Errors in file /oracle/admin/odsdb/udump/odsdb1_ora_2547936.trc:
ORA-00600: internal error code, arguments: [4193], [22248], [22252], [], [], [], [], []

太常见错误,不再做说明,虽然使用event是的库open成功,因为部分回滚段有问题,该错误还是会出现(还是喜欢直接屏蔽回滚段)

ORA-00600[ktpridestroy2]

Wed May 16 20:36:26 2012
Errors in file /oracle/admin/odsdb/bdump/odsdb1_smon_2101296.trc:
ORA-00600: internal error code, arguments: [ktpridestroy2], [], [], [], [], [], [], []

This error could be the result of a corruption and involves the parallel rollback that SMON enables each startup.
解决:fast_start_parallel_rollback=false

ORA-00600[kturacf1]/ORA-00600[kcbgcur_9]

Wed May 16 20:49:15 2012
Errors in file /oracle/admin/odsdb/bdump/odsdb1_j000_2007088.trc:
ORA-00600: internal error code, arguments: [kturacf1], [2097152], [], [], [], [], [], []
ORA-00600: internal error code, arguments: [kcbgcur_9], [780140563], [4], [4294959056], [2097152], [], [], []

ORA-00600[kturacf1]错误未查询到原因
ORA-00600[kcbgcur_9]错误原因可能是:Buffers are pinned in a specific class order to prevent internal deadlocks.
因为这两个错误是job产生非致命错误,在这次的处理过程中可以忽略

ORA-00600[4097]

Wed May 16 21:05:05 2012
Errors in file /oracle/admin/odsdb/bdump/odsdb1_j000_1716282.trc:
ORA-12012: error on auto execute of job 6603
ORA-20001: ORA-00600: internal error code, arguments: [4097], [], [], [], [], [], [], []
ORA-06512: at "EPBI.UP_SYSLOG_ONLINE_USER", line 141
ORA-00600: internal error code, arguments: [4097], [], [], [], [], [], [], []

When an instance has a rollback segment offline and the instance crashes, or
the user does a shutdown abort, the rollback segment wrap number does not get
updated. If that segment is then dropped and recreated immediately after the
instance is restarted, the wrap number could be lower than existing wrap
numbers. This will cause the ORA-600[4097] to occur in subsequent
transactions using Rollback.
这个错误也是因为回滚段wrap number未被及时更新导致的异常.

总结这次恢复过程
1.因当前redo丢失,使用隐含参数打开库,然后出现ORA-600[KCLCHKBLK_4](这个错误比较少见,更何况这个库是10.2.0.4）
2.undo出现问题出现ORA-00600[6856]错误不是很多见
3.接下来都是一些列undo导致的问题,其实如果开始就直接使用隐含参数删除掉有问题回滚段,效果可能会比event好.
4.因为部分trace文件没有拿到,未做深入分析,文章列出整体恢复思路
5.本次恢复的数据库版本是10.2.0.4,根据mos描述,很多错误应不会发生,但是实际还是发生了,MOS的版本范围,不要太看重.
6.其实这篇文章的本质不是展示恢复过程,而是再一次提醒:操作数据库慎重,特别是一台机器上多套库.

library cache lock等待事件

Library cache lock介绍
Oracle利用Library cache lock和Library cache pin来实现并发控制，Library cache lock是在handle上获取的，而Library cache pin则是在data heap上获取。访问对象时，首先必须获取handle上的lock，然后将访问的数据pin在内存中。lock的作用是控制进程间的并发访问，而pin的作用是保证数据一致性，防止数据在访问时被交换出去。
lock和pin的实现类似于enqueue，在每个handle上都有lock和pin的holder list和waiter list，用来保存持有该资源和等待该资源的队列。

Library cache lock相关sql语句

--找出library cache lock等待sid,saddr信息
select sid,saddr from v$session where event= 'library cache lock';
SID        SADDR
---------- --------
16         572ed244
--找出blocked信息
select kgllkhdl Handle,kgllkreq Request, kglnaobj Object
from x$kgllk where kgllkses = '572ed244'
and kgllkreq > 0;
HANDLE   REQUEST   OBJECT
-------- ---------- ------------------
62d064dc          2 EMPLOYEES
--找出blocking信息
select kgllkses saddr,kgllkhdl handle,kgllkmod mod,kglnaobj object
from x$kgllk lock_a
where kgllkmod > 0
and exists (select lock_b.kgllkhdl from x$kgllk lock_b
where kgllkses = '572ed244' /* blocked session */
and lock_a.kgllkhdl = lock_b.kgllkhdl
and kgllkreq > 0);
SADDR     HANDLE   MOD         OBJECT
--------  -------- ---------- ------------
572eac94  62d064dc          3  EMPLOYEES
--blocking 会话信息
select sid,username,terminal,program from v$session where saddr = '572eac94'
SID        USERNAME     TERMINAL  PROGRAM
---------- -----------  --------- --------------------------------------------
12          SCOTT        pts/20    sqlplus@goblin.forgotten.realms (TNS V1-V3)
--所有blocked 会话
select sid,username,terminal,program from v$session
where saddr in
(select kgllkses from x$kgllk lock_a
 where kgllkreq > 0
 and exists (select lock_b.kgllkhdl from x$kgllk lock_b
             where kgllkses = '572eac94' /* blocking session */
             and lock_a.kgllkhdl = lock_b.kgllkhdl
             and kgllkreq = 0)
);
SID        USERNAME  TERMINAL  PROGRAM
---------- --------- --------- -------------------------------------------
13         SCOTT     pts/22    sqlplus@goblin.forgotten.realms (TNS V1-V3)
16         SCOTT     pts/7     sqlplus@goblin.forgotten.realms (TNS V1-V3)

library cache pin等待事件

library cache pin说明
library cache pin 事件是用来管理library cache的并发访问的, pin一个object会引起相应的heap被载入内存中,如果客户端需要修改或检测这个object它就必须在锁住后取得一个pin.library cache pin的等待时间为3秒钟,其中有1秒钟用于PMON后台进程,即在取得pin之前最多等待3秒钟,否则就超时.library cache pin通常是发生在编译或重新编译PL/SQL,VIEW,TYPES等object时.编译通常都是显性的,如安装应用程序,升级,安装补丁程序等,但object的重新编译也可能发生在object变得无效时.library cache pin的参数如下,有用的主要是P1和P2:
P1 – KGL Handle address.
P2 – Pin address
P3 – 10*Mode + Namespace
其中,P1,P2可与x$kglpn和x$kglob表相关.x$kglpn和x$kglob是ORACLE数据库的内部数据字典.
x$kglpn library cache pin信息
x$kglob library cache object信息

查询方法一

--通过查询V$SESSION_WAIT找出正在等待”library cache pin”的session
SELECT sid,
       SUBSTR (event, 1, 30),
       TO_CHAR(p1, 'xxxxxxxx') p1_16,
       --P1RAW P1_16,
       p2,
       p3
  FROM v$session_wait
 WHERE wait_time = 0 AND event LIKE 'library cache pin%';
--P1 列是Library Cache Handle Address
--P2 列是Library Cache Pin Address.
--找到相关session pin状态
SELECT ADDR,
       INDX,
       KGLPNADR,-- Library Cache Pin Address
       KGLPNUSE,
       KGLPNSES,--识别锁住此pin 的session
       KGLPNHDL,--Library Cache Handle Address
       kGLPNLCK,
       KGLPNMOD,-- Pin 锁
       KGLPNREQ-- Pin 请求
  FROM x$kglpn
 WHERE KGLPNHDL LIKE '%EB3EB8%';--p1_16
 --询X$KGLOB (Library Cache Object),可找到相关的object
SELECT KGLNAOBJ-- 相关object的名字(取前面80个字符)
  FROM X$KGLOB
 WHERE KGLHDADR LIKE '%EB3EB8%';--p1_16
 --查出占着pin锁的session目前正在做什么
SELECT a.sid, a.username, a.program
  FROM v$session a, x$kglpn b
 WHERE a.saddr = b.kglpnuse AND b.kglpnhdl LIKE '%EB3EB8%'--p1_16
  AND b.kgnmod <> 0;
   --查出阻塞者正执行的SQL语句
 SELECT sid, sql_text
  FROM v$session, v$sqlarea
 WHERE v$session.sql_address = v$sqlarea.address AND sid =&sid;

查询方法二

--通过查询DBA_LOCK_INTERNAL和V$SESSION_WAIT,可得到与”library cache pin” 等待相关的object的名字
SELECT TO_CHAR (SESSION_ID, '999') sid,
       SUBSTR (LOCK_TYPE, 1, 30) TYPE,
       SUBSTR (lock_id1, 1, 23) Object_Name,
       SUBSTR (mode_held, 1, 4) HELD,
       SUBSTR (mode_requested, 1, 4) REQ,
       lock_id2 Lock_addr
  FROM dba_lock_internal
 WHERE mode_requested <> 'None' AND mode_requested <> mode_held
       AND session_id IN
              (SELECT sid
                 FROM v$session_wait
                WHERE wait_time = 0 AND event LIKE 'library cache pin%');
 --查出”library cache pin”占有者(即阻塞者)的session id
 SELECT sid Holder,
       KGLPNUSE Sesion,
       KGLPNMOD Held,
       KGLPNREQ Req
  FROM sys.x$kglpn, v$session
 WHERE KGLPNHDL IN (SELECT p1raw
                      FROM v$session_wait
                     WHERE wait_time = 0 AND event LIKE 'library cache pin%')
       AND KGLPNMOD <> 0
       AND v$session.saddr = x$kglpn.kglpnuse;
 --查出”library cache pin”占有者(阻塞者)正在等什么
 SELECT sid, SUBSTR (event, 1, 30), wait_time
  FROM v$session_wait
 WHERE sid IN
          (SELECT sid
             FROM x$kglpn, v$session
            WHERE KGLPNHDL IN
                     (SELECT p1raw
                        FROM v$session_wait
                       WHERE wait_time = 0
                             AND event LIKE 'library cache pin%')
                  AND KGLPNMOD <> 0
                  AND v$session.saddr = x$kglpn.kglpnuse);
 --查出阻塞者正执行的SQL语句
 SELECT sid, sql_text
  FROM v$session, v$sqlarea
 WHERE v$session.sql_address = v$sqlarea.address AND sid =&sid;

cursor: pin S wait on X 等待事件

cursor: pin S整体描述

cursor: pin S A session waits on this event when it wants to update a shared mutex pin and another session
is currently in the process of updating a shared mutex pin for the same cursor object. This wait event should
rarely be seen because a shared mutex pin update is very fast.(Wait Time: Microseconds)
--Parameter说明
P1 Hash value of cursor
P2 Mutex value
64 bit platforms
8 bytes are used.
Top 4 bytes hold the session id (if the mutex is held X)
Bottom 4 bytes hold the ref count (if the mutex is held S).
32 bit platforms
4 bytes are used.
Top 2 bytes hold the session id (if the mutex is held X)
Bottom 2 bytes hold the ref count (if the mutex is held S).
P3 Mutex where (an internal code locator) OR'd with Mutex Sleeps
--查询sql
SELECT a.*, s.sql_text
  FROM v$sql s,
       (SELECT sid,
               event,
               wait_class,
               p1 cursor_hash_value,
               p2raw Mutex_value,
               TO_NUMBER (SUBSTR (p2raw, 1, 8), 'xxxxxxxx') hold_mutex_x_sid
          FROM v$session_wait
         WHERE event LIKE 'cursor%') a
 WHERE s.HASH_VALUE = a.cursor_hash_value

cursor: pin S wait on X描述

- In previous versions of Oracle, library cache pin is protected by “library cache pin latch”.
- But in recent versions of Oracle(I believe it’s 10.2.0.2),
  library cache pin for the cursor LCO is protected by mutext.
- Mutex is allocated per LCO, so it enables fine-grained access control.
“cursor: pin S wait on X” wait event is mostly related to mutex and hard parse.
- When a process hard parses the SQL statement, it should acquire exclusive
  library cache pin for the corresponding LCO.
- This means that the process acquires the mutex in exclusive mode.
- Another process which also executes the same query needs to acquire the mutex
  but it’s being blocked by preceding process. The wait event is “cursor: pin S wait on X”.
--发生cursor: pin S wait on X原因
Frequent Hard Parses
If the frequency of Hard Parsing is extremely high, then contention can occur on this pin.
High Version Counts
When Version counts become excessive, a long chain of versions needs to
be examined and this can lead to contention on this event
Known bugs
Bug 5907779 - Self deadlock hang on "cursor: pin S wait on X" (typically from DBMS_STATS) [ID 5907779.8]
Bug 7568642: BLOCKING_SESSION EMPTY FOR "CURSOR: PIN S WAIT ON X"

ORA-00600[kcratr1_lostwrt]/ORA-00600[3020]错误恢复

open数据库alert日志报ORA-00600[kcratr1_lostwrt]错误

Mon May 14 14:57:28 2012
ALTER DATABASE OPEN
Mon May 14 14:57:29 2012
Beginning crash recovery of 1 threads
Mon May 14 14:57:29 2012
Started redo scan
Mon May 14 14:57:29 2012
Errors in file d:\oracle\admin\cqgasold\udump\cqgasold_ora_504.trc:
ORA-00600: 内部错误代码，参数: [kcratr1_lostwrt], [], [], [], [], [], [], []
ORA-600 signalled during: alter database open...

查询相关SCN
同一个查询中SCN相同,省略

SQL> select file#,online_status "STATUS",to_char(change#,'9999999999999999') "SCN",
  2  To_char(time,'yyyy-mm-dd hh24:mi:ss')"TIME" from v$recover_file;
未选定行
SQL> select file#,to_char(checkpoint_change#,'999999999999999') "SCN",
  2  to_char(last_change#,'999999999999999')"STOP_SCN" from v$datafile;
     FILE# SCN              STOP_SCN
---------- ---------------- ----------------
         1       7842987188
         2       7842987188
         3       7842987188
SQL> select file#,to_char(checkpoint_change#,'9999999999999999') "SCN",
  2  to_char(RESETLOGS_CHANGE#,'9999999999999999') "RESETLOGS SCN"
  3  from v$datafile_header;
     FILE# SCN               RESETLOGS SCN
---------- ----------------- -----------------
         1        7842991811                 1
         2        7842991811                 1
         3        7842991811                 1

这里看到奇怪现象datafile scn小于datafile_header scn,数据库异常断电一般来说也不会出现这样的情况,个人猜测是错误的恢复或者使用历史控制文件导致,对于这样的现状,我先尝试着使用using backup controlfile方式恢复,结果失败.估计控制文件有异常,本着先拉起库原则,重建控制文件.

进行完全恢复

SQL> recover database;
ORA-00283: 恢复会话因错误而取消
ORA-00600: 内部错误代码，参数: [3020], [8388617], [1], [23403], [25], [112],[], []
ORA-10567: Redo is inconsistent with data block (file# 2, block# 9)
ORA-10564: tablespace UNDOTBS1
ORA-01110: 数据文件 2: 'D:\ORACLE\ORADATA\CQGASOLD\UNDO_1.DBF'
ORA-10560: block type 'KTU SMU HEADER BLOCK'

尝试跳过坏块继续恢复

SQL> recover database allow 1 corruption;
ORA-00283: 恢复会话因错误而取消
ORA-00600: 内部错误代码，参数: [3020], [8388610], [1], [23403], [2264], [16],[], []
ORA-10567: Redo is inconsistent with data block (file# 2, block# 2)
ORA-10564: tablespace UNDOTBS1
ORA-01110: 数据文件 2: 'D:\ORACLE\ORADATA\CQGASOLD\UNDO_1.DBF'
ORA-10560: block type 'KTFB Bitmapped File Space Header'

使用dbv检查坏块数量

C:\>dbv file='d:\oracle\oradata\cqgasold\undo_1.dbf' blocksize=8192
DBVERIFY: Release 9.2.0.5.0 - Production on 星期二 5月 15 19:43:42 2012
Copyright (c) 1982, 2002, Oracle Corporation.  All rights reserved.
DBVERIFY - 验证正在开始 : FILE = d:\oracle\oradata\cqgasold\undo_1.dbf
DBV-00200: 块, dba 8388617, 已经标记为崩溃
汇入的页369 - 可能是介质损坏
***
Corrupt block relative dba: 0x00800171 (file 2, block 369)
Fractured block found during dbv:
Data in bad block -
 type: 2 format: 2 rdba: 0x00800171
 last change scn: 0x0001.d37c57db seq: 0x1 flg: 0x04
 consistency value in tail: 0x4e930260
 check value in block header: 0x8202, computed block checksum: 0x4e5f
 spare1: 0x0, spare2: 0x0, spare3: 0x0
***
汇入的页417 - 可能是介质损坏
***
Corrupt block relative dba: 0x008001a1 (file 2, block 417)
Fractured block found during dbv:
Data in bad block -
 type: 2 format: 2 rdba: 0x008001a1
 last change scn: 0x0001.d37c53d4 seq: 0x2 flg: 0x04
 consistency value in tail: 0x4b6b0201
 check value in block header: 0x6ae7, computed block checksum: 0x5abc
 spare1: 0x0, spare2: 0x0, spare3: 0x0
***
…………
--类此记录很多,我放弃了跳过坏块修复的方法

恢复过程中提示坏块数据库文件离线恢复

SQL> alter database datafile 'd:\oracle\oradata\cqgasold\undo_1.dbf' offline;
数据库已更改。
SQL> recover database;
完成介质恢复。
SQL> alter database open;
alter database open
*
ERROR 位于第 1 行:
ORA-00604: 递归 SQL 层 1 出现错误
ORA-00376: 此时无法读取文件 2
ORA-01110: 数据文件 2: 'D:\ORACLE\ORADATA\CQGASOLD\UNDO_1.DBF'

到了这一步,根据经验,数据库被open的可能性很多了,很有可能是open以后因为smon回滚导致数据库down

查看日志,屏蔽回滚段,完成恢复

Tue May 15 19:59:52 2012
alter database open
Tue May 15 19:59:52 2012
Beginning crash recovery of 1 threads
Tue May 15 19:59:52 2012
Started redo scan
Tue May 15 19:59:52 2012
Completed redo scan
 323 redo blocks read, 82 data blocks need recovery
Tue May 15 19:59:52 2012
Started recovery at
 Thread 1: logseq 23404, block 3, scn 0.0
Recovery of Online Redo Log: Thread 1 Group 4 Seq 23404 Reading mem 0
  Mem# 0 errs 0: F:\ORACLE\ORADATA\LOGCQGASOLD4.ORA
Tue May 15 19:59:52 2012
Completed redo application
Tue May 15 19:59:52 2012
Ended recovery at
 Thread 1: logseq 23404, block 326, scn 1.3548264979
 82 data blocks read, 82 data blocks written, 323 redo blocks read
Crash recovery completed successfully
Tue May 15 19:59:53 2012
Thread 1 advanced to log sequence 23405
Thread 1 opened at log sequence 23405
  Current log# 2 seq# 23405 mem# 0: D:\ORACLE\ORADATA\CQGASOLD\REDO02.LOG
Successful open of redo thread 1
Tue May 15 19:59:53 2012
SMON: enabling cache recovery
SMON: enabling tx recovery
Tue May 15 19:59:54 2012
Database Characterset is ZHS16GBK
Tue May 15 19:59:55 2012
replication_dependency_tracking turned off (no async multimaster replication found)
ORA-604 signalled during: alter database open...
Tue May 15 19:59:56 2012
SMON: about to recover undo segment 1
SMON: mark undo segment 1 as needs recovery
SMON: about to recover undo segment 2
SMON: mark undo segment 2 as needs recovery
SMON: about to recover undo segment 3
SMON: mark undo segment 3 as needs recovery
SMON: about to recover undo segment 4
SMON: mark undo segment 4 as needs recovery
SMON: about to recover undo segment 5
SMON: mark undo segment 5 as needs recovery
SMON: about to recover undo segment 6
SMON: mark undo segment 6 as needs recovery
SMON: about to recover undo segment 7
SMON: mark undo segment 7 as needs recovery
SMON: about to recover undo segment 8
SMON: mark undo segment 8 as needs recovery
SMON: about to recover undo segment 9
SMON: mark undo segment 9 as needs recovery
SMON: about to recover undo segment 10
SMON: mark undo segment 10 as needs recovery
Tue May 15 20:00:37 2012
Shutting down instance (abort)

看到这里,可以大概确定是因为undo文件离线,导致回滚段异常.
这个问题,基本上可以确定通过隐含参数屏蔽回滚段,然后open数据库,重建undo删除异常undo,数据库恢复完成。