sql_id和hash value的部分转换

从oracle 10g开始引进了sql_id,在老版本的oralce中,要表明一条sql,一般使用hash value,而在10g及其以后版本中一般建议使用sql_id,从9i的sp和10g的awr中也可以看出.对于Library Cache对象,Oracle使用MD5算法进行哈希,生成一个128位的Hash Value,其中低32位作为HASH VALUE显示,SQL_ID则取了后64位.既然hash value和sql_id之前存在着这样的关系,那么我们就可以通过函数实现两者的部分转换(因为最终取值长度不同,所以不能完全转换)
1.查询sql_id和hash value

SQL> select * from v$version;
BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bi
PL/SQL Release 10.2.0.4.0 - Production
CORE    10.2.0.4.0      Production
TNS for Linux: Version 10.2.0.4.0 - Production
NLSRTL Version 10.2.0.4.0 - Production
SQL> select to_char(sysdate,'yyyy-mm-dd hh24:mi:ss')
  2  "www.xifenfei.com" from dual;
www.xifenfei.com
-------------------
2012-05-26 01:05:39
SQL> select sql_id,hash_value from v$sql where sql_text like
  2  'select * from dual';
SQL_ID        HASH_VALUE
------------- ----------
a5ks9fhw2v9s1  942515969

2.oracle自带函数转换sql_id to hash value

SQL> select dbms_utility.SQLID_TO_SQLHASH('a5ks9fhw2v9s1') hash_value FROM DUAL;
HASH_VALUE
----------
 942515969

3.自己编写函数sql_id to hash value

SQL> CREATE OR REPLACE FUNCTION sql_id_2_hash_value (sql_id VARCHAR2)
  2     RETURN NUMBER
  3  IS
  4     l_output   NUMBER := 0;
  5  BEGIN
  6         SELECT TRUNC (
  7                   MOD (
  8                      SUM (
  9                         (INSTR ('0123456789abcdfghjkmnpqrstuvwxyz',
 10                                 SUBSTR (LOWER (TRIM (sql_id)), LEVEL, 1))
 11                          - 1)
 12                         * POWER (32, LENGTH (TRIM (sql_id)) - LEVEL)),
 13                      POWER (2, 32)))
 14           INTO l_output
 15           FROM DUAL
 16     CONNECT BY LEVEL <= LENGTH (TRIM (sql_id));
 17     RETURN l_output;
 18  END;
 19  /
函数已创建。
SQL> select sql_id_2_hash_value('a5ks9fhw2v9s1') hash_value FROM DUAL;
HASH_VALUE
----------
 942515969

4.hash value 转换为部分 sql_id

SQL> CREATE OR REPLACE FUNCTION hash_value_2_sql_id (p_hash_value NUMBER)
  2     RETURN VARCHAR2
  3  IS
  4     l_output   VARCHAR2 (8) := '';
  5  BEGIN
  6     FOR i
  7        IN (    SELECT SUBSTR (
  8                          '0123456789abcdfghjkmnpqrstuvwxyz',
  9                          1
 10                          + FLOOR (
 11                               MOD (p_hash_value / (POWER (32, LEVEL - 1)), 32)),
 12                          1)
 13                          sqlidchar
 14                  FROM DUAL
 15            CONNECT BY LEVEL <= LN (p_hash_value) / LN (32)
 16              ORDER BY LEVEL DESC)
 17     LOOP
 18        l_output := l_output || i.sqlidchar;
 19     END LOOP;
 20
 21     RETURN l_output;
 22  END;
 23  /
函数已创建。
SQL> select hash_value_2_sql_id(942515969) from dual;
HASH_VALUE_2_SQL_ID(942515969)
--------------------------------------------------------
2v9s1

参考:http://blog.tanelpoder.com/2009/02/22/sql_id-is-just-a-fancy-representation-of-hash-value/

使用exp/imp 导入11g数据到9i

现在有个需求,需要使用exp/imp导入11g的数据库数据到9i中,解决这个问题一般来说想到三种方法思路,一个个尝试(其实从高版本服务端支持低版本客户端的原则,可以大概的猜测出使用9i的客户端处理该问题)
方法1:导出导入都使用11g客户端

--11g客户端导出
[oracle@xifenfei ~]$ exp chf/xifenfei file=/tmp/t_xifenfei.dmp
>log=/tmp/t_xifenfei.log tables=chf.t_xifenfei
Export: Release 11.2.0.3.0 - Production on Fri May 18 18:15:18 2012
Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.
Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
Export done in ZHS16GBK character set and AL16UTF16 NCHAR character set
server uses AL32UTF8 character set (possible charset conversion)
About to export specified tables via Conventional Path ...
. . exporting table                     T_XIFENFEI          2 rows exported
Export terminated successfully without warnings.
--11g客户端导入
[oracle@xifenfei ~]$ imp chf/xifenfei@ora9i file=/tmp/t_xifenfei_11g.dmp
>log=/tmp/t_xifenfei.log tables=chf.t_xifenfei
Import: Release 11.2.0.3.0 - Production on Fri May 18 18:17:24 2012
Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.
IMP-00058: ORACLE error 6550 encountered
ORA-06550: line 1, column 33:
PLS-00302: component 'SET_NO_OUTLINES' must be declared
ORA-06550: line 1, column 7:
PL/SQL: Statement ignored
IMP-00000: Import terminated unsuccessfully

这个错误是版本不兼容导致:PLS-00302: component ‘SET_NO_OUTLINES’ must be declared

方法2:11g客户端导出,9i客户端导入

--11g客户端导出
[oracle@xifenfei ~]$ exp chf/xifenfei file=/tmp/t_xifenfei.dmp
>log=/tmp/t_xifenfei.log tables=chf.t_xifenfei
Export: Release 11.2.0.3.0 - Production on Fri May 18 18:15:18 2012
Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.
Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
Export done in ZHS16GBK character set and AL16UTF16 NCHAR character set
server uses AL32UTF8 character set (possible charset conversion)
About to export specified tables via Conventional Path ...
. . exporting table                     T_XIFENFEI          2 rows exported
Export terminated successfully without warnings.
--传输到9i
[oracle@xifenfei tmp]$ scp t_xifenfei.dmp 192.168.1.10:/tmp/
The authenticity of host '192.168.1.10 (192.168.1.10)' can't be established.
RSA key fingerprint is 3d:0c:d1:4b:45:bd:a3:f5:25:eb:4d:52:d2:32:03:69.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '192.168.1.10' (RSA) to the list of known hosts.
oracle@192.168.1.10's password:
t_xifenfei.dmp                          100%   56KB  56.0KB/s   00:00
--9i客户端导入
[oracle@xifenfei ~]$ imp chf/xifenfei file=/tmp/t_xifenfei.dmp tables=t_xifenfei
Import: Release 9.2.0.4.0 - Production on Thu May 24 23:32:18 2012
Copyright (c) 1982, 2002, Oracle Corporation.  All rights reserved.
Connected to: Oracle9i Enterprise Edition Release 9.2.0.4.0 - Production
With the Partitioning, OLAP and Oracle Data Mining options
JServer Release 9.2.0.4.0 - Production
IMP-00010: not a valid export file, header failed verification
IMP-00000: Import terminated unsuccessfully
--版本不兼容(高版本的dump文件低版本不能识别)

方法3:9i客户端导出,9i客户端导入

--9i客户端导出
[oracle@xifenfei ~]$ exp chf/xifenfei@ora11g file=/tmp/t_xifenfei_11g.dmp
>log=/tmp/t_xifenfei.log tables=chf.t_xifenfei
Export: Release 9.2.0.4.0 - Production on Thu May 24 23:37:20 2012
Copyright (c) 1982, 2002, Oracle Corporation.  All rights reserved.
Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
Export done in ZHS16GBK character set and AL16UTF16 NCHAR character set
server uses AL32UTF8 character set (possible charset conversion)
About to export specified tables via Conventional Path ...
. . exporting table                     T_XIFENFEI          2 rows exported
Export terminated successfully without warnings.
--9i客户端导入
[oracle@xifenfei log]$ imp chf/xifenfei file=/tmp/t_xifenfei_11g.dmp log=/tmp/xifenfei.log full=y
Import: Release 9.2.0.4.0 - Production on Fri May 25 03:22:14 2012
Copyright (c) 1982, 2002, Oracle Corporation.  All rights reserved.
Connected to: Oracle9i Enterprise Edition Release 9.2.0.4.0 - Production
With the Partitioning, OLAP and Oracle Data Mining options
JServer Release 9.2.0.4.0 - Production
Export file created by EXPORT:V09.02.00 via conventional path
import done in ZHS16GBK character set and AL16UTF16 NCHAR character setSegmentation fault
--导入数据遇到setSegmentation fault异常终止

解决setSegmentation fault异常终止

--修改exu9defpswitches视图
[oracle@xifenfei ~]$ sqlplus / as sysdba
SQL*Plus: Release 11.2.0.3.0 Production on Fri May 18 22:29:00 2012
Copyright (c) 1982, 2011, Oracle.  All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
SQL> CREATE OR REPLACE VIEW exu9defpswitches (
  2                  compflgs, nlslensem ) AS
  3          SELECT  a.value, b.value
  4          FROM    sys.v$parameter a, sys.v$parameter b
  5          WHERE   a.name = 'plsql_code_type' AND
  6                  b.name = 'nls_length_semantics' ;
View created.
--9i导出11g数据
[oracle@xifenfei tmp]$ exp chf/xifenfei@ora11g file=/tmp/t_xifenfei_11g.dmp
>log=/tmp/xifenfei.log tables=t_xifenfei
Export: Release 9.2.0.4.0 - Production on Fri May 25 04:08:32 2012
Copyright (c) 1982, 2002, Oracle Corporation.  All rights reserved.
Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
Export done in ZHS16GBK character set and AL16UTF16 NCHAR character set
server uses AL32UTF8 character set (possible charset conversion)
About to export specified tables via Conventional Path ...
. . exporting table                     T_XIFENFEI          2 rows exported
Export terminated successfully without warnings.
--9i导入数据
[oracle@xifenfei tmp]$ imp chf/xifenfei file=/tmp/t_xifenfei_11g.dmp
>log=/tmp/xifenfei.log tables=t_xifenfei
Import: Release 9.2.0.4.0 - Production on Fri May 25 04:08:53 2012
Copyright (c) 1982, 2002, Oracle Corporation.  All rights reserved.
Connected to: Oracle9i Enterprise Edition Release 9.2.0.4.0 - Production
With the Partitioning, OLAP and Oracle Data Mining options
JServer Release 9.2.0.4.0 - Production
Export file created by EXPORT:V09.02.00 via conventional path
import done in ZHS16GBK character set and AL16UTF16 NCHAR character set
. importing CHF's objects into CHF
. . importing table                   "T_XIFENFEI"          2 rows imported
Import terminated successfully without warnings.
--至此导入成功,完成了11gr2数据导入到9ir2中

通过一系列的实验证明,需要把11g的数据导入到9i中,需要使用9i的客户端进行,其中exu9defpswitches视图需要重建,否则会出现setSegmentation fault异常,导致导入失败.

Hanganalyze分析会话阻塞—锁表

前两篇分别大概的介绍了Systemstates分析会话阻塞—锁表使用ass109.awk分析systemstate,这篇也大概的学习下hanganalyze.
模拟阻塞会话

--会话1
SQL> select * from v$version;
BANNER
-------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - Production
PL/SQL Release 11.2.0.3.0 - Production
CORE    11.2.0.3.0      Production
TNS for Linux: Version 11.2.0.3.0 - Production
NLSRTL Version 11.2.0.3.0 - Production
SQL> select * from t_xifenfei;
        ID NAME
---------- ----------------------------------------
         1 xifenfei
         2 www.xifenfei
SQL> delete from t_xifenfei where id=2;
1 row deleted.
--会话2
SQL> delete from t_xifenfei where id=2;
--hang住

做hanganalyze

--sys登录
SQL> ORADEBUG setmypid
Statement processed.
SQL> oradebug unlimit;
Statement processed.
SQL> oradebug hanganalyze 3
Hang Analysis in /u01/oracle/diag/rdbms/ora11g/ora11g/trace/ora11g_ora_13719.trc

分析hanganalyze 文件

--HANG ANALYSIS基本信息
===============================================================================
HANG ANALYSIS:
  instances (db_name.oracle_sid): ora11g.ora11g
  oradebug_node_dump_level: 3
  analysis initiated by oradebug
  os thread scheduling delay history: (sampling every 1.000000 secs)
    0.000000 secs at [ 15:53:16 ]
      NOTE: scheduling delay has not been sampled for 0.356486 secs 0.000000 secs from [ 15:53:12 - 15:53:17 ], 5 sec avg
    0.000000 secs from [ 15:52:17 - 15:53:17 ], 1 min avg
    0.000000 secs from [ 15:48:17 - 15:53:17 ], 5 min avg
===============================================================================
Chains most likely to have caused the hang:
 [a] Chain 1 Signature: 'SQL*Net message from client'<='enq: TX - row lock contention'
     Chain 1 Signature Hash: 0x38c48850
===============================================================================
Non-intersecting chains:
-------------------------------------------------------------------------------
Chain 1:
-------------------------------------------------------------------------------
--被阻塞会话信息
    Oracle session identified by:
    {
                instance: 1 (ora11g.ora11g)
                   os id: 13634
              process id: 21, oracle@xifenfei (TNS V1-V3)
              session id: 143
        session serial #: 281
    }
--等待信息
    is waiting for 'enq: TX - row lock contention' with wait info:
    {
                      p1: 'name|mode'=0x54580006
--54580006 is split into ASCII 54 + ASCII 58 (TX) + Mode 0006 (X) ...  在等待TX mode=6
                      p2: 'usn<<16 | slot'=0x20010
                      p3: 'sequence'=0x356
            time in wait: 1 min 56 sec
           timeout after: never
                 wait id: 24
                blocking: 0 sessions
             current sql: delete from t_xifenfei where id=2
             short stack: --省略
            wait history:
              * time between current wait and wait #1: 0.001471 sec
              1.       event: 'SQL*Net message from client'
                 time waited: 10.776765 sec
                     wait id: 23              p1: 'driver id'=0x62657100
                                              p2: '#bytes'=0x1
              * time between wait #1 and #2: 0.000001 sec
              2.       event: 'SQL*Net message to client'
                 time waited: 0.000001 sec
                     wait id: 22              p1: 'driver id'=0x62657100
                                              p2: '#bytes'=0x1
              * time between wait #2 and #3: 0.000028 sec
              3.       event: 'SQL*Net message from client'
                 time waited: 0.000032 sec
                     wait id: 21              p1: 'driver id'=0x62657100
                                              p2: '#bytes'=0x1
    }
    and is blocked by
 => Oracle session identified by:
--阻塞会话信息
    {
                instance: 1 (ora11g.ora11g)
                   os id: 13546
              process id: 20, oracle@xifenfei (TNS V1-V3)
              session id: 15
        session serial #: 189
    }
--该会话处于空闲状态
    which is waiting for 'SQL*Net message from client' with wait info:
    {
                      p1: 'driver id'=0x62657100
                      p2: '#bytes'=0x1
            time in wait: 2 min 26 sec
           timeout after: never
                 wait id: 29
                blocking: 1 session
             current sql: <none>
             short stack: --省略
            wait history:
              * time between current wait and wait #1: 0.000019 sec
              1.       event: 'SQL*Net message to client'
                 time waited: 0.000007 sec
                     wait id: 28              p1: 'driver id'=0x62657100
                                              p2: '#bytes'=0x1
              * time between wait #1 and #2: 0.049656 sec
              2.       event: 'SQL*Net message from client'
                 time waited: 9.759067 sec
                     wait id: 27              p1: 'driver id'=0x62657100
                                              p2: '#bytes'=0x1
              * time between wait #2 and #3: 0.000216 sec
              3.       event: 'SQL*Net message to client'
                 time waited: 0.000002 sec
                     wait id: 26              p1: 'driver id'=0x62657100
                                              p2: '#bytes'=0x1
    }
Chain 1 Signature: 'SQL*Net message from client'<='enq: TX - row lock contention'
Chain 1 Signature Hash: 0x38c48850
-------------------------------------------------------------------------------
===============================================================================
通过上述分析:大概可以得出sid=143因为请求enq: TX - row lock contention(TX mode=6)被sid=15阻塞

查询视图验证

SQL> select sid,event from v$session where wait_class#<>6;
       SID EVENT
---------- ------------------------------
        20 SQL*Net message to client
       143 enq: TX - row lock contention
SQL> select * from v$lock where type in('TX','TM');
ADDR     KADDR           SID TYPE        ID1        ID2      LMODE    REQUEST      CTIME      BLOCK
-------- -------- ---------- ---- ---------- ---------- ---------- ---------- ---------- ----------
352F8BEC 352F8C18        143 TX       131088        854          0          6       1862          0
B6B9C7A8 B6B9C7D8         15 TM        75928          0          3          0       1892          0
B6B9C7A8 B6B9C7D8        143 TM        75928          0          3          0       1862          0
343C0E54 343C0E94         15 TX       131088        854          6          0       1892          1
--查询结果sid=15的会话持有TX MODE=6阻塞sid=143的TX MODE=6的请求,和HANG ANALYSIS分析基本一致

参考文档:USING AND READING HANGANALYZE或者[ID 215858.1]

使用ass109.awk分析systemstate

本篇介绍工具ass109.awk使用,大大节约分析systemstate dump文件时间.当然如果要获得详细信息,还是需要人工去读相关进程的dump文件.
模拟会话被hang住

--会话1
SQL> select * from t_xifenfei;
        ID NAME
---------- ----------------------------------------
         1 xifenfei
         2 www.xifenfei
SQL> delete from t_xifenfei where id=1;
1 row deleted.
--会话2
SQL> delete from t_xifenfei where id=1;
--hang住

做systemstate

SQL> oradebug setmypid
Statement processed.
SQL>  oradebug unlimit
Statement processed.
SQL> oradebug dump systemstate 10
Statement processed.
SQL> oradebug tracefile_name
/u01/oracle/diag/rdbms/ora11g/ora11g/trace/ora11g_ora_9976.trc
SQL> exit

使用ass109.awk分析dump文件

[oracle@xifenfei ~]$ awk -f ass109.awk /u01/oracle/diag/rdbms/ora11g/ora11g/trace/ora11g_ora_9976.trc
Starting Systemstate 1
..................................
Ass.Awk Version 1.0.9 - Processing /u01/oracle/diag/rdbms/ora11g/ora11g/trace/ora11g_ora_9976.trc
System State 1
~~~~~~~~~~~~~~~~
1:
2:  0: waiting for 'pmon timer'
3:  0: waiting for 'rdbms ipc message'
4:  0: waiting for 'VKTM Logical Idle Wait'
5:  0: waiting for 'rdbms ipc message'
6:  0: waiting for 'DIAG idle wait'
7:  0: waiting for 'rdbms ipc message'
8:  0: waiting for 'DIAG idle wait'
9:  0: waiting for 'rdbms ipc message'
10: 0: waiting for 'rdbms ipc message'
11: 0: waiting for 'rdbms ipc message'
12: 0: waiting for 'rdbms ipc message'
13: 0: waiting for 'smon timer'
14: 0: waiting for 'rdbms ipc message'
15: 0: waiting for 'rdbms ipc message'
16: 0: waiting for 'rdbms ipc message'
17:
18:
19: 0: waiting for 'Space Manager: slave idle wait'
20: 0: waiting for 'SQL*Net message from client'
21: 0: waiting for 'enq: TX - row lock contention'[Enqueue TX-000A0020-0000024F]
     Cmd: Delete
22: 0: waiting for 'rdbms ipc message'
23: 0: waiting for 'rdbms ipc message'
24: 0: waiting for 'rdbms ipc message'
25: 0: waiting for 'rdbms ipc message'
26: 0: waiting for 'Streams AQ: qmn coordinator idle wait'
27:
28:
30: 0: waiting for 'Streams AQ: qmn slave idle wait'
31: 0: waiting for 'rdbms ipc message'
33: 1: waited for 'Streams AQ: waiting for time management or cleanup tasks'
35: 0: waiting for 'rdbms ipc message'
41:
44:
Blockers
~~~~~~~~
        Above is a list of all the processes. If they are waiting for a resource
        then it will be given in square brackets. Below is a summary of the
        waited upon resources, together with the holder of that resource.
        Notes:
        ~~~~~
         o A process id of '???' implies that the holder was not found in the
           systemstate.
                    Resource Holder State
Enqueue TX-000A0020-0000024F    20: 0: waiting for 'SQL*Net message from client'
Object Names
~~~~~~~~~~~~
Enqueue TX-000A0020-0000024F
30586 Lines Processed.
--从这里马上就可以知道pid 21 请求Enqueue TX被pid 20阻塞

下载:ass109.awk

Systemstates分析会话阻塞—锁表

模拟会话被阻塞

--会话1
SQL> select sid from v$mystat where rownum=1;
       SID
----------
        15
SQL> create table t_xifenfei (id number,name varchar2(20));
Table created.
SQL> insert into t_xifenfei values(1,'xifenfei');
1 row created.
SQL> insert into t_xifenfei values(2,'www.xifenfei');
1 row created.
SQL> commit;
Commit complete.
SQL> select * from t_xifenfei;
        ID NAME
---------- ----------------------------------------
         1 xifenfei
         2 www.xifenfei
SQL> delete from t_xifenfei where id=1;
1 row deleted.
--会话2
SQL>  select sid from v$mystat where rownum=1;
       SID
----------
       143
SQL> delete from chf.t_xifenfei where id=1;
--hang住

新打开会话做Systemstates
我们假设不知道会话1和会话2的sid,现在特定的使用Systemstates分析问题,后面给出简单分析方法

SQL> oradebug setmypid
Statement processed.
SQL> oradebug unlimit
Statement processed.
SQL> oradebug dump systemstate 10
Statement processed.
SQL> oradebug dump systemstate 10
Statement processed.
SQL> oradebug tracefile_name
/u01/oracle/diag/rdbms/ora11g/ora11g/trace/ora11g_ora_31027.trc
SQL> exit

阅读trace文件
会话2在这样的情况下hang住,而其他会话正常,第一反应是在lock级别阻塞了,而这个又是DML操作,很自然想到是TX,于是在trace文件中搜索”enq: TX” ,找到对应记录,然后向上找到对应的进程号,开始读相关内容,发现有小信息如下:

PROCESS 20:
  ----------------------------------------
--客户端信息
    client details:
      O/S info: user: oracle, term: pts/0, ospid: 30622
      machine: xifenfei program: sqlplus@xifenfei (TNS V1-V3)
      application name: SQL*Plus, hash value=3669949024
--进程相关session信息
    (session) sid: 15 ser: 151 trans: 0x343a4c2c, creator: 0x35fe2218
              flags: (0x45) USR/- flags_idl: (0x0) -/-/-/-/-/-
              flags2: (0x40009) -/-/INC
              DID: , short-term DID:
              txn branch: (nil)
              oct: 0, prv: 0, sql: (nil), psql: 0x2f6e7b68, user: 84/CHF
--被阻塞会话信息
    There are 1 sessions blocked by this session.
    Dumping one waiter:
      inst: 1, sid: 143, ser: 229
      wait event: 'enq: TX - row lock contention'
        p1: 'name|mode'=0x54580006
        p2: 'usn<<16 | slot'=0x40005
        p3: 'sequence'=0x252
      row_wait_obj#: 75928, block#: 171, row#: 0, file# 4
      min_blocked_time: 296 secs, waiter_cache_ver: 7860
    Wait State:
      fixed_waits=0 flags=0x22 boundary=(nil)/-1
--54580006 is split into ASCII 54 + ASCII 58 (TX) + Mode 0006 (X) ...
SQL> select object_type,object_name,owner from dba_objects where object_id=75928;
OBJECT_TYP OBJECT_NAME          OWNER
---------- -------------------- ----------
TABLE      T_XIFENFEI           CHF
--持有锁的信息
      (enqueue) TX-00040005-00000252	DID: 0001-0014-0000009C
      lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  res_flag: 0x6
      mode: X, lock_flag: 0x0, lock: 0x343a4c6c, res: 0x353606a8
      own: 0x355ae5b8, sess: 0x355ae5b8, proc: 0x35fe2218, prv: 0x353606b0
--通过上述信息可以分析出结论:
sqlplus登录的sid=15的会话占用了TX mode=6(mode:x)的锁,阻塞了sid=143会话对chf.t_xifenfei表操作

找出被阻塞进程相关信息(sid 为143的进程),搜索”sid: 143″,阅读相关进程信息

PROCESS 21:
  ----------------------------------------
--相关session信息
 (session) sid: 143 ser: 229 trans: 0x343915a0, creator: 0x35fe2d3c
              flags: (0x41) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
              flags2: (0x40009) -/-/INC
              DID: , short-term DID:
              txn branch: (nil)
              oct: 7, prv: 0, sql: 0x2f6cb234, psql: 0x2f6dd5cc, user: 0/SYS
--当前等待信息
 Current Wait Stack:
     0: waiting for 'enq: TX - row lock contention'
        name|mode=0x54580006, usn<<16 | slot=0x40005, sequence=0x252
        wait_id=12 seq_num=13 snap_id=1
        wait times: snap=5 min 1 sec, exc=5 min 1 sec, total=5 min 1 sec
        wait times: max=infinite, heur=5 min 1 sec
        wait counts: calls=101 os=101
        in_wait=1 iflags=0x15a0
--阻塞该会话的session信息
    There is at least one session blocking this session.
      Dumping 1 direct blocker(s):
        inst: 1, sid: 15, ser: 151
      Dumping final blocker:
        inst: 1, sid: 15, ser: 151
    Wait State:
      fixed_waits=0 flags=0x22 boundary=(nil)/-1
--请求锁信息
      SO: 0x352f8fcc, type: 8, owner: 0x35765fe8, flag: INIT/-/-/0x00 if: 0x1 c: 0x1
       proc=0x35fe2d3c, name=enqueue, file=ksq1.h LINE:380, pg=0
      (enqueue) TX-00040005-00000252	DID: 0001-0015-0000003B
      lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  res_flag: 0x6
      req: X, lock_flag: 0x10, lock: 0x352f8ff8, res: 0x353606a8
      own: 0x356f49b8, sess: 0x356f49b8, proc: 0x35fe2d3c, prv: 0x353606b8
--通过对被阻塞对象分析,可以得出和阻塞者相同的信息

对该问题的常规分析思路

--查询等待事件
SQL> select event,p1,p2,p3 from v$session where wait_class#<>6;
EVENT                                  P1         P2         P3
------------------------------ ---------- ---------- ----------
SQL*Net message to client      1650815232          1          0
enq: TX - row lock contention  1415053318     262149        594
--查询锁信息(因为通过上面的等待事件分析,TX可能引起会话hang)
SQL> SELECT * FROM v$lock where type in ('TM','TX');
ADDR     KADDR           SID TYPE        ID1        ID2      LMODE    REQUEST      CTIME      BLOCK
-------- -------- ---------- ---- ---------- ---------- ---------- ---------- ---------- ----------
352F8FCC 352F8FF8        143 TX       262149        594          0          6       4181          0
B69CC7A8 B69CC7D8        143 TM        75928          0          3          0       4181          0
B69CC7A8 B69CC7D8         15 TM        75928          0          3          0       4266          0
343A4C2C 343A4C6C         15 TX       262149        594          6          0       4267          1
--通过TM查询出来对应对象
SQL>  select object_type,object_name,owner from dba_objects where object_id=75928;
OBJECT_TYP OBJECT_NAME          OWNER
---------- -------------------- ----------
TABLE      T_XIFENFEI           CHF
--通过观察v$lock查询结果可以知道:
15会话的TX MODE=6的锁阻塞了143会话想会的的TX MODE=6的锁,从而是的143会话hang住

Systemstates分析参考文档:Understanding and Reading Systemstates或者[ID 423153.1]

找出 alter system kill session 'sid,serial#' kill 掉的数据库会话对应进程

当我们使用alter system kill session ‘sid,serial#’ 在数据库中kill掉某个会话的时候,如果你观察仔细会发现v$session.paddr发生了改变,从而是的不能直接通过关联v$process.add找出spid,然后进行其他操作.本文提供三种方法找该种情况下spid的方法.
数据库版本

SQL> select * from v$version;
BANNER
-----------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - Production
PL/SQL Release 11.2.0.3.0 - Production
CORE    11.2.0.3.0      Production
TNS for Linux: Version 11.2.0.3.0 - Production
NLSRTL Version 11.2.0.3.0 - Production

会话1

SQL> select sid, SERIAL#,paddr from v$session where
  2  sid=(select sid from v$mystat where rownum=1);
       SID    SERIAL# PADDR
---------- ---------- --------
       133         53 35FE16F4

会话2

SQL> select sid, SERIAL#,paddr from v$session where
  2  sid=(select sid from v$mystat where rownum=1);
       SID    SERIAL# PADDR
---------- ---------- --------
       143         21 35FE2D3C

会话3

SQL> alter system kill session '133,53';
System altered.
SQL> alter system kill session '143,21';
System altered.
SQL> select sid, SERIAL#,paddr,status from v$session where sid in(133,143);
       SID    SERIAL# PADDR    STATUS
---------- ---------- -------- ----------------
       133         53 3547A3F4 KILLED
       143         21 3547A3F4 KILLED

证明alter system kill session后,v$session中的paddr发生了改变,这个时候如果需要找出原来的spid,不能使用v$session.paddr和v$process.addr关联获得

找出kill掉的spid方法1

SQL> select spid, program from v$process
  2      where program!= 'PSEUDO'
  3      and addr not in (select paddr from v$session)
  4      and addr not in (select paddr from v$bgprocess)
  5      and addr not in (select paddr from v$shared_server);
SPID                                             PROGRAM
------------------------------------------------ ------------------------------
14260                                            oracle@xifenfei (L001)
14256                                            oracle@xifenfei (L000)
15300                                            oracle@xifenfei (TNS V1-V3)
14179                                            oracle@xifenfei (D000)
15318                                            oracle@xifenfei (TNS V1-V3)
14252                                            oracle@xifenfei (N000)
SQL> !ps -ef|grep 15300|grep -v grep
oracle   15300 14052  0 03:22 ?        00:00:00 oracleora11g (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
SQL> !ps -ef|grep 15318|grep -v grep
oracle   15318 15315  0 03:22 ?        00:00:00 oracleora11g (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))

找出kill 掉的spid 方法2

SQL> SELECT s.username,s.status,
  2  x.ADDR,x.KSLLAPSC,x.KSLLAPSN,x.KSLLASPO,x.KSLLID1R,x.KSLLRTYP,
  3  decode(bitand (x.ksuprflg,2),0,null,1)
  4  FROM x$ksupr x,v$session s
  5  WHERE s.paddr(+)=x.addr
  6  and bitand(ksspaflg,1)!=0;
USERNAME   STATUS           ADDR       KSLLAPSC   KSLLAPSN KSLLASPO                   KSLLID1R KSLL DE
---------- ---------------- -------- ---------- ---------- ------------------------ ---------- ---- --
           ACTIVE           35FD5990          6         26 14121                             0      1
           ACTIVE           35FD6FD8          1         69 14055                             0      1
           ACTIVE           35FD8620          1         69 14055                             0      1
           ACTIVE           35FD9C68          1         69 14055                             0      1
           ACTIVE           35FDB2B0          8         27 15300                             0      1
           ACTIVE           35FDC8F8         12         36 15300                             0      1
           ACTIVE           35FDDF40          1         69 14055                             0      1
           ACTIVE           35FDF588          1         69 14055                             0      1
           ACTIVE           35FE3860          7         26 14236                             0      1
           ACTIVE           35FE4EA8          1         69 14224                             0      1
           ACTIVE           35FE64F0         63          2 14311                           377 EV   1
           ACTIVE           35FEA7C8          3         26 14155                           258 EV   1
           ACTIVE           35FE9180         59          2 14248                           378 EV   1
           ACTIVE           35FE9CA4         12          2 14603                             0      1
           ACTIVE           35FD64B4          1         69 14055                             0      1
           ACTIVE           35FD7AFC          2         27 14055                             0      1
           ACTIVE           35FD9144          2         27 15300                             0      1
           ACTIVE           35FDA78C          3         26 14171                             0      1
           ACTIVE           35FDBDD4         17          2 15255                             0      1
           ACTIVE           35FDD41C         22         26 14155                             0      1
           ACTIVE           35FDEA64         52         26 14155                             0      1
           ACTIVE           35FE4384          1         69 14224                             0      1
           ACTIVE           35FE59CC          1         69 14224                             0      1
           ACTIVE           35FEB2EC          2          2 14248                             0      1
           ACTIVE           35FEC934         11         26 14121                             0      1
SYS        ACTIVE           35FEF5C4          4         16 14117                             0
                            35FE0BD0          1         69 14055                             0
                            35FE865C          1         69 14117                             0
                            35FE7B38          1         69 14117                             0
                            35FE16F4          1         26 14155                             0
                            35FD4E6C          0          0                                   0
                            35FE00AC          2        279 14117                             0
                            35FE2D3C          0          0                                   0
                            35FE7014          2        335 14117                             0
--挑选username和status为null的会话
SQL> select spid,program from v$process where addr in (
  2  '35FE0BD0',
  3  '35FE865C',
  4  '35FE7B38',
  5  '35FE16F4',
  6  '35FD4E6C',
  7  '35FE00AC',
  8  '35FE2D3C',
  9  '35FE7014'
 10  );
SPID                                             PROGRAM
------------------------------------------------ ------------------------------
                                                 PSEUDO
14179                                            oracle@xifenfei (D000)
14183                                            oracle@xifenfei (S000)
15300                                            oracle@xifenfei (TNS V1-V3)
15318                                            oracle@xifenfei (TNS V1-V3)
14252                                            oracle@xifenfei (N000)
14256                                            oracle@xifenfei (L000)
14260                                            oracle@xifenfei (L001)
8 rows selected.
--同样可以发现spid 15300和15318的进程已经在数据库中被kill掉

找出kill掉的spid方法3(11g特有)

SQL> select  spid,program  from v$process where addr in
  2  (select creator_addr from v$session where sid in(133,143));
SPID                                             PROGRAM
------------------------------------------------ ------------------------------
15300                                            oracle@xifenfei (TNS V1-V3)
15318                                            oracle@xifenfei (TNS V1-V3)

找出kill掉的spid方法4(11g特有)

SQL> select * from V$DETACHED_SESSION;
      INDX PG_NAME                                                             SID    SERIAL#        PID
---------- ------------------------------------------------------------ ---------- ---------- ----------
         0 DEFAULT                                                             143         21         21
         1 DEFAULT                                                             133         53         19
SQL> select spid,program from v$process where pid in(21,19);
SPID                                             PROGRAM
------------------------------------------------ ------------------------------
15300                                            oracle@xifenfei (TNS V1-V3)
15318                                            oracle@xifenfei (TNS V1-V3)

ROW CACHE LOCK等待事件

ROW CACHE LOCK基础说明
ROW CACHE LOCK等待事件是一个共享池相关的等待事件。是由于对于字典缓冲的访问造成的。
P1 – Cache Id
P2 – Mode Held
P3 – Mode Requested

mode 和REQUEST的取值:
KQRMNULL 0 null mode – not locked
KQRMS 3 share mode
KQRMX 5 exclusive mode
KQRMFAIL 10 fail to acquire instance lock

如果是RAC/OPS环境,前台进程发出锁请求,LCK0进程发出锁请求。如果是单实例模式,由前台进程直接发出锁请求。
在RAC/OPS环境下,前台进程会循环等待锁的获取,最多会等待60秒钟。在单实例环境,前台进程会循环1000次,等待3秒钟。PMON进程无论在哪种模式,都会等待5秒钟。
要注意的是单实例模式下和多实例模式下申请该锁调用的模块是不同的(kqrget()- 单实例,kqgigt()- 多实例)。
如果发现这个等待十分高,一般来说可能由于2种原因,一是共享池太小了,需要增加共享池,另外一种情况是SQL分析过于频繁,对于共享池的并发访问量过大。对于任何一种情况,绝大多数情况下加大共享池会有助于降低该等待,不过加大共享池的时候也要注意,并不一定所有的情况下增加共享池都会有明显的效果,特别是对于第二种情况,精确的分析十分重要。另外进一步分析,弄清楚哪些ROW CACHE的等待最为严重,有助于解决问题。

SQL查询

--查询row cache lock等待
select *  from v$session_wait where wait_class = 'row cache lock';
--查询rowcache 名称
select * from v$rowcache where cache# = &p1;

ENQUEUE TYPE
DC_TABLESPACES
Probably the most likely cause is the allocation of new extents. If extent sizes are set low then the application may constantly be requesting new extents and causing contention. Do you have objects with small extent sizes that are rapidly growing? (You may be able to spot these by looking for objects with large numbers of extents). Check the trace for insert/update activity, check the objects inserted into for number of extents.

DC_SEQUENCES
Check for appropriate caching of sequences for the application requirements.

DC_USERS
Deadlock and resulting “WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK!” can occur if a session issues a GRANT to a user, and that user is in the process of logging on to the database.

DC_SEGMENTS
This is likely to be down to segment allocation. Identify what the session holding the enqueue is doing and use errorstacks to diagnose.

DB_ROLLBACK_SEGMENTS
This is due to rollback segment allocation. Just like dc_segments,identify what is holding the enqueue and also generate errorstacks. Remember that on a multi-node system (RAC) the holder may be on another node and so multiple systemstates from each node will be required.

DC_AWR_CONTROL
This enqueue is related to control of the Automatic Workload Repository. As such any operation manipulating the repository may hold this so look for processes blocking these.

Process OS id : xxxxx alive after kill

Process OS id : xxxxx alive after kill警告

Mon May 21 04:55:06 2012
Shutting down instance (immediate)
License high water mark = 373
Mon May 21 04:55:06 2012
Stopping Job queue slave processes
Mon May 21 04:55:06 2012
Job queue slave processes stopped
All dispatchers and shared servers shutdown
Mon May 21 04:55:14 2012
Process OS id : 9922 alive after kill
Errors in file
Mon May 21 04:55:16 2012
Process OS id : 8159 alive after kill
Errors in file /oracle/admin/resultdb/udump/resultdb_ora_14639.trc
Mon May 21 04:55:17 2012
Process OS id : 8285 alive after kill
Errors in file /oracle/admin/resultdb/udump/resultdb_ora_14639.trc
Mon May 21 04:55:33 2012
ALTER DATABASE CLOSE NORMAL

错误原因

On some platforms it takes some time to kill processes--AIX being one of those platforms
There have been previous reports of shutdown taking time on AIX and after all Oracle waits were taken out of
the picture it was determined to be due to the way the kill command is implemented on that platform.
Bug 4931101 ERRORS IN ALERT LOG DURING SHUTDOWN

处理建议

Ignore the error  as all processes will be closed and shutdown will complete successfully.

补充说明
本次出问题的数据库是运行在 linux 平台上的 10.2.0.3

主键表插入数据不提交,外键表插入数据被阻塞

有客户和我说:他在含主外键的表中实验发现,在主表数据未提交,然后在外键表插入该数据数据时,出现外键表hang住现象.我开始以为是不同的会话,根据oracle数据库的一致性原则,应该新会话在外键表中不能知道这个记录的存在,直接报错.
可是我实验结果证明:外键表会被阻塞.分析原因如下:
模拟环境

SQL> create table t_p(id number primary key,name varchar2(100));
Table created.
SQL> create table t_f(fid number primary key,pid number, foreign key(pid) references  t_p(id));
Table created.
--会话1
SQL> insert into t_p values(1,'xifenfei');
1 row created.
SQL> commit;
Commit complete.
--会话2
SQL> insert into t_f values(1,1);
1 row created.
SQL> commit;
--会话1
SQL> insert into t_p values(2,'XIFENFEI');
1 row created.
--会话2
SQL> insert into t_f values(2,2);
--hang住

通过实验发现,当主键数据没有提交,然后在外键表中插入该数据外键数据时,该条记录会处于hang住状态(等待),那是什么原因导致了这个等待呢?对会话2做一个10046的trace,发现如下

*** 2012-05-17 17:25:41.757
WAIT #3065187488: nam='enq: TX - row lock contention' ela= 27002895 name|mode=1415053316 usn<<16
| slot=262151 sequence=588 obj#=-1 tim=1337246741756917
EXEC #3065187488:c=4000,e=27004456,p=0,cr=2,cu=14,mis=0,r=0,dep=0,og=1,
plh=0,tim=1337246741757690
ERROR #3065187488:err=1013 tim=1337246741757751
STAT #3065187488 id=1 cnt=0 pid=0 pos=1 obj=0 op='LOAD TABLE CONVENTIONAL
(cr=0 pr=0 pw=0 time=12 us)'
WAIT #3065187488: nam='SQL*Net break/reset to client' ela= 581 driver
id=1650815232 break?=0 p3=0 obj#=-1 tim=1337246741782587
WAIT #3065187488: nam='SQL*Net message to client' ela= 2 driver id=1650815232
#bytes=1 p3=0 obj#=-1 tim=1337246741782668

通过这个trace发现,是因为TX锁导致了外键表上的插入操作被阻塞.出现该问题的原因
有两种可能:1.两次插入(主键表和外键表分别插入)在主键表上有不兼容锁;2.外键表上有不兼容性锁.

使用oradebug跟踪会话

oradebug setmypid
--EVENT 10704跟踪锁的使用情况
oradebug EVENT 10704 trace name context forever,level 10
--插入数据操作
oradebug EVENT 10704 trace name context off
oradebug TRACEFILE_NAME

跟踪主键表插入数据

*** 2012-05-17 19:05:52.410
ksqgtl *** TM-00012892-00000000 mode=3 flags=0x401 timeout=21474836 ***
ksqgtl: xcb=0x343c0e54, ktcdix=2147483647, topxcb=0x343c0e54
        ktcipt(topxcb)=0x0
*** 2012-05-17 19:05:52.411
ksucti: init txn DID from session DID
ksqgtl:
        ksqlkdid: 0001-0013-0000000F
*** 2012-05-17 19:05:52.429
*** ksudidTrace: ksqgtl
        ktcmydid(): 0001-0013-0000000F
        ksusesdi:   0000-0000-00000000
        ksusetxn:   0001-0013-0000000F
ksqgtl: RETURNS 0
*** 2012-05-17 19:05:52.430
ksqgtl *** TM-00012894-00000000 mode=3 flags=0x401 timeout=21474836 ***
ksqgtl: xcb=0x343c0e54, ktcdix=2147483647, topxcb=0x343c0e54
        ktcipt(topxcb)=0x0
*** 2012-05-17 19:05:52.430
ksucti: init session DID from txn DID:
ksqgtl:
        ksqlkdid: 0001-0013-0000000F
*** 2012-05-17 19:05:52.430
*** ksudidTrace: ksqgtl
        ktcmydid(): 0001-0013-0000000F
        ksusesdi:   0000-0000-00000000
        ksusetxn:   0001-0013-0000000F
ksqgtl: RETURNS 0
*** 2012-05-17 19:05:52.431
ksqgtl *** TX-00050019-00000307 mode=6 flags=0x401 timeout=0 ***
ksqgtl: xcb=0x343c0e54, ktcdix=2147483647, topxcb=0x343c0e54
        ktcipt(topxcb)=0x0
*** 2012-05-17 19:05:52.431
ksucti: init session DID from txn DID:
ksqgtl:
        ksqlkdid: 0001-0013-0000000F
*** 2012-05-17 19:05:52.431
*** ksudidTrace: ksqgtl
        ktcmydid(): 0001-0013-0000000F
        ksusesdi:   0000-0000-00000000
        ksusetxn:   0001-0013-0000000F
ksqgtl: RETURNS 0
SQL> SELECT TO_NUMBER(12892,'xxxxxxx') from dual;
TO_NUMBER(12892,'XXXXXXX')
--------------------------
                     75922
SQL> SELECT TO_NUMBER(12894,'xxxxxxx') from dual;
TO_NUMBER(12894,'XXXXXXX')
--------------------------
                     75924
SQL> select object_name from dba_objects where object_id in(75922,75924);
OBJECT_NAM
----------
T_P
T_F

通过锁使用情况跟踪可以知道,在主键表插入一条记录时,先在主键表获得TM锁,然后外键表获得TM锁,最后主键表获得TX MODE=6的锁。

跟踪外键表插入数据

*** 2012-05-17 19:49:24.912
ksqgtl *** TM-00012892-00000000 mode=3 flags=0x401 timeout=21474836 ***
ksqgtl: xcb=0x343a45e8, ktcdix=2147483647, topxcb=0x343a45e8
        ktcipt(topxcb)=0x0
*** 2012-05-17 19:49:24.912
ksucti: init txn DID from session DID
ksqgtl:
        ksqlkdid: 0001-0015-00000064
*** 2012-05-17 19:49:24.913
*** ksudidTrace: ksqgtl
        ktcmydid(): 0001-0015-00000064
        ksusesdi:   0000-0000-00000000
        ksusetxn:   0001-0015-00000064
ksqgtl: RETURNS 0
*** 2012-05-17 19:49:24.913
ksqgtl *** TM-00012894-00000000 mode=3 flags=0x401 timeout=21474836 ***
ksqgtl: xcb=0x343a45e8, ktcdix=2147483647, topxcb=0x343a45e8
        ktcipt(topxcb)=0x0
*** 2012-05-17 19:49:24.913
ksucti: init session DID from txn DID:
ksqgtl:
        ksqlkdid: 0001-0015-00000064
*** 2012-05-17 19:49:24.913
*** ksudidTrace: ksqgtl
        ktcmydid(): 0001-0015-00000064
        ksusesdi:   0000-0000-00000000
        ksusetxn:   0001-0015-00000064
ksqgtl: RETURNS 0
*** 2012-05-17 19:49:24.913
ksqgtl *** TX-0002001f-0000034a mode=6 flags=0x401 timeout=0 ***
ksqgtl: xcb=0x343a45e8, ktcdix=2147483647, topxcb=0x343a45e8
        ktcipt(topxcb)=0x0
*** 2012-05-17 19:49:24.913
ksucti: init session DID from txn DID:
ksqgtl:
        ksqlkdid: 0001-0015-00000064
*** 2012-05-17 19:49:24.914
*** ksudidTrace: ksqgtl
        ktcmydid(): 0001-0015-00000064
        ksusesdi:   0000-0000-00000000
        ksusetxn:   0001-0015-00000064
ksqgtl: RETURNS 0
*** 2012-05-17 19:49:24.914
ksqgtl *** TX-00050019-00000307 mode=4 flags=0x10021 timeout=21474836 ***
ksqgtl: xcb=0x343a45e8, ktcdix=2147483647, topxcb=0x343a45e8
        ktcipt(topxcb)=0x0
*** 2012-05-17 19:49:24.914
ksucti: init session DID from txn DID:
ksqgtl:
        ksqlkdid: 0001-0015-00000064
*** 2012-05-17 19:49:24.914
*** ksudidTrace: ksqgtl
        ktcmydid(): 0001-0015-00000064
        ksusesdi:   0000-0000-00000000
        ksusetxn:   0001-0015-00000064
*** 2012-05-17 19:49:24.914
ksqcmi: TX,50019,307 mode=4 timeout=21474836

从这里可以发现:先在主键表和外键表上加上TM锁,然后外键表获得TX MODE=6的锁(这边成功,因为该表上未有其他级别不兼容锁),再需要在主键表上获得TX MODE=4(表结构共享锁+所有记录共享锁),但是这个时候,发现该锁上已经在主键表插入数据未提交的时候,已经含有了TX MODE=6的锁,从而使得TX MODE=4无法获得,从而使得外键表插入数据处于阻塞状态.

rac redo log file被意外覆盖数据库恢复

当前日志被覆盖导致错误
朋友的一客户在一套rac上包含了两个数据库,其其中一个库增加redo group时候,覆盖了另外一个库的redo,悲剧的是刚好是current redo

Wed May 16 17:03:05 2012
ALTER DATABASE OPEN
This instance was first to open
Wed May 16 17:03:09 2012
Beginning crash recovery of 2 threads
 parallel recovery started with 15 processes
Wed May 16 17:03:11 2012
Started redo scan
Wed May 16 17:03:11 2012
Errors in file /oracle/admin/odsdb/udump/odsdb1_ora_2040024.trc:
ORA-00305: log 14 of thread 1 inconsistent; belongs to another database
ORA-00312: online log 14 thread 1: '/dev/rods_redo1_2_2'
ORA-00305: log 14 of thread 1 inconsistent; belongs to another database
ORA-00312: online log 14 thread 1: '/dev/rods_redo1_2_1'
Abort recovery for domain 0
Wed May 16 17:03:11 2012
Aborting crash recovery due to error 305
Wed May 16 17:03:11 2012
Errors in file /oracle/admin/odsdb/udump/odsdb1_ora_2040024.trc:
ORA-00305: log 14 of thread 1 inconsistent; belongs to another database
ORA-00312: online log 14 thread 1: '/dev/rods_redo1_2_2'
ORA-00305: log 14 of thread 1 inconsistent; belongs to another database
ORA-00312: online log 14 thread 1: '/dev/rods_redo1_2_1'
ORA-305 signalled during: ALTER DATABASE OPEN...
Wed May 16 17:03:13 2012
Shutting down instance (abort)

使用_allow_resetlogs_corruption= TRUE进行恢复

Wed May 16 18:16:48 2012
SMON: enabling cache recovery
Wed May 16 18:16:48 2012
Instance recovery: looking for dead threads
Instance recovery: lock domain invalid but no dead threads
Wed May 16 18:16:48 2012
Errors in file /oracle/admin/odsdb/udump/odsdb1_ora_2105454.trc:
ORA-00600: internal error code, arguments: [kclchkblk_4], [2522], [18446744072024280773],
[2522], [18446744072024247666], [], [], []
Wed May 16 18:16:50 2012
Errors in file /oracle/admin/odsdb/udump/odsdb1_ora_2105454.trc:
ORA-00600: internal error code, arguments: [kclchkblk_4], [2522], [18446744072024280773],
[2522], [18446744072024247666], [], [], []
Wed May 16 18:16:50 2012
Error 600 happened during db open, shutting down database
USER: terminating instance due to error 600
Instance terminated by USER, pid = 2105454
ORA-1092 signalled during: alter database open resetlogs...

ORA-600[KCLCHKBLK_4], is signaled because the SCN in a tempfile block is too high.
The same reason caused the ORA-600[2662]s in the alert logs.
因为是临时文件的scn太大的问题,那就比较好解决:
启动数据库到mount状态,查询出来相关temp file,然后drop掉.

ORA-00600[6856]

Wed May 16 20:25:16 2012
Errors in file /oracle/admin/odsdb/bdump/odsdb1_smon_2482210.trc:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/dev/rods_redo2_1_1'
ORA-00600: internal error code, arguments: [6856], [0], [0], [], [], [], [], []
ORACLE Instance odsdb1 (pid = 16) - Error 600 encountered while recovering transaction
(10, 8) on object 7162533.
Wed May 16 20:25:16 2012
Errors in file /oracle/admin/odsdb/bdump/odsdb1_smon_2482210.trc:
ORA-00600: internal error code, arguments: [6856], [0], [0], [], [], [], [], []

这里的4193和4194是比较熟悉的,根据这里的提示猜测6856也是和undo有关系
ORA-600[6856]SMON is trying to recover a dead transaction.
But the undo application runs into an internal error (trying to delete a row that is already deleted).
因为smon回滚的时候出现上面错误,解决方法是想办法终止回滚,使用event=”10513 trace name context forever, level 2″.

ORA-00600[4193]/ORA-00600[4194]

Wed May 16 20:25:17 2012
Errors in file /oracle/admin/odsdb/udump/odsdb1_ora_2547936.trc:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/dev/rods_redo2_1_1'
ORA-00600: internal error code, arguments: [4194], [22], [25], [], [], [], [], []
Wed May 16 20:25:18 2012
Errors in file /oracle/admin/odsdb/udump/odsdb1_ora_2547936.trc:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/dev/rods_redo2_1_1'
ORA-00600: internal error code, arguments: [4194], [22], [25], [], [], [], [], []
Wed May 16 20:25:56 2012
Errors in file /oracle/admin/odsdb/udump/odsdb1_ora_2547936.trc:
ORA-00600: internal error code, arguments: [4193], [22248], [22252], [], [], [], [], []

太常见错误,不再做说明,虽然使用event是的库open成功,因为部分回滚段有问题,该错误还是会出现(还是喜欢直接屏蔽回滚段)

ORA-00600[ktpridestroy2]

Wed May 16 20:36:26 2012
Errors in file /oracle/admin/odsdb/bdump/odsdb1_smon_2101296.trc:
ORA-00600: internal error code, arguments: [ktpridestroy2], [], [], [], [], [], [], []

This error could be the result of a corruption and involves the parallel rollback that SMON enables each startup.
解决:fast_start_parallel_rollback=false

ORA-00600[kturacf1]/ORA-00600[kcbgcur_9]

Wed May 16 20:49:15 2012
Errors in file /oracle/admin/odsdb/bdump/odsdb1_j000_2007088.trc:
ORA-00600: internal error code, arguments: [kturacf1], [2097152], [], [], [], [], [], []
ORA-00600: internal error code, arguments: [kcbgcur_9], [780140563], [4], [4294959056], [2097152], [], [], []

ORA-00600[kturacf1]错误未查询到原因
ORA-00600[kcbgcur_9]错误原因可能是:Buffers are pinned in a specific class order to prevent internal deadlocks.
因为这两个错误是job产生非致命错误,在这次的处理过程中可以忽略

ORA-00600[4097]

Wed May 16 21:05:05 2012
Errors in file /oracle/admin/odsdb/bdump/odsdb1_j000_1716282.trc:
ORA-12012: error on auto execute of job 6603
ORA-20001: ORA-00600: internal error code, arguments: [4097], [], [], [], [], [], [], []
ORA-06512: at "EPBI.UP_SYSLOG_ONLINE_USER", line 141
ORA-00600: internal error code, arguments: [4097], [], [], [], [], [], [], []

When an instance has a rollback segment offline and the instance crashes, or
the user does a shutdown abort, the rollback segment wrap number does not get
updated. If that segment is then dropped and recreated immediately after the
instance is restarted, the wrap number could be lower than existing wrap
numbers. This will cause the ORA-600[4097] to occur in subsequent
transactions using Rollback.
这个错误也是因为回滚段wrap number未被及时更新导致的异常.

总结这次恢复过程
1.因当前redo丢失,使用隐含参数打开库,然后出现ORA-600[KCLCHKBLK_4](这个错误比较少见,更何况这个库是10.2.0.4)
2.undo出现问题出现ORA-00600[6856]错误不是很多见
3.接下来都是一些列undo导致的问题,其实如果开始就直接使用隐含参数删除掉有问题回滚段,效果可能会比event好.
4.因为部分trace文件没有拿到,未做深入分析,文章列出整体恢复思路
5.本次恢复的数据库版本是10.2.0.4,根据mos描述,很多错误应不会发生,但是实际还是发生了,MOS的版本范围,不要太看重.
6.其实这篇文章的本质不是展示恢复过程,而是再一次提醒:操作数据库慎重,特别是一台机器上多套库.