library cache lock等待事件

Library cache lock介绍
Oracle利用Library cache lock和Library cache pin来实现并发控制,Library cache lock是在handle上获取的,而Library cache pin则是在data heap上获取。访问对象时,首先必须获取handle上的lock,然后将访问的数据pin在内存中。lock的作用是控制进程间的并发访问,而pin的作用是保证数据一致性,防止数据在访问时被交换出去。
lock和pin的实现类似于enqueue,在每个handle上都有lock和pin的holder list和waiter list,用来保存持有该资源和等待该资源的队列。

Library cache lock相关sql语句

--找出library cache lock等待sid,saddr信息
select sid,saddr from v$session where event= 'library cache lock';
SID        SADDR
---------- --------
16         572ed244
--找出blocked信息
select kgllkhdl Handle,kgllkreq Request, kglnaobj Object
from x$kgllk where kgllkses = '572ed244'
and kgllkreq > 0;
HANDLE   REQUEST   OBJECT
-------- ---------- ------------------
62d064dc          2 EMPLOYEES
--找出blocking信息
select kgllkses saddr,kgllkhdl handle,kgllkmod mod,kglnaobj object
from x$kgllk lock_a
where kgllkmod > 0
and exists (select lock_b.kgllkhdl from x$kgllk lock_b
where kgllkses = '572ed244' /* blocked session */
and lock_a.kgllkhdl = lock_b.kgllkhdl
and kgllkreq > 0);
SADDR     HANDLE   MOD         OBJECT
--------  -------- ---------- ------------
572eac94  62d064dc          3  EMPLOYEES
--blocking 会话信息
select sid,username,terminal,program from v$session where saddr = '572eac94'
SID        USERNAME     TERMINAL  PROGRAM
---------- -----------  --------- --------------------------------------------
12          SCOTT        pts/20    sqlplus@goblin.forgotten.realms (TNS V1-V3)
--所有blocked 会话
select sid,username,terminal,program from v$session
where saddr in
(select kgllkses from x$kgllk lock_a
 where kgllkreq > 0
 and exists (select lock_b.kgllkhdl from x$kgllk lock_b
             where kgllkses = '572eac94' /* blocking session */
             and lock_a.kgllkhdl = lock_b.kgllkhdl
             and kgllkreq = 0)
);
SID        USERNAME  TERMINAL  PROGRAM
---------- --------- --------- -------------------------------------------
13         SCOTT     pts/22    sqlplus@goblin.forgotten.realms (TNS V1-V3)
16         SCOTT     pts/7     sqlplus@goblin.forgotten.realms (TNS V1-V3)

library cache pin等待事件

library cache pin说明
library cache pin 事件是用来管理library cache的并发访问的, pin一个object会引起相应的heap被载入内存中,如果客户端需要修改或检测这个object它就必须在锁住后取得一个pin.library cache pin的等待时间为3秒钟,其中有1秒钟用于PMON后台进程,即在取得pin之前最多等待3秒钟,否则就超时.library cache pin通常是发生在编译或重新编译PL/SQL,VIEW,TYPES等object时.编译通常都是显性的,如安装应用程序,升级,安装补丁程序等,但object的重新编译也可能发生在object变得无效时.library cache pin的参数如下,有用的主要是P1和P2:
P1 – KGL Handle address.
P2 – Pin address
P3 – 10*Mode + Namespace
其中,P1,P2可与x$kglpn和x$kglob表相关.x$kglpn和x$kglob是ORACLE数据库的内部数据字典.
x$kglpn library cache pin信息
x$kglob library cache object信息

查询方法一

--通过查询V$SESSION_WAIT找出正在等待”library cache pin”的session
SELECT sid,
       SUBSTR (event, 1, 30),
       TO_CHAR(p1, 'xxxxxxxx') p1_16,
       --P1RAW P1_16,
       p2,
       p3
  FROM v$session_wait
 WHERE wait_time = 0 AND event LIKE 'library cache pin%';
--P1 列是Library Cache Handle Address
--P2 列是Library Cache Pin Address.
--找到相关session pin状态
SELECT ADDR,
       INDX,
       KGLPNADR,-- Library Cache Pin Address
       KGLPNUSE,
       KGLPNSES,--识别锁住此pin 的session
       KGLPNHDL,--Library Cache Handle Address
       kGLPNLCK,
       KGLPNMOD,-- Pin 锁
       KGLPNREQ-- Pin 请求
  FROM x$kglpn
 WHERE KGLPNHDL LIKE '%EB3EB8%';--p1_16
 --询X$KGLOB (Library Cache Object),可找到相关的object
SELECT KGLNAOBJ-- 相关object的名字(取前面80个字符)
  FROM X$KGLOB
 WHERE KGLHDADR LIKE '%EB3EB8%';--p1_16
 --查出占着pin锁的session目前正在做什么
SELECT a.sid, a.username, a.program
  FROM v$session a, x$kglpn b
 WHERE a.saddr = b.kglpnuse AND b.kglpnhdl LIKE '%EB3EB8%'--p1_16
  AND b.kgnmod <> 0;
   --查出阻塞者正执行的SQL语句
 SELECT sid, sql_text
  FROM v$session, v$sqlarea
 WHERE v$session.sql_address = v$sqlarea.address AND sid =&sid;

查询方法二

--通过查询DBA_LOCK_INTERNAL和V$SESSION_WAIT,可得到与”library cache pin” 等待相关的object的名字
SELECT TO_CHAR (SESSION_ID, '999') sid,
       SUBSTR (LOCK_TYPE, 1, 30) TYPE,
       SUBSTR (lock_id1, 1, 23) Object_Name,
       SUBSTR (mode_held, 1, 4) HELD,
       SUBSTR (mode_requested, 1, 4) REQ,
       lock_id2 Lock_addr
  FROM dba_lock_internal
 WHERE mode_requested <> 'None' AND mode_requested <> mode_held
       AND session_id IN
              (SELECT sid
                 FROM v$session_wait
                WHERE wait_time = 0 AND event LIKE 'library cache pin%');
 --查出”library cache pin”占有者(即阻塞者)的session id
 SELECT sid Holder,
       KGLPNUSE Sesion,
       KGLPNMOD Held,
       KGLPNREQ Req
  FROM sys.x$kglpn, v$session
 WHERE KGLPNHDL IN (SELECT p1raw
                      FROM v$session_wait
                     WHERE wait_time = 0 AND event LIKE 'library cache pin%')
       AND KGLPNMOD <> 0
       AND v$session.saddr = x$kglpn.kglpnuse;
 --查出”library cache pin”占有者(阻塞者)正在等什么
 SELECT sid, SUBSTR (event, 1, 30), wait_time
  FROM v$session_wait
 WHERE sid IN
          (SELECT sid
             FROM x$kglpn, v$session
            WHERE KGLPNHDL IN
                     (SELECT p1raw
                        FROM v$session_wait
                       WHERE wait_time = 0
                             AND event LIKE 'library cache pin%')
                  AND KGLPNMOD <> 0
                  AND v$session.saddr = x$kglpn.kglpnuse);
 --查出阻塞者正执行的SQL语句
 SELECT sid, sql_text
  FROM v$session, v$sqlarea
 WHERE v$session.sql_address = v$sqlarea.address AND sid =&sid;

cursor: pin S wait on X 等待事件

cursor: pin S整体描述

cursor: pin S A session waits on this event when it wants to update a shared mutex pin and another session
is currently in the process of updating a shared mutex pin for the same cursor object. This wait event should
rarely be seen because a shared mutex pin update is very fast.(Wait Time: Microseconds)
--Parameter说明
P1 Hash value of cursor
P2 Mutex value
64 bit platforms
8 bytes are used.
Top 4 bytes hold the session id (if the mutex is held X)
Bottom 4 bytes hold the ref count (if the mutex is held S).
32 bit platforms
4 bytes are used.
Top 2 bytes hold the session id (if the mutex is held X)
Bottom 2 bytes hold the ref count (if the mutex is held S).
P3 Mutex where (an internal code locator) OR'd with Mutex Sleeps
--查询sql
SELECT a.*, s.sql_text
  FROM v$sql s,
       (SELECT sid,
               event,
               wait_class,
               p1 cursor_hash_value,
               p2raw Mutex_value,
               TO_NUMBER (SUBSTR (p2raw, 1, 8), 'xxxxxxxx') hold_mutex_x_sid
          FROM v$session_wait
         WHERE event LIKE 'cursor%') a
 WHERE s.HASH_VALUE = a.cursor_hash_value

cursor: pin S wait on X描述

- In previous versions of Oracle, library cache pin is protected by “library cache pin latch”.
- But in recent versions of Oracle(I believe it’s 10.2.0.2),
  library cache pin for the cursor LCO is protected by mutext.
- Mutex is allocated per LCO, so it enables fine-grained access control.
“cursor: pin S wait on X” wait event is mostly related to mutex and hard parse.
- When a process hard parses the SQL statement, it should acquire exclusive
  library cache pin for the corresponding LCO.
- This means that the process acquires the mutex in exclusive mode.
- Another process which also executes the same query needs to acquire the mutex
  but it’s being blocked by preceding process. The wait event is “cursor: pin S wait on X”.
--发生cursor: pin S wait on X原因
Frequent Hard Parses
If the frequency of Hard Parsing is extremely high, then contention can occur on this pin.
High Version Counts
When Version counts become excessive, a long chain of versions needs to
be examined and this can lead to contention on this event
Known bugs
Bug 5907779 - Self deadlock hang on "cursor: pin S wait on X" (typically from DBMS_STATS) [ID 5907779.8]
Bug 7568642: BLOCKING_SESSION EMPTY FOR "CURSOR: PIN S WAIT ON X"

ORA-00600[kcratr1_lostwrt]/ORA-00600[3020]错误恢复

open数据库alert日志报ORA-00600[kcratr1_lostwrt]错误

Mon May 14 14:57:28 2012
ALTER DATABASE OPEN
Mon May 14 14:57:29 2012
Beginning crash recovery of 1 threads
Mon May 14 14:57:29 2012
Started redo scan
Mon May 14 14:57:29 2012
Errors in file d:\oracle\admin\cqgasold\udump\cqgasold_ora_504.trc:
ORA-00600: 内部错误代码,参数: [kcratr1_lostwrt], [], [], [], [], [], [], []
ORA-600 signalled during: alter database open...

查询相关SCN
同一个查询中SCN相同,省略

SQL> select file#,online_status "STATUS",to_char(change#,'9999999999999999') "SCN",
  2  To_char(time,'yyyy-mm-dd hh24:mi:ss')"TIME" from v$recover_file;
未选定行
SQL> select file#,to_char(checkpoint_change#,'999999999999999') "SCN",
  2  to_char(last_change#,'999999999999999')"STOP_SCN" from v$datafile;
     FILE# SCN              STOP_SCN
---------- ---------------- ----------------
         1       7842987188
         2       7842987188
         3       7842987188
SQL> select file#,to_char(checkpoint_change#,'9999999999999999') "SCN",
  2  to_char(RESETLOGS_CHANGE#,'9999999999999999') "RESETLOGS SCN"
  3  from v$datafile_header;
     FILE# SCN               RESETLOGS SCN
---------- ----------------- -----------------
         1        7842991811                 1
         2        7842991811                 1
         3        7842991811                 1

这里看到奇怪现象datafile scn小于datafile_header scn,数据库异常断电一般来说也不会出现这样的情况,个人猜测是错误的恢复或者使用历史控制文件导致,对于这样的现状,我先尝试着使用using backup controlfile方式恢复,结果失败.估计控制文件有异常,本着先拉起库原则,重建控制文件.

进行完全恢复

SQL> recover database;
ORA-00283: 恢复会话因错误而取消
ORA-00600: 内部错误代码,参数: [3020], [8388617], [1], [23403], [25], [112],[], []
ORA-10567: Redo is inconsistent with data block (file# 2, block# 9)
ORA-10564: tablespace UNDOTBS1
ORA-01110: 数据文件 2: 'D:\ORACLE\ORADATA\CQGASOLD\UNDO_1.DBF'
ORA-10560: block type 'KTU SMU HEADER BLOCK'

尝试跳过坏块继续恢复

SQL> recover database allow 1 corruption;
ORA-00283: 恢复会话因错误而取消
ORA-00600: 内部错误代码,参数: [3020], [8388610], [1], [23403], [2264], [16],[], []
ORA-10567: Redo is inconsistent with data block (file# 2, block# 2)
ORA-10564: tablespace UNDOTBS1
ORA-01110: 数据文件 2: 'D:\ORACLE\ORADATA\CQGASOLD\UNDO_1.DBF'
ORA-10560: block type 'KTFB Bitmapped File Space Header'

使用dbv检查坏块数量

C:\>dbv file='d:\oracle\oradata\cqgasold\undo_1.dbf' blocksize=8192
DBVERIFY: Release 9.2.0.5.0 - Production on 星期二 5月 15 19:43:42 2012
Copyright (c) 1982, 2002, Oracle Corporation.  All rights reserved.
DBVERIFY - 验证正在开始 : FILE = d:\oracle\oradata\cqgasold\undo_1.dbf
DBV-00200: 块, dba 8388617, 已经标记为崩溃
汇入的页369 - 可能是介质损坏
***
Corrupt block relative dba: 0x00800171 (file 2, block 369)
Fractured block found during dbv:
Data in bad block -
 type: 2 format: 2 rdba: 0x00800171
 last change scn: 0x0001.d37c57db seq: 0x1 flg: 0x04
 consistency value in tail: 0x4e930260
 check value in block header: 0x8202, computed block checksum: 0x4e5f
 spare1: 0x0, spare2: 0x0, spare3: 0x0
***
汇入的页417 - 可能是介质损坏
***
Corrupt block relative dba: 0x008001a1 (file 2, block 417)
Fractured block found during dbv:
Data in bad block -
 type: 2 format: 2 rdba: 0x008001a1
 last change scn: 0x0001.d37c53d4 seq: 0x2 flg: 0x04
 consistency value in tail: 0x4b6b0201
 check value in block header: 0x6ae7, computed block checksum: 0x5abc
 spare1: 0x0, spare2: 0x0, spare3: 0x0
***
…………
--类此记录很多,我放弃了跳过坏块修复的方法

恢复过程中提示坏块数据库文件离线恢复

SQL> alter database datafile 'd:\oracle\oradata\cqgasold\undo_1.dbf' offline;
数据库已更改。
SQL> recover database;
完成介质恢复。
SQL> alter database open;
alter database open
*
ERROR 位于第 1 行:
ORA-00604: 递归 SQL 层 1 出现错误
ORA-00376: 此时无法读取文件 2
ORA-01110: 数据文件 2: 'D:\ORACLE\ORADATA\CQGASOLD\UNDO_1.DBF'

到了这一步,根据经验,数据库被open的可能性很多了,很有可能是open以后因为smon回滚导致数据库down

查看日志,屏蔽回滚段,完成恢复

Tue May 15 19:59:52 2012
alter database open
Tue May 15 19:59:52 2012
Beginning crash recovery of 1 threads
Tue May 15 19:59:52 2012
Started redo scan
Tue May 15 19:59:52 2012
Completed redo scan
 323 redo blocks read, 82 data blocks need recovery
Tue May 15 19:59:52 2012
Started recovery at
 Thread 1: logseq 23404, block 3, scn 0.0
Recovery of Online Redo Log: Thread 1 Group 4 Seq 23404 Reading mem 0
  Mem# 0 errs 0: F:\ORACLE\ORADATA\LOGCQGASOLD4.ORA
Tue May 15 19:59:52 2012
Completed redo application
Tue May 15 19:59:52 2012
Ended recovery at
 Thread 1: logseq 23404, block 326, scn 1.3548264979
 82 data blocks read, 82 data blocks written, 323 redo blocks read
Crash recovery completed successfully
Tue May 15 19:59:53 2012
Thread 1 advanced to log sequence 23405
Thread 1 opened at log sequence 23405
  Current log# 2 seq# 23405 mem# 0: D:\ORACLE\ORADATA\CQGASOLD\REDO02.LOG
Successful open of redo thread 1
Tue May 15 19:59:53 2012
SMON: enabling cache recovery
SMON: enabling tx recovery
Tue May 15 19:59:54 2012
Database Characterset is ZHS16GBK
Tue May 15 19:59:55 2012
replication_dependency_tracking turned off (no async multimaster replication found)
ORA-604 signalled during: alter database open...
Tue May 15 19:59:56 2012
SMON: about to recover undo segment 1
SMON: mark undo segment 1 as needs recovery
SMON: about to recover undo segment 2
SMON: mark undo segment 2 as needs recovery
SMON: about to recover undo segment 3
SMON: mark undo segment 3 as needs recovery
SMON: about to recover undo segment 4
SMON: mark undo segment 4 as needs recovery
SMON: about to recover undo segment 5
SMON: mark undo segment 5 as needs recovery
SMON: about to recover undo segment 6
SMON: mark undo segment 6 as needs recovery
SMON: about to recover undo segment 7
SMON: mark undo segment 7 as needs recovery
SMON: about to recover undo segment 8
SMON: mark undo segment 8 as needs recovery
SMON: about to recover undo segment 9
SMON: mark undo segment 9 as needs recovery
SMON: about to recover undo segment 10
SMON: mark undo segment 10 as needs recovery
Tue May 15 20:00:37 2012
Shutting down instance (abort)

看到这里,可以大概确定是因为undo文件离线,导致回滚段异常.
这个问题,基本上可以确定通过隐含参数屏蔽回滚段,然后open数据库,重建undo删除异常undo,数据库恢复完成。

记录一次rman备份ORA-19502/ORA-27063错误原因分析

rman备份出现ORA-19502/ORA-27063错误

RMAN> 2> 3> 4> 5> 6> 7> 8> 9> 10> 11> 12> 13> 14> 15> 16> 17> 18> 19> 20>
allocated channel: t11
channel t11: sid=824 instance=ncdb1 devtype=DISK
allocated channel: t12
channel t12: sid=838 instance=ncdb1 devtype=DISK
allocated channel: t13
channel t13: sid=809 instance=ncdb1 devtype=DISK
allocated channel: t14
channel t14: sid=886 instance=ncdb1 devtype=DISK
allocated channel: t15
channel t15: sid=620 instance=ncdb1 devtype=DISK
allocated channel: t16
channel t16: sid=599 instance=ncdb1 devtype=DISK
allocated channel: t17
channel t17: sid=482 instance=ncdb1 devtype=DISK
allocated channel: t18
channel t18: sid=506 instance=ncdb1 devtype=DISK
一共开通8个通道
channel t12: starting full datafile backupset
channel t12: specifying datafile(s) in backupset
input datafile fno=00008 name=/dev/rnc32g_39
input datafile fno=00016 name=/dev/rnc32g_47
input datafile fno=00024 name=/dev/rnc32g_57
input datafile fno=00032 name=/dev/rnc32g_25
input datafile fno=00040 name=/dev/rnc32g_33
input datafile fno=00048 name=/dev/rnc32g_3
input datafile fno=00056 name=/dev/rnc32g_11
input datafile fno=00064 name=/dev/rnc32g_19
input datafile fno=00072 name=/dev/rnc32g_67
input datafile fno=00080 name=/dev/rnc32g_106
input datafile fno=00088 name=/dev/rnc32g_114
input datafile fno=00096 name=/dev/rnc32g_87
input datafile fno=00104 name=/dev/rnc32g_95
input datafile fno=00112 name=/dev/rnc32g_103
input datafile fno=00120 name=/dev/rnc32g_75
input datafile fno=00003 name=/dev/rnc50_sysaux
input datafile fno=00130 name=/dev/rnc32g_119
channel t12: starting piece 1 at 14-MAY-12
--通道12备份数据文件
channel t17: starting full datafile backupset
channel t17: specifying datafile(s) in backupset
input datafile fno=00002 name=/dev/rnc32g_22
input datafile fno=00013 name=/dev/rnc32g_44
input datafile fno=00021 name=/dev/rnc32g_54
input datafile fno=00029 name=/dev/rnc32g_62
input datafile fno=00037 name=/dev/rnc32g_30
input datafile fno=00045 name=/dev/rnc32g_38
input datafile fno=00053 name=/dev/rnc32g_8
input datafile fno=00061 name=/dev/rnc32g_16
input datafile fno=00069 name=/dev/rnc32g_64
input datafile fno=00077 name=/dev/rncundo_33g_4
input datafile fno=00085 name=/dev/rnc32g_111
input datafile fno=00093 name=/dev/rnc32g_84
input datafile fno=00101 name=/dev/rnc32g_92
input datafile fno=00109 name=/dev/rnc32g_100
input datafile fno=00117 name=/dev/rnc32g_72
input datafile fno=00006 name=/dev/rnc50_4g_1
channel t17: starting piece 1 at 14-MAY-12
--通道17备份数据文件
channel t15: finished piece 1 at 15-MAY-12
piece handle=/rman/db_mpnb04jl_1_1 tag=TAG20120514T204954 comment=NONE
channel t15: backup set complete, elapsed time: 06:07:59
channel t11: finished piece 1 at 15-MAY-12
piece handle=/rman/db_mlnb04jk_1_1 tag=TAG20120514T204954 comment=NONE
channel t11: backup set complete, elapsed time: 06:17:25
channel t16: finished piece 1 at 15-MAY-12
piece handle=/rman/db_mqnb04jm_1_1 tag=TAG20120514T204954 comment=NONE
channel t16: backup set complete, elapsed time: 06:34:49
channel t14: finished piece 1 at 15-MAY-12
piece handle=/rman/db_monb04jl_1_1 tag=TAG20120514T204954 comment=NONE
channel t14: backup set complete, elapsed time: 06:40:05
channel t18: finished piece 1 at 15-MAY-12
piece handle=/rman/db_msnb04jn_1_1 tag=TAG20120514T204954 comment=NONE
channel t18: backup set complete, elapsed time: 06:43:38
channel t13: finished piece 1 at 15-MAY-12
piece handle=/rman/db_mnnb04jl_1_1 tag=TAG20120514T204954 comment=NONE
channel t13: backup set complete, elapsed time: 07:40:56
--这里可以看出rman的备份完成了通道11/13/14/15/16/18,也就是说目前为止通道12/17未完成.
RMAN-03009: failure of backup command on t12 channel at 05/15/2012 04:39:58
ORA-19502: write error on file "/rman/db_mmnb04jl_1_1", blockno 30481025 (blocksize=8192)
ORA-27063: number of bytes read/written is incorrect
IBM AIX RISC System/6000 Error: 28: No space left on device
Additional information: -1
Additional information: 1048576
ORA-19502: write error on file "/rman/db_mmnb04jl_1_1", blockno 30480897 (blocksize=8192)
ORA-27063: number of bytes read/written is incorrect
IBM AIX RISC System/6000 Error: 28: No space left on device
channel t12 disabled, job failed on it will be run on another channel
--通道12报错(硬盘空间不足)
channel t11: starting full datafile backupset
channel t11: specifying datafile(s) in backupset
input datafile fno=00008 name=/dev/rnc32g_39
input datafile fno=00016 name=/dev/rnc32g_47
input datafile fno=00024 name=/dev/rnc32g_57
input datafile fno=00032 name=/dev/rnc32g_25
input datafile fno=00040 name=/dev/rnc32g_33
input datafile fno=00048 name=/dev/rnc32g_3
input datafile fno=00056 name=/dev/rnc32g_11
input datafile fno=00064 name=/dev/rnc32g_19
input datafile fno=00072 name=/dev/rnc32g_67
input datafile fno=00080 name=/dev/rnc32g_106
input datafile fno=00088 name=/dev/rnc32g_114
input datafile fno=00096 name=/dev/rnc32g_87
input datafile fno=00104 name=/dev/rnc32g_95
input datafile fno=00112 name=/dev/rnc32g_103
input datafile fno=00120 name=/dev/rnc32g_75
input datafile fno=00003 name=/dev/rnc50_sysaux
input datafile fno=00130 name=/dev/rnc32g_119
channel t11: starting piece 1 at 15-MAY-12
--在通道12报错后,通道11已经完成了上次备份,所以启动备份通道12出错的数据文件
RMAN-03009: failure of backup command on t17 channel at 05/15/2012 04:39:58
ORA-19502: write error on file "/rman/db_mrnb04jm_1_1", blockno 30753793 (blocksize=8192)
ORA-27063: number of bytes read/written is incorrect
IBM AIX RISC System/6000 Error: 28: No space left on device
Additional information: -1
Additional information: 1048576
ORA-19502: write error on file "/rman/db_mrnb04jm_1_1", blockno 30753665 (blocksize=8192)
ORA-27063: number of bytes read/written is incorrect
IBM AIX RISC System/6000 Error: 28: No space left on device
channel t17 disabled, job failed on it will be run on another channel
--通道17也因为磁盘空间报错
channel t13: starting full datafile backupset
channel t13: specifying datafile(s) in backupset
input datafile fno=00002 name=/dev/rnc32g_22
input datafile fno=00013 name=/dev/rnc32g_44
input datafile fno=00021 name=/dev/rnc32g_54
input datafile fno=00029 name=/dev/rnc32g_62
input datafile fno=00037 name=/dev/rnc32g_30
input datafile fno=00045 name=/dev/rnc32g_38
input datafile fno=00053 name=/dev/rnc32g_8
input datafile fno=00061 name=/dev/rnc32g_16
input datafile fno=00069 name=/dev/rnc32g_64
input datafile fno=00077 name=/dev/rncundo_33g_4
input datafile fno=00085 name=/dev/rnc32g_111
input datafile fno=00093 name=/dev/rnc32g_84
input datafile fno=00101 name=/dev/rnc32g_92
input datafile fno=00109 name=/dev/rnc32g_100
input datafile fno=00117 name=/dev/rnc32g_72
input datafile fno=00006 name=/dev/rnc50_4g_1
channel t13: starting piece 1 at 15-MAY-12
--通道13也尝试备份通道17失败的数据文件
RMAN-03009: failure of backup command on t11 channel at 05/15/2012 04:39:59
ORA-19504: failed to create file "/rman/db_mtnb104u_1_1"
ORA-27044: unable to write the header block of file
IBM AIX RISC System/6000 Error: 28: No space left on device
Additional information: 3
Addition
--因为当前没有空闲空间,通道11终止,
--这个时候rman异常终止,导致后续的通道13终止记录未打印到日志

阅读完rman日志,很好理解因为存放rman备份的磁盘空间不足导致了一系列错误

检查磁盘剩余空间

Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
/dev/hd4          10.00      9.75    3%     6548     1% /
/dev/hd2          10.00      4.55   55%    84383     8% /usr
/dev/hd9var        5.00      4.04   20%     6290     1% /var
/dev/hd3           5.00      3.87   23%     1551     1% /tmp
/dev/hd1          10.00      9.91    1%      382     1% /home
/proc                 -         -    -         -     -  /proc
/dev/hd10opt       5.00      4.89    3%     3502     1% /opt
/dev/archalv      99.00     82.98   17%       96     1% /archa
/dev/fslv01       40.00     19.49   52%    72324     2% /ora10
/dev/fslv00     1800.00    467.25   75%       10     1% /rman

这下让人迷糊了,磁盘空间还剩余467.25G,怎么会报错呢?

分析原因

RMAN-03009: failure of backup command on t12 channel at 05/15/2012 04:39:58
ORA-19502: write error on file "/rman/db_mmnb04jl_1_1", blockno 30481025 (blocksize=8192)
ORA-27063: number of bytes read/written is incorrect
IBM AIX RISC System/6000 Error: 28: No space left on device
Additional information: -1
Additional information: 1048576
ORA-19502: write error on file "/rman/db_mmnb04jl_1_1", blockno 30480897 (blocksize=8192)
ORA-27063: number of bytes read/written is incorrect
IBM AIX RISC System/6000 Error: 28: No space left on device
channel t12 disabled, job failed on it will be run on another channel
RMAN-03009: failure of backup command on t17 channel at 05/15/2012 04:39:58
ORA-19502: write error on file "/rman/db_mrnb04jm_1_1", blockno 30753793 (blocksize=8192)
ORA-27063: number of bytes read/written is incorrect
IBM AIX RISC System/6000 Error: 28: No space left on device
Additional information: -1
Additional information: 1048576
ORA-19502: write error on file "/rman/db_mrnb04jm_1_1", blockno 30753665 (blocksize=8192)
ORA-27063: number of bytes read/written is incorrect
IBM AIX RISC System/6000 Error: 28: No space left on device
channel t17 disabled, job failed on it will be run on another channel

这两个通道在写入rman备份到磁盘中的时候,在05/15/2012 04:39:58发现磁盘空间不足,两个通道分别准备写入30480897/30753665号块的时候出错,那么当时这两个通道分别写入的数据块数为30480896/30753664,写入文件大小为(30480896+30753664)*8192/1024/1024/1024=467.1826171875G.这里可以看出磁盘剩余空间467.25G,其实当时已经写入了467.1826171875G,继续写入的时候出错.然后rman为了保证备份的正确性,自动删除了当时已经备份的467.1826171875G错误的备份文件.从而在备份结束后看到磁盘空间还有大量剩余而rman包空间不足的现象.

library cache latch等待事件

产生library cache latch原因

The library cache latches protect the cached SQL statements and objects' definitions held
in the library cache within the shared pool. The library cache latch must be acquired
in order to add a new statement to the library cache. During a parse, Oracle searches
the library cache for a matching statement. If one is not found, then Oracle will parse
the SQL statement, obtain the library cache latch and insert the new SQL.

每一个sql被执行之前,先要到library cache中根据hash_value查找parent cursor,这就需要先获得library cache latch;找到parent cursor后,就会去查找对应的child cursor,当发现无法找到时,就会释放library cache latch,获得share pool latch分配空间给硬解析后的产生的执行计划;然后再次获得library cache latch进行把执行计划放入share pool,转入library cache pin+lock(null模式)开始执行sql.library cache latch 的个数有限(与CPU_COUNT参数相关),当数据库中出现大量硬解析的时候,某一个sql无法得到library cache latch就会开始spin,达到spin count后还没得到,就会开始sleep,达到sleep时间后,醒来还再次试图过的library cache latch得不到就在spin再得不到又sleep…依此类推.
综上可知:在sql执行的过程中可以看出在出现High Versions Count和Hard Parse的情况下都有可能出现library cache latch等待.
关于Hard Parse见:shared pool latch 等待事件
关于High Versions Count见:关于High Versions Count总结

1._KGL_LATCH_COUNT控制library cache latches数量

The hidden parameter _KGL_LATCH_COUNT controls the number of library cache latches.
The default value should be adequate, but if contention for the library cache latch cannot  be resolved,
it one may consider increasing this value. The default value for _KGL_LATCH_COUNT is the next prime number
after CPU_COUNT. This value cannot exceed 67. 

2.Library cache: mutex X
在10g及其以后版本中,很多latch使用mutex代替,我们常见的Library cache: mutex X is similar to library cache wait in earlier version.(_kks_use_mutex_pin=false可以禁止mutex)

ORA-07445 [ACCESS_VIOLATION] [UNABLE_TO_READ] []

alert中发现ORA-07445错误
ORA-07445: exception encountered: core dump [PC:0x7FFF65D0] [ACCESS_VIOLATION] [ADDR:0xFFFFFFFF] [PC:0x7FFF65D0] [UNABLE_TO_READ] []错误,导致数据库down掉

Mon May 14 14:34:34 2012
Exception [type: ACCESS_VIOLATION, UNABLE_TO_READ] [ADDR:0xFFFFFFFF] [PC:0x7FFF65D0, {empty}]
Errors in file d:\app\administrator\diag\rdbms\orcl\orcl\trace\orcl_p001_1280.trc:
ORA-07445: exception encountered: core dump [PC:0x7FFF65D0] [ACCESS_VIOLATION]
[ADDR:0xFFFFFFFF] [PC:0x7FFF65D0] [UNABLE_TO_READ] []
Mon May 14 14:34:35 2012
Trace dumping is performing id=[cdmp_20120514143435]
Mon May 14 14:35:10 2012
Exception [type: ACCESS_VIOLATION, UNABLE_TO_READ] [ADDR:0xFFFFFFFF] [PC:0x7FFF65D0, {empty}]
Errors in file d:\app\administrator\diag\rdbms\orcl\orcl\trace\orcl_smon_1072.trc  (incident=164712):
ORA-07445: exception encountered: core dump [PC:0x7FFF65D0] [ACCESS_VIOLATION]
[ADDR:0xFFFFFFFF] [PC:0x7FFF65D0] [UNABLE_TO_READ] []
ORA-12080: Buffer cache miss for IOQ batching
Incident details in: d:\app\administrator\diag\rdbms\orcl\orcl\incident\incdir_164712\orcl_smon_1072_i164712.trc

分析trace文件

Oracle Database 11g Enterprise Edition Release 11.1.0.6.0 - Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
Windows Server 2003 Version V5.2 Service Pack 2
CPU                 : 8 - type 586, 4 Physical Cores
Process Affinity    : 0x00000000
Memory (Avail/Total): Ph:4892M/8189M, Ph+PgF:5638M/9795M, VA:925M/4095M
Instance name: orcl
Redo thread mounted by this instance: 1
Oracle process number: 12
Windows thread id: 1072, image: ORACLE.EXE (SMON)
--以上信息得出操作系统和数据库版本2003 sp2+oracle11g(11.1.0.6 32位)
Dump continued from file: d:\app\administrator\diag\rdbms\orcl\orcl\trace\orcl_smon_1072.trc
ORA-07445: exception encountered: core dump [PC:0x7FFF65D0] [ACCESS_VIOLATION]
[ADDR:0xFFFFFFFF] [PC:0x7FFF65D0] [UNABLE_TO_READ] []
ORA-12080: Buffer cache miss for IOQ batching
========= Dump for incident 164712 (ORA 7445 [PC:0x7FFF65D0]) ========
----- Beginning of Customized Incident Dump(s) -----
Exception [type: ACCESS_VIOLATION, UNABLE_TO_READ] [ADDR:0xFFFFFFFF] [PC:0x7FFF65D0, {empty}]
--这里的ORA-07445 [ACCESS_VIOLATION][UNABLE_TO_READ]根据经验结合这里的32位的环境,
--怀疑是sga使用的内存太多,ORACLE数据库不能读SGA相关内存导致
--在trace中找出相关参数配置.
[0004]: processes=300
[0004]: sessions=335
[0004]: __shared_pool_size=1124073472
[0004]: __large_pool_size=8388608
[0004]: __java_pool_size=16777216
[0004]: __streams_pool_size=251658240
[0004]: streams_pool_size=251658240
[0004]: sga_target=0
[0004]: __sga_target=1887436800
[0004]: memory_target=3145728000
[0004]: memory_max_target=4722786304
[0004]: db_block_size=8192
[0004]: __db_cache_size=478150656
[0004]: __shared_io_pool_size=0
[0004]: compatible=11.1.0.0.0
[0004]: log_buffer=8851456
[0004]: __pga_aggregate_target=780140544
--这里可以看到sga_target分配了内存为1887436800=1.7578125G
--pga_aggregate_target分配了780140544=0.7265625G
--两者内存之和大于2G,超过了32位ORACLE默认限制

查询MOS发现[1341681.1]
该错误原因

This is a resource issue (memory in particular). 32-bit windows systems,
are limited to 2GB of addressable memory so if you are on this platform
it's likely you are simply exceeding the capabilities of the 32bit operating system.

解决建议

First recommendation :
If you have not already done so, add the /3GB switch to your boot.ini file and reboot the server. The
boot.ini will be located in the root directory on the drive where windows is installed. The switch, /3GB,
is placed at the end of the line that executes the WinNT loading process.
This will allow applications such as oracle access to 3Gb or memory instead of 2Gb.
Example:
[operating systems] multi(0)disk(0)rdisk(0)partition(2)\WINNT="Windows NT Server Version 4.00" /3GB
 Second recommendation :
You do not want to increase memory target. If anything, this should be decreased.
You are limited to under 2GB of addressable memory on 32bit windows (the limit is actually about 1.85GB).
This is for both SGA and PGA memory for all instances; you have to reduce the SGA size for the instance.
The recommendation is to reduce sga_target, memory_target, and memory_max_target.

9i库遇到ORA-01595/ORA-01594

在alert日志中发现ORA-01595/ORA-01594错误

Sat May 12 21:54:17 2012
Errors in file /oracle/app/admin/prmdb/bdump/prmdb2_smon_483522.trc:
ORA-01595: error freeing extent (2) of rollback segment (19))
ORA-01594: attempt to wrap into rollback segment (19) extent (2) which is being freed

分析trace文件

Oracle9i Enterprise Edition Release 9.2.0.8.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Oracle Data Mining options
JServer Release 9.2.0.8.0 - Production
ORACLE_HOME = /oracle/app/product/9.2.0
System name:    AIX
Node name:      prmsvr2
Release:        3
Version:        5
Machine:        0008585FD600
Instance name: prmdb2
Redo thread mounted by this instance: 2
Oracle process number: 14
Unix process pid: 483522, image: oracle@prmsvr2 (SMON)
*** 2011-05-03 15:28:47.858
*** SESSION ID:(17.1) 2011-05-03 15:28:47.843
*** 2011-05-03 15:28:47.858
SMON: Parallel transaction recovery tried
*** 2011-07-11 17:13:52.028
SMON: Restarting fast_start parallel rollback
*** 2011-07-11 17:28:39.705
SMON: Parallel transaction recovery tried
*** 2012-05-12 21:54:17.246   --当前问题时间点
SMON: following errors trapped and ignored:
ORA-01595: error freeing extent (2) of rollback segment (19))
ORA-01594: attempt to wrap into rollback segment (19) extent (2) which is being freed
--通过trace文件,我们没有获得关于该错误的其他有用信息

查询MOS相关信息[280151.1]
出现该错误原因

This is a known problem and there is an Internal Bug:2181139 for this Issue.
The error is signaled because smon is shrinking a rollback segment and this fails
because we need an extent to store some rollback information. This is a failure message
for the shrinking. Subsequently smon would succeed in doing that.
--当smon在shrink rollback segment因为需要一个extent存放rollback

解决建议

Ignore these error messages.
Normally adding more undo space should solve the problem,
but if space is not correcting the problem, please file an SR for assistance.
This error message logging is fixed in 10g.
--忽略该错误或者升级到10g

关于High Versions Count总结

High Versions Count后果
sql查询–>hash对比确定是否存在shared pool中(不讨论不存在情况)–>选择合适的children
可能导致library cache latch contention

When you have unnecessary versions of a cursor, each time that cursor is executed,
the parse engine has to search through the list of versions to see which is the cursor that you want.
This wastes CPU cycles that you could be using on something else.
High version counts can easily cause high contention for library cache latches.
A process parsing a SQL statement with many versions (children cursors) will need to scan
through all these children while holding on to a library cache latch.
This means that other processes needing the same latch will have to wait and
can lead to significant database-wide performance degradation.

引起High Versions Count原因

•UNBOUND_CURSOR - The existing child cursor was not fully built (in other words, it was not optimized)
•SQL_TYPE_MISMATCH - The SQL type does not match the existing child cursor
•**OPTIMIZER_MISMATCH - The optimizer environment does not match the existing child cursor.
For example:
select count(*) from emp; ->> 1 PARENT, 1 CHILD
alter session set optimizer_mode=ALL_ROWS
select count(*) from emp; ->> 1 PARENT, 2 CHILDREN
    (The optimizer mode has changed and therefore
       the existing child cannot be reused)
(The same applies with events - if I turned on tracing with 10046 than
I would get the OPTIMIZER_MISMATCH again and a 3rd child)
•OUTLINE_MISMATCH - The outlines do not match the existing child cursor
If my user had created stored outlines previously for this command and they were
stored in seperate categories (say "OUTLINES1" and "OUTLINES2") running:-
alter session set use_stored_outlines = OUTLINES1;
select count(*) from emp;
alter session set use_stored_oulines= OUTLINES2;
select count(*) from emp;
--> Would create a 2nd child as the outline used is different than the first run.
•STATS_ROW_MISMATCH - The existing statistics do not match the existing child cursor.
Check that 10046/sql_trace is not set on all sessions as this can cause this.
•LITERAL_MISMATCH - Non-data literal values do not match the existing child cursor
•SEC_DEPTH_MISMATCH - Security level does not match the existing child cursor
•EXPLAIN_PLAN_CURSOR - The child cursor is an explain plan cursor and should not be shared.
Explain plan statements will generate a new child by default - the mismatch will be this.
•BUFFERED_DML_MISMATCH - Buffered DML does not match the existing child cursor
•PDML_ENV_MISMATCH - PDML environment does not match the existing child cursor
•INST_DRTLD_MISMATCH - Insert direct load does not match the existing child cursor
•SLAVE_QC_MISMATCH -The existing child cursor is a slave cursor and the new one was issued by the coordinator
(or, the existing child cursor was issued by the coordinator and the new one is a slave cursor).
•TYPECHECK_MISMATCH - The existing child cursor is not fully optimized
•AUTH_CHECK_MISMATCH - Authorization/translation check failed for the existing child cursor
The user does not have permission to access the object in any previous version of the cursor.
A typical example would be where each user has it's own copy of a table
•**BIND_MISMATCH - The bind metadata does not match the existing child cursor. For example:
variable a varchar2(100);
select count(*) from emp where ename = :a ->> 1 PARENT, 1 CHILD
variable a varchar2(400);
select count(*) from emp where ename = :a ->> 1 PARENT, 2 CHILDREN
   (The bind 'a' has now changed in definition)
•DESCRIBE_MISMATCH - The typecheck heap is not present during the describe for the child cursor
•LANGUAGE_MISMATCH - The language handle does not match the existing child cursor
•TRANSLATION_MISMATCH - The base objects of the existing child cursor do not match.
The definition of the object does not match any current version.
Usually this is indicative of the same issue as "AUTH_CHECK_MISMATCH" where the object is different.
•ROW_LEVEL_SEC_MISMATCH - The row level security policies do not match
•INSUFF_PRIVS - Insufficient privileges on objects referenced by the existing child cursor
•INSUFF_PRIVS_REM - Insufficient privileges on remote objects referenced by the existing child cursor
•REMOTE_TRANS_MISMATCH - The remote base objects of the existing child cursor do not match
USER1: select count(*) from table@remote_db
USER2: select count(*) from table@remote_db
  (Although the SQL is identical, the dblink pointed to
   by remote_db may be a private dblink which resolves
   to a different object altogether)
•LOGMINER_SESSION_MISMATCH
•INCOMP_LTRL_MISMATCH
•OVERLAP_TIME_MISMATCH - error_on_overlap_time mismatch
•SQL_REDIRECT_MISMATCH - sql redirection mismatch
•MV_QUERY_GEN_MISMATCH - materialized view query generation
•USER_BIND_PEEK_MISMATCH - user bind peek mismatch
•TYPCHK_DEP_MISMATCH - cursor has typecheck dependencies
•NO_TRIGGER_MISMATCH - no trigger mismatch
•FLASHBACK_CURSOR - No cursor sharing for flashback
•ANYDATA_TRANSFORMATION - anydata transformation change
•INCOMPLETE_CURSOR - incomplete cursor.
When bind length is upgradeable (i.e. we found a child cursor that matches everything
else except that the bind length is not long enough), we mark the old cursor is not usable
and build a new one.  This means the version can be ignored.
•TOP_LEVEL_RPI_CURSOR - top level/rpi cursor
In a Parallel Query invocation this is expected behaviour (we purposely do not share)
•DIFFERENT_LONG_LENGTH - different long length
•LOGICAL_STANDBY_APPLY - logical standby apply mismatch
•DIFF_CALL_DURN - different call duration
•BIND_UACS_DIFF - bind uacs mismatch
•PLSQL_CMP_SWITCHS_DIFF - plsql compiler switches mismatch
•CURSOR_PARTS_MISMATCH - cursor-parts executed mismatch
•STB_OBJECT_MISMATCH - STB object different (now exists)
•ROW_SHIP_MISMATCH - row shipping capability mismatch
•PQ_SLAVE_MISMATCH - PQ slave mismatch
Check you want to be using PX with this reason code, as the problem could be caused by running
lots of small SQL statements which do not really need PX. If you are on < 11i you may be hitting Bug:4367986
•TOP_LEVEL_DDL_MISMATCH - top-level DDL cursor
•MULTI_PX_MISMATCH - multi-px and slave-compiled cursor
•BIND_PEEKED_PQ_MISMATCH - bind-peeked PQ cursor
•MV_REWRITE_MISMATCH - MV rewrite cursor
•ROLL_INVALID_MISMATCH - rolling invalidation window exceeded
This is caused by the rolling invalidation capability in DBMS_STATS.
The child cannot be shared as it's invalidation window is exceeded. See:
Note:557661.1  Rolling Cursor Invalidations with DBMS_STATS in Oracle10g (Doc ID 557661.1)
•OPTIMIZER_MODE_MISMATCH - optimizer mode mismatch
•PX_MISMATCH - parallel query mismatch
If running 11.1.0.6 and RAC see Bug:7352775.
Check that if (on each instance) parallel_instance_groups is set then instance_groups is set to the same.
•MV_STALEOBJ_MISMATCH - mv stale object mismatch
•FLASHBACK_TABLE_MISMATCH - flashback table mismatch
•LITREP_COMP_MISMATCH - literal replacement compilation mismatch
New in 11g :
•PLSQL_DEBUG - debug mismatch Session has debugging parameter plsql_debug set to true
•LOAD_OPTIMIZER_STATS  - Load optimizer stats for cursor sharing
•ACL_MISMATCH   -  Check ACL mismatch
•FLASHBACK_ARCHIVE_MISMATCH  - Flashback archive mismatch
•LOCK_USER_SCHEMA_FAILED  - Failed to lock user and schema
•REMOTE_MAPPING_MISMATCH  - Remote mapping mismatch
•LOAD_RUNTIME_HEAP_FAILED  - Runtime heap mismatch
•HASH_MATCH_FAILED  - Hash mismatch
Set to "Y" if sharing fails due to a hash mismatch, such as the case with mismatched histogram data
or a range predicate marked as unsafe by literal replacement (See Bug:3461251)
New in 11.2  :
•PURGED_CURSOR - cursor marked for purging
The cursor has been marked for purging with dbms_shared_pool.purge
•BIND_LENGTH_UPGRADEABLE - bind length upgradeable
 Could not be shared because a bind variable size was smaller than the new value beiing inserted
(marked as BIND_MISMATCH in earlier versions).
•USE_FEEDBACK_STATS - cardinality feedback
Cardinality feedback is being used and therefore a new plan could be formed for the current execution.
•BIND_EQUIV_FAILURE - The bind value's selectivity does not match that used to optimize the existing child cursor
There is no longer  ROW_LEVEL_SEC_MISMATCH in 11.2.

High Versions Count查询
1.使用version_rpt function查询

--install version_rpt3_21
connect / as sysdba
start version_rpt3_21.sql
-- Generate reports for all cursors with more than 100 versions using SQL_ID (10g and up)
select b.* from v$sqlarea a ,table(version_rpt(a.sql_id)) b where loaded_versions >=100;
-- Generate reports for all cursors with more than 100 versions using HASH_VALUE
select b.* from v$sqlarea a ,table(version_rpt(null,a.hash_value)) b where loaded_versions>=100;
-- Generate the report for cursor with sql_id cyzznbykb509s
select * from table(version_rpt('cyzznbykb509s'));

2.直接查询

select sql_id,version_count, ADDRESS,sql_text
from   v$sqlarea
where version_count > 10
order by version_count, hash_value;
•Version 9.2.X.X and below :
select * from v$sql_shared_cursor where kglhdpar = '0000000386BC2E58'
•Version 10.0.X.X and above:
select * from v$sql_shared_cursor where address = '0000000386BC2E58'
NOTE:The 'Y's denote a mismatch

High Versions Count跟踪

CURSORTRACE(10G及其以后版本)
TO trace on using:-
alter system set events
'immediate trace name cursortrace level 577, address hash_value';
(level 578/580 can be used for high level tracing (577=level 1, 578=level 2, 580=level 3)
To turn off tracing use:-
alter system set events
'immediate trace name cursortrace level 2147483648, address 1';
Please note: BUG:5555371 exists in 10.2 (fixed in 10.2.0.4) where cursor trace cannot fully be turned off
and single line entries will still be made to the trace file as a result.
The w/a is to restart the instance. How invasive this BUG is depends on the executions of the cursor
(and the size of the resultant trace file additions)
cursordump(11GR2)
alter system set events 'immediate trace name cursordump level 16'

version_rpt3_21脚本:下载
参考:Troubleshooting: High Version Count Issues [ID 296377.1]

ORACLE在AIX中产生SOFTWARE PROGRAM ABNORMALLY TERMINATED警告原因

数据库中发现如下错误
该错误的解决方案:ORA-07445[dbgrmqmqpk_query_pick_key()+0f88]

Dump file /oracle/diag/rdbms/sgerp5/sgerp5/incident/incdir_579300/sgerp5_m000_7602504_i579300.trc
Oracle Database 11g Enterprise Edition Release 11.1.0.6.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
ORACLE_HOME = /oracle/product/11.1.0/db_1
System name:    AIX
Node name:  sgerp5
Release:    1
Version:    6
Machine:    00C8F0564C00
Instance name: sgerp5
Redo thread mounted by this instance: 1
Oracle process number: 138
Unix process pid: 7602504, image: oracle@sgerp5 (m000)
*** 2012-05-11 03:52:35.200
*** SESSION ID:(752.5029) 2012-05-11 03:52:35.200
*** CLIENT ID:() 2012-05-11 03:52:35.200
*** SERVICE NAME:(SYS$BACKGROUND) 2012-05-11 03:52:35.200
*** MODULE NAME:(MMON_SLAVE) 2012-05-11 03:52:35.200
*** ACTION NAME:(Auto-Purge Slave Action) 2012-05-11 03:52:35.200
Dump continued from file: /oracle/diag/rdbms/sgerp5/sgerp5/trace/sgerp5_m000_7602504.trc
ORA-07445: exception encountered: core dump [dbgrmqmqpk_query_pick_key()+0f88]
[SIGSEGV] [ADDR:0xB38F0000000049][PC:0x100213C08] [Address not mapped to object] []

errpt错误说明
在产生7445错误的同时观察aix系统错误日志发现SOFTWARE PROGRAM ABNORMALLY TERMINATED错误

sgerp5_[oracle]-->errpt -aj A924A5FC
---------------------------------------------------------------------------
LABEL:          CORE_DUMP
IDENTIFIER:     A924A5FC
Date/Time:       Fri May 11 03:52:55 BEIST 2012
Sequence Number: 471
Machine Id:      00C8F0564C00
Node Id:         sgerp5
Class:           S
Type:            PERM
WPAR:            Global
Resource Name:   SYSPROC
Description
SOFTWARE PROGRAM ABNORMALLY TERMINATED
Probable Causes
SOFTWARE PROGRAM
User Causes
USER GENERATED SIGNAL
        Recommended Actions
        CORRECT THEN RETRY
Failure Causes
SOFTWARE PROGRAM
        Recommended Actions
        RERUN THE APPLICATION PROGRAM
        IF PROBLEM PERSISTS THEN DO THE FOLLOWING
        CONTACT APPROPRIATE SERVICE REPRESENTATIVE
Detail Data
SIGNAL NUMBER
           6
USER'S PROCESS ID:
               7602504
FILE SYSTEM SERIAL NUMBER
          14
INODE NUMBER
           0      367648
CORE FILE NAME
/oracle/diag/rdbms/sgerp5/sgerp5/cdump/core_7602504/core
PROGRAM NAME
oracle
STACK EXECUTION DISABLED
           0
COME FROM ADDRESS REGISTER
sskgmcrea 0
PROCESSOR ID
  hw_fru_id: 1
  hw_cpu_id: 2
ADDITIONAL INFORMATION
skgdbgcra 224
??
ksdbgcra 3D0
ssexhd 978
??
Symptom Data
REPORTABLE
1
INTERNAL ERROR
SYMPTOM CODE
PCSS/SPI2 FLDS/oracle SIG/6 FLDS/skgdbgcra VALU/224

错误原因

This error is logged when a software program abnormally ends and causes a core dump. Users might
not be exiting applications correctly, the system might have been shut down while users were
working in application, or the user's terminal might have locked up and the application stopped
1)这里也就是说如果oracle进程在aix机器上异常终止,并且产生了一个core dump文件,
  就会出现SOFTWARE PROGRAM ABNORMALLY TERMINATED警告信息
2)用户登录系统没有正常退出,而系统被关闭
3)用户强制终止一个一个lock,而导致进程停止

本次AIX日志警告原因:由于进程7602504异常终止(ORA-07445错误)并且产生了 /oracle/diag/rdbms/sgerp5/sgerp5/cdump/core_7602504/core dump 文件,从而有了AIX中的SOFTWARE PROGRAM ABNORMALLY TERMINATED警告信息