今天一朋友让我帮忙分析他们的9.2.0.2 rac 节点2异常down掉原因,相关信息如下:
前提信息
OS:HP-UX B.11.31
DB:9.2.0.2.0 RAC
节点2alert日志信息
Fri Sep 7 13:13:49 2012
ARC0: Completed archiving log 11 thread 2 sequence 11651
Fri Sep 7 13:31:56 2012
Errors in file /oracle/admin/agent/udump/agent2_ora_797.trc:
ORA-00600: internal error code, arguments: [kgavsd_3], [0], [], [], [], [], [], []
ORA-00028: your session has been killed
Fri Sep 7 13:31:58 2012
Errors in file /oracle/admin/agent/bdump/agent2_pmon_5938.trc:
ORA-30013: undo tablespace 'UNDOTBS1' is currently in use
Fri Sep 7 13:31:58 2012
PMON: terminating instance due to error 30013
Fri Sep 7 13:31:58 2012
Errors in file /oracle/admin/agent/bdump/agent2_lms7_6033.trc:
ORA-30013: undo tablespace '' is currently in use
Fri Sep 7 13:31:58 2012
…………
Errors in file /oracle/admin/agent/bdump/agent2_lms0_6027.trc:
ORA-30013: undo tablespace '' is currently in use
Fri Sep 7 13:31:58 2012
System state dump is made for local instance
Fri Sep 7 13:32:03 2012
Instance terminated by PMON, pid = 5938
Fri Sep 7 14:34:35 2012
这里可以看到因为ORA-30013的错误使得pmon进程异常,从而使得该rac的节点2 down掉.同时这里还发现了ORA-00600[kgavsd_3]错误,是否是因为该ORA-600导致了数据库异常down还是一个偶然机会,我们继续分析
查看ORA-600[kgavsd_3]相关trace文件
*********START PLSQL RUNTIME DUMP************
***Got ORA-28 while running PLSQL***
***********END PLSQL RUNTIME DUMP************
*** 2012-09-07 13:31:56.740
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [kgavsd_3], [0], [], [], [], [], [], []
ORA-00028: your session has been killed
Current SQL statement for this session:
--用户补档
DECLARE
OUT_ERR_CODE NUMBER;
OUT_ERR_MSG VARCHAR2(1000);
V_COUNT NUMBER;
BEGIN
WHILE TRUE LOOP
SELECT COUNT(*) INTO V_COUNT FROM amc_stat_log where proc_name in('pRunOdsChannelWareData')
and run_param=201208 AND STATE='A';
IF V_COUNT>0 THEN
dbms_output.put_line('exit loop '|| sysdate);
EXIT;
END IF;
sys.Dbms_Lock.sleep(600);
dbms_output.put_line('wake up '|| sysdate);
END LOOP;
PKG_AME_ODS_DATA.P_Add_TO_AgentServ(201208,OUT_ERR_CODE,OUT_ERR_MSG);
PKG_AME_ODS_DATA.P_Update_Serv_Ware_ID(201208, OUT_ERR_CODE, OUT_ERR_MSG);
PKG_AMS_SETTLE.P_COMMISION_51PRE_FLAG(201208,OUT_ERR_CODE, OUT_ERR_MSG);
END;
----- PL/SQL Call Stack -----
object line object
handle number name
c000000e84ec59d0 14 anonymous block
----- Call Stack Trace -----
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
Cannot find symbol in .
Cannot find symbol in .
Cannot find symbol in .
ksedmp()+512 call 9fffffffffff3940 000000000 ?
C000000000000A17 ?
40000000025B6540 ?
ksfdmp()+64 call 9fffffffffff3940 000000003 ?
kgerinv()+352 call 9fffffffffff3940 60000000000466B0 ?
000000003 ?
C000000000000714 ?
4000000004EBA6A0 ?
00001821B ?
6000000000468EA8 ?
kgesinv()+48 call 9fffffffffff3940 60000000000466B0 ?
600000000059BD98 ?
600000000046B070 ?
60000000000179C0 ?
6000000000017950 ?
kgesin()+112 call 9fffffffffff3940 60000000000466B0 ?
600000000059BD98 ?
4000000000B44C10 ?
000000001 ?
9FFFFFFFFFFF4310 ?
$cold_kgavsd_stackl call 9fffffffffff3940 60000000000466B0 ?
et_done()+1184 600000000059BD98 ?
4000000000B44C10 ?
000000001 ? 000000000 ?
000000000 ? 000000000 ?
00001FE19 ?
pbesd_stacklet_done call 9fffffffffff3940 60000000000466B0 ?
()+96 000000002 ? 000000000 ?
9FFFFFFFBEC6AE68 ?
pfrrun()+3328 call 9fffffffffff3940 9FFFFFFFBEC6AE68 ?
C000000000001D42 ?
4000000001ABBDA0 ?
9FFFFFFFBEC6B06C ?
9FFFFFFFFFFF64F0 ?
6000000000468EA8 ?
000000000 ? 000000000 ?
peicnt()+480 call 0000000000000000 9FFFFFFFBEC6AEE8 ?
C000000000000996 ?
40000000019F3680 ?
kkxexe()+816 call 0000000000000000 9FFFFFFFFFFF64F0 ?
9FFFFFFFBEC6AE68 ?
C00000000000099B ?
4000000001DD90F0 ?
00000FE4F ?
9FFFFFFFFFFF5F00 ?
60000000000467F0 ?
4000000000B603F0 ?
opiexe()+11168 call 0000000000000000 000000000 ?
C000000000002C60 ?
4000000001BEF980 ?
00000800F ?
9FFFFFFFFFFF6470 ?
9FFFFFFFBEC6AEB2 ?
6000000000040900 ?
9FFFFFFFBEC6B534 ?
opiall0()+3184 call _etext_f()+23058430 000000004 ? 0000000C4 ?
09110686928 9FFFFFFFFFFF7B40 ?
C000000000002BDF ?
4000000001B26CD0 ?
000000000 ? 00000C893 ?
9FFFFFFFFFFF6690 ?
Cannot find symbol in .
kpoal8()+2064 call 9fffffffffff7ad0 000000001 ?
9FFFFFFFFFFF8304 ?
FFFFFFFFBFFFFFFF ?
9FFFFFFFFFFF83E4 ?
FFFFFFFFFFE7FBDF ?
9FFFFFFFFFFF7B88 ?
000000000 ?
6000000000474528 ?
opiodr()+3584 call 9fffffffffff81fc 6000000000040950 ?
000000000 ? 000000000 ?
C000000000002C60 ?
4000000001C09FE0 ?
00000C50B ?
9FFFFFFFFFFF81F0 ?
9FFFFFFFFFFF81D8 ?
ttcpip()+3776 call _etext_f()+23058430 00000005E ? 000000014 ?
09114957288 9FFFFFFFFFFFA5F0 ?
6000000000040918 ?
C000000000001ABD ?
4000000001AB3BA0 ?
000000000 ? 00000C59B ?
opitsk()+1872 call 9fffffffffffa200 6000000000049FC0 ?
000000001 ?
9FFFFFFFFFFFA5F0 ?
000000001 ?
9FFFFFFFFFFFA740 ?
9FFFFFFFFFFFA564 ?
9FFFFFFFBF780058 ?
000000000 ?
opiino()+3184 call 000000000000057b 000000000 ? 000000000 ?
C00000000000132B ?
4000000001F78730 ?
000008001 ?
opiodr()+3584 call 0000000000000000 6000000000548A38 ?
4000000000B606F0 ?
6000000000548A38 ?
C000000000002C60 ?
4000000001C09FE0 ?
00000A201 ?
9FFFFFFFFFFFBC90 ?
4000000000B606F0 ?
opidrv()+976 call _etext_f()+23058430 00000003C ? 000000004 ?
09114957288 9FFFFFFFFFFFEFB0 ?
6000000000040918 ?
sou2o()+80 call _etext_f()+23058430 000000004 ? 000000004 ?
09114957288 9FFFFFFFFFFFEFB0 ?
main()+352 call _etext_f()+23058430 9FFFFFFFFFFFEFD0 ?
09114957288 9FFFFFFFFFFFEFD4 ?
60000000004744F0 ?
9FFFFFFFFFFFEFB0 ?
main_opd_entry()+80 call _etext_f()+23058430 000000000 ?
09114957288 9FFFFFFFFFFFF498 ?
C000000000000004 ?
C00000000002BE30 ?
--------------------- Binary Stack Dump ---------------------
Process global information:
process: c000000d6428c0c0, call: c000000e46e772a8, xact: 0000000000000000,
curses: c000000d6437d020, usrses: c000000d6437d020
----------------------------------------
SO: c000000d6428c0c0, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00
(process) Oracle pid=282, calls cur/top: c000000e46e772a8/c000000e46e772a8, flag: (0) -
int error: 28, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 0
last post received-location: No post
last process to post me: c000000d64234f18 1 6
last post sent: 0 0 104
last post sent-location: kglpsl: in loop
last process posted by me: c000000d6428e900 23 0
(latch info) wait_event=0 bits=0
Process Group: DEFAULT, pseudo proc: c000000d62234ee0
O/S info: user: oracle, term: UNKNOWN, ospid: 797
OSD pid info: Unix process pid: 797, image: oracle@gzagent2 (TNS V1-V3)
----------------------------------------
SO: c000000d6437d020, type: 4, owner: c000000d6428c0c0, flag: INIT/-/-/0x00
(session) trans: 0000000000000000, creator: c000000d6428c0c0, flag: (41) USR/- BSY/-/-/-/KIL/-
DID: 0000-0000-00000000, short-term DID: 0000-0000-00000000
txn branch: 0000000000000000
oct: 0, prv: 0, sql: c000000e76b03f10, psql: 0000000000000000, user: 31/CUSTOM
O/S info: user: huangqianhai_lc, term: SVCTAG-D1MLV2X, ospid: 11124:11796, machine: WORKGROUP\SVCTAG-D1MLV2X
program: plsqldev.exe
application name: PL/SQL Developer, hash value=1190136663
action name: 测试窗口 - 新建, hash value=3604520210
last wait for 'null event' blocking sess=0x0 seq=142 wait_time=567341620
=ea60, =0, =0
temporary object counter: 0
----------------------------------------
通过这里可以看出来是因为pl/sql dev进行一个plsql的操作导致该错误发生,查询MOS[ID 403575.1]发现
Applies to:
Oracle Server - Enterprise Edition - Version: 9.2.0.7 and later [Release: 9.2 and later ]
Information in this document applies to any platform.
***Checked for relevance on 27-Oct-2010***
Symptoms
The following errors appears in the alert log file :
Probe:read_pipe: receive failed, status 3
Probe:S:debug_loop: timeout. Action 1
*********START PLSQL RUNTIME DUMP************
***Got ORA-604 while running PLSQL***
***********END PLSQL RUNTIME DUMP************
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [kgavsd_3], [0], [], [], [], [], [], []
ORA-00604: error occurred at recursive SQL level 2
Current SQL statement for this session:
begin :id := sys.dbms_transaction.local_transaction_id; end;
.
Cause
The ora-600 kgavsd_3 appears when calling kgavsd_stacklet_done function which is related to PLSQL DEBUG.
From traces, dbms_debug package is being used during trace generation.
The return code of 3 further indicates that the dbms_pipe message was interrupted.
Probably user cancelled a plsql program, so the appeared while trying to dump the stack
Solution
There is no data corruption over here.
The error appears to be due to abnormal termination of aPL/SQL Developer application while executing a PL/SQL block.
Changing the PL/SQL and/or the procedure code could help in avoiding this error message.
Hence, this error can be safely ignored.
查找trace文件确实发现有name=SYS.DBMS_DEBUG,进一步表明该错误是由于plsql dev工具使用debug模式运行上面的plsql而引起该错误的发生,但是因为mos中记录和错误不是完全的一致,所以不能十分确定是该错误导致数据库down掉
继续分析ORA-30013
Error: ORA-30013 (ORA-30013)
Text: undo tablespace '%s' is currently in use
---------------------------------------------------------------------------
Cause: the specified undo tablespace is currently used by another instance.
Action: Wait for the undo tablespace to become available or change to another name and reissue the statement.
这个说明是没有疑问的:因为2节点配置的当前undo是UNDOTBS2,而UNDOTBS1是1节点使用的,证明这里的undo确实发生了错误,继续查询mos发现Bug 3368552
Hdr: 3368552 9.2.0.3 RDBMS 9.2.0.3 RAC PRODID-5 PORTID-23
Abstract: RAC: ORA-30013 WHEN INSTANCE 2 ATTEMPTS TO ACCESS UNDO TABLESPACE OF INSTANCE 1
*** 01/12/04 06:21 am ***
TAR:
----
3554549.995
PROBLEM:
--------
The RAC database has been stable, but experienced an instance termination due
to ORA-30031 error in the alert log (instance 2):
...
Tue Dec 23 03:01:46 2003
ARC1: Evaluating archive log 4 thread 2 sequence 1116
ARC1: Beginning to archive log 4 thread 2 sequence 1116
Creating archive destination LOG_ARCHIVE_DEST_1:
'/oracle/oradata/VLDB/logs/archives/VLDBN2/VLDB_0000001116_0002.arc'
ARC1: Completed archiving log 4 thread 2 sequence 1116
Tue Dec 23 08:14:09 2003
Errors in file /oracle/admin/VLDB/bdump/vldbn2_pmon_22860.trc:
ORA-30013: undo tablespace 'UNDOTBS1' is currently in use
Tue Dec 23 08:14:09 2003
PMON: terminating instance due to error 30013
Tue Dec 23 08:14:10 2003
System state dump is made for local instance
Tue Dec 23 08:14:12 2003
Trace dumping is performing id=[cdmp_20031223081410]
Tue Dec 23 08:14:14 2003
Instance terminated by PMON, pid = 22860
<eof>
Instance 1 alert log shows only the reconfiguration and the cdump info:
..
Tue Dec 23 03:54:13 2003
Errors in file /oracle/admin/VLDB/udump/vldbn1_ora_13564.trc:
ORA-27037: unable to obtain file status
SVR4 Error: 2: No such file or directory
Additional information: 4
Tue Dec 23 08:14:10 2003
Trace dumping is performing id=[cdmp_20031223081410]
Tue Dec 23 08:14:12 2003
Reconfiguration started
List of nodes: 0,
Global Resource Directory frozen
one node partition
Communication channels reestablished
...
因为在9.2.0.3的RAC中有着该bug,那么我们可以大胆猜测在9.2.0.2中应该存在该bug,那么结合上面的ORA-00600[kgavsd_3]错误,我们大概还原该事故的全部:
1.节点1 dml操作了程序中报错的plsql中要范围的部分表对象,但是未提交(或者正在执行)
2.节点2 有用户使用pl/sql dev去执行程序中的plsql,因为是debug模式执行,需要UNDOTBS1的块来构建cr,从而使得节点2去访问UNDOTBS1,引发了Bug 3368552 从而使得数据库直接kill掉该plsql dev会话,进而出现ORA-00600[kgavsd_3]错误和pmon进程异常使得节点2 down掉