一、问题由来
今天有网友和我说,他的数据库始终只能mount,不能打开到open状态,具体状态为:
SQL> alter database open;
alter database open
*
ERROR at line 1:
ORA-01154: database busy. Open, close, mount, and dismount not allowed now
启动数据库一直停留在Database mounted.
最后报ORA-03113: end-of-file on communication channel
二、数据库环境
OS version:HP-UX Itanium version 11.31
Oracle version:Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit
三、日志文件错误
查看alert日志
Fri Nov 11 09:31:01 2011
Completed: ALTER DATABASE MOUNT
Fri Nov 11 09:31:08 2011
ALTER DATABASE OPEN
Fri Nov 11 09:40:32 2011
Errors in file /opt/oracle/oracle1/admin/gminer/bdump/gminer_dbw0_3122.trc:
ORA-00240: control file enqueue held for more than 120 seconds
Fri Nov 11 09:47:18 2011
Errors in file /opt/oracle/oracle1/admin/gminer/bdump/gminer_dbw0_3122.trc:
Fri Nov 11 10:02:53 2011
Errors in file /opt/oracle/oracle1/admin/gminer/udump/gminer_ora_3555.trc:
ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 3122'
Fri Nov 11 10:02:55 2011
System State dumped to trace file /opt/oracle/oracle1/admin/gminer/udump/gminer_ora_3555.trc
Killing enqueue blocker (pid=3122) on resource CF-00000000-00000000
by killing session 54.1
Fri Nov 11 10:07:57 2011
Errors in file /opt/oracle/oracle1/admin/gminer/udump/gminer_ora_3555.trc:
ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 3122'
Fri Nov 11 10:07:59 2011
System State dumped to trace file /opt/oracle/oracle1/admin/gminer/udump/gminer_ora_3555.trc
Killing enqueue blocker (pid=3122) on resource CF-00000000-00000000
by terminating the process
USER: terminating instance due to error 2103
Instance terminated by USER, pid = 3555
四、分析ORA-00240和ORA-00494
[oracle@node1 ~]$ oerr ora 240
00240, 00000, "control file enqueue held for more than %s seconds"
// *Cause: The current process did not release the control file enqueue within
// the maximum allowed time.
// *Action: Reissue any commands that failed and contact Oracle Support
// Services with the error information.
[oracle@node1 ~]$ oerr ora 494
00494, 00000, "enqueue%s held for too long%s by '%s'"
// *Cause: The specified process did not release the enqueue within
// the maximum allowed time.
// *Action: Reissue any commands that failed and contact Oracle Support
// Services with the incident information.
ORA-00240: control file enqueue held for more than 120 seconds
ORA-00494: enqueue [CF] held for too long (more than 900 seconds)
dbw0进程持有control file enqueue超过了默认的900s,导致系统kill掉持有这个enqueue的会话,然后数据库自动终止。现在问题定位在control file enqueue的持有时间为什么会超时(ORA-00240)上面
五、深入理解ORA-00240
ORA-00240: control file enqueue held for more than 120 seconds
The Developers confirmed that this is just a warning to let the DBA know that a CF enqueue is being held for more than 120 seconds.
This is not an error, and error will occur if a CF enqueue is held for more than 900 seconds (15 minutes) and this is not the case here.
The message occurs when there are many datafiles in the database. DBWriter (dbw0) is taking too much time to release the CF enqueue due to having to open these datafiles.
从mos的这几句中,可以分析出来,是因为dbw0在open数据文件的过程中占用了太多时间的CF enqueue。这说明和存储或者asm有关系
六、继续分析日志
分析日志,找出出现ORA-00240和ORA-00494之前出现过什么错误
Stopping Job queue slave processes, flags = 7
Thu Nov 10 17:52:26 2011
Job queue slave processes stopped
Thu Nov 10 17:54:23 2011
Errors in file /opt/oracle/oracle1/admin/gminer/bdump/gminer_asmb_26101.trc:
ORA-00600: internal error code, arguments: [kffmAllocate_1], [8], [2], [], [], [], [], []
Thu Nov 10 17:54:24 2011
Errors in file /opt/oracle/oracle1/admin/gminer/bdump/gminer_asmb_26101.trc:
ORA-00600: internal error code, arguments: [kffmAllocate_1], [8], [2], [], [], [], [], []
Thu Nov 10 17:54:24 2011
ASMB: terminating instance due to error 486
Termination issued to instance processes. Waiting for the processes to exit
Instance terminated by ASMB, pid = 26101
通过这个,发现和asm有关系,更加确定了自己的猜测
七、分析asm日志
Fri Nov 11 09:44:42 2011
Errors in file /opt/oracle/oracle1/admin/+ASM/bdump/+asm_gmon_24008.trc:
ORA-27091: unable to queue I/O
ORA-27072: File I/O error
HPUX-ia64 Error: 11: Resource temporarily unavailable
Additional information: 4
Additional information: 2044
Additional information: -1
+asm_gmon_24008.trc中内容也就报这些错误,不再贴出
八、mos中查找ORA-27091和ORA-27072
Bug 8236874 – HP-Itanium: Intermittent ORA-27091, ORA-27072, Additional information: 4 on HPUX中有类此描述
ORA-01115: IO error reading block from file fff (block # bbb)
ORA-01110: data file 5: '/dev/xxx'
ORA-27091: unable to queue I/O
ORA-27072: File I/O error
Additional information: 4
Additional information: xxx
Additional information: yyy
问题很类此,进一步确信是asm的问题
八、在查询mos中ORA-00240和asm相关主题
发现HP Itanium – ORA-240 or process on ASM & Database hang [ID 1105825.1],阅读完毕,马上查去阅读gminer_ora_3555.trc文件
果然发现和mos中一致的描述
*** SESSION ID:(40.1) 2011-11-11 09:36:59.860
Waited for detached process: DBW0 for 300 seconds:
*** 2011-11-11 09:36:59.860
Dumping diagnostic information for DBW0:
OS pid = 3122
loadavg : 0.00 0.01 0.01
Swapinfo :
Avail = 124485.08Mb Used = 62062.20Mb
Swap free = 62422.88Mb Kernel rsvd = 4492.72Mb
Free Mem = 60374.91Mb
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME COMD
1401 S oracle 3122 1 0 154 20 e00000090de8b980 55372 e0000009203b1340 09:19:12 ? 0:06 ora_dbw0_gminer
Attaching to program: /opt/oracle/oracle1/bin/oracle, process 3122
warning: The shared libraries were not privately mapped; setting a
breakpoint in a shared library will not work until you rerun the program;
stepping over longjmp calls will not work as expected.
Please set the kernel variable "shlib_debug_enable" to 1 to enable the shared library debugging
warning: Load module /opt/oracle/oracle1/lib/libskgxp10.so has been stripped.
Debugging information is not available.
warning: Load module /opt/oracle/oracle1/lib/libhasgen10.so has been stripped.
Debugging information is not available.
warning: Load module /opt/oracle/oracle1/lib/libocr10.so has been stripped.
Debugging information is not available.
warning: Load module /opt/oracle/oracle1/lib/libocrb10.so has been stripped.
Debugging information is not available.
warning: Load module /opt/oracle/oracle1/lib/libocrutl10.so has been stripped.
Debugging information is not available.
warning: Load module /opt/oracle/oracle1/lib/libdbcfg10.so has been stripped.
Debugging information is not available.
warning: Load module /opt/oracle/oracle1/lib/libnnz10.so has been stripped.
Debugging information is not available.
0xc00000000043e3f0:0 in pw_wait+0x30 () from /usr/lib/hpux64/libc.so.1
(gdb) (gdb) #0 0xc00000000043e3f0:0 in pw_wait+0x30 () from /usr/lib/hpux64/libc.so.1
#1 0x40000000094e5ee0:0 in pw_wait () at sskgp.c:2245
#2 0x4000000002cb36c0:0 in sskgpwwait () at sskgp.c:926
#3 0x4000000002f16440:0 in skgpwwait () at skgp.c:2137
#4 0x4000000002e28770:0 in ksliwat () at ksl.c:7428
#5 0x400000000402a7f0:0 in kslwaitns_timed () at ksl.c:7652
#6 0x4000000002d3de40:0 in kskthbwt () at ksk.c:2263
#7 0x4000000002d3c940:0 in kslwait () at ksl.c:7616
#8 0x40000000041780e0:0 in ksvsubmit () at ksv.c:1975
#9 0x4000000004d5b710:0 in kfncSlaveSubmit () at kfnc.c:2913
#10 0x4000000004d63120:0 in kfncFileIdentify () at kfnc.c:1638
#11 0x4000000004cfb890:0 in kfioIdentify () at kfio.c:1431
#12 0x4000000004137c20:0 in ksfd_osmopn () at ksfd.c:16583
#13 0x40000000044c9850:0 in $cold_ksfdopn+0xc0 () at kfio.c:1654
#14 0x4000000002630aa0:0 in kcfbid () at kcf.c:13447
#15 0x40000000025b72f0:0 in kcfida () at kcf.c:13599
#16 0x4000000002d3bc80:0 in ksbabs () at ksb.c:1390
#17 0x4000000003aed5f0:0 in ksbrdp () at ksb.c:2971
#18 0x4000000002807f50:0 in opirip () at opirip.c:283
#19 0x400000000200f680:0 in $cold_opidrv+0x580 ()
#20 0x400000000277edf0:0 in sou2o () at sou2o.c:86
#21 0x4000000001fff020:0 in $cold_opimai_real+0x280 ()
#22 0x400000000269c5e0:0 in main () at opimai.c:173
(gdb) Detaching from program: /opt/oracle/oracle1/bin/oracle, process 3122
九、HP官网描述
Issue
Any questions on the use of software developed and maintained by Oracle should involve Oracle technical support. Oracle, not HP, supports Oracle software.
All ITRC HP-UX version 11.31 March 2009 patch bundles were applied to two HP Integrity rx3600-based Oracle RAC nodes. After this installation, the crsd.bin process from the Oracle RAC Clustering software seemed to hang at random intervals, causing a cluster check process launched by a third party to wait indefinitely and to eventually consume all available RAM memory.
A situation where crsd.bin hangs may cause RAC to issue a Transfer of Control (TOC), that is, a system restart with crash dump.
Solution
If any patches related to Oracle RAC are included in a patch bundle (whether it is HP-UX specific patching that RAC depends upon or vice versa), RAC requires that the binaries be relinked prior to starting them.
An alternative might be to (temporarily) uninstall the patch bundle.
The relink of the binaries should be done by a DBA, they should know how to perform that action.
In the current situation, relinking the Oracle RAC binaries resolved the issue.
NOTE:
It seems this problem can also be caused by the installation of HP patches PHKL_38762 or PHKL_39145. QXCR1000940361 tracked the problem. This has been fixed now, and the problem is not present anymore when installing the superseded patch PHKL_40208.
The problem of QXCR1000940361 has existed before the fix of PHKL_39145 was introduced so we dont set a patch warning.
十、解决方案
问题定位完毕是因为缺少打HP patches PHKL_38762 or PHKL_39145的补丁导致
根据HP官网的建议,打HP patches PHKL_38762 or PHKL_39145解决这个问题,也可以打上PHKL_40208补丁问题解决
十一、补充说明
1、因为系统以前可以运行,那么我猜测这个bug应该有偶然性,让网友重启asm,然后启动数据库正常
2、在处理问题过程中,因缺少耐心,开始没有仔细阅读trace文件,导致问题分析过程中走了一些弯路
3、这个是我第一次发现因为系统补丁没有打,导致数据库异常