联系:手机/微信(+86 17813235971) QQ(107644445)
标题:11.2.0.3 adg库因 bug 16427872 导致smon占用大量cpu
作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]
检查数据库发现客户有一套核心的ADG库smon进程负载异常,单进程一直持有cpu 100%
[oracle@q9adg01 trace]$ top -c top - 14:00:14 up 83 days, 21:39, 4 users, load average: 10.34, 11.55, 11.25 Tasks: 1162 total, 3 running, 1157 sleeping, 0 stopped, 2 zombie Cpu(s): 1.7%us, 1.2%sy, 0.0%ni, 86.2%id, 10.7%wa, 0.0%hi, 0.1%si, 0.0%st Mem: 264253752k total, 200445076k used, 63808676k free, 757684k buffers Swap: 33554424k total, 0k used, 33554424k free, 6529220k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5707 oracle 25 0 150g 20m 16m R 99.9 0.0 14273:00 ora_smon_q9db1 5285 oracle 16 0 13564 1952 820 R 31.5 0.0 0:02.49 top -c 5713 oracle 18 0 150g 20m 17m S 5.3 0.0 410:01.33 ora_asmb_q9db1 5821 oracle 15 0 150g 23m 17m S 5.3 0.0 4883:29 ora_lck0_q9db1 7596 oracle 15 0 150g 69m 37m S 5.3 0.0 5368:28 ora_pr00_q9db1 [oracle@q9adg02 ~]$ top -c top - 14:00:03 up 84 days, 19:36, 3 users, load average: 6.46, 6.96, 6.76 Tasks: 1045 total, 5 running, 1040 sleeping, 0 stopped, 0 zombie Cpu(s): 1.8%us, 1.0%sy, 0.0%ni, 93.4%id, 3.7%wa, 0.0%hi, 0.1%si, 0.0%st Mem: 264253752k total, 196879216k used, 67374536k free, 425320k buffers Swap: 33554424k total, 0k used, 33554424k free, 4727836k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 11615 oracle 25 0 150g 16m 14m R 100.0 0.0 14272:55 ora_smon_q9db2 18173 oracle 16 0 150g 73m 37m D 18.6 0.0 24:33.91 oracleq9db2 (LOCAL=NO) 6561 oracle 15 0 150g 31m 25m R 12.2 0.0 0:48.50 oracleq9db2 (LOCAL=NO)
数据库版本和patch信息
14:18:05 sys@Q9DB>select * from v$version;
BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
PL/SQL Release 11.2.0.3.0 - Production
CORE 11.2.0.3.0 Production
TNS for Linux: Version 11.2.0.3.0 - Production
NLSRTL Version 11.2.0.3.0 - Production
SQL> SELECT INST_ID,DATABASE_ROLE,OPEN_MODE FROM GV$DATABASE;
INST_ID DATABASE_ROLE OPEN_MODE
---------- ---------------- --------------------
2 PHYSICAL STANDBY READ ONLY WITH APPLY
1 PHYSICAL STANDBY READ ONLY WITH APPLY
SQL >select inst_id,STARTUP_TIME from gv$instance;
INST_ID STARTUP_T
---------- ---------
2 01-NOV-13
1 01-NOV-13
[oracle@q9adg01 trace]$ /u01/app/oracle/product/11.2.0/db_1/OPatch/opatch lspatches
16056266;Database Patch Set Update : 11.2.0.3.6 (16056266)
16315641;Grid Infrastructure Patch Set Update : 11.2.0.3.6 (16083653)
SYSAUX表空间增加数据文件
SQL> select ts# from v$tablespace where name='SYSAUX';
TS#
----------
2
SQL> select file#,name,creation_time from v$datafile where ts#=2;
FILE# NAME CREATION_
---------- -------------------------------------------------- ---------
3 +DATA/q9db/datafile/sysaux.1412.818566605 12-MAR-08
151 +DATA/q9db/datafile/sysaux.1431.818566885 26-MAR-12
221 +DATA/q9db/datafile/sysaux.828.818547945 16-APR-12
1744 +DATA/q9db_adg/datafile/sysaux.2050.835459505 29-DEC-13
核对数据库确实在2013年12月29日对SYSAUX表空间增加了数据文件而且未重启数据库,触发Bug 16427872 Standby SMON spins on CPU after add/drop SYSAUX datafile on primary

Bug 16427872 Standby SMON spins on CPU after add/drop SYSAUX datafile on primary
在12.1.0.1中修复,在未修复前增加/删除sysaux的数据文件后,通过重启实例来解决该问题
Bug 16427872 : SMON STUCK ON LATCH FREE ON ACTIVE DATA GUARD AFTER ADDING A DATAFILE IN PRIMARY
Hdr: 16427872 11.2.0.3 RDBMS 11.2.0.3 SPACE PRODID-5 PORTID-226 Abstract: SMON STUCK ON LATCH FREE ON ACTIVE DATA GUARD AFTER ADDING A DATAFILE IN PRIMARY *** 03/03/13 05:24 pm *** (ADD: Impact/Symptom->WRONG RESULT - INCO...) *** 03/03/13 05:24 pm *** BUG TYPE CHOSEN =============== Code == Component: RDBMS == ====================== DETAILED PROBLEM DESCRIPTION ============================ The SMON process appears stuck on Latch Free on Active Data Guard after a datafile was added to the sysaux tablespace in the primary database. The SMON is moving between these few functions: ktm_back_to_smon, ktm_set_curfun, ktmmon, ktmSmonMain. DIAGNOSTIC ANALYSIS =================== We obtained the pstack outputs of the smon process ps -ef | grep smon script /tmp/stack.<pid of smon> date pstack <pid of smon> -- wait for 15 seconds date pstack <pid of smon> -- wait for 15 seconds date pstack <pid of smon> exit In stack12124.htm we can find the pstack of smon as follow: 876:skywp_nwk-p20adm-lapd04.corp.apple.com:/home/oracle/skywp/admin/backdump> date Thu Feb 28 23:22:57 GMT 2013 877:skywp_nwk-p20adm-lapd04.corp.apple.com:/home/oracle/skywp/admin/backdump> pstack 12124 #0 0x000000339f698b17 in times () from /lib64/libc.so.6 #1 0x00000000092cde1f in sltrgatime64 () #2 0x000000000107967a in ktm_set_curfun () #3 0x00000000010744c3 in ktmmon () #4 0x0000000001073198 in ktmSmonMain () #5 0x00000000023343b3 in ksbrdp () #6 0x000000000255a4ff in opirip () #7 0x000000000180c839 in opidrv () #8 0x0000000001e01f9b in sou2o () #9 0x0000000000a0b3ea in opimai_real () #10 0x0000000001e07f0c in ssthrdmain () #11 0x0000000000a0b2d1 in main () 878:skywp_nwk-p20adm-lapd04.corp.apple.com:/home/oracle/skywp/admin/backdump> date Thu Feb 28 23:23:35 GMT 2013 880:skywp_nwk-p20adm-lapd04.corp.apple.com:/home/oracle/skywp/admin/backdump> pstack 12124 #0 0x000000339f698b17 in times () from /lib64/libc.so.6 #1 0x00000000092cde1f in sltrgatime64 () #2 0x00000000010798ef in ktm_back_to_smon () #3 0x00000000010744e8 in ktmmon () #4 0x0000000001073198 in ktmSmonMain () #5 0x00000000023343b3 in ksbrdp () #6 0x000000000255a4ff in opirip () #7 0x000000000180c839 in opidrv () #8 0x0000000001e01f9b in sou2o () #9 0x0000000000a0b3ea in opimai_real () #10 0x0000000001e07f0c in ssthrdmain () #11 0x0000000000a0b2d1 in main () 881:skywp_nwk-p20adm-lapd04.corp.apple.com:/home/oracle/skywp/admin/backdump> pstack 12124 #0 0x000000339f698b17 in times () from /lib64/libc.so.6 #1 0x00000000092cde1f in sltrgatime64 () #2 0x000000000107967a in ktm_set_curfun () #3 0x0000000001074425 in ktmmon () #4 0x0000000001073198 in ktmSmonMain () #5 0x00000000023343b3 in ksbrdp () #6 0x000000000255a4ff in opirip () #7 0x000000000180c839 in opidrv () #8 0x0000000001e01f9b in sou2o () #9 0x0000000000a0b3ea in opimai_real () #10 0x0000000001e07f0c in ssthrdmain () #11 0x0000000000a0b2d1 in main () 883:skywp_nwk-p20adm-lapd04.corp.apple.com:/home/oracle/skywp/admin/backdump> exit This indicates that smon is moving between these few functions: ktm_back_to_smon, ktm_set_curfun, ktmmon and ktmSmonMain. This is matched to the errorstack level 3. A repeated query from v$session and v$process shows: TIME SID SEQ# PNAME P1 P2 P3 WAIT_TIME SECONDS_IN_WAIT STATE EVENT ----------------- ---------- ---------- ----- ---------- ---------- ---------- ---------- --------------- ------------------- ---------------------------------------------------------------- 13-03-01:04:03:34 1639 3702 SMON 1610802608 259 0 -1 25277 WAITED SHORT TIME latch free TIME SID SEQ# PNAME P1 P2 P3 WAIT_TIME SECONDS_IN_WAIT STATE EVENT ----------------- ---------- ---------- ----- ---------- ---------- ---------- ---------- --------------- ------------------- ---------------------------------------------------------------- 13-03-01:04:03:22 1639 3702 SMON 1610802608 259 0 -1 25325 WAITED SHORT TIME latch free TIME SID SEQ# PNAME P1 P2 P3 WAIT_TIME SECONDS_IN_WAIT STATE EVENT ----------------- ---------- ---------- ----- ---------- ---------- ---------- ---------- --------------- ------------------- ---------------------------------------------------------------- 13-03-01:04:03:09 1639 3702 SMON 1610802608 259 0 -1 25433 WAITED SHORT TIME latch free WORKAROUND? =========== No TECHNICAL IMPACT ================ Standby database unusable. RELATED ISSUES (bugs, forums, RFAs) =================================== Few bugs found: Defect 13712579 SMON CPU SPIKES ON ADG PHYSICAL STANDBY Defect 13060518 ADG 11.2.0.2 - SMON TAKES 100% CPU ON STANDBY SINCE ADDING A DATAFILE ON PRIMARY Defect 16092439 CDB EXIT : SMON IN ADG AT 100% CPU IN KTM All of them are not closed. We have set the following event: alter session set tracefile_identifier='SMON'; alter system set events ' 10500 level 15 '; After few minutes got a 1GB trace file smon.trc.gz, in it: ... 0: waited for 'latch free' address=0x6002e5b0, number=0x103, tries=0x0 wait_id=3701 seq_num=3702 snap_id=1 wait times: snap=0.000069 sec, exc=0.000069 sec, total=0.000069 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 287 min 5 sec of elapsed time 1: waited for 'smon timer' sleep time=0x12c, failed=0x0, =0x0 wait_id=3700 seq_num=3701 snap_id=1 wait times: snap=4 min 43 sec, exc=4 min 43 sec, total=4 min 43 sec wait times: max=5 min 0 sec wait counts: calls=1 os=95 occurred after 0.003400 sec of elapsed time 2: waited for 'smon timer' sleep time=0x12c, failed=0x0, =0x0 wait_id=3699 seq_num=3700 snap_id=1 wait times: snap=5 min 0 sec, exc=5 min 0 sec, total=5 min 0 sec wait times: max=5 min 0 sec wait counts: calls=1 os=99 occurred after 0.000032 sec of elapsed time ... The above looks very similar to defect 13060518 "ADG 11.2.0.2 - SMON TAKES 100% CPU ON STANDBY SINCE ADDING A DATAFILE ON PRIMARY" (status 11). HOW OFTEN DOES THE ISSUE REPRODUCE AT CUSTOMER SITE? ==================================================== Intermittent DOES THE ISSUE REPRODUCE INTERNALLY? ==================================== No IS A TESTCASE AVAILABLE? ======================== No Link to IPS Package: ==================== "not available"