一个客户反馈说有一系统经常性负载比较高,让我帮忙分析原因
系统负载情况
[oracle@zwq-kfdialdb ~]$ top -c
top - 17:11:06 up 78 days, 1:12, 5 users, load average: 124.83, 125.90, 112.13
Tasks: 836 total, 152 running, 684 sleeping, 0 stopped, 0 zombie
Cpu(s): 98.1%us, 0.1%sy, 0.0%ni, 1.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 132070908k total, 90494280k used, 41576628k free, 1147384k buffers
Swap: 67108856k total, 0k used, 67108856k free, 79109904k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
12179 oracle 15 0 32.5g 142m 139m R 39.0 0.1 6:09.20 oracleahdial (LOCAL=NO)
11530 oracle 16 0 32.5g 469m 466m R 36.1 0.4 11:14.23 oracleahdial (LOCAL=NO)
11816 oracle 15 0 32.5g 467m 463m R 36.1 0.4 6:33.86 oracleahdial (LOCAL=NO)
11577 oracle 15 0 32.5g 480m 477m R 34.7 0.4 7:15.98 oracleahdial (LOCAL=NO)
12136 oracle 16 0 32.5g 455m 452m R 31.9 0.4 9:07.88 oracleahdial (LOCAL=NO)
11237 oracle 16 0 32.5g 997m 992m R 31.2 0.8 20:53.50 oracleahdial (LOCAL=NO)
11427 oracle 16 0 32.5g 137m 135m R 31.2 0.1 11:50.16 oracleahdial (LOCAL=NO)
12051 oracle 16 0 32.5g 459m 456m R 31.2 0.4 6:12.67 oracleahdial (LOCAL=NO)
[oracle@zwq-kfdialdb ~]$ vmstat 3 10
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
137 0 0 41566600 1147384 79109888 0 0 0 3 0 0 0 0 99 0 0
146 0 0 41567548 1147384 79109888 0 0 0 47 1058 32190 98 0 2 0 0
150 0 0 41568056 1147384 79109888 0 0 0 39 1081 31543 98 0 2 0 0
12 0 0 41568464 1147384 79109888 0 0 0 23 1056 32111 98 0 2 0 0
62 0 0 41568776 1147384 79109888 0 0 0 11 1067 31474 98 0 2 0 0
108 0 0 41568304 1147384 79109888 0 0 0 39 1059 31193 98 0 2 0 0
140 0 0 41569280 1147384 79109888 0 0 0 48 1063 31171 98 0 2 0 0
140 0 0 41569444 1147384 79109888 0 0 0 40 1075 30508 98 0 2 0 0
通过top和vmstat看出系统现在负载很高,主要都是用户进程导致.
查询等待事件
SQL> select event from v$session where wait_class#<>6;
EVENT
----------------------------------------------------------------
cursor: mutex S
SQL> /
EVENT
----------------------------------------------------------------
cursor: mutex S
SQL> /
EVENT
----------------------------------------------------------------
cursor: mutex S
SQL> /
EVENT
----------------------------------------------------------------
cursor: mutex S
SQL> SELECT a.*, s.sql_text
2 FROM v$sql s,
3 (SELECT sid,
4 event,
5 wait_class,
6 p1 cursor_hash_value,
7 p2raw Mutex_value,
8 TO_NUMBER (SUBSTR (p2raw, 1, 8), 'xxxxxxxx') hold_mutex_x_sid
9 FROM v$session_wait
10 WHERE event LIKE 'cursor%') a
11 WHERE s.HASH_VALUE = a.cursor_hash_value
12 /
no rows selected
SQL> select event from v$session where wait_class#<>6;
EVENT
----------------------------------------------------------------
SQL*Net message to client
数据库开始的等待事件只有cursor: mutex S,等该等待事件消失后系统负载也恢复正常
再次查看系统负载
[oracle@zwq-kfdialdb ~]$ top -c -i10
top - 17:13:51 up 78 days, 1:15, 6 users, load average: 12.57, 78.21, 96.45
Tasks: 702 total, 2 running, 700 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.5%us, 0.2%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 132070908k total, 86477808k used, 45593100k free, 1147500k buffers
Swap: 67108856k total, 0k used, 67108856k free, 79116036k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
15045 oracle 16 0 13136 1476 724 R 10.6 0.0 0:00.12 top -c -i10
10010 root 10 -5 0 0 0 D 0.0 0.0 0:00.68 [kondemand/4]
10019 root 10 -5 0 0 0 D 0.0 0.0 1:41.58 [kondemand/13]
10020 root 10 -5 0 0 0 D 0.0 0.0 1:52.28 [kondemand/14]
10021 root 10 -5 0 0 0 R 0.0 0.0 2:01.54 [kondemand/15]
12166 root 24 0 10084 300 216 D 0.0 0.0 0:00.00 /opt/VRTSgab/gablogd
[oracle@zwq-kfdialdb ~]$ vmstat 3 10
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 45484264 1147528 79117360 0 0 0 3 0 0 0 0 99 0 0
2 0 0 45468452 1147528 79117456 0 0 4 161 1372 5369 2 2 97 0 0
3 0 0 45463712 1147528 79117584 0 0 4 187 1602 7253 6 0 93 0 0
1 0 0 45458220 1147528 79117648 0 0 1 99 1358 5821 2 0 98 0 0
0 0 0 45475168 1147528 79117712 0 0 0 41 1321 5321 2 0 98 0 0
0 0 0 45473624 1147528 79117744 0 0 3 104 1378 5455 2 0 98 0 0
2 0 0 45474656 1147528 79117776 0 0 0 55 1196 4872 1 0 99 0 0
0 0 0 45474376 1147532 79117824 0 0 8 113 1170 4990 2 0 98 0 0
1 0 0 45475440 1147532 79117872 0 0 1 56 1187 5301 3 0 97 0 0
1 0 0 45475824 1147532 79117888 0 0 3 99 1083 4643 3 0 97 0 0
结合上面的等待事件查询,我们可以大概评估出来,当cursor: mutex S等待消失后,系统负载也恢复正常,现在已经不存在环境,如果要找出问题只能够是借助AWR和ASH
分析ASH
Top User Events
Top SQL with Top Events
Activity Over Time
通过对ASH分析,可以大概确定,在这段时间内,引起系统负载高主要是cursor: mutex S导致
分析AWR
awr整体信息(从这里看数据库是相当的繁忙)
Load Profile(从这里看数据库业务比较小)
Top 5 Timed Foreground Events(主要等待事件cursor: mutex S,和前面分析相符)
OS LOAD(虽然和系统看到有一定出入,但是整体还是展示系统负载较高)
SQL ordered by Version Count(出现cursor: mutex S,因为load profile中解析不多,所以想到高版本问题,这里确实非常高)
通过这里的一些列分析,我们已经基本上可以确定,该数据库因为高版本问题导致cursor: mutex S以及library cache 相关等待严重,从而出现系统负载过高.
找出高版本原因
高版本相关信息和查询请见:关于High Versions Count总结
SQL> select * from table(version_rpt('f8b9tba7sfsb5'));
COLUMN_VALUE
--------------------------------------------------------------------------------
Version Count Report Version 3.2.1 -- Today's Date 19-jul-12 18:13
RDBMS Version :11.2.0.1.0 Host: zwq-kfdialdb Instance 1 : ahdial
==================================================================
Addr: 000000080FA4CEA0 Hash_Value: 2408014181 SQL_ID f8b9tba7sfsb5
Sharable_Mem: 206315729 bytes Parses: 48689
Stmt:
0 select count(*) as col_0_0_ from TBL_SP_SALES_RECORDS tblspsales
1 0_ where tblspsales0_.SALES_RECORDS_STATUS=:"SYS_B_0" and tblsps
2 ales0_.MOBILE_TELE_NO=:"SYS_B_1"
3
COLUMN_VALUE
--------------------------------------------------------------------------------
Versions Summary
----------------
AUTH_CHECK_MISMATCH :4
TRANSLATION_MISMATCH :4
ROLL_INVALID_MISMATCH :10219
PURGED_CURSOR :9
Total Versions:10219
Plan Hash Value Summary
-----------------------
COLUMN_VALUE
--------------------------------------------------------------------------------
Plan Hash Value Count
=============== =====
791727930 920
2820478500 9300
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Details for AUTH_CHECK_MISMATCH :
# of Ver PARSING_USER_ID PARSING_SCHEMA_ID PARSING_SCHEMA_NAME
========== =============== ================= ===================
10218 75 75 HOLLYSP
COLUMN_VALUE
--------------------------------------------------------------------------------
2 107 107 HOLLYSP_TEST
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Details for TRANSLATION_MISMATCH :
Summary of objects probably causing TRANSLATION_MISMATCH
Object# Owner.Object_Name
========= =================
76737 HOLLYSP.TBL_SP_SALES_RECORDS
HOLLYSP.TBL_SP_SALES_RECORDS
HOLLYSP_TEST.TBL_SP_SALES_RECORDS
COLUMN_VALUE
--------------------------------------------------------------------------------
107043 HOLLYSP_TEST.TBL_SP_SALES_RECORDS
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Details for ROLL_INVALID_MISMATCH :
No details available
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Details for PURGED_CURSOR :
No details available
####
To further debug Ask Oracle Support for the appropiate level LLL.
COLUMN_VALUE
--------------------------------------------------------------------------------
alter session set events
'immediate trace name cursortrace address 2408014181, level LLL';
To turn it off do use address 1, level 2147483648
================================================================
59 rows selected.
SQL> select * from table(version_rpt('6zhjf3qh8gyp9'));
COLUMN_VALUE
--------------------------------------------------------------------------------
Version Count Report Version 3.2.1 -- Today's Date 19-jul-12 18:15
RDBMS Version :11.2.0.1.0 Host: zwq-kfdialdb Instance 1 : ahdial
==================================================================
Addr: 00000007FD46D2A8 Hash_Value: 2693266089 SQL_ID 6zhjf3qh8gyp9
Sharable_Mem: 111904227 bytes Parses: 4880
Stmt:
0 select count(*) as col_0_0_ from TBL_SP_SALES_RECORDS tblspsales
1 0_ where tblspsales0_.SALES_RECORDS_STATUS=:"SYS_B_0" and tblsps
2 ales0_.CAMPAIGN_ID=:"SYS_B_1" and tblspsales0_.IS_CONN=:"SYS_B_2
3 " and tblspsales0_.SALES_TIME>=:"SYS_B_3" and tblspsales0_.SALES
4 _PERSON=:"SYS_B_4"
COLUMN_VALUE
--------------------------------------------------------------------------------
5
Versions Summary
----------------
BIND_MISMATCH :4804
INCOMP_LTRL_MISMATCH :372
HASH_MATCH_FAILED :4936
Total Versions:4935
Plan Hash Value Summary
COLUMN_VALUE
--------------------------------------------------------------------------------
-----------------------
Plan Hash Value Count
=============== =====
1645985080 3
2040125427 4933
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Details for BIND_MISMATCH :
Consolidated details for :
BIND_MISMATCH,USER_BIND_PEEK_MISMATCH,BIND_UACS_DIFF and
COLUMN_VALUE
--------------------------------------------------------------------------------
BIND_EQUIV_FAILURE (Mislabled as ROW_LEVEL_SEC_MISMATCH BY bug 6964441 in 11gR1)
from v$sql_bind_capture
COUNT(*) POSITION MIN(MAX_LENGTH) MAX(MAX_LENGTH) DATATYPE (PRECISION,SCALE)
======== ======== =============== =============== ======== ================
4936 1 32 32 1 (,)
4936 2 32 32 1 (,)
4936 3 32 32 1 (,)
4936 4 32 32 1 (,)
4936 5 32 32 1 (,)
COLUMN_VALUE
--------------------------------------------------------------------------------
SUM(DECODE(column,Y, 1, 0) FROM V$SQL
IS_OBSOLETE IS_BIND_SENSITIVE IS_BIND_AWARE IS_SHAREABLE
=========== ================= ============= ============
0 4309 0 4309
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Details for INCOMP_LTRL_MISMATCH :
No details available
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Details for HASH_MATCH_FAILED :
COLUMN_VALUE
--------------------------------------------------------------------------------
No details available
####
To further debug Ask Oracle Support for the appropiate level LLL.
alter session set events
'immediate trace name cursortrace address 2693266089, level LLL';
To turn it off do use address 1, level 2147483648
================================================================
62 rows selected.
根据经验,出现这么多的高版本情况,很可能是cursor_sharing参数设置问题
--对应sql语句
select count(*) as col_0_0_ from TBL_SP_SALES_RECORDS tblspsales0_
where tblspsales0_.SALES_RECORDS_STATUS=:"SYS_B_0" and
tblspsales0_.CAMPAIGN_ID=:"SYS_B_1" and tblspsales0_.IS_CONN=:"SYS_B_2"
and tblspsales0_.SALES_TIME>=:"SYS_B_3" and tblspsales0_.SALES_PERSON=:"SYS_B_4"
--cursor_sharing参数
SQL> show parameter cursor_sharing;
NAME TYPE VALUE
------------------------------------ -------------------------------- ----------------
cursor_sharing string SIMILAR
根据oracle官方建议在11g中不推荐使用cursor_sharing=SIMILAR,其实在所有版本中都不推荐,设置为该值很容易导致高版本问题.而且该值会出现莫名其妙的,无法解释的高版本问题.而且根据oracle相关文档,在即将发布的12c版本中,将除掉SIMILAR值.对于客户库的该问题,因为很多sql未绑定参数,为了减少硬解析,建议在业务低谷时设置cursor_sharing=FORCE,并刷新shared pool.