今天中午发现福建生产库报负载有点异常,处理思路记录下来:
1、使用top命令查看系统,发现系统负载是比以前要搞(平时都是1以下,今天已经稳定在4左右,总是有部分进城占用cpu比较高,系统cpu等待明显)
1.1)第一反应是有人执行sql导致,抓取占用cpu较高的spid,查询出对应sql,发现都是一些比较简单sql
1.2)查询这些spid的客户端是应用服务器,也就是说不是人为执行,那在一个稳定的系统中,不会出现sql突然改变的原因
2、查询系统是否因为有对象被阻塞导致,查询发现无对象被阻塞
3、查询系统等待事件,发现几十个read by other session等待,都是从一台web的服务器上连接过来
SELECT * FROM v$session WHERE wait_class#<>6;
4、read by other session等待事件比较陌生,幸好伴随有db file sequential read的等待事件,初步怀疑读取数据到内存中等待导致
5、查询资料发现
read by other session Definition: When information is requested from the database, Oracle will first read the data from disk into the database buffer cache. If two or more sessions request the same information, the first session will read the data into the buffer cache while other sessions wait.
In previous versions this wait was classified under the “buffer busy waits” event.
However, in Oracle 10.1 and higher this wait time is now broken out into the “read by other session” wait event. Excessive waits for this event are typically due to several processes repeatedly reading the same blocks, e.g. many sessions scanning the same index or performing full-table scans on the same table. Tuning this issue is a matter of finding and eliminating this contention.
Confio concludes with a summary that “read by other session waits” are very similar to buffer busy waits
When a session waits on the “read by other session” event, it indicates a wait for another session to read the data from disk into the Oracle buffer cache.
If this happens too often the performance of the query or the entire database can suffer. Typically this is caused by contention for “hot” blocks or objects so it is imperative to find out which data is being contended for. Once that is known this document lists several alternative methods for solving the issue.
总结:两个或者多个会话同时需要把硬盘中的对象装载到data buffer中,当其中一个会话把对象装入后,其他会话就处于read by other session等待状态;这个是oracle 10g 从oracle 9i的buffer busy waits中分离出来的,也是需要一种热块现象
6、根据FILE#,BLOCK#查询热块对象
SELECT OWNER, SEGMENT_NAME, SEGMENT_TYPE, TABLESPACE_NAME, A.PARTITION_NAME
FROM DBA_EXTENTS A
WHERE FILE_ID = &FILE_ID
AND &BLOCK_ID BETWEEN BLOCK_ID AND BLOCK_ID + BLOCKS – 1;
7、通过这个对象查询出对应的操作语句
select * from v$sql where upper(sql_text) like ‘%object_name%’;
8、直接查找热点块对象语句
SELECT * FROM (SELECT O.OWNER, O.OBJECT_NAME, O.OBJECT_TYPE, SUM(TCH) TOUCHTIME, FROM X$BH B, DBA_OBJECTS O WHERE B.OBJ = O.DATA_OBJECT_ID AND B.TS# > 0 GROUP BY O.OWNER, O.OBJECT_NAME, O.OBJECT_TYPE ORDER BY SUM(TCH) DESC) WHERE ROWNUM <= 10 --或者 SELECT E.OWNER, E.SEGMENT_NAME, E.SEGMENT_TYPE FROM DBA_EXTENTS E, (SELECT * FROM (SELECT ADDR, TS#, FILE#, DBARFIL, DBABLK, TCH FROM X$BH ORDER BY TCH DESC) WHERE ROWNUM < 11) B WHERE E.RELATIVE_FNO = B.DBARFIL AND E.BLOCK_ID <= B.DBABLK AND E.BLOCK_ID + E.BLOCKS > B.DBABLK;
9、直接查找热点块操作语句
SELECT /*+rule*/ HASH_VALUE, SQL_TEXT FROM V$SQLTEXT WHERE (HASH_VALUE, ADDRESS) IN (SELECT A.HASH_VALUE, A.ADDRESS FROM V$SQLTEXT A, (SELECT DISTINCT A.OWNER, A.SEGMENT_NAME, A.SEGMENT_TYPE FROM DBA_EXTENTS A, (SELECT DBARFIL, DBABLK FROM (SELECT DBARFIL, DBABLK FROM X$BH ORDER BY TCH DESC) WHERE ROWNUM < 11) B WHERE A.RELATIVE_FNO = B.DBARFIL AND A.BLOCK_ID <= B.DBABLK AND A.BLOCK_ID + A.BLOCKS > B.DBABLK) B WHERE A.SQL_TEXT LIKE '%' || B.SEGMENT_NAME || '%' AND B.SEGMENT_TYPE = 'TABLE') ORDER BY HASH_VALUE, ADDRESS, PIECE;
10、也可以通过awr查询出来相关对象
Segments by Buffer Busy Waits
语句需要通过这些等待对象进行判断
到了1点钟左右,数据库read by other session等待事件消失,数据库恢复正常负载(因为系统还能够承受,所以当时没有采用kill进程的方法)
事后对查询出来的热点块对象和操作语句,已经访问服务器和开发确认的结果为:这个是一个报表功能,但是平时没有人用,所以也没有关注,今天突然被人用了下,导致这个问题发生,他们承诺在升级下个版本中解决这个问题。