|
ORACLE CPU过高的一次调整过程
发现CPU占用率一般都维持在90%以上,由于大量使用FOR UPDATE,造成大量LATCH等等待现象,其中某些线程SQL又运行非常频繁。很久以前就发现这些问题,给出相对意见,建议不采用FOR UPDATE或在FOR UPDATE上加NOWAIT(见http://spaces.msn.com/sunmoonkin ... ED01777CA!318.entry “ELECT FOR UPDATE 相关的知识”一章 ),由于项目进程问题一直没进行程序修改,决定在这次解决掉,以解后顾之忧。
调整前要知道如下问题
1,CPU是多少MHZ
2,SERVER 端的CPU是否有负荷较重
3,CLIENT 端的CPU是否有负荷较重
4,空闲时间(如半夜)CPU使用率是否超过15%,如果超过则需要特别注意了
5,CPU的PEAK LOAD
6,CPU的IDLE STATE
影响CPU的因素
1,高的无必要的解析会代价昂贵。
发现那些SQL运行了大量的PARSE
select sql_text, parse_calls, executions from v$sqlarea order by parse_calls desc;
SYS的总的PARSE情况
select name, value from v$sysstat where name = ’parse count%’;
只有硬解析才能减少,可以绑定变量,或增加每一个SESSION的CACHED CURSORS。
2,导致大量I/O的SQL也会明显占用CPU,如没有INDEX。BUFFER GETS一般会同CPU一块增长。可以通过v$sqlarea发现buffer_gets
3,其他等待时间,可以通过v$sesstat,v$sysstat查看
#VMSTAT 5 5 的CPU部分
us user用掉的
sy system用掉的
id 空闲
调整前
TOPAS
Name PID CPU% PgSp Owner
oracle 688416 25.2 4.6 orasbp
oracle 569658 24.9 4.5 orasbp
topas 676210 0.1 2.9 root
syncd 77964 0.0 0.5 root
hatsd 159792 0.0 8.3 root
或者ps aux|head
检查CPU数量
/usr/sbin/bindprocessor -q
The available processors are: 0 1 2 3
STATSPACK的信息
Total Wait wait Waits
Event Waits Timeouts Time (s) (ms) /txn
---------------------------- ------------ ---------- ---------- ------ --------
enqueue 5,465 121 1,793 328 0.6
latch free 2,986 2,669 21 7 0.3
根据消耗最多CPU的进程PID来得到SID详细信息
select s.sid from v$process p, v$session s
where s.paddr=p.addr and p.spid = &your_spid;
根据SID查SQL
SELECT SQL_TEXT from V$SQLTEXT_WITH_NEWLINES where HASH_VALUE
= (select sql_hash_value from v$session
where SID = <problem_SID_you_got_from_last_step>) ;
还有
select n.name,s.value
from v$statname n,V$sesstat s
where n.statistic# = s.statistic#
and value > 0
and s.sid = (select a.sid from v$process p,v$session a
where p.addr =a.paddr
and a.terminal = userenv(’terminal’))
order by n.class,n.name
用以上SQL完成SHELL(shell信息在后面whoit.sh),运行
sh whoit.sh 688416 来根据PID得到用户信息和SQL语句
在STATSPACK和我的whoit.sh都指定是这个SQL的问题
SELECT demessageid, fromid, apptype, demessage,appversion,vovers
ion ,toid,tag FROM de_receivelog WHERE (status = :1 and rownum<
=1 and (dealtime is null or dealtime<sysdate) and delock=0) or (
delock=1 and Update_Date<sysdate-2/24) order by RECEIVETIME for
update
再看等待事件
select sid||' '||event||' '|| total_waits||' '||average_wait from v$session_event where sid=25
SQL> /
SID||''||EVENT||''||TOTAL_WAITS||''||AVERAGE_WAIT
--------------------------------------------------------------------------------
25 latch free 46180 1
25 control file sequential read 4 0
25 log file sync 1 0
25 db file sequential read 202 0
25 db file scattered read 445 1
25 SQL*Net message to client 22 0
25 SQL*Net message from client 22 0
只有一个DEFAULT 池
NAME HIT_RATIO
---------------------------------------- ----------
DEFAULT .88042806
设置db_keep_cache_size池,并KEEP表
ALTER TABLE customer STORAGE (BUFFER_POOL RECYCLE|KEEP|DFAULT)
SQL> analyze table sbpopt.de_receivelog compute statistics;
建立相关索引
alter table TI_REPAIR_DEED storage(buffer_pool keep);
alter table de_receivelog storage(buffer_pool keep);
看到相关的等待都是LATCH FREE,enqueue,估计是由于SELECT FOR UPDATE并且全表扫描造成的
查看每个Session的CPU利用情况:
select ss.sid||' '||se.command||' '||ss.value CPU ||' '||se.username||' '||se.program
from v$sesstat ss, v$session se
where ss.statistic# in
(select statistic#
from v$statname
where name = 'CPU used by this session')
and se.sid=ss.sid
and ss.sid>6
order by ss.value
根据STATSPACK的HASH VALUE 用SQL>@sprepsql得到
STATSPACK SQL report for Hash Value: 1710202187 Module: JDBC Thin Client
DB Name DB Id Instance Inst Num Release Cluster Host
------------ ----------- ------------ -------- ----------- ------- ------------
SBP 3008872479 SBP 1 9.2.0.6.0 NO svodbp01
Start Id Start Time End Id End Time Duration(mins)
--------- ------------------- --------- ------------------- --------------
44 23-May-06 14:13:01 45 23-May-06 14:28:00 14.98
SQL Statistics
~~~~~~~~~~~~~~
-> CPU and Elapsed Time are in seconds (s) for Statement Total and in
milliseconds (ms) for Per Execute
% Snap
Statement Total Per Execute Total
--------------- --------------- ------
Buffer Gets: 6,938,821 7,608.4 63.34
Disk Reads: 0 0.0 .00
Rows processed: 197 0.2
CPU Time(s/ms): 508 557.2
Elapsed Time(s/ms): 607 665.8
Sorts: 1,292 1.4
Parse Calls: 191 .2
Invalidations: 0
Version count: 1
Sharable Mem(K): 22
Executions: 912
SQL Text
~~~~~~~~
SELECT demessageid, fromid, apptype, demessage,appversion,vovers
ion ,toid,tag FROM de_receivelog WHERE (status = :1 and rownum<
=1 and (dealtime is null or dealtime<sysdate) and delock=0) or (
delock=1 and Update_Date<sysdate-2/24) order by RECEIVETIME for
update nowait
===============================================================
15分钟内执行了912次,占用22K内存(不多),每次执行1.4个排序,共花了508秒,由于已经采取措施将此表KEEP与内存中,因此没有磁盘读,但每次的逻辑读7,608.4个块
然后将SQL语句的FOR UPDATE改成 FOR UPDATE NOWAIT; 这才是真正解决问题的地方
至此,通过观察CPU一般维持在25%以内。
等待事件
enqueue(队列) Indicates waits associated with internal queuing mechanism for locking various resources and components of Oracle. Please refer to Appendix B of the Oracle8i Reference Manual for the complete list of enqueues in Oracle.是一种保护共享资源的琐,锁可以保护纪录里面数据一类的共享资源,防止两个人同时修改该纪录,fifo,enqueue等待通常是 ST enqueue,HW enqueue ,TX4 enqueue。ST enqueue在数据字典管理方式表空间的物理空间分配和管理上发挥作用。当有此问题时可以转变为本地管理表空间或预先分配扩展。HW enqueue同段的高水位线一起使用,手工分配扩展可避免其上的等待。TX4 是最常见的,1,唯一索引的重复,需要COMMIT/ROLLBACK来释放ENQUEUE。2,多个并发的对同一位图索引片的修改,同上。3,多用户同时修改相同的数据块,如果已经没有空闲的ITL槽,则会引发一数据块的锁,运用增大initrans或maxtrans以容纳更多ITL槽的方法来解决,或增大PCTFREE。(ST:使用本地表空间或预先分配大扩展,HW预先分配扩展于高水水位线之上,TX4增大表或索引的initrans和maxtrans,TM为外键建立索引,察看日志放在快速磁盘上)
latch free(锁存器空闲,拴空闲) Indicates latch contention for the latch# that is being waited on. Ensure that you already have tuned the number of latches to their allowed maximums by setting the relevant init.ora parameters. If the problem persists, you should determine what is causing the contention for the latch and fix the underlying problem. Your goal should be to cure the disease not the symptom. A latch free event is a symptom of a bigger problem. For example, if the latch# derived from this is a library cache latch (assuming that the shared pool is configured appropriately), it may indicate a significant amount of hard parsing. This usually is a problem with applications that have hard-coded values in them. Either rewrite them with bind variables or upgrade to Oracle8i and use CURSOR_SHARING=force (or just look the other way).Latch Free是一种低级的同步锁机制,用以维持某些访寻和执行操作的顺序。Oracle通过enqueue来达成对重做线程/表/事务一类对象的并发使用,而通过Latch来达成对SGA中共享内存结构的保护。Latch速度快而成本低,往往通过单个的内存单元来实现。大部分Latch是互斥的。共享型Latch许可对其内存结构的并发读操作。当请求得到Latch但它已被其它进程占住时,将产生一条Latch free miss的纪录。大多数Latch问题关联到不使用绑定变量(Library Cache Latch),重做日志生成中问题(Redo Allocation Latch),内存缓冲区竞争问题(Cache Buffers LRU Chain)和内存缓冲区中‘热块’的数据块(Cache Buffers Chain)。也有一些Latch Waits是由软件错误而造成的。 首先确认足够的LATCH数,然后确认那个锁存器(V$SESSION_WAIT的P2),再确认是什么引起的.
#cat whoit.sh
sqlplus /nolog <<EOF
connect / as sysdba
col machine format a30
col program format a40
set line 200
select sid,serial# ,username,osuser,machine,program,process,to_char(logon_time,'yyyy/mm/dd hh24:mi:ss')
from v\$session where paddr in
( select addr from v\$process where spid in($1));
select sql_text from v\$sqltext_with_newlines
where hash_value in
(select SQL_HASH_VALUE from v\$session where
paddr in (select addr from v\$process where spid=$1)
)
order by piece;
exit;
EOF |
|