客户环境:db2 v82 fp13, aix 5.3一个采购管理系统,
问题描述:某些页面响应时间很慢,系统管理员通过系统监控,发现 CPU、I/O资源使用率高。
问题分析:这是典型的性能问题,分析的思路自然从database snapshot入手,发现rows read/rows selected过高,达到600左右,观察bufferpool同步读I/O,发现很低。这说明database有性能瓶颈。接着抓取sql snapshot快照,发现很多语句有问题。以下是找到的比较有意思的一条。
通过sql快照:
Number of executions = 3
Number of compilations = 1
Worst preparation time (ms) = 18
Best preparation time (ms) = 18
Internal rows deleted = 0
Internal rows inserted = 0
Rows read = 225976098
Internal rows updated = 0
Rows written = 77448
Statement sorts = 5520
Statement sort overflows = 6
Total sort time = 108
Buffer pool data logical reads = 166248
Buffer pool data physical reads = 0
Buffer pool temporary data logical reads = 3391188
Buffer pool temporary data physical reads = 0
Buffer pool index logical reads = 404475
Buffer pool index physical reads = 3467
Buffer pool temporary index logical reads = 0
Buffer pool temporary index physical reads = 0
Total execution time (sec.ms) = 134.657283
Total user cpu time (sec.ms) = 96.790617
Total system cpu time (sec.ms) = 1.878419
Statement text = select sum(a.zctamount) amount,aa.invoicecode code,aa.status from invoicedetail a,invoice aa,billing b,deliverydetail c,orderdetail d, ord dd,ordertype e WHERE 1=1 and a.invoiceid=aa.invoiceid and a.billingid=b.billingid and b.relatedtrandetailid=c.deliverydetailid and c.orderdetailid=d.orderdetailid and dd.ordertypeid=e.ordertypeid and dd.orderid=d.orderid and e.saleflag>0 and aa.partnerid=dd.billto and aa.orgid=dd.sellerid and b.billto=dd.billto and a.relatedtrantype=10 and aa.status=80 and dd.sellerid=100000001 and dd.billto=100002441 and dd.status in ( 50, 60, 70) and dd.ordertypeid in (100000242,100000000,100000001,100000002,100000003,100000004,100000011,100000005,100000170,100000006,100000007,100000008,100000151,100000155,100000156,100000157,100000158,100000070,100000163,100000165,100000194,100000202,100000203,100000159,100000132) GROUP BY aa.invoicecode ,aa.status
复制代码
发现该语句每次执行大概需要45秒,每次读的行数7100万行,发生排序1800次,根据这些指标,说明该SQL存在严重的性能问题。Rows_read对系统的性能影响很大,通常是由于过多的表扫描引起。
找到了语句,接下来很自然的看它的执行计划。通过explain + db2exfmt,发现语句的cost为537.262,也不算高啊。
继续分析,发现很多连接操作符是MSJOIN和NLJOIN,根据经验,这些操作符有可能是出问题的地方。我们找到比较可疑的一个NLJOIN(21)。(篇幅所限,只粘贴了最关键的部分)
这个NLJOIN的预测rows只有0.0529,继续定位它的左子树,Fetch的cost达到165,而IXSCAN 25)的估计返回值是83行,以下是25)操作符的详细解释:
25) IXSCAN: (Index Scan)
Cumulative Total Cost: 25.7581
Cumulative CPU Cost: 210702
Cumulative I/O Cost: 2
Cumulative Re-Total Cost: 0.0427741
Cumulative Re-CPU Cost: 155241
Cumulative Re-I/O Cost: 0
Cumulative First Row Cost: 25.7183
Estimated Bufferpool Buffers: 3
Arguments:
---------
MAXPAGES: (Maximum pages for prefetch)
1
PREFETCH: (Type of Prefetch)
NONE
ROWLOCK : (Row Lock intent)
NONE
SCANDIR : (Scan Direction)
FORWARD
TABLOCK : (Table Lock intent)
INTENT NONE
TBISOLVL: (Table access Isolation Level)
CURSOR STABILITY
Predicates:
----------
7) Start Key Predicate
Relational Operator: Equal (=)
Subquery Input Required: No
Filter Factor: 0.000172325
Predicate Text:
--------------
(Q2."BILLTO" = 100002441)
7) Stop Key Predicate
Relational Operator: Equal (=)
Subquery Input Required: No
Filter Factor: 0.000172325
Predicate Text:
--------------
(Q2."BILLTO" = 100002441)
Input Streams:
-------------
24) From Object DB2ADMIN.ORD_IDX9
Estimated number of rows: 486884
Number of columns: 2
Subquery predicate ID: Not Applicable
Column Names:
------------
+Q2.$RID$+Q2."BILLTO"
Output Streams:
--------------
25) To Operator #24
Estimated number of rows: 83.9021
Number of columns: 1
Subquery predicate ID: Not Applicable
Column Names:
------------
+Q2."BILLTO"
复制代码
附件: 您需要登录才可以下载或查看附件。没有帐号?注册
本主题由 System 于 2012-5-15 05:00 撤销鉴定
发短消息
加为好友
飞天 当前离线
UID6761 帖子185 主题11 金钱1333 米 阅读权限50 注册时间2009-9-15
2#
五星级会员 飞天发表于 2011-6-20 18:20 | 只看该作者
本帖最后由 飞天 于 2011-6-20 18:28 编辑
回复 1# 飞天
观察到该操作符执行的过滤条件为Q2."BILLTO" = 100002441,那么我们要看一下该条件的返回行数是多少,结果是2580,而估计的值是83.9,差了几十倍。
db2 “select count(*) from ord where BILLTO=100002441”
2957
这是典型的优化器低估问题,那么为什么会出现这种情况呢?统计信息,没错,就是统计信息,如果统计信息不准确,就会误导优化器选择错误的路径。
这时,检查syscat.tables的stats_time统计时间,发现该表最近做过统计信息收集。那么为什么造成这种情况呢?继续通过以下语句查看数据分布情况,
db2 “select billto, count(*) from ord group by billto order by desc”
发现数据及其不均匀,这时可以怀疑分布统计信息问题。查看sysstat.coldist系统表,果然不出所料,该表并没有做分布统计信息。这时,执行 runstats with distribution。但奇怪的是,执行后,访问计划没有变,时间仍然为45秒。
继续查sysstat.coldist,发现收集了10个frequency value,但查询条件100002441并没有出现在该表中,也就是并没有为该值收集统计信息。
TABNAME COLNAME TYPE SEQNO COLVALUE VALCOUNT DISTCOUNT
---------- ---------- ---- ------ ------------ --------- ----------
ORD BILLTO F 1 100000002 5145 -
ORD BILLTO F 2 100004319 4533 -
ORD BILLTO F 3 100008682 4129 -
ORD BILLTO F 4 100000003 4001 -
ORD BILLTO F 5 100004037 3696 -
ORD BILLTO F 6 100017876 3435 -
ORD BILLTO F 7 100005877 3309 -
ORD BILLTO F 8 100000001 3195 -
ORD BILLTO F 9 100004046 3071 -
ORD BILLTO F 10 100004030 3008 -
ORD BILLTO Q 1 100000001 3195 1
ORD BILLTO Q 2 100001808 26186 85
ORD BILLTO Q 3 100002389 51546 179
ORD BILLTO Q 4 100003137 76945 276
ORD BILLTO Q 5 100003272 102535 355
ORD BILLTO Q 6 100003993 128771 513
ORD BILLTO Q 7 100004078 154800 559
ORD BILLTO Q 8 100004558 179535 706
ORD BILLTO Q 9 100005661 205113 964
ORD BILLTO Q 10 100006660 231133 1192
ORD BILLTO Q 11 100007976 256415 1489
ORD BILLTO Q 12 100008682 282549 1695
ORD BILLTO Q 13 100009725 307643 1923
ORD BILLTO Q 14 100011096 333224 2257
ORD BILLTO Q 15 100011827 359223 2592
ORD BILLTO Q 16 100013373 384522 2923
ORD BILLTO Q 17 100015551 410143 3295
ORD BILLTO Q 18 100016969 436002 3806
ORD BILLTO Q 19 100018193 461560 4293
ORD BILLTO Q 20 100024510 487015 5827
复制代码
这时更改数据库参数NUM_FREQVALUES到25,观察sysstat.coldist,发现已经为100002441 值收集了统计信息。
TABNAME COLNAME TYPE SEQNO COLVALUE VALCOUNT DISTCOUNT
---------- ---------- ---- ------ ------------ --------- ----------
ORD BILLTO F 1 100000002 5145 -
ORD BILLTO F 2 100004319 4533 -
ORD BILLTO F 3 100008682 4129 -
ORD BILLTO F 4 100000003 4001 -
ORD BILLTO F 5 100004037 3696 -
ORD BILLTO F 6 100017876 3435 -
ORD BILLTO F 7 100005877 3309 -
ORD BILLTO F 8 100000001 3195 -
ORD BILLTO F 9 100004046 3071 -
ORD BILLTO F 10 100004030 3008 -
ORD BILLTO F 11 100002441 2957 -
ORD BILLTO F 12 100011195 2690 -
ORD BILLTO F 13 100002354 2383 -
ORD BILLTO F 14 100004078 2311 -
ORD BILLTO F 15 100003218 2300 -
ORD BILLTO F 16 100002402 2252 -
ORD BILLTO F 17 100004087 2094 -
ORD BILLTO F 18 100004127 1950 -
ORD BILLTO F 19 100010910 1934 -
ORD BILLTO F 20 100003300 1894 -
ORD BILLTO F 21 100004006 1877 -
ORD BILLTO F 22 100003932 1835 -
ORD BILLTO F 23 100001852 1830 -
ORD BILLTO F 24 100002367 1796 -
ORD BILLTO F 25 100002342 1710 -
ORD BILLTO Q 1 100000001 3195 1
ORD BILLTO Q 2 100001808 26186 85
ORD BILLTO Q 3 100002389 51546 179
ORD BILLTO Q 4 100003137 76945 276
ORD BILLTO Q 5 100003272 102535 355
ORD BILLTO Q 6 100003993 128771 513
ORD BILLTO Q 7 100004078 154800 559
ORD BILLTO Q 8 100004558 179535 706
ORD BILLTO Q 9 100005661 205113 964
ORD BILLTO Q 10 100006660 231133 1192
ORD BILLTO Q 11 100007976 256415 1489
ORD BILLTO Q 12 100008682 282549 1695
ORD BILLTO Q 13 100009725 307643 1923
ORD BILLTO Q 14 100011096 333224 2257
ORD BILLTO Q 15 100011827 359223 2592
ORD BILLTO Q 16 100013373 384522 2923
ORD BILLTO Q 17 100015551 410143 3295
ORD BILLTO Q 18 100016969 436002 3806
ORD BILLTO Q 19 100018193 461560 4293
ORD BILLTO Q 20 100024510 487015 5827
复制代码
这时,再次执行该sql,响应时间降为1秒,返回的行数为90850。
Rows selected = 1317
Rows read = 90850
Rows written = 13183
Total User CPU Time used by agent (s) = 1.077983
Total System CPU Time used by agent (s) = 0.000324
Host execution elapsed time = 0.000023
复制代码
总结:
当数据分别不均匀时,需要收集分布式统计信息,否则可能会误导优化器选择错误的路径。当发现cost很低,并出现MSJOIN和NLJOIN操作符时,很大的可能是优化器低估。这时需要找到哪个操作符发生的问题,并一步一步地验证优化器估计值与实际值的偏差。
http://www.db2china.net/club/thread-18522-1-3.html
猜你喜欢
转载自tianyihuyidao9.iteye.com/blog/1680676
今日推荐
周排行