frequency histogram下 effective index selectivity的计算

[复制链接]
查看11 | 回复9 | 2017-3-27 08:04:23 | 显示全部楼层 |阅读模式
最近一直在研究histogram方面的内容,今天在做测试时有个问题困扰了一下,希望哪位大侠能够解答一下,不甚感激!
问题是这样的:
当X列上有frequency histogram存在时,谓词中出现比如 "where X between A and B“ 时,selectivity是如何计算的?
首先建1个表t1 , 有3列,分别为id, id2,name
SQL> create table t1
2(
3 id number,
4 id2 number,
5 name char(8)
6);
Table created.

接着向t1表插数据,目的是使id=1出现1次,id=2出现2次,...... id=50出现50次, 共有1275条数据:
SQL> begin
2for i in 1 ..50 loop
3for j in 1 ..i loop
4insert into t1 values(i,i,rpad(i,8,'0'));
5end loop;
6commit;
7end loop;
8end;
9/
PL/SQL procedure successfully completed.

再在 id 及 id2 列上建1个复合索引:
SQL> create index i_t1 on t1(id,id2);
Index created.

接着收集t1表的统计数据,并在id列上创建histogram:
1begin
2dbms_stats.gather_table_stats
3(
4 'sys','t1',cascade=>true,estimate_percent=>null,
5 method_opt=>'for columns size skewonly id'
6);
7* end;
SQL> /
PL/SQL procedure successfully completed.
以下是t1表中各个列的统计数据:
COLUMN_NAME DATA_TYPENUM_DISTINCTDENSITYLOW_VALUHIGH_VALUE NUM_NULLS NUM_BUCKETS HISTOGRAM
------------------------- ------------
---------- ------------------ -----------------------------------
ID
NUMBER
50
.000392157C102C133
0
50 FREQUENCY
ID2
NUMBER
50
.02
C102 C133
0
1 NONE
NAME
CHAR
45
.0222222223130303039303030300
1NONE

30303030303030
接着执行以下sql,同时启用10053事件:
SQL> alter session set events '10053 trace name context forever,level 2';
Session altered.
SQL> select * from t1 where id between 1 and 4;
Execution Plan
----------------------------------------------------------
Plan hash value: 4068921349
------------------------------------------------------------------------------------
| Id| Operation
| Name | Rows| Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT
| |10 | 150 | 3 (0)| 00:00:01 |
| 1 |TABLE ACCESS BY INDEX ROWID | T1|10 | 150 | 3 (0)| 00:00:01 |
|*2 | INDEX RANGE SCAN
| I_T1 |11 |
| 2 (0)| 00:00:01 |
------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("ID">=1 AND "ID"<=4)

10053中的部分内容如下:
***************************************
SINGLE TABLE ACCESS PATH
Column (#1): ID(NUMBER)
AvgLen: 3.00 NDV: 50 Nulls: 0 Density: 3.9216e-04 Min: 1 Max: 50
Histogram: Freq#Bkts: 50UncompBkts: 1275EndPtVals: 50
Table: T1Alias: T1
Card: Original: 1275Rounded: 10Computed: 10.50Non Adjusted: 10.50
Access Path: TableScan
Cost:3.02Resp: 3.02Degree: 0
Cost_io: 3.00Cost_cpu: 284411
Resp_io: 3.00Resp_cpu: 284411
Access Path: index (RangeScan)
Index: I_T1
resc_io: 3.00resc_cpu: 25654
ix_sel: 0.0082353ix_sel_with_filters: 0.0082353
Cost: 3.00Resp: 3.00Degree: 1
Best:: AccessPath: IndexRangeIndex: I_T1
Cost: 3.00Degree: 1Resp: 3.00Card: 10.50Bytes: 0
***************************************
那么在10053中显示的ix_sel的值是如何得到的?
起初,我觉得这里的ix_sel的值应该这样计算(1275是num_rows的值):
(1+2+3+4)/1275=0.00784314
显然这个结果与10053中显示的ix_sel的值不一致,哪位大侠能够解释下这个ix_sel的值是如何计算出来的,先谢谢了!
回复

使用道具 举报

千问 | 2017-3-27 08:04:23 | 显示全部楼层
哦,补充说明一下,我的测试库版本为10201
回复

使用道具 举报

千问 | 2017-3-27 08:04:23 | 显示全部楼层
Jonathan Lewis's "Cost-Based Oracle Fundamentals" pp.66-67 talks about it.
Yong Huang
回复

使用道具 举报

千问 | 2017-3-27 08:04:23 | 显示全部楼层
原帖由 Yong Huang 于 2011-5-6 05:55 发表
Jonathan Lewis's "Cost-Based Oracle Fundamentals" pp.66-67 talks about it.
Yong Huang

Could you please show mepart of the text you mentioned above in "Cost-Based Oracle Fundamentals"pp.66-67?
Because I'vechecked the content you said in "Cost-Based Oracle Fundamentals" pp.66-67,
what I saw in page 66 is as follows(part of it):
We have crossed a partition boundary—Oracle has noted that we will be hitting partitions 1
and 2. Checking the table and column statistics, we find that between them the partitions have
160,000 rows, and the column has 400 distinct values with a range from 0 to 399, from which we
want the range 150 to 250. Let’s apply the same formula as last time: 160,000 * ((250 – 150)/399
+ 2/400 ) = 48,100. The result is not even close.
We could check to see whether Oracle has done two sets of arithmetic, one for the range
150 <= part_col < 200 and one for 200 <= part_col <= 250, and then added them. It hasn’t—this
would give the result 40,800.
The answer comes from the table-level statistics. There are 1,000,000 million rows, with
1,000 different values, and a range of 0 to 999, giving us 1,000,000 * ((250 –150) / 999 + 2/1,000)
= 102,100.
With multiple-known partitions at parse time, Oracle uses the table-level statistics.
The subject Jonathan talked about in pp.66-67 is about How to get cardinality with partitioned objects,
but my testing case is talking about how to calculate effective index selectivity on range predicates when there is a frequency histogram in place.
Anyway , thanks for your comment


回复

使用道具 举报

千问 | 2017-3-27 08:04:23 | 显示全部楼层
For now I can't explain why the ix_sel is 0.0082353. Apparently that number equals
.000392157+0.00784314, i.e. density plus (1+2+3+4)/1275
or
(0.5+1+2+3+4)/1275
Either of the two needs some thinking to explain. I'll think about it.
By the way, you quoted Jonathan's text wrong. pp.66-67 is the section called "Effective Index Selectivity" in Chapter 4. Maybe you're using a different edition.
Yong Huang
回复

使用道具 举报

千问 | 2017-3-27 08:04:23 | 显示全部楼层
Thanks, Yong Huang.
I think your arithmetic is workable in my testing case,and I did a few more tests by changing the range value of the predicates, i.e. "where id between1 and 5" , "where id between 1 and 6" ......
When "where id between1 and 5" :
ix_sel = (1+2+3+4+5)/1275+density

When "where id between 1 and 6" :
ix_sel = (1+2+3+4+5+6)/1275+density
At the same time, I'm thinking about the value of ix_sel if there is no histogram in place in my testing case,
the standard formula should be:
(high_end-low_end)/(high_value-low_value) + 2 * density
So, I think the significant difference between the 2 kinds ofarithmetic is the times of density you should plusin bounded and closed range predicates:when there is no higtogram, you should plus the density twice; when there is a frequency histogram , you should plus the density only once.
Because so far I haven't found any official evidence to prove my viewpoint, so it's just my guess

Anyone who can find some convincing papers to explain how to calculate the ix_sel in my testing case, pls let me know


回复

使用道具 举报

千问 | 2017-3-27 08:04:23 | 显示全部楼层
The standard formula works with fairly uniform data distribution. In your case, you have one 1, two 2's, three 3's, ... So you can't simply use the formula like this:
(4-1)/(1275-1) + 2 * density
The (4-1) part would be too small. Instead, you have to factor in the count, or endpoint_number as seen in user_histograms:

ENDPOINT_NUMBER ENDPOINT_VALUE
--------------- --------------

1
1

3
2

6
3

10
4

15
5

For a frequency histogram, endpoint_number - lag(endpoint_number) over (order by endpoint_number) is the count of this value in the column. (The "order by" part can also be endpoint_value. But I don't think there's difference).
Yong Huang
[ 本帖最后由 Yong Huang 于 2011-5-7 08:48 编辑 ]
回复

使用道具 举报

千问 | 2017-3-27 08:04:23 | 显示全部楼层
First of all, I think I need to show you some special cases which are not mentioned in Jonathan's book when we do the maths to get the right selectivity.
1.Still in my testing table t1, on column id,the predicates are "when id between 1 and 4"
Regenerate table statistics without building a histogram on column id
Note "1" equals the low_value in column id:
COLUMN_NAME DATA_TYPENUM_DISTINCTDENSITY LOW_VALU HIGH_VALUE NUM_NULLS NUM_BUCKETS HISTOGRAM
--------------- ---------- ------------ ---------- -------- ---------- --------- ----------- ---------------
ID
NUMBER
50.02 C102 C133
0 1 NONE
ID2
NUMBER
50.02 C102 C133
0 1 NONE
NAME
CHAR
45 .022222222 31303030 3930303030 0 1 NONE

30303030 303030
We then run the query "select * from t1 where id between 1 and 4", andthe value of ix_sel in 10053is:
***************************************
SINGLE TABLE ACCESS PATH
Column (#1): ID(NUMBER)
AvgLen: 3.00 NDV: 50 Nulls: 0 Density: 0.02 Min: 1 Max: 50
Table: T1Alias: T1
Card: Original: 1275Rounded: 104Computed: 103.56Non Adjusted: 103.56
Access Path: TableScan
Cost:3.02Resp: 3.02Degree: 0
Cost_io: 3.00Cost_cpu: 292824
Resp_io: 3.00Resp_cpu: 292824
Access Path: index (RangeScan)
Index: I_T1
resc_io: 3.00resc_cpu: 61924
ix_sel: 0.081224ix_sel_with_filters: 0.081224
Cost: 3.00Resp: 3.00Degree: 1
Best:: AccessPath: IndexRangeIndex: I_T1
Cost: 3.00Degree: 1Resp: 3.00Card: 103.56Bytes: 0
***************************************
0.081224= (4-1)/(50-1) + density
2.Run the query : "select * from t1 where id between 45 and 50"
Note "50" equals the high_value in column id
Check the value of ix_sel in 10053:
***************************************
SINGLE TABLE ACCESS PATH
Column (#1): ID(NUMBER)
AvgLen: 3.00 NDV: 50 Nulls: 0 Density: 0.02 Min: 1 Max: 50
Table: T1Alias: T1
Card: Original: 1275Rounded: 156Computed: 155.60Non Adjusted: 155.60
Access Path: TableScan
Cost:3.02Resp: 3.02Degree: 0
Cost_io: 3.00Cost_cpu: 297506
Resp_io: 3.00Resp_cpu: 297506
Access Path: index (RangeScan)
Index: I_T1
resc_io: 3.00resc_cpu: 82204
ix_sel: 0.12204ix_sel_with_filters: 0.12204
Cost: 3.01Resp: 3.01Degree: 1
Best:: AccessPath: IndexRangeIndex: I_T1
Cost: 3.01Degree: 1Resp: 3.01Card: 155.60Bytes: 0
***************************************
Here, "0.12204" also equals (50-45)/(50-1) + density
3.Run the query: "select * from t1 where id between 1 and 50"
At this time , we ask for the whole values in t1
Check the value of ix_sel in 10053:
***************************************
SINGLE TABLE ACCESS PATH
Column (#1): ID(NUMBER)
AvgLen: 3.00 NDV: 50 Nulls: 0 Density: 0.02 Min: 1 Max: 50
Table: T1Alias: T1
Card: Original: 1275Rounded: 1275Computed: 1275.00Non Adjusted: 1275.0
0
Access Path: TableScan
Cost:3.03Resp: 3.03Degree: 0
Cost_io: 3.00Cost_cpu: 398236
Resp_io: 3.00Resp_cpu: 398236
Access Path: index (RangeScan)
Index: I_T1
resc_io: 9.00resc_cpu: 561343
ix_sel: 1ix_sel_with_filters: 1
Cost: 9.04Resp: 9.04Degree: 1
Best:: AccessPath: TableScan
Cost: 3.03Degree: 1Resp: 3.03Card: 1275.00Bytes: 0
***************************************
Here, "1" probably equals (50-1)/(50-1) without adding density
From the 3 special cases listed above, I can get this conclusion:
When the value of one end in a range(bounded,closed) predicate equals to high_value or low_high of the column, you shouldplus the value of density only once when you do the maths;and if both ends equal to high_value and low_value separately, there is no need to plus density; when values of both ends are right inside the range just as normal cases, you should plus the density twice, as the standard formula said.
So by now,I think I've already known how to do the math when there is a frequency histogram in place just as in my original testing case.
The standard formula is as below:
(high_end-low_end)/(high_value-low_value) + 2 * density
Let's call"(high_end-low_end)/(high_value-low_value)" "part 1"for short
when there is a frequency histogram, "part 1" changes to this(in my original testing case where id between 1 and 4):
(1+2+3+4)/(1275) ---this may mean factor in the count as yong huang said
And then because "1" equals to low_value of column id,so you should plus the density only once.
Thus comes the ix_sel in 10053: (1+2+3+4)/(1275) + 0.000392157
I think this is a good way to explain why I should plus density only once in my original testing case, it's not because the existence of frequency histogram , but the value of one end in my range predicate equals to the low_value of column id.
Finally,the data distribution in column id is indeed notuniform,but it's continuous from 1 to 50, so I think the standard formula still works well,see my starting 3 examples without a histogram in place.
[ 本帖最后由 sprite731 于 2011-5-8 10:23 编辑 ]
回复

使用道具 举报

千问 | 2017-3-27 08:04:23 | 显示全部楼层
You seem to be interested in the part whether and how many times you should add density. But I think a bigger problem is this part:
(high_end-low_end)/(high_value-low_value)
This part becomes completely invalid in skewed cases. We know the correct calculation in your case is
(1+2+3+4)/...
How do you fit that sum of those four numbers into this part of the formula: high_end-low_end? Where's the minus sign in 1+2+3+4? Which one is the high and which is the low value? We can solve this problem first and then go on to the density part.
Yong Huang
回复

使用道具 举报

千问 | 2017-3-27 08:04:23 | 显示全部楼层
原帖由 Yong Huang 于 2011-5-10 02:29 发表
You seem to be interested in the part whether and how many times you should add density. But I think a bigger problem is this part:
(high_end-low_end)/(high_value-low_value)
This part becomes completely invalid in skewed cases. We know the correct calculation in your case is
(1+2+3+4)/...
How do you fit that sum of those four numbers into this part of the formula: high_end-low_end? Where's the minus sign in 1+2+3+4? Which one is the high and which is the low value? We can solve this problem first and then go on to the density part.
Yong Huang


To Yong Huang:
Yes,data distribution of table t1 in my orginal test case is skewed just as you said, but it's continuous from 1 to 50.
So,I think the standard formula "(high_end-low_end)(you call "required range")/(high_value-low_value)"in Jonathan's 《cost based oracle fundamentals》is workablein my test case when there is no frequency histogram in place,and I've showd you the 3 examples in my earlier post:
"where id between 1 and 4" when there is no histogram-----selectivity = (4-1)/(50-1) + density
After that,i think when we build a frequency histogram on column id of table t1, Oracle probably knows exactly the frequence(repetition counts) of each value in our required range (i.e "between 1 and 4 "means 1 "1" and 2 "2"s and 3 "3"s and 4 "4"s) from frequency histogram on column id,
so Oracle can use this real number to calculate selectivity instead of using the"high_end-low_end":
"where id between 1 and 4" when there is frequency histogram ------selectivity = (1+2+3+4) / 1275 + density
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

主题

0

回帖

4882万

积分

论坛元老

Rank: 8Rank: 8

积分
48824836
热门排行