Understanding Oracle Histograms
Histograms
are a feature of the Cost Based Optimizer (CBO) that helps to know the data
distribution in a column. Histograms should be created in columns with highly
skew data and are included in the filter clause (WHERE) of a SQL statement.
Histograms
play a very important role in the creation of the execution plan and we must be
careful in the method used to gather statistics in the DB objects.
Let me
show you how histograms work, for this test I have created a table T_HIST with
some data in it:
SQL> CREATE TABLE T_HIST AS SELECT LEVEL
C1, mod(round(dbms_random.value(1, 100000)),254) c2,mod(round(dbms_random.value(1, 100000)),254) c3,
'C' C4 from dual connect by level < 529101;
SQL> INSERT INTO T_HIST SELECT LEVEL C1, mod(round(dbms_random.value(1, 100000)),254) c2,
mod(round(dbms_random.value(1, 100000)),254) c3,
'P' from dual connect by level < 301;
300 rows created.
SQL> INSERT INTO T_HIST SELECT LEVEL C1, mod(round(dbms_random.value(1, 100000)),254) c2,
mod(round(dbms_random.value(1, 100000)),254) c3,
'R' from dual connect by level < 301;
300 rows created.
SQL> INSERT INTO T_HIST SELECT LEVEL C1, mod(round(dbms_random.value(1, 100000)),254) c2,
mod(round(dbms_random.value(1, 100000)),254) c3,
'T' from dual connect by level < 301;
300 rows created.
SQL> INSERT INTO T_HIST SELECT LEVEL C1, mod(round(dbms_random.value(1, 100000)),254) c2,
2 mod(round(dbms_random.value(1, 100000)),254) c3,
'A' from dual connect by level < 500001;
500000 rows created.
SQL> COMMIT;
Commit complete.
SQL>
SQL> INSERT INTO T_HIST SELECT LEVEL C1, mod(round(dbms_random.value(1,
mod(round(dbms_random.value(1,
300 rows created.
SQL> INSERT INTO T_HIST SELECT LEVEL C1, mod(round(dbms_random.value(1,
mod(round(dbms_random.value(1,
300 rows created.
SQL> INSERT INTO T_HIST SELECT LEVEL C1, mod(round(dbms_random.value(1,
mod(round(dbms_random.value(1,
300 rows created.
SQL> INSERT INTO T_HIST SELECT LEVEL C1, mod(round(dbms_random.value(1,
2 mod(round(dbms_random.value(1,
500000 rows created.
SQL> COMMIT;
Commit complete.
SQL>
We will now create an index in the column C4:
SQL> CREATE INDEX T_HIST_IDX1 ON T_HIST(C4);
Index created.
SQL>
Index created.
SQL>
Let us
see how the data is distributed inside column C4:
SQL> SELECT C4,COUNT(*) FROM T_HIST GROUP BY C4 ORDER BY C4;
C COUNT(*)
- ----------
A 500000
C 529100
P 300
R 300
T 300
You can see that we have 5 distinct values in column C4 (A, C, P, R
& T) and the number of values that correspond to each. Let us take value ‘R’
as example, if we query the table the expected execution plan is to use the
Index created, because we need only 300 rows of the 1030000 rows total. Let’s
query:
SQL> explain plan for select /* test23 */
*
2 from T_HIST
3 where C4 = 'R'
4 order by
5 C1
6 ;
Explained.
SQL>
SET LINESIZE 130
SET PAGESIZE 0
SELECT *
FROM TABLE(DBMS_XPLAN.DISPLAY); 2 3 4 5 6
Explained.
SQL> SQL> SQL> SQL> 2
Plan hash value: 3146893614
------------------------------ ------------------------------ ------------------------------ --
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------ ------------------------------ ------------------------------ --
| 0 | SELECT STATEMENT | | 30 | 1260 | 387 (1)| 00:00:05 |
| 1 | SORT ORDER BY | | 30 | 1260 | 387 (1)| 00:00:05 |
| 2 | TABLE ACCESS BY INDEX ROWID| T_HIST | 30 | 1260 | 386 (0)| 00:00:05 |
|* 3 | INDEX RANGE SCAN | T_HIST_IDX1 | 3607 | | 375 (0)| 00:00:05 |
------------------------------ ------------------------------ ------------------------------ --
Predicate Information (identified by operation id):
------------------------------ ---------------------
3 - access("C4"='R')
Note
-----
- dynamic sampling used for this statement (level=2)
SET PAGESIZE 0
SELECT *
FROM TABLE(DBMS_XPLAN.DISPLAY); 2 3 4 5 6
Explained.
SQL> SQL> SQL> SQL> 2
Plan hash value: 3146893614
------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------
| 0 | SELECT STATEMENT | | 30 | 1260 | 387 (1)| 00:00:05 |
| 1 | SORT ORDER BY | | 30 | 1260 | 387 (1)| 00:00:05 |
| 2 | TABLE ACCESS BY INDEX ROWID| T_HIST | 30 | 1260 | 386 (0)| 00:00:05 |
|* 3 | INDEX RANGE SCAN | T_HIST_IDX1 | 3607 | | 375 (0)| 00:00:05 |
------------------------------
Predicate Information (identified by operation id):
------------------------------
3 - access("C4"='R')
Note
-----
- dynamic sampling used for this statement (level=2)
What happened? Execution plan is saying that our query will return only
30 rows! obviously this is totally wrong; also look at the estimated elapsed
time (5 sec). Ok, we are being unfair to the CBO as we are not giving all the
information required, so let’s gather statistics but without histograms:
SQL> exec DBMS_STATS.GATHER_TABLE_STATS (ownname => 'ALFRED',
tabname=>'T_HIST',
estimate_percent=>dbms_stats.auto_sample_size,method_opt=>'for all
columns size 1');
PL/SQL procedure successfully completed.
SQL> explain plan for select /* test23 */ *
2 from T_HIST
3 where
C4 = 'R'
4 order by
5 C1
6 ;
Explained.
SQL>
SQL> SET LINESIZE 130
SQL> SET PAGESIZE 0
SQL> SELECT *
2 FROM
TABLE(DBMS_XPLAN.DISPLAY);
Plan hash value: 2911980760
-------------------------------------------------------------------------------------
| Id | Operation | Name | Rows
| Bytes |TempSpc| Cost (%CPU)| Time
|
-------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT |
| 206K| 3017K|
| 1832 (2)| 00:00:22 |
| 1 | SORT ORDER BY |
| 206K| 3017K|
5672K| 1832 (2)| 00:00:22 |
|* 2 | TABLE ACCESS FULL| T_HIST | 206K|
3017K| | 774
(2)| 00:00:10 |
-------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("C4"='R')
What happened? Now the CBO decided to choose a Full Table Scan, it also
states that will return 206k rows and the estimated elapsed time is 22 sec.
That’s because we used AUTO_SAMPLE_SIZE so Oracle we decide how many rows to
sample and for sure values of ‘A’ & ‘C’ represent more than 99% of the
samples. But how can we tell the CBO that our data distribution is skewed? that’s
where histograms enter into play. Let’s now gather statistics using histogram
for column C4:
SQL> exec DBMS_STATS.GATHER_TABLE_STATS (ownname => 'ALFRED',
tabname=>'T_HIST',
estimate_percent=>dbms_stats.auto_sample_size,method_opt=>'for columns C4
size 254');
PL/SQL procedure successfully completed.
SQL> explain plan for select /* test23 */ *
2 from T_HIST
3 where
C4 = 'R'
4 order by
5 C1
6 ;
Explained.
SQL>
SQL> SET LINESIZE 130
SQL> SET PAGESIZE 0
SQL> SELECT *
2 FROM
TABLE(DBMS_XPLAN.DISPLAY);
Plan hash value: 3146893614
--------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows
| Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | |
941 | 14115 | 8 (13)| 00:00:01 |
| 1 | SORT ORDER BY | |
941 | 14115 | 8 (13)| 00:00:01 |
| 2 | TABLE ACCESS BY INDEX ROWID| T_HIST |
941 | 14115 | 7 (0)| 00:00:01 |
|* 3 | INDEX RANGE SCAN | T_HIST_IDX1 | 941 |
| 4 (0)| 00:00:01 |
--------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - access("C4"='R')
Now is using the Index again, states that will return 941 rows but look
how the cost decreased and the estimated elapsed time to 1 sec! Even if the CBO
wasn’t able to calculate the real number of rows being returned the cost was
fixed, in order to have the most accurate execution plan we must increase the
sample size. Let us make a test with 100% of the samples:
SQL> exec DBMS_STATS.GATHER_TABLE_STATS (ownname => 'ALFRED',
tabname=>'T_HIST', estimate_percent=>100,method_opt=>'for columns C4
size 254');
PL/SQL procedure successfully completed.
SQL> explain plan for select /* test23 */ *
2 from T_HIST
3 where
C4 = 'R'
4 order by
5 C1
6 ;
Explained.
SQL>
SQL> SET LINESIZE 130
SQL> SET PAGESIZE 0
SQL> SELECT *
2 FROM
TABLE(DBMS_XPLAN.DISPLAY);
Plan hash value: 3146893614
--------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows
| Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | |
300 | 4500 | 5
(20)| 00:00:01 |
| 1 | SORT ORDER BY | |
300 | 4500 | 5 (20)|
00:00:01 |
| 2 | TABLE ACCESS BY INDEX ROWID| T_HIST |
300 | 4500 | 4
(0)| 00:00:01 |
|* 3 | INDEX RANGE SCAN | T_HIST_IDX1 | 300 |
| 3 (0)| 00:00:01 |
--------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - access("C4"='R')
Now is totally accurate! Right? The execution plan generated by the CBO
is using the Index, the estimated rows are 300 which is correct and the cost is
even better. This is an example of a very small table, but is really an option
to gather statistics with estimate of 100% to a big table?, I don’t think so;
that’s why AUTO_SAMPLE_SIZE can be a good balance between sample size and
accuracy.
Column C4 now is having a frequency histogram because is having less than
254 distinct values (NDV), only 5; that means that frequency histogram has the
restriction of 254 buckets (254 NDV’s), seriously?. Let test it, we will gather
statistics to all columns of the table using AUTO_SAMPLE_SIZE and AUTO for
histograms:
SQL> exec
dbms_stats.gather_table_stats(ownname=>'ALFRED',tabname=>'T_HIST', estimate_percent=>null,
method_opt=>'FOR ALL COLUMNS SIZE AUTO');
PL/SQL procedure successfully completed.
SQL> select COLUMN_NAME, NUM_DISTINCT, DENSITY, NUM_BUCKETS,
HISTOGRAM from user_tab_columns where table_name = 'T_HIST';
C1 529100
.00000189 1 NONE
C2
254 .003937008 1 NONE
C3
254 4.8544E-07 254
FREQUENCY
C4
5 4.8544E-07 5 FREQUENCY
What happened?, well column C3 was also candidate for a histogram because
it’s density (skewed) but look at the number of buckets is 254; that means that
we are on the edge of the limit in order to have a histogram in this column.
Don’t believe me? let’s make a quick test, we will insert a new distinct value
into this column, gather statistics again and see what happen:
SQL> insert into T_HIST values (100001,42, 99999999999,'B');
1 row created.
SQL> COMMIT;
Commit complete.
SQL> exec
dbms_stats.gather_table_stats(ownname=>'ALFRED',tabname=>'T_HIST',
estimate_percent=>null, method_opt=>'FOR ALL COLUMNS SIZE AUTO');
PL/SQL procedure successfully completed.
SQL> select COLUMN_NAME, NUM_DISTINCT, DENSITY, NUM_BUCKETS,
HISTOGRAM from user_tab_columns where table_name = 'T_HIST';
C1
529100 .00000189 1 NONE
C2
254 .003937008 1 NONE
C3
255 .003921569 1 NONE
C4
6 4.8544E-07 6 FREQUENCY
Can you see how the number of distinct values (NVD) changed? now we can’t
have a histogram in that column, really?; well we can force it but look how low
is our density. Let’s try to force it and specify to the gather stats that we want
all columns with maximum of 254 buckets:
SQL> exec
dbms_stats.gather_table_stats(ownname=>'ALFRED',tabname=>'T_HIST', estimate_percent=>null,
method_opt=>'FOR ALL COLUMNS SIZE 254');
PL/SQL procedure successfully completed.
SQL> select COLUMN_NAME, NUM_DISTINCT, DENSITY, NUM_BUCKETS,
HISTOGRAM from user_tab_columns where table_name = 'T_HIST';
C1 529100 1.9186E-06 254 HEIGHT BALANCED
C2
254 4.8544E-07 254
FREQUENCY
C3
255 .003935563 254 HEIGHT
BALANCED
C4
6 4.8544E-07 6 FREQUENCY
Look how it works? it can create frequency histogram on columns with
less than 254 NDV’s and height balanced histogram in columns with more than 254
NDV’s. Also note the first example, we used size of 1 (1 bucket) that means no
histogram because all values are inside only one bucket.
This is a basic example of how histograms work, honestly I was having a
lot of doubts before making this test and some of them are clarified reading Jonathan Lewis new post in allthinngsoracle.com, however
there’s a lot of path to walk through and even difficult as 12c have 2 more
types of histograms to help CBO.
Thanks,
Alfredo
Labels: CBO, DBMS_STATS, explain plan, histograms, statistics