CLUSTERING FACTOR DEMYSTIFIED PART – I

In this post, I will explain what is clustering factor, how and why does it affect the performance and then demonstrate it with the help of an example.
What is  Clustering Factor?
The Index clustering factor is a number which represents the degree to which the data in table is synchronized with the entries in the index. It  gives a rough measure of  how many I/Os the database would perform if it were to read every row in that table via the index in index order.  If the rows of a table on disk are sorted in the same order as the index keys, the database will perform a minimum number of I/Os on the table to read the entire table via the index.
Let’s try to understand it by taking an analogy.  I want to locate all the books (records)  by (for)  an author (a key value)  in the library (table)  using the catalog card (index) .  If books (records) are arranged authorwise (sorted by key)  in racks (table blocks) , my catalog card (index)  will show me information like this :
Book1(Record1)    Rack1(Block1)
Book2(Record2)    Rack1(Block1)
Book3(Record3)    Rack1(Block1)
Bookn(Recordn)    Rack1(Block1)
It means that I need to visit only one rack (block) provided all the books (records)  fit in one rack (block) i.e. Rack1(Block1) to get all the required books (records) . If one rack (block) is insufficient to hold all the books (records) ,Hence,
No. of racks (blocks) to be visited = number of distinct racks (blocks in rowids) listed in the card(index)
Consider a scenario where books (records) of the same author (key value) are scattered across various racks (blocks) . In the worst scenario i.e. each book is in a different rack, my catalog card (index)  will be  like this:
Book1(Record1)    Rack1(Block1)
Book2(Record2)    Rack2(Block2)
Book3(Record3)    Rack3(Block3)
Bookn(Recordn)    Rackn(Blockn)
Now, to get all the books (records) I will have to gather them from multiple racks (blocks) .
In the worst case,  no. of racks (blocks)  visited = no. of books  (records)
To calculate the clustering factor of an index during the gathering of index statistics, Oracle does the following :
For each entry in the index
(
   Compare the entry’s table rowid block with the block of the previous index entry.
If the block is different, Oracle increments the clustering factor by 1.
)
The minimum possible clustering factor is equal to the number of distinct  blocks identified through the index’s list of rowid’s.  An index with a low clustering_factor is closely aligned with the table and related rows reside together inside each data block, making indexes very desirable for optimal access.
The maximum clustering factor is the number of entries in the index i.e. each rowid points to a different block in the table.  An index with a high clustering factor is out-of-sequence with the rows in the table and large index range scans will consume lots of I/O.
How to find the clustering factor of an index 
Oracle provides a column called clustering_factor in the dba_indexes view that provides information on how the table rows are synchronized with the index. A low value is desirable.
Let’s demonstrate the concept:
- Create a table organized which contains two columns  –  id(number) and txt (char)
- Populate the table insert 35 records for each value of id where id ranges from 1 to 100
- In this case as records are added sequentially,  records for a key value are stored together
SQL> drop table organized purge; 
            create table organized (id number, txt char(900));

           begin
           for i in 1..100 loop
               insert into organized select i, lpad('x', 900, 'x')
               from    dba_objects where rownum < 35;
           end loop;
          end;
          /
-  create another table unorganized which is a replica of the ‘organized’ table but records are inserted in a random manner so that records for a key value may be scattered across different blocks (order by dbms_random.random).
SQL> drop table unorganized purge;
           create table unorganized as select * from organized order by dbms_random.random;
- Find out no. of blocks across which records of a key value are spread in the two tables.
- Note that in ‘unorganized’  table,  records  are scattered where in ‘organized’ table, records are clustered   
SQL> select org.id,  org.cnt organized_blocks , unorg.cnt unorganized_blocks
         from
            (select id, count(distinct(dbms_rowid.ROWID_BLOCK_NUMBER(rowid))) cnt
              from organized
              group by id)org ,
           ( select id,  count(distinct(dbms_rowid.ROWID_BLOCK_NUMBER(rowid))) cnt
             from unorganized
              group by id) unorg
        where org.id = unorg.id
        order by id;
     ID    ORGANIZED_BLOCKS UNORGANIZED_BLOCKS
———- —————- ——————
         1                5                 34
         2                6                 34
         3                6                 32
                      …..
        98                5                 32
        99                5                 34
       100               6                 33
-- Create index on id column on both the tables and gather statistics for both the tables
create index organized_idx on organized(id);
           create index unorganized_idx on unorganized(id);
          exec dbms_stats.gather_table_stats(USER, 'organized', estimate_percent => 100, method_opt=> 'for all indexed columns size 254');
        exec dbms_stats.gather_table_stats(USER, 'unorganized', estimate_percent => 100, method_opt=> 'for all indexed columns size 254');
– Check that both the tables contain same no. of rows/blocks.
SQL> select table_name, blocks, num_rows
           from user_tables
           where table_name like '%ORGANIZED'
           order by 1;
TABLE_NAME                         BLOCKS   NUM_ROWS
——————————      ———- ———-
ORGANIZED                                 488       3400
UNORGANIZED                           486       3400
– Check the index statistics
– Note that the index on table ‘organized‘ has a clustering factor (488) which is equal to the no. of blocks in the table i.e.  to fetch all the records for various key values using index,  blocks need not be switched unless all the records in the earlier block have been fetched
– Note that the index on table ‘unorganized‘ has a clustering factor (3299) which is close to the no. of rows   in the table (3400) i.e. almost every record for a key  value is in a different block and  to fetch all the records for a  key values using index,  block  needs to be switched almost for every record
SQL>set line 500
          col table_name for a15
          col index_name for a15
          select blevel,  leaf_blocks, table_name, index_name, clustering_factor
          from user_indexes
          where table_name like '%ORGANIZED%'
          order by 1;
    BLEVEL LEAF_BLOCKS TABLE_NAME      INDEX_NAME      CLUSTERING_FACTOR
———- ———– ————— ————— —————–
         1           7 ORGANIZED              ORGANIZED_IDX                 488
         1           7 UNORGANIZED     UNORGANIZED_IDX              3299
The  clustering factor is a measure of the no. of I/Os the database will perform against the table in order to read every row via the index. We can verify this fact by 
 executing a query with tracing enabled that will, in fact, read every row of the table via the index. We’ll do that by using an index hint to force the optimizer to use the index and count the non-null occurrences of a nullable column (text) that is not in the index. That will force the database to go from index to table for every single row:
SQL> alter session set tracefile_identifier = 'cluster_factor';
           alter session set sql_trace=true;
           select /*+ index(organized organized_idx) */ count(txt) from organized where id=id;
           select /*+ index(unorganized unorganized_idx) */ count(txt) from unorganized where id=id;
           alter session set sql_trace=false;
– Find out the name of trace file generated
SQL>col trace_file for a100
           select  value trace_file from v$diag_info
           where upper(name) like '%TRACE FILE%';
TRACE_FILE
—————————————————————————————————-
/u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_ora_29116_cluster_factor.trc
– Run tkprof utility on the trace file generated 
cd /u01/app/oracle/diag/rdbms/orcl/orcl/trace
   tkprof orcl_ora_29116_cluster_factor.trc cluster_factor.out
   vi cluster_factor.out
– let’s analyze the tkprof output 
********************************************************************************
SQL ID: c3r8241qas3rn
Plan Hash: 2296712572
select /*+ index(organized organized_idx) */ count(txt)
from
 organized where id=id
Rows     Row Source Operation
——-  —————————————————
      1  SORT AGGREGATE (cr=496 pr=0 pw=0 time=0 us)
   3400   TABLE ACCESS BY INDEX ROWID ORGANIZED (cr=496 pr=0 pw=0 time=37892 us cost=496 size=3073600 card=3400)
   3400    INDEX FULL SCAN ORGANIZED_IDX (cr=8 pr=0 pw=0 time=9189 us cost=8 size=0 card=3400)(object id 75116)
********************************************************************************
As you can see
Total no. of I/Os performed against the index on ‘organized’ table = 8 (cr=8 in the INDEX FULL SCAN ORGANIZED_IDX row source)
Total I/O’s performed by the query = 496 (cr = 496 in the TABLE ACCESS BY INDEX ROWID ORGANIZED)
Hence , No. of I/O’s made against the table = 496 – 8 = 488 which is equal to the clustering factor of the index
********************************************************************************
SQL ID: d979q0sqdbabb
Plan Hash: 3317439903
select /*+ index(unorganized unorganized_idx) */ count(txt)
from
 unorganized where id=id
Rows     Row Source Operation
——-  —————————————————
      1  SORT AGGREGATE (cr=3307 pr=272 pw=0 time=0 us)
   3400   TABLE ACCESS BY INDEX ROWID UNORGANIZED (cr=3307 pr=272 pw=0 time=40536 us cost=3308 size=3073600 card=3400)
   3400    INDEX FULL SCAN UNORGANIZED_IDX (cr=8 pr=0 pw=0 time=9693 us cost=8 size=0 card=3400)(object id 75117)
********************************************************************************
Similarly, if I do the same analysis on the UNORGANIZED index,
Total no. of I/Os performed against the index on ‘unorganized’ table = 8 (cr=8 in the INDEX FULL SCAN UNORGANIZED_IDX row source)
Total I/O’s performed by the query = 3307 (cr = 3307 in the TABLE ACCESS BY INDEX ROWID ORGANIZED)
Hence , No. of I/O’s made against the table = 3307 – 8 = 3299 which is equal to the clustering factor of the index
So, for one table, the database performs 496 total I/O’s to retrieve exactly the same data as for the other table—which needed 3307 I/O’s.
Obviously, one of these indexes is going to be more useful for retrieving a larger number of rows than the other.
We can verify this  by using autotrace on a query against both  the tables:
SQL>set autotrace traceonly explain
          select /*+ index(organized organized_idx) */ count(txt)
          from  organized where id=id;
          set autotrace off
———————————————————————————————-
| Id  | Operation                                    | Name          | Rows  | Bytes | Cost (%CPU)| Time     |
———————————————————————————————-
|   0 | SELECT STATEMENT             |               |     1 |   904 |   496   (0)| 00:00:06 |
|   1 |  SORT AGGREGATE                |               |     1 |   904 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID| ORGANIZED     |  3400 |  3001K|   496   (0)| 00:00:06 |
|*  3 |    INDEX FULL SCAN           | ORGANIZED_IDX |  3400 |       |    8   (0)| 00:00:01 |
———————————————————————————————-
Note that there is a cost of 8 for using the index for the ORGANIZED table and index—about 8 I/O’s against the index i.e. the query will hit one root block (1)and the leaf blocks (7) .
Then the query will be doing 488 more I/Os against the table(= number of blocks in table), because the rows needed are all next to each other on a few database blocks, for a total cost of 496.
SQL>set autotrace traceonly explain
         select /*+ index(unorganized unorganized_idx) */ count(txt)
         from      unorganized where id=id;
         set autotrace off;
————————————————————————————————–
| Id  | Operation                    | Name            | Rows  | Bytes | Cost (%CPU)| Time     |
————————————————————————————————
|   0 | SELECT STATEMENT             |                 |     1 |   904 |  3308   (1)| 00:00:40 |
|   1 |  SORT AGGREGATE              |                 |     1 |   904 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID| UNORGANIZED     |  3400 |  3001K|  3308   (1)| 00:00:40 |
|*  3 |    INDEX FULL SCAN           | UNORGANIZED_IDX |  3400 |       |     8   (0)| 00:00:01 |
————————————————————————————————
In the case of UNORGANIZED index, the plan still has the same 8 I/Os against the index. But because the rows needed from the table are not next to each other, the optimizer estimates that the query will have to switch the block in  the table for every row it retrieves, and its estimated cost for 3400 row is 3299 (clustering factor) rows + 8 I/Os which comes to 3307 (almost = 3308 (listed)).
As you can see, both the plans expect to return the same number of rows: 1. Both the plans are using an index full scan. But the two plans have radically different costs: one has a low cost of 496 and the other a much higher cost of 3308—even though both plans are going after exactly the same set of rows from two tables that contain the same data!
The reason of this observation is quite clear : the data in the ‘unorganized’  table is not sorted in the same fashion as the data in the index. This leads to an  increase the clustering factor  and the database must do reads and rereads of the table thereby increasing the cost. With the ‘organized’  table, the table data and the index data were sorted identically.
How to resolve the performance issues due to high clustering factor?
 To improve the CF, the table must be rebuilt (and reordered). The data retrieval can be considerably speeded up by physically sequencing the rows in the same order as the key column. If we can group together the rows for a key value,  we can get all of the row with a single block read because the rows are together.  T o achieve this goal, various methods may be used :
    . Single table index clusters : Clusters related rows together onto the same data block
    . Manual row re-sequencing (CTAS with Order by) : Pre-orders data to avoid expensive disk sorts after retrieval.
. Using dbms_redefinition, for an in-place table reorganization
In my post Clustering Factor Demystified : Part – III, I have demonstrated the use of single table index and hash clusters to improve the clustering factor of an unorganized table.
Note :
- Rebuilding of index cannot  improve the Clustering Factor.
- If table has multiple indexes, careful consideration needs to be given by which index to order table.
- Row re-sequencing does not help queries that perform full-scans or index unique scans, and careful attention must be given to the nature of the queries against the target table.
- The degree to which resequencing improves the performance depends on how far out of sequence the rows are when you begin and how many rows you will be accessing in sequence.
Summary:
- A good Clustering Factor is equal (or near) to the values of number of blocks of table.- A bad Clustering Factor is equal (or near) to the number of rows of table.
– A low clustering factor is good and reflects strong clustering.
– A high clustering factor is bad and reflects weak clustering.
– The clustering factor may be lower than the number of blocks if there are empty blocks in the table below HWM and/or there are many rows that have null values for the indexed column(s).
– The clustering factor can never be greaterr than the no. of rows in the table.In my next post, I will demonstrate how to use CTAS order by to resequence the rows physically in the table.
Thanx for your time! I hope you found the post useful. Your comments and suggestions are always welcome!

 

Index Clustering Factor And Oracle
——————————————————————————————————–


Related links:

                                              ——————-

5 thoughts on “CLUSTERING FACTOR DEMYSTIFIED PART – I

    1. The table can be reorganized w.r.t. one index only. If you have multiple indexes, reorganize the table as per the index on the column which is queried most often.

      Regards
      Anju Garg

  1. Hello Anuj,

    I am frequent visitor of your blog and every time i am learning new things in oracle.

    Its very nice explanation and wishing you a bright future in oracle .

    Regards
    Ranjeet Sing

    1. Thanks Ranjeet for your time and feedback.

      It seems that you have made a typo. My name is Anju.

      Keep visiting the blog.
      Your comments and suggestions are always welcome!

      regards
      Anju

Leave a Reply to satya Cancel reply