Monthly Archives: December 2012

11G DATAGUARD

DATABASE

 

11G R2 RAC

                                                               —————

BUFFER CACHE WAIT EVENTS

 
  
In this post, I will discuss about wait events associated with database buffer cache  : synptoms, causes and their solutions.
 
SYMPTOMS:
 
There are several symptoms that point to the buffer cache as a problem. If the following are observed in the Top 5 Timed Events (V$session_event, v$system_event), then investigate further:
 
- LATCH: CACHE BUFFER CHAINS (CBC): This generally points to hot blocks in the cache.
 
- LATCH: CACHE BUFFER LRU CHAINS: contention indicates that the buffer cache is too small, and block replacement is excessive.
 
- BUFFER BUSY WAITS: This is almost always an application tuning issue. Multiple processes are requesting the same block.
 
- DB FILE SEQUENTIAL READS : wait for a single block to be read synchronously from disk
 
- DB FILE SCATTERED READS : wait for multiple blocks to be read concurrently from disk
 
- DB FILE PARALLEL READS : wait for a synchronous multiblock read
 
- FREE BUFFER WAITS: When the DBWn processes do not make sufficient free buffers to meet the demand, free buffer waits are seen.
 
   Let’s discuss above wait events in detail.
 
  As I discussed in my earlier article on buffer cache architecture, the buffer cache holds in memory versions of data blocks for faster access. Each buffer in the buffer cache has an associated element the buffer header array, externalized as x$bh. Buffer headers keep track of various attributes and state of buffers in the buffer cache.  These buffer headers are chained together in a doubly linked list and linked to a hash bucket. There are many hash buckets (# of buckets are derived and governed by _db_block_hash_buckets parameter). Access (both inspect and change) to these hash chains are protected by cache buffers chains latches.
 
Further, buffer headers can be linked and delinked from hash buckets dynamically.
 
Here is a simple algorithm to access a buffer:
 
1. Hash the data block address (DBAs: a combination of tablespace, file_id and block_id) to find hash bucket.
2. Get the CBC latch protecting the hash bucket.
3. If success (CBC latch obtained), walk the hash chain, reading buffer headers to see if a specific version of the block is already in the chain.
      If  specific version of block found,
         access the buffer in buffer cache, with protection of buffer pin/unpin actions.
 
      If specific version of block not found,
          Get cache buffer LRU chain latch to find a free buffer in buffer cache,
          unlink the buffer header for that buffer from its current chain,
          link that buffer header with this hash chain,
          release the latch and
          read block in to that free buffer in buffer cache with buffer header pinned.
 
   If not success(CBC latch not obtained) ,
      spin for spin_count times and
      go to step 2.
      If that latch was not got with spinning, then
         sleep (with exponentially increasing sleep time with an upper bound),
         wake up, and go to step 2.
 
Let;s talk about them one by one .
 
 
                                       LATCH : CACHE BUFFER CHAINS
 
The CBC latches are used when searching for, adding, or removing a buffer from the buffer cache.
 
  Multiple users looking for the same blocks or blocks in the same hash bucket and hence trying to obtain the latch on the same bucket. Such blocks are called hot blocks. Wait event can occur if
 
- Latch can’t be obtained due to contention.
- Latch was obtained by one session and was held for long while walking the chain as the chain was long . Hence, others trying to obtain the latch have to wait.
 
Some of the activities that can cause this wait event are:
 
1. Multiple users trying to read code/description etc. from look up tables.
 
Soln:
- Identify the hot block
  . Segments by logical reads in AWR report
  . v$segment_statistics (object_name, statistic_name, value)
 
    This query shows the top 10 statistic values, by object and statistic name.
 
    > SELECT *
      FROM (SELECT owner, object_name,
     object_type, statistic_name, sum(value)
    FROM V$SEGMENT_STATISTICS
    GROUP BY owner, object_name, object_type, statistic_name
    ORDER BY SUM(value) DESC)
      WHERE ROWNUM <= 10;
 
- Find out the file# (P1) and block# (P2) from v$session_wait
 
  SQL:>select sid, event, P1 File#, P2 Block#
             from v$session_wait
             where event like ‘%buffer busy%';
 
- Find out the segment whose data is on this file/block
 
SQL>select owner, segment_name
            from dba_extents
            where file_id = &file#
                and &block# between block_id and block_id+blocks-1;
 
- Modify the application to use PL/SQL to read the look up table once and store code/descriptions in local variables which can be accessed later rather than reading from the table.
 
2. Simultaneous update/select operations on the same block:
 
 CBC latch contention can become worse if a session is modifying the data block that users are reading because readers will clone a block with uncommitted changes and roll back the changes in the cloned block. SInce all these clone copies will go in the same bucket and be protected by the same latch, length of the cbc chain gets longer, it takes more time to walk the chain , latch is held for a longer time and hence users trying to obtain the latch  have to wait for longer period.
 
Soln:
 
- Modify the application to commit frequently so that CR clones need not be created.
 
3. When multiple users are running nested loop joins on a table and accessing the table driven into via an index. Since the NL join is basically a
 
  For all rows in i
     look up a value in j  where j.field1 = i.val
  end loop
 
then table j’s index on field1 will get hit for every row returned from i. Now if the lookup on i returns a lot of rows and if multiple users are running this same query then the index root block is going to get hammered on the index j(field1).
 
e.g.
 
  select t1.val, t2.val
    from t1, t2
   where t1.c1 = {value}
     and t2.id = t1.id;
 
Here, if there are a large no. of rows in table t1 where c1 = value, Oracle will repeatedly hit the index on table t2 (id).
 
In this case SQL statements with high BUFFER_GETS (logical reads) per EXECUTION are the main culprits. Multiple concurrent sessions are executing the same inefficient SQL that is going after the same data set.
 
SOLUTION:
———-
  Reducing contention for the cache buffer chains latch will usually require reducing logical I/O rates by tuning and minimizing the I/O requirements of the SQL involved. High i/O rates could be a sign of a hot block (meaning a block highly accessed).
 
-  Find SQL ( Why is application hitting the block so hard? ). SQL by buffer gets in AWR report
- Replace Nested loop joins with hash join
- Hash Partition the index with hot block
- Create a reverse key index
- Use Hash clusters
 
4. This can also result from the use of sequences if cache option is not used.
 
Soln:
- Create sequences with cache option.
 
5. Inefficient SQL
 
   Multiple concurrent sessions are executing the same inefficient SQL that is going after the same data set.
 
Soln:
 Reducing contention for the cache buffer chains latch will usually require reducing logical I/O rates by tuning and minimizing the I/O requirements of  the SQL involved. High I/O rates could be a sign of a hot block (meaning a block highly accessed).
 
- Spread data out to reduce contention
- Export the table, increase the PCTFREE significantly, and import the data. This minimizes the number of  rows per block, spreading them over many blocks. Of course, this is at the expense of storage and full table scans operations will be slower
- For indexes, you can rebuild them with higher PCTFREE values, bearing in mind that this may increase the height of the index.
- Consider reducing the block size : If the current block size is 16K, you may move the table or recreate the index in a tablespace with an 8K block size. This too will negatively impact full table scans operations. Also, various  block sizes increase management complexity.
 
6. Index leaf chasing from very many processes scanning the same unselective index with very similar predicate.
 
Soln:
 
- Modify the application to use more selective predicate/index.
 
7. Large size of buffer cache so that length of hash chains increases and hence time taken to walk the chain increasing thereby increasing latch contention.
 
Soln:
- Increase the parametere _db_block_hash_buckets so that no. of buckets increases and hence length of hash chain decreases.
 
                      BUFFER BUSY WAITS
                      —————————
 
  This is the second wait event which is encountered in case of logical I/O. It is a wait event for a buffer that is being used in an unsharable way or is being read into the buffer cache. Buffer busy waits are there when latch is obtained on the hash bucket but  the buffer is “busy” as
- Another session is reading the block into the buffer
- Another session holds the buffer in an incompatible mode to our request
 
These waits indicate read/read, read/write, or write/write contention. Buffer busy waits are common in an I/O-bound Oracle system.This wait can be intensified by a large block size as more rows can be contained within the block.
 
  These waits can be on :
 
- Data Blocks        : Multiple sessions trying to insert /update / delete rows in same block
- Segment Header : Multiple sessions trying to insert records in same segment
- Undo header      : A lot of undo is being generated and hence contention on undo segment header.
- Undo block        : Contention on the same undo block as multiple users querying the records  in same undo block
 
Diagnosis :
- Top 5 events in AWR report
- Buffer wait statistics section of AWR report- shows the type of data block, the wait is on i.e. data block, segment header or undo block (v$waitstat)
- V$session_event
- V$system_event
- v$waitstat
 
Solution :
 
- Wait on Data Blocks (v$waitstat) : If the wait is on the data block, we can move “hot” data to another block to avoid this hot block or use smaller blocks (to reduce the No. of rows per block, making it less “hot”).
    . Identify segments with buffer busy waits
         v$segment_statistics (object_name, statistic_name, value)
    . Eliminate HOT blocks from the application.
    . Check for repeatedly scanned / unselective indexes.
    . Reduce the number of rows per block : Try rebuilding the object with a higher PCTFREE and lower PCTUSED
    . Check for ‘right- hand-indexes’ (indexes that get inserted into at the same point by many processes) – Rebuild as reverse key indexes
    . Increase INITRANS and MAXTRANS (where users are simultaneously accessing the same block),
      When  a DML occurs on a block, the lock byte is set in the block and any user accessing the record(s) being changed must check the Interested Transaction List (ITL) for info regarding building the before image of the block. The Oracle database writes info into the block, including all users who are “interested” in the state of the block, in the ITL. Increasing INITRANS  will create the space in the block to allow multiple ITL slots (for multiple user acces).
       
- Wait on segment header (v$waitstat)
  . If using Manual segment space management,
      – Increase of number of FREELISTs and FREELIST GROUPs
      – Increase the PCTUSED-to-PCTFREE gap
      – Use ASSM
 
- Wait on undo header
  . Use Automatic undo management.
  . If using manual undo management, increase no. of rollback segments .
 
- Wait on undo block
   . Try to commit more often
   . Use larger rollback segments or larger undo tablespace.
 
To practically simulate buffer busy wait and identify the hot block, visit this link.
 
                              LATCH: CACHE BUFFER LRU CHAIN
 
.   This is the first wait which occurs in case of physical I/O i.e. server process is unable to find the desired block in buffer cache and hence needs to read it from disk into cache. Before reading the block, it searches for the free buffers by scanning the LRU list. To scan the LRU list, it must obtain latch on the LRU chain. The cache buffer lru chain latch is also acquired  when writing a buffer back to disk, specifically when trying  to scan the LRU (least recently used) chain containing all the dirty blocks in the buffer cache (to delink from LRU list and link to WRITE LIST – MAIN).
 
Diagnosis :
- Top 5 timed events in AWR report
- v$session_event
- v$system_event
 
   This wait is encountered when
 
1. Multiple users issuing Statements that repeatedly scan  large unselective indexes or perform full table scans and hence scan LRU chain to look for free buffers.
 
Soln:
- Identify SQL causing large amount of logical I/O –  SQL by buffer gets in AWR report
- Tune SQL to reduce no. of buffers required
 
2. Insufficient size of buffer cache which leads to frequent aging out of the cached blocks which need to be reread from the disk requiring search for free buffers and hence need to obtain LRU latch.
 
Soln:
- Use AMM/ASMM
- Increase Db buffer cache size as per
  . ADDM recommendation
  . v$db_cache_advice
 
-  Increase the parameter _db_block_lru_latch to increase the no. of latches protecting lru chain so that contention on lru chain is reduced.
 
                                                 FREE BUFFER WAITS
 
  After the latch has been obtained on LRU list and server process is scanning the LRU list to search for free buffers, we are waiting for a free buffer but there are none available in the cache because there are too many dirty buffers in  the cache.
 
Causes :
 
1. Buffer cache is too small
 
Soln:
- Use AMM/ASMM
- Increase DB buffer cache size as per
  . ADDM recommendation
  . v$db_cache_advice
 
2. DBWR is slow in writing modified buffers to disk and is unable to keep up to the write
requests.
 
Soln:
- Increase I/O bandwidth by striping the datafiles.
- If asynchronous I/O is supported
     . Enable asynchronous I/O (DISK_ASYNCH_IO=true)
       – if multiple CPU’s are there,
            Increase the no. of database writers (DB_WRITER_PROCESSES)
         else
            Configure I/O slaves (set DBWR_IO_SLAVES to non zero )
  else
         Configure I/O slaves (set DBWR_IO_SLAVES to non zero )
       
 
3. Large sorts and full table scans and / or scan of unselective indexex are filling the  cache with modified blocks faster than the  DBWR is able to write to disk. As the data is not pre-sorted, the data for a key value is scattered across multiple blocks and hence more no. of blocks are read  to get records for a key.
 
Soln:
- Pre-sorting or reorganizing data can help
 
 
                        READ WAITS
 
  Waits encountered when data blocks are being read physically from the disk. These waits are always present even in well tuned Databases. Following strategy should be employed to decide of tuning is required:
 
- Compare total read time with the baseline value.
- If total wait time is excessive fraction of total DB time, two cases can be there
 
  1. Average time for each read is normal but no. of disk reads are high
 
     Soln:
     – Try to decrease no. of reads : Find SQL statements issuing most logical/physical rads (AWR report) and tune them.
     – Large no. of physical reads may be due to small DB_CACHE_SIZE – Use advisor to increase cache size
 
  2. Average time for each read is high (> 15ms)
 
     Soln:
     – Avoid reading from disk 
           . Cache tables/indexes
           . Use KEEP / RECYCLE pool
     – Stripe the datafiles for more bandwidth.
    – Reduce checkpointing. Whenever there is a checkpoint, data is written to disk. Since disk is busy servicing writes, reads get delayed.
 
 Various read waits can be separately tuned as follows:
 
                  DB FILE SCATTERED READ
 
  Waits for multiple blocks (up to DB_FILE_MULTIBLOCK_READ_COUNT) to be read from disk while performing full table scans / Index fast full scans (no order by) . This wait is encountered because:
 
- As full table scans are pulled into memory, they are scattered throughout the buffer cache , since it is highly unlikely that they fall into contiguous buffers.
 
- Since a large no. of blocks have to be read into the buffer cache, server process has to search for a large no. of free/usable blocks in buffer cache which leads to waits.
 
solutions :
- Try to cache frequently used small tables to avoid readingthem into memory over and overagain.
- Optimize multi-block I/O by setting the parameter DB_FILE_MULTIBLOCK_READ_COUNT
- Partition pruning: Partition tables/indexes so that only a partition is scanned.
- Consider the usage of multiple buffer pools
- Optimize the SQL statement that initiated most of the waits. The goal is to minimize the number of physical and logical reads.
     . Should the statement access the data by a full table scan or index FFS?
     . Would an index range or unique scan  be more efficient?
     . Does the query use the right driving table?
     . Are the SQL predicates appropriate for hash or merge join?
     . If full scans are appropriate, can parallel query improve the response time?
- Make sure all statistics are representative of the actual data. Check the LAST_ANALYZED date
 
The objective is to reduce the demands for both the logical and physical I/Os, and this is best achieved through SQL and application tuning.
 
                  DB FILE SEQUENTIAL READ
 
  These waits generally indicate a single block read (e.g. Idex full scan : an index read with order by clause).
 
Causes and solutions :
 
1. Use of an unselective index
 
Soln:
- Check indexes on the table to ensure that the right index is being used
- Check the column order of the index with the WHERE clause of the Top SQL statements
 
2. Fragmented Indexes : If the DBA_INDEXES.CLUSTERING_FACTOR of the index approaches the number of blocks in the table, then most of the rows in the table are ordered. This is desirable.
 
 However, if the clustering factor approaches the number of rows in the table, it means the rows in the table are randomly ordered and thus it requires more I/Os to complete the operation.
 
Soln:
You can improve the index’s clustering factor by rebuilding the table so that rows are ordered according to the index key and rebuilding the index thereafter.
 
3. High I/O on a particular disk or mount point
 
Soln:
- Use in partitioning to reduce the amount of blocks being visited
- Make sure optimizer statistics are up to date
- Relocate ‘hot’ datafiles :    Place the tables used in
the SQL statement on a faster part of the disk.

- Consider the usage of multiple buffer pools and cache frequently used indexes/tables in the KEEP pool
- Inspect the execution plans of the SQL statements that access data through indexes
Tune the I/O subsystem
to return data faster.


4. Bad application design
 
Soln:
- Optimize the SQL statement that initiated most of the waits. The goal is to minimize the number of physical and logical reads.Inspect the execution plans of the SQL statements that access data through indexes

1.     Examine the SQL statement to see if

. it is doing a full-table scan when it should be using an index,

. it is using a wrong index or

. it can be rewritten to reduce the amount of data it retrieves

. it is appropriate for the SQL statements to access data through index lookups or would full table scan be more efficient?

. the statements use the right drivin table i.e. join order is proper?


5.
Range scans on data spread in many different blocks

 
Soln:
- check that range scans should not be using reverse indexes.
- Load the data in sorted manner on the colums on which range scans will be there.
6.  Consider increasing the buffer cache to see if the expanded size will accommodate the additional blocks, therefore reducing the I/O and the wait.

                  DB FILE PARALLEL READ

  Waits for synchronous multiblock reads. The process has issued multiple I/O requests in parallel to read from datafiles into memory (not during parallel Query/DML) i.e. the process batches many single block I/O requests together and issues them in parallel.
 
Some of the most common wait problems and potential solutions are outlined here:
 
Sequential Read Indicates many index reads—tune the code (especially joins)
Scattered Read Indicates many full table scans—tune the code; cache small tables
Free Buffer Increase the DB_CACHE_SIZE; shorten the checkpoint; tune the code
Buffer Busy Segment header—add freelists or freelist groups
Buffer Busy Data block—separate “hot” data; use reverse key indexes; use
smaller blocks; increase initrans (debatable); reduce block
popularity; make I/O faster
Buffer Busy Undo header—add rollback segments or areas
Buffer Busy Undo block—commit more; larger rollback segments or areas
————————————————————————————————————

Related links:




SQL PROFILE DEMYSTIFIED : PART – II

In my earlier post SQL PROFILE DEMYSTIFIED : PART – I, I had explained the purpose and working of the sql profile. In this post, I will demonstrate the creation and use of SQL profile for a poorly tuned SQL statement and also explain the hints the that are stored as part of the profile.
Let’s start …
- SETUP
-- create a table
SQL>conn hr/hr


    drop table t purge;
    create table t (id constraint id_pk primary key, pad) as
    select rownum , lpad (‘*’, 4000, ‘*’)
    from all_objects
    where rownum <= 10000;
-- check that  there is 1 row with id = 84 (126 – 42 )
SQL>select count(*) from hr.t where id+42=126;


COUNT(*)
———-
         1
ISSUE FOLLOWING STATEMENT AND SEE ITS EXECUTION PLAN –
– NOTE THAT ESTIMATED NO. OF ROWS = 108 WHICH NEED 1404 BYTES TO BE READ
SQL>CONN / AS SYSDBA


         SET AUTOTRACE TRACEONLY
         select count(*) from hr.t where id+42=126;
         SET AUTOTRACE OFF
Execution Plan
——————————————————————————-
| Id  | Operation             | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
——————————————————————————-
|   0 | SELECT STATEMENT      |       |     1 |    13 |     9   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE       |       |     1 |    13 |            |          |
|*  2 |   INDEX FAST FULL SCAN| ID_PK |   108 |  1404 |     9   (0)| 00:00:01 |
——————————————————————————-
– Create SQL tuning task on above stmt
SQL>variable sqltext varchar2(4000);


BEGIN
 :sqltext := q’#select count(*) from hr.t where id+42=126#';
END;
/


variable spmtune   varchar2(30);


exec :spmtune := dbms_sqltune.create_tuning_task(sql_text => :sqltext);


PRINT SPMTUNE
SPMTUNE
——————————–
TASK_412
– Execute the tuning task
SQL>exec dbms_sqltune.execute_tuning_task(:spmtune);
- RUN REPORT OF SQL TUNING TASK TO VIEW RECOMMENDATIONS –
SQL>set long 10000
         set line 10000


         select dbms_sqltune.report_tuning_task(:spmtune, ‘TEXT’)
          from dual;
– I have trimmed off the output and am displaying only the recommendation
   of SQL profile
2- SQL Profile Finding (see explain plans section below)
——————————————————–
  A potentially better execution plan was found for this statement.
  Recommendation (estimated benefit: 18.05%)
  ——————————————
  – Consider accepting the recommended SQL profile.
    execute dbms_sqltune.accept_sql_profile(task_name => ‘TASK_412′,-
            task_owner => ‘SYS’, replace => TRUE);
  Validation results
  ——————
  The SQL profile was tested by executing both its plan and the original plan
  and measuring their respective execution statistics. A plan may have been
  only partially executed if the other could be run to completion in less time.
                           Original Plan  With SQL Profile  % Improved
                           ————-  —————-  ———-
  Completion Status:            COMPLETE          COMPLETE
DBMS_SQLTUNE.REPORT_TUNING_TASK(:SPMTUNE,’TEXT’)
——————————————————————————–
  Elapsed Time(us):                1122              1270     -13.19 %
  CPU Time(us):                    1562              1562          0 %
  User I/O Time(us):                  0                 0
  Buffer Gets:                       22                18      18.18 %
  Physical Read Requests:             0                 0
  Physical Write Requests:            0                 0
  Physical Read Bytes:                0                 0
  Physical Write Bytes:               0                 0
  Rows Processed:                     1                 1
  Fetches:                            1                 1
  Executions:                         1                 1
– Accept the profile
SQL>execute dbms_sqltune.accept_sql_profile(task_name => ‘TASK_412′,-
            task_owner => ‘SYS’, replace => TRUE);
-- check that profile has been saved in data dictionary
SQL> SELECT NAME, SQL_TEXT  
           FROM DBA_SQL_PROFILES
           WHERE SQL_TEXT LIKE ‘select count(*) from hr.t where id+42=126%';
NAME                           SQL_TEXT
—————————— ————————————————
SYS_SQLPROF_013b9d57d27f0001   select count(*) from hr.t where id+42=126
--Let’s execute the statement again and check the execution plan now
set autotrace traceonly
select count(*) from hr.t where id+42=126;
set autotrace off
———————————————————-
| Id  | Operation        | Name  | Rows  | Bytes | Cost  |
———————————————————-
|   0 | SELECT STATEMENT |       |     1 |    13 |    21 |
|   1 |  SORT AGGREGATE  |       |     1 |    13 |       |
|*  2 |   INDEX FULL SCAN| ID_PK |     2 |    26 |    21 |
———————————————————-
Predicate Information (identified by operation id):
—————————————————
   2 – filter(“ID”+42=126)
Note
—–
   – cpu costing is off (consider enabling it)
   – automatic DOP: Computed Degree of Parallelism is 1 because of parallel threshold
   - SQL profile “SYS_SQLPROF_013b9d57d27f0001″ used for this statement
– Earlier plan was :
——————————————————————————-
| Id  | Operation             | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
——————————————————————————-
|   0 | SELECT STATEMENT      |       |     1 |    13 |     9   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE       |       |     1 |    13 |            |          |
|*  2 |   INDEX FAST FULL SCAN| ID_PK |   108 |  1404 |     9   (0)| 00:00:01 |
——————————————————————————-
Note that 
   - estimated no. of rows now = 2 which is much closer to the actual value of 1
   – No. of bytes read have dropped by 1404 – 26 =   1378
-- CHECK THE HINTS SAVED AS PART OF THE PROFILE WHICH WILL BE USED FOR FURTHER EXECUTIONS OF THE STATEMENT
– Pls note that name of the table containing hints in Oracle 10g is SQLPROF$ATTR
SQL> SELECT * 
           FROM DBMSHSXP_SQL_PROFILE_ATTR
           WHERE PROFILE_NAME=’SYS_SQLPROF_013b9d57d27f0001′;
PROFILE_NAME                   COMP_DATA
—————————— ——————————————————————————–
SYS_SQLPROF_013b9d57d27f0001   <outline_data><hint><![CDATA[OPT_ESTIMATE(@”SEL$1″, TABLE, “T”@”SEL$1″, SCALE_ROWS=0.0160247093)]]></hint><hint><![CDATA[COLUMN_STATS(“HR”.”T”, “ID”, scale, length=3)]]></hint><hint><![CDATA[TABLE_STATS(“HR”.”T”, scale, blocks=10143 rows=9889.425)]]></hint><hint><![CDATA[IGNORE_OPTIM_EMBEDDED_HINTS]]></hint><hint><![CDATA[OPTIMIZER_FEATURES_ENABLE(‘8.0.0′)]]></hint></outline_data>
- LET’S ANALYZE THE HINTS
HINT – 1 
OPT_ESTIMATE(@”SEL$1″, TABLE, “T”@”SEL$1″, SCALE_ROWS=0.0160247093
This hint tells the optimizer that no. of estimated rows need to be multiplied by 0.0160247093 to get
the actual no. of rows returned by the query.
Let’s check
– Earlier plan was :
——————————————————————————-
| Id  | Operation             | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
——————————————————————————-
|   0 | SELECT STATEMENT      |       |     1 |    13 |     9   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE       |       |     1 |    13 |            |          |
|*  2 |   INDEX FAST FULL SCAN| ID_PK |   108 |  1404 |     9   (0)| 00:00:01 |
——————————————————————————-
As per earlier execution plan estimated no. of rows = 108
Let’s multiply it by the scaling factor of 0.0160247093 .
This gives us 108 * 0.0160247093 = 1.73 (= 2 approx as seen in the execution plan with SQL PROFILE)
which much closer to the 1 (actual no. of rows returned)
HINT – 2
COLUMN_STATS(“HR”.”T”, “ID”, scale, length=3)
This hint tells the oprimizer that length of id column is 3.
Let’s check the average length of column id :
SQL> select sum(length(id))/count(*) from hr.t;
SUM(LENGTH(ID))/COUNT(*)
————————
                  3.8894
We see that average length = 3.9 which is quite close the value stored as hint
HINT – 3
TABLE_STATS(“HR”.”T”, scale, blocks=10143 rows=9889.425
This hiint tells the optimizer that no. of rows in the table = 9889.425 and blocks = 10143
Let’s compare with actual values:
SQL> select count(*) from hr.t;
  COUNT(*)
———-
     10000
SQL> select owner, table_name, blocks from dba_tables where table_name=’T';
OWNER                          TABLE_NAME                         BLOCKS
—————————— —————————— ———-
HR                             T                                   10143
Note that
  – Actual no. of rows (10000) is quite close the value stored in hint (9889.425)
  – Actual No. of blocks  = Stored value 10143
HINT – 4
IGNORE_OPTIM_EMBEDDED_HINTS
This hint tells the optimizer to ignore any hints provided in the statement.
HINT – 5
OPTIMIZER_FEATURES_ENABLE(‘8.0.0′)
This hint tells the optimizer to execute the statement with features of optimizer version 8.0.0
I hope this post was useful.
Your comments and suggestions are always welcome.
References:

http://kerryosborne.oracle-guy.com/2009/10/how-to-attach-a-sql-profile-to-a-different-statement-take-2/
Oracle’s OPT_ESTIMATE hint: Usage Guide

——————————————————————————————————-

 

SQL PROFILE DEMYSTIFIED : PART – I

SQL PROFILE DEMYSTIFIED : PART – I

In this post and the next one, I will discuss about SQL Profiles – what is their purpose, how to create and use them , how are they stored etc.

PURPOSE OF SQL PROFILE :

It is DBA’s responsibility to  maintain SQL performance despite various changes which affect the optimizer. The changes could be change in  optimizer statistics, schema changes (creation of index), change in the size of table etc. A DBA should be able to influence the optimizer to use the  most efficient plan. To achieve this, Oracle 9i introduced outlines but outlines are more about fixing the plan for a query that optimizer will use. So if there is change in environment, the plan would still be the same. For example  I created an outline when size of the table was small and Full table scan was the most efficient. Later as the size of the table grew and I still use the older outline, my execution plan would surely will be not efficient any more. This is just an example situation and things can change in your environment and so outline is not a very good approach for such situations.

HOW DO SQL PROFILES WORK?

To resolve this issue, Oracle 10g introduced SQL Profiles. Profiles store statement statistics in data dictionary in the form of hints as key inputs to the optimizer. The statistics could be used to :

adjust the No. of rows returned by the query

e.g. 10 times as many as expected are returned from table test

OPT_ESTIMATE(@SEL$1, TABLE, TEST@SEL$1, SCALE_ROWS=10)

adjust the No. of rows returned through an index scan

e.g. 10 times fewer rows as expected are returned from table test through index test_idx

OPT_ESTIMATE(@SEL$1, INDEX_SCAN, TEST@SEL$1, TEST_IDX, SCALE_ROWS=.1)

adjust the No. of rows returned from a join

e.g. 4.2 times rows as expected are returned when tables test1 and test2 are joined

OPT_ESTIMATE(@SEL$1, JOIN, (TEST1@SEL$1,TEST2@SEL$1),SCALE_ROWS=4.2)

provide missing/stale statistics for a

. Table

TABLE_STATS(“HR”, “EMPLOYEES”, scale, blocks=10, rows=107)

. Column

COLUMN_STATS(“HR”, “EMPLOYEES”, “EMPLOYEE_ID”, scale,length=3 DISTINCT=107 nulls=0 min=100 max=207)

. Index

INDEX_STATS(“HR”, “EMPLOYEES”, “EMP_IDX”, scale, blocks=5, rows=107)

to set optimizer mode

e.g. ALL_ROWS

disable hints present in SQL statement

e.g. IGNORE_OPTIM_EMBEDDED_HINTS

to set OPTIMIZER_FEATURES_ENABLE

e.g. OPTIMIZER_FEATURES_ENABLE(DEFAULT)
Whenever the statement is executed, the hints are automatically applied to the statement. Thus the execution plan of the statement is not fixed. Thus profile is more of an advisor.  We can say that a SQL profile serves the samed purpose for a SQL statement as is served by optimizer statistics for a table or index.  With change in the environment the plans can adapt to the change if the profile is used.

HOW ARE SQL PROFILES CREATED?

Profiles can be created on two occasions :

– When SQL Tuning Advisor is run manually on high resource SQL statements  in comprehensive mode

– When SQL Tuning Advisor runs automatically in default maintenance window.

In both the cases, Query optimizer runs in tuning mode and performs additional analysis to check whether the execution plan produced can further be improved. When a statament is executed, optimizer makes estimates regarding the execution of the statement e.g. no. of rows returned by the query, no. of buffer gets, Cost of the query etc. In tuning mode, it additionally verifies whether its estimates are correct by employing various techniques :

– Sampling : A sample of data is taken and predicates are appllied to it. The new estimate is checked against earlier estimate. If difference is substantial, a correction factor is stored as part of the hint.

– Partial execution : A fragment of the SQL statement is executed. This method is preferred when respective predicates provide efficient access paths.

– Past execution history: Past execution history of the statement is used to determine correct settings. Ifthe execution hostory indicates that a SQL statement is executed partially most of the times, hint for OPTIMIZER_MODE= FIRST_ROWS is set instead of ALL_ROWS.

Automatic Tuning optimizer chooses the appropriate estimate validation method.

After the estimates have been verified, optimizer generates auxiliary information and generates a recommendation to the user to accept the profile. Once the user accepts the profile, the auxiliary info in the form of hints is stored persistently in the data dictionary . Whenever the statement is executed , that information is used to generate the most efficient execution plan.

CONCLUSION:

Thus, SQL Profiles , without any modification to the application code, the referenced objects and their statistics, provide the cost based optimizer adjustments for

– Initialization parameters

– Estimated cardinalities

– Object statistucs

SQL profiles provide a solution at the statement level.

In my next post SQL PROFILE DEMYSTIFIED : PART – II , I will demonstrate the creation and use of SQL profile for a poorly tuned SQL statement and also explain the hints the that are stored as part of the profile.

References:
http://antognini.ch/papers/SQLProfiles_20060622.pdf
http://askdba.org/weblog/2011/12/plan-stability-using-sql-profiles-and-sql-plan-management/
http://kerryosborne.oracle-guy.com/2009/07/how-to-attach-a-sql-profile-to-a-different-statement/
http://kerryosborne.oracle-guy.com/2009/10/how-to-attach-a-sql-profile-to-a-different-statement-take-2/
SQL Profiles: Check what they do before accepting
Oracle SQL Profile: Why Multiple OPT_ESTIMATE
Oracle’s OPT_ESTIMATE hint: Usage Guide

 

——————————————————————————————————-

        

                                                              —————– 

 

11g R2 RAC : VOTING DISK DEMYSTIFIED

                                      Voting disk in 11g

In this post, I will write about voting disk – what does it contain, who updates it, how is it used, where is it stored and so on…
Voting disk a key component of clusterware and its failure can lead to inoperability of the cluster.
In RAC at any point in time the clusterware must know which nodes are member of the cluster so that
- it can perform load balancing
- In case a node fails, it can perform failover of resources as defined in the resource profiles
- If a node joins, it can start resources on it as defined in OCR/OLR
- If a node joins, it can assign VIP to it in case GNS is in use
- If a node fails, it can execute callouts if defined
and so on
Hence, there must be a way by which clusterware can find out about the node membership.
  That is where voting disk comes into picture. It is the place where nodes mark their attendance. Consider an analogy where a manager wants to find out which of his subordinates are present. He can just check the attendance register and assign them their tasks accordingly. Similarly, CSSD process on every node makes entries in the voting disk to ascertain the membership of that node. The voting disk  records node membership information. If it ever fails, the entire clustered environment for Oracle 11g RAC will be adversely affected and a possible outage may result if the vote disks is/are lost.
Also, in a cluster communication between various nodes is of paramount importance.  Nodes which can’t communicate with other nodes  should be evicted from the cluster. While marking their own presence, all the nodes also register the information about their communicability with other nodes in voting disk . This is called network heartbeat. CSSD process in each RAC node maintains its heart beat in a block of size 1 OS block, in the hot  block of voting disk  at a specific offset.  The written block has a header area with the node name.  The heartbeat counter increments every second on every write call. Thus heartbeat of various nodes is recorded at different offsets in the voting disk. In addition to maintaining its own disk block, CSSD processes also monitors the disk blocks maintained by the CSSD processes running in other cluster nodes. Healthy nodes will have continuous network and disk heartbeats exchanged between the nodes. Break in heart beat indicates a possible error scenario.If the disk block is not updated in a short timeout period, that node is considered unhealthy and  may be rebooted to protect the database information. In this case , a message to this effect is written in the kill block of the node. Each node  reads its kill block once per second, if the kill block is overwritten node commits  suicide.
During reconfig (join or leave) CSSD monitors all nodes and determines whether  a node has a disk heartbeat, including those with no network heartbeat. If no disk  heartbeat is detected  then node is declared as dead.
 What is stored in voting disk?
——————————
Voting disks contain static and dynamic data.
Static data : Info about nodes in the cluster
Dynamic data : Disk heartbeat logging
It maintains and consists of important details about the cluster nodes membership, such as
- which node is part of the cluster,
- who (node) is joining the cluster, and
- who (node) is leaving the cluster.
Why is voting disk needed ?
—————————
The Voting Disk Files are used by Oracle Clusterware  by way of a health check .
- by CSS to determine which nodes are currently members of the cluster
- in concert with other Cluster components such as CRS to shut down, fence, or reboot either single or multiple nodes whenever network communication is lost between any node within the cluster, in order to prevent the dreaded split-brain condition in which two or more instances attempt to control the RAC database. It  thus protects the database information.
- It will be used by the CSS daemon to arbitrate with peers that it cannot see over the private interconnect in the event of an outage, allowing it to salvage the largest fully connected subcluster for further operation.  It checks the voting disk to determine if there is a failure on any other nodes in the cluster. During this operation, NM will make an entry in the voting disk to inform its vote on availability. Similar operations are performed by other instances in the cluster. The three voting disks configured also provide a method to determine who in the cluster should survive. For example, if eviction of one of the nodes is necessitated by an unresponsive action, then the node that has two voting disks will start evicting the other node. NM alternates its action between the heartbeat and the voting disk to determine the availability of other nodes in the cluster.
The Voting disk is the key communication mechanism within the Oracle Clusterware where all nodes in the cluster read and write heartbeat information. CSSD processes (Cluster Services Synchronization Daemon) monitor the health of  RAC nodes employing two distinct heart beats: Network heart beat and Disk heart beat. Healthy nodes will have continuous network and disk heartbeats exchanged between the  nodes. Break in heart beat indicates a possible error scenario. There are few different scenarios possible with missing heart beats:
1. Network heart beat is successful, but disk heart beat is missed.
2. Disk heart beat is successful, but network heart beat is missed.
3. Both heart beats failed.
In addition, with numerous nodes, there are other possible scenarios too. Few possible scenarios:
1. Nodes have split in to N sets of nodes, communicating within the set, but not with members in other set.
2. Just one node is unhealthy.
Nodes with quorum will maintain active membership of the cluster and other node(s) will be fenced/rebooted.
A node must be able to access more than half of the voting disks
————————————————————————————–
A node must be able to access more than half of the voting disks at any time. For example, let’s have a two node cluster with an even number of let’s say 2 voting disks. Let Node1 is able to access voting disk1 and Node2 is able to access voting disk2 . This means that there is no common file where clusterware can check the heartbeat of both the nodes.  Hence, if we have 2 voting disks, all the nodes in the cluster should be able to access both the voting disks. If we have 3 voting disks and both the nodes are able to access more than half i.e. 2 voting disks, there will be at least on disk which will be accessible by both the nodes. The clusterware can use that disk to check the heartbeat of both the nodes. Hence, each  node should be  able to access more than half the number of voting disks. A node not able  to do so will have to be evicted from the cluster to maintain the integrity of the cluster  . After the cause of the failure has been corrected and access to the voting disks has been restored, you can instruct Oracle Clusterware to recover the failed node and restore it to the cluster.
 Loss of more than half your voting disks will cause the entire cluster to fail !!
Why should we have an odd number of voting disks?
————————————————————————–
Here is a table which represents the number of voting disks whose failure can be tolerated for different numbers of voting disks:
                                                    No. of voting disks
 Total                             Which should be accessible           Whose failure can be tolerated
      1                                                         1                                                                     0
      2                                                         2                                                                     0
      3                                                         2                                                                     1
      4                                                         3                                                                     1
      5                                                         3                                                                     2
      6                                                         4                                                                     2
It can be seen that number of voting disks whose failure can be tolerated is same for (2n-1) as well as 2n voting disks where n can be 1, 2 or 3. Hence to save a redundant voting disk, (2n-1) i.e. an odd number of voting disks are desirable.
Where is voting disk stored?
—————————-
 The Voting Disk is a shared disk that will be accessed by all member nodes in the cluster during an operation. Hence, the voting disks must be on shared accessible storage .
- You should plan on allocating 280MB for each voting disk file.
- Prior to 11g R2 RAC, it could be placed on
     . a raw device
   . a clustered filesystem supported by Oracle RAC such as OCFS, Sun Cluster, or Veritas Cluster filesystem
- As of  11g R2 RAC, it can be placed on  ASM disks . This simplifies management and improves performance.  But this brought up a puzzle too. For a node to join the cluster, it must be able to access voting disk but voting disk is on ASM and ASM can’t be up until node is up. To resolve this issue, Oracle ASM reserves several blocks at a fixed location for every Oracle ASM disk used for storing the voting disk.As a result , Oracle Clusterware can access the voting disks present in ASM even if the ASM instance is down and CSS can continue to maintain the Oracle cluster even if the ASM instance has failed.The physical location of the voting files in used ASM disks is fixed, i.e. the cluster stack does not rely on a running ASM instance to access the files. The location of the file is visible in the ASM disk header (dumping the file out of ASM with dd is quite easy):
oracle@rac1:~/ [+ASM1] kfed read /dev/sdf | grep -E ‘vfstart|vfend’

kfdhdb.vfstart:                   96 ; 0x0ec: 0x00000060                          <

kfdhdb.vfend:                    128 ; 0x0f0: 0x00000080                         <

 – The voting disk is not striped but put as a whole on ASM Disks
 – In the event that the disk containing the voting disk fails, Oracle ASM will choose another disk on which to store this data.
 – It eliminates the need for using a third-party cluster volume manager.
 – you can reduce the complexity of managing disk partitions for voting disks during Oracle Clusterware installations.
 –  Voting disk needs to be mirrored, should it become unavailable, cluster will come down. Hence, you should maintain multiple copies of the voting disks on separate disk LUNs so that you eliminate a Single Point of Failure (SPOF) in your Oracle 11g RAC configuration.
– If voting disk is stored on ASM, multiplexing level of voting disk is decided by the redundancy of the diskgroup.
Redundancy of the diskgroup       #of copies of voting disk        ( Minimum # of disks in the diskgroup)
External                                               1                                                  1
Normal                                                3                                                  3
High                                                      5                                                  5
- If voting disk is on a diskgroup with external redundancy, one copy of voting file will be stored on one disk in the diskgroup
.-  If we store voting disk on a diskgroup with normal redundancy, we should be able to tolerate the loss of one disk i.e. even if we lose one disk, we should have sufficient number of voting disks so that clusterware can continue.  If the diskgroup has 2 disks (minimum required for normal redundancy), we can store 2 copies of voting disk on it. If we lose one disk, only one copy of voting disk will be left  and clusterware won’t be able to continue,  because to continue, clusterware should be able to access more than  half the no. of voting disks i.e.> (2*1/2)
i.e. > 1
i.e.=  2
Hence, to be able to tolerate the loss of one disk, we should have 3 copies of the voting disk on a diskgroup with normal redundancy . So, a normal redundancy diskgroup having voting disk should have minimum 3 disks in it.
– Similarly, if we store voting disk on diskgroup with high redundancy, 5 Voting Files are placed, each on one ASM Disk i.e a high redundancy diskgroup should have at least 5 disks so that even of we lose 2 disks, clusterware can continue .
 – Ensure that all the nodes participating in the cluster have read/write permissions on disks.
 – You can have up to a maximum of 15 voting disks. However, Oracle recommends that you do not go beyond five voting disks.
Backing up voting disk
———————–
In previous versions of Oracle Clusterware you needed to backup the voting disks with the dd command. Starting with Oracle Clusterware 11g Release 2 you no longer need to backup the voting disks. The voting disks are automatically backed up as a part of the OCR. In fact, Oracle explicitly indicates that you should not use a backup tool like dd to backup or restore voting disks. Doing so can lead to the loss of the voting disk.
Although the Voting disk contents are not changed frequently, you will need to back up the Voting disk file every time
- you add or remove a node from the cluster or
- immediately after you configure or upgrade a cluster.
  A node in the cluster must be able to access more than half of the voting disks at any time in order to be able to tolerate a failure of n voting disks. Therefore, it is strongly recommended that you configure an odd number of voting disks such as 3, 5, and so on.
Check the location of voting disk
grid@host01$crsctl query css votedisk
##  STATE    File Universal Id                File Name Disk group
–  —–    —————–                ——— ———
 1. ONLINE   243ec3b2a3cf4fbbbfed6f20a1ef4319 (ORCL:ASMDISK01) [DATA]
Located 1 voting disk(s).
– we can see that only one copy of the voting disk is there on data diskgroup which has external redundancy.
As I mentioned earlier, Oracle writes the voting devices to the underlying disks at pre-designated locations so that it can get the contents of these files when the cluster starts up.
Let’s see that with an actual example. Let’s see the logs from CSS . They are located at $ORACLE_HOME/log//cssd  Here is an excerpt from one of the logs. The line says that it found a “potential” voting file on one of the disks – 243ec3b2-a3cf4fbb-bfed6f20-a1ef4319
.
grid@host01$ vi /u01/app/11.2.0/grid/log/host01/cssd/ocssd.log
search for string potential or File Universal ID – 243ec3……
2012-10-09 03:54:28.423: [    CSSD][986175376]clssnmvDiskVerify: Successful discovery for disk ORCL:ASMDISK01, UID 243ec3b2-a3cf4fbb-bfed6f20-a1ef4319,
Create another diskgroup test with normal redundancy and 2 disks.
Try to move voting disk from diskgroup data to test diskgroup
– Fails as we should have at least 3 disks in the test diskgropup
[grid@host01 cssd]$ crsctl replace votedisk +test
Failed to create voting files on disk group test.
Change to configuration failed, but was successfully rolled back.
CRS-4000: Command Replace failed, or completed with errors.
Add another disk to test diskgroup and mark it as quorum disk. The quorum disk is one small Disk (300 MB should be on the safe side here, since the Voting File is only about 280 MB in size) to keep one Mirror of the Voting File. Other two disks will contain each one Voting File and all the other stripes of the Database Area as well, but quorum  will only get that one Voting File.
Now try to move the voting disk from data diskgroup tp test diskgroup
– Now the operation is successful
[grid@host01 cssd]$ crsctl replace votedisk +test
Successful addition of voting disk 00ce3c95c6534f44bfffa645a3430bc3.
Successful addition of voting disk a3751063aec14f8ebfe8fb89fccf45ff.
Successful addition of voting disk 0fce89ac35834f99bff7b04ccaaa8006.
Successful deletion of voting disk 243ec3b2a3cf4fbbbfed6f20a1ef4319.
Successfully replaced voting disk group with +test.
CRS-4266: Voting file(s) successfully replaced
– Check the ocssd.log – search for 00ce3c9……
grid@host01$vi $ORACLE_HOME/log/host01/cssd/ocssd.log
2012-10-09 05:08:19.484: [    CSSD][997631888]  Listing unique IDs for 3 voting files:
2012-10-09 05:08:19.484: [    CSSD][997631888]    voting file 1: 00ce3c95-c6534f44-bfffa645-a3430bc3
2012-10-09 05:08:19.484: [    CSSD][997631888]    voting file 2: a3751063-aec14f8e-bfe8fb89-fccf45ff
2012-10-09 05:08:19.484: [    CSSD][997631888]    voting file 3: 0fce89ac35834f99bff7b04ccaaa8006
I hope this information was useful.
Keep visiting the blog. Thanks for your time!

References:

Oracle 10g RAC Grid, Services & ClusteringBy Murali Vallath

http://orainternals.wordpress.com/2010/10/29/whats-in-a-voting-disk/

—————————————————————————————————–

Related links:

Home

11G R2 RAC Index
11g R2 RAC: GPNP Profile Demystified
11g R2 RAC: How To Identify The Master Node In RAC
11g R2 RAC:Node Eviction Due To CSSDagent Stopping
11g R2 RAC : Node Eviction Due To Member Kill Escalation
11g R2 RAC: Node Eviction Due To Missing Disk Heartbeat
11g R2 RAC: Node Eviction Due To Missing Network Heartbeat 
11g R2 RAC : OCR Demystified
11g R2 RAC : OLR  Demystified
How Does  11G R2 Clusterware Start ASM When ASM SPfile Is On ASM Itself?
Cache Fusion Demonstrated
Instance Recovery In RAC
Need For VIP In RAC
Recover Voting Disk – Scenario-I
Recover Voting Disk – Scenario-II