The SQL Optimizers

Whenever you execute a SQL statement, a component of the database known as the optimizer must decide how best to access the data operated on by that statement.
Oracle supports two optimizers: the rule-base optimizer (which was the original), and the cost-based optimizer.

To figure out the optimal execution path for a statement, the optimizers consider the following:
a. The syntax you've specified for the statement
b. Any conditions that the data must satisfy (the WHERE clauses)
c. The database tables your statement will need to access
d. All possible indexes that can be used in retrieving data from the table
e. The Oracle RDBMS version
f. The current optimizer mode
g. SQL statement hints
h. All available object statistics (generated via the ANALYZE command)
i. The physical table location (distributed SQL)
j. INIT.ORA settings (parallel query, async I/O, etc.)

Oracle gives you a choice of two optimizing alternatives:
the predictable rule-based optimizer and the more intelligent cost-based optimizer.

Understanding the Rule-Based Optimizer

The rule-based optimizer (RBO) uses a predefined set of precedence rules to figure out which path it will use to access the database.
The RDBMS kernel defaults to the rule-based optimizer under a number of conditions, including:

a. OPTIMIZER_MODE = RULE is specified in your INIT.ORA file
b. OPTIMIZER_MODE = CHOOSE is specified in your INIT.ORA file, and no statistics exist for any table involved in the statement
c. An ALTER SESSION SET OPTIMIZER_MODE = RULE command has been issued
d. An ALTER SESSION SET OPTIMIZER_MODE = CHOOSE command has been issued, and no statistics exist for any table involved in the statement
e. The rule hint (e.g., SELECT /*+ RULE */. . .) has been used in the statement

The rule-based optimizer is driven primarily by 20 condition rankings, or "golden rules."
These rules instruct the optimizer how to determine the execution path for a statement, when to choose one index over another,
and when to perform a full table scan.
These rules, shown in Table 1, are fixed, predetermined, and, in contrast with the cost-based optimizer, not influenced by outside sources (table volumes, index distributions, etc.).

Table 1: Rule-based optimizer condition rankings

Rank Condition

1 ROWID = constant
2 Cluster join with unique or primary key = constant
3 Hash cluster key with unique or primary key = constant
4 Entire Unique concatenated index = constant
5 Unique indexed column = constant
6 Entire cluster key = corresponding cluster key of another table in the same cluster
7 Hash cluster key = constant
8 Entire cluster key = constant
9 Entire non-UNIQUE CONCATENATED index = constant
10 Non-UNIQUE index merge
11 Entire concatenated index = lower bound
12 Most leading column(s) of concatenated index = constant
13 Indexed column between low value and high value or indexed column LIKE "ABC%" (bounded range)
14 Non-UNIQUE indexed column between low value and high value or indexed column like `ABC%' (bounded range)
15 UNIQUE indexed column or constant (unbounded range)
16 Non-UNIQUE indexed column or constant (unbounded range)
17 Equality on non-indexed = column or constant (sort/merge join)
18 MAX or MIN of single indexed columns
19 ORDER BY entire index
20 Full table scans



While knowing the rules is helpful, they alone do not tell you enough about how to tune for the rule-based optimizer. To overcome this deficiency, the following sections provide some information that the rules don't tell you.

What the RBO rules don't tell you #1
Only single column indexes are ever merged. Consider the following SQL and indexes:

SELECT col1, ...
FROM emp
WHERE emp_name = 'GURRY'
AND emp_no = 127
AND dept_no = 12

Index1 (dept_no)
Index2 (emp_no, emp_name)

The SELECT statement looks at all three indexed columns. Many people believe that Oracle will merge the two indexes, which involve those three columns, to return the requested data. In fact, only the two-column index is used; the single-column index is not used. While Oracle will merge two single-column indexes, it will not merge a multi-column index with another index.
There is one thing to be aware of with respect to this scenario. If the single-column index is a unique or primary key index, that would cause the single-column index to take precedence over the multi-column index. Compare rank 4 with rank 9 in Table 1.
NOTE: Oracle8i introduced a new hint, INDEX_JOIN, that allows you to join multi-column indexes.

What the RBO rules don't tell you #2
If all columns in an index are specified in the WHERE clause, that index will be used in preference to other indexes for which some columns are referenced. For example:

SELECT col1, ...
FROM emp
WHERE emp_name = 'GURRY'
AND emp_no = 127
AND dept_no = 12

Index1 (emp_name)
Index2 (emp_no, dept_no, cost_center)

In this example, only Index1 is used, because the WHERE clause includes all columns for that index, but does not include all columns for Index2.

What the RBO rules don't tell you #3
If multiple indexes can be applied to a WHERE clause, and they all have an equal number of columns specified, only the index created last will be used. For example:

SELECT col1, ...
FROM emp
WHERE emp_name = 'GURRY'
AND emp_no = 127
AND dept_no = 12
AND emp_category = 'CLERK'

Index1 (emp_name, emp_category) Created 4pm Feb 11th 2002
Index2 (emp_no, dept_no) Created 5pm Feb 11th 2002

In this example, only Index2 is used, because it was created at 5 p.m. and the other index was created at 4 p.m. This behavior can pose a problem, because if you rebuild indexes in a different order than they were first created, a different index may suddenly be used for your queries. To deal with this problem, many sites have a naming standard requiring that indexes are named in alphabetical order as they are created. Then, if a table is rebuilt, the indexes can be rebuilt in alphabetical order, preserving the correct creation order. You could, for example, number your indexes. Each new index added to a table would then be given the next number.

What the RBO rules don't tell you #4
If multiple columns of an index are being accessed with an = operator, that will override other operators such as LIKE or BETWEEN. Two ='s will override two ='s and a LIKE. For example:

SELECT col1, ...
FROM emp
WHERE emp_name LIKE 'GUR%'
AND emp_no = 127
AND dept_no = 12
AND emp_category = 'CLERK'
AND emp_class = 'C1'

Index1 (emp_category, emp_class, emp_name)
Index2 (emp_no, dept_no)

In this example, only Index2 is utilized despite Index1 having three columns accessed and Index2 having only two column accessed.

What the RBO rules don't tell you #5
A higher percentage of columns accessed will override a lower percentage of columns accessed. So generally, the optimizer will choose to use the index from which you specify the highest percentage of columns. However, as stated previously, all columns specified in a unique or primary key index will override the use of all other indexes. For example:

SELECT col1, ...
FROM emp
WHERE emp_name = 'GURRY'
AND emp_no = 127
AND emp_class = 'C1'

Index1 (emp_name, emp_class, emp_category)
Index2 (emp_no, dept_no)

In this example, only Index1 is utilized, because 66% of the columns are accessed. Index2 is not used because a lesser 50% of the indexed columns are used.

What the RBO rules don't tell you #6
If you join two tables, the rule-based optimizer needs to select a driving table. The table selected can have a significant impact on performance, particularly when the optimizer decides to use nested loops. A row will be returned from the driving table, and then the matching rows selected from the other table. It is important that as few rows as possible are selected from the driving table.

The rule-based optimizer uses the following rules to select the driving table:

1. A unique or primary key index will always cause the associated table to be selected as the driving table in front of a non-unique or non-primary key index.
2. An index for which you apply the equality operator (=) to all columns will take precedence over indexes from which you use only some columns, and will result in the underlying table being chosen as the driving table for the query.
3. The table that has a higher percentage of columns in an index will override the table that has a lesser percentage of columns indexed.
4. A table that satisfies one two-column index in the WHERE clause of a query will be chosen as the driving table in front of a table that satisfies two single-column indexes.
5. If two tables have the same number of index columns satisfied, the table that is listed last in the FROM clause will be the driving table. In the SQL below, the EMP table will be the driving table because it is listed last in the FROM clause.

SELECT ....
FROM DEPT d, EMP e
WHERE e.emp_name = 'GURRY'
AND d.dept_name = 'FINANCE'
AND d.dept_no = e.dept_no

What the RBO rules don't tell you #7
If a WHERE clause has a column that is the leading column on any index, the rule-based optimizer will use that index. The exception is if a function is placed on the leading index column in the WHERE clause. For example:

SELECT col1, ...
FROM emp
WHERE emp_name = 'GURRY'

Index1 (emp_name, emp_class, emp_category)
Index2 (emp_class, emp_name, emp_category)

Index1 will be used, because emp_name (used in the WHERE clause) is the leading column. Index2 will not be used, because emp_name is not the leading column.

The following example illustrates what happens when a function is applied to an indexed column:

SELECT col1, ...
FROM emp
WHERE LTRIM(emp_name) = 'GURRY'

In this case, because the LTRIM function has been applied to the column, no index will be used.

Understanding the Cost-Based Optimizer
The cost-based optimizer is a more sophisticated facility than the rule-based optimizer. To determine the best execution path for a statement, it uses database information such as table size, number of rows, key spread, and so forth, rather than rigid rules.

The information required by the cost-based optimizer is available once a table has been analyzed via the ANALYZE command, or via the DBMS_STATS facility. If a table has not been analyzed, the cost-based optimizer can use only rule-based logic to select the best access path. It is possible to run a schema with a combination of cost-based and rule-based behavior by having some tables analyzed and others not analyzed.

NOTE: The ANALYZE command and the DBMS_STATS functions collect statistics about tables, clusters, and indexes, and store those statistics in the data dictionary.

A SQL statement will default to the cost-based optimizer if any one of the tables involved in the statement has been analyzed. The cost-based optimizer then makes an educated guess as to the best access path for the other tables based on information in the data dictionary.

The RDBMS kernel defaults to using the cost-based optimizer under a number of situations, including the following:

1. OPTIMIZER_MODE = CHOOSE has been specified in the INIT.ORA file, and statistics exist for at least one table involved in the statement
2. An ALTER SESSION SET OPTIMIZER_MODE = CHOOSE command has been executed, and statistics exist for at least one table involved in the statement
3. An ALTER SESSION SET OPTIMIZER_MODE = FIRST_ROWS (or ALL_ROWS) command has been executed, and statistics exist for at least one table involved in the statement
4. A statement uses the FIRST_ROWS or ALL_ROWS hint (e.g., SELECT /*+ FIRST_ROWS */. . .)

ANALYZE command
The way that you analyze your tables can have a dramatic effect on your SQL performance. If your DBA forgets to analyze tables or indexes after a table re-build, the impact on performance can be devastating. If your DBA analyzes each weekend, a new threshold may be reached and Oracle may change its execution plan. The new plan will more often than not be an improvement, but will occasionally be worse.
I cannot stress enough that if every SQL statement has been tuned, do not analyze just for the sake of it. One site that I tuned had a critical SQL statement that returned data in less than a second. The DBA analyzed each weekend believing that the execution path would continue to improve. One Monday, morning I got a phone call telling me that the response time had risen to 310 seconds.
If you do want to analyze frequently, use DBMS_STATS.EXPORT_SCHEMA_STATS to back up the existing statistics prior to re-analyzing. This gives you the ability to revert back to the previous statistics if things screw up.
When you analyze, you can have Oracle look at all rows in a table (ANALYZE COMPUTE) or at a sampling of rows (ANALYZE ESTIMATE). Typically, I use ANALYZE ESTIMATE for very large tables (1,000,000 rows or more), and ANALYZE COMPUTE for small to medium tables.
I strongly recommend that you analyze FOR ALL INDEXED COLUMNS for any table that can have severe data skewness. For example, if a large percentage of rows in a table has the same value in a given column, that represents skewness. The FOR ALL INDEXED COLUMNS option makes the cost-based optimizer aware of the skewness of a column's data in addition to the cardinality (number-distinct values) of that data.
When a table is analyzed using ANALYZE, all associated indexes are analyzed as well. If an index is subsequently dropped and recreated, it must be re-analyzed. Be aware that the procedures DBMS_STATS.GATHER_SCHEMA_STATS and GATHER_TABLE_STATS analyze only tables by default, not their indexes. When using those procedures, you must specify the CASCADE=>TRUE option for indexes to be analyzed as well.

Following are some sample ANALYZE statements:

ANALYZE TABLE EMP ESTIMATE STATISTICS SAMPLE 5 PERCENT FOR ALL INDEXED COLUMNS;

ANALYZE INDEX EMP_NDX1 ESTIMATE STATISTICS SAMPLE 5 PERCENT FOR ALL INDEXED COLUMNS;

ANALYZE TABLE EMP COMPUTE STATISTICS FOR ALL INDEXED COLUMNS;

If you analyze a table by mistake, you can delete the statistics. For example:

ANALYZE TABLE EMP DELETE STATISTICS;

Analyzing can take an excessive amount of time if you use the COMPUTE option on large objects. We find that on almost every occasion, ANALYZE ESTIMATE 5 PERCENT on a large table forces the optimizer make the same decision as ANALYZE COMPUTE.

Tuning prior to releasing to production
A major dilemma that exists with respect to the cost-based optimizer (CBO) is how to tune the SQL for production prior to it being released. Most development and test databases will contain substantially fewer rows than a production database. It is therefore highly likely that the CBO will make different decisions on execution plans. Many sites can't afford the cost and inconvenience of copying the production database into a pre-production database.

Oracle8i and later provides various features to overcome this problem, including DBMS_STATS and the outline facility. Each is explained in more detail later in this book.

Inner workings of the cost-based optimizer
Unlike the rule-based optimizer, the cost-based optimizer does not have hard and fast path evaluation rules. The cost-based optimizer is flexible and can adapt to its environment. This adaptation is possible only once the necessary underlying object statistics have been refreshed (re-analyzed). What is constant is the method by which the cost-based optimizer calculates each possible execution plan and evaluates its cost (efficiency).

The cost-based optimizer's functionality can be (loosely) broken into the following steps:

Parse the SQL (check syntax, object privileges, etc.).

Generate a list of all potential execution plans.

Calculate (estimate) the cost of each execution plan using all available object statistics.

Select the execution plan with the lowest cost.

The cost-based optimizer will be used only if at least one table within a SQL statement has statistics (table statistics for unanalyzed tables are estimated). If no statistics are available for any table involved in the SQL, the RDBMS will resort to the rule-based optimizer, unless the cost-based optimizer is forced via statement-level HINTS or by an optimizer goal of ALL_ROWS or FIRST_ROWS.

To understand how the cost-based optimizer works and, ultimately, how to exploit it, we need to understand how it thinks.

Primary key and/or UNIQUE index equality
A UNIQUE index's selectivity is recognized as 100%. No other indexed access method is more precise. For this reason, a unique index is always used when available.

Non-UNIQUE index equality
For non-UNIQUE indexes, index selectivity is calculated. The cost-based optimizer makes the assumption that the table (and subsequent indexes) have uniform data spread unless you use the FOR ALL INDEXED COLUMNS option of the ANALYZE. That option will make the cost-based optimizer aware of how the data in the indexed columns is skewed.

Range evaluation
For index range execution plans, selectivity is evaluated. This evaluation is based on a column's most recent high-value and low-value statistics. Again, the cost-based optimizer makes the assumption that the table (and subsequent indexes) have uniform data spread unless you use the FOR ALL INDEXED COLUMNS option when analyzing the table.

Range evaluation over bind variables
For index range execution plans, selectivity is guessed. Prior to Oracle9i, because bind variable values are not available at parse time (values are passed to the cursor after the execution plan has been decided), the optimizer cannot make decisions based on bind variable values. The optimizer assumes a rule of thumb of 25% selectivity for unbounded bind variable ranges (e.g., WHERE dept_no = :b1) and 50% selectivity for bounded ranges (WHERE dept_no > :b1 AND dept_no < :b2). Beginning with Oracle9i, the cost-based optimizer obtains bind variable values prior to determining an execution plan.

Histograms
Prior to the introduction of histograms in Oracle 7.3, The cost-based optimizer could not distinguish grossly uneven key data spreads.

System resource usage
By default, the cost-based optimizer assumes that you are the only person accessing the database. Oracle9i gives you the ability to store information about system resource usage, and can make much better informed decisions based on workload (read up on the DBMS_STATS.GATHER_SYSTEM_STATS package).

Current statistics are important
The cost-based optimizer can make poor execution plan choices when a table has been analyzed but its indexes have not been, or when indexes have been analyzed but not the tables.

You should not force the database to use the cost-based optimizer via inline hints when no statistics are available for any table involved in the SQL.

Using old (obsolete) statistics can be more dangerous than estimating the statistics at runtime, but keep in mind that changing statistics frequently can also blow up in your face, particularly on a mission-critical system with lots of online users. Always back up your statistics before you re-analyze by using DBMS_STATS.EXPORT_SCHEMA_STATS.

Analyzing large tables and their associated indexes with the COMPUTE option will take a long, long time, requiring lots of CPU, I/O, and temporary tablespace resources. It is often overkill. Analyzing with a consistent value, for example, estimate 3%, will usually allow the cost-based optimizer to make optimal decisions

Combining the information provided by the selectivity rules with other database I/O information allows the cost-based optimizer to calculate the cost of an execution plan.

EXPLAIN PLAN for the cost-based optimizer
Oracle provides information on the cost of query execution via the EXPLAIN PLAN facility. EXPLAIN PLAN can be used to display the calculated execution cost(s) via some extensions to the utility. In particular, the plan table's COST column returns a value that increases or decreases to show the relative cost of a query. For example:

EXPLAIN PLAN FOR
SELECT count(*)
FROM winners, horses
WHERE winners.owner=horses.owner
AND winners.horse_name LIKE 'Mr %'

COLUMN "SQL" FORMAT a56

SELECT lpad(' ',2*level)||operation||''
||options ||' '||object_name||
decode(OBJECT_TYPE, '', '',
'('||object_type||')') "SQL",
cost "Cost", cardinality "Num Rows"
FROM plan_table
CONNECT BY prior id = parent_id
START WITH id = 0;

SQL Cost Num Rows
-----------------------------------------------
SELECT STATEMENT 44 1
SORT AGGREGATE
HASH JOIN 44 100469
INDEX RANGE SCAN MG1(NON-UNIQUE)
2 1471
INDEX FAST FULL SCAN OWNER_PK(UNIQUE)
4 6830

By manipulating the cost-based optimizer (i.e., via inline hints, by creating/removing indexes, or by adjusting the way that indexes or tables are analyzed), we can see the differences in the execution cost as calculated by the optimizer. Use EXPLAIN PLAN to look at different variations on a query, and choose the variation with the lowest relative cost.

For absolute optimal performance, many sites have the majority of the tables and indexes analyzed but a small number of tables that are used in isolation are not analyzed. This is usually to force rule-based behavior on the tables that are not analyzed. However, it is important that tables that have not been analyzed are not joined with tables that have been analyzed.

Some Common Optimizer Misconceptions
Let's clear up some common misconceptions regarding the optimizers:

Oracle8i and Oracle9i don't support the rule-based optimizer
This is totally false. Certain publications mentioned this some time ago, but Oracle now assures us that this is definitely not true.

Hints can't be used with the rule-based optimizer
The large majority of hints can indeed be applied to SQL statements using the rule-based optimizer.

SQL tuned for rule will run well in cost
If you are very lucky it may, but when you transfer to cost, expect a handful of SQL statements that require tuning. However, there is not a single site that I have transferred and been unable to tune.

SQL tuned for cost will run well in rule
This is highly unlikely, unless the code was written with knowledge of the rule-based optimizer.

You can't run rule and cost together
You can run both together by setting the INIT.ORA parameter OPTIMIZER_MODE to CHOOSE, and having some tables analyzed and others not. Be careful that you don't join tables that are analyzed with tables that are not analyzed.

Which Optimizer to Use?
If you are currently using the rule-based optimizer, I strongly recommend that you transfer to the cost-based optimizer. Here is a list of the reasons why:

1. The time spent coding is reduced.
2. Coders do not need to be aware of the rules.
3. There are more features, and far more tuning tools, available for the cost-based optimizer.
4. The chances of third-party packages performing well has been improved considerably. Many third-party packages are written to run on DB2, Informix, and SQL*Server, as well as on Oracle. The code has not been written to suit the rule-based optimizer; it has been written in a generic fashion.
5. End users can develop tuned code without having to learn a large set of optimizer rules.
6. The cost-based optimizer has improved dramatically from one version of Oracle to the next. Development of the rule-based optimizer is stalled.
7. There is less risk from adding new indexes.
8. There are many features that are available only with the cost-based optimizer. These features include recognition of materialized views, star transformation, the use of function indexes, and so on. The number of such features is huge, and as time goes on, the gap between cost and rule will widen.
9. Oracle has introduced features such as the DBMS_STATS package and outlines to get around known problems with the inconsistency of the cost-based optimizer across environments.