4.1 Universal Techniques for Controlling Plans

This section describes a number of database-independent techniques you can use to control execution plans. The techniques are good for the following purposes:

Enabling use of the index you want
Preventing use of the wrong indexes
Enabling the join order you want
Preventing join orders you do not want
Choosing the order to execute outer queries and subqueries
Providing the cost-based optimizer with good data
Fooling the cost-based optimizer with bad data

These vendor-independent techniques often offer an alternative method to achieve ends you could also achieve with vendor-specific methods. When you have a choice, the vendor-specific methods are usually cleaner. However, some problems are solvable only by these universal techniques, which offer solutions that can sometimes work on SQL that is intended to run on multiple vendor databases.

4.1.1 Enabling Use of the Index You Want

To enable efficient use of an index, you need a reasonably selective condition on the leading column (or only column) of that index. The condition must also be expressed in a way that enables the database to establish a reasonably narrow index range for the index values. The ideal form this takes is:

SomeAlias.Leading_Indexed_Column=<Expression>

In less ideal cases, the comparison is with some range of values, using BETWEEN, LIKE, <, >, <=, or >=. These range comparisons also potentially enable use of the index, but the index range is likely to be larger and the resulting query therefore slower. If the index range is too large, the optimizer might conclude that the index is not worth using and choose another path to the data. When you combine equalities and range conditions for multicolumn indexes, you should prefer indexes that lead with the columns that have equality conditions and finish with columns that have range conditions. Note that the left side of the comparison simply names the column, with no function around the column, and no expression (such as addition) using the column. Use of a function, a type conversion, or an arithmetic expression on the side with the indexed column will generally disable use of that index.

Type conversions are a particularly subtle way that SQL sometimes disables use of an index. DB2 returns an error if you compare two expressions with incompatible types. SQL Server prefers to perform the implicit conversion on the side of the comparison that does not disable the index. Oracle implicitly converts character-type expressions to the type of the other side, even when this disables index use. For example, consider this expression:

P.Phone_Number=5551212

If Phone_Number were a character-type column, this would likely evaluate internally on Oracle and SQL Server as:

On Oracle:     TO_NUMBER(P.Phone_Number)=5551212
On SQL Server: P.Phone_Number=CAST(5551212 AS VARCHAR)

SQL Server preserves indexed access to the column. On Oracle, the implicit use of TO_NUMBER( ) disables use of the index just as surely as if you made the expression explicit. (The only real difference is that the problem is harder to find in the implicit form.) The same problem can plague index use for joins, as well as for single-table conditions. For example, consider the join:

P.Phone_Number=C.Contact_Number

If Contact_Number were a number type and Phone_Number were a character type, the implicit conversion on Oracle would prevent an index-driven nested-loops join from C to P. A join in the other direction would be unhindered.

The expression opposite the indexed column reference can be arbitrarily complex. However, it must not reference columns in the same alias with the indexed column. For example, consider the condition:

P.Phone_Number=P.Area_Code||'5551212'

The database cannot drive into the index on P.Phone_Number with this condition, because the database must reach alias P before it can evaluate the expression on the right side. This chicken-and-egg problem prevents identifying (with the index) the subset of the table that meets this condition until after the database examines the whole table.

The final way that SQL often disables index use is with conditions combined with OR. For example, consider the query:

SELECT ... 
FROM Order_Details D, ... 
WHERE ...
  AND (D.Order_ID=:1 or :1 IS NULL)
  AND ...

In this example, the database can reach Order_Details through an index on Order_ID if the bind variable :1 happens to be nonnull. But if :1 is bound to a null value, there is no restriction at all on Order_ID and thus no use for that index. Since the database cannot tell what :1 will be bound to when it parses the SQL and prepares the plan, it will find no good opportunity to use the index. In this case, the solution is to create a two-part plan, with each part optimized for one of the cases:

SELECT ... 
FROM Order_Details D, ... 
WHERE ...
  AND D.Order_ID=:1
  AND :1 IS NOT NULL
  AND ...
UNION ALL
SELECT ... 
FROM Order_Details D, ... 
WHERE ...
  AND :1 IS NULL
  AND ...

When you view the execution plan for this query, it shows both indexed access through the index on Order_Details(Order_ID) and full-table-scan access to Order_Details. This might appear to be the worst of both worlds, but you are saved by the conditions:

AND :1 IS NOT NULL
...
AND :1 IS NULL

These conditions make no reference at all to any data in the database, so the database can and does evaluate them before it even begins reading data for that half of the combined statement. Therefore, it never actually executes the full table scan when :1 is not null, and it never actually executes an indexed read (or any other part of the execution plan for the first half of the query) when :1 is null. This amounts to a method to branch your execution plan depending on conditions on the bind variables, the variables that determine which data is available to drive the query. The only catch is that you must ensure that the conditions on the bind variables are mutually exclusive, so that exactly one of the branches actually returns data. For example, if you have another bind variable to provide Customer_Name, you might put together a query like this:

SELECT ... 
FROM Order_Details D, Customers C, ... 
WHERE ...
  AND D.Order_ID=:1
  AND :1 IS NOT NULL
  AND (C.Customer_Name=:2 OR :2 IS NULL)
  AND ...
UNION ALL
SELECT ... 
FROM Order_Details D, Customers C, ... 
WHERE ...
  AND :1 IS NULL
  AND :2 IS NOT NULL
  AND C.Customer_Name=:2
  AND ...
UNION ALL
SELECT ... 
FROM Order_Details D, Customers C, ... 
WHERE ...
  AND :1 IS NULL
  AND :2 IS NULL
  AND ...

This could support a three-part plan, in which the database would:

Drive into Orders on the index on Order_ID (your first choice), when possible.
Otherwise, drive into Customers on the index on Customer_Name (your second choice) when it has no Order_ID specified but has a customer name.
Otherwise, just get all the rows, probably beginning with a full table scan, when it has no selective conditions at all.

In any case, the conditions on the bind variables in the three parts are contrived to be mutually exclusive:

AND :1 IS NOT NULL
...
AND :1 IS NULL
AND :2 IS NOT NULL
...
AND :1 IS NULL
AND :2 IS NULL

4.1.2 Preventing Use of the Wrong Indexes

Join expressions are usually simple, usually between consistent types, and usually between numerical IDs. Conditions on the driving table are usually simple and compatible with index use. A more frequent problem than enabling use of the right index is preventing use of the wrong indexes. In many queries, there are multiple single-table conditions that are capable of reaching multiple indexes, but you want to use only a specific one of those indexes. Join conditions are usually expressed to allow index-driven joins in either direction, although only one of the possible join directions turns out to be optimal. Occasionally, you'll prefer to disable use of an index on a join altogether, to force a hash or sort-merge join.

To disable use of an index, create the simplest possible expression around the indexed column reference. For example, you should prevent use of an index on Status_Code for the unselective condition on closed orders, as the number of closed orders will eclipse open orders as you do more and more business:

O.Status_Code='CL'

Since Status_Code is a character-type column, a simple expression to disable index use without changing the results would simply concatenate an empty string to the end of Status_Code:

On Oracle and DB2: O.Status_Code||''='CL'
On SQL Server:     O.Status_Code+''='CL'

For number-type columns, you can add 0:

O.Region_ID+0=137

All databases have some sort of function that evaluates to the first argument when the argument is null and otherwise returns the second argument. On Oracle, the function is NVL( ). On SQL Server and DB2, it is COALESCE( ). If both arguments are the same column, the function always returns the same result as the bare column, regardless of the column type. Therefore, this makes a handy recipe to deactivate index use regardless of column type:

On Oracle:             NVL(O.Order_Date,O.Order_Date)=<Value>
On DB2 and SQL Server: COALESCE(O.Order_Date,O.Order_Date)=<Value>

In a join condition, a join that disables an indexed path to O.Region_ID (but not to R.Region_ID) could look like this:

O.Region_ID+0=R.Region_ID

Using the type-independent approach, this same join would look like this:

NVL(O.Region_ID,O.Region_ID)=R.Region_ID

4.1.3 Enabling the Join Order You Want

Apart from unintentionally disabled indexes, there are two issues that sometimes disable desired join orders:

Outer joins
Missing redundant join conditions

4.1.3.1 Outer joins

Consider an outer join query, in Oracle-style notation:

SELECT ... 
FROM Employees E, Locations L
WHERE E.Location_ID=L.Location_ID(+)

or in the newer, universal notation:

SELECT ... 
FROM Employees E LEFT OUTER JOIN Locations L
     ON E.Location_ID=L.Location_ID

This query requests employee records with their matching locations, when employees have locations; otherwise, null location data is used for employees that have no matching locations. Based on the request, it is clear that the query cannot effectively drive from Locations to Employees, since even employees without locations are needed. Consider a case in which this query is just a template to which an application adds conditions that depend on search criteria provided by an end user. If the end user wants to see employees for a particular location, the application might create this query:

SELECT ... 
FROM Employees E LEFT OUTER JOIN Locations L
     ON E.Location_ID=L.Location_ID
WHERE L.Description='Headquarters'

In the outer case of the join from Employees to Locations, L.Description will be assigned a generated value of null, and the condition on L.Description will be false. Only the inner case of the join will return rows that might meet the restriction on L.Description, so now it makes perfect sense to drive the query in the other join order, from Locations to Employees. However, the existence of the outer join often prevents automated optimizers from allowing this reversed order on the outer joins, so you need to make the join explicitly an inner join to get the reversed join direction:

SELECT ... 
FROM Employees E INNER JOIN Locations L
     ON E.Location_ID=L.Location_ID WHERE L.Description='Headquarters'

4.1.3.2 Missing redundant join conditions

Normally, between any number of tables, the join count is the number of tables minus one. For example, between three tables, you expect to find two joins. Occasionally, a query permits an extra, redundant join. For example, if you have an Addresses table that contains all addresses significant to the company, it might have a one-to-zero or one-to-one relationship with the earlier Locations table, which contains only locations owned by the company and which references Addresses through a matching primary key. In this case, you might find a query like the following:

SELECT ... 
FROM Employees E, Locations L, Addresses A
WHERE E.Location_ID=L.Location_ID
  AND E.Location_ID=A.Address_ID
  AND A.ZIP_Code=95628

By transitivity (if a=b and b=c, then a=c), you can deduce that the condition L.Location_ID=A.Address_ID must be true for all rows this query would return. However, that condition is not explicit in the query, and not all databases will deduce it and fill it in if it is left out. The best plan, in this case, will likely begin with all addresses within that ZIP Code and immediately join to Locations to discard all addresses except the one or two that correspond to company locations, before joining to Employees. Since that join order requires the missing join condition to support an indexed path from Addresses to Locations, you should make the missing join condition explicit:

SELECT ... 
FROM Employees E, Locations L, Addresses A
WHERE E.Location_ID=L.Location_ID
  AND E.Location_ID=A.Address_ID
  AND L.Location_ID=A.Address_ID
  AND A.ZIP_Code=95628

Since you do not want to follow the join from Addresses to Employees directly, you could also remove, if necessary, the redundant join condition E.Location_ID=A.Address_ID, to discourage that unwanted join operation.

4.1.4 Preventing Join Orders You Do Not Want

Forcing joins in the direction you want, using the earlier techniques for preventing use of the wrong indexes, will prevent many undesired join orders. What do you do when you want the database to follow a particular join direction eventually, but not too early in the execution plan? You cannot afford to disable an index, because you must use that index eventually, just not too early. Consider the following two joins, in which you want to start the query with reads of T1 and then join to T2 before joining to T3:

... AND T1.Key2_ID=T2.Key2_ID
AND T1.Key3_ID=T3.Key3_ID ...

Here, you want to follow nested loops into both T2 and T3, following indexes in the keys mentioned and reaching T2 before reaching T3. To postpone the join you want to happen later, make it depend (or at least to appear to depend) on data from the join that must happen earlier. Here is a solution:

... AND T1.Key2_ID=T2.Key2_ID
AND T1.Key3_ID+0*T2.Key2_ID=T3.Key3_ID ...

You and I know that the second version is logically equivalent to the first. However, the database just finds an expression on the left side of the second join that depends on both T1 and T2 (not recognizing that no value from T2 can change the result), so it won't try to perform the join to T3 until after T2.

If necessary, you can string together joins like this to completely constrain a join order. For each join after the first, add a logically irrelevant component referencing one of the columns added in the preceding join to the join expression. For example, if you want to reach tables T1 through T5 in numerical order, you can use the following. Notice that the join condition for the T3 table uses the expression 0*T2.Key2_ID to force the join to T2 to occur first. Likewise, the join condition for the T4 table uses 0*T3.Key3_ID to force T3 to be joined first.

... AND T1.Key2_ID=T2.Key2_ID
AND T1.Key3_ID+0*T2.Key2_ID=T3.Key3_ID
AND T1.Key4_ID+0*T3.Key3_ID=T4.Key4_ID
AND T1.Key4_ID+0*T4.Key4_ID=T5.Key5_ID ...

I'll apply this method to a concrete example. Consider the following SQL, adapted from Chapter 3:

SELECT E.First_Name, E.Last_Name, E.Salary, LE.Description,
       M.First_Name, M.Last_Name, LM.Description
FROM Locations LE, Locations LM, Employees M, Employees E
WHERE E.Last_Name = 'Johnson'
  AND E.Manager_ID=M.Employee_ID
  AND E.Location_ID=LE.Location_ID
  AND M.Location_ID=LM.Location_ID
  AND LE.Description='Dallas'

Assume that you have an execution plan that drives from the index on the employee's last name, but you find that the join to the employee's location (alias LE) to discard employees at locations other than Dallas is unfortunately happening last, after the other joins (to M and LM). You should join to LE immediately from E, to minimize the number of rows you need to join to the other two tables. Starting from E, the join to LM is not immediately possible, so if you prevent the join to M before LE, you should get the join order you want. Here's how:

SELECT E.First_Name, E.Last_Name, E.Salary, LE.Description,
       M.First_Name, M.Last_Name, LM.Description
FROM Locations LE, Locations LM, Employees M, Employees E
WHERE E.Last_Name = 'Johnson'
  AND E.Manager_ID+0*LE.Location_ID=M.Employee_ID
  AND E.Location_ID=LE.Location_ID
  AND M.Location_ID=LM.Location_ID 
  AND LE.Description='Dallas'

The key here is that I've made the join to M dependent on the value from LE. The expression 0*LE.Location_ID forces the optimizer to join to LE before M. Because of the multiply-by-zero, the added expression has no effect on the results returned by the query.

4.1.5 Forcing Execution Order for Outer Queries and Subqueries

Most queries with subqueries can logically drive from either the outer query or the subquery. Depending on the selectivity of the subquery condition, either choice can be best. The choice generally arises for queries with EXISTS or IN conditions. You can always convert between an EXISTS condition on a correlated subquery and the equivalent IN condition on a noncorrelated subquery. For example, you can convert this:

SELECT ... 
FROM Departments D 
WHERE EXISTS (SELECT NULL FROM Employees E
                          WHERE E.Department_ID=D.Department_ID)

to this:

SELECT ... 
FROM Departments D
WHERE D.Department_ID IN (SELECT E.Department_ID FROM Employees E)

The first form implies that the database drives from the outer query to the subquery. For each row returned by the outer query, the database executes the join in the subquery. The second form implies that you begin with the list of distinct departments that have employees, as found in the noncorrelated subquery, and drive from that list into the matching list of such departments in the outer query. Sometimes, the database itself follows this implied join order, although some databases can make the conversion internally if their optimizer finds that the alternate order is better. To make your own SQL more readable and to make it work well regardless of whether your database can convert the forms internally, use the form that implies the order you want. To force that order even when the database could make the conversion, use the same join-direction-forcing technique used in Section 4.1.4. Thus, an EXISTS condition that forces the outer query to execute first would look like this:

SELECT ... 
FROM Departments D 
WHERE EXISTS (SELECT NULL FROM Employees E
                          WHERE E.Department_ID=D.Department_ID+0)

For the contrary order, an IN condition that forces the implied driving order from the subquery to the outer query would look like this:

SELECT ... 
FROM Departments D
WHERE D.Department_ID IN (SELECT E.Department_ID+0 FROM Employees E)

This latter order would be a bad idea, unless you found a strange case in which you had more departments than employees!

You can have several subqueries in which the database either must drive from the outer query to the subquery (such as NOT EXISTS subqueries) or should drive in that order. Such a case implies a choice of the order of execution of the subqueries. You can also have choices about whether to execute subqueries after completing the outer query, or at the first opportunity, as soon as the correlation join is possible, or at some point between these extremes.

The first tactic for controlling the order of subquery execution is simply to list the subqueries in order in the WHERE clause (i.e., the top subquery to be executed should be listed first). This is one of the few times when WHERE-clause order seems to matter.

Rarely, the database will execute a subquery sooner than you would like. The same tactic for postponing joins (described in Section 4.1.4) works for correlation joins, the joins in subqueries that correlate the subqueries to the outer queries. For example, consider this query:

SELECT ... 
FROM Orders O, Customers C, Regions R
WHERE O.Status_Code='OP'
  AND O.Customer_ID=C.Customer_ID
  AND C.Customer_Type_Code='GOV'
  AND C.Region_ID=R.Region_ID
  AND EXISTS (SELECT NULL 
              FROM Order_Details OD
              WHERE O.Order_ID=OD.Order_ID
                AND OD.Shipped_Flag='Y')

For this query you might find that the subquery runs as soon as you reach the driving Orders table, but you might wish to perform the join to Customers first, to discard nongovernmental orders, before you take the expense of the subquery execution. In this case, this would be the transformation to postpone the correlation join:

SELECT ... 
FROM Orders O, Customers C, Regions R
WHERE O.Status_Code='OP'
  AND O.Customer_ID=C.Customer_ID
  AND C.Customer_Type_Code='GOV'
  AND C.Region_ID=R.Region_ID
  AND EXISTS (SELECT NULL 
              FROM Order_Details OD
              WHERE O.Order_ID+0*C.Customer_ID=OD.Order_ID
                AND OD.Shipped_Flag='Y')

Notice the addition of +0*C.Customer_ID to the subquery's WHERE clause. This ensures the join to Customers occurs first, before the subquery executes.

4.1.6 Providing the Cost-Based Optimizer with Good Data

On any cost-based optimizer (that is, for any query except one running on the Oracle rule-based optimizer, since only Oracle has a rule-based optimizer), the second most common source of poor execution plans (after missing indexes) is missing statistics on the tables, columns, and indexes involved in the query. In all, cost-based optimizers do a fairly good job of finding the best plan without help when they have good information to begin with. However, when they are missing informationfor example, because a table or index has been rebuilt without regenerating statistics for that objectthey tend to make terrible assumptions.

If you are running on any database except Oracle, or if you are on Oracle's cost-based optimizer (as is most common and as Oracle recommends) and not forcing the rule-based optimizer, the first thing you should try if you are not getting the execution plan you want is to regenerate statistics on every table and index relevant to the query. Standard statistics will usually suffice to get reasonable execution plans.

Cost-based optimizers usually assume that data is uniformly distributed. For example, if the optimizer statistics show a table of 1,000,000 rows with 50,000 distinct values for some indexed foreign key, the database will optimize on the assumption that every value of that key will match exactly 20 rows. For most indexed columns, like foreign keys, this assumption of a uniform data distribution works well. However, some columns have highly skewed distributions, such as status, code, or type columns, or foreign keys to status or type tables. For example, consider this query:

SELECT ... FROM Orders WHERE Status_Code = 'OP'

There might only be three or four values of Status_Code across a 1,000,000-row Orders table, but if 'OP' means this is an open order, not yet fulfilled or cancelled, this condition is far more selective than the optimizer would expect based solely on the number of distinct values. If the column had an index, the optimizer might never use that index if it knew only the small number of distinct indexed values. However, on some databases, you can generate added statistics that let the database know not only the number of distinct values but also the distribution of those values, a necessary step when you have such highly skewed distributions.

4.1.7 Fooling the Cost-Based Optimizer with Incorrect Data

This last technique is dangerous, and I recommend it only as a last resort. Sometimes, you want to simulate a large database on a small, development database. If you can extrapolate (or, better, measure from an actual database) statistics that apply to a large database, you can manually modify the data-dictionary tables that store those statistics for the optimizer, to fool the optimizer into thinking it is working with a large database. The small database will have statistics that show large tables with many distinct values on most indexes. This is a handy way to see execution plans that will apply to production volumes when you have only a test database with toy data volumes. For such toy-sized databases, there is no great risk to this approach. On production databases, the optimizer will occasionally make better choices if it has the wrong data, usually if it has data that exaggerates the selectivity of desired indexes or that exaggerates the size of a table when a full table scan is undesirable.

Imagine reversing the logic the optimizer follows: ask "What would I need to believe about the tables and indexes of this query to find an alternative plan (the alternative that you, the human optimizer, want) much more attractive?" It is not hard to fool the optimizer into doing what you want rather than what it would choose on its own, if you lie to it about the statistics. However, on production systems, this is dangerous in several ways:

As soon as anyone regenerates statistics for the tables or indexes, the optimizer will revert to the original error, unless the manual statistics-tweak is reapplied. You will have to rigorously control statistics generation to prevent this.
As soon as the database optimizer improveswith the next release, perhapsit is denied the chance to exploit those improvements with correct data.
Most importantly, every other query against the tables and indexes with false statistics is at risk and will potentially be harmed, just to help the one query you wanted to tune when you fudged the statistics.

I have never needed to play this card to get an adequately optimized plan on Oracle, SQL Server, or DB2, and I recommend you avoid it if possible.