2.6 Calculating Selectivity

When tuning queries, you can best picture nonjoin (single-table) conditions as filters. These filters pass through the table rows that the application needs (the rows satisfying the conditions), while discarding the rows that the application does not need (the rows failing to satisfy the query conditions). The application has functional reasons to exclude unneeded rows, but from a performance point of view the job of the filter is to save the database work and time. The trick of tuning is to avoid work, as much as possible, on rows destined for the trash heap.

Selectivity is the power of a filter as a row-excluding tool, expressed as the fraction of table rows that the filter passes through. The database can apply filters at three points, with varying success at minimizing query cost:

Determining the index range condition: Conditions that determine the limits of an index range scan require no work at all on the excluded rows.
Determining the table rows reached from the index: Sometimes, conditions do not determine the index range limits but nevertheless can be evaluated in the index before reaching the table. These conditions require touching ultimately excluded row entries in the index, but not in the table.
Determining the rows returned after table access: If a condition requires columns held in the table but not in an index, a database cannot evaluate that condition until it reads the table rows. Filters evaluated in the table are of no value in reducing the costs of reaching the rows of that table, though they reduce network costs because the excluded rows do not need to be returned. If the filtered table is not the last or the only table touched, any filter also reduces the costs of joins to other tables later in the execution plan.

2.6.1 Filter Selectivity

In this section, I explain how to calculate the selectivity of conditions on a table. Let's begin with some definitions:

Individual-condition filter selectivity: The fraction of table rows that satisfy a single condition on that table.
Multiple-condition filter selectivity: The fraction of table rows that satisfy the combination of conditions referring exclusively to that table.
Filter independence: The assumption, usually true, that you can calculate multiple-condition selectivity as the simple product of individual-condition selectivity fractions. For example, a condition on a person's first name and a condition on a person's Zip Code are logically independent. You can assume that the fraction of rows that have both the correct first name and the correct Zip Code will be roughly the product of the fraction of rows that have the correct name and the fraction that have the correct Zip Code. For example, if 1/100^th of the rows contain the desired first name and 1/500^th of the rows contain the desired Zip Code, then the multiple-condition filter selectivity is 1/100 x 1/500 = 1/50,000.
Filter redundancy: The opposite of filter independence. The truth of one condition guarantees the truth of another. For example, a condition on the Zip Code likely guarantees a single value for telephone area code, so the selectivity of conditions on both would be no better than the selectivity of the ZIP Code alone. You can always test for complete or partial filter redundancy by calculating the multicondition filter selectivity based on filter independence and seeing whether it equals the actual selectivity of the combination of the conditions.

When you tune a query and evaluate filter selectivity, start by asking if the query stands alone or if it represents a whole group of queries. If it represents a group of queries, ask about the distribution of values within that group. For example, consider the query:^[5]

^[5] Note that I usually replace the SELECT list with ... in my examples. It turns out that the list of columns and expressions that you select has little impact on query performance, which is dominated by the list of tables in the FROM clause and the conditions in the WHERE clause.

SELECT ... FROM Orders WHERE Unpaid_Flag='Y';

This query has (we would hope) high selectivity, being true for a small fraction of the complete history of orders. If you can count on finding a value of 'Y' for Unpaid_Flag in the query, you might want an index on that column. But if the query is part of a group that just as often searches for the unselective condition Unpaid_Flag='N', you should likely avoid the index. In this example, the meaning of the flag is likely special to the query, driving the very purpose of the query (to find bills to send out), so you can count on finding primarily queries against 'Y', which is the rare value.

Yes, I promised earlier that you did not need to understand the application to tune the SQL. Really, you do not. You can always ask the developers if the application SQL will consistently point to the rare indexed value. However, you might be surprised how much you can deduce about an application with just a little thought about what makes sense, given the names of the tables and columns.

To calculate the selectivity of the condition Unpaid_Flag='Y', begin by executing the following two queries:

SELECT COUNT(*) FROM Orders WHERE Unpaid_Flag='Y';
SELECT COUNT(*) FROM Orders;

The selectivity of the condition is the first count divided by the second.

On the other hand, consider the query:

SELECT ... FROM Order_Details WHERE Order_ID=:id;

End users would query for details of orders roughly at random. You could assume this even if the application replaced the bind variable :id with an actual value, since an application would have no reason to always query the same order. The query against any particular ID does not stand alone, but represents a whole family of queries that you should tune as a unit. End users are as likely to query one order as another, so you could estimate the filter selectivity with:

SELECT 1/COUNT(DISTINCT Order_ID) FROM Order_Details;

A more subtle case arises when the end user might query on any value, but the end user is more likely to query on common values than uncommon values. For example, if operators bring up customers by finding all customers that have the last name of the calling customer, they will bring up common last names more often than uncommon ones with a query such as:

SELECT ... FROM Customers WHERE Last_Name = 'SMITH';

Here, if you just counted distinct names, as you counted order IDs earlier, you would see an over-optimistic selectivity that assumed you were just as likely to search for Last_Name='KMETEC' as for Last_Name='SMITH'. Each last name has a selectivity of n(i)/C, where n(i) is the count of rows with the ith nonnull last name and C is the count of all rows in the table. If choosing any last name were equally probable, you could just average n(i)/C over all the last names. That average would equal one over the number of distinct names. However, the probability of searching on a last name in this scenario is n(i)/C', where C' is the count of rows having nonnull last names. Therefore, you really need the sum of the selectivities times the probability of seeing each selectivityi.e., the sum of (n(i)/C') x (n(i)/C)over all last names. Since C' is also the sum of the individual n(i) values, you can compute the filter selectivity in SQL as follows:

SELECT SUM(COUNT(LastName)*COUNT(Last_Name))/
        (SUM(COUNT(Last_Name))*SUM(COUNT(*))) 
FROM Customers GROUP BY Last_Name;

2.6.2 Index Range-Condition Selectivity

Index range-condition selectivity is the fraction of table rows the database examines in an index while scanning the index. Every index range scan must have two endpoints that define the beginning and end of the range. These endpoints can be equivalent to positive or negative infinity, meaning that the range can be unbounded on either end, though not generally on both ends. The range can exclude either or both of the range endpoints, depending on the nature of the bounding condition. For example, the range condition (Salary > 4000 AND Salary < 8000) excludes both endpoints with > and < bounding conditions.

There are common problems that prevent a database from finding a concrete beginning and end for a range scan, even given a very selective condition on the indexed column, generally preventing effective use of the index. It is hard or even impossible (depending on the function) to translate a condition on SomeFunction(Some_Column) into a single range condition on a simple index leading with Some_Column. Generally, a database does not even try to do so.

Function-based indexes, supported on Oracle, are the main exception to this rule. They allow you to specifically index the result of some expression on a table's columnsfor example, UPPER(Last_Name)to drive indexed access to conditions on that expressionfor example, UPPER(Last_Name) LIKE 'SMITH%'.

Therefore, expressions that do more than simply name a column do not usually enable indexed access to ranges defined on that column, often making indexes useless for queries using those expressions. This is a double-edged sword: it forces you to rewrite some queries to enable the index that you want to use, but it is also a useful tool, allowing to you rewrite other queries to prevent the database from using indexes that you do not want to use.

A subtle example of a function that disables use of an index is when you compare expressions of different types and the database applies a type-conversion function implicitly. For example, the type-inconsistent condition CharacterColumn=94303 implicitly becomes, in Oracle, TO_NUMBER(CharacterColumn)=94303. To resolve this problem and enable use of an index on the character column, perform the conversion explicitly on the other side. For example:

Replace

Comparing a character expression to anything else in Oracle results in the character value converting to the other type, unless you convert the other side of the comparison explicitly. This is a bit surprising, since numbers and dates can always convert without error to character strings, but character strings frequently fail to convert to numbers and dates.

Even when comparing numbers to numbers, type conversion can lead to difficulties when a database vendor supports different number types, such as integer and decimal types. Type conversions also interfere with efficient, indexed join paths for which a foreign key of one type points to a primary key of another type.

To avoid the problem of implicit type conversion preventing the use of indexes, your database design should always use foreign keys with the same type as the primary keys they reference.

Several rules apply when you figure out how large an index range scan will be:

Conditions on the leading column of the index under consideration are suited to provide a range start or endpoint. Conditions on later columns of the same index are not suited to provide a range start or endpoint, unless you also have exact equality conditions on all preceding columns of that index. For example, an index on (Date_Column, ID_Number) applied to the condition Date_Column >= TO_DATE('2003/01/01', 'YYYY/MM/DD') delivers a range scan fully determined by the date condition. Further conditions on the second column, such as ID_Number=137, do not further narrow the index range scan. To further narrow the range based on the second condition, you have to look at a long list of ranges, one for each possible value of Date_Column that satisfies the first condition, but databases do not do this. However, if you flip the index column order to (ID_Number, Date_Column), then the same two conditions together define the range endpoints and you get a much shorter, faster index range scan.

Since query conditions on dates (and even more so on date-times, which are dates that include a time component) tend not to be equality conditions, multicolumn indexes usually should not lead with a date-type column.

Depending on the database, the database usually assumes that Indexed_Col IS NULL denotes too large a range to be useful, and the database ignores such conditions for establishing range endpoints.

DB2 is the exception to this rule. Oracle does not include references to rows that have only null values for all columns in an index. DB2 does, though, and it appears to treat nulls as just another value, having the same likelihood of appearing as other values. This tends to make DB2 all too likely to choose a plan driving from an is-null condition, since nulls, where allowed, tend to be much more common than individual nonnull values.

The database assumes that Indexed_Col IS NOT NULL covers too large a range to be useful, so the database will not drive to an index from this condition. In rare cases, having any nonnull value is so rare that an index range scan over all possible nonnull values is beneficial. In such cases, if you can figure out a safe lower or upper limit to the range of all possible values, you can enable a range scan with a condition such as Positive_ID_Column > -1 or Date_Column > TO_DATE('0001/01/01','YYYY/MM/DD').
Indexed_Char_Col LIKE 'ABC%' establishes the start and end of a legal range for an index, with Indexed_Char_Col as the leading column of the index. (LIKE is SQL's pattern-matching comparator, and % is the pattern-matching wildcard.)
Indexed_Char_Col LIKE 'ABC%DEF' also establishes the start and end of a legal range, but in this case only the 'ABC%' part of the comparison string contributes to narrowing the index range scan.
Indexed_Number_Column LIKE '123%' does not establish the start and end of a legal range, because the LIKE comparison is meaningful only when applied to character strings and the database must implicitly convert Indexed_Number_Column to a character string to check this condition, disabling any index leading with Indexed_Number_Column. In terms native to numbers, this condition would point to a whole series of ranges:
```
(Indexed_Number_Column >= 123 AND Indexed_Number_Column < 124) OR 
(Indexed_Number_Column >= 1230 AND Indexed_Number_Column < 1240) OR 
(Indexed_Number_Column >= 12300 AND Indexed_Number_Column < 12400) OR...
```
Indexed_Char_Col LIKE '%ABC%' does not establish the start and end of a legal range, because the leading wildcard implies this pattern might be matched anywhere in the entire range of the index.
Equality (=), BETWEEN, and most inequality (<, <=, >, >=) conditions on first columns of indexes establish legitimate index range endpoints.
The not-equal-to inequality condition, usually expressed as != or <>, does not establish an index range, because the database assumes this condition to be too unselective to be worth indexed access. If the excluded value covers almost all the rows, with other values being rare, you can usefully enable indexed access by replacing the negative condition Column!='DominantValue' with Column IN (<List of all other values, all of which are rare>), although it could prove inconvenient to keep this list up-to-date as the application evolves.
Series of conditions combined by OR or by an IN list can lead to a series of range scans, but only if every such condition points to a legal range. For example, IDColumn IN (123,124,125) yields three legal equality conditions that deliver three range scans, and (Name='Smith' OR Name='Jones') yields a pair of range scans, but (Name='Smith' OR Name IS NULL) does not enable use of an index (except on DB2), since IS NULL does not mark a legal range.

If you have conditions that do not specify a restricted range on at least the first column of an index, the only way to use an index at all is to perform a full index scan, a scan of every index entry, across all leaf blocks. Databases will not usually choose full index scans, because to read an entire index is expensive and full index scans usually end up pointing to much of the table as well, with a net cost greater than the competing full table scan. You can force full index scans (often without meaning to) by adding a query hint (more on these later) that instructs the database to use an index even though it would not choose to use that index on its own. More often than not, adding such a hint is a mistake, a desperate attempt to get something better than a full table scan. Usually, this happens when a developer values an index too much, actually leading to a worse plan than the plan the database would choose without help.

Even in cases for which a full index scan is better than the competing full table scan, it is almost surely worse than a third alternative: usually driving from a different, sometimes new index or changing a condition on a function to enable a narrow range scan.

2.6.3 Selectivity on Table Rows Reached from the Index

Since indexes are compact and better cached objects, compared to tables, selectivity of an index range scan matters less than efficient table access. Even when an index and table are perfectly cached, the selectivity on the table matters more than on the index, because you read about 300 index rows in a single logical I/O to a leaf block but you must do a separate logical I/O for every table row. This brings us to the second part of evaluating the usefulness of an index: how narrowly will the index restrict access to the table? Fortunately, database implementations are smart enough to check all conditions as early in the execution plan as possible, before touching a table, when the index allows. This reduces table reads whenever an index includes columns required to evaluate a condition, even if the database cannot use that condition to determine the range scan endpoints. For example, consider a Persons table with one index that includes the Area_Code, Phone_Number, Last_Name, and First_Name columns. Consider a query against this table with the conditions:

WHERE Area_Code=916 AND UPPER(First_Name)='IVA'

Only the first condition contributes to the endpoints of the index range scan. The second condition, on the fourth indexed column, fails to narrow that index range scan for three reasons:

The second index column, Phone_Number, has no equality condition specified (no condition at all, for that matter).
The third column, Last_Name, also has no equality condition specified.
The condition on First_Name is disabled as range-limiting by the UPPER function.

Fortunately, none of these reasons prevents the database from checking the second condition before accessing the table. Because the index contains the First_Name column, the database can test conditions against that column using data from the index. In this case, the database uses the Area_Code condition to define an index range scan. Then, as it looks at each index entry in the range, the database discards entries with first names other than Iva. The likely result is just one or two table-row reads on this selective combination of conditions.

Let's examine this situation more closely to decide whether it would be worthwhile to create a better index for this combination of conditions. Both Area_Code and First_Name will see skewed distributions. You will query the common area codes and common first names more often than the uncommon ones, so follow the approach for skewed distributions and find the individual selectivity factors:

SELECT SUM(COUNT(Area_Code)*COUNT(Area_Code))/
        (SUM(COUNT(Area_Code))*SUM(COUNT(*))) 
FROM Customers GROUP BY Area_Code;

SELECT SUM(COUNT(First_Name)*COUNT(First_Name))/
        (SUM(COUNT(First_Name))*SUM(COUNT(*))) 
FROM Customers GROUP BY First_Name;

Let's say that the first calculation yields a selectivity of 0.0086, and the second yields a selectivity of 0.018. Let's further assume that you have a 1,000,000-row table and the index references 200 rows per leaf block (less than usual, because this is an unusually wide index key). The condition on Area_Code alone determines the number of index blocks scanned, so first find the number of rows that condition references. The calculation is 0.0086 x 1,000,000=8,600, indicating that the database scans 43 (8600/20) leaf blocks. (You can always neglect root and branch block reads in single-table queries.) These 43 leaf blocks are likely well cached and require about a millisecond to read.

Databases can perform about 60,000 logical I/Os per second on typical CPUs. However, your mileage will vary considerably. Throughout this book, I include figures like 300 index row entries per leaf block and 60,000 logical I/Os per second because I have found them to be roughly correct and useful for an intuitive understanding of tuning priorities. Across a wide range of database and CPU vendors and a wide range of block and column sizes, these numbers can easily be off by a factor of four or more, and performance will no doubt improve greatly in the years after this book is published. Your intuitive understanding from knowing at least the order-of-magnitude ballpark for these numbers will still help, though, and I hope you find remembering a number easier than remembering a complex, conditional range.

Going to the table with 8,600 reads could be a problem, but, fortunately, you pick up the additional selectivity of the condition on First_Name before reaching the table, reducing the row count to about 155 (8,600 x 0.018) rows. You will see about 155 logical I/Os to the table, with a few physical I/Os, since you will see a less favorable hit ratio on the table, compared to the more compact index.

The multicolumn filter selectivity, 0.0086 x 0.018 = 0.000155, is well into the range that favors indexed access, even though the individual selectivity of the first column alone is in the gray area. Notice that even if you modify the query or the data model to pick up the full selectivity in the index range conditions, the cost will go down only marginally, eliminating most of the 43 well-cached index leaf-block reads and speeding the query by about a millisecond. This will have no effect on the 155 more expensive table-block reads. (You could modify the application to do a case-sensitive match on the first name, and create a new index to cover just these two columns. Alternatively, you could modify the table to store the uppercase first names and index the new column together with area code.) Therefore, even though this query and index are not perfectly suited to one another, the index is good enough to use. It is not worthwhile to add a new index and change the query for the small potential gain.

2.6.4 Combining Indexes

Occasionally, the database finds equality-condition filters that point to different single-column indexes. By combining index range scans on these indexes, the database can pick up the multicolumn filter selectivity before reaching the table. Let's reuse the earlier example of a query that specifies area code and customer first name, but replace the multicolumn index with single-column indexes on the two columns. Further, replace the condition on UPPER(First_Name) with a simple match on the column, assuming the column already stores uppercase values. Typical conditions are then:

WHERE Area_Code=415 AND First_Name='BRUCE';

You can now assume that you have closer to the typical 300 rows or more per index leaf block, since these are single-column indexes on short columns. Using the same single-column filter selectivity factors and assuming as before that the table holds 1,000,000 rows, you can again predict that the Area_Code condition will yield 8,600 (0.0086 x 1,000,000) rows, requiring 29 (8,600/300) index leaf blocks. The second index range scan, for First_Name, will yield 18,000 (0.018 x 1,000,000) rows, requiring 60 (18,000/300) index leaf blocks. Since these are both equality conditions, the database finds the two rowid lists for these two conditions presorted, and it can easily merge these two sorted lists to find the same 155 table rowids found with the earlier multicolumn index.

The operation just described, merging lists of rowids from two indexes, is called the index AND-EQUAL MERGE operation, and the database can do this between any number of equality conditions on single-column indexes. The cost of the table access is the same as finding the rows with one multicolumn index, but you must read more index leaf blocksin this case, 89 (29+60) instead of the earlier example of 43.

As you can see, the AND-EQUAL MERGE is more expensive than a multicolumn index, but it might be better than using just one single-column index. But this case, in which the AND-EQUAL MERGE is reasonably attractive, is rare. Almost always, the most selective single-column condition is fine by itself and the cost of the less selective index range scan exceeds the savings in table access, or the added cost of the AND-EQUAL MERGE justifies creating a multicolumn index. When you see an AND-EQUAL MERGE in an execution plan, you should almost always either suppress use of the less selective index or create a multicolumn index to use instead. The new multicolumn index should start with the more selective columnin this case, Area_Codeand should usually take the place of any index on that column alone. Such an index sometimes enables you to drop the index on the other column as well.