There are three main steps involved in tuning on SQL Server:
Provide the optimizer with good statistics about the tables and indexes, so it can calculate the costs of alternatives accurately.
Modify the query to prevent execution plans that you do not want, mainly using methods specific to SQL Server.
Force a simple execution plan with FORCEPLAN when necessary.
Proving that a little knowledge is a dangerous thing, cost-based optimizers often do a terrible job if they do not have statistics on all the tables and indexes involved in the query. It is therefore imperative to maintain statistics on tables and indexes reliably; this includes regenerating statistics anytime table volumes change much or anytime tables or indexes are rebuilt. It is safest to regenerate statistics periodically, during times when load is relatively quiet, nightly or at least weekly. Run the following from Query Analyzer, then cut and paste the resulting UPDATE STATISTICS commands into the query window and run them as well:
-- file called updateall.sql -- update your whole database SELECT 'UPDATE STATISTICS ', name FROM sysobjects WHERE type = 'U'
Often, queries include conditions on highly skewed distributions, such as conditions on special types, codes, or flags, when these columns have only a few values. SQL Server automatically maintains statistics on the distribution of indexed column values, enabling SQL Server to estimate selectivities automatically, even when indexed columns have skewed distributions.
Occasionally, it is useful to help the SQL Server estimate the selectivity of condition with a skewed distribution even when the distribution applies to a nonindexed column. In such a case, you need to specially request data on that column. For example, to request a statistics group named Eflag on the nonindexed column Exempt_Flag of the Employees table, run:
CREATE STATISTICS EFlag on Employees(Exempt_Flag)
As an example of a case in which such skewed distributions might apply, consider an Orders table in which you have a Status_Code column with three possible values: 'CL' for closed (i.e., fulfilled) orders, 'CA' for cancelled orders, and 'OP' for open orders. Most orders, by far, would be fulfilled once the application has been running for a few months, causing a steady rise in 'CL' values. A steady, significant fraction of orders would end up cancelled, so the value 'CA' would also eventually point to a large list of orders. However, as long as the business keeps up with incoming orders, the number of open orders would remain moderate and steady, even as data accumulates for years. Quite early, a condition that specified Status_Code='OP' would be selective enough to prefer joining to the corresponding table early, even if Status_Code is not indexed, and it is important to enable the optimizer to realize this fact, preferably without a lot of manual tuning. This requires that your SQL actually mention the specific value that applies to the condition, rather than use a generic stored procedure that only fills in the value of the constant after the parse, at execution time.
You should usually tune SQL Server with hints. Hints generally go in either the FROM clause, when they apply to a specific table access, or in the SQL Server OPTION( ) clause at the very end of a query. These are the most useful hints:
Immediately following a table alias in a FROM clause, this hint instructs SQL Server to use the specified index to access that table alias. The older alternative syntax INDEX=<Index_Name> is also supported, but it might be dropped in the future, so I don't recommend it. Even more obsolete and dangerous is the still-supported method of naming the internal object ID that corresponds to the desired index. Naming the index you want with the ID is horribly unreliable, because the index will get a new ID if anyone ever drops it and recreates it, or if the application moves to a new SQL Server database.
Immediately following a table alias in a FROM clause, this hint instructs SQL Server to use a full table scan to reach that table alias.
Immediately following a table alias in a FROM clause, this hint instructs SQL Server to read the specified table alias without requiring read locks or otherwise enforcing a consistent read. Read locks on SQL Server can create a bottleneck when combined with heavy update activity on a table. This hint avoids such a bottleneck, potentially at the cost of a consistent view of the data as of a single moment in time.
These two different hints can each immediately precede the JOIN keyword in the FROM clause, instructing SQL Server to perform the specified join with the specified join method. These hints require the new-style join syntax with the JOIN keyword in the FROM clause. The presence of even a single hint of this type also forces all joins to take place in the same order the aliases are listed in the FROM clause.
This hint goes at the end of a query and forces all joins to follow nested loops.
This hint goes at the end of a query and forces all joins to take place in the same order the aliases are listed in the FROM clause.
This hint simply instructs SQL Server to attempt to reach the first returned rows as quickly as possible, which generally favors a nested-loops execution plan. Its effect is much like the OPTION(LOOP JOIN) hint, although in theory SQL Server might recognize that no execution plan could reach the first rows quickly in a query with an explicit ORDER BY, nullifying any effect from OPTION(FAST 1). The OPTION(FAST 1) hint is the equivalent of the FIRST_ROWS hint on Oracle and the OPTIMIZE FOR 1 ROW hint on DB2.
These hints can be combined. You can place multiple hints within a single WITH clause, separating them with commasfor example, WITH (INDEX(Employee_First_Name), NOLOCK). Multiple hints in a single OPTION clause also are separated by commasfor example, OPTION(LOOP JOIN, FORCE ORDER). Together, these hints give full control of the join order, the join methods, and the table access methods.
I'll demonstrate tuning with hints on a couple of queries. If you choose a robust all-nested-loops plan that drives from the employee last name to the other tables in optimum order, using the primary keys to reach the other tables, this query's hints force the desired plan:
SELECT E.First_Name, E.Last_Name, E.Salary, LE.Description, M.First_Name, M.Last_Name, LM.Description FROM Employees E WITH (INDEX(Employee_Last_Name)) INNER JOIN Locations LE WITH (INDEX(Location_PKey)) ON E.Location_ID=LE.Location_ID INNER JOIN Employees M WITH (INDEX(Employee_PKey)) ON E.Manager_ID=M.Employee_ID INNER JOIN Locations LM WITH (INDEX(Location_PKey)) ON M.Location_ID=LM.Location_ID WHERE E.Last_Name = 'Johnson' AND LE.Description='Dallas' OPTION(LOOP JOIN, FORCE ORDER)
SET SHOWPLAN_TEXT ON (as described in Chapter 3) generates the following results when you run this query from SQL Server Query Analyzer:
StmtText ----------------------------------------------------------------- |--Bookmark Lookup(...(...[Locations] AS [LM])) |--Nested Loops(Inner Join) |--Bookmark Lookup(...(...[Employees] AS [M])) | |--Nested Loops(Inner Join) | |--Filter(WHERE:([LE].[Description]='Dallas')) | | |--Bookmark Lookup(...(...[Locations] AS [LE])) | | |--Nested Loops(Inner Join) | | |--Bookmark Lookup(...(...[Employees] AS [E])) | | | |--Index Seek(...(... (wrapped line) [Employees].[Employee_Last_Name] (wrapped line) AS [E]), SEEK:([E].[Last_Name]='Johnson') ORDERED) | | |--Index Seek(...(...[Locations].[Location_PKey] (wrapped line) AS [LE]), SEEK:([LE].[Location_ID]=[E].[Location_ID]) ORDERED) | |--Index Seek(...(...[Employees].[Employee_PKey] (wrapped line) AS [M]), SEEK:([M].[Employee_ID]=[E].[Manager_ID]) ORDERED) |--Index Seek(..(...[Locations].[Location_PKey] (wrapped line) AS [LM]), SEEK:([LM].[Location_ID]=[M].[Location_ID]) ORDERED) (12 row(s) affected)
If you don't want to specify all nested loops, you might need the join HASH and LOOP hints, as shown in the following alternative to the last query:
SELECT E.First_Name, E.Last_Name, E.Salary, LE.Description, M.First_Name, M.Last_Name, LM.Description FROM Employees E WITH (INDEX(Employee_Last_Name)) INNER HASH JOIN Locations LE WITH (INDEX(Location_Description)) ON E.Location_ID=LE.Location_ID INNER LOOP JOIN Employees M WITH (INDEX(Employee_PKey)) ON E.Manager_ID=M.Employee_ID INNER LOOP JOIN Locations LM WITH (INDEX(Location_PKey)) ON M.Location_ID=LM.Location_ID WHERE E.Last_Name = 'Johnson' AND LE.Description='Dallas'
The preceding query delivers the following execution plan, triggered by SET SHOWPLAN_TEXT ON:
StmtText ------------------------------------------------------------- |--Bookmark Lookup(...(...[Locations] AS [LM])) |--Nested Loops(Inner Join) |--Bookmark Lookup(...(...[Employees] AS [M])) | |--Nested Loops(Inner Join) | |--Hash Match(Inner Join... (wrapped line) ([E].[Location_ID])=([LE].[Location_ID])...) | | |--Bookmark Lookup(...(...[Employees] AS [E])) | | | |--Index Seek(...(...[Employees].[Employee_Last_Name] (wrapped line) AS [E]), SEEK:([E].[Last_Name]='Johnson') ORDERED) | | |--Bookmark Lookup(...(...[Locations] AS [LE])) | | |--Index Seek(...(...[Locations].[Location_Description] (wrapped line) AS [LE]), SEEK:([LE].[Description]='Dallas') ORDERED) | |--Index Seek(...(...[Employees].[Employee_PKey] (wrapped line) AS [M]), SEEK:([M].[Employee_ID]=[E].[Manager_ID]) ORDERED) |--Index Seek(...(...[Locations].[Location_PKey] (wrapped line) AS [LM]), SEEK:([LM].[Location_ID]=[M].[Location_ID]) ORDERED) (11 row(s) affected)
There are two basic extremes involved in tuning with hints such as those in this example:
Use as little direction as possible to get the execution plan you want, or at least to get close enough to the plan you want for reasonable performance. This approach reasons that SQL Server has more information than you have, and it should be left free to adapt to changing data distributions and take advantage of improvements in SQL Server with future releases. By leaving SQL Server the maximum degrees of freedom, you maximize its power to optimize well for you in the future. However, until you try, you won't know how much direction the SQL Server will need if it did not get the plan right in the first place, so this approach is likely to be iterative, involving adding one hint at a time until SQL Server delivers a good plan.
If you did not get the plan you wanted from SQL Server automatically, assume the database has made bad assumptions that will propagate to distort all of its calculations. Therefore, leave it with very little freedom, specifying essentially the whole plan you want.
If you are confident in your chosen execution plan, as you should be if you apply the methods I describe later in this book, there is little reason to hold back from fully specifying that plan. I have yet to find a case in which a well-chosen, robust execution plan needed to evolve to handle new data distributions or new database features. On the other hand, it is easy for SQL with a partially restricting set of hints to go wrong, especially if some table or index loses its statistics. When SQL Server chooses incorrectly, the error that made the database choose incorrectly is likely to propagate over the entire plan. However, the OPTION(FAST 1) hint is the sort of instruction that can be useful even when SQL Server has perfect information, simply specifying that the time to reach the first row is more important than the time to reach the last row.
An older method to tune on both Microsoft SQL Server and Sybase is the FORCEPLAN option. You execute this option with a standalone SQL statement:
SET FORCEPLAN ON
This option affects all SQL for that connection until you execute this statement:
SET FORCEPLAN OFF
When FORCEPLAN is ON, the database is instructed to perform only the simplest optimization on the SQL it sees. It generally uses nested-loops execution plans that drive through indexes and join tables in the same order you list them in the FROM clause. When this is the sort of plan you want, SET FORCEPLAN can be ideal, not only forcing the plan simply, but even saving parse time that would otherwise be wasted considering a much wider range of plans, especially for joins of many tables. It is a blunt-edged sword, so to speak, so only use it when you know the FROM-clause join order is the correct join order and you want nested loops.