Online steps in a business process that run in less than a second are unlikely to significantly slow the end user who is performing the process. Even steps that take well over a second can often be made convenient to the end user if those parts of the process are taken offline and made into batch processes. The only business-driven need for queries to run very fastunder a half a second, for examplecomes about when a single step in a business process requires queries to run repeatedly. Some application designs repeat a query hundreds of times for a single online event, or up to millions of times for a single batch process. In these cases, clearly, a query would have to run in a few milliseconds or less to meet an end user's needs, and most queries are not that fast, even after perfect tuning. When you run into this problem, the real issue is the need for so many queries, not the runtime of each query. There are three basic solutions to the need for repeated queries:
Cache everything the repeated queries might need, once, in application memory, and subsequently get data as needed from the application cache.
Consolidate the repeated queries into a new, single query.
Merge the repeated queries into joins that are added to a single query the application already runs.
Before I discuss these three solutions, consider a typical repeated query, which retrieves the pricing history for a given product:
SELECT Product_Price FROM Price_List WHERE Product_ID = :1 AND Effective_Date <= :today ORDER BY Effective_Date DESC;
To find the best solution to such a repeated query, you need to consider several questions:
How many times must the application run the query sequentially to perform a task?
What is the source of the values of :1 and :today that drive the query?
How many rows would you need to preread and cache for the cache to supply all the data the application might need? How much future use could you get from that cache?
What is the rowcount range and average for the query? How many of those rows does the application actually use?
The next few sections describe three possible solutions and show how the answers to these questions can lead to one of those solutions.
If the set of possible Product_IDs is small compared to the number of times the example query repeats, you have a good case to cache the whole set. For example, we might have 10 Product_IDs while repeating the query 1,000 times, so it makes more sense to precache results across the whole set and read from the cache in place of repeated queries to the database. For example, you could issue this query:
SELECT Product_Price FROM Price_List WHERE Effective_Date <= :today ORDER BY Product_ID ASC, Effective_Date DESC;
The cache could then hold these resultsfor example, in a structure in application memory using hash buckets or a binary tree that gives rapid (microseconds) access to the results for a particular Product_ID. Since individual reads from the first approach outnumber the Product_IDs, this approach offers two advantages in terms of the cost of database access:
The database reads fewer rows, with fewer logical I/Os. In the original approach, the database read each row more than once, on average, since it hit the average Product_ID more than once. These reads were likely less efficient in terms of logical I/Os per row as well, because they likely drove through an index read that required more than one logical I/O per row read. The precaching approach, in contrast, likely drives from a full table scan that reads several rows per logical I/O.
The database gets the data in a single query, likely with a single round trip over the network. Even with reuse of a preparsed query, the repeated-query approach requires at least one round trip to the database per repeat of the query, with considerable overhead.
In the current example, the repeated query likely returns multiple rows for each Product_ID, but it is quite likely that only the row with the highest Effective_Date returned (the first one in the sort order) is relevant to the application. An ideal caching algorithm would take this into account and simply discard the other rows, saving memory and other costs to fill the cache.
Therefore, even if such a cache is used one set of times, to perform a single business task for a single end user, it can cost less than the repeated queries it makes unnecessary. If the cache continues to be useful for future tasks by the same end user, the benefit increases, even justifying a cache that is initially more expensive to populate than the repeated queries for a single task. If the cache resides in shared memory that is accessible to multiple end users, assuming its contents are useful to the whole community of end users, the benefits multiply still further.
Caching in the Application Versus Caching in the DatabaseThe database already caches frequently used database blocks in its own shared memory, so why have a redundant application cache holding data the database is probably caching already? The answer is certainly not to reduce physical I/O; both approaches to caching work well at eliminating physical I/O. The application cache does have advantages, though:
The price paid for a fast application cache is increased application complexity. A few simple, static caches for repeatedly read rowsets are easy enough, but if you get carried away, you will have reinvented the database software functionality in your own application. At the extreme, you will be using Oracle, SQL Server, or DB2 as nothing more than a glorified fileserver, and your own application will be responsible for all the real work of a database, with orders of magnitude less development behind it than behind those databases. |
A blind query to read every row that a repeated query might potentially touch can be too expensive, and the result set might be too large to cache. However, that does not prevent the application from consolidating the multiple queries into a single query with one of two forms. In the current example, each pass through the loop would find a Product_ID to drive the initial repeated query. You could run through the loops without querying and simply compile a list of IDs to use in a single consolidated query. For the example, these forms would look like this:
SELECT Product_Price FROM Price_List WHERE Product_ID in (<long list of IDs>) AND Effective_Date <= :today ORDER BY Product_ID ASC, Effective_Date DESC;
or this:
SELECT Product_Price FROM Price_List WHERE Product_ID in (<subquery that returns the list of IDs>) AND Effective_Date <= :today ORDER BY Product_ID ASC, Effective_Date DESC;
This still has the advantages of eliminating the overhead of handling multiple queries, and it handles a case in which caching the set of all possible query results would not work, but it produces a result that yields far less row reuse. In all, this query-consolidation approach resolves a broad class of repeated-query performance problems.
Usually, the source of the variables in repeated queries is an earlier query. In the current example, :today surely comes from the clock/calendar, but :1 almost certainly comes from an earlier query that returned a long list of Product_IDs. The most likely reason behind the repeated query is to fill in a current (with the most recent Effective_Date) product price for each of the rows the earlier query returned, based on the first row that the query returns each time it runs. Since the goal of the query is to find a matching price row for each of the earlier query's rows, this is functionally like a join, though not nearly so efficient.
The technique in these cases is to convert these to actual joins. When these repeated queries are just simple single-row queries on a primary key, the conversion to a join is obvious. In cases like the current example, the solution is subtler. The query returned a price history sorted by Effective_Date to place the currently effective price at the top. This currently effective price record is the only record you actually want. To consolidate this repeated query with the query that provides the list of Product_IDs, you have to find a way to join with the first row a sorted query returns without reading the other rows.
There are several approaches that resolve this requirement, and I will describe two. Each approach begins with a new column, Current_Price_Flag, which equals 'Y' if and only if the price is the price currently in effect:
If rows are never created with future Effective_Dates, create a trigger that sets the newly obsolete price's Current_Price_Flag to 'N' and that creates new prices with Current_Price_Flag set to 'Y'.
If not-yet-effective price records are sometimes created, run a nightly batch process that looks for future effective price records that just became current at midnight, and have that process update those records to have Current_Price_Flag equal to 'Y', while updating Current_Price_Flag of newly obsolete prices to 'N'. This nightly batch process will probably supplement trigger-driven updates from new price records that are current at creation time.
Given such a properly maintained Current_Price_Flag column, the join is simple. If the initial query looked like this:
SELECT ... OD.Product_ID, ... FROM ... Order_Details OD, ... WHERE ...
the new query that incorporates the join would be:
SELECT ... OD.Product_ID, ..., PL.Product_Price FROM ... Order_Details OD, ..., Price_List PL WHERE ... AND OD.Product_ID=PL.Product_ID AND PL.Current_Price_Flag='Y'
Not all procedural code is replaced by a join as easily as in this last example. However, SQL incorporates powerful logic expressions, such as CASE, that enable virtually any procedural data-lookup function to be converted to some series of joins combined with complex (often nested) SQL expressions.